summaryrefslogtreecommitdiff
path: root/strings/ctype-ujis.c
Commit message (Collapse)AuthorAgeFilesLines
* A cleanup for MDEV-30695 Refactor case folding data types in Asian collationsAlexander Barkov2023-03-031-13/+13
| | | | Adding "const" qualifiers to casefold_info_st::page
* MDEV-30695 Refactor case folding data types in Asian collationsAlexander Barkov2023-02-211-1064/+1068
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a non-functional change and should not change the server behavior. Casefolding information is now stored in items of a new data type MY_CASEFOLD_CHARACTER: typedef struct casefold_info_char_t { uint32 toupper; uint32 tolower; } MY_CASEFOLD_CHARACTER; Before this change, casefolding tables for Asian collations were stored in: typedef struct unicase_info_char_st { uint32 toupper; uint32 tolower; uint32 sort; } MY_UNICASE_CHARACTER; The "sort" member was not used in the code handling Asian collations, it only wasted space. (it's only used by Unicode _general_ci and _general_mysql500_ci collations). Unicode collations (at least UCA and _bin) should also be refactored later, but under terms of a separate task.
* MDEV-30661 UPPER() returns an empty string for U+0251 in uca1400 collations ↵Alexander Barkov2023-02-171-13/+7
| | | | | | | | | | | | | | | | | | | | | for utf8 String length growth during upper/lower conversion in Unicode collations depends only on the underlying MY_UNICASE_INFO used in the collation. Maintaining a separate member CHARSET_INFO::caseup_multiply and CHARSET_INFO::casedn_multiply duplicated this information and caused bugs like this (when MY_UNICASE_INFO and case??_multiply when out of sync because of incomplete CHARSET_INFO initialization). Fix: Changing CHARSET_INFO::caseup_multiply and CHARSET_INFO::casedn_multiply from members to virtual functions. The virtual functions in Unicode collations calculate case conversion growth factors from the MY_UNICASE_INFO. This guarantees that the growth factors are always in sync with the MY_UNICASE_INFO.
* MDEV-27009 Add UCA-14.0.0 collationsAlexander Barkov2022-08-101-8/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Added one neutral and 22 tailored (language specific) collations based on Unicode Collation Algorithm version 14.0.0. Collations were added for Unicode character sets utf8mb3, utf8mb4, ucs2, utf16, utf32. Every tailoring was added with four accent and case sensitivity flag combinations, e.g: * utf8mb4_uca1400_swedish_as_cs * utf8mb4_uca1400_swedish_as_ci * utf8mb4_uca1400_swedish_ai_cs * utf8mb4_uca1400_swedish_ai_ci and their _nopad_ variants: * utf8mb4_uca1400_swedish_nopad_as_cs * utf8mb4_uca1400_swedish_nopad_as_ci * utf8mb4_uca1400_swedish_nopad_ai_cs * utf8mb4_uca1400_swedish_nopad_ai_ci - Introducing a conception of contextually typed named collations: CREATE DATABASE db1 CHARACTER SET utf8mb4; CREATE TABLE db1.t1 (a CHAR(10) COLLATE uca1400_as_ci); The idea is that there is no a need to specify the character set prefix in the new collation names. It's enough to type just the suffix "uca1400_as_ci". The character set is taken from the context. In the above example script the context character set is utf8mb4. So the CREATE TABLE will make a column with the collation utf8mb4_uca1400_as_ci. Short collations names can be used in any parts of the SQL syntax where the COLLATE clause is understood. - New collations are displayed only one time (without character set combinations) by these statements: SELECT * FROM INFORMATION_SCHEMA.COLLATIONS; SHOW COLLATION; For example, all these collations: - utf8mb3_uca1400_swedish_as_ci - utf8mb4_uca1400_swedish_as_ci - ucs2_uca1400_swedish_as_ci - utf16_uca1400_swedish_as_ci - utf32_uca1400_swedish_as_ci have just one entry in INFORMATION_SCHEMA.COLLATIONS and SHOW COLLATION, with COLLATION_NAME equal to "uca1400_swedish_as_ci", which is the suffix without the character set name: SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%uca1400_swedish_as_ci'; +-----------------------+ | COLLATION_NAME | +-----------------------+ | uca1400_swedish_as_ci | +-----------------------+ Note, the behaviour of old collations did not change. Non-unicode collations (e.g. latin1_swedish_ci) and old UCA-4.0.0 collations (e.g. utf8mb4_unicode_ci) are still displayed with the character set prefix, as before. - The structure of the table INFORMATION_SCHEMA.COLLATIONS was changed. The NOT NULL constraint was removed from these columns: - CHARACTER_SET_NAME - ID - IS_DEFAULT and from the corresponding columns in SHOW COLLATION. For example: SELECT COLLATION_NAME, CHARACTER_SET_NAME, ID, IS_DEFAULT FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%uca1400_swedish_as_ci'; +-----------------------+--------------------+------+------------+ | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | +-----------------------+--------------------+------+------------+ | uca1400_swedish_as_ci | NULL | NULL | NULL | +-----------------------+--------------------+------+------------+ The NULL value in these columns now means that the collation is applicable to multiple character sets. The behavioir of old collations did not change. Make sure your client programs can handle NULL values in these columns. - The structure of the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY was changed. Three new NOT NULL columns were added: - FULL_COLLATION_NAME - ID - IS_DEFAULT New collations have multiple entries in COLLATION_CHARACTER_SET_APPLICABILITY. The column COLLATION_NAME contains the collation name without the character set prefix. The column FULL_COLLATION_NAME contains the collation name with the character set prefix. Old collations have full collation name in both FULL_COLLATION_NAME and COLLATION_NAME. SELECT COLLATION_NAME, FULL_COLLATION_NAME, CHARACTER_SET_NAME, ID, IS_DEFAULT FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY WHERE FULL_COLLATION_NAME RLIKE '^(utf8mb4|latin1).*swedish.*ci$'; +-----------------------------+-------------------------------------+--------------------+------+------------+ | COLLATION_NAME | FULL_COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | +-----------------------------+-------------------------------------+--------------------+------+------------+ | latin1_swedish_ci | latin1_swedish_ci | latin1 | 8 | Yes | | latin1_swedish_nopad_ci | latin1_swedish_nopad_ci | latin1 | 1032 | | | utf8mb4_swedish_ci | utf8mb4_swedish_ci | utf8mb4 | 232 | | | uca1400_swedish_ai_ci | utf8mb4_uca1400_swedish_ai_ci | utf8mb4 | 2368 | | | uca1400_swedish_as_ci | utf8mb4_uca1400_swedish_as_ci | utf8mb4 | 2370 | | | uca1400_swedish_nopad_ai_ci | utf8mb4_uca1400_swedish_nopad_ai_ci | utf8mb4 | 2372 | | | uca1400_swedish_nopad_as_ci | utf8mb4_uca1400_swedish_nopad_as_ci | utf8mb4 | 2374 | | +-----------------------------+-------------------------------------+--------------------+------+------------+ - Other INFORMATION_SCHEMA queries: SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS; SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.PARAMETERS; SELECT TABLE_COLLATION FROM INFORMATION_SCHEMA.TABLES; SELECT DEFAULT_COLLATION_NAME FROM INFORMATION_SCHEMA.SCHEMATA; SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.ROUTINES; SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.EVENTS; SELECT DATABASE_COLLATION FROM INFORMATION_SCHEMA.EVENTS; SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.ROUTINES; SELECT DATABASE_COLLATION FROM INFORMATION_SCHEMA.ROUTINES; SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.TRIGGERS; SELECT DATABASE_COLLATION FROM INFORMATION_SCHEMA.TRIGGERS; SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.VIEWS; display full collation names, including character sets prefix, for all collations, including new collations. Corresponding SHOW commands also display full collation names in collation related columns: SHOW CREATE TABLE t1; SHOW CREATE DATABASE db1; SHOW TABLE STATUS; SHOW CREATE FUNCTION f1; SHOW CREATE PROCEDURE p1; SHOW CREATE EVENT ev1; SHOW CREATE TRIGGER tr1; SHOW CREATE VIEW; These INFORMATION_SCHEMA queries and SHOW statements may change in the future, to display show collation names.
* Merge branch '10.6' into 10.7Oleksandr Byelkin2022-02-041-5/+9
|\
| * Merge branch '10.5' into 10.6Oleksandr Byelkin2022-02-031-5/+9
| |\
| | * Merge branch '10.4' into 10.5Oleksandr Byelkin2022-02-011-5/+9
| | |\
| | | * Merge branch '10.3' into 10.4Oleksandr Byelkin2022-01-301-5/+5
| | | |\
| | | | * MDEV-27494 Rename .ic files to .inlVladislav Vaintroub2022-01-171-5/+5
| | | | |
| | | * | MDEV-25904 New collation functions to compare InnoDB style trimmed NO PAD ↵bb-10.4-bar-MDEV-25904Alexander Barkov2022-01-211-0/+4
| | | |/ | | | | | | | | | | | | strings
* | | | Merge 10.6 into 10.7Marko Mäkelä2021-09-301-4/+13
|\ \ \ \ | |/ / /
| * | | MDEV-26669 Add MY_COLLATION_HANDLER functions min_str() and max_str()bb-10.6-bar-MDEV-26669Alexander Barkov2021-09-271-4/+13
| | | |
* | | | MDEV-26572 Improve simple multibyte collation performance on the ASCII rangebb-10.7-bar-MDEV-26572Alexander Barkov2021-09-131-0/+4
|/ / /
* | | Change CHARSET_INFO character set and collaction names to LEX_CSTRINGMonty2021-05-191-8/+9
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This change removed 68 explict strlen() calls from the code. The following renames was done to ensure we don't use the old names when merging code from earlier releases, as using the new variables for print function could result in crashes: - charset->csname renamed to charset->cs_name - charset->name renamed to charset->coll_name Almost everything where mechanical changes except: - Changed to use the new Protocol::store(LEX_CSTRING..) when possible - Changed to use field->store(LEX_CSTRING*, CHARSET_INFO*) when possible - Changed to use String->append(LEX_CSTRING&) when possible Other things: - There where compiler issues with ensuring that all character set names points to the same string: gcc doesn't allow one to use integer constants when defining global structures (constant char * pointers works fine). To get around this, I declared defines for each character set name length.
* | MDEV-7947 strcmp() takes 0.37% in OLTP ROMonty2020-07-231-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch ensures that all identical character sets shares the same cs->csname. This allows us to replace strcmp() in my_charset_same() with comparisons of pointers. This fixes a long standing performance issue that could cause as strcmp() for every item sent trough the protocol class to the end user. One consequence of this patch is that we don't allow one to add a character definition in the Index.xml file that changes the csname of an existing character set. This is by design as changing character set names of existing ones is extremely dangerous, especially as some storage engines just records character set numbers. As we now have a hash over character set's csname, we can in the future use that for faster access to a specific character set. This could be done by changing the hash to non unique and use the hash to find the next character set with same csname.
* | MDEV-22043 Special character leads to assertion in ↵Alexander Barkov2020-05-091-0/+1
|/ | | | | | | | | | | my_wc_to_printable_generic on 10.5.2 (debug) The code did not take into account that: - U+005C (backslash) can occupy more than mbminlen characters (e.g. in sjis) - Some character sets do not have a code for U+005C (e.g. swe7) Adding a new function my_wc_to_printable into MY_CHARSET_HANDLER to cover all special cases easier.
* Merge 10.1 into 10.2Marko Mäkelä2019-05-131-1/+1
|\
| * Merge branch '5.5' into 10.1Vicențiu Ciorbaru2019-05-111-1/+1
| |\
| | * Update FSF AddressVicențiu Ciorbaru2019-05-111-1/+1
| | | | | | | | | | | | * Update wrong zip-code
* | | Merge 10.1 into 10.2Marko Mäkelä2018-08-021-4/+4
|\ \ \ | |/ /
| * | Merge branch '10.0' into bb-10.1-merge-sanjaOleksandr Byelkin2018-07-251-4/+4
| |\ \
| | * | Simplify caseup() and casedn() in charsetsAlexander Barkov2018-07-191-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After the MDEV-13118 fix there's no code in the server that wants caseup/casedn to change the argument in place for simple charsets. Let's remove this logic and always return the result in a new string for all charsets, both simple and complex. 1. Removing the optimization that *some* character sets used in casedn() and caseup(), which allowed (and required) to change the case in-place, overwriting the string passed as the "src" argument. Now all CHARSET_INFO's work in the same way: non of them change the source string in-place, all of them now convert case from the source string to the destination string, leaving the source string untouched. 2. Adding "const" qualifier to the "char *src" parameter to caseup() and casedn(). 3. Removing duplicate implementations in ctype-mb.c. Now both caseup() and casedn() implementations for all CJK character sets use internally the same function my_casefold_mb() (the former my_casefold_mb_varlen()). 4. Removing the "unused" attribute from parameters of some my_case{up|dn}_xxx() implementations, as the affected parameters are now *used* in the code. Previously these parameters were used only in DBUG_ASSERT().
* | | | MDEV-7769 MY_CHARSET_INFO refactoring# On branch 10.2Alexander Barkov2016-10-101-1/+0
| | | | | | | | | | | | | | | | Part 3 (final): removing MY_CHARSET_HANDLER::well_formed_len().
* | | | MDEV-9711 NO PAD collationsAlexander Barkov2016-09-061-0/+118
| | | | | | | | | | | | | | | | Based on the patch from Daniil Medvedev (a Google Summer of Code task)
* | | | MDEV-6353 my_ismbchar() and my_mbcharlen() refactoringAlexander Barkov2016-05-171-7/+0
| | | |
* | | | MDEV-9823 LOAD DATA INFILE silently truncates incomplete byte sequencesAlexander Barkov2016-04-061-0/+1
| | | |
* | | | MDEV-9665 Remove cs->cset->ismbchar()Alexander Barkov2016-03-161-11/+0
|/ / / | | | | | | | | | Using a more powerfull cs->cset->charlen() instead.
* | | Adding MY_CHARSET_HANDLER::native_to_mb().Alexander Barkov2015-08-141-0/+1
| | | | | | | | | | | | | | | | | | | | | This is a pre-requisite patch for: - MDEV-8433 Make field<'broken-string' use indexes - MDEV-8625 Bad result set with ignorable characters when using a prefix key - MDEV-8626 Bad result set with contractions when using a prefix key
* | | MDEV-8215 Asian MB3 charsets: compare broken bytes as "greater than any ↵Alexander Barkov2015-07-031-5/+40
| | | | | | | | | | | | non-broken character"
* | | MDEV-6566 Different INSERT behaviour on bad bytes with and without character ↵Alexander Barkov2015-03-131-2/+4
| | | | | | | | | | | | set conversion
* | | Adding a shared include file ctype-mb.ic and removing a numberAlexander Barkov2015-03-041-61/+20
| | | | | | | | | | | | | | | of very similar copies of my_well_formed_len_xxx(), implemented for big5, cp932, euckr, eucjpms, gb2312m gbk, sjis, ujis.
* | | A preparatory patch for MDEV-6566.Alexander Barkov2015-03-021-1/+2
|/ / | | | | | | | | | | Adding a new virtual function MY_CHARSET_HANDLER::copy_abort(). Moving character set specific code into the correspoding implementations (for simple, multi-byte and mbmaxlen>1 character sets).
* | MDEV-6776 ujis and eucjmps erroneously accept 0x8EA0 as a valid byte sequenceAlexander Barkov2014-09-241-9/+8
| |
* | 5.5.39 mergeSergei Golubchik2014-08-071-3/+3
|\ \ | |/
| * mysql-5.5.39 mergeSergei Golubchik2014-08-021-3/+3
| |\ | | | | | | | | | | | | | | | | | | ~40% bugfixed(*) applied ~40$ bugfixed reverted (incorrect or we're not buggy) ~20% bugfixed applied, despite us being not buggy (*) only changes in the server code, e.g. not cmakefiles
| | * Bug#18850241 WRONG COPYRIGHT HEADER IN SOME STRINGS/CTYPE-* FILESErlend Dahl2014-06-231-1/+2
| | |
* | | MDEV-5163 Merge WEIGHT_STRING function from MySQL-5.6Alexander Barkov2013-10-231-1/+3
| | |
* | | MDEV-4928 Merge collation customization improvements Alexander Barkov2013-10-021-20/+27
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | Merging the following MySQL-5.6 changes: - WL#5624: Collation customization improvements http://dev.mysql.com/worklog/task/?id=5624 - WL#4013: Unicode german2 collation http://dev.mysql.com/worklog/task/?id=4013 - Bug#62429 XML: ExtractValue, UpdateXML max arg length 127 chars http://bugs.mysql.com/bug.php?id=62429 (required by WL#5624)
* | mysql-5.5.32 mergeSergei Golubchik2013-07-161-2/+2
|\ \ | |/
| * Fix for Bug 16395495 - OLD FSF ADDRESS IN GPL HEADERMurthy Narkedimilli2013-03-191-2/+2
| |
* | 5.3 mergeSergei Golubchik2012-01-131-1/+1
|\ \
| * \ Merge with MariaDB 5.1Michael Widenius2011-11-241-2/+3
| |\ \
| | * \ Initail merge with MySQL 5.1 (XtraDB still needs to be merged)Michael Widenius2011-11-211-2/+3
| | |\ \ | | | | | | | | | | | | | | | Fixed up copyright messages.
* | | \ \ mysql-5.5.18 mergeSergei Golubchik2011-11-031-1/+1
|\ \ \ \ \ | | |_|_|/ | |/| | |
| * | | | Updated/added copyright headersKent Boortz2011-06-301-1/+1
| |\ \ \ \ | | | |_|/ | | |/| |
| | * | | Updated/added copyright headersKent Boortz2011-06-301-3/+4
| | |\ \ \
* | | \ \ \ merge with 5.3Sergei Golubchik2011-10-191-4/+5
|\ \ \ \ \ \ | | |_|_|/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sql/sql_insert.cc: CREATE ... IF NOT EXISTS may do nothing, but it is still not a failure. don't forget to my_ok it. ****** CREATE ... IF NOT EXISTS may do nothing, but it is still not a failure. don't forget to my_ok it. sql/sql_table.cc: small cleanup ****** small cleanup
| * | | | | Merge with MariaDB 5.1Michael Widenius2011-05-031-1/+3
| |\ \ \ \ \ | | | |_|_|/ | | |/| | |
| | * | | | Merge with MySQL 5.1.57/58Michael Widenius2011-05-021-1/+3
| | |\ \ \ \ | | | | |/ / | | | |/| | | | | | | | Moved some BSD string functions from Unireg
| * | | | | Merge with 5.1 to get in changes from MySQL 5.1.55Michael Widenius2011-02-281-3/+2
| |\ \ \ \ \ | | |/ / / /