summaryrefslogtreecommitdiff
path: root/enc/unicode.c
Commit message (Collapse)AuthorAgeFilesLines
* Suppress warnings by gcc 10.1.0-RC-20200430Nobuyoshi Nakada2020-05-041-0/+4
| | | | | | | | | | | | | | | | | | | | | | * Folding results should not be empty. If `OnigCodePointCount(to->n)` were 0, `for` loop using `fn` wouldn't execute and `ncs` elements are not initialized. ``` enc/unicode.c:557:21: warning: 'ncs[0]' may be used uninitialized in this function [-Wmaybe-uninitialized] 557 | for (i = 0; i < ncs[0]; i++) { | ~~~^~~ ``` * Cast to `enum yytokentype` Additional enums for scanner events by ripper are not included in `yytokentype`. ``` ripper.y:7274:28: warning: implicit conversion from 'enum <anonymous>' to 'enum yytokentype' [-Wenum-conversion] ```
* implement special behavior for Georgian for String#capitalizeduerst2018-12-091-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The modern Georgian script is special in that it has an 'uppercase' variant called MTAVRULI which can be used for emphasis of whole words, for screamy headlines, and so on. However, in contrast to all other bicameral scripts, there is no usage of capitalizing the first letter in a word or a sentence. Words with mixed capitalization are not used at all. We therefore implement special behavior for String#capitalize. Formally, we define String#capitalize as first applying String#downcase for the whole string, then using titlecase on the first letter. Because Georgian defines titlecase as the identity function both for MTAVRULI ('uppercase') and Mkhedruli (lowercase), this results in String#capitalize being equivalent to String#downcase for Georgian. This avoids undesirable mixed case. * enc/unicode.c: Actual implementation * string.c: Add mention of this special case for documentation * test/ruby/enc/test_case_mapping.rb: Add two tests, a general one that uses String#capitalize on some (including nonsensical) combinations of MTAVRULI and Mkhedruli, and a canary test to detect the potential assignment of characters to the currently open slots (holes) at U+1CBB and U+1CBC. * test/ruby/enc/test_case_comprehensive.rb: Tweak generation of expectation data. Together with r65933, this closes issue #14839. git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@66300 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* remove obsolete data from unicode.cduerst2018-12-061-26/+0
| | | | | | | | | | * unicode.c: Remove the arrays onigenc_unicode_GCB_ranges_GAZ, onigenc_unicode_GCB_ranges_E_Base, and onigenc_unicode_GCB_ranges_Emoji, because they are not needed anymore for Unicode 11.0.0. * regparse.c: Remove external declarations for above arrays. git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@66232 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* solve the genie/zombie/wrestlers bugduerst2018-12-021-8/+10
| | | | | | | | | | enc/unicode.c: - Add U+1F93C (WRESTLERS), U+1F9DE (GENIE), and U+1F9DF to onigenc_unicode_GCB_ranges_E_Base. - Add comments with character names. test/ruby/enc/test_emoji_breaks.rb: Activate tests for genie/zombie/wrestlers. This closes issue #15343. git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@66133 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* Added words in the comment at r65088 [ci skip]nobu2018-11-301-2/+2
| | | | git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@66103 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* deal with ONIGENC_CASE_IS_TITLECASE flag on lowercase charactersduerst2018-11-251-4/+9
| | | | | | | | | In the function onigenc_unicode_case_map() in enc/unicode.c, deal with the case that the ONIGENC_CASE_IS_TITLECASE flag is set on lowercase characters. This is in preparation for Georgian Mtavruli, which are uppercase but not titlecase, in Unicode 11.0.0. git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@65971 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* enc/unicode.c: 'a' is bigger than 'A'shyouhei2018-11-161-2/+4
| | | | | | | | | | | | | | | | | | | | | In ASCII, 'a' is bigger than 'A'. Which means 'A' - 'a' is a negative number (-32, to be precise). In C, the type of 'a' and 'A' are signed int (cf: ISO/IEC 9899:1990 section 6.1.3.4). So 'A' - 'a' is also a signed int. It is `(signed int)-32`. The problem is, OnigCodePoint is unsigned int. Adding a negative number to a variable of OnigCodepoint (`code` here) introduces an unintentional cast of `(unsigned)(signed)-32`, which is 4,294,967,264. Adding this value to code then overflows, and the result eventually becomes normal codepoint. The series of operations are not a serious problem but because `code >= 'a'` holds, we can `(code - 'a') + 'A'` to reroute this. See also: https://github.com/k-takata/Onigmo/pull/107 git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@65752 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* revert r65091, r65090 because ci failsduerst2018-10-161-9/+4
| | | | git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@65093 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* update to Unicode 11.0.0 (basic step, not complete yet)duerst2018-10-161-4/+9
| | | | | | | | | | | | | | | - common.mk: Change Unicode version to 11.0.0 - enc/unicode/case-folding.rb, enc/unicode.c: Initial changes to deal with Gregorian Mtavruli. This should bring us up to the same level as e.g. Python 3.7, by following the Unicode tables exactly. But it will produce undesirable (mixed-case) results for String#capitalize. This will be addressed in a later commit. - enc/unicode/11.0.0, enc/unicode/11.0.0/casefold.h, enc/unicode/name2ctype.h: Add generated files. - lib/unicode_normalize/tables.rb: Updated table. git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@65091 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* Removed data for old Unicode [ci skip]nobu2018-10-161-28/+2
| | | | git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@65088 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* unicode.c: moved addtional GCB rangesnobu2018-10-151-0/+52
| | | | | | | * enc/unicode.c: moved additional Grapheme Cluster Break ranges which depend on the Unicode version. git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@65087 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* regparse.c: Suppress duplicated range warning by mere \Xnobu2018-10-151-2/+0
| | | | | | | | * regparse.c (node_extended_grapheme_cluster): as Unicode 10 has added Grapheme_Cluster_Break properties to some characters, remove duplicated ranges for Unicode 9. git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@65086 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* Merge Onigmo 6.0.0naruse2016-12-101-123/+125
| | | | | | | | | | | * https://github.com/k-takata/Onigmo/blob/Onigmo-6.0.0/HISTORY * fix for ruby 2.4: https://github.com/k-takata/Onigmo/pull/78 * suppress warning: https://github.com/k-takata/Onigmo/pull/79 * include/ruby/oniguruma.h: include onigmo.h. * template/encdb.h.tmpl: ignore duplicated definition of EUC-CN in enc/euc_kr.c. It is defined in enc/gb2313.c with CRuby macro. git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@57045 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* remove special processing for U+03B9/U+03BC/U+A64Bduerst2016-12-041-13/+4
| | | | | | | | | | | | | * enc/unicode.c: Remove special processing for U+03B9/U+03BC/U+A64B (GREEK SMALL LETTERs IOTA/MU, CYRILLIC SMALL LETTER MONOGRAPH UK) from onigenc_unicode_case_map and simplify code. * enc/unicode/case-folding.rb: Remove check for U+03B9/U+03BC/U+A64B. This and the previous few related commits make sure that we won't hit the equivalent of bug #12990 anymore for future updates of Unicode versions. git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@56976 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* constify CaseMappingSpecialsnobu2016-12-011-1/+1
| | | | git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@56951 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* fix uppercasing for U+A64B, CYRILLIC SMALL LETTER MONOGRAPH UKduerst2016-11-301-2/+2
| | | | | | | | | | | | | | | | | * enc/unicode.c: Add U+A64B to the special cases 03B9 and 03BC at the end of onigenc_unicode_case_map (Bug #12990). * enc/unicode/case-folding.rb: Add U+A64B to the special cases 03B9 and 03BC. Add a comment pointing to enc/unicode.c. Change warnings to exceptions for unpredicted cases, because this would have been more easily noticed (the warning was not noticed when upgrading to Unicode 9.0.0). * test/ruby/enc/test_case_comprehensive.rb: Remove temporary exclusion of U+A64B from testing. git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@56941 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* * regenc.h/c, include/ruby/oniguruma.h, enc/ascii.c, big5.c, cp949.c,duerst2016-07-241-8/+0
| | | | | | | | | | | emacs_mule.c, euc_jp.c, euc_kr.c, euc_tw.c, gb18030.c, gbk.c, iso_8859_1|2|3|4|5|6|7|8|9|10|11|13|14|15|16.c, koi8_r.c, koi8_u.c, shift_jis.c, unicode.c, us_ascii.c, utf_16|32be|le.c, utf_8.c, windows_1250|51|52|53|54|57.c, windows_31j.c, unicode.c: Remove conditional compilation macro ONIG_CASE_MAPPING. [Feature #12386]. git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@55740 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* Move generated headers to unicode data directorynobu2016-07-171-2/+20
| | | | | | | * common.mk, enc/depend (casefold.h, name2ctype.h): move to unicode data directory per version. git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@55701 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* * string.c: Raise ArgumentError when invalid string is detected induerst2016-06-021-1/+7
| | | | | | | | | | | case mapping methods. * enc/unicode.c: Check for invalid string and signal with negative length value. * test/ruby/enc/test_case_mapping.rb: Add tests for above. * test/ruby/test_m17n_comb.rb: Add a message to clarify test failure. git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@55253 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* * enc/unicode.c: Handle DOTLESS_i by hand because it isn't involved in folding.duerst2016-05-251-1/+5
| | | | git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@55164 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* * enc/unicode.c: Fix flag error for switch from titlecase to lowercase.duerst2016-05-241-1/+4
| | | | | | | * test/ruby/enc/test_case_mapping.rb: Tests for above error. git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@55153 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* * enc/unicode.h: Additional uses of ONIG_CASE_MAPPING compilation switchduerst2016-05-161-0/+4
| | | | git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@55020 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* * append newline at EOF.svn2016-05-161-1/+1
| | | | git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@55019 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* * include/ruby/oniguruma.h: Introducing ONIG_CASE_MAPPING compilationduerst2016-05-161-0/+4
| | | | | | | | | switch * include/ruby/oniguruma.h, enc/unicode.h: Using ONIG_CASE_MAPPING compilation switch git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@55018 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* * enc/unicode/case-folding.rb, casefold.h: Data generation to implementduerst2016-04-011-1/+8
| | | | | | | | | | | | | swapcase functionality for titlecase characters. Swapcase isn't defined by Unicode, because the purpose/usage of swapcase is unclear anyway. The implementation follows a proposal from Nobu, swaping the case of each component of a titlecase character individually. This means that the titlecase characters have to be decomposed. * enc/unicode.c: Code using the above data. * test/ruby/enc/test_case_mapping.rb: Tests for the above. git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@54469 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* fix a typo [ci skip]kazu2016-03-291-1/+1
| | | | git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@54400 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* * enc/unicode/case-folding.rb, casefold.h: Tweaked handling of 6duerst2016-03-291-5/+10
| | | | | | | | | | special cases in CaseUnfold_11_Table. * enc/unicode.c: Adjustments for above. * test/ruby/enc/test_case_mapping.rb: Tests for the above: Some tests in test_titlecase activated; test_greek added. A test in test_cherokee fixed. git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@54383 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* * enc/unicode.c: Cleaned up some comments.duerst2016-03-291-7/+6
| | | | git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@54349 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* * enc/unicode/case-folding.rb, casefold.h: Removing data for idempotentduerst2016-03-291-8/+6
| | | | | | | | titlecasing. * enc/unicode.c: Adjust code to data removal. git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@54347 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* * enc/unicode.c: Refactoring in preparation for data reduction forduerst2016-03-281-5/+8
| | | | | | | titlecase. git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@54313 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* * enc/unicode.c: Minor refactoring for I WITH DOT ABOVE.duerst2016-03-281-4/+3
| | | | git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@54312 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* * enc/unicode.c: Removed code now covered by data from table.duerst2016-03-281-6/+0
| | | | git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@54311 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* * enc/unicode.c: Adding comments. [ci skip]duerst2016-03-281-7/+7
| | | | git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@54310 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* * include/ruby/oniguruma.h: Additional flag for characters that are titlecase.duerst2016-03-221-1/+6
| | | | | | | | | | * enc/unicode/case-folding.rb, casefold.h: Using above flag in data. * enc/unicode.c: Marking capitalized character as unmodified if it is already titlecase. * test/ruby/enc/test_case_mapping.rb: Tests for above functionality. git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@54229 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* * enc/unicode.c: Fixed two macro definitions.duerst2016-03-171-2/+2
| | | | | | | | * test/ruby/enc/test_case_mapping.rb: Test cases that detected the above bugs. git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@54140 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* * enc/unicode.c: Eliminating common code.duerst2016-03-151-29/+13
| | | | | | | (with Kimihito Matsui) git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@54118 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* * enc/unicode.c: Expansion of some code repetition in preparation forduerst2016-03-151-9/+13
| | | | | | | | elimination of common code pieces. (with Kimihito Matsui) git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@54117 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* * remove trailing spaces.svn2016-03-151-1/+1
| | | | git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@54113 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* * enc/unicode.c: Additional macros and code to use mapping data induerst2016-03-151-21/+67
| | | | | | | | CaseMappingSpecials array. (with Kimihito Matsui) git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@54112 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* * include/ruby/oniguruma.h, enc/unicode.c: Adjusting flag assignmentsduerst2016-03-141-0/+7
| | | | | | | | and macros to work with unified CaseMappingSpecials array. (with Kimihito Matsui) git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@54101 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* unicode.c: off-by-one errornobu2016-03-121-1/+1
| | | | | | * enc/unicode.c (CodePointListValidP): fix off-by-one error. git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@54091 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* unicode.c: boundary checknobu2016-03-121-6/+14
| | | | | | | * enc/unicode.c (CodePointListValidP): add pathological boundary check, for gcc 4.9. git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@54090 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* * enc/unicode/case-folding.rb, casefold.h: Streamlining approach toduerst2016-03-111-2/+10
| | | | | | | | | | case mapping data not available from case folding by unifying all three cases (special title, special upper, special lower). * enc/unicode.c: Adjust macro names for above (macros are currently inactive). (with Kimihito Matsui) git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@54085 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* * include/ruby/oniguruma.h: Rearranging flag assignments and makingduerst2016-02-241-5/+1
| | | | | | | | | | | space for titlecase indices; adding additional macros to add or extract titlecase index; adding comments for better documentation. * enc/unicode.c: Moving some macros to include/ruby/oniguruma.h; activating use of titlecase indices. (with Kimihito Matsui) git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@53915 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* * enc/unicode/case-folding.rb, casefold.h: Outputting actual titlecaseduerst2016-02-231-2/+2
| | | | | | | | | data (new table, with indices from other tables). * enc/unicode.c: Ignoring titlecase data indices for the moment. (with Kimihito Matsui) git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@53906 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* * enc/unicode.c: Activated use of case mapping data in CaseUnfold_11 array.duerst2016-02-191-0/+9
| | | | | | | (with Kimihito Matsui) git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@53870 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* * string.c, enc/unicode.c: Disassociating ONIGENC_CASE_FOLD flag fromduerst2016-02-081-1/+1
| | | | | | | | ONIGENC_CASE_DOWNCASE. (with Kimihito Matsui) git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@53778 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* unicode.c: magic numbersnobu2016-02-081-33/+37
| | | | | | | * enc/unicode.c (I_WITH_DOT_ABOVE, DOTLESS_i, DOT_ABOVE): name magic numbers. git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@53776 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* * enc/unicode.c: Shortened macros for enc/unicode/casefold.h toduerst2016-02-081-10/+11
| | | | | | | | | | | single-letter; use flags in casefold.h for logic. * enc/unicode/case-folding.rb: Added flag for case folding. Changed parameter passing. * enc/unicode/casefold.h: New flags added. (with Kimihito Matsui) git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@53775 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* * common.mk: Added two more precondition files for enc/unicode/casefold.hduerst2016-02-071-0/+10
| | | | | | | | | | | * enc/unicode.c: Added shortening macros for enc/unicode/casefold.h * enc/unicode/case-folding.rb: Fixed file encoding for CaseFolding.txt to ASCII-8BIT (should fix some ci errors). Clarified usage. Created class MapItem. Partially implemented class CaseMapping. (with Kimihito Matsui) git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@53767 b2dd03c8-39d4-4d8f-98ff-823fe69b080e