summaryrefslogtreecommitdiff
path: root/regcharclass.h
Commit message (Collapse)AuthorAgeFilesLines
* pod/perluniprops: Split run-on lines before '\'Karl Williamson2020-05-271-1/+1
| | | | | | | | | | | | This changes mktables, which generates this pod, to consider long pod lines to be splittable before most backslashes. On os390, the lack of this caused a line to not be split at all, creating a Porting test failure. There is also a current rule that you can split at a lowercase/uppercase boundary. This works for the limited domain this code is run on. But it shouldn't split \cK. So don't do the split if the lowercase is a single letter preceded by a backslash.
* Fix a bunch of repeated-word typosDagfinn Ilmari Mannsåker2020-05-221-1/+1
| | | | | Mostly in comments and docs, but some in diagnostic messages and one case of 'or die die'.
* mktables: Change named sequences to 5 digitsKarl Williamson2020-03-201-1/+1
| | | | | This makes them correspond to names for single characters, and will make parsing easier in the next commits.
* lib/unicore/mktables: use function signaturesNicolas R2020-03-191-1/+1
| | | | | | Also regenerate files depending on lib/unicore/mktables ./perl -Ilib regen/mk_invlists.pl; ./perl -Ilib regen/regcharclass.pl
* Implement \p{Name=/.../} wildcardsKarl Williamson2020-03-111-1/+1
| | | | | This commit adds wildcard subpatterns for the Name and Name Aliases properties.
* mktables: Calculate legal chars in algorithmic namesKarl Williamson2020-03-111-1/+1
| | | | | | | | | | | Many ideographic character names are of the form 'prefix-code_point'. For these, we know that the legal names are just the ones in the prefix, the dash, and uppercase hex digits. This commit for each series of these types of names figures out what characters are legal in that series, and adds that info to the hash describing the series. This will be used in a later commit to rule out entire series when matching under some circumstances, without having to try any individual matches within it.
* Reformat lib/unicore/Name.plKarl Williamson2020-03-111-2/+2
| | | | | | | | | | | | | | | | | | | | This changes the format of this generated file so that it can more easily be used with the Unicode Name property in wildcard matching. Each line will now end with \n\n, and the \t characters are replaced by \n. Thus an entry will look like 00001\nSTART OF HEADING\n\n This makes matching of user-defined patterns using anchors work under /m, which commit 4829f32decd128e6a122bd8ce35fe944bd87f104 forces. That commit also changed some anchors' defintions to make them match \n under /m with wildcards, so this makes it all transparent to user patterns. The double \n\n at the end of an entry is so that the code can distinguish between a line that contains a code point vs a name without relying on the content; it is a disambiguator, like the \t that used to be.
* Restrict features in wildcardsKarl Williamson2020-02-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | The algorithm for dealing with Unicode property wildcards is to wrap the user-supplied pattern with /miaa. We don't want the user to be able to override the /m and /aa parts. Modifiers that are only specifiable as a modifier in a qr or similar op (like /gc) can't be included in things like (?gc). These normally incur a warning that they are ignored, but the texts of those warnings are misleading when using wildcards, so I chose to just make them illegal. Of course that could be changed to having custom useful warning texts, but I didn't think it was worth it. I also chose to forbid recursion of using nested \p{}, just from fear that it might lead to issues down the road, and it really isn't useful for this limited universe of strings to match against. Because wildcards currently can't handle '}' inside them, only the single letter \p,\P are valid anyway. Similarly, I forbid the '*' quantifier to make it harder for the constructed subpattern to take forever to make any progress and decide to halt. Again, using it would be overkill on the universe of possible match strings.
* Update to latest Unicode 13.0Karl Williamson2020-02-191-6/+6
| | | | | | Unicode has made minor changes in its data files since I added the beta versions to Perl 5.31. These are still beta; the final release date is March 10. I thought it best to get the latest into Perl 5.31.9.
* mktables: Handle versioning of non-UCD filesKarl Williamson2020-02-191-1/+1
| | | | | | | | | | | | | Unicode has lately been asking implementations to support non-Unicode Character Database properties. Files for these contain a different versioning syntax than the UCD files. Previously I was hand-editing those files before commitiing to bring them to use a consistent style. But that is tedious, and I decide to invest a little time to be able to handle all the current versioning syntaxes automatically, to save having to manually update in the future. This was complicated by the fact that some Unicode non-UCD files have BOM marks on many comment lines. I submitted a trouble report to them.
* perluniprops: Fix missing backslashKarl Williamson2020-02-131-1/+1
|
* Add qr/\p{Name=...}/Karl Williamson2020-02-121-1/+1
| | | | | | | | | | | | | | | This accomplishes the same thing as \N{...}, but only for regex patterns, using loose matching and only the official Unicode names. This commit includes a comparison of the two approaches, added to perlunicode. But the real reason to do this is as a way station to being able to specify wild card lookup on the name property, coming in a later commit. I chose to not include user-defined aliases nor :short character names at this time. I thought that there might be unforeseen consequences of using them. It's better to later relax a requirement than to try to restrict it.
* Support Unicode properties Identifier_(Status|Type)Karl Williamson2020-02-031-1/+3
| | | | | | | | | | These non-UCD properties are now being asked to be supported by the Unicode regular expression specification, UTS #18 These have a slightly different header syntax for giving the version than UCD files. In this commit, I modify these to fit, but will probably have to generalize at some point the parsing of versions in mktables.
* mktables: Generalize the scx property handlingKarl Williamson2020-02-031-1/+1
| | | | | | | | | Until now, this property was unique in that it specifies a set of possible values for scripts that a character can be in, rather than a single script. That multiplicity has been handled specially. But the next couple of commits will introduce another property that has similar characteristics. This commit makes the scx handling more general, so as to also be usable for the new property.
* mktables: Improve warning msgKarl Williamson2020-02-031-1/+1
|
* mktables: Add capability to override match directoryKarl Williamson2020-02-031-1/+1
| | | | | | This is because this is still supposed to work on DOS 8.3 filesystems, and future commits will use non-Unicode-Character-Database tables which don't have shorter names.
* Use Unicode 13.0 (beta)Unicode Consortium2020-01-301-47/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unicode has changed its yearly release cycle so that the final version is not available until early March of the year. This year it is March 10, 2020. However, all changes planned were finalized in early January, and the actual computer files have been updated to their presumably final substantive versions. The release has been authorized without further review needed. The release is awaiting final documentation additions, and soak time of the beta to verify there are no glitches. This commit causes Perl to participate in that soak. I don't anticipate any problems, and likely the only substantive change upon the official release will be to update perldelta. Comments in the files supplied by Unicode will likely also change to indicate these are no longer beta. There were very few changes affecting existing characters; most of the changes involved adding new characters, including emoji. The break characteristics of some existing characters were changed (GCB, LB, WB, and SB properties). The only perl code I really had to change to cope with the new release was about rules in the Line Break property, dealing around ellipses (...) and certain East Asian characters next to opening parentheses. If there are problems, we can revert this at any time, and ship with 12.0.
* Change Unicode property abbrev to upcoming officialKarl Williamson2020-01-301-1/+1
| | | | | | | | | | | | Unicode 12.0 used a new property file that was not from the Unicode Character Database. It only had a long property name. I incorporated it into our data, and rather than use the very long name all the time, I created my own short name, since there was no official one. Now, the upcoming 13.0 has moved the file to the UCD, and come up with a short name that differs from the one I had. This commit converts to use Unicode's name. This property is not exposed to user or XS space, so there is no user impact.
* PATCH GH #17025 \p{user-defined} overrides official UnicodeKarl Williamson2019-12-091-1/+1
| | | | Prior to this patch, they only sometimes overrode.
* Remove generation and use of NonFinalFold tableKarl Williamson2019-11-161-1/+1
| | | | | | With the revamping done in cc288b7a2732c37504039083ebb98241954636be, the table of Unicode case folds that are more than a single character is no longer used, so no need to generate it, or having it available.
* mktables: Fix non-final-fold tableKarl Williamson2019-11-161-1/+1
| | | | | | This wasn't generating the correct values. It is no longer used, and the next commit will remove it, but I wanted to get it right, in case it is ever needed again.
* regcharclass.h: Add some macrosKarl Williamson2019-11-161-8/+352
| | | | | | These macros will be used in a future commit, and are for three-character folds. regen/regcharclass*.pl are changed for this purpose.
* regen/regcharclass_multi_char_folds.pl: SimplifyKarl Williamson2019-11-161-1/+1
| | | | | This creates a simply named array instead of a more complicated array ref, so is easier to understand
* regen/regcharclass_multi_char_folds.pl: Use printable charKarl Williamson2019-11-161-1/+1
| | | | | | | | It makes the result more legible if it uses the printable character instead of an escape sequence when appropriate. Although, currently, the value is re-escaped for output. This helped during debugging.
* regen/regcharclass_multi_char_folds.pl: Fix commentsKarl Williamson2019-11-161-1/+1
|
* Remove lib/unicore/Heavy.plKarl Williamson2019-11-061-2/+2
| | | | | | | This file was for the use of utf8_heavy.pl. But now that that is incorporated into Unicode::UCD, move the definitions from Heavy.pl to lib/unicore/UCD.pl which is used by Unicode::UCD. This allows removing package names.
* Remove utf8_heavy.plKarl Williamson2019-11-061-2/+2
| | | | | | | | | | | | | The only remaining user of this is Unicode::UCD, and so most of the code from utf8_heavy.pl is moved into that UCD.pm. It removes a no-longer relevant test (that had been changed into a skip anyway), and it changes or removes the no-longer relevant references in comments to utf8_heavy.pl Later commits will do some simplification as not all the previous functionality is needed. This commit removed only the parts that were preventing compilation and tests passing.
* Remove swashes from coreKarl Williamson2019-11-061-1/+1
| | | | Also references to the term.
* regen charclass_invlists.hDavid Mitchell2019-10-031-2/+2
| | | | | | | this was missed from the previous commit Also, fix typo in regen/regcharclass.pl It was still referring to itself as Porting/regcharclass.pl
* mktables: Fix Named Sequences for EBCDICKarl Williamson2019-10-021-1/+1
| | | | This table wasn't being translated into native code points
* Fix "it it" typosDagfinn Ilmari Mannsåker2019-07-041-1/+1
| | | | And regen affected files
* /\p{InFoo} should only match blocks, or be user-definedKarl Williamson2019-06-021-1/+1
| | | | | | | | | | For a property \p{Block=Foo}, we allow the synonym \p{InFoo} as documented variously, including perluniprops, even though this usage is discouraged, as a new Unicode release used in a new version of Perl could cause the synonym to no longer work. Prior to this commit, we erroneously allowed the synonym for other properties, such as \p{InKana} or \p{InS}.
* Unicode::UCD: Use L</Foo Bar>, not L<Foo Bar>Karl Williamson2019-05-251-1/+1
|
* Update Unicode 12.1Karl Williamson2019-04-191-1/+1
| | | | | | | This takes the few latest changes in the draft Unicode 12.1, ahead of our freeze. None are substantive. No further non-substantive changes will be added, except in the unlikely event that a substantive change is made, we will take it and potentially delay Perl 5.30.
* mktables: Silence warningKarl Williamson2019-04-161-1/+1
| | | | A variable needed to be updated for Unicode 12.1
* mktables: Generalize handling of [perl #133979]Karl Williamson2019-04-101-1/+1
| | | | | | I realized that commit f9c1e7e9ed13a16099c8471c2030b93deb482571 works now, but future Unicode versions may add fractions that fool it. This commit should handle any such event
* Preliminary Unicode 12.1Unicode Consortium2019-04-081-46/+46
|
* mktables: White-space onlyKarl Williamson2019-04-061-1/+1
| | | | Indent block newly formed in previous commit
* PATCH: [perl #133979] uniprops02 failing on WindowsKarl Williamson2019-04-061-1/+1
| | | | | | | This turns out to be because Windows doesn't necessarily round to even on floating point %e conversions. The solution is to add an extra entry rounding up to odd when a fraction is precisely representable in binary. So far, the only case where this occurs is 1/32.
* mktables: Turn off DEBUGKarl Williamson2019-04-041-1/+1
| | | | This inadvertently was left on, slowing down the process a little
* Corrections to Unicode 12.0Unicode Consortium2019-04-021-18/+18
| | | | | | | | Somehow I missed updating some files with the result that a few official 12.0 final corrections did not make it into 906f46d96ca4ba2d1039d576954bc5a47868348c. These are mostly tests and break property changes for a few characters
* regcharclass.h: Change to use new inRANGE macroKarl Williamson2019-03-301-296/+288
| | | | | This was done by changing regen/regcharclass.pl. This results in half the conditionals being needed, and in some cases better error checking.
* Add tests for wildcards in Unicode property valuesKarl Williamson2019-03-121-1/+1
|
* Add warnings category experimental::uniprop_wildcardsKarl Williamson2019-03-121-1/+1
|
* Check for \n in EBCDIC code pagesKarl Williamson2019-03-061-3/+3
| | | | | | | IBM says that there are 13 characters whose code point varies depending on the EBCDIC code page. They fail to mention that the \n character may also vary. This commit adds checks for \n, in addition to the checks for the 13 graphic variant ones.
* Use Unicode 12.0Unicode Consortium2019-03-041-47/+47
| | | | Unicode 12.0 is finalized. Change to use it.
* mktables: Omit unnecessary duplicatesKarl Williamson2019-02-161-1/+1
| | | | These are in a generated structure.
* regen/regcharclass.pl: Remove obsolete macroKarl Williamson2019-02-051-28/+1
| | | | This has been replaced by regen/unicode_constants.pl some releases ago.
* mktables: Make Turkic 'I' chars problematicKarl Williamson2019-02-051-19/+46
| | | | | | | | | | | | In a Turkic locale, these are problematic because their mappings cross the 255/256 boundary. This change has the side effect of causing U+307 to be added to the problematic list, and it normally really isn't problematic, because in those locales where U+130 and U+131 are problematic, U+307 isn't used. But applications could switch in and out of Turkic locales, so it's best to leave it be considered problematic. The consequences of making this mark problematic are simply slightly less optimized regex pattern code.
* Move 2 property defns to mktablesKarl Williamson2018-12-251-1/+1
| | | | | | | | | These 2 Unicode-like property definitions used internally by the regular expression compiler are moved by this commit from regen/mk_invlists.pl to lib/unicore/mktables. By placing all these in the same place, maintainers only have to learn one bit of code, instead of two.