summaryrefslogtreecommitdiff
path: root/lib/Unicode
Commit message (Collapse)AuthorAgeFilesLines
* Support Unicode 15.0Unicode Consortium2022-09-281-2/+2
|
* Add official Unicode normalization testsKarl Williamson2022-06-061-0/+26
|
* Support Unicode 14.0Unicode Consortium2021-09-151-2/+2
|
* Unicode::UCD: Don't depend on a file current syntaxKarl Williamson2021-08-311-17/+15
| | | | | This generated file will be changed in a future commit. This shouldn't have been relying on its syntax anyway, but the value it returns.
* UCD.t: Add testKarl Williamson2021-08-311-0/+2
|
* Unicode::UCD: Fix typo in podKarl Williamson2021-08-311-2/+2
|
* Remove deprecated Unicode filesKarl Williamson2021-09-011-129/+14
| | | | | | | | | | | | | | | | | | | | | | | These files were once apparently intended for use by modules to supplement the core Unicode handling. They contain tables suitable for use by Perl code of the portions of the Unicode character database about changing the case of characters and finding the numeric value of a given \d character, in a form suitable for use by perl code. In particular, they were designed for fast access using the swash mechanism that has since been removed. Now, Unicode::UCD now contains more convenient methods of accessing the data these contain, and the use of these files has been deprecated since 5.16. I could not figure out a way to force a message should someone open and read one of these files, but each of their texts say that the file may be removed without notice at any time. I did not find any uses on cpan of them. Unicode is adding new properties that the format of these files will not be able to handle. Consequently I'm coming up with a new format. Though these files don't contain the new properties, their existence means having the burden of having to maintain two separate mechanisms. Better to have just one mechanism, suitable for going forward.
* Unicode::UCD: Bump version; regenKarl Williamson2021-07-201-1/+1
|
* Unicode::UCD: Fix character name in PODJakub Wilk2021-07-201-1/+1
|
* Update UCD version. Remove changes to cpan Encode. RegenThibault DUPONCHELLE2021-06-171-1/+1
|
* Fix several unicode.org linksThibault DUPONCHELLE2021-06-171-3/+3
|
* Reformat lib/unicore/Name.plKarl Williamson2020-03-112-24/+48
| | | | | | | | | | | | | | | | | | | | This changes the format of this generated file so that it can more easily be used with the Unicode Name property in wildcard matching. Each line will now end with \n\n, and the \t characters are replaced by \n. Thus an entry will look like 00001\nSTART OF HEADING\n\n This makes matching of user-defined patterns using anchors work under /m, which commit 4829f32decd128e6a122bd8ce35fe944bd87f104 forces. That commit also changed some anchors' defintions to make them match \n under /m with wildcards, so this makes it all transparent to user patterns. The double \n\n at the end of an entry is so that the code can distinguish between a line that contains a code point vs a name without relying on the content; it is a disambiguator, like the \t that used to be.
* Use Unicode 13.0 (beta)Unicode Consortium2020-01-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unicode has changed its yearly release cycle so that the final version is not available until early March of the year. This year it is March 10, 2020. However, all changes planned were finalized in early January, and the actual computer files have been updated to their presumably final substantive versions. The release has been authorized without further review needed. The release is awaiting final documentation additions, and soak time of the beta to verify there are no glitches. This commit causes Perl to participate in that soak. I don't anticipate any problems, and likely the only substantive change upon the official release will be to update perldelta. Comments in the files supplied by Unicode will likely also change to indicate these are no longer beta. There were very few changes affecting existing characters; most of the changes involved adding new characters, including emoji. The break characteristics of some existing characters were changed (GCB, LB, WB, and SB properties). The only perl code I really had to change to cope with the new release was about rules in the Line Break property, dealing around ellipses (...) and certain East Asian characters next to opening parentheses. If there are problems, we can revert this at any time, and ship with 12.0.
* Remove lib/unicore/Heavy.plKarl Williamson2019-11-062-82/+93
| | | | | | | This file was for the use of utf8_heavy.pl. But now that that is incorporated into Unicode::UCD, move the definitions from Heavy.pl to lib/unicore/UCD.pl which is used by Unicode::UCD. This allows removing package names.
* UCD.pm: Remove 'none' from swashKarl Williamson2019-11-061-12/+4
| | | | | This was only used by tr///, and hence is no longer relevant. I never really understood it.
* Remove utf8_heavy.plKarl Williamson2019-11-062-46/+622
| | | | | | | | | | | | | The only remaining user of this is Unicode::UCD, and so most of the code from utf8_heavy.pl is moved into that UCD.pm. It removes a no-longer relevant test (that had been changed into a skip anyway), and it changes or removes the no-longer relevant references in comments to utf8_heavy.pl Later commits will do some simplification as not all the previous functionality is needed. This commit removed only the parts that were preventing compilation and tests passing.
* Unicode::UCD: Use L</Foo Bar>, not L<Foo Bar>Karl Williamson2019-05-251-2/+2
|
* lib/Unicode/UCD.t: Use standard Perl environment variableKarl Williamson2019-05-251-3/+3
| | | | | This test file invented its own environment variable, whereas everyone else uses a different one. Make this one comply.
* Preliminary Unicode 12.1Unicode Consortium2019-04-081-1/+1
|
* Use Unicode 12.0Unicode Consortium2019-03-041-2/+2
| | | | Unicode 12.0 is finalized. Change to use it.
* fix typosAlexandr Savca2018-10-091-2/+2
| | | | | | | | Committer: For porting tests: Update $VERSION in 4 files. Run: ./perl -Ilib regen/mk_invlists.pl ./perl -Ilib regen/regcharclass.pl
* Use Unicode 11.0Unicode Consortium2018-07-201-2/+2
| | | | This completes the process of upgrading to Unicode 11.0.
* Unicode::UCD: Avoid uninit messageKarl Williamson2018-06-251-9/+12
| | | | | I found a case where this array can be empty, so add a test for that to avoid trying to look at the first (non-existent) element.
* Simplify Unicode::UCD::openunicode() and callersDagfinn Ilmari Mannsåker2017-12-281-43/+27
| | | | | | | | | | Get rid of the file-global filehandles and the unused filename return value, instead return the filehandle and assign it to a lexical variable. Also don't bother checking the return value; it croaks on failure anyway. In passing, eliminate erroneous assignment of {} to %CASESPEC for Unicode < 2.1.8.
* Fix typo in POD for new Unicode::UCD::num() variantDagfinn Ilmari Mannsåker2017-12-281-1/+1
|
* Unicode::UCD: Add optional paramter to num()Karl Williamson2017-12-272-15/+56
| | | | | | | As discussed in http://nntp.perl.org/group/perl.perl5.porters/244444, this sets the optional scalar ref paramater to the length of the valid initial portion of the first parameter passed to num(). This is useful in teasing apart why the input is invalid.
* Unicode::UCD.pm Add undocumented internal featureKarl Williamson2017-12-241-4/+4
| | | | This allows charprop() to be called on a Perl-internal-only property
* Unicode::UCD: max code point is now IV_MAXKarl Williamson2017-12-162-8/+12
| | | | Return the correct value when asked.
* Use Unicode 10.0Karl Williamson2017-06-201-1/+1
| | | | | | The new file from Unicode "extracted/DerivedName.txt" is not delivered here, as Perl doesn't need it, as it duplicates information in other files.
* Switch most open() calls to three-argument form.John Lightsey2016-12-231-2/+2
| | | | | | | | | | Switch from two-argument form. Filehandle cloning is still done with the two argument form for backward compatibility. Committer: Get all porting tests to pass. Increment some $VERSIONs. Run: ./perl -Ilib regen/mk_invlists.pl; ./perl -Ilib regen/regcharclass.pl For: RT #130122
* Fixup Unicode::UCD pod/version and regen dependent filesMatthew Horsfall2016-11-141-2/+2
|
* Additional warning of Name.pl going awayH.Merijn Brand2016-11-141-1/+2
|
* Unicode::UCD documentation for reading Name.pl as encouraged practiceH.Merijn Brand2016-11-141-2/+38
|
* Change \p{foo} to mean \p{scx: foo}Karl Williamson2016-06-301-3/+10
| | | | | | | when 'foo' is a script. Also update the pods correspondingly, and to encourage scx property use. See http://nntp.perl.org/group/perl.perl5.porters/237403
* Use Unicode 9.0Unicode Consortium2016-06-211-1/+1
| | | | | This includes regenerating the files that depend on the Unicode 9 data files
* Prepare for Unicode 9.0Karl Williamson2016-06-211-1/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The major code changes needed to support Unicode 9.0 are to changes in the boundary (break) rules, for things like \b{lb}, \b{wb}. regen/mk_invlists.pl creates two-dimensional arrays for all these properties. To see if a given point in the target string is a break or not, regexec.c looks up the entry in the property's table whose row corresponds to the code point before the potential break, and whose column corresponds to the one after. Mostly this is completely determining, but for some cases, extra context is required, and the array entry indicates this, and there has to be specially crafted code in regexec.c to handle each such possibility. When a new release comes along, mk_invlists.pl has to be changed to handle any new or changed rules, and regexec.c has to be changed to handle any changes to the custom code. Unfortunately this is not a mature area of the Standard, and changes are fairly common in new releases. In part, this is because new types of code points come along, which need new rules. Sometimes it is because they realized the previous version didn't work as well as it could. An example of the latter is that Unicode now realizes that Regional Indicator (RI) characters come in pairs, and that one should be able to break between each pair, but not within a pair. Previous versions treated any run of them as unbreakable. (Regional Indicators are a fairly recent type that was added to the Standard in 6.0, and things are still getting shaken out.) The other main changes to these rules also involve a fairly new type of character, emojis. We can expect further changes to these in the next Unicode releases. \b{gcb} for the first time, now depends on context (in rarely encountered cases, like RI's), so the function had to be changed from a simple table look-up to be more like the functions handling the other break properties. Some years ago I revamped mktables in part to try to make it require as few manual interventions as possible when upgrading to a new version of Unicode. For example, a new data file in a release requires telling mktables about it, but as long as it follows the format of existing recent files, nothing else need be done to get whatever properties it describes to be included. Some of changes to mktables involved guessing, from existing limited data, what the underlying paradigm for that data was. The problem with that is there may not have been a paradigm, just something they did ad hoc, which can change at will; or I didn't understand their unstated thinking, and guessed wrong. Besides the boundary rule changes, the only change that the existing mktables couldn't cope with was the addition of the Tangut script, whose character names include the code point, like CJK UNIFIED IDEOGRAPH-3400 has always done. The paradigm for this wasn't clear, since CJK was the only script that had this characteristic, and so I hard-coded it into mktables. The way Tangut is structured may show that there is a paradigm emerging (but we only have two examples, and there may not be a paradigm at all), and so I have guessed one, and changed mktables to assume this guessed paradigm. If other scripts like this come along, and I have guessed correctly, mktables will cope with these automatically without manual intervention.
* Unicode/UCD.t: better handling of errorsKarl Williamson2016-05-301-4/+9
| | | | | | | | This now looks for the PERL_DIFF_TOOL environment variable, and if found uses that to display some problems. If not found, it uses is(), with a message that better output is available through setting this variable. PERL_DIFF_TOOL is a convention I wasn't familiar with.
* Add an example of the '0x' string format.Jarkko Hietaniemi2016-05-301-2/+3
| | | | | | I am not certain that I find the 'leading zero means hex' format recommendable ('0123' meaning '0x123', the octal format has poisoned the well); but water under the bridge.
* Unicode::UCD: Fix to work on very early Unicode versionsKarl Williamson2016-01-141-3/+9
| | | | | Prior to this commit, it would not compile because 2 properties weren't defined in very early Unicodes.
* PATCH [perl #120790] Unicode::UCD failure to warn on bad inputKarl Williamson2015-09-142-2/+25
| | | | | | | | | | This ticket was originally because the requester did not realize the function Unicode::UCD::charscript took a code point argument instead of a chr one. It was rejected on that basis. But discussion here suggested it would be better to warn on bad input instead of just returning <undef>. It turns out that all other routines in Unicode::UCD but charscript and charblock already do warn. This commit extends that to the two outlier returns.
* mktables: Fix up Name_Alias in early UnicodesKarl Williamson2015-07-282-4/+4
| | | | | | | | | | | perl needs the Name_Alias property accessible in all releases in order for charnames to work properly. However the property was not created until Unicode version 5.0. Previously, the property was made available to all Unicode versions, which is contrary to the policy of exposing properties to public use only when Unicode so exposes them. Thus the behavior is as close as possible to Unicode-specified. This commit creates an internal-only property for the perl core, and removes the general access on early Unicode releases.
* Properly handle the Unicode kIICore propertyKarl Williamson2015-07-282-3/+13
| | | | | | | | | | | | | | | | | | | | | | | | | This property is not included in the standard Perl distribution, but it is normative in the Unicode Unihan database and perl can be compiled to include it. This property is currently unique in that it operates much like how Perl defines string truthfulness: non-empty values are considered true. That is \p{kIICore} matches all characters which have a non-empty value for this property, plus the actual values have meaning that may need to be examined in some circumstances. These can be retrieved via Unicode::UCD::prop_invmap(). Unicode 7.0 changed this property without my noticing, and went a very different direction with it than I anticipated. And the perl interpreter would loop when trying to deal with it under some circumstances. This property is true for all 'core' Chinese/Japanese/Korean characters that every implementation of CJK things should strive to handle, i.e., the minimally acceptable set, though the values now specify a precedence as their first letter, A, B, or C (I suppose this means one could implement just the A level ones first). The remaining letters in each value encode the standards which were used as the source for the character. In previous versions of the Standard, every non-null value was the string "2.1".
* Unicode::UCD: Handle inverted inputKarl Williamson2015-07-281-1/+17
| | | | | | | | No current input comes inverted, but it could some time in the future, and we wouldn't know. In one case, it's easy to handle, so do so; in another, die with a message so won't sneak past. At that point, if and when it happens, time could be spent figuring out the best way to handle the situation.
* lib/Unicode/UCD.t: White-space onlyKarl Williamson2015-07-281-252/+252
| | | | Re-indent after the previous commit
* lib/Unicode/UCD.t: Fix to work on older UnicodesKarl Williamson2015-07-281-54/+154
| | | | | | | This commit causes this test to pass tests even when run on old Unicodes, back to 3.0, where it becomes just too much. Most tests aren't structured to pass on older Unicodes, but it's somewhat important that this one does.
* Unicode::UCD: Add pod text about old UnicodesKarl Williamson2015-07-281-0/+9
|
* Unicode::UCD: Handle old Unicode filesKarl Williamson2015-07-281-1/+8
| | | | | The formats of some older Unicode releases can be different than previously expected.
* Unicode::UCD: Handle old Unicode Blocks file formatKarl Williamson2015-07-281-0/+4
|
* Handle Unicode 3.0.1 /i Turkish "i" rulesKarl Williamson2015-07-281-1/+23
| | | | | | | Actually, there are no special rules for this Unicode release. All the 4 "i" characters are considered equivalent under /i only in this release. (Upper and lowercase dotted and dotless "i"). This adds special cases that are only compiled in for that release.
* Unicode::UCD: Remove dead codeKarl Williamson2015-07-281-2/+1
| | | | Nothing should get executed after a croak.