summaryrefslogtreecommitdiff
path: root/l1_char_class_tab.h
Commit message (Collapse)AuthorAgeFilesLines
* Change handy.h macro names to be C standard conformantKarl Williamson2022-06-121-768/+768
| | | | | | | C reserves symbols beginning with underscores for its own use. This commit moves the underscore so it is trailing, which is legal. The symbols changed here are many of the ones in handy.h that have significant uses outside it.
* Remove EBCDIC-only codeKarl Williamson2021-08-071-30/+30
| | | | The previous commit stopped using this code, so can just get rid of it.
* l1_char_class_tab.h: Add bits for binary, octal digitsKarl Williamson2020-01-131-24/+24
| | | | | | The motivation behind these extra bits is to allow three functions that deal with, respectively, binary, octal, and hex data to use the same paradigm, and hence be collapsible into a single function.
* l1_char_class_tab.h: Remove some special EBCDIC casesKarl Williamson2019-10-091-180/+180
| | | | These are no longer needed.
* Check for \n in EBCDIC code pagesKarl Williamson2019-03-061-2/+2
| | | | | | | IBM says that there are 13 characters whose code point varies depending on the EBCDIC code page. They fail to mention that the \n character may also vary. This commit adds checks for \n, in addition to the checks for the 13 graphic variant ones.
* regen/mk_PL_charclass.pl: RevampKarl Williamson2018-06-251-1/+1
| | | | | | | | | | | The change in 5.28 to having precompiled Unicode properties leaves this program with a chicken-and-egg problem. Prior to this commit, it used those properties to construct its output, relying on them to be using the latest Unicode data, but the code that generates the tables from that data uses the output of this program, with potentially disastrous results. This commit changes to use the data itself, through Unicode::UCD.
* regen/mk_PL_charclass.pl: sort output tableKarl Williamson2018-06-251-372/+372
| | | | | This makes it easier to verify that future commits don't change anything.
* Stop accepting deprecated NBSP in \N{}Karl Williamson2016-05-091-3/+3
| | | | As scheduled for 5.26, this construct will no longer be accepted.
* Don't generate EBCDIC POSIX-BC tablesKarl Williamson2016-01-141-263/+0
| | | | | | | | | | | This commit comments out the code that generates these tables. This is trivially reversible. We don't believe anyone is using Perl and POSIX-BC at this time, and this saves time during development when having to regenerate these tables, and makes the resulting tar ball smaller. See thread beginning at http://nntp.perl.org/group/perl.perl5.porters/233663
* Change EBCDIC macro definitionKarl Williamson2015-09-041-45/+45
| | | | | | This changes the definition of isUTF8_POSSIBLY_PROBLEMATIC() on EBCDIC platforms to use PL_charclass[] instead of PL_e2a[]. The new array is more likely to be in the memory cache.
* l1_char_class_tab.h: Add commentsKarl Williamson2015-09-041-285/+285
| | | | | | This adds the I8 value (used for generating UTF-EBCDIC) for bytes where it differs from the regular value on the EBCDIC portions of this header. This value is useful in debugging.
* l1_char_class_tab.h: Add bits for UTF-EBCDICKarl Williamson2015-09-041-270/+270
| | | | This is for the next commit.
* regen/mk_PL_charclass.pl: Suppress extra null array elementKarl Williamson2015-08-011-4/+4
| | | | | | We don't output a trailing comma after the final element in these C arrays, and thus prevent the C compiler from generating a useless null element
* \s matching VT is no longer experimentalKarl Williamson2015-02-211-32/+32
| | | | | | | This was experimentally introduced in 5.18, and no issues were raised, except that it got us to thinking and spurred us to stop allowing $^X, where 'X' is a non-printable control character, and that change caused some issues.
* regcomp.c: Make macro a lookupKarl Williamson2014-09-061-28/+28
| | | | | | | | | | | The recently introduced macro isMNEMONIC_CNTRL has a look-up and several tests in it, which occupy time and space. Since it was only used for debugging, that did not matter much, but future commits will use it in more mainline code. This commit changes it to be a single look-up, using up one of the spare bits available for that purpose in PL_charclass. There are enough available bits that we aren't likely to run out, really ever. (We can always add a 2nd word of bits if necessary.)
* Deprecate unescaped literal "{" in regex patternsKarl Williamson2014-06-121-28/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit also causes escaped (by a backslash) "(", "[", and "{" to be considered literally. In the previous 2 Perl versions, the escaping was ignored, and a (default-on) deprecation warning was raised. Now that we have warned for 2 release cycles, we can change the meaning.of escaping to actually do something Warning when a literal left brace is not escaped by a backslash, will allow us to eventually use this character in more contexts as being meta, allowing us to extend the language. For example, the lower limit of a quantifier could be omited, and better error checking instituted, or things like \w could be followed by a {...} indicating some special word character, like \w{Greek} to restrict to just Greek word characters. We tried to do this in v5.16, and many CPAN modules changed to backslash their left braces at that time. However we had to back out that change before 5.16 shipped because it turned out that escaping a left brace in some contexts didn't work, namely when the brace would normally be a metacharacter (for example surrounding a quantifier), and the pattern delimiters were { }. Instead we raised the useless backslash warning mentioned above, which has now been there for the requisite 2 cycles. This patch partially reverts 2 patches. The first, e62d0b1335a7959680be5f7e56910067d6f33c1f, partially reverted the deprecation of unescaped literal left brace. The other, 4d68ffa0f7f345bc1ae6751744518ba4bc3859bd, instituted the deprecation of the useless left-characters. Note that, as in the original attempt to deprecate, we don't raise a warning if the left brace is the first character in the pattern. This is because in that position it can't be a metacharacter, so we don't require any disambiguation, and we found that if we did raise an error, there were quite a few places where this occurred.
* regcomp.c: Skip work that is a no-opKarl Williamson2014-06-011-36/+36
| | | | | | | | | | | | There are a few characters in the Latin1 range that can be folded to by above-Latin1 characters. Some of these are folded to as part of a single character fold, like KELVIN SIGN folds to 'k'. More are folded to as part of a multi-character fold. Until this commit, there wasn't a quick way to distinguish between the two classes. A couple of places only want the single-character ones. It is more efficient to look for just those than to include the multi-char ones which end up not doing anything. This uses a bit in l1_char_class_tab.h to indicate those characters that are in the desired class.
* regen/mk_PL_charclass.pl: Update to use EBCDIC utilitiesKarl Williamson2014-05-311-0/+795
| | | | | This causes the generated l1_char_class_tab.h to be valid on all supported platforms
* regen/mk_PL_charclass.pl: Rmv hard-coded char namesKarl Williamson2014-05-311-66/+66
| | | | | | | | | | | | | Since this program was written, the abbreviated names of the control characters have become available from charnames::viacode(). We change to use these instead of hard-coding them in. At the same time, this shortens the names for some of the other characters in cases where it is easy to read the short ones. It also changes to use mnemonics instead of hard-coded ordinals, like using ASCII instead of x < 128. This allows it to be run on an EBCDIC platform.
* Deprecate certain rare uses of backslashes within regexesKarl Williamson2013-01-191-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are three pairs of characters that Perl recognizes as metacharacters in regular expression patterns: {}, [], and (). These can be used as well to delimit patterns, as in: m{foo} s(foo)(bar) Since they are metacharacters, they have special meaning to regular expression patterns, and it turns out that you can't turn off that special meaning by the normal means of preceding them with a backslash, if you use them, paired, within a pattern delimitted by them. For example, in m{foo\{1,3\}} the backslashes do not change the behavior, and this matches "f", "o" followed by one to three more occurrences of "o". Usages like this, where they are interpreted as metacharacters, are exceedingly rare; we think there are none, for example, in all of CPAN. Hence, this deprecation should affect very little code. It does give notice, however, that any such code needs to change, which will in turn allow us to change the behavior in future Perl versions so that the backslashes do have an effect, and without fear that we are silently breaking any existing code. =head1 Performance Enhancements
* regex: Add pseudo-Posix class: 'cased'Karl Williamson2012-12-311-117/+117
| | | | | | | | | | | | | | | | | /[[:upper:]]/i and /[[:lower:]]/i should match the Unicode property \p{Cased}. This commit introduces a pseudo-Posix class, internally named 'cased', to represent this. This class isn't specifiable by the user, except through using either /[[:upper:]]/i or /[[:lower:]]/i. Debug output will say ':cased:'. The regex parsing either of :lower: or :upper: will change them into :cased:, where already existing logic can handle this, just like any other class. This commit fixes the regression introduced in 3018b823898645e44b8c37c70ac5c6302b031381, and that these have never worked under 'use locale'. The next commit will un-TODO the tests for these things.
* handy.h: Create isALPHANUMERIC() and kinKarl Williamson2012-12-221-127/+127
| | | | | | | | | | | | | | | | | | | | | | Perl has had an undocumented macro isALNUMC() for a long time. I want to document it, but the name is very obscure. Neither Yves nor I are sure what it is. My best guess is "C's alnum". It corresponds to /[[:alnum:]]/, and so its best name would be isALNUM(). But that is the name long given to what matches \w. A new synonym, isWORDCHAR(), has been in place for several releases for that, but the old isALNUM() should remain for backwards compatibility. I don't think that the name isALNUMC() should be published, as it is too close to isALNUM(). I finally came to the conclusion that isALPHANUMERIC() is the best name; it describes its purpose clearly; the disadvantage is its long length. I doubt that it will get much use, but we need something, I think, that we can publish to accomplish this functionality. This commit also converts core uses of isALNUMC to isALPHANUMERIC. (I intended to that separately, but made a mistake in rebasing, and combined the two patches; and it seemed like not a big enough problem to separate them out again.)
* regexes: Add \v to table of latin1 char classesKarl Williamson2012-11-191-5/+5
| | | | | | | This will be used in future commits to allow \v and \V to be treated consistently with other character classes. (Doing the same for \h isn't necessary, as it matches identically to [:blank:] in the entire Unicode range.)
* regen/mk_PL_charclass.pl: Use mktables table for charnameKarl Williamson2012-11-111-1/+1
| | | | | | | | | | | This commit uses the mktables defined table for whether or not a character is a legitimate charname continuation. This will allow it to be kept in sync with other code that needs the definition. The only change this makes is to delete "colon" from being a legitimate continuation character. A colon was only accepted because it was used in the paradigm for like "Greek: Alpha", and is not part of any actual character name.
* regen/mk_PL_charclass.pl: Move code to subroutineKarl Williamson2012-10-201-1/+1
| | | | | | | | | This code is for just this property and was kludged in to be executed in the general loop. It makes more sense to it to be in the subroutine that handles the property that was just added in a prior commit. It also changes the output slightly. The Latin1 sharp S isn't a non-final fold, unlike what was said previously
* regen/mk_PL_charclass.pl: Add bit for if character foldsKarl Williamson2012-10-111-115/+115
| | | | | | This takes the existing mktables-generated table that lists all characters that participate in any way in a fold, and creates a bit for it in l1_char_class_tab.h
* mktables: Generate tables for chars that aren't in final fold posKarl Williamson2012-08-021-19/+19
| | | | | | | | | | This starts with the existing table that mktables generates that lists all the characters in Unicode that occur in multi-character folds, and aren't in the final positions of any such fold. It generates data structures with this information to make it quickly available to code that wants to use it. Future commits will use these tables.
* handy.h: Free up bits in PL_charclass[]Karl Williamson2012-07-241-256/+256
| | | | | | | | | | | | | | | | | | | | | | | | This array is a bit map containing the Posix and similar character classes for the first 256 code points. Prior to this commit many character classes were represented by two bits, one for characters that are in it over the full Latin-1 range, and one for just the ASCII characters that are in it. The number of bits in use was approaching the 32-bit limit available without playing games. This commit takes advantage of a recent commit that adds a bit to the table for all the ASCII characters, and the fact that the ASCII characters in a character class are a subset of the full Latin1 range. So, iff both the full-range character class bit and the ASCII bit is set is that character an ASCII-range character with the given character class. A new internal macro is created to generate code to determine if a character is an ASCII range character with the given class. It's not clear if the generated code is faster or slower than the full range version. The result is that nearly half the bits are freed up, as the ones for the ASCII-range are now redundant.
* handy.h: Move bit shifting into base macroKarl Williamson2012-07-241-256/+256
| | | | | | This changes the #defines to be just the shift number, while doing the shifting in the macro that the number is passed to. This will prove useful in future commits
* handy.h: l1_charclass.h: Add bit for matching ASCIIKarl Williamson2012-07-241-128/+128
| | | | | | | This does not replace the isASCII macro definition, as I think the current one is more efficient than this one provides. But future commits will rely on all the named character classes (e.g., /[[:ascii:]]/) having a bit, and this is the only one missing.
* mk_PL_charclass.pl: Allow to work on early UnicodesKarl Williamson2012-06-021-2/+1
| | | | | | If the version of Unicode being compiled doesn't have the modern casefolding .txt file, get the values from Unicode::UCD. Also for EBCDIC, where otherwise the file would have to be translated.
* handy.h: New defn of isOCTAL_A() to free up bitKarl Williamson2012-05-221-8/+8
| | | | | | | | | | | | The new definition is likely slightly faster, as it replaces an array lookup with a mask. Comments are also added, listing the other possible candidates for this treatment, though the speed differential is unclear as they would also add an extra test. A U32 is used to store the information about the various properties for a character. This frees up one bit of that for future other use.
* Experimentally add VT to \s definitionKarl Williamson2012-05-221-1/+1
| | | | | | | | | | | | | | This commit is the minimal necessary to get \s to match the vertical tab. It is being done early in the 5.17 series in order to see what repercussions there might be from doing this. It may well be that we decide that this change will require a 'use feature' to activate. In any event there is significant documentation of the behavior without the VT that this patch does not address at all. Tom Christiansen asked Larry Wall why \s did not include VT, and reported that Larry replied that he did not remember, but had no objections to adding it.
* l1_char_class_tab.h: Add field for quotemetaKarl Williamson2012-02-151-117/+117
| | | | | This changes this header to include a bit for each character indicating if it should be quoted by quotemeta under unicode_strings
* Unicode 6.1Karl Williamson2012-02-041-2/+2
| | | | | | This commit delivers the official Unicode character database files for release 6.1, plus the final bits needed to cope with the changes in them from release 6.0, including documentation.
* mk_PL_charclass.pl: Revise comments, gen'd headerKarl Williamson2011-10-011-2/+2
|
* Revert "l1_char_class_tab.h: Remove multi-char fold targets"Karl Williamson2011-02-141-21/+21
| | | | | | | | | | This reverts commit 88c8c9616516015e2fe0b502cdb92dc4efcc0c10. It turns out that these multi-char fold targets are now needed; In a future commit, I plan to compile in the dozen or so rules that are needed to avoid a Latin1-only regex from having to go out to the utf8 tables to avoid the performance penalty; or calling code can use the also forthcoming 'use re "/aa"'.
* l1_char_class_tab.h: Remove multi-char fold targetsKarl Williamson2011-02-041-21/+21
| | | | | | | These are not currently used, and slow things down, as regular expressions that have them, such as /[Etl]/i now have to go out and load utf8 code. This remains the case, though, for bracketed character classes that include [KkSs].
* Move mk_PL_charclass.pl from Porting/ to regen/Nicholas Clark2011-01-241-1/+1
|
* Convert mk_PL_charclass.pl to use regen_lib.plNicholas Clark2011-01-241-1/+9
| | | | | | | Change it to read CaseFolding.txt from lib/unicore, instead of the file installed with perl, so that it can run with an uninstalled perl. Add "read only" editor blocks to l1_char_class_tab.h
* Move the non-generated parts of l1_char_class_tab.h out into handy.hNicholas Clark2011-01-241-46/+0
| | | | | Now the contents of l1_char_class_tab.h is only the output of Porting/mk_PL_charclass.pl
* l1_char_class_tab.h: include multi-char foldsKarl Williamson2010-12-151-21/+21
| | | | This patch is the result of running mk_PL_charclass.pl
* l1_char_class_tab.h: Add new bit to table.Karl Williamson2010-11-221-9/+9
| | | | | | The output of the revised Porting/mk_charclass.pl is here incorporated into this .h., with a #define for the new bit that signifies if a character participates in a fold with a non-latin1 character.
* l1_char_class_tab.h: Wrong for ALNUMCKarl Williamson2010-10-311-65/+65
| | | | | The generated table was wrong in the Latin1 range for characters with the ALNUMC property
* Add 256 word bit table of character classesKarl Williamson2010-09-251-0/+303
This patch adds a table for looking up character classes. It is 256 words long, in l1_char_class_tab.h, with each word corresponding to the ordinal of a Latin1 character, and each word contains a bit map of all the properties that character matches. Each property has a bit or two. Ones named _CC_property_A are true only if the character is also in the ASCII character set. Ones named CC_property_L1 do not have this restriction. (L1 stands for Latin1.) Also added is a script that generates the table. It is not anticipated that this will need to be used often. (This commit was changed from its original form by Steffen.)