summaryrefslogtreecommitdiff
path: root/l1_char_class_tab.h
Commit message (Collapse)AuthorAgeFilesLines
* Deprecate certain rare uses of backslashes within regexesKarl Williamson2013-01-191-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are three pairs of characters that Perl recognizes as metacharacters in regular expression patterns: {}, [], and (). These can be used as well to delimit patterns, as in: m{foo} s(foo)(bar) Since they are metacharacters, they have special meaning to regular expression patterns, and it turns out that you can't turn off that special meaning by the normal means of preceding them with a backslash, if you use them, paired, within a pattern delimitted by them. For example, in m{foo\{1,3\}} the backslashes do not change the behavior, and this matches "f", "o" followed by one to three more occurrences of "o". Usages like this, where they are interpreted as metacharacters, are exceedingly rare; we think there are none, for example, in all of CPAN. Hence, this deprecation should affect very little code. It does give notice, however, that any such code needs to change, which will in turn allow us to change the behavior in future Perl versions so that the backslashes do have an effect, and without fear that we are silently breaking any existing code. =head1 Performance Enhancements
* regex: Add pseudo-Posix class: 'cased'Karl Williamson2012-12-311-117/+117
| | | | | | | | | | | | | | | | | /[[:upper:]]/i and /[[:lower:]]/i should match the Unicode property \p{Cased}. This commit introduces a pseudo-Posix class, internally named 'cased', to represent this. This class isn't specifiable by the user, except through using either /[[:upper:]]/i or /[[:lower:]]/i. Debug output will say ':cased:'. The regex parsing either of :lower: or :upper: will change them into :cased:, where already existing logic can handle this, just like any other class. This commit fixes the regression introduced in 3018b823898645e44b8c37c70ac5c6302b031381, and that these have never worked under 'use locale'. The next commit will un-TODO the tests for these things.
* handy.h: Create isALPHANUMERIC() and kinKarl Williamson2012-12-221-127/+127
| | | | | | | | | | | | | | | | | | | | | | Perl has had an undocumented macro isALNUMC() for a long time. I want to document it, but the name is very obscure. Neither Yves nor I are sure what it is. My best guess is "C's alnum". It corresponds to /[[:alnum:]]/, and so its best name would be isALNUM(). But that is the name long given to what matches \w. A new synonym, isWORDCHAR(), has been in place for several releases for that, but the old isALNUM() should remain for backwards compatibility. I don't think that the name isALNUMC() should be published, as it is too close to isALNUM(). I finally came to the conclusion that isALPHANUMERIC() is the best name; it describes its purpose clearly; the disadvantage is its long length. I doubt that it will get much use, but we need something, I think, that we can publish to accomplish this functionality. This commit also converts core uses of isALNUMC to isALPHANUMERIC. (I intended to that separately, but made a mistake in rebasing, and combined the two patches; and it seemed like not a big enough problem to separate them out again.)
* regexes: Add \v to table of latin1 char classesKarl Williamson2012-11-191-5/+5
| | | | | | | This will be used in future commits to allow \v and \V to be treated consistently with other character classes. (Doing the same for \h isn't necessary, as it matches identically to [:blank:] in the entire Unicode range.)
* regen/mk_PL_charclass.pl: Use mktables table for charnameKarl Williamson2012-11-111-1/+1
| | | | | | | | | | | This commit uses the mktables defined table for whether or not a character is a legitimate charname continuation. This will allow it to be kept in sync with other code that needs the definition. The only change this makes is to delete "colon" from being a legitimate continuation character. A colon was only accepted because it was used in the paradigm for like "Greek: Alpha", and is not part of any actual character name.
* regen/mk_PL_charclass.pl: Move code to subroutineKarl Williamson2012-10-201-1/+1
| | | | | | | | | This code is for just this property and was kludged in to be executed in the general loop. It makes more sense to it to be in the subroutine that handles the property that was just added in a prior commit. It also changes the output slightly. The Latin1 sharp S isn't a non-final fold, unlike what was said previously
* regen/mk_PL_charclass.pl: Add bit for if character foldsKarl Williamson2012-10-111-115/+115
| | | | | | This takes the existing mktables-generated table that lists all characters that participate in any way in a fold, and creates a bit for it in l1_char_class_tab.h
* mktables: Generate tables for chars that aren't in final fold posKarl Williamson2012-08-021-19/+19
| | | | | | | | | | This starts with the existing table that mktables generates that lists all the characters in Unicode that occur in multi-character folds, and aren't in the final positions of any such fold. It generates data structures with this information to make it quickly available to code that wants to use it. Future commits will use these tables.
* handy.h: Free up bits in PL_charclass[]Karl Williamson2012-07-241-256/+256
| | | | | | | | | | | | | | | | | | | | | | | | This array is a bit map containing the Posix and similar character classes for the first 256 code points. Prior to this commit many character classes were represented by two bits, one for characters that are in it over the full Latin-1 range, and one for just the ASCII characters that are in it. The number of bits in use was approaching the 32-bit limit available without playing games. This commit takes advantage of a recent commit that adds a bit to the table for all the ASCII characters, and the fact that the ASCII characters in a character class are a subset of the full Latin1 range. So, iff both the full-range character class bit and the ASCII bit is set is that character an ASCII-range character with the given character class. A new internal macro is created to generate code to determine if a character is an ASCII range character with the given class. It's not clear if the generated code is faster or slower than the full range version. The result is that nearly half the bits are freed up, as the ones for the ASCII-range are now redundant.
* handy.h: Move bit shifting into base macroKarl Williamson2012-07-241-256/+256
| | | | | | This changes the #defines to be just the shift number, while doing the shifting in the macro that the number is passed to. This will prove useful in future commits
* handy.h: l1_charclass.h: Add bit for matching ASCIIKarl Williamson2012-07-241-128/+128
| | | | | | | This does not replace the isASCII macro definition, as I think the current one is more efficient than this one provides. But future commits will rely on all the named character classes (e.g., /[[:ascii:]]/) having a bit, and this is the only one missing.
* mk_PL_charclass.pl: Allow to work on early UnicodesKarl Williamson2012-06-021-2/+1
| | | | | | If the version of Unicode being compiled doesn't have the modern casefolding .txt file, get the values from Unicode::UCD. Also for EBCDIC, where otherwise the file would have to be translated.
* handy.h: New defn of isOCTAL_A() to free up bitKarl Williamson2012-05-221-8/+8
| | | | | | | | | | | | The new definition is likely slightly faster, as it replaces an array lookup with a mask. Comments are also added, listing the other possible candidates for this treatment, though the speed differential is unclear as they would also add an extra test. A U32 is used to store the information about the various properties for a character. This frees up one bit of that for future other use.
* Experimentally add VT to \s definitionKarl Williamson2012-05-221-1/+1
| | | | | | | | | | | | | | This commit is the minimal necessary to get \s to match the vertical tab. It is being done early in the 5.17 series in order to see what repercussions there might be from doing this. It may well be that we decide that this change will require a 'use feature' to activate. In any event there is significant documentation of the behavior without the VT that this patch does not address at all. Tom Christiansen asked Larry Wall why \s did not include VT, and reported that Larry replied that he did not remember, but had no objections to adding it.
* l1_char_class_tab.h: Add field for quotemetaKarl Williamson2012-02-151-117/+117
| | | | | This changes this header to include a bit for each character indicating if it should be quoted by quotemeta under unicode_strings
* Unicode 6.1Karl Williamson2012-02-041-2/+2
| | | | | | This commit delivers the official Unicode character database files for release 6.1, plus the final bits needed to cope with the changes in them from release 6.0, including documentation.
* mk_PL_charclass.pl: Revise comments, gen'd headerKarl Williamson2011-10-011-2/+2
|
* Revert "l1_char_class_tab.h: Remove multi-char fold targets"Karl Williamson2011-02-141-21/+21
| | | | | | | | | | This reverts commit 88c8c9616516015e2fe0b502cdb92dc4efcc0c10. It turns out that these multi-char fold targets are now needed; In a future commit, I plan to compile in the dozen or so rules that are needed to avoid a Latin1-only regex from having to go out to the utf8 tables to avoid the performance penalty; or calling code can use the also forthcoming 'use re "/aa"'.
* l1_char_class_tab.h: Remove multi-char fold targetsKarl Williamson2011-02-041-21/+21
| | | | | | | These are not currently used, and slow things down, as regular expressions that have them, such as /[Etl]/i now have to go out and load utf8 code. This remains the case, though, for bracketed character classes that include [KkSs].
* Move mk_PL_charclass.pl from Porting/ to regen/Nicholas Clark2011-01-241-1/+1
|
* Convert mk_PL_charclass.pl to use regen_lib.plNicholas Clark2011-01-241-1/+9
| | | | | | | Change it to read CaseFolding.txt from lib/unicore, instead of the file installed with perl, so that it can run with an uninstalled perl. Add "read only" editor blocks to l1_char_class_tab.h
* Move the non-generated parts of l1_char_class_tab.h out into handy.hNicholas Clark2011-01-241-46/+0
| | | | | Now the contents of l1_char_class_tab.h is only the output of Porting/mk_PL_charclass.pl
* l1_char_class_tab.h: include multi-char foldsKarl Williamson2010-12-151-21/+21
| | | | This patch is the result of running mk_PL_charclass.pl
* l1_char_class_tab.h: Add new bit to table.Karl Williamson2010-11-221-9/+9
| | | | | | The output of the revised Porting/mk_charclass.pl is here incorporated into this .h., with a #define for the new bit that signifies if a character participates in a fold with a non-latin1 character.
* l1_char_class_tab.h: Wrong for ALNUMCKarl Williamson2010-10-311-65/+65
| | | | | The generated table was wrong in the Latin1 range for characters with the ALNUMC property
* Add 256 word bit table of character classesKarl Williamson2010-09-251-0/+303
This patch adds a table for looking up character classes. It is 256 words long, in l1_char_class_tab.h, with each word corresponding to the ordinal of a Latin1 character, and each word contains a bit map of all the properties that character matches. Each property has a bit or two. Ones named _CC_property_A are true only if the character is also in the ASCII character set. Ones named CC_property_L1 do not have this restriction. (L1 stands for Latin1.) Also added is a script that generates the table. It is not anticipated that this will need to be used often. (This commit was changed from its original form by Steffen.)