From 0abd0d78a73da1c4d13b1c700526b7e5d03b32d4 Mon Sep 17 00:00:00 2001 From: Yves Orton Date: Sun, 25 Oct 2009 20:37:08 +0100 Subject: disable non-unicode case insensitive trie matching Also revert 8902bb05b18c9858efa90229ca1ee42b17277554 as it merely masked one symptom of the deeper problems. Also fixes RT #69973, which was a segfault which was exposed by 8902bb05, see the ticket for further details. http://rt.perl.org/rt3//Public/Bug/Display.html?id=69973 At the code of this is the problem that in unicode matching a bunch of code points have case folding rules beyond just A-Z/a-z. Since the case folding rules are decided at runtime by the string, we cant use the same TRIE tables for both unicode/non-unicode matching. Until this is reconciled or some other solution is found case insensitive matching only gets the TRIE optimisation when the pattern is uniocde. From CaseFolding.txt: 00B5; C; 03BC; # MICRO SIGN 00C0; C; 00E0; # LATIN CAPITAL LETTER A WITH GRAVE 00C1; C; 00E1; # LATIN CAPITAL LETTER A WITH ACUTE 00C2; C; 00E2; # LATIN CAPITAL LETTER A WITH CIRCUMFLEX 00C3; C; 00E3; # LATIN CAPITAL LETTER A WITH TILDE 00C4; C; 00E4; # LATIN CAPITAL LETTER A WITH DIAERESIS 00C5; C; 00E5; # LATIN CAPITAL LETTER A WITH RING ABOVE 00C6; C; 00E6; # LATIN CAPITAL LETTER AE 00C7; C; 00E7; # LATIN CAPITAL LETTER C WITH CEDILLA 00C8; C; 00E8; # LATIN CAPITAL LETTER E WITH GRAVE 00C9; C; 00E9; # LATIN CAPITAL LETTER E WITH ACUTE 00CA; C; 00EA; # LATIN CAPITAL LETTER E WITH CIRCUMFLEX 00CB; C; 00EB; # LATIN CAPITAL LETTER E WITH DIAERESIS 00CC; C; 00EC; # LATIN CAPITAL LETTER I WITH GRAVE 00CD; C; 00ED; # LATIN CAPITAL LETTER I WITH ACUTE 00CE; C; 00EE; # LATIN CAPITAL LETTER I WITH CIRCUMFLEX 00CF; C; 00EF; # LATIN CAPITAL LETTER I WITH DIAERESIS 00D0; C; 00F0; # LATIN CAPITAL LETTER ETH 00D1; C; 00F1; # LATIN CAPITAL LETTER N WITH TILDE 00D2; C; 00F2; # LATIN CAPITAL LETTER O WITH GRAVE 00D3; C; 00F3; # LATIN CAPITAL LETTER O WITH ACUTE 00D4; C; 00F4; # LATIN CAPITAL LETTER O WITH CIRCUMFLEX 00D5; C; 00F5; # LATIN CAPITAL LETTER O WITH TILDE 00D6; C; 00F6; # LATIN CAPITAL LETTER O WITH DIAERESIS 00D8; C; 00F8; # LATIN CAPITAL LETTER O WITH STROKE 00D9; C; 00F9; # LATIN CAPITAL LETTER U WITH GRAVE 00DA; C; 00FA; # LATIN CAPITAL LETTER U WITH ACUTE 00DB; C; 00FB; # LATIN CAPITAL LETTER U WITH CIRCUMFLEX 00DC; C; 00FC; # LATIN CAPITAL LETTER U WITH DIAERESIS 00DD; C; 00FD; # LATIN CAPITAL LETTER Y WITH ACUTE 00DE; C; 00FE; # LATIN CAPITAL LETTER THORN 00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S --- ext/re/t/regop.t | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'ext/re') diff --git a/ext/re/t/regop.t b/ext/re/t/regop.t index 9118bf6203..46e6ec04f5 100644 --- a/ext/re/t/regop.t +++ b/ext/re/t/regop.t @@ -231,12 +231,12 @@ anchored "ABC" at 0 #Freeing REx: "(\\.COM|\\.EXE|\\.BAT|\\.CMD|\\.VBS|\\.VBE|\\.JS|\\.JSE|\\."...... %MATCHED% floating ""$ at 3..4 (checking floating) -1:1[1] 3:2[1] 5:2[64] 45:83[1] 47:84[1] 48:85[0] -stclass EXACTF <.> minlen 3 -Found floating substr ""$ at offset 30... -Does not contradict STCLASS... -Guessed: match at offset 26 -Matching stclass EXACTF <.> against ".exe" +#1:1[1] 3:2[1] 5:2[64] 45:83[1] 47:84[1] 48:85[0] +#stclass EXACTF <.> minlen 3 +#Found floating substr ""$ at offset 30... +#Does not contradict STCLASS... +#Guessed: match at offset 26 +#Matching stclass EXACTF <.> against ".exe" --- #Compiling REx "[q]" #size 12 nodes Got 100 bytes for offset annotations. -- cgit v1.2.1