diff options
author | Karl Williamson <public@khwilliamson.com> | 2012-06-18 13:09:38 -0600 |
---|---|---|
committer | Karl Williamson <public@khwilliamson.com> | 2012-08-02 09:24:53 -0600 |
commit | 26faadbdfacc510abb04832e4c81d1f71329c697 (patch) | |
tree | bf680a9c3d4787711b0dc3562060247271b9a41b /embedvar.h | |
parent | b72a36d4f2738e8e15eb2e22819c8ffee7421c93 (diff) | |
download | perl-26faadbdfacc510abb04832e4c81d1f71329c697.tar.gz |
regcomp.c: Fix multi-char fold bug
Input text to be matched under /i is placed in EXACTFish nodes. The
current limit on such text is 255 bytes per node. Even if we raised
that limit, it will always be finite. If the input text is longer than
this, it is split across 2 or more nodes. A problem occurs when that
split occurs within a potential multi-character fold. For example, if
the final character that fits in a node is 'f', and the next character
is 'i', it should be matchable by LATIN SMALL LIGATURE FI, but because
Perl isn't structured to find multi-char folds that cross node
boundaries, we will miss this it.
The solution presented here isn't optimum. What we do is try to prevent
all EXACTFish nodes from ending in a character that could be at the
beginning or middle of a multi-char fold. That prevents the problem.
But in actuality, the problem only occurs if the input text is actually
a multi-char fold, which happens much less frequently. For example,
we try to not end a full node with an 'f', but the problem doesn't
actually occur unless the adjacent following node begins with an 'i' (or
one of the other characters that 'f' participates in). That is, this
patch splits when it doesn't need to.
At the point of execution for this patch, we only know that the final
character that fits in the node is that 'f'. The next character remains
unparsed, and could be in any number of forms, a literal 'i', or a hex,
octal, or named character constant, or it may need to be decoded (from
'use encoding'). So look-ahead is not really viable.
So finding if a real multi-character fold is involved would have to be
done later in the process, when we have full knowledge of the nodes, at
the places where join_exact() is now called, and would require inserting
a new node(s) in the middle of existing ones.
This solution seems reasonable instead.
It does not yet address named character constants (\N{}) which currently
bypass the code added here.
Diffstat (limited to 'embedvar.h')
-rw-r--r-- | embedvar.h | 1 |
1 files changed, 1 insertions, 0 deletions
diff --git a/embedvar.h b/embedvar.h index 01f3db139f..0a3c7fa2d9 100644 --- a/embedvar.h +++ b/embedvar.h @@ -67,6 +67,7 @@ #define PL_Mem (vTHX->IMem) #define PL_MemParse (vTHX->IMemParse) #define PL_MemShared (vTHX->IMemShared) +#define PL_NonL1NonFinalFold (vTHX->INonL1NonFinalFold) #define PL_PerlSpace (vTHX->IPerlSpace) #define PL_PosixAlnum (vTHX->IPosixAlnum) #define PL_PosixAlpha (vTHX->IPosixAlpha) |