regcomp.c: Fix multi-char fold bug

Input text to be matched under /i is placed in EXACTFish nodes. The current limit on such text is 255 bytes per node. Even if we raised that limit, it will always be finite. If the input text is longer than this, it is split across 2 or more nodes. A problem occurs when that split occurs within a potential multi-character fold. For example, if the final character that fits in a node is 'f', and the next character is 'i', it should be matchable by LATIN SMALL LIGATURE FI, but because Perl isn't structured to find multi-char folds that cross node boundaries, we will miss this it. The solution presented here isn't optimum. What we do is try to prevent all EXACTFish nodes from ending in a character that could be at the beginning or middle of a multi-char fold. That prevents the problem. But in actuality, the problem only occurs if the input text is actually a multi-char fold, which happens much less frequently. For example, we try to not end a full node with an 'f', but the problem doesn't actually occur unless the adjacent following node begins with an 'i' (or one of the other characters that 'f' participates in). That is, this patch splits when it doesn't need to. At the point of execution for this patch, we only know that the final character that fits in the node is that 'f'. The next character remains unparsed, and could be in any number of forms, a literal 'i', or a hex, octal, or named character constant, or it may need to be decoded (from 'use encoding'). So look-ahead is not really viable. So finding if a real multi-character fold is involved would have to be done later in the process, when we have full knowledge of the nodes, at the places where join_exact() is now called, and would require inserting a new node(s) in the middle of existing ones. This solution seems reasonable instead. It does not yet address named character constants (\N{}) which currently bypass the code added here.
author: Karl Williamson <public@khwilliamson.com> 2012-06-18 13:09:38 -0600
committer: Karl Williamson <public@khwilliamson.com> 2012-08-02 09:24:53 -0600
commit: 26faadbdfacc510abb04832e4c81d1f71329c697 (patch)
tree: bf680a9c3d4787711b0dc3562060247271b9a41b /embedvar.h
parent: b72a36d4f2738e8e15eb2e22819c8ffee7421c93 (diff)
download: perl-26faadbdfacc510abb04832e4c81d1f71329c697.tar.gz
1 files changed, 1 insertions, 0 deletions
diff --git a/embedvar.h b/embedvar.h
index 01f3db139f..0a3c7fa2d9 100644
--- a/embedvar.h
+++ b/embedvar.h
@@ -67,6 +67,7 @@
 #define PL_Mem			(vTHX->IMem)
 #define PL_MemParse		(vTHX->IMemParse)
 #define PL_MemShared		(vTHX->IMemShared)
+#define PL_NonL1NonFinalFold	(vTHX->INonL1NonFinalFold)
 #define PL_PerlSpace		(vTHX->IPerlSpace)
 #define PL_PosixAlnum		(vTHX->IPosixAlnum)
 #define PL_PosixAlpha		(vTHX->IPosixAlpha)
author	Karl Williamson <public@khwilliamson.com>	2012-06-18 13:09:38 -0600
committer	Karl Williamson <public@khwilliamson.com>	2012-08-02 09:24:53 -0600
commit	26faadbdfacc510abb04832e4c81d1f71329c697 (patch)
tree	bf680a9c3d4787711b0dc3562060247271b9a41b /embedvar.h
parent	b72a36d4f2738e8e15eb2e22819c8ffee7421c93 (diff)
download	perl-26faadbdfacc510abb04832e4c81d1f71329c697.tar.gz