diff options
author | Rafael Garcia-Suarez <rgarciasuarez@gmail.com> | 2006-06-08 14:11:29 +0000 |
---|---|---|
committer | Rafael Garcia-Suarez <rgarciasuarez@gmail.com> | 2006-06-08 14:11:29 +0000 |
commit | b23a565decf7acb33d46fc5bb7bed5ad79774efe (patch) | |
tree | f6842bace41c25d8b2bc797e16287a02df5c49de /pod/perlreguts.pod | |
parent | ea4f570325887c35caa0c54fa18ba03348797db0 (diff) | |
download | perl-b23a565decf7acb33d46fc5bb7bed5ad79774efe.tar.gz |
Add the perlreguts manpage, by Yves Orton
p4raw-id: //depot/perl@28372
Diffstat (limited to 'pod/perlreguts.pod')
-rw-r--r-- | pod/perlreguts.pod | 722 |
1 files changed, 722 insertions, 0 deletions
diff --git a/pod/perlreguts.pod b/pod/perlreguts.pod new file mode 100644 index 0000000000..ce9f3c0b03 --- /dev/null +++ b/pod/perlreguts.pod @@ -0,0 +1,722 @@ +=head1 NAME + +perlreguts - Description of the Perl regular expression engine. + +=head1 DESCRIPTION + +This document is an attempt to shine some light on the guts of the regex +engine and how it works. The regex engine represents a signifigant chunk +of the perl codebase, but is relatively poorly understood. This document +is a meagre attempt at addressing this situation. It is derived from the +authors experience, comments in the source code, other papers on the +regex engine, feedback in p5p, and no doubt other places as well. + +B<WARNING!> It should be clearly understood that this document +represents the state of the regex engine as the author understands it at +the time of writing. It is B<NOT> an API definition, it is purely an +internals guide for those who want to hack the regex engine, or +understand how the regex engine works. Readers of this document are +expected to understand perls regex syntax and its usage in detail, if +you are a beginner you are in the wrong the place. + +=head1 OVERVIEW + +=head2 A quick note on terms + +There is some debate as to whether to say 'regexp' or 'regex'. In this +document we will use the term "regex" unless there is a special reason +not to, and then we will explain why. + +When speaking about regexes we need to distinguish between their source +code form and their internal form. In this document we will use the term +"pattern" when we speak of their textual, source code form, the term +"program" when we speak of their internal representation. These +correspond to the terms C<S-regex> and C<B-regex> that Mark Jason +Dominus employs in his paper on "Rx"[1]. + +=head2 What is a regular expression engine? + +A regular expression engine is a program whose job is to efficiently +find a section of a string that matches a set criteria of criteria. The +criteria is expressed in text using a formal language. See perlre for a +full definition of the language. + +So the job in less grandiose terms is to some turn a pattern into +something the computer can efficiently use to find the matching point in +the string. + +To do this we need to produce a program by parsing the text. We then +need to execute the program to find the point in the string that +matches. And we need to do the whole thing efficiently. + +=head2 Structure of a Regexp Program + +=head3 High Level + +Although it is a bit confusing and some object to the terminology it +is worth taking a look at a comment that has +been in regexp.h for years: + +I<This is essentially a linear encoding of a nondeterministic +finite-state machine (aka syntax charts or "railroad normal form" in +parsing technology).> + +The term "railroad normal form" is a bit esoteric, with "syntax +diagram/charts", or "railroad diagram/charts" being more common terms. +Nevertheless it provides a useful mental image of a regex program: Each +node can be thought of as a unit of track, with a single entry and in +most cases a single exit point (there are pieces of track that fork, but +statistically not many), and the total forms a system of track with a +single entry and single exit point. The matching process can be thought +of as a car that moves on the track, with the particular route through +the system being determined by the character read at each possible +connector point. A car can roll off the track at any point but it may +not procede unless it matches the track... + +Thus the pattern C</foo(?:\w+|\d+|\s+)bar/> can be thought of as the +following chart: + + [start] + | + <foo> + | + +---+---+ + | | | + <\w+> | <\s+> + | <\d+> | + | | | + +---+---+ + | + <bar> + | + [end] + +The truth of the matter is that perls regular expressions these days are +way beyond such models, but they can help when trying to get your +bearings, and they do match pretty closely with the current +implementation. + +To be more precise we will say that a regex program is an encoding +of a graph. Each node in the graph corresponds to part of +the original regex pattern, such as a literal string or a branch, +and has a pointer to the nodes representing the next component +to be matched. Since "node" and opcode are overloaded terms in the +perl source, we will call the nodes in a regex program 'regops'. + +The program is represented by an array of regnode structures, one or +more of which together represent a single regop of the program. Struct +regnode is the smallest struct needed and has a field structure which is +shared with all the other larger structures. + +"Next" pointers of all regops except BRANCH implement concatenation; a +"next" pointer with a BRANCH on both ends of it is connecting two +alternatives. [Here we have one of the subtle syntax dependencies: an +individual BRANCH (as opposed to a collection of them) is never +concatenated with anything because of operator precedence. + +The operand of some types of regop is a literal string; for others, +it is a regop leading into a sub-program. In particular, the operand +of a BRANCH node is the first regop of the branch. + +B<NOTE>: As the railroad metaphor suggests this is B<not> a tree +structure: the tail of the branch connects to the thing following the +set of BRANCHes. It is a like a single line of railway track that +splits as it goes into a station or railway yard and rejoins as it comes +out the other side. + +=head3 Regops + +The base structure of a regop is defined in regexp.h as follows: + + struct regnode { + U8 flags; /* Various purposes, sometimes overriden */ + U8 type; /* Opcode value as specified by regnodes.h */ + U16 next_off; /* Offset in size regnode */ + }; + +Other larger regnode-like structures are defined in regcomp.h. They +are almost like subclasses in that they have the same fields as +regnode, with possibly additional fields following in +the structure, and in some cases the specific meaning (and name) +of some of base fields are overriden. The following is a more +complete description. + +=over 4 + +=item regnode_1 + +=item regnode_2 + +regnode_1 structures have the same header, followed by a single +four-byte argument; regnode_2 structures contain two two-byte +arguments instead: + + regnode_1 U32 arg1; + regnode_2 U16 arg1; U16 arg2; + +=item regnode_string + +regnode_string structures, used for literal strings, follow the header +with a one-byte length and then the string data. Strings are padded on +the end with zero bytes so that the total length of the node is a +multiple of four bytes: + + regnode_string char string[1]; + U8 str_len; (overides flags) + +=item regnode_charclass + +character classes are represented by regnode_charclass structures, +which have a four-byte argument and then a 32-byte (256-bit) bitmap +indicating which characters are included in the class. + + regnode_charclass U32 arg1; + char bitmap[ANYOF_BITMAP_SIZE]; + +=item regnode_charclass_class + +There is also a larger form of a char class structure used to represent +POSIX char classes called regnode_charclass_class which contains the +same fields plus an additional 4-byte (32-bit) bitmap indicating which +POSIX char class have been included. + + regnode_charclass_class U32 arg1; + char bitmap[ANYOF_BITMAP_SIZE]; + char classflags[ANYOF_CLASSBITMAP_SIZE]; + +=back + +regnodes.h defines an array called regarglen[] which gives the size +of each opcode in units of size regnode (4-byte). A macro is used +to calculate the size of an EXACT node based on its C<str_len> field. + +The opcodes are defined in regnodes.h which is generated from +regcomp.sym by regcomp.pl. Currently the maximum possible number +of distinct opcodes is restricted to 256, with about 1/4 already +used. + +There's a set of macros provided to make accessing the fields +easier and more consistent. These include C<OP()> which is used to tell +the type of a regnode-like structure, NEXT_OFF() which is the offset to +the next node (more on this later), ARG(), ARG1(), ARG2(), ARG_SET(), +and equivelents for reading and setting the arguments, STR_LEN(), +STRING(), and OPERAND() for manipulating strings and regop bearing +types. + +=head3 What opcode is next? + +There are three distinct concepts of "next" in the regex engine, and +it is important to keep them clear. + +=over 4 + +=item * + +There is the "next regnode" from a given regnode, a value which is +rarely useful except that sometimes it matches up in terms of value +with one of the others, and that sometimes the code assumes this to +always be so. + +=item * + +There is the "next opcode" from a given opcode/regnode. This is the +opcode physically located after the the current one, as determined by +the size of the current opcode. This is often useful, such as when +dumping the structure we use this order to traverse. Sometimes the code +assumes that the "next regnode" is the same as the "next opcode", or in +other words assumes that the sizeof a given opcode type is always going +to be 1 regnode large. + +=item * + +There is the "regnext" from a given opcode. This is the opcode which +is reached by jumping forward by the value of NEXT_OFF(), +or in a few cases for longer jumps by the arg1 field of the regnode_1 +structure. The subroutine regnext() handles this transparently. +This is the logical successor of the node, which in some cases, like +that of the BRANCH opcode, has special meaning. + +=back + +=head1 PROCESS OVERVIEW + +Broadly speaking performing a match of a string against a pattern +involves the following steps + + A. Compilation + 1. Parsing for size + 2. Parsing for construction + 3. Peep-hole Optimisation and Analysis + B. Execution + 4. Start position and no-match optimisations + 5. Program execution + +Where these steps occur in the actual execution of a perl program is +determined by whether the pattern involves interpolating any string +variables. If it does then compilation happens at run time. If it +doesn't then it happens at compile time. (The C</o> modifier changes this, +as does C<qr//> to a certain extent.) The engine doesn't really care that +much. + +=head2 Compilation + +This code exists primarily in regcomp.c, along with the header files +regcomp.h, regexp.h, regnodes.h. + +Compilation starts with C<pregcomp()>, which is mostly an initialization +wrapper which farms out two other routines for the heavy lifting. The +first being C<reg()> which is the start point for parsing, and +C<study_chunk()> which is responsible for optimisation. + +Initialization in C<pregcomp()> mostly involves the creation and data +filling of a special structure RExC_state_t, (defined in regcomp.c). +Almost all internally used routines in regcomp.h take a pointer to one +of these structures as their first argument, with the name *pRExC_state. +This structure is used to store the compilation state and contains many +fields. Likewise their are many macros defined which operate on this +variable. Anything that looks like RExC_xxxx is a macro that operates on +this pointer/structure. + +=head3 Parsing for size + +In this pass the input pattern is parsed in order to calculate how much +space is needed for each opcode we would need to emit. The size is also +used to determine whether long jumps will be required in the program. + +This stage is controlled by the macro SIZE_ONLY being set. + +The parse procedes pretty much exactly as it does during the +construction phase except that most routines are shortcircuited to +change the size field RExC_size and not actually do anything. + +=head3 Parsing for construcution + +Once the size of the program has been determine the pattern is parsed +again, but this time for real. Now SIZE_ONLY will be false, and the +actual construction can occur. + +C<reg()> is the start of the parse process. It is responsible for +parsing an arbitrary chunk of pattern up to either the end of the +string, or the first closing parenthesis it encounters in the pattern. +This means it can be used to parse the toplevel regex, or any section +inside of a grouping parenthesis. It also handles the "special parens" +that perls regexes have. For instance when parsing C</x(?:foo)y/> C<reg()> +will at one point be called to parse from the '?' symbol up to and +including the ')'. + +Additionally C<reg()> is responsible for parsing the one or more +branches from the pattern, and for "finishing them off" by correctly +setting their next pointers. In order to do the parsing it repeatedly +calls out to C<regbranch()> which is responsible for handling up to the +first C<|> symbol it sees. + +C<regbranch()> in turn calls C<regpiece()> which is responsible for +handling "things" followed by a quantifier. In order to parse the +"things" C<regatom()> is called. This is the lowest level routine which +is responsible for parsing out constant strings, char classes, and the +various special symbols like C<$>. If C<regatom()> encounters a '(' +character it in turn calls C<reg()>. + +The routine C<regtail()> is called by both C<reg()>, C<regbranch()> +in order to "set the tail pointer" correctly. When executing and +we get to the end of a branch we need to go to node following the +grouping parens. When parsing however we don't know where the end will +be until we get there, so when we do we must go back and update the +offsets as appropriate. C<regtail> is used to make this easier. + +A subtlety of the parse process means that a regex like C</foo/> is +originally parsed into an alternation with a single branch. It is only +afterwards that the optimizer converts single branch alternations into the +simpler form. + +=head3 Parse Call Graph and a Grammar + +The call graph looks like this: + + reg() # parse a top level regex, or inside of parens + regbranch() # parse a single branch of an alternation + regpiece() # parse a pattern followed by a quantifier + regatom() # parse a simple pattern + regclass() # used to handle a class + reg() # used to handle a parenthesized subpattern + .... + ... + regtail() # finish off the branch + ... + regtail() # finish off the branch sequence. Tie each + # branches tail to the tail of the sequence + # (NEW) In Debug mode this is + # regtail_study(). + +A grammar form might be something like this: + + atom : constant | class + quant : '*' | '+' | '?' | '{min,max}' + _branch: piece + | piece _branch + | nothing + branch: _branch + | _branch '|' branch + group : '(' branch ')' + _piece: atom | group + piece : _piece + | _piece quant + +=head3 Debug Output + +In bleadperl you can C<< use re Debug => 'PARSE'; >> to see some trace +information about the parse process. We will start with some simple +patterns and build up to more complex patterns. + +So when we parse C</foo/> we see something like the following table. The +left shows whats being parsed, the number indicates where the next regop +would go. The stuff on the right is the trace output of the graph. The +names are chosen to be short to make it less dense on the screen. 'tsdy' +is a special form of C<regtail()> which does some extra analysis. + + >foo< 1 reg + brnc + piec + atom + >< 4 tsdy~ EXACT <foo> (EXACT) (1) + ~ attach to END (3) offset to 2 + +The resulting program then looks like: + + 1: EXACT <foo>(3) + 3: END(0) + +As you can see, even though we parsed out a branch and a piece, it was ultimately +only an atom. The final program shows us how things work. We have an EXACT regop, +followed by an END regop. The number in parens indicates where the 'regnext' of +the node goes. The 'regnext' of an END regop is unused, as END regops mean +we have successfully matched. The number on the left indicates the position of +the regop in the regnode array. + +Now lets try a harder pattern. We will add a quantifier so we have the pattern +C</foo+/>. We will see that C<regbranch()> calls C<regpiece()> regpiece twice. + + >foo+< 1 reg + brnc + piec + atom + >o+< 3 piec + atom + >< 6 tail~ EXACT <fo> (1) + 7 tsdy~ EXACT <fo> (EXACT) (1) + ~ PLUS (END) (3) + ~ attach to END (6) offset to 3 + +And we end up with the program: + + 1: EXACT <fo>(3) + 3: PLUS(6) + 4: EXACT <o>(0) + 6: END(0) + +Now we have a special case. The EXACT regop has a regnext of 0. This is +because if it matches it should try to match itself again. The PLUS regop +handles the actual failure of the EXACT regop and acts appropriately (going +to regnode 6 if the EXACT matched at least once, or failing if it didn't.) + +Now for something much more complex: C</x(?:foo*|b[a][rR])(foo|bar)$/> + + >x(?:foo*|b... 1 reg + brnc + piec + atom + >(?:foo*|b[... 3 piec + atom + >?:foo*|b[a... reg + >foo*|b[a][... brnc + piec + atom + >o*|b[a][rR... 5 piec + atom + >|b[a][rR])... 8 tail~ EXACT <fo> (3) + >b[a][rR])(... 9 brnc + 10 piec + atom + >[a][rR])(f... 12 piec + atom + >a][rR])(fo... clas + >[rR])(foo|... 14 tail~ EXACT <b> (10) + piec + atom + >rR])(foo|b... clas + >)(foo|bar)... 25 tail~ EXACT <a> (12) + tail~ BRANCH (3) + 26 tsdy~ BRANCH (END) (9) + ~ attach to TAIL (25) offset to 16 + tsdy~ EXACT <fo> (EXACT) (4) + ~ STAR (END) (6) + ~ attach to TAIL (25) offset to 19 + tsdy~ EXACT <b> (EXACT) (10) + ~ EXACT <a> (EXACT) (12) + ~ ANYOF[Rr] (END) (14) + ~ attach to TAIL (25) offset to 11 + >(foo|bar)$< tail~ EXACT <x> (1) + piec + atom + >foo|bar)$< reg + 28 brnc + piec + atom + >|bar)$< 31 tail~ OPEN1 (26) + >bar)$< brnc + 32 piec + atom + >)$< 34 tail~ BRANCH (28) + 36 tsdy~ BRANCH (END) (31) + ~ attach to CLOSE1 (34) offset to 3 + tsdy~ EXACT <foo> (EXACT) (29) + ~ attach to CLOSE1 (34) offset to 5 + tsdy~ EXACT <bar> (EXACT) (32) + ~ attach to CLOSE1 (34) offset to 2 + >$< tail~ BRANCH (3) + ~ BRANCH (9) + ~ TAIL (25) + piec + atom + >< 37 tail~ OPEN1 (26) + ~ BRANCH (28) + ~ BRANCH (31) + ~ CLOSE1 (34) + 38 tsdy~ EXACT <x> (EXACT) (1) + ~ BRANCH (END) (3) + ~ BRANCH (END) (9) + ~ TAIL (END) (25) + ~ OPEN1 (END) (26) + ~ BRANCH (END) (28) + ~ BRANCH (END) (31) + ~ CLOSE1 (END) (34) + ~ EOL (END) (36) + ~ attach to END (37) offset to 1<div></div> + +Resulting in the program + + 1: EXACT <x>(3) + 3: BRANCH(9) + 4: EXACT <fo>(6) + 6: STAR(26) + 7: EXACT <o>(0) + 9: BRANCH(25) + 10: EXACT <ba>(14) + 12: OPTIMIZED (2 nodes) + 14: ANYOF[Rr](26) + 25: TAIL(26) + 26: OPEN1(28) + 28: TRIE-EXACT(34) + [StS:1 Wds:2 Cs:6 Uq:5 #Sts:7 Mn:3 Mx:3 Stcls:bf] + <foo> + <bar> + 30: OPTIMIZED (4 nodes) + 34: CLOSE1(36) + 36: EOL(37) + 37: END(0) + +Here we can see a much more complex program, with various optimisations in +play. At regnode 10 we can see an example where a char class with only +one character in it was turned into an EXACT node. We can also see where +an entire alternation was turned into a TRIE-EXACT node. As a consequence +some of the regnodes have been marked as optimised away. We can see that +the C<$> symbol has been converted into an EOL regop, a special piece of +code that looks for \n or the end of a string. + +The next pointer for BRANCHes is interesting in that it points at where +execution should go if the branch fails. When executing if the engine +tries to traverse from a branch to a regnext that isnt a branch then +the engine will know the overall series of branches have failed. + +=head3 Peep-hole Optimisation and Analysis + +The regular expression engine can be a weighty tool to wield. On long +strings and complex patterns it can end up having to do a lot of work +to find a match, and even more to decide that no match is possible. +Consider a situation like the following pattern. + + 'ababababababababababab' =~ /(a|b)*z/ + +The C<(a|b)*> part can match at every char in the string, and then fail +every time because there is no C<z> in the string. So obviously we can +not bother to use the regex engine unless there is a 'z' in the string. +Likewise in a pattern like: + + /foo(\w+)bar/ + +In this case we know that the string must contain a C<foo> which must be +followed by C<bar>. We can use Fast Boyer-More matching as implemented +in fbm_instr() to find the location of these strings. If they dont exist +then we dont need to resort to the much more expensive regex engine. +Even better if they do exist then we can use their positions to +reduce the search space that the regex engine needs to cover to determine +if the entire pattern does match. + +There are various aspects of the pattern that can be used to facilitate +optimisations along these lines: + + * anchored fixed strings + * floating fixed strings + * minimum and maximum length requirements + * start class + * Beginning/End of line positions + +Another form of optimisation that can occur is post-parse "peep-hole" +optimisations, where inefficient constructs are modified so that they +are more efficient. An example of this is TAIL regops which are used +during parsing to mark the end of branches and the end of groups. These +regops are used as place holders during construction and "always match" +so they can be "optimised away" by making the things that point to the +TAIL point to thing the TAIL points to, in essence "skipping" the node. + +Another optimisation that can occur is that of "EXACT merging" which is +where two consecutive EXACT nodes are merged into a single more efficient +to execute regop. An even more agressive form of this is that a branch +sequence of the form EXACT BRANCH ... EXACT can be converted into a TRIE +regop. + +All of this occurs in the routine study_chunk() which uses a special +structure scan_data_t to store the analysis that it has performed, and +as it goes does the "peep-hole" optimisations. + +The code involved in study_chunk() is extremely cryptic. Be careful. :-) + +=head2 Execution + +Execution of a regex generally involves two phases, the first being +finding the start point in the string where we should match from, +and the second being running the regop interpreter. + +If we can tell that there is no valid start point we don't bother running +interpreter at all. Likewise if we know from the analysis phase that we +can not optimise detection of the start position we go straight to the +interpreter. + +The two entry points are re_intuit_start() and pregexec(). These routines +have a somewhat incestuous relationship with overlap between their functions, +and pregexec() may even call re_intuit_start() on its own. Nevertheless +the perl source code may call into either, or both. + +Execution of the interpreter itself used to be recursive. Due to the +efforts of Dave Mitchel in blead perl it no longer is. Instead an +internal stack is maintained on the heap and the routine is fully +iterative. This can make it tricky as the code is quite conservative +about what state it stores which means that two consecutive lines in the +code can actually be running in totally different contexts due to the +simulated recursion. + +=head3 Start position and no-match optimisations + +re_intuit_start() is responsible for handling start point and no match +optimisations as determined by the results of the analysis done by +study_chunk() (and described in L<Peep-hole Optimisation and Analysis>). + +The basic structure of this routine is to try to find the start and/or +end points of where the pattern could match, and to ensure that the string +is long enough to match the pattern. It tries to use more efficent +methods over less efficient methods and may involve considerable cross +checking of constraints to find the place in the string that matches. +For instance it may try to determine that a given fixed string must be +not only present but a certain number of chars before the end of the +string, or whatever. + +It calls out into several other routines, like fbm_instr() which does +"Fast Boyer More" matching and find_byclass() which is responsible for +finding the start using the first mandatory regop in the program. + +When the optimisation criteria have been satisfied reg_try() is called +to perform the match. + +=head3 Program execution + +C<pregexec()> is the main entry point for running a regex. It contains +support for initializing the regex interpreters state, running +re_intuit_start() if needed, and running the intepreter on the string +from various start positions as needed. When its necessary to use +the regex interpreter C<pregexec()> calls C<regtry()>. + +C<regtry()> is the entry point into the regex interpreter. It expects +as arguments a pointer to a regmatch_info structure and a pointer to +a string. It returns an integer 1 for success and a 0 for failure. +It is basically a setup wrapper around C<regmatch()>. + +C<regmatch> is the main "recursive loop" of the interpreter. It is +basically a giant switch statement that executes the regops based on +their type. A few of the regops are implemented as subroutines but +the bulk are inline code. + +=head1 MISCELLANEOUS + +=head2 UNICODE and Localization Support + +No matter how you look at it unicode support is going to be a pain in a +regex engine. Tricks that might be fine when you have 256 possible +characters often won't scale to handle the size of the 'utf8' character +set. Things you can take for granted with ASCII may not be true with +unicode. For instance in ASCII its safe to assume that +C<sizeof(char1) == sizeof(char2)>, in utf8 it isn't. Unicode case folding is +vastly more complex than the simple rules of English, and even when not +using unicode but only localized single byte encodings things can get +tricky (technically GERMAN-SHARP-ESS should match 'ss' in localized case +insensitive matching.) + +Making things worse is that C<utf8> support was a later addition to the +regex engine (as it was to perl) and necessarily this made things a lot +more complicated. Obviously its easier to design a regex engine with +unicode support from the beginning than it is to retrofit one that +wasn't designed with it in mind. + +Pretty well every regop that involves looking at the input string has +two cases, one for 'utf8' and one not. In fact its often more complex +than that, as the pattern may be 'utf8' as well. + +Care must be taken when making changes to make sure that you handle +utf8 properly both at compile time and at execution time, including +when the string and pattern are mismatched. + +The following comment in regcomp.h gives an example of exactly how +tricky this can be: + + Two problematic code points in Unicode casefolding of EXACT nodes: + + U+0390 - GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS + U+03B0 - GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS + + which casefold to + + Unicode UTF-8 + + U+03B9 U+0308 U+0301 0xCE 0xB9 0xCC 0x88 0xCC 0x81 + U+03C5 U+0308 U+0301 0xCF 0x85 0xCC 0x88 0xCC 0x81 + + This means that in case-insensitive matching (or "loose matching", + as Unicode calls it), an EXACTF of length six (the UTF-8 encoded + byte length of the above casefolded versions) can match a target + string of length two (the byte length of UTF-8 encoded U+0390 or + U+03B0). This would rather mess up the minimum length computation. + + What we'll do is to look for the tail four bytes, and then peek + at the preceding two bytes to see whether we need to decrease + the minimum length by four (six minus two). + + Thanks to the design of UTF-8, there cannot be false matches: + A sequence of valid UTF-8 bytes cannot be a subsequence of + another valid sequence of UTF-8 bytes. + +=head1 AUTHOR + +by Yves Orton, 2006. + +With excerpts from Perl, and contributions and suggestions from +Ronald J. Kimball, Dave Mitchell, Dominic Dunlop, Mark Jason Dominus, +and Stephen McCamant. + +=head1 LICENSE + +Same terms as Perl. + +=head1 REFERENCES + +[1] http://perl.plover.com/Rx/paper/ + +=cut |