diff options
author | Nicholas Clark <nick@ccl4.org> | 2013-02-04 17:54:33 +0100 |
---|---|---|
committer | Nicholas Clark <nick@ccl4.org> | 2013-03-19 11:53:19 +0100 |
commit | a35e7505f3e9cd4a87a7d911c4e7ae19e97cb9f6 (patch) | |
tree | 5e740981b2756747fad90602f3bbb8974e423738 /pod/perlreguts.pod | |
parent | d2d8c7119aed4a86948ff470a23161eb2f07f551 (diff) | |
download | perl-a35e7505f3e9cd4a87a7d911c4e7ae19e97cb9f6.tar.gz |
Document the uses of NULL returns in the regex parsing code.
Diffstat (limited to 'pod/perlreguts.pod')
-rw-r--r-- | pod/perlreguts.pod | 46 |
1 files changed, 46 insertions, 0 deletions
diff --git a/pod/perlreguts.pod b/pod/perlreguts.pod index fbcb149dbc..bb7f372c66 100644 --- a/pod/perlreguts.pod +++ b/pod/perlreguts.pod @@ -386,6 +386,52 @@ A grammar form might be something like this: piece : _piece | _piece quant +=head3 Parsing complications + +The implication of the above description is that a pattern containing nested +parentheses will result in a call graph which cycles through C<reg()>, +C<regbranch()>, C<regpiece()>, C<regatom()>, C<reg()>, C<regbranch()> I<etc> +multiple times, until the deepest level of nesting is reached. All the above +routines return a pointer to a C<regnode>, which is usually the last regnode +added to the program. However, one complication is that reg() returns NULL +for parsing C<(?:)> syntax for embedded modifiers, setting the flag +C<TRYAGAIN>. The C<TRYAGAIN> propagates upwards until it is captured, in +some cases by by C<regatom()>, but otherwise unconditionally by +C<regbranch()>. Hence it will never be returned by C<regbranch()> to +C<reg()>. This flag permits patterns such as C<(?i)+> to be detected as +errors (I<Quantifier follows nothing in regex; marked by <-- HERE in m/(?i)+ +<-- HERE />). + +Another complication is that the representation used for the program differs +if it needs to store Unicode, but it's not always possible to know for sure +whether it does until midway through parsing. The Unicode representation for +the program is larger, and cannot be matched as efficiently. (See L</Unicode +and Localisation Support> below for more details as to why.) If the pattern +contains literal Unicode, it's obvious that the program needs to store +Unicode. Otherwise, the parser optimistically assumes that the more +efficient representation can be used, and starts sizing on this basis. +However, if it then encounters something in the pattern which must be stored +as Unicode, such as an C<\x{...}> escape sequence representing a character +literal, then this means that all previously calculated sizes need to be +redone, using values appropriate for the Unicode representation. Currently, +all regular expression constructions which can trigger this are parsed by code +in C<regatom()>. + +To avoid wasted work when a restart is needed, the sizing pass is abandoned +- C<regatom()> immediately returns NULL, setting the flag C<RESTART_UTF8>. +(This action is encapsulated using the macro C<REQUIRE_UTF8>.) This restart +request is propagated up the call chain in a similar fashion, until it is +"caught" in C<Perl_re_op_compile()>, which marks the pattern as containing +Unicode, and restarts the sizing pass. It is also possible for constructions +within run-time code blocks to turn out to need Unicode representation., +which is signalled by C<S_compile_runtime_code()> returning false to +C<Perl_re_op_compile()>. + +The restart was previously implemented using a C<longjmp> in C<regatom()> +back to a C<setjmp> in C<Perl_re_op_compile()>, but this proved to be +problematic as the latter is a large function containing many automatic +variables, which interact badly with the emergent control flow of C<setjmp>. + =head3 Debug Output In the 5.9.x development version of perl you can C<< use re Debug => 'PARSE' >> |