diff options
author | Ilya Zakharevich <ilya@math.berkeley.edu> | 1998-06-20 17:11:37 -0400 |
---|---|---|
committer | Gurusamy Sarathy <gsar@cpan.org> | 1998-06-21 21:23:41 +0000 |
commit | 75e14d17912ce8a35d5c2b04c0c6e30b903ab97f (patch) | |
tree | d77e5128f9ce6b8cdacef5657b07e4e6de5af2dd /pod/perlop.pod | |
parent | 87c6202a78304039fc9e5218cf4694c9f7e247f9 (diff) | |
download | perl-75e14d17912ce8a35d5c2b04c0c6e30b903ab97f.tar.gz |
Re docs
Message-Id: <199806210111.VAA17752@monk.mps.ohio-state.edu>
p4raw-id: //depot/perl@1181
Diffstat (limited to 'pod/perlop.pod')
-rw-r--r-- | pod/perlop.pod | 188 |
1 files changed, 188 insertions, 0 deletions
diff --git a/pod/perlop.pod b/pod/perlop.pod index b3202e540b..c534234bdd 100644 --- a/pod/perlop.pod +++ b/pod/perlop.pod @@ -695,6 +695,18 @@ evaluation of variables when used within double quotes. Here are the quote-like operators that apply to pattern matching and related activities. +Most of this section is related to use of regular expressions from Perl. +Such a use may be considered from two points of view: Perl handles a +a string and a "pattern" to RE (regular expression) engine to match, +RE engine finds (or does not find) the match, and Perl uses the findings +of RE engine for its operation, possibly asking the engine for other matches. + +RE engine has no idea what Perl is going to do with what it finds, +similarly, the rest of Perl has no idea what a particular regular expression +means to RE engine. This creates a clean separation, and in this section +we discuss matching from Perl point of view only. The other point of +view may be found in L<perlre>. + =over 8 =item ?PATTERN? @@ -1172,6 +1184,182 @@ an eval(): =back +=head2 Gory details of parsing quoted constructs + +When presented with something which may have several different +interpretations, Perl uses the principle B<DWIM> (expanded to Do What I Mean +- not what I wrote) to pick up the most probable interpretation of the +source. This strategy is so successful that Perl users usually do not +suspect ambivalence of what they write. However, time to time Perl's ideas +differ from what the author meant. + +The target of this section is to clarify the Perl's way of interpreting +quoted constructs. The most frequent reason one may have to want to know the +details discussed in this section is hairy regular expressions. However, the +first steps of parsing are the same for all Perl quoting operators, so here +they are discussed together. + +Some of the passes discussed below are performed concurrently, but as +far as results are the same, we consider them one-by-one. For different +quoting constructs Perl performs different number of passes, from +one to five, but they are always performed in the same order. + +=over + +=item Finding the end + +First pass is finding the end of the quoted construct, be it multichar ender +C<"\nEOF\n"> of C<<<EOF> construct, C</> which terminates C<qq/> construct, +C<E<]>> which terminates C<qq[> construct, or C<E<gt>> which terminates a +fileglob started with C<<>. + +When searching for multichar construct no skipping is performed. When +searching for one-char non-matching delimiter, such as C</>, combinations +C<\\> and C<\/> are skipped. When searching for one-char matching delimiter, +such as C<]>, combinations C<\\>, C<\]> and C<\[> are skipped, and +nested C<[>, C<]> are skipped as well. + +For 3-parts constructs C<s///> etc. the search is repeated once more. + +During this search no attension is paid to the semantic of the construct, thus + + "$hash{"$foo/$bar"}" + +or + + m/ + bar # This is not a comment, this slash / terminated m//! + /x + +do not form legal quoted expressions. Note that since the slash which +terminated C<m//> was followed by a C<SPACE>, this is not C<m//x>, +thus C<#> was interpreted as a literal C<#>. + +=item Removal of backslashes before delimiters + +During the second pass the text between the starting delimiter and +the ending delimiter is copied to a safe location, and the C<\> is +removed from combinations consisting of C<\> and delimiter(s) (both starting +and ending delimiter if they differ). + +The removal does not happen for multi-char delimiters. + +Note that the combination C<\\> is left as it was! + +Starting from this step no information about the delimiter(s) is used in the +parsing. + +=item Interpolation + +Next step is interpolation in the obtained delimiter-independent text. +There are many different cases. + +=over + +=item C<<<'EOF'>, C<m''>, C<s'''>, C<tr///>, C<y///> + +No interpolation is performed. + +=item C<''>, C<q//> + +The only interpolation is removal of C<\> from pairs C<\\>. + +=item C<"">, C<``>, C<qq//>, C<qx//>, C<<file*globE<gt>> + +C<\Q>, C<\U>, C<\u>, C<\L>, C<\l> (possibly paired with C<\E>) are converted +to corresponding Perl constructs, thus C<"$foo\Qbaz$bar"> is converted to + + $foo . (quotemeta("baz" . $bar)); + +Other combinations of C<\> with following chars are substituted with +appropriate expansions. + +Interpolated scalars and arrays are converted to C<join> and C<.> Perl +constructs, thus C<"'@arr'"> becomes + + "'" . (join $", @arr) . "'"; + +Since all three above steps are performed simultaneously left-to-right, +the is no way to insert a literal C<$> or C<@> inside C<\Q\E> pair: it +cannot be protected by C<\>, since any C<\> (except in C<\E>) is +interpreted as a literal inside C<\Q\E>, and any $ is +interpreted as starting an interpolated scalar. + +Note also that the interpolating code needs to make decision where the +interpolated scalar ends, say, whether C<"a $b -> {c}"> means + + "a " . $b . " -> {c}"; + +or + + "a " . $b -> {c}; + +Most the time the decision is to take the longest possible text which does +not include spaces between components and contains matching braces/brackets. + +=item C<?RE?>, C</RE/>, C<m/RE/>, C<s/RE/foo/>, + +Processing of C<\Q>, C<\U>, C<\u>, C<\L>, C<\l> and interpolation happens +(almost) as with qq// constructs, but I<the substitution of C<\> followed by +other chars is not performed>! Moreover, inside C<(?{BLOCK})> no processing +is performed at all. + +Interpolation has several quirks: $|, $( and $) are not interpolated, and +constructs C<$var[SOMETHING]> are I<voted> (by several different estimators) +to be an array element or $var followed by a RE alternative. This is +the place where the notation C<${arr[$bar]}> comes handy: C</${arr[0-9]}/> +is interpreted as an array element -9, not as a regular expression from +variable $arr followed by a digit, which is the interpretation of +C</$arr[0-9]/>. + +Note that absense of processing of C<\\> creates specific restrictions on the +post-processed text: if the delimeter is C</>, one cannot get the combination +C<\/> into the result of this step: C</> will finish the regular expression, +C<\/> will be stripped to C</> on the previous step, and C<\\/> will be left +as is. Since C</> is equivalent to C<\/> inside a regular expression, this +does not matter unless the delimiter is special character for RE engine, as +in C<s*foo*bar*>, C<m[foo]>, or C<?foo?>. + +=back + +This step is the last one for all the constructs except regular expressions, +which are processed further. + +=item Interpolation of regular expressions + +All the previous steps were performed during the compilation of Perl code, +this one happens in run time (though it may be optimized to be calculated +at compile time if appropriate). After all the preprocessing performed +above (and possibly after evaluation if catenation, joining, up/down-casing +and quotemeta()ing are involved) the resulting I<string> is passed to RE +engine for compilation. + +Whatever happens in the RE engine is better be discussed in L<perlre>, +but for the sake of continuity let us do it here. + +This is the first step where presense of the C<//x> switch is relevant. +RE engine scans the string left-to-right, and converts it to a finite +automaton. + +Backslashed chars are either substituted by corresponding literal +strings, or generate special nodes of the finite automaton. Characters +which are special to RE engine generate corresponding nodes. C<(?#...)> +comments are ignored. All the rest is either converted to literal strings +to match, or is ignored (as is whitespace and C<#>-style comments if +C<//x> is present). + +Note that the parsing of the construct C<[...]> is performed using +absolutely different rules than the rest of the regular expression. +Similarly, the C<(?{...})> is only checked for matching braces. + +=item Optimization of regular expressions + +This step is listed for compeleteness only. Since it does not change +semantics, details of this step are not documented and are subject +to change. + +=back + =head2 I/O Operators There are several I/O operators you should know about. |