summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2009-09-26 19:12:32 +0000
committerph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2009-09-26 19:12:32 +0000
commit5c3493df2827cb70ddc42899df2f9ee30f5e7a7b (patch)
tree94e6f1bf0230945da14fd4fe3408292572d90749
parent13ec83b84a6939e47ebabc1836caec7d94836896 (diff)
downloadpcre-5c3493df2827cb70ddc42899df2f9ee30f5e7a7b.tar.gz
Added lower bound length-finding to pcre_study() and use it when matching; make
the value available via pcre_fullinfo(); also fixed bugs connected with pcre_study() in pcre_dfa_exec(). git-svn-id: svn://vcs.exim.org/pcre/code/trunk@455 2f5784b3-3f2a-0410-8824-cb99058d5e15
-rw-r--r--ChangeLog16
-rw-r--r--doc/pcre_fullinfo.31
-rw-r--r--doc/pcreapi.344
-rw-r--r--doc/pcretest.15
-rw-r--r--maint/README9
-rw-r--r--pcre.h.in1
-rw-r--r--pcre_compile.c17
-rw-r--r--pcre_dfa_exec.c131
-rw-r--r--pcre_exec.c132
-rw-r--r--pcre_fullinfo.c8
-rw-r--r--pcre_internal.h27
-rw-r--r--pcre_study.c418
-rw-r--r--pcre_try_flipped.c6
-rw-r--r--pcretest.c15
-rw-r--r--testdata/testinput2237
-rw-r--r--testdata/testinput718
-rw-r--r--testdata/testoutput2392
-rw-r--r--testdata/testoutput32
-rw-r--r--testdata/testoutput58
-rw-r--r--testdata/testoutput742
20 files changed, 1329 insertions, 200 deletions
diff --git a/ChangeLog b/ChangeLog
index dbc57de..a51dd8a 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -135,6 +135,22 @@ Version 8.00 ??-???-??
pattern matches a fixed length string. PCRE did not allow this; now it
does. Neither allows recursion.
+25. I finally figured out how to implement a request to provide the minimum
+ length of subject string that was needed in order to match a given pattern.
+ (It was back references and recursion that I had previously got hung up
+ on.) This code has now been added to pcre_study(); it finds a lower bound
+ to the length of subject needed. It is not necessarily the greatest lower
+ bound, but using it to avoid searching strings that are too short does give
+ some useful speed-ups. The value is available to calling programs via
+ pcre_fullinfo().
+
+26. While implementing 25, I discovered to my embarrassment that pcretest had
+ not been passing the result of pcre_study() to pcre_dfa_exec(), so the
+ study optimizations had never been tested with that matching function.
+ Oops. What is worse, even when it was passed study data, there was a bug in
+ pcre_dfa_exec() that meant it never actually used it. Double oops. There
+ were also very few tests of studied patterns with pcre_dfa_exec().
+
Version 7.9 11-Apr-09
---------------------
diff --git a/doc/pcre_fullinfo.3 b/doc/pcre_fullinfo.3
index 6972da4..16a3c63 100644
--- a/doc/pcre_fullinfo.3
+++ b/doc/pcre_fullinfo.3
@@ -33,6 +33,7 @@ The following information is available:
PCRE_INFO_FIRSTTABLE Table of first bytes (after studying)
PCRE_INFO_JCHANGED Return 1 if (?J) or (?-J) was used
PCRE_INFO_LASTLITERAL Literal last byte required
+ PCRE_INFO_MINLENGTH Lower bound length of matching strings
PCRE_INFO_NAMECOUNT Number of named subpatterns
PCRE_INFO_NAMEENTRYSIZE Size of name table entry
PCRE_INFO_NAMETABLE Pointer to name table
diff --git a/doc/pcreapi.3 b/doc/pcreapi.3
index cf4a013..c406f60 100644
--- a/doc/pcreapi.3
+++ b/doc/pcreapi.3
@@ -772,19 +772,19 @@ help speed up matching, \fBpcre_study()\fP returns a pointer to a
results of the study.
.P
The returned value from \fBpcre_study()\fP can be passed directly to
-\fBpcre_exec()\fP. However, a \fBpcre_extra\fP block also contains other
-fields that can be set by the caller before the block is passed; these are
-described
+\fBpcre_exec()\fP or \fBpcre_dfa_exec()\fP. However, a \fBpcre_extra\fP block
+also contains other fields that can be set by the caller before the block is
+passed; these are described
.\" HTML <a href="#extradata">
.\" </a>
below
.\"
in the section on matching a pattern.
.P
-If studying the pattern does not produce any additional information
+If studying the pattern does not produce any useful information,
\fBpcre_study()\fP returns NULL. In that circumstance, if the calling program
-wants to pass any of the other fields to \fBpcre_exec()\fP, it must set up its
-own \fBpcre_extra\fP block.
+wants to pass any of the other fields to \fBpcre_exec()\fP or
+\fBpcre_dfa_exec()\fP, it must set up its own \fBpcre_extra\fP block.
.P
The second argument of \fBpcre_study()\fP contains option bits. At present, no
options are defined, and this argument should always be zero.
@@ -804,9 +804,18 @@ This is a typical call to \fBpcre_study\fP():
0, /* no options exist */
&error); /* set to NULL or points to a message */
.sp
-At present, studying a pattern is useful only for non-anchored patterns that do
-not have a single fixed starting character. A bitmap of possible starting
-bytes is created.
+Studying a pattern does two things: first, a lower bound for the length of
+subject string that is needed to match the pattern is computed. This does not
+mean that there are any strings of that length that match, but it does
+guarantee that no shorter strings match. The value is used by
+\fBpcre_exec()\fP and \fBpcre_dfa_exec()\fP to avoid wasting time by trying to
+match strings that are shorter than the lower bound. You can find out the value
+in a calling program via the \fBpcre_fullinfo()\fP function.
+.P
+Studying a pattern is also useful for non-anchored patterns that do not have a
+single fixed starting character. A bitmap of possible starting bytes is
+created. This speeds up finding a position in the subject at which to start
+matching.
.
.
.\" HTML <a name="localesupport"></a>
@@ -971,6 +980,16 @@ follows something of variable length. For example, for the pattern
/^a\ed+z\ed+/ the returned value is "z", but for /^a\edz\ed/ the returned value
is -1.
.sp
+ PCRE_INFO_MINLENGTH
+.sp
+If the pattern was studied and a minimum length for matching subject strings
+was computed, its value is returned. Otherwise the returned value is -1. The
+value is a number of characters, not bytes (there may be a difference in UTF-8
+mode). The fourth argument should point to an \fBint\fP variable. A
+non-negative value is a lower bound to the length of any matching string. There
+may not be any strings of that length that do actually match, but every string
+that does match is at least that long.
+.sp
PCRE_INFO_NAMECOUNT
PCRE_INFO_NAMEENTRYSIZE
PCRE_INFO_NAMETABLE
@@ -1059,7 +1078,8 @@ variable.
Return the size of the data block pointed to by the \fIstudy_data\fP field in
a \fBpcre_extra\fP block. That is, it is the value that was passed to
\fBpcre_malloc()\fP when PCRE was getting memory into which to place the data
-created by \fBpcre_study()\fP. The fourth argument should point to a
+created by \fBpcre_study()\fP. If \fBpcre_extra\fP is NULL, or there is no
+study data, zero is returned. The fourth argument should point to a
\fBsize_t\fP variable.
.
.
@@ -1121,7 +1141,7 @@ is different. (This seems a highly unlikely scenario.)
.P
The function \fBpcre_exec()\fP is called to match a subject string against a
compiled pattern, which is passed in the \fIcode\fP argument. If the
-pattern has been studied, the result of the study should be passed in the
+pattern was studied, the result of the study should be passed in the
\fIextra\fP argument. This function is the main matching facility of the
library, and it operates in a Perl-like manner. For specialist use there is
also an alternative matching function, which is described
@@ -2023,6 +2043,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 22 September 2009
+Last updated: 26 September 2009
Copyright (c) 1997-2009 University of Cambridge.
.fi
diff --git a/doc/pcretest.1 b/doc/pcretest.1
index 1bfafae..7a6f662 100644
--- a/doc/pcretest.1
+++ b/doc/pcretest.1
@@ -371,6 +371,9 @@ recognized:
\eR pass the PCRE_DFA_RESTART option to \fBpcre_dfa_exec()\fP
\eS output details of memory get/free calls during matching
.\" JOIN
+ \eY pass the PCRE_NO_START_OPTIMIZE option to \fBpcre_exec()\fP
+ or \fBpcre_dfa_exec()\fP
+.\" JOIN
\eZ pass the PCRE_NOTEOL option to \fBpcre_exec()\fP
or \fBpcre_dfa_exec()\fP
.\" JOIN
@@ -728,6 +731,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 11 September 2009
+Last updated: 26 September 2009
Copyright (c) 1997-2009 University of Cambridge.
.fi
diff --git a/maint/README b/maint/README
index aa904d6..c2e2282 100644
--- a/maint/README
+++ b/maint/README
@@ -178,15 +178,8 @@ others are relatively new.
o A required byte from alternatives - not just the last char, but an
earlier one if common to all alternatives.
- o Minimum length of subject needed (see also next . bullet).
-
o Friedl contains other ideas.
-. There was a request for a way of finding the minimum subject length that can
- match a given pattern. (If this were available, it could be usefully added
- to study() - see above.) This is easy for simple cases, but I haven't figured
- out how to handle recursion.
-
. If Perl gets to a consistent state over the settings of capturing sub-
patterns inside repeats, see if we can match it. One example of the
difference is the matching of /(main(O)?)+/ against mainOmain, where PCRE
@@ -308,4 +301,4 @@ others are relatively new.
Philip Hazel
Email local part: ph10
Email domain: cam.ac.uk
-Last updated: 20 September 2009
+Last updated: 26 September 2009
diff --git a/pcre.h.in b/pcre.h.in
index 3b2ab8b..0eecbbf 100644
--- a/pcre.h.in
+++ b/pcre.h.in
@@ -177,6 +177,7 @@ both, so we keep them all distinct. */
#define PCRE_INFO_OKPARTIAL 12
#define PCRE_INFO_JCHANGED 13
#define PCRE_INFO_HASCRORLF 14
+#define PCRE_INFO_MINLENGTH 15
/* Request types for pcre_config(). Do not re-arrange, in order to remain
compatible. */
diff --git a/pcre_compile.c b/pcre_compile.c
index 56c97cb..a60f876 100644
--- a/pcre_compile.c
+++ b/pcre_compile.c
@@ -1550,7 +1550,9 @@ for (;;)
/* This little function scans through a compiled pattern until it finds a
capturing bracket with the given number, or, if the number is negative, an
-instance of OP_REVERSE for a lookbehind.
+instance of OP_REVERSE for a lookbehind. The function is global in the C sense
+so that it can be called from pcre_study() when finding the minimum matching
+length.
Arguments:
code points to start of expression
@@ -1560,8 +1562,8 @@ Arguments:
Returns: pointer to the opcode for the bracket, or NULL if not found
*/
-static const uschar *
-find_bracket(const uschar *code, BOOL utf8, int number)
+const uschar *
+_pcre_find_bracket(const uschar *code, BOOL utf8, int number)
{
for (;;)
{
@@ -5072,7 +5074,8 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */
if (lengthptr == NULL)
{
*code = OP_END;
- if (recno != 0) called = find_bracket(cd->start_code, utf8, recno);
+ if (recno != 0)
+ called = _pcre_find_bracket(cd->start_code, utf8, recno);
/* Forward reference */
@@ -6582,7 +6585,7 @@ while (errorcode == 0 && cd->hwm > cworkspace)
cd->hwm -= LINK_SIZE;
offset = GET(cd->hwm, 0);
recno = GET(codestart, offset);
- groupptr = find_bracket(codestart, utf8, recno);
+ groupptr = _pcre_find_bracket(codestart, utf8, recno);
if (groupptr == NULL) errorcode = ERR53;
else PUT(((uschar *)codestart), offset, groupptr - codestart);
}
@@ -6609,9 +6612,9 @@ if (cd->check_lookbehind)
of zero, but that is a pathological case, and it does no harm.) When we find
one, we temporarily terminate the branch it is in while we scan it. */
- for (cc = (uschar *)find_bracket(codestart, utf8, -1);
+ for (cc = (uschar *)_pcre_find_bracket(codestart, utf8, -1);
cc != NULL;
- cc = (uschar *)find_bracket(cc, utf8, -1))
+ cc = (uschar *)_pcre_find_bracket(cc, utf8, -1))
{
if (GET(cc, 1) == 0)
{
diff --git a/pcre_dfa_exec.c b/pcre_dfa_exec.c
index ca32e51..bdb4668 100644
--- a/pcre_dfa_exec.c
+++ b/pcre_dfa_exec.c
@@ -2651,7 +2651,7 @@ if (extra_data != NULL)
if ((flags & PCRE_EXTRA_TABLES) != 0)
md->tables = extra_data->tables;
}
-
+
/* Check that the first field in the block is the magic number. If it is not,
test for a regex that was compiled on a host of opposite endianness. If this is
the case, flipped values are put in internal_re and internal_study if there was
@@ -2790,8 +2790,8 @@ if (!anchored)
}
else
{
- if (startline && study != NULL &&
- (study->options & PCRE_STUDY_MAPPED) != 0)
+ if (!startline && study != NULL &&
+ (study->flags & PCRE_STUDY_MAPPED) != 0)
start_bits = study->start_bits;
}
}
@@ -2842,13 +2842,11 @@ for (;;)
}
/* There are some optimizations that avoid running the match if a known
- starting point is not found, or if a known later character is not present.
- However, there is an option that disables these, for testing and for
- ensuring that all callouts do actually occur. */
+ starting point is not found. However, there is an option that disables
+ these, for testing and for ensuring that all callouts do actually occur. */
if ((options & PCRE_NO_START_OPTIMIZE) == 0)
{
-
/* Advance to a known first byte. */
if (first_byte >= 0)
@@ -2914,66 +2912,75 @@ for (;;)
/* Restore fudged end_subject */
end_subject = save_end_subject;
- }
- /* If req_byte is set, we know that that character must appear in the subject
- for the match to succeed. If the first character is set, req_byte must be
- later in the subject; otherwise the test starts at the match point. This
- optimization can save a huge amount of work in patterns with nested unlimited
- repeats that aren't going to match. Writing separate code for cased/caseless
- versions makes it go faster, as does using an autoincrement and backing off
- on a match.
-
- HOWEVER: when the subject string is very, very long, searching to its end can
- take a long time, and give bad performance on quite ordinary patterns. This
- showed up when somebody was matching /^C/ on a 32-megabyte string... so we
- don't do this when the string is sufficiently long.
-
- ALSO: this processing is disabled when partial matching is requested, and can
- also be explicitly deactivated. Furthermore, we have to disable when
- restarting after a partial match, because the required character may have
- already been matched. */
-
- if ((options & PCRE_NO_START_OPTIMIZE) == 0 &&
- req_byte >= 0 &&
- end_subject - current_subject < REQ_BYTE_MAX &&
- (options & (PCRE_PARTIAL_HARD|PCRE_PARTIAL_SOFT|PCRE_DFA_RESTART)) == 0)
- {
- register const uschar *p = current_subject + ((first_byte >= 0)? 1 : 0);
-
- /* We don't need to repeat the search if we haven't yet reached the
- place we found it at last time. */
-
- if (p > req_byte_ptr)
- {
- if (req_byte_caseless)
- {
- while (p < end_subject)
- {
- register int pp = *p++;
- if (pp == req_byte || pp == req_byte2) { p--; break; }
- }
- }
- else
+ /* The following two optimizations are disabled for partial matching or if
+ disabling is explicitly requested (and of course, by the test above, this
+ code is not obeyed when restarting after a partial match). */
+
+ if ((options & PCRE_NO_START_OPTIMIZE) == 0 &&
+ (options & (PCRE_PARTIAL_HARD|PCRE_PARTIAL_SOFT)) == 0)
+ {
+ /* If the pattern was studied, a minimum subject length may be set. This
+ is a lower bound; no actual string of that length may actually match the
+ pattern. Although the value is, strictly, in characters, we treat it as
+ bytes to avoid spending too much time in this optimization. */
+
+ if (study != NULL && (study->flags & PCRE_STUDY_MINLEN) != 0 &&
+ end_subject - current_subject < study->minlength)
+ return PCRE_ERROR_NOMATCH;
+
+ /* If req_byte is set, we know that that character must appear in the
+ subject for the match to succeed. If the first character is set, req_byte
+ must be later in the subject; otherwise the test starts at the match
+ point. This optimization can save a huge amount of work in patterns with
+ nested unlimited repeats that aren't going to match. Writing separate
+ code for cased/caseless versions makes it go faster, as does using an
+ autoincrement and backing off on a match.
+
+ HOWEVER: when the subject string is very, very long, searching to its end
+ can take a long time, and give bad performance on quite ordinary
+ patterns. This showed up when somebody was matching /^C/ on a 32-megabyte
+ string... so we don't do this when the string is sufficiently long. */
+
+ if (req_byte >= 0 && end_subject - current_subject < REQ_BYTE_MAX)
{
- while (p < end_subject)
+ register const uschar *p = current_subject + ((first_byte >= 0)? 1 : 0);
+
+ /* We don't need to repeat the search if we haven't yet reached the
+ place we found it at last time. */
+
+ if (p > req_byte_ptr)
{
- if (*p++ == req_byte) { p--; break; }
+ if (req_byte_caseless)
+ {
+ while (p < end_subject)
+ {
+ register int pp = *p++;
+ if (pp == req_byte || pp == req_byte2) { p--; break; }
+ }
+ }
+ else
+ {
+ while (p < end_subject)
+ {
+ if (*p++ == req_byte) { p--; break; }
+ }
+ }
+
+ /* If we can't find the required character, break the matching loop,
+ which will cause a return or PCRE_ERROR_NOMATCH. */
+
+ if (p >= end_subject) break;
+
+ /* If we have found the required character, save the point where we
+ found it, so that we don't search again next time round the loop if
+ the start hasn't passed this character yet. */
+
+ req_byte_ptr = p;
}
- }
-
- /* If we can't find the required character, break the matching loop,
- which will cause a return or PCRE_ERROR_NOMATCH. */
-
- if (p >= end_subject) break;
-
- /* If we have found the required character, save the point where we
- found it, so that we don't search again next time round the loop if
- the start hasn't passed this character yet. */
-
- req_byte_ptr = p;
+ }
}
- }
+ } /* End of optimizations that are done when not restarting */
/* OK, now we can do the business */
diff --git a/pcre_exec.c b/pcre_exec.c
index fe741fa..a585481 100644
--- a/pcre_exec.c
+++ b/pcre_exec.c
@@ -5121,7 +5121,7 @@ if (!anchored)
}
else
if (!startline && study != NULL &&
- (study->options & PCRE_STUDY_MAPPED) != 0)
+ (study->flags & PCRE_STUDY_MAPPED) != 0)
start_bits = study->start_bits;
}
@@ -5247,74 +5247,86 @@ for(;;)
/* Restore fudged end_subject */
end_subject = save_end_subject;
-
-#ifdef DEBUG /* Sigh. Some compilers never learn. */
- printf(">>>> Match against: ");
- pchars(start_match, end_subject - start_match, TRUE, md);
- printf("\n");
-#endif
-
- /* If req_byte is set, we know that that character must appear in the
- subject for the match to succeed. If the first character is set, req_byte
- must be later in the subject; otherwise the test starts at the match point.
- This optimization can save a huge amount of backtracking in patterns with
- nested unlimited repeats that aren't going to match. Writing separate code
- for cased/caseless versions makes it go faster, as does using an
- autoincrement and backing off on a match.
-
- HOWEVER: when the subject string is very, very long, searching to its end
- can take a long time, and give bad performance on quite ordinary patterns.
- This showed up when somebody was matching something like /^\d+C/ on a
- 32-megabyte string... so we don't do this when the string is sufficiently
- long.
-
- ALSO: this processing is disabled when partial matching is requested, or if
+
+ /* The following two optimizations are disabled for partial matching or if
disabling is explicitly requested. */
-
- if ((options & PCRE_NO_START_OPTIMIZE) == 0 &&
- req_byte >= 0 &&
- end_subject - start_match < REQ_BYTE_MAX &&
- !md->partial)
- {
- register USPTR p = start_match + ((first_byte >= 0)? 1 : 0);
-
- /* We don't need to repeat the search if we haven't yet reached the
- place we found it at last time. */
-
- if (p > req_byte_ptr)
+
+ if ((options & PCRE_NO_START_OPTIMIZE) == 0 && !md->partial)
+ {
+ /* If the pattern was studied, a minimum subject length may be set. This is
+ a lower bound; no actual string of that length may actually match the
+ pattern. Although the value is, strictly, in characters, we treat it as
+ bytes to avoid spending too much time in this optimization. */
+
+ if (study != NULL && (study->flags & PCRE_STUDY_MINLEN) != 0 &&
+ end_subject - start_match < study->minlength)
+ {
+ rc = MATCH_NOMATCH;
+ break;
+ }
+
+ /* If req_byte is set, we know that that character must appear in the
+ subject for the match to succeed. If the first character is set, req_byte
+ must be later in the subject; otherwise the test starts at the match point.
+ This optimization can save a huge amount of backtracking in patterns with
+ nested unlimited repeats that aren't going to match. Writing separate code
+ for cased/caseless versions makes it go faster, as does using an
+ autoincrement and backing off on a match.
+
+ HOWEVER: when the subject string is very, very long, searching to its end
+ can take a long time, and give bad performance on quite ordinary patterns.
+ This showed up when somebody was matching something like /^\d+C/ on a
+ 32-megabyte string... so we don't do this when the string is sufficiently
+ long. */
+
+ if (req_byte >= 0 && end_subject - start_match < REQ_BYTE_MAX)
{
- if (req_byte_caseless)
+ register USPTR p = start_match + ((first_byte >= 0)? 1 : 0);
+
+ /* We don't need to repeat the search if we haven't yet reached the
+ place we found it at last time. */
+
+ if (p > req_byte_ptr)
{
- while (p < end_subject)
+ if (req_byte_caseless)
{
- register int pp = *p++;
- if (pp == req_byte || pp == req_byte2) { p--; break; }
+ while (p < end_subject)
+ {
+ register int pp = *p++;
+ if (pp == req_byte || pp == req_byte2) { p--; break; }
+ }
}
- }
- else
- {
- while (p < end_subject)
+ else
{
- if (*p++ == req_byte) { p--; break; }
+ while (p < end_subject)
+ {
+ if (*p++ == req_byte) { p--; break; }
+ }
}
+
+ /* If we can't find the required character, break the matching loop,
+ forcing a match failure. */
+
+ if (p >= end_subject)
+ {
+ rc = MATCH_NOMATCH;
+ break;
+ }
+
+ /* If we have found the required character, save the point where we
+ found it, so that we don't search again next time round the loop if
+ the start hasn't passed this character yet. */
+
+ req_byte_ptr = p;
}
-
- /* If we can't find the required character, break the matching loop,
- forcing a match failure. */
-
- if (p >= end_subject)
- {
- rc = MATCH_NOMATCH;
- break;
- }
-
- /* If we have found the required character, save the point where we
- found it, so that we don't search again next time round the loop if
- the start hasn't passed this character yet. */
-
- req_byte_ptr = p;
}
- }
+ }
+
+#ifdef DEBUG /* Sigh. Some compilers never learn. */
+ printf(">>>> Match against: ");
+ pchars(start_match, end_subject - start_match, TRUE, md);
+ printf("\n");
+#endif
/* OK, we can now run the match. If "hitend" is set afterwards, remember the
first starting point for which a partial match was found. */
diff --git a/pcre_fullinfo.c b/pcre_fullinfo.c
index af39c44..120f1ef 100644
--- a/pcre_fullinfo.c
+++ b/pcre_fullinfo.c
@@ -119,9 +119,15 @@ switch (what)
case PCRE_INFO_FIRSTTABLE:
*((const uschar **)where) =
- (study != NULL && (study->options & PCRE_STUDY_MAPPED) != 0)?
+ (study != NULL && (study->flags & PCRE_STUDY_MAPPED) != 0)?
((const pcre_study_data *)extra_data->study_data)->start_bits : NULL;
break;
+
+ case PCRE_INFO_MINLENGTH:
+ *((int *)where) =
+ (study != NULL && (study->flags & PCRE_STUDY_MINLEN) != 0)?
+ study->minlength : -1;
+ break;
case PCRE_INFO_LASTLITERAL:
*((int *)where) =
diff --git a/pcre_internal.h b/pcre_internal.h
index db76647..4a8fb51 100644
--- a/pcre_internal.h
+++ b/pcre_internal.h
@@ -549,6 +549,7 @@ compatibility. */
/* Options for the "extra" block produced by pcre_study(). */
#define PCRE_STUDY_MAPPED 0x01 /* a map of starting chars exists */
+#define PCRE_STUDY_MINLEN 0x02 /* a minimum length field exists */
/* Masks for identifying the public options that are permitted at compile
time, run time, or study time, respectively. */
@@ -1491,7 +1492,7 @@ Because people can now save and re-use compiled patterns, any additions to this
structure should be made at the end, and something earlier (e.g. a new
flag in the options or one of the dummy fields) should indicate that the new
fields are present. Currently PCRE always sets the dummy fields to zero.
-NOTE NOTE NOTE:
+NOTE NOTE NOTE
*/
typedef struct real_pcre {
@@ -1518,8 +1519,9 @@ remark (see NOTE above) about extending this structure applies. */
typedef struct pcre_study_data {
pcre_uint32 size; /* Total that was malloced */
- pcre_uint32 options;
- uschar start_bits[32];
+ pcre_uint32 flags; /* Private flags */
+ uschar start_bits[32]; /* Starting char bits */
+ pcre_uint32 minlength; /* Minimum subject length */
} pcre_study_data;
/* Structure for building a chain of open capturing subpatterns during
@@ -1722,15 +1724,16 @@ extern const uschar _pcre_OP_lengths[];
one of the exported public functions. They have to be "external" in the C
sense, but are not part of the PCRE public API. */
-extern BOOL _pcre_is_newline(const uschar *, int, const uschar *,
- int *, BOOL);
-extern int _pcre_ord2utf8(int, uschar *);
-extern real_pcre *_pcre_try_flipped(const real_pcre *, real_pcre *,
- const pcre_study_data *, pcre_study_data *);
-extern int _pcre_valid_utf8(const uschar *, int);
-extern BOOL _pcre_was_newline(const uschar *, int, const uschar *,
- int *, BOOL);
-extern BOOL _pcre_xclass(int, const uschar *);
+extern const uschar *_pcre_find_bracket(const uschar *, BOOL, int);
+extern BOOL _pcre_is_newline(const uschar *, int, const uschar *,
+ int *, BOOL);
+extern int _pcre_ord2utf8(int, uschar *);
+extern real_pcre *_pcre_try_flipped(const real_pcre *, real_pcre *,
+ const pcre_study_data *, pcre_study_data *);
+extern int _pcre_valid_utf8(const uschar *, int);
+extern BOOL _pcre_was_newline(const uschar *, int, const uschar *,
+ int *, BOOL);
+extern BOOL _pcre_xclass(int, const uschar *);
/* Unicode character database (UCD) */
diff --git a/pcre_study.c b/pcre_study.c
index 778851d..29f5482 100644
--- a/pcre_study.c
+++ b/pcre_study.c
@@ -6,7 +6,7 @@
and semantics are as close as possible to those of the Perl 5 language.
Written by Philip Hazel
- Copyright (c) 1997-2008 University of Cambridge
+ Copyright (c) 1997-2009 University of Cambridge
-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@@ -54,6 +54,354 @@ supporting functions. */
enum { SSB_FAIL, SSB_DONE, SSB_CONTINUE };
+
+/*************************************************
+* Find the minimum subject length for a group *
+*************************************************/
+
+/* Scan a parenthesized group and compute the minimum length of subject that
+is needed to match it. This is a lower bound; it does not mean there is a
+string of that length that matches. In UTF8 mode, the result is in characters
+rather than bytes.
+
+Arguments:
+ code pointer to start of group (the bracket)
+ startcode pointer to start of the whole pattern
+ options the compiling options
+
+Returns: the minimum length
+ -1 if \C was encountered
+ -2 internal error (missing capturing bracket)
+*/
+
+static int
+find_minlength(const uschar *code, const uschar *startcode, int options)
+{
+int length = -1;
+BOOL utf8 = (options & PCRE_UTF8) != 0;
+BOOL had_recurse = FALSE;
+register int branchlength = 0;
+register uschar *cc = (uschar *)code + 1 + LINK_SIZE;
+
+if (*code == OP_CBRA || *code == OP_SCBRA) cc += 2;
+
+/* Scan along the opcodes for this branch. If we get to the end of the
+branch, check the length against that of the other branches. */
+
+for (;;)
+ {
+ int d, min;
+ uschar *cs, *ce;
+ register int op = *cc;
+
+ switch (op)
+ {
+ case OP_CBRA:
+ case OP_SCBRA:
+ case OP_BRA:
+ case OP_SBRA:
+ case OP_ONCE:
+ case OP_COND:
+ case OP_SCOND:
+ d = find_minlength(cc, startcode, options);
+ if (d < 0) return d;
+ branchlength += d;
+ do cc += GET(cc, 1); while (*cc == OP_ALT);
+ cc += 1 + LINK_SIZE;
+ break;
+
+ /* Reached end of a branch; if it's a ket it is the end of a nested
+ call. If it's ALT it is an alternation in a nested call. If it is
+ END it's the end of the outer call. All can be handled by the same code. */
+
+ case OP_ALT:
+ case OP_KET:
+ case OP_KETRMAX:
+ case OP_KETRMIN:
+ case OP_END:
+ if (length < 0 || (!had_recurse && branchlength < length))
+ length = branchlength;
+ if (*cc != OP_ALT) return length;
+ cc += 1 + LINK_SIZE;
+ branchlength = 0;
+ had_recurse = FALSE;
+ break;
+
+ /* Skip over assertive subpatterns */
+
+ case OP_ASSERT:
+ case OP_ASSERT_NOT:
+ case OP_ASSERTBACK:
+ case OP_ASSERTBACK_NOT:
+ do cc += GET(cc, 1); while (*cc == OP_ALT);
+ /* Fall through */
+
+ /* Skip over things that don't match chars */
+
+ case OP_REVERSE:
+ case OP_CREF:
+ case OP_RREF:
+ case OP_DEF:
+ case OP_OPT:
+ case OP_CALLOUT:
+ case OP_SOD:
+ case OP_SOM:
+ case OP_EOD:
+ case OP_EODN:
+ case OP_CIRC:
+ case OP_DOLL:
+ case OP_NOT_WORD_BOUNDARY:
+ case OP_WORD_BOUNDARY:
+ cc += _pcre_OP_lengths[*cc];
+ break;
+
+ /* Skip over a subpattern that has a {0} or {0,x} quantifier */
+
+ case OP_BRAZERO:
+ case OP_BRAMINZERO:
+ case OP_SKIPZERO:
+ cc += _pcre_OP_lengths[*cc];
+ do cc += GET(cc, 1); while (*cc == OP_ALT);
+ cc += 1 + LINK_SIZE;
+ break;
+
+ /* Handle literal characters and + repetitions */
+
+ case OP_CHAR:
+ case OP_CHARNC:
+ case OP_NOT:
+ case OP_PLUS:
+ case OP_MINPLUS:
+ case OP_POSPLUS:
+ case OP_NOTPLUS:
+ case OP_NOTMINPLUS:
+ case OP_NOTPOSPLUS:
+ branchlength++;
+ cc += 2;
+#ifdef SUPPORT_UTF8
+ if (utf8 && cc[-1] >= 0xc0) cc += _pcre_utf8_table4[cc[-1] & 0x3f];
+#endif
+ break;
+
+ case OP_TYPEPLUS:
+ case OP_TYPEMINPLUS:
+ case OP_TYPEPOSPLUS:
+ branchlength++;
+ cc += (cc[1] == OP_PROP || cc[1] == OP_NOTPROP)? 4 : 2;
+ break;
+
+ /* Handle exact repetitions. The count is already in characters, but we
+ need to skip over a multibyte character in UTF8 mode. */
+
+ case OP_EXACT:
+ case OP_NOTEXACT:
+ branchlength += GET2(cc,1);
+ cc += 4;
+#ifdef SUPPORT_UTF8
+ if (utf8 && cc[-1] >= 0xc0) cc += _pcre_utf8_table4[cc[-1] & 0x3f];
+#endif
+ break;
+
+ case OP_TYPEEXACT:
+ branchlength += GET2(cc,1);
+ cc += (cc[3] == OP_PROP || cc[3] == OP_NOTPROP)? 6 : 4;
+ break;
+
+ /* Handle single-char non-literal matchers */
+
+ case OP_PROP:
+ case OP_NOTPROP:
+ cc += 2;
+ /* Fall through */
+
+ case OP_NOT_DIGIT:
+ case OP_DIGIT:
+ case OP_NOT_WHITESPACE:
+ case OP_WHITESPACE:
+ case OP_NOT_WORDCHAR:
+ case OP_WORDCHAR:
+ case OP_ANY:
+ case OP_ALLANY:
+ case OP_EXTUNI:
+ case OP_HSPACE:
+ case OP_NOT_HSPACE:
+ case OP_VSPACE:
+ case OP_NOT_VSPACE:
+ branchlength++;
+ cc++;
+ break;
+
+ /* "Any newline" might match two characters */
+
+ case OP_ANYNL:
+ branchlength += 2;
+ cc++;
+ break;
+
+ /* The single-byte matcher means we can't proceed in UTF-8 mode */
+
+ case OP_ANYBYTE:
+#ifdef SUPPORT_UTF8
+ if (utf8) return -1;
+#endif
+ branchlength++;
+ cc++;
+ break;
+
+ /* For repeated character types, we have to test for \p and \P, which have
+ an extra two bytes of parameters. */
+
+ case OP_TYPESTAR:
+ case OP_TYPEMINSTAR:
+ case OP_TYPEQUERY:
+ case OP_TYPEMINQUERY:
+ case OP_TYPEPOSSTAR:
+ case OP_TYPEPOSQUERY:
+ if (cc[1] == OP_PROP || cc[1] == OP_NOTPROP) cc += 2;
+ cc += _pcre_OP_lengths[op];
+ break;
+
+ case OP_TYPEUPTO:
+ case OP_TYPEMINUPTO:
+ case OP_TYPEPOSUPTO:
+ if (cc[3] == OP_PROP || cc[3] == OP_NOTPROP) cc += 2;
+ cc += _pcre_OP_lengths[op];
+ break;
+
+ /* Check a class for variable quantification */
+
+#ifdef SUPPORT_UTF8
+ case OP_XCLASS:
+ cc += GET(cc, 1) - 33;
+ /* Fall through */
+#endif
+
+ case OP_CLASS:
+ case OP_NCLASS:
+ cc += 33;
+
+ switch (*cc)
+ {
+ case OP_CRPLUS:
+ case OP_CRMINPLUS:
+ branchlength++;
+ /* Fall through */
+
+ case OP_CRSTAR:
+ case OP_CRMINSTAR:
+ case OP_CRQUERY:
+ case OP_CRMINQUERY:
+ cc++;
+ break;
+
+ case OP_CRRANGE:
+ case OP_CRMINRANGE:
+ branchlength += GET2(cc,1);
+ cc += 5;
+ break;
+
+ default:
+ branchlength++;
+ break;
+ }
+ break;
+
+ /* Backreferences and subroutine calls are treated in the same way: we find
+ the minimum length for the subpattern. A recursion, however, causes an
+ a flag to be set that causes the length of this branch to be ignored. The
+ logic is that a recursion can only make sense if there is another
+ alternation that stops the recursing. That will provide the minimum length
+ (when no recursion happens). A backreference within the group that it is
+ referencing behaves in the same way. */
+
+ case OP_REF:
+ ce = cs = (uschar *)_pcre_find_bracket(startcode, utf8, GET2(cc, 1));
+ if (cs == NULL) return -2;
+ do ce += GET(ce, 1); while (*ce == OP_ALT);
+ if (cc > cs && cc < ce)
+ {
+ d = 0;
+ had_recurse = TRUE;
+ }
+ else d = find_minlength(cs, startcode, options);
+ cc += 3;
+
+ /* Handle repeated back references */
+
+ switch (*cc)
+ {
+ case OP_CRSTAR:
+ case OP_CRMINSTAR:
+ case OP_CRQUERY:
+ case OP_CRMINQUERY:
+ min = 0;
+ cc++;
+ break;
+
+ case OP_CRRANGE:
+ case OP_CRMINRANGE:
+ min = GET2(cc, 1);
+ cc += 5;
+ break;
+
+ default:
+ min = 1;
+ break;
+ }
+
+ branchlength += min * d;
+ break;
+
+ case OP_RECURSE:
+ cs = ce = (uschar *)startcode + GET(cc, 1);
+ if (cs == NULL) return -2;
+ do ce += GET(ce, 1); while (*ce == OP_ALT);
+ if (cc > cs && cc < ce)
+ had_recurse = TRUE;
+ else
+ branchlength += find_minlength(cs, startcode, options);
+ cc += 1 + LINK_SIZE;
+ break;
+
+ /* Anything else does not or need not match a character. We can get the
+ item's length from the table, but for those that can match zero occurrences
+ of a character, we must take special action for UTF-8 characters. */
+
+ case OP_UPTO:
+ case OP_NOTUPTO:
+ case OP_MINUPTO:
+ case OP_NOTMINUPTO:
+ case OP_POSUPTO:
+ case OP_STAR:
+ case OP_MINSTAR:
+ case OP_NOTMINSTAR:
+ case OP_POSSTAR:
+ case OP_NOTPOSSTAR:
+ case OP_QUERY:
+ case OP_MINQUERY:
+ case OP_NOTMINQUERY:
+ case OP_POSQUERY:
+ case OP_NOTPOSQUERY:
+ cc += _pcre_OP_lengths[op];
+#ifdef SUPPORT_UTF8
+ if (utf8 && cc[-1] >= 0xc0) cc += _pcre_utf8_table4[cc[-1] & 0x3f];
+#endif
+ break;
+
+ /* For the record, these are the opcodes that are matched by "default":
+ OP_ACCEPT, OP_CLOSE, OP_COMMIT, OP_FAIL, OP_PRUNE, OP_SET_SOM, OP_SKIP,
+ OP_THEN. */
+
+ default:
+ cc += _pcre_OP_lengths[op];
+ break;
+ }
+ }
+/* Control never gets here */
+}
+
+
+
/*************************************************
* Set a bit and maybe its alternate case *
*************************************************/
@@ -500,13 +848,15 @@ Arguments:
set NULL unless error
Returns: pointer to a pcre_extra block, with study_data filled in and the
- appropriate flag set;
+ appropriate flags set;
NULL on error or if no optimization possible
*/
PCRE_EXP_DEFN pcre_extra * PCRE_CALL_CONVENTION
pcre_study(const pcre *external_re, int options, const char **errorptr)
{
+int min;
+BOOL bits_set = FALSE;
uschar start_bits[32];
pcre_extra *extra;
pcre_study_data *study;
@@ -533,30 +883,39 @@ code = (uschar *)re + re->name_table_offset +
(re->name_count * re->name_entry_size);
/* For an anchored pattern, or an unanchored pattern that has a first char, or
-a multiline pattern that matches only at "line starts", no further processing
-at present. */
-
-if ((re->options & PCRE_ANCHORED) != 0 ||
- (re->flags & (PCRE_FIRSTSET|PCRE_STARTLINE)) != 0)
- return NULL;
-
-/* Set the character tables in the block that is passed around */
+a multiline pattern that matches only at "line starts", there is no point in
+seeking a list of starting bytes. */
-tables = re->tables;
-if (tables == NULL)
- (void)pcre_fullinfo(external_re, NULL, PCRE_INFO_DEFAULT_TABLES,
- (void *)(&tables));
+if ((re->options & PCRE_ANCHORED) == 0 &&
+ (re->flags & (PCRE_FIRSTSET|PCRE_STARTLINE)) == 0)
+ {
+ /* Set the character tables in the block that is passed around */
+
+ tables = re->tables;
+ if (tables == NULL)
+ (void)pcre_fullinfo(external_re, NULL, PCRE_INFO_DEFAULT_TABLES,
+ (void *)(&tables));
+
+ compile_block.lcc = tables + lcc_offset;
+ compile_block.fcc = tables + fcc_offset;
+ compile_block.cbits = tables + cbits_offset;
+ compile_block.ctypes = tables + ctypes_offset;
+
+ /* See if we can find a fixed set of initial characters for the pattern. */
+
+ memset(start_bits, 0, 32 * sizeof(uschar));
+ bits_set = set_start_bits(code, start_bits,
+ (re->options & PCRE_CASELESS) != 0, (re->options & PCRE_UTF8) != 0,
+ &compile_block) == SSB_DONE;
+ }
+
+/* Find the minimum length of subject string. */
-compile_block.lcc = tables + lcc_offset;
-compile_block.fcc = tables + fcc_offset;
-compile_block.cbits = tables + cbits_offset;
-compile_block.ctypes = tables + ctypes_offset;
+min = find_minlength(code, code, re->options);
-/* See if we can find a fixed set of initial characters for the pattern. */
+/* Return NULL if no optimization is possible. */
-memset(start_bits, 0, 32 * sizeof(uschar));
-if (set_start_bits(code, start_bits, (re->options & PCRE_CASELESS) != 0,
- (re->options & PCRE_UTF8) != 0, &compile_block) != SSB_DONE) return NULL;
+if (!bits_set && min < 0) return NULL;
/* Get a pcre_extra block and a pcre_study_data block. The study data is put in
the latter, which is pointed to by the former, which may also get additional
@@ -579,8 +938,19 @@ extra->flags = PCRE_EXTRA_STUDY_DATA;
extra->study_data = study;
study->size = sizeof(pcre_study_data);
-study->options = PCRE_STUDY_MAPPED;
-memcpy(study->start_bits, start_bits, sizeof(start_bits));
+study->flags = 0;
+
+if (bits_set)
+ {
+ study->flags |= PCRE_STUDY_MAPPED;
+ memcpy(study->start_bits, start_bits, sizeof(start_bits));
+ }
+
+if (min >= 0)
+ {
+ study->flags |= PCRE_STUDY_MINLEN;
+ study->minlength = min;
+ }
return extra;
}
diff --git a/pcre_try_flipped.c b/pcre_try_flipped.c
index 0d2f3a2..49887a5 100644
--- a/pcre_try_flipped.c
+++ b/pcre_try_flipped.c
@@ -6,7 +6,7 @@
and semantics are as close as possible to those of the Perl 5 language.
Written by Philip Hazel
- Copyright (c) 1997-2008 University of Cambridge
+ Copyright (c) 1997-2009 University of Cambridge
-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@@ -128,7 +128,9 @@ if (study != NULL)
{
*internal_study = *study; /* To copy other fields */
internal_study->size = byteflip(study->size, sizeof(study->size));
- internal_study->options = byteflip(study->options, sizeof(study->options));
+ internal_study->flags = byteflip(study->flags, sizeof(study->flags));
+ internal_study->minlength = byteflip(study->minlength,
+ sizeof(study->minlength));
}
return internal_re;
diff --git a/pcretest.c b/pcretest.c
index 08aa383..25080b9 100644
--- a/pcretest.c
+++ b/pcretest.c
@@ -1451,7 +1451,8 @@ while (!done)
{
pcre_study_data *rsd = (pcre_study_data *)(extra->study_data);
rsd->size = byteflip(rsd->size, sizeof(rsd->size));
- rsd->options = byteflip(rsd->options, sizeof(rsd->options));
+ rsd->flags = byteflip(rsd->flags, sizeof(rsd->flags));
+ rsd->minlength = byteflip(rsd->minlength, sizeof(rsd->minlength));
}
}
@@ -1628,10 +1629,14 @@ while (!done)
else
{
uschar *start_bits = NULL;
+ int minlength;
+
+ new_info(re, extra, PCRE_INFO_MINLENGTH, &minlength);
+ fprintf(outfile, "Subject length lower bound = %d\n", minlength);
+
new_info(re, extra, PCRE_INFO_FIRSTTABLE, &start_bits);
-
if (start_bits == NULL)
- fprintf(outfile, "No starting byte set\n");
+ fprintf(outfile, "No set of starting bytes\n");
else
{
int i;
@@ -2150,7 +2155,7 @@ while (!done)
{
int workspace[1000];
for (i = 0; i < timeitm; i++)
- count = pcre_dfa_exec(re, NULL, (char *)bptr, len, start_offset,
+ count = pcre_dfa_exec(re, extra, (char *)bptr, len, start_offset,
options | g_notempty, use_offsets, use_size_offsets, workspace,
sizeof(workspace)/sizeof(int));
}
@@ -2213,7 +2218,7 @@ while (!done)
else if (all_use_dfa || use_dfa)
{
int workspace[1000];
- count = pcre_dfa_exec(re, NULL, (char *)bptr, len, start_offset,
+ count = pcre_dfa_exec(re, extra, (char *)bptr, len, start_offset,
options | g_notempty, use_offsets, use_size_offsets, workspace,
sizeof(workspace)/sizeof(int));
if (count == 0)
diff --git a/testdata/testinput2 b/testdata/testinput2
index 5561e82..d64bcf8 100644
--- a/testdata/testinput2
+++ b/testdata/testinput2
@@ -2861,4 +2861,241 @@ a random value. /Ix
/(?<=b(?1))xyz(b+)pqrstuvew/
+/(a|bc)\1/SI
+
+/(a|bc)\1{2,3}/SI
+
+/(a|bc)(?1)/SI
+
+/(a|b\1)(a|b\1)/SI
+
+/(a|b\1){2}/SI
+
+/(a|bbbb\1)(a|bbbb\1)/SI
+
+/(a|bbbb\1){2}/SI
+
+/^From +([^ ]+) +[a-zA-Z][a-zA-Z][a-zA-Z] +[a-zA-Z][a-zA-Z][a-zA-Z] +[0-9]?[0-9] +[0-9][0-9]:[0-9][0-9]/SI
+
+/ (?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* # optional leading comment
+(?: (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|
+" (?: # opening quote...
+[^\\\x80-\xff\n\015"] # Anything except backslash and quote
+| # or
+\\ [^\x80-\xff] # Escaped something (something != CR)
+)* " # closing quote
+) # initial word
+(?: (?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* \. (?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|
+" (?: # opening quote...
+[^\\\x80-\xff\n\015"] # Anything except backslash and quote
+| # or
+\\ [^\x80-\xff] # Escaped something (something != CR)
+)* " # closing quote
+) )* # further okay, if led by a period
+(?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* @ (?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+| \[ # [
+(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff
+\] # ]
+) # initial subdomain
+(?: #
+(?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* \. # if led by a period...
+(?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+| \[ # [
+(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff
+\] # ]
+) # ...further okay
+)*
+# address
+| # or
+(?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|
+" (?: # opening quote...
+[^\\\x80-\xff\n\015"] # Anything except backslash and quote
+| # or
+\\ [^\x80-\xff] # Escaped something (something != CR)
+)* " # closing quote
+) # one word, optionally followed by....
+(?:
+[^()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037] | # atom and space parts, or...
+\(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) | # comments, or...
+
+" (?: # opening quote...
+[^\\\x80-\xff\n\015"] # Anything except backslash and quote
+| # or
+\\ [^\x80-\xff] # Escaped something (something != CR)
+)* " # closing quote
+# quoted strings
+)*
+< (?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* # leading <
+(?: @ (?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+| \[ # [
+(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff
+\] # ]
+) # initial subdomain
+(?: #
+(?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* \. # if led by a period...
+(?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+| \[ # [
+(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff
+\] # ]
+) # ...further okay
+)*
+
+(?: (?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* , (?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* @ (?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+| \[ # [
+(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff
+\] # ]
+) # initial subdomain
+(?: #
+(?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* \. # if led by a period...
+(?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+| \[ # [
+(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff
+\] # ]
+) # ...further okay
+)*
+)* # further okay, if led by comma
+: # closing colon
+(?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* )? # optional route
+(?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|
+" (?: # opening quote...
+[^\\\x80-\xff\n\015"] # Anything except backslash and quote
+| # or
+\\ [^\x80-\xff] # Escaped something (something != CR)
+)* " # closing quote
+) # initial word
+(?: (?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* \. (?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|
+" (?: # opening quote...
+[^\\\x80-\xff\n\015"] # Anything except backslash and quote
+| # or
+\\ [^\x80-\xff] # Escaped something (something != CR)
+)* " # closing quote
+) )* # further okay, if led by a period
+(?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* @ (?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+| \[ # [
+(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff
+\] # ]
+) # initial subdomain
+(?: #
+(?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* \. # if led by a period...
+(?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+| \[ # [
+(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff
+\] # ]
+) # ...further okay
+)*
+# address spec
+(?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* > # trailing >
+# name and address
+) (?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* # optional trailing comment
+/xSI
+
+/<tr([\w\W\s\d][^<>]{0,})><TD([\w\W\s\d][^<>]{0,})>([\d]{0,}\.)(.*)((<BR>([\w\W\s\d][^<>]{0,})|[\s]{0,}))<\/a><\/TD><TD([\w\W\s\d][^<>]{0,})>([\w\W\s\d][^<>]{0,})<\/TD><TD([\w\W\s\d][^<>]{0,})>([\w\W\s\d][^<>]{0,})<\/TD><\/TR>/isIS
+
+"(?>.*/)foo"SI
+
+/(?(?=[^a-z]+[a-z]) \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) /xSI
+
+/(?:(?:(?:(?:(?:(?:(?:(?:(?:(a|b|c))))))))))/iSI
+
+/(?:c|d)(?:)(?:aaaaaaaa(?:)(?:bbbbbbbb)(?:bbbbbbbb(?:))(?:bbbbbbbb(?:)(?:bbbbbbbb)))/SI
+
+/<a[\s]+href[\s]*=[\s]* # find <a href=
+ ([\"\'])? # find single or double quote
+ (?(1) (.*?)\1 | ([^\s]+)) # if quote found, match up to next matching
+ # quote, otherwise match up to next space
+/isxSI
+
+/^(?!:) # colon disallowed at start
+ (?: # start of item
+ (?: [0-9a-f]{1,4} | # 1-4 hex digits or
+ (?(1)0 | () ) ) # if null previously matched, fail; else null
+ : # followed by colon
+ ){1,7} # end item; 1-7 of them required
+ [0-9a-f]{1,4} $ # final hex number at end of string
+ (?(1)|.) # check that there was an empty component
+ /xiIS
+
/-- End of testinput2 --/
diff --git a/testdata/testinput7 b/testdata/testinput7
index 7aafc95..f921835 100644
--- a/testdata/testinput7
+++ b/testdata/testinput7
@@ -4489,4 +4489,22 @@
/(?=C)/g+
ABCDECBA
+/(abc|def|xyz)/I
+ terhjk;abcdaadsfe
+ the quick xyz brown fox
+ \Yterhjk;abcdaadsfe
+ \Ythe quick xyz brown fox
+ ** Failers
+ thejk;adlfj aenjl;fda asdfasd ehj;kjxyasiupd
+ \Ythejk;adlfj aenjl;fda asdfasd ehj;kjxyasiupd
+
+/(abc|def|xyz)/SI
+ terhjk;abcdaadsfe
+ the quick xyz brown fox
+ \Yterhjk;abcdaadsfe
+ \Ythe quick xyz brown fox
+ ** Failers
+ thejk;adlfj aenjl;fda asdfasd ehj;kjxyasiupd
+ \Ythejk;adlfj aenjl;fda asdfasd ehj;kjxyasiupd
+
/-- End of testinput7 --/
diff --git a/testdata/testoutput2 b/testdata/testoutput2
index ab6a7c4..85523dd 100644
--- a/testdata/testoutput2
+++ b/testdata/testoutput2
@@ -145,6 +145,7 @@ Capturing subpattern count = 0
No options
No first char
No need char
+Subject length lower bound = 3
Starting byte set: c d e
this sentence eventually mentions a cat
0: cat
@@ -156,6 +157,7 @@ Capturing subpattern count = 0
Options: caseless
No first char
No need char
+Subject length lower bound = 3
Starting byte set: C D E c d e
this sentence eventually mentions a CAT cat
0: CAT
@@ -167,6 +169,7 @@ Capturing subpattern count = 0
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: a b c d
/(a|[^\dZ])/IS
@@ -174,6 +177,7 @@ Capturing subpattern count = 1
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: \x00 \x01 \x02 \x03 \x04 \x05 \x06 \x07 \x08 \x09 \x0a
\x0b \x0c \x0d \x0e \x0f \x10 \x11 \x12 \x13 \x14 \x15 \x16 \x17 \x18 \x19
\x1a \x1b \x1c \x1d \x1e \x1f \x20 ! " # $ % & ' ( ) * + , - . / : ; < = >
@@ -194,6 +198,7 @@ Capturing subpattern count = 1
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: \x09 \x0a \x0c \x0d \x20 a b
/(ab\2)/
@@ -527,6 +532,7 @@ Capturing subpattern count = 0
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: a b c d
/(?i)[abcd]/IS
@@ -534,6 +540,7 @@ Capturing subpattern count = 0
Options: caseless
No first char
No need char
+Subject length lower bound = 1
Starting byte set: A B C D a b c d
/(?m)[xy]|(b|c)/IS
@@ -541,6 +548,7 @@ Capturing subpattern count = 1
Options: multiline
No first char
No need char
+Subject length lower bound = 1
Starting byte set: b c x y
/(^a|^b)/Im
@@ -605,13 +613,15 @@ Capturing subpattern count = 1
No options
First char = 'b' (caseless)
No need char
-Study returned NULL
+Subject length lower bound = 1
+No set of starting bytes
/(a*b|(?i:c*(?-i)d))/IS
Capturing subpattern count = 1
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: C a b c d
/a$/I
@@ -676,6 +686,7 @@ Capturing subpattern count = 0
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: a b
/(?<!foo)(alpha|omega)/IS
@@ -683,6 +694,7 @@ Capturing subpattern count = 1
No options
No first char
Need char = 'a'
+Subject length lower bound = 5
Starting byte set: a o
/(?!alphabet)[ab]/IS
@@ -690,6 +702,7 @@ Capturing subpattern count = 0
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: a b
/(?<=foo\n)^bar/Im
@@ -1658,7 +1671,8 @@ Capturing subpattern count = 0
Options: anchored
No first char
Need char = 'd'
-Study returned NULL
+Subject length lower bound = 4
+No set of starting bytes
/\( # ( at start
(?: # Non-capturing bracket
@@ -1890,6 +1904,7 @@ Capturing subpattern count = 0
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
_ a b c d e f g h i j k l m n o p q r s t u v w x y z
@@ -1951,6 +1966,7 @@ Contains explicit CR or LF match
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: \x09 \x0a \x0b \x0c \x0d \x20
/^[[:cntrl:]]/DZ
@@ -3421,6 +3437,7 @@ Capturing subpattern count = 0
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: a b
/[^a]/I
@@ -3440,6 +3457,7 @@ Capturing subpattern count = 0
No options
No first char
Need char = '6'
+Subject length lower bound = 4
Starting byte set: 0 1 2 3 4 5 6 7 8 9
/a^b/I
@@ -3473,6 +3491,7 @@ Capturing subpattern count = 0
Options: caseless
No first char
No need char
+Subject length lower bound = 1
Starting byte set: A B a b
/[ab](?i)cd/IS
@@ -3480,6 +3499,7 @@ Capturing subpattern count = 0
No options
No first char
Need char = 'd' (caseless)
+Subject length lower bound = 3
Starting byte set: a b
/abc(?C)def/I
@@ -3830,6 +3850,7 @@ Capturing subpattern count = 0
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: a b
/(?R)/I
@@ -4614,7 +4635,8 @@ Capturing subpattern count = 3
Options: caseless
No first char
Need char = 'g' (caseless)
-Study returned NULL
+Subject length lower bound = 8
+No set of starting bytes
Baby Bjorn Active Carrier - With free SHIPPING!!
0: Baby Bjorn Active Carrier - With free SHIPPING!!
1: Baby Bjorn Active Carrier - With free SHIPPING!!
@@ -4632,7 +4654,8 @@ Capturing subpattern count = 0
No options
No first char
Need char = 'b'
-Study returned NULL
+Subject length lower bound = 1
+No set of starting bytes
/(a|b)*.?c/ISDZ
------------------------------------------------------------------
@@ -4652,7 +4675,8 @@ Capturing subpattern count = 1
No options
No first char
Need char = 'c'
-Study returned NULL
+Subject length lower bound = 1
+No set of starting bytes
/abc(?C255)de(?C)f/DZ
------------------------------------------------------------------
@@ -5435,6 +5459,7 @@ Capturing subpattern count = 1
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: a b
Compiled regex written to testsavedregex
Study data written to testsavedregex
@@ -5455,6 +5480,7 @@ Capturing subpattern count = 1
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: a b
Compiled regex written to testsavedregex
Study data written to testsavedregex
@@ -6167,6 +6193,7 @@ Capturing subpattern count = 0
No options
No first char
Need char = ','
+Subject length lower bound = 1
Starting byte set: \x09 \x0a \x0c \x0d \x20 ,
\x0b,\x0b
0: ,
@@ -6471,6 +6498,7 @@ Capturing subpattern count = 1
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: C a b c d
/()[ab]xyz/IS
@@ -6478,6 +6506,7 @@ Capturing subpattern count = 1
No options
No first char
Need char = 'z'
+Subject length lower bound = 4
Starting byte set: a b
/(|)[ab]xyz/IS
@@ -6485,6 +6514,7 @@ Capturing subpattern count = 1
No options
No first char
Need char = 'z'
+Subject length lower bound = 4
Starting byte set: a b
/(|c)[ab]xyz/IS
@@ -6492,6 +6522,7 @@ Capturing subpattern count = 1
No options
No first char
Need char = 'z'
+Subject length lower bound = 4
Starting byte set: a b c
/(|c?)[ab]xyz/IS
@@ -6499,6 +6530,7 @@ Capturing subpattern count = 1
No options
No first char
Need char = 'z'
+Subject length lower bound = 4
Starting byte set: a b c
/(d?|c?)[ab]xyz/IS
@@ -6506,6 +6538,7 @@ Capturing subpattern count = 1
No options
No first char
Need char = 'z'
+Subject length lower bound = 4
Starting byte set: a b c d
/(d?|c)[ab]xyz/IS
@@ -6513,6 +6546,7 @@ Capturing subpattern count = 1
No options
No first char
Need char = 'z'
+Subject length lower bound = 4
Starting byte set: a b c d
/^a*b\d/DZ
@@ -6605,6 +6639,7 @@ Capturing subpattern count = 1
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: a b c d
/(a+|b*)[cd]/IS
@@ -6612,6 +6647,7 @@ Capturing subpattern count = 1
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: a b c d
/(a*|b+)[cd]/IS
@@ -6619,6 +6655,7 @@ Capturing subpattern count = 1
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: a b c d
/(a+|b+)[cd]/IS
@@ -6626,6 +6663,7 @@ Capturing subpattern count = 1
No options
No first char
No need char
+Subject length lower bound = 2
Starting byte set: a b
/((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((
@@ -9069,6 +9107,7 @@ Capturing subpattern count = 1
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: x y z
/(?(?=.*b)b|^)/CI
@@ -9838,4 +9877,347 @@ Failed: reference to non-existent subpattern at offset 8
/(?<=b(?1))xyz(b+)pqrstuvew/
Failed: lookbehind assertion is not fixed length at offset 26
+/(a|bc)\1/SI
+Capturing subpattern count = 1
+Max back reference = 1
+No options
+No first char
+No need char
+Subject length lower bound = 2
+Starting byte set: a b
+
+/(a|bc)\1{2,3}/SI
+Capturing subpattern count = 1
+Max back reference = 1
+No options
+No first char
+No need char
+Subject length lower bound = 3
+Starting byte set: a b
+
+/(a|bc)(?1)/SI
+Capturing subpattern count = 1
+No options
+No first char
+No need char
+Subject length lower bound = 2
+Starting byte set: a b
+
+/(a|b\1)(a|b\1)/SI
+Capturing subpattern count = 2
+Max back reference = 1
+No options
+No first char
+No need char
+Subject length lower bound = 2
+Starting byte set: a b
+
+/(a|b\1){2}/SI
+Capturing subpattern count = 1
+Max back reference = 1
+No options
+No first char
+No need char
+Subject length lower bound = 2
+Starting byte set: a b
+
+/(a|bbbb\1)(a|bbbb\1)/SI
+Capturing subpattern count = 2
+Max back reference = 1
+No options
+No first char
+No need char
+Subject length lower bound = 2
+Starting byte set: a b
+
+/(a|bbbb\1){2}/SI
+Capturing subpattern count = 1
+Max back reference = 1
+No options
+No first char
+No need char
+Subject length lower bound = 2
+Starting byte set: a b
+
+/^From +([^ ]+) +[a-zA-Z][a-zA-Z][a-zA-Z] +[a-zA-Z][a-zA-Z][a-zA-Z] +[0-9]?[0-9] +[0-9][0-9]:[0-9][0-9]/SI
+Capturing subpattern count = 1
+Options: anchored
+No first char
+Need char = ':'
+Subject length lower bound = 22
+No set of starting bytes
+
+/ (?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* # optional leading comment
+(?: (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|
+" (?: # opening quote...
+[^\\\x80-\xff\n\015"] # Anything except backslash and quote
+| # or
+\\ [^\x80-\xff] # Escaped something (something != CR)
+)* " # closing quote
+) # initial word
+(?: (?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* \. (?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|
+" (?: # opening quote...
+[^\\\x80-\xff\n\015"] # Anything except backslash and quote
+| # or
+\\ [^\x80-\xff] # Escaped something (something != CR)
+)* " # closing quote
+) )* # further okay, if led by a period
+(?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* @ (?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+| \[ # [
+(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff
+\] # ]
+) # initial subdomain
+(?: #
+(?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* \. # if led by a period...
+(?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+| \[ # [
+(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff
+\] # ]
+) # ...further okay
+)*
+# address
+| # or
+(?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|
+" (?: # opening quote...
+[^\\\x80-\xff\n\015"] # Anything except backslash and quote
+| # or
+\\ [^\x80-\xff] # Escaped something (something != CR)
+)* " # closing quote
+) # one word, optionally followed by....
+(?:
+[^()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037] | # atom and space parts, or...
+\(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) | # comments, or...
+
+" (?: # opening quote...
+[^\\\x80-\xff\n\015"] # Anything except backslash and quote
+| # or
+\\ [^\x80-\xff] # Escaped something (something != CR)
+)* " # closing quote
+# quoted strings
+)*
+< (?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* # leading <
+(?: @ (?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+| \[ # [
+(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff
+\] # ]
+) # initial subdomain
+(?: #
+(?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* \. # if led by a period...
+(?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+| \[ # [
+(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff
+\] # ]
+) # ...further okay
+)*
+
+(?: (?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* , (?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* @ (?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+| \[ # [
+(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff
+\] # ]
+) # initial subdomain
+(?: #
+(?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* \. # if led by a period...
+(?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+| \[ # [
+(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff
+\] # ]
+) # ...further okay
+)*
+)* # further okay, if led by comma
+: # closing colon
+(?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* )? # optional route
+(?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|
+" (?: # opening quote...
+[^\\\x80-\xff\n\015"] # Anything except backslash and quote
+| # or
+\\ [^\x80-\xff] # Escaped something (something != CR)
+)* " # closing quote
+) # initial word
+(?: (?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* \. (?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+|
+" (?: # opening quote...
+[^\\\x80-\xff\n\015"] # Anything except backslash and quote
+| # or
+\\ [^\x80-\xff] # Escaped something (something != CR)
+)* " # closing quote
+) )* # further okay, if led by a period
+(?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* @ (?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+| \[ # [
+(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff
+\] # ]
+) # initial subdomain
+(?: #
+(?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* \. # if led by a period...
+(?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* (?:
+[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters...
+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom
+| \[ # [
+(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff
+\] # ]
+) # ...further okay
+)*
+# address spec
+(?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* > # trailing >
+# name and address
+) (?: [\040\t] | \(
+(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )*
+\) )* # optional trailing comment
+/xSI
+Capturing subpattern count = 0
+Contains explicit CR or LF match
+Options: extended
+No first char
+No need char
+Subject length lower bound = 3
+Starting byte set: \x09 \x20 ! " # $ % & ' ( * + - / 0 1 2 3 4 5 6 7 8
+ 9 = ? A B C D E F G H I J K L M N O P Q R S T U V W X Y Z ^ _ ` a b c d e
+ f g h i j k l m n o p q r s t u v w x y z { | } ~ \x7f
+
+/<tr([\w\W\s\d][^<>]{0,})><TD([\w\W\s\d][^<>]{0,})>([\d]{0,}\.)(.*)((<BR>([\w\W\s\d][^<>]{0,})|[\s]{0,}))<\/a><\/TD><TD([\w\W\s\d][^<>]{0,})>([\w\W\s\d][^<>]{0,})<\/TD><TD([\w\W\s\d][^<>]{0,})>([\w\W\s\d][^<>]{0,})<\/TD><\/TR>/isIS
+Capturing subpattern count = 11
+Options: caseless dotall
+First char = '<'
+Need char = '>'
+Subject length lower bound = 47
+No set of starting bytes
+
+"(?>.*/)foo"SI
+Capturing subpattern count = 0
+No options
+First char at start or follows newline
+Need char = 'o'
+Subject length lower bound = 4
+No set of starting bytes
+
+/(?(?=[^a-z]+[a-z]) \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) /xSI
+Capturing subpattern count = 0
+Options: extended
+No first char
+Need char = '-'
+Subject length lower bound = 8
+No set of starting bytes
+
+/(?:(?:(?:(?:(?:(?:(?:(?:(?:(a|b|c))))))))))/iSI
+Capturing subpattern count = 1
+Options: caseless
+No first char
+No need char
+Subject length lower bound = 1
+Starting byte set: A B C a b c
+
+/(?:c|d)(?:)(?:aaaaaaaa(?:)(?:bbbbbbbb)(?:bbbbbbbb(?:))(?:bbbbbbbb(?:)(?:bbbbbbbb)))/SI
+Capturing subpattern count = 0
+No options
+No first char
+Need char = 'b'
+Subject length lower bound = 41
+Starting byte set: c d
+
+/<a[\s]+href[\s]*=[\s]* # find <a href=
+ ([\"\'])? # find single or double quote
+ (?(1) (.*?)\1 | ([^\s]+)) # if quote found, match up to next matching
+ # quote, otherwise match up to next space
+/isxSI
+Capturing subpattern count = 3
+Max back reference = 1
+Options: caseless extended dotall
+First char = '<'
+Need char = '='
+Subject length lower bound = 9
+No set of starting bytes
+
+/^(?!:) # colon disallowed at start
+ (?: # start of item
+ (?: [0-9a-f]{1,4} | # 1-4 hex digits or
+ (?(1)0 | () ) ) # if null previously matched, fail; else null
+ : # followed by colon
+ ){1,7} # end item; 1-7 of them required
+ [0-9a-f]{1,4} $ # final hex number at end of string
+ (?(1)|.) # check that there was an empty component
+ /xiIS
+Capturing subpattern count = 1
+Options: anchored caseless extended
+No first char
+Need char = ':'
+Subject length lower bound = 2
+No set of starting bytes
+
/-- End of testinput2 --/
diff --git a/testdata/testoutput3 b/testdata/testoutput3
index ce18ed2..7b0a3e9 100644
--- a/testdata/testoutput3
+++ b/testdata/testoutput3
@@ -87,6 +87,7 @@ Capturing subpattern count = 0
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P
Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z
@@ -95,6 +96,7 @@ Capturing subpattern count = 0
No options
No first char
No need char
+Subject length lower bound = 1
Starting byte set: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P
Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z
ª µ º À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø Ù Ú Û Ü Ý Þ ß à á â
diff --git a/testdata/testoutput5 b/testdata/testoutput5
index ded5edb..1aaa5be 100644
--- a/testdata/testoutput5
+++ b/testdata/testoutput5
@@ -351,6 +351,7 @@ Capturing subpattern count = 0
Options: utf8
No first char
No need char
+Subject length lower bound = 1
Starting byte set: \x00 \x01 \x02 \x03 \x04 \x05 \x06 \x07 \x08 \x09 \x0a
\x0b \x0c \x0d \x0e \x0f \x10 \x11 \x12 \x13 \x14 \x15 \x16 \x17 \x18 \x19
\x1a \x1b \x1c \x1d \x1e \x1f \x20 ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4
@@ -388,7 +389,8 @@ Capturing subpattern count = 0
Options: utf8
First char = 196
Need char = 128
-Study returned NULL
+Subject length lower bound = 3
+No set of starting bytes
\x{100}\x{100}\x{100}\x{100\x{100}
0: \x{100}\x{100}\x{100}
@@ -407,6 +409,7 @@ Capturing subpattern count = 1
Options: utf8
No first char
No need char
+Subject length lower bound = 1
Starting byte set: x \xc4
/(\x{100}*a|x)/8SDZ
@@ -425,6 +428,7 @@ Capturing subpattern count = 1
Options: utf8
No first char
No need char
+Subject length lower bound = 1
Starting byte set: a x \xc4
/(\x{100}{0,2}a|x)/8SDZ
@@ -443,6 +447,7 @@ Capturing subpattern count = 1
Options: utf8
No first char
No need char
+Subject length lower bound = 1
Starting byte set: a x \xc4
/(\x{100}{1,2}a|x)/8SDZ
@@ -462,6 +467,7 @@ Capturing subpattern count = 1
Options: utf8
No first char
No need char
+Subject length lower bound = 1
Starting byte set: x \xc4
/\x{100}*(\d+|"(?1)")/8
diff --git a/testdata/testoutput7 b/testdata/testoutput7
index b9f9bb0..8cac766 100644
--- a/testdata/testoutput7
+++ b/testdata/testoutput7
@@ -7472,4 +7472,46 @@ Partial match: +ab
0:
0+ CBA
+/(abc|def|xyz)/I
+Capturing subpattern count = 1
+No options
+No first char
+No need char
+ terhjk;abcdaadsfe
+ 0: abc
+ the quick xyz brown fox
+ 0: xyz
+ \Yterhjk;abcdaadsfe
+ 0: abc
+ \Ythe quick xyz brown fox
+ 0: xyz
+ ** Failers
+No match
+ thejk;adlfj aenjl;fda asdfasd ehj;kjxyasiupd
+No match
+ \Ythejk;adlfj aenjl;fda asdfasd ehj;kjxyasiupd
+No match
+
+/(abc|def|xyz)/SI
+Capturing subpattern count = 1
+No options
+No first char
+No need char
+Subject length lower bound = 3
+Starting byte set: a d x
+ terhjk;abcdaadsfe
+ 0: abc
+ the quick xyz brown fox
+ 0: xyz
+ \Yterhjk;abcdaadsfe
+ 0: abc
+ \Ythe quick xyz brown fox
+ 0: xyz
+ ** Failers
+No match
+ thejk;adlfj aenjl;fda asdfasd ehj;kjxyasiupd
+No match
+ \Ythejk;adlfj aenjl;fda asdfasd ehj;kjxyasiupd
+No match
+
/-- End of testinput7 --/