summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2007-06-13 08:44:34 +0000
committerph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2007-06-13 08:44:34 +0000
commitcc7768ebfd86be3e6126739dd2c66c76bc197ca1 (patch)
treefae05267b3db5a2d5909f322dd4b2cfc2c2c3a80
parent9e1e6e4433ecff8a263222322ff99e0d96f8efe5 (diff)
downloadpcre-cc7768ebfd86be3e6126739dd2c66c76bc197ca1.tar.gz
Add support for \h, \H, \v, \V.
git-svn-id: svn://vcs.exim.org/pcre/code/trunk@178 2f5784b3-3f2a-0410-8824-cb99058d5e15
-rw-r--r--ChangeLog2
-rw-r--r--NEWS2
-rw-r--r--doc/pcrepattern.362
-rw-r--r--pcre_compile.c150
-rw-r--r--pcre_dfa_exec.c414
-rw-r--r--pcre_exec.c502
-rw-r--r--pcre_internal.h205
-rw-r--r--testdata/testinput14
-rw-r--r--testdata/testinput251
-rw-r--r--testdata/testinput538
-rw-r--r--testdata/testinput735
-rw-r--r--testdata/testinput838
-rw-r--r--testdata/testoutput16
-rw-r--r--testdata/testoutput2114
-rw-r--r--testdata/testoutput588
-rw-r--r--testdata/testoutput760
-rw-r--r--testdata/testoutput866
17 files changed, 1694 insertions, 143 deletions
diff --git a/ChangeLog b/ChangeLog
index 89e29a8..d1052c5 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -43,6 +43,8 @@ Version 7.2 05-June-07
(g) (?| introduces a group in which the numbering of parentheses in each
alternative starts with the same number.
+
+ (h) \h, \H, \v, and \V match horizontal and vertical whitespace.
7. Added two new calls to pcre_fullinfo(): PCRE_INFO_OKPARTIAL and
PCRE_INFO_JCHANGED.
diff --git a/NEWS b/NEWS
index a82f2a1..f585cb5 100644
--- a/NEWS
+++ b/NEWS
@@ -29,6 +29,8 @@ Some more features from Perl 5.10 have been added:
(?| introduces a group where the capturing parentheses in each alternative
start from the same number; for example, (?|(abc)|(xyz)) sets capturing
parentheses number 1 in both cases.
+
+ \h, \H, \v, \V match horizontal and vertical whitespace, respectively.
Release 7.1 24-Apr-07
diff --git a/doc/pcrepattern.3 b/doc/pcrepattern.3
index e760623..1acb7d7 100644
--- a/doc/pcrepattern.3
+++ b/doc/pcrepattern.3
@@ -260,10 +260,14 @@ parenthesized subpatterns.
Another use of backslash is for specifying generic character types. The
following are always recognized:
.sp
- \ed any decimal digit
+ \ed any decimal digit
\eD any character that is not a decimal digit
+ \eh any horizontal whitespace character
+ \eH any character that is not a horizontal whitespace character
\es any whitespace character
\eS any character that is not a whitespace character
+ \ev any vertical whitespace character
+ \eV any character that is not a vertical whitespace character
\ew any "word" character
\eW any "non-word" character
.sp
@@ -277,9 +281,49 @@ there is no character to match.
.P
For compatibility with Perl, \es does not match the VT character (code 11).
This makes it different from the the POSIX "space" class. The \es characters
-are HT (9), LF (10), FF (12), CR (13), and space (32). (If "use locale;" is
+are HT (9), LF (10), FF (12), CR (13), and space (32). If "use locale;" is
included in a Perl script, \es may match the VT character. In PCRE, it never
-does.)
+does.
+.P
+In UTF-8 mode, characters with values greater than 128 never match \ed, \es, or
+\ew, and always match \eD, \eS, and \eW. This is true even when Unicode
+character property support is available. These sequences retain their original
+meanings from before UTF-8 support was available, mainly for efficiency
+reasons.
+.P
+The sequences \eh, \eH, \ev, and \eV are Perl 5.10 features. In contrast to the
+other sequences, these do match certain high-valued codepoints in UTF-8 mode.
+The horizontal space characters are:
+.sp
+ U+0009 Horizontal tab
+ U+0020 Space
+ U+00A0 Non-break space
+ U+1680 Ogham space mark
+ U+180E Mongolian vowel separator
+ U+2000 En quad
+ U+2001 Em quad
+ U+2002 En space
+ U+2003 Em space
+ U+2004 Three-per-em space
+ U+2005 Four-per-em space
+ U+2006 Six-per-em space
+ U+2007 Figure space
+ U+2008 Punctuation space
+ U+2009 Thin space
+ U+200A Hair space
+ U+202F Narrow no-break space
+ U+205F Medium mathematical space
+ U+3000 Ideographic space
+.sp
+The vertical space characters are:
+.sp
+ U+000A Linefeed
+ U+000B Vertical tab
+ U+000C Formfeed
+ U+000D Carriage return
+ U+0085 Next line
+ U+2028 Line separator
+ U+2029 Paragraph separator
.P
A "word" character is an underscore or any character less than 256 that is a
letter or digit. The definition of letters and digits is controlled by PCRE's
@@ -295,19 +339,15 @@ in the
.\"
page). For example, in a French locale such as "fr_FR" in Unix-like systems,
or "french" in Windows, some character codes greater than 128 are used for
-accented letters, and these are matched by \ew.
-.P
-In UTF-8 mode, characters with values greater than 128 never match \ed, \es, or
-\ew, and always match \eD, \eS, and \eW. This is true even when Unicode
-character property support is available. The use of locales with Unicode is
-discouraged.
+accented letters, and these are matched by \ew. The use of locales with Unicode
+is discouraged.
.
.
.SS "Newline sequences"
.rs
.sp
Outside a character class, the escape sequence \eR matches any Unicode newline
-sequence. This is an extension to Perl. In non-UTF-8 mode \eR is equivalent to
+sequence. This is a Perl 5.10 feature. In non-UTF-8 mode \eR is equivalent to
the following:
.sp
(?>\er\en|\en|\ex0b|\ef|\er|\ex85)
@@ -1933,6 +1973,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 11 June 2007
+Last updated: 13 June 2007
Copyright (c) 1997-2007 University of Cambridge.
.fi
diff --git a/pcre_compile.c b/pcre_compile.c
index e412eef..350f50d 100644
--- a/pcre_compile.c
+++ b/pcre_compile.c
@@ -58,6 +58,11 @@ used by pcretest. DEBUG is not defined when building a production library. */
#endif
+/* Macro for setting individual bits in class bitmaps. */
+
+#define SETBIT(a,b) a[b/8] |= (1 << (b%8))
+
+
/*************************************************
* Code parameters and static tables *
*************************************************/
@@ -87,12 +92,12 @@ static const short int escapes[] = {
0, 0, 0, 0, 0, 0, 0, 0, /* 0 - 7 */
0, 0, ':', ';', '<', '=', '>', '?', /* 8 - ? */
'@', -ESC_A, -ESC_B, -ESC_C, -ESC_D, -ESC_E, 0, -ESC_G, /* @ - G */
- 0, 0, 0, -ESC_K, 0, 0, 0, 0, /* H - O */
--ESC_P, -ESC_Q, -ESC_R, -ESC_S, 0, 0, 0, -ESC_W, /* P - W */
+-ESC_H, 0, 0, -ESC_K, 0, 0, 0, 0, /* H - O */
+-ESC_P, -ESC_Q, -ESC_R, -ESC_S, 0, 0, -ESC_V, -ESC_W, /* P - W */
-ESC_X, 0, -ESC_Z, '[', '\\', ']', '^', '_', /* X - _ */
'`', 7, -ESC_b, 0, -ESC_d, ESC_e, ESC_f, 0, /* ` - g */
- 0, 0, 0, -ESC_k, 0, 0, ESC_n, 0, /* h - o */
--ESC_p, 0, ESC_r, -ESC_s, ESC_tee, 0, 0, -ESC_w, /* p - w */
+-ESC_h, 0, 0, -ESC_k, 0, 0, ESC_n, 0, /* h - o */
+-ESC_p, 0, ESC_r, -ESC_s, ESC_tee, 0, -ESC_v, -ESC_w, /* p - w */
0, 0, -ESC_z /* x - z */
};
@@ -106,18 +111,18 @@ static const short int escapes[] = {
/* 70 */ 0, 0, 0, 0, 0, 0, 0, 0,
/* 78 */ 0, '`', ':', '#', '@', '\'', '=', '"',
/* 80 */ 0, 7, -ESC_b, 0, -ESC_d, ESC_e, ESC_f, 0,
-/* 88 */ 0, 0, 0, '{', 0, 0, 0, 0,
+/* 88 */-ESC_h, 0, 0, '{', 0, 0, 0, 0,
/* 90 */ 0, 0, -ESC_k, 'l', 0, ESC_n, 0, -ESC_p,
/* 98 */ 0, ESC_r, 0, '}', 0, 0, 0, 0,
-/* A0 */ 0, '~', -ESC_s, ESC_tee, 0, 0, -ESC_w, 0,
+/* A0 */ 0, '~', -ESC_s, ESC_tee, 0,-ESC_v, -ESC_w, 0,
/* A8 */ 0,-ESC_z, 0, 0, 0, '[', 0, 0,
/* B0 */ 0, 0, 0, 0, 0, 0, 0, 0,
/* B8 */ 0, 0, 0, 0, 0, ']', '=', '-',
/* C0 */ '{',-ESC_A, -ESC_B, -ESC_C, -ESC_D,-ESC_E, 0, -ESC_G,
-/* C8 */ 0, 0, 0, 0, 0, 0, 0, 0,
+/* C8 */-ESC_H, 0, 0, 0, 0, 0, 0, 0,
/* D0 */ '}', 0, 0, 0, 0, 0, 0, -ESC_P,
/* D8 */-ESC_Q,-ESC_R, 0, 0, 0, 0, 0, 0,
-/* E0 */ '\\', 0, -ESC_S, 0, 0, 0, -ESC_W, -ESC_X,
+/* E0 */ '\\', 0, -ESC_S, 0, 0,-ESC_V, -ESC_W, -ESC_X,
/* E8 */ 0,-ESC_Z, 0, 0, 0, 0, 0, 0,
/* F0 */ 0, 0, 0, 0, 0, 0, 0, 0,
/* F8 */ 0, 0, 0, 0, 0, 0, 0, 0
@@ -2530,7 +2535,7 @@ for (;; ptr++)
case ESC_E: /* Perl ignores an orphan \E */
continue;
-
+
default: /* Not recognized; fall through */
break; /* Need "default" setting to stop compiler warning. */
}
@@ -2539,6 +2544,133 @@ for (;; ptr++)
else if (c == -ESC_d || c == -ESC_D || c == -ESC_w ||
c == -ESC_W || c == -ESC_s || c == -ESC_S) continue;
+
+ /* We need to deal with \H, \h, \V, and \v in both phases because
+ they use extra memory. */
+
+ if (-c == ESC_h)
+ {
+ SETBIT(classbits, 0x09); /* VT */
+ SETBIT(classbits, 0x20); /* SPACE */
+ SETBIT(classbits, 0xa0); /* NSBP */
+#ifdef SUPPORT_UTF8
+ if (utf8)
+ {
+ class_utf8 = TRUE;
+ *class_utf8data++ = XCL_SINGLE;
+ class_utf8data += _pcre_ord2utf8(0x1680, class_utf8data);
+ *class_utf8data++ = XCL_SINGLE;
+ class_utf8data += _pcre_ord2utf8(0x180e, class_utf8data);
+ *class_utf8data++ = XCL_RANGE;
+ class_utf8data += _pcre_ord2utf8(0x2000, class_utf8data);
+ class_utf8data += _pcre_ord2utf8(0x200A, class_utf8data);
+ *class_utf8data++ = XCL_SINGLE;
+ class_utf8data += _pcre_ord2utf8(0x202f, class_utf8data);
+ *class_utf8data++ = XCL_SINGLE;
+ class_utf8data += _pcre_ord2utf8(0x205f, class_utf8data);
+ *class_utf8data++ = XCL_SINGLE;
+ class_utf8data += _pcre_ord2utf8(0x3000, class_utf8data);
+ }
+#endif
+ continue;
+ }
+
+ if (-c == ESC_H)
+ {
+ for (c = 0; c < 32; c++)
+ {
+ int x = 0xff;
+ switch (c)
+ {
+ case 0x09/8: x ^= 1 << (0x09%8); break;
+ case 0x20/8: x ^= 1 << (0x20%8); break;
+ case 0xa0/8: x ^= 1 << (0xa0%8); break;
+ default: break;
+ }
+ classbits[c] |= x;
+ }
+
+#ifdef SUPPORT_UTF8
+ if (utf8)
+ {
+ class_utf8 = TRUE;
+ *class_utf8data++ = XCL_RANGE;
+ class_utf8data += _pcre_ord2utf8(0x0100, class_utf8data);
+ class_utf8data += _pcre_ord2utf8(0x167f, class_utf8data);
+ *class_utf8data++ = XCL_RANGE;
+ class_utf8data += _pcre_ord2utf8(0x1681, class_utf8data);
+ class_utf8data += _pcre_ord2utf8(0x180d, class_utf8data);
+ *class_utf8data++ = XCL_RANGE;
+ class_utf8data += _pcre_ord2utf8(0x180f, class_utf8data);
+ class_utf8data += _pcre_ord2utf8(0x1fff, class_utf8data);
+ *class_utf8data++ = XCL_RANGE;
+ class_utf8data += _pcre_ord2utf8(0x200B, class_utf8data);
+ class_utf8data += _pcre_ord2utf8(0x202e, class_utf8data);
+ *class_utf8data++ = XCL_RANGE;
+ class_utf8data += _pcre_ord2utf8(0x2030, class_utf8data);
+ class_utf8data += _pcre_ord2utf8(0x205e, class_utf8data);
+ *class_utf8data++ = XCL_RANGE;
+ class_utf8data += _pcre_ord2utf8(0x2060, class_utf8data);
+ class_utf8data += _pcre_ord2utf8(0x2fff, class_utf8data);
+ *class_utf8data++ = XCL_RANGE;
+ class_utf8data += _pcre_ord2utf8(0x3001, class_utf8data);
+ class_utf8data += _pcre_ord2utf8(0x7fffffff, class_utf8data);
+ }
+#endif
+ continue;
+ }
+
+ if (-c == ESC_v)
+ {
+ SETBIT(classbits, 0x0a); /* LF */
+ SETBIT(classbits, 0x0b); /* VT */
+ SETBIT(classbits, 0x0c); /* FF */
+ SETBIT(classbits, 0x0d); /* CR */
+ SETBIT(classbits, 0x85); /* NEL */
+#ifdef SUPPORT_UTF8
+ if (utf8)
+ {
+ class_utf8 = TRUE;
+ *class_utf8data++ = XCL_RANGE;
+ class_utf8data += _pcre_ord2utf8(0x2028, class_utf8data);
+ class_utf8data += _pcre_ord2utf8(0x2029, class_utf8data);
+ }
+#endif
+ continue;
+ }
+
+ if (-c == ESC_V)
+ {
+ for (c = 0; c < 32; c++)
+ {
+ int x = 0xff;
+ switch (c)
+ {
+ case 0x0a/8: x ^= 1 << (0x0a%8);
+ x ^= 1 << (0x0b%8);
+ x ^= 1 << (0x0c%8);
+ x ^= 1 << (0x0d%8);
+ break;
+ case 0x85/8: x ^= 1 << (0x85%8); break;
+ default: break;
+ }
+ classbits[c] |= x;
+ }
+
+#ifdef SUPPORT_UTF8
+ if (utf8)
+ {
+ class_utf8 = TRUE;
+ *class_utf8data++ = XCL_RANGE;
+ class_utf8data += _pcre_ord2utf8(0x0100, class_utf8data);
+ class_utf8data += _pcre_ord2utf8(0x2027, class_utf8data);
+ *class_utf8data++ = XCL_RANGE;
+ class_utf8data += _pcre_ord2utf8(0x2029, class_utf8data);
+ class_utf8data += _pcre_ord2utf8(0x7fffffff, class_utf8data);
+ }
+#endif
+ continue;
+ }
/* We need to deal with \P and \p in both phases. */
diff --git a/pcre_dfa_exec.c b/pcre_dfa_exec.c
index 01dab00..120c2f6 100644
--- a/pcre_dfa_exec.c
+++ b/pcre_dfa_exec.c
@@ -63,11 +63,14 @@ applications. */
/* These are offsets that are used to turn the OP_TYPESTAR and friends opcodes
into others, under special conditions. A gap of 20 between the blocks should be
-enough. */
+enough. The resulting opcodes don't have to be less than 256 because they are
+never stored, so we push them well clear of the normal opcodes. */
-#define OP_PROP_EXTRA 100
-#define OP_EXTUNI_EXTRA 120
-#define OP_ANYNL_EXTRA 140
+#define OP_PROP_EXTRA 300
+#define OP_EXTUNI_EXTRA 320
+#define OP_ANYNL_EXTRA 340
+#define OP_HSPACE_EXTRA 360
+#define OP_VSPACE_EXTRA 380
/* This table identifies those opcodes that are followed immediately by a
@@ -82,7 +85,8 @@ static uschar coptable[] = {
0, 0, 0, 0, 0, /* \A, \G, \K, \B, \b */
0, 0, 0, 0, 0, 0, /* \D, \d, \S, \s, \W, \w */
0, 0, /* Any, Anybyte */
- 0, 0, 0, 0, /* NOTPROP, PROP, EXTUNI, ANYNL */
+ 0, 0, 0, /* NOTPROP, PROP, EXTUNI */
+ 0, 0, 0, 0, 0, /* \R, \H, \h, \V, \v */
0, 0, 0, 0, 0, /* \Z, \z, Opt, ^, $ */
1, /* Char */
1, /* Charnc */
@@ -559,10 +563,10 @@ for (;;)
permitted.
We also use this mechanism for opcodes such as OP_TYPEPLUS that take an
- argument that is not a data character - but is always one byte long.
- Unfortunately, we have to take special action to deal with \P, \p, and
- \X in this case. To keep the other cases fast, convert these ones to new
- opcodes. */
+ argument that is not a data character - but is always one byte long. We
+ have to take special action to deal with \P, \p, \H, \h, \V, \v and \X in
+ this case. To keep the other cases fast, convert these ones to new opcodes.
+ */
if (coptable[codevalue] > 0)
{
@@ -580,6 +584,10 @@ for (;;)
case OP_PROP: codevalue += OP_PROP_EXTRA; break;
case OP_ANYNL: codevalue += OP_ANYNL_EXTRA; break;
case OP_EXTUNI: codevalue += OP_EXTUNI_EXTRA; break;
+ case OP_NOT_HSPACE:
+ case OP_HSPACE: codevalue += OP_HSPACE_EXTRA; break;
+ case OP_NOT_VSPACE:
+ case OP_VSPACE: codevalue += OP_VSPACE_EXTRA; break;
default: break;
}
}
@@ -1090,6 +1098,96 @@ for (;;)
break;
/*-----------------------------------------------------------------*/
+ case OP_VSPACE_EXTRA + OP_TYPEPLUS:
+ case OP_VSPACE_EXTRA + OP_TYPEMINPLUS:
+ case OP_VSPACE_EXTRA + OP_TYPEPOSPLUS:
+ count = current_state->count; /* Already matched */
+ if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
+ if (clen > 0)
+ {
+ BOOL OK;
+ switch (c)
+ {
+ case 0x000a:
+ case 0x000b:
+ case 0x000c:
+ case 0x000d:
+ case 0x0085:
+ case 0x2028:
+ case 0x2029:
+ OK = TRUE;
+ break;
+
+ default:
+ OK = FALSE;
+ break;
+ }
+
+ if (OK == (d == OP_VSPACE))
+ {
+ if (count > 0 && codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSPLUS)
+ {
+ active_count--; /* Remove non-match possibility */
+ next_active_state--;
+ }
+ count++;
+ ADD_NEW_DATA(-state_offset, count, 0);
+ }
+ }
+ break;
+
+ /*-----------------------------------------------------------------*/
+ case OP_HSPACE_EXTRA + OP_TYPEPLUS:
+ case OP_HSPACE_EXTRA + OP_TYPEMINPLUS:
+ case OP_HSPACE_EXTRA + OP_TYPEPOSPLUS:
+ count = current_state->count; /* Already matched */
+ if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
+ if (clen > 0)
+ {
+ BOOL OK;
+ switch (c)
+ {
+ case 0x09: /* HT */
+ case 0x20: /* SPACE */
+ case 0xa0: /* NBSP */
+ case 0x1680: /* OGHAM SPACE MARK */
+ case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
+ case 0x2000: /* EN QUAD */
+ case 0x2001: /* EM QUAD */
+ case 0x2002: /* EN SPACE */
+ case 0x2003: /* EM SPACE */
+ case 0x2004: /* THREE-PER-EM SPACE */
+ case 0x2005: /* FOUR-PER-EM SPACE */
+ case 0x2006: /* SIX-PER-EM SPACE */
+ case 0x2007: /* FIGURE SPACE */
+ case 0x2008: /* PUNCTUATION SPACE */
+ case 0x2009: /* THIN SPACE */
+ case 0x200A: /* HAIR SPACE */
+ case 0x202f: /* NARROW NO-BREAK SPACE */
+ case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
+ case 0x3000: /* IDEOGRAPHIC SPACE */
+ OK = TRUE;
+ break;
+
+ default:
+ OK = FALSE;
+ break;
+ }
+
+ if (OK == (d == OP_HSPACE))
+ {
+ if (count > 0 && codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSPLUS)
+ {
+ active_count--; /* Remove non-match possibility */
+ next_active_state--;
+ }
+ count++;
+ ADD_NEW_DATA(-state_offset, count, 0);
+ }
+ }
+ break;
+
+ /*-----------------------------------------------------------------*/
#ifdef SUPPORT_UCP
case OP_PROP_EXTRA + OP_TYPEQUERY:
case OP_PROP_EXTRA + OP_TYPEMINQUERY:
@@ -1233,6 +1331,111 @@ for (;;)
break;
/*-----------------------------------------------------------------*/
+ case OP_VSPACE_EXTRA + OP_TYPEQUERY:
+ case OP_VSPACE_EXTRA + OP_TYPEMINQUERY:
+ case OP_VSPACE_EXTRA + OP_TYPEPOSQUERY:
+ count = 2;
+ goto QS4;
+
+ case OP_VSPACE_EXTRA + OP_TYPESTAR:
+ case OP_VSPACE_EXTRA + OP_TYPEMINSTAR:
+ case OP_VSPACE_EXTRA + OP_TYPEPOSSTAR:
+ count = 0;
+
+ QS4:
+ ADD_ACTIVE(state_offset + 2, 0);
+ if (clen > 0)
+ {
+ BOOL OK;
+ switch (c)
+ {
+ case 0x000a:
+ case 0x000b:
+ case 0x000c:
+ case 0x000d:
+ case 0x0085:
+ case 0x2028:
+ case 0x2029:
+ OK = TRUE;
+ break;
+
+ default:
+ OK = FALSE;
+ break;
+ }
+ if (OK == (d == OP_VSPACE))
+ {
+ if (codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSSTAR ||
+ codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSQUERY)
+ {
+ active_count--; /* Remove non-match possibility */
+ next_active_state--;
+ }
+ ADD_NEW_DATA(-(state_offset + count), 0, 0);
+ }
+ }
+ break;
+
+ /*-----------------------------------------------------------------*/
+ case OP_HSPACE_EXTRA + OP_TYPEQUERY:
+ case OP_HSPACE_EXTRA + OP_TYPEMINQUERY:
+ case OP_HSPACE_EXTRA + OP_TYPEPOSQUERY:
+ count = 2;
+ goto QS5;
+
+ case OP_HSPACE_EXTRA + OP_TYPESTAR:
+ case OP_HSPACE_EXTRA + OP_TYPEMINSTAR:
+ case OP_HSPACE_EXTRA + OP_TYPEPOSSTAR:
+ count = 0;
+
+ QS5:
+ ADD_ACTIVE(state_offset + 2, 0);
+ if (clen > 0)
+ {
+ BOOL OK;
+ switch (c)
+ {
+ case 0x09: /* HT */
+ case 0x20: /* SPACE */
+ case 0xa0: /* NBSP */
+ case 0x1680: /* OGHAM SPACE MARK */
+ case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
+ case 0x2000: /* EN QUAD */
+ case 0x2001: /* EM QUAD */
+ case 0x2002: /* EN SPACE */
+ case 0x2003: /* EM SPACE */
+ case 0x2004: /* THREE-PER-EM SPACE */
+ case 0x2005: /* FOUR-PER-EM SPACE */
+ case 0x2006: /* SIX-PER-EM SPACE */
+ case 0x2007: /* FIGURE SPACE */
+ case 0x2008: /* PUNCTUATION SPACE */
+ case 0x2009: /* THIN SPACE */
+ case 0x200A: /* HAIR SPACE */
+ case 0x202f: /* NARROW NO-BREAK SPACE */
+ case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
+ case 0x3000: /* IDEOGRAPHIC SPACE */
+ OK = TRUE;
+ break;
+
+ default:
+ OK = FALSE;
+ break;
+ }
+
+ if (OK == (d == OP_HSPACE))
+ {
+ if (codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSSTAR ||
+ codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSQUERY)
+ {
+ active_count--; /* Remove non-match possibility */
+ next_active_state--;
+ }
+ ADD_NEW_DATA(-(state_offset + count), 0, 0);
+ }
+ }
+ break;
+
+ /*-----------------------------------------------------------------*/
#ifdef SUPPORT_UCP
case OP_PROP_EXTRA + OP_TYPEEXACT:
case OP_PROP_EXTRA + OP_TYPEUPTO:
@@ -1361,6 +1564,103 @@ for (;;)
}
break;
+ /*-----------------------------------------------------------------*/
+ case OP_VSPACE_EXTRA + OP_TYPEEXACT:
+ case OP_VSPACE_EXTRA + OP_TYPEUPTO:
+ case OP_VSPACE_EXTRA + OP_TYPEMINUPTO:
+ case OP_VSPACE_EXTRA + OP_TYPEPOSUPTO:
+ if (codevalue != OP_VSPACE_EXTRA + OP_TYPEEXACT)
+ { ADD_ACTIVE(state_offset + 4, 0); }
+ count = current_state->count; /* Number already matched */
+ if (clen > 0)
+ {
+ BOOL OK;
+ switch (c)
+ {
+ case 0x000a:
+ case 0x000b:
+ case 0x000c:
+ case 0x000d:
+ case 0x0085:
+ case 0x2028:
+ case 0x2029:
+ OK = TRUE;
+ break;
+
+ default:
+ OK = FALSE;
+ }
+
+ if (OK == (d == OP_VSPACE))
+ {
+ if (codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSUPTO)
+ {
+ active_count--; /* Remove non-match possibility */
+ next_active_state--;
+ }
+ if (++count >= GET2(code, 1))
+ { ADD_NEW_DATA(-(state_offset + 4), 0, 0); }
+ else
+ { ADD_NEW_DATA(-state_offset, count, 0); }
+ }
+ }
+ break;
+
+ /*-----------------------------------------------------------------*/
+ case OP_HSPACE_EXTRA + OP_TYPEEXACT:
+ case OP_HSPACE_EXTRA + OP_TYPEUPTO:
+ case OP_HSPACE_EXTRA + OP_TYPEMINUPTO:
+ case OP_HSPACE_EXTRA + OP_TYPEPOSUPTO:
+ if (codevalue != OP_HSPACE_EXTRA + OP_TYPEEXACT)
+ { ADD_ACTIVE(state_offset + 4, 0); }
+ count = current_state->count; /* Number already matched */
+ if (clen > 0)
+ {
+ BOOL OK;
+ switch (c)
+ {
+ case 0x09: /* HT */
+ case 0x20: /* SPACE */
+ case 0xa0: /* NBSP */
+ case 0x1680: /* OGHAM SPACE MARK */
+ case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
+ case 0x2000: /* EN QUAD */
+ case 0x2001: /* EM QUAD */
+ case 0x2002: /* EN SPACE */
+ case 0x2003: /* EM SPACE */
+ case 0x2004: /* THREE-PER-EM SPACE */
+ case 0x2005: /* FOUR-PER-EM SPACE */
+ case 0x2006: /* SIX-PER-EM SPACE */
+ case 0x2007: /* FIGURE SPACE */
+ case 0x2008: /* PUNCTUATION SPACE */
+ case 0x2009: /* THIN SPACE */
+ case 0x200A: /* HAIR SPACE */
+ case 0x202f: /* NARROW NO-BREAK SPACE */
+ case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
+ case 0x3000: /* IDEOGRAPHIC SPACE */
+ OK = TRUE;
+ break;
+
+ default:
+ OK = FALSE;
+ break;
+ }
+
+ if (OK == (d == OP_HSPACE))
+ {
+ if (codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSUPTO)
+ {
+ active_count--; /* Remove non-match possibility */
+ next_active_state--;
+ }
+ if (++count >= GET2(code, 1))
+ { ADD_NEW_DATA(-(state_offset + 4), 0, 0); }
+ else
+ { ADD_NEW_DATA(-state_offset, count, 0); }
+ }
+ }
+ break;
+
/* ========================================================================== */
/* These opcodes are followed by a character that is usually compared
to the current subject character; it is loaded into d. We still get
@@ -1460,6 +1760,102 @@ for (;;)
break;
/*-----------------------------------------------------------------*/
+ case OP_NOT_VSPACE:
+ if (clen > 0) switch(c)
+ {
+ case 0x000a:
+ case 0x000b:
+ case 0x000c:
+ case 0x000d:
+ case 0x0085:
+ case 0x2028:
+ case 0x2029:
+ break;
+
+ default:
+ ADD_NEW(state_offset + 1, 0);
+ break;
+ }
+ break;
+
+ /*-----------------------------------------------------------------*/
+ case OP_VSPACE:
+ if (clen > 0) switch(c)
+ {
+ case 0x000a:
+ case 0x000b:
+ case 0x000c:
+ case 0x000d:
+ case 0x0085:
+ case 0x2028:
+ case 0x2029:
+ ADD_NEW(state_offset + 1, 0);
+ break;
+
+ default: break;
+ }
+ break;
+
+ /*-----------------------------------------------------------------*/
+ case OP_NOT_HSPACE:
+ if (clen > 0) switch(c)
+ {
+ case 0x09: /* HT */
+ case 0x20: /* SPACE */
+ case 0xa0: /* NBSP */
+ case 0x1680: /* OGHAM SPACE MARK */
+ case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
+ case 0x2000: /* EN QUAD */
+ case 0x2001: /* EM QUAD */
+ case 0x2002: /* EN SPACE */
+ case 0x2003: /* EM SPACE */
+ case 0x2004: /* THREE-PER-EM SPACE */
+ case 0x2005: /* FOUR-PER-EM SPACE */
+ case 0x2006: /* SIX-PER-EM SPACE */
+ case 0x2007: /* FIGURE SPACE */
+ case 0x2008: /* PUNCTUATION SPACE */
+ case 0x2009: /* THIN SPACE */
+ case 0x200A: /* HAIR SPACE */
+ case 0x202f: /* NARROW NO-BREAK SPACE */
+ case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
+ case 0x3000: /* IDEOGRAPHIC SPACE */
+ break;
+
+ default:
+ ADD_NEW(state_offset + 1, 0);
+ break;
+ }
+ break;
+
+ /*-----------------------------------------------------------------*/
+ case OP_HSPACE:
+ if (clen > 0) switch(c)
+ {
+ case 0x09: /* HT */
+ case 0x20: /* SPACE */
+ case 0xa0: /* NBSP */
+ case 0x1680: /* OGHAM SPACE MARK */
+ case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
+ case 0x2000: /* EN QUAD */
+ case 0x2001: /* EM QUAD */
+ case 0x2002: /* EN SPACE */
+ case 0x2003: /* EM SPACE */
+ case 0x2004: /* THREE-PER-EM SPACE */
+ case 0x2005: /* FOUR-PER-EM SPACE */
+ case 0x2006: /* SIX-PER-EM SPACE */
+ case 0x2007: /* FIGURE SPACE */
+ case 0x2008: /* PUNCTUATION SPACE */
+ case 0x2009: /* THIN SPACE */
+ case 0x200A: /* HAIR SPACE */
+ case 0x202f: /* NARROW NO-BREAK SPACE */
+ case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
+ case 0x3000: /* IDEOGRAPHIC SPACE */
+ ADD_NEW(state_offset + 1, 0);
+ break;
+ }
+ break;
+
+ /*-----------------------------------------------------------------*/
/* Match a negated single character. This is only used for one-byte
characters, that is, we know that d < 256. The character we are
checking (c) can be multibyte. */
diff --git a/pcre_exec.c b/pcre_exec.c
index faac023..f5a2340 100644
--- a/pcre_exec.c
+++ b/pcre_exec.c
@@ -1482,6 +1482,102 @@ for (;;)
ecode++;
break;
+ case OP_NOT_HSPACE:
+ if (eptr >= md->end_subject) RRETURN(MATCH_NOMATCH);
+ GETCHARINCTEST(c, eptr);
+ switch(c)
+ {
+ default: break;
+ case 0x09: /* HT */
+ case 0x20: /* SPACE */
+ case 0xa0: /* NBSP */
+ case 0x1680: /* OGHAM SPACE MARK */
+ case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
+ case 0x2000: /* EN QUAD */
+ case 0x2001: /* EM QUAD */
+ case 0x2002: /* EN SPACE */
+ case 0x2003: /* EM SPACE */
+ case 0x2004: /* THREE-PER-EM SPACE */
+ case 0x2005: /* FOUR-PER-EM SPACE */
+ case 0x2006: /* SIX-PER-EM SPACE */
+ case 0x2007: /* FIGURE SPACE */
+ case 0x2008: /* PUNCTUATION SPACE */
+ case 0x2009: /* THIN SPACE */
+ case 0x200A: /* HAIR SPACE */
+ case 0x202f: /* NARROW NO-BREAK SPACE */
+ case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
+ case 0x3000: /* IDEOGRAPHIC SPACE */
+ RRETURN(MATCH_NOMATCH);
+ }
+ ecode++;
+ break;
+
+ case OP_HSPACE:
+ if (eptr >= md->end_subject) RRETURN(MATCH_NOMATCH);
+ GETCHARINCTEST(c, eptr);
+ switch(c)
+ {
+ default: RRETURN(MATCH_NOMATCH);
+ case 0x09: /* HT */
+ case 0x20: /* SPACE */
+ case 0xa0: /* NBSP */
+ case 0x1680: /* OGHAM SPACE MARK */
+ case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
+ case 0x2000: /* EN QUAD */
+ case 0x2001: /* EM QUAD */
+ case 0x2002: /* EN SPACE */
+ case 0x2003: /* EM SPACE */
+ case 0x2004: /* THREE-PER-EM SPACE */
+ case 0x2005: /* FOUR-PER-EM SPACE */
+ case 0x2006: /* SIX-PER-EM SPACE */
+ case 0x2007: /* FIGURE SPACE */
+ case 0x2008: /* PUNCTUATION SPACE */
+ case 0x2009: /* THIN SPACE */
+ case 0x200A: /* HAIR SPACE */
+ case 0x202f: /* NARROW NO-BREAK SPACE */
+ case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
+ case 0x3000: /* IDEOGRAPHIC SPACE */
+ break;
+ }
+ ecode++;
+ break;
+
+ case OP_NOT_VSPACE:
+ if (eptr >= md->end_subject) RRETURN(MATCH_NOMATCH);
+ GETCHARINCTEST(c, eptr);
+ switch(c)
+ {
+ default: break;
+ case 0x0a: /* LF */
+ case 0x0b: /* VT */
+ case 0x0c: /* FF */
+ case 0x0d: /* CR */
+ case 0x85: /* NEL */
+ case 0x2028: /* LINE SEPARATOR */
+ case 0x2029: /* PARAGRAPH SEPARATOR */
+ RRETURN(MATCH_NOMATCH);
+ }
+ ecode++;
+ break;
+
+ case OP_VSPACE:
+ if (eptr >= md->end_subject) RRETURN(MATCH_NOMATCH);
+ GETCHARINCTEST(c, eptr);
+ switch(c)
+ {
+ default: RRETURN(MATCH_NOMATCH);
+ case 0x0a: /* LF */
+ case 0x0b: /* VT */
+ case 0x0c: /* FF */
+ case 0x0d: /* CR */
+ case 0x85: /* NEL */
+ case 0x2028: /* LINE SEPARATOR */
+ case 0x2029: /* PARAGRAPH SEPARATOR */
+ break;
+ }
+ ecode++;
+ break;
+
#ifdef SUPPORT_UCP
/* Check the next character by Unicode property. We will get here only
if the support is in the binary; otherwise a compile-time error occurs. */
@@ -2814,6 +2910,110 @@ for (;;)
}
break;
+ case OP_NOT_HSPACE:
+ for (i = 1; i <= min; i++)
+ {
+ if (eptr >= md->end_subject) RRETURN(MATCH_NOMATCH);
+ GETCHARINC(c, eptr);
+ switch(c)
+ {
+ default: break;
+ case 0x09: /* HT */
+ case 0x20: /* SPACE */
+ case 0xa0: /* NBSP */
+ case 0x1680: /* OGHAM SPACE MARK */
+ case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
+ case 0x2000: /* EN QUAD */
+ case 0x2001: /* EM QUAD */
+ case 0x2002: /* EN SPACE */
+ case 0x2003: /* EM SPACE */
+ case 0x2004: /* THREE-PER-EM SPACE */
+ case 0x2005: /* FOUR-PER-EM SPACE */
+ case 0x2006: /* SIX-PER-EM SPACE */
+ case 0x2007: /* FIGURE SPACE */
+ case 0x2008: /* PUNCTUATION SPACE */
+ case 0x2009: /* THIN SPACE */
+ case 0x200A: /* HAIR SPACE */
+ case 0x202f: /* NARROW NO-BREAK SPACE */
+ case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
+ case 0x3000: /* IDEOGRAPHIC SPACE */
+ RRETURN(MATCH_NOMATCH);
+ }
+ }
+ break;
+
+ case OP_HSPACE:
+ for (i = 1; i <= min; i++)
+ {
+ if (eptr >= md->end_subject) RRETURN(MATCH_NOMATCH);
+ GETCHARINC(c, eptr);
+ switch(c)
+ {
+ default: RRETURN(MATCH_NOMATCH);
+ case 0x09: /* HT */
+ case 0x20: /* SPACE */
+ case 0xa0: /* NBSP */
+ case 0x1680: /* OGHAM SPACE MARK */
+ case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
+ case 0x2000: /* EN QUAD */
+ case 0x2001: /* EM QUAD */
+ case 0x2002: /* EN SPACE */
+ case 0x2003: /* EM SPACE */
+ case 0x2004: /* THREE-PER-EM SPACE */
+ case 0x2005: /* FOUR-PER-EM SPACE */
+ case 0x2006: /* SIX-PER-EM SPACE */
+ case 0x2007: /* FIGURE SPACE */
+ case 0x2008: /* PUNCTUATION SPACE */
+ case 0x2009: /* THIN SPACE */
+ case 0x200A: /* HAIR SPACE */
+ case 0x202f: /* NARROW NO-BREAK SPACE */
+ case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
+ case 0x3000: /* IDEOGRAPHIC SPACE */
+ break;
+ }
+ }
+ break;
+
+ case OP_NOT_VSPACE:
+ for (i = 1; i <= min; i++)
+ {
+ if (eptr >= md->end_subject) RRETURN(MATCH_NOMATCH);
+ GETCHARINC(c, eptr);
+ switch(c)
+ {
+ default: break;
+ case 0x0a: /* LF */
+ case 0x0b: /* VT */
+ case 0x0c: /* FF */
+ case 0x0d: /* CR */
+ case 0x85: /* NEL */
+ case 0x2028: /* LINE SEPARATOR */
+ case 0x2029: /* PARAGRAPH SEPARATOR */
+ RRETURN(MATCH_NOMATCH);
+ }
+ }
+ break;
+
+ case OP_VSPACE:
+ for (i = 1; i <= min; i++)
+ {
+ if (eptr >= md->end_subject) RRETURN(MATCH_NOMATCH);
+ GETCHARINC(c, eptr);
+ switch(c)
+ {
+ default: RRETURN(MATCH_NOMATCH);
+ case 0x0a: /* LF */
+ case 0x0b: /* VT */
+ case 0x0c: /* FF */
+ case 0x0d: /* CR */
+ case 0x85: /* NEL */
+ case 0x2028: /* LINE SEPARATOR */
+ case 0x2029: /* PARAGRAPH SEPARATOR */
+ break;
+ }
+ }
+ break;
+
case OP_NOT_DIGIT:
for (i = 1; i <= min; i++)
{
@@ -2925,6 +3125,70 @@ for (;;)
}
break;
+ case OP_NOT_HSPACE:
+ for (i = 1; i <= min; i++)
+ {
+ if (eptr >= md->end_subject) RRETURN(MATCH_NOMATCH);
+ switch(*eptr++)
+ {
+ default: break;
+ case 0x09: /* HT */
+ case 0x20: /* SPACE */
+ case 0xa0: /* NBSP */
+ RRETURN(MATCH_NOMATCH);
+ }
+ }
+ break;
+
+ case OP_HSPACE:
+ for (i = 1; i <= min; i++)
+ {
+ if (eptr >= md->end_subject) RRETURN(MATCH_NOMATCH);
+ switch(*eptr++)
+ {
+ default: RRETURN(MATCH_NOMATCH);
+ case 0x09: /* HT */
+ case 0x20: /* SPACE */
+ case 0xa0: /* NBSP */
+ break;
+ }
+ }
+ break;
+
+ case OP_NOT_VSPACE:
+ for (i = 1; i <= min; i++)
+ {
+ if (eptr >= md->end_subject) RRETURN(MATCH_NOMATCH);
+ switch(*eptr++)
+ {
+ default: break;
+ case 0x0a: /* LF */
+ case 0x0b: /* VT */
+ case 0x0c: /* FF */
+ case 0x0d: /* CR */
+ case 0x85: /* NEL */
+ RRETURN(MATCH_NOMATCH);
+ }
+ }
+ break;
+
+ case OP_VSPACE:
+ for (i = 1; i <= min; i++)
+ {
+ if (eptr >= md->end_subject) RRETURN(MATCH_NOMATCH);
+ switch(*eptr++)
+ {
+ default: RRETURN(MATCH_NOMATCH);
+ case 0x0a: /* LF */
+ case 0x0b: /* VT */
+ case 0x0c: /* FF */
+ case 0x0d: /* CR */
+ case 0x85: /* NEL */
+ break;
+ }
+ }
+ break;
+
case OP_NOT_DIGIT:
for (i = 1; i <= min; i++)
if ((md->ctypes[*eptr++] & ctype_digit) != 0) RRETURN(MATCH_NOMATCH);
@@ -3116,6 +3380,90 @@ for (;;)
}
break;
+ case OP_NOT_HSPACE:
+ switch(c)
+ {
+ default: break;
+ case 0x09: /* HT */
+ case 0x20: /* SPACE */
+ case 0xa0: /* NBSP */
+ case 0x1680: /* OGHAM SPACE MARK */
+ case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
+ case 0x2000: /* EN QUAD */
+ case 0x2001: /* EM QUAD */
+ case 0x2002: /* EN SPACE */
+ case 0x2003: /* EM SPACE */
+ case 0x2004: /* THREE-PER-EM SPACE */
+ case 0x2005: /* FOUR-PER-EM SPACE */
+ case 0x2006: /* SIX-PER-EM SPACE */
+ case 0x2007: /* FIGURE SPACE */
+ case 0x2008: /* PUNCTUATION SPACE */
+ case 0x2009: /* THIN SPACE */
+ case 0x200A: /* HAIR SPACE */
+ case 0x202f: /* NARROW NO-BREAK SPACE */
+ case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
+ case 0x3000: /* IDEOGRAPHIC SPACE */
+ RRETURN(MATCH_NOMATCH);
+ }
+ break;
+
+ case OP_HSPACE:
+ switch(c)
+ {
+ default: RRETURN(MATCH_NOMATCH);
+ case 0x09: /* HT */
+ case 0x20: /* SPACE */
+ case 0xa0: /* NBSP */
+ case 0x1680: /* OGHAM SPACE MARK */
+ case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
+ case 0x2000: /* EN QUAD */
+ case 0x2001: /* EM QUAD */
+ case 0x2002: /* EN SPACE */
+ case 0x2003: /* EM SPACE */
+ case 0x2004: /* THREE-PER-EM SPACE */
+ case 0x2005: /* FOUR-PER-EM SPACE */
+ case 0x2006: /* SIX-PER-EM SPACE */
+ case 0x2007: /* FIGURE SPACE */
+ case 0x2008: /* PUNCTUATION SPACE */
+ case 0x2009: /* THIN SPACE */
+ case 0x200A: /* HAIR SPACE */
+ case 0x202f: /* NARROW NO-BREAK SPACE */
+ case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
+ case 0x3000: /* IDEOGRAPHIC SPACE */
+ break;
+ }
+ break;
+
+ case OP_NOT_VSPACE:
+ switch(c)
+ {
+ default: break;
+ case 0x0a: /* LF */
+ case 0x0b: /* VT */
+ case 0x0c: /* FF */
+ case 0x0d: /* CR */
+ case 0x85: /* NEL */
+ case 0x2028: /* LINE SEPARATOR */
+ case 0x2029: /* PARAGRAPH SEPARATOR */
+ RRETURN(MATCH_NOMATCH);
+ }
+ break;
+
+ case OP_VSPACE:
+ switch(c)
+ {
+ default: RRETURN(MATCH_NOMATCH);
+ case 0x0a: /* LF */
+ case 0x0b: /* VT */
+ case 0x0c: /* FF */
+ case 0x0d: /* CR */
+ case 0x85: /* NEL */
+ case 0x2028: /* LINE SEPARATOR */
+ case 0x2029: /* PARAGRAPH SEPARATOR */
+ break;
+ }
+ break;
+
case OP_NOT_DIGIT:
if (c < 256 && (md->ctypes[c] & ctype_digit) != 0)
RRETURN(MATCH_NOMATCH);
@@ -3187,6 +3535,54 @@ for (;;)
}
break;
+ case OP_NOT_HSPACE:
+ switch(c)
+ {
+ default: break;
+ case 0x09: /* HT */
+ case 0x20: /* SPACE */
+ case 0xa0: /* NBSP */
+ RRETURN(MATCH_NOMATCH);
+ }
+ break;
+
+ case OP_HSPACE:
+ switch(c)
+ {
+ default: RRETURN(MATCH_NOMATCH);
+ case 0x09: /* HT */
+ case 0x20: /* SPACE */
+ case 0xa0: /* NBSP */
+ break;
+ }
+ break;
+
+ case OP_NOT_VSPACE:
+ switch(c)
+ {
+ default: break;
+ case 0x0a: /* LF */
+ case 0x0b: /* VT */
+ case 0x0c: /* FF */
+ case 0x0d: /* CR */
+ case 0x85: /* NEL */
+ RRETURN(MATCH_NOMATCH);
+ }
+ break;
+
+ case OP_VSPACE:
+ switch(c)
+ {
+ default: RRETURN(MATCH_NOMATCH);
+ case 0x0a: /* LF */
+ case 0x0b: /* VT */
+ case 0x0c: /* FF */
+ case 0x0d: /* CR */
+ case 0x85: /* NEL */
+ break;
+ }
+ break;
+
case OP_NOT_DIGIT:
if ((md->ctypes[c] & ctype_digit) != 0) RRETURN(MATCH_NOMATCH);
break;
@@ -3448,6 +3844,70 @@ for (;;)
}
break;
+ case OP_NOT_HSPACE:
+ case OP_HSPACE:
+ for (i = min; i < max; i++)
+ {
+ BOOL gotspace;
+ int len = 1;
+ if (eptr >= md->end_subject) break;
+ GETCHARLEN(c, eptr, len);
+ switch(c)
+ {
+ default: gotspace = FALSE; break;
+ case 0x09: /* HT */
+ case 0x20: /* SPACE */
+ case 0xa0: /* NBSP */
+ case 0x1680: /* OGHAM SPACE MARK */
+ case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
+ case 0x2000: /* EN QUAD */
+ case 0x2001: /* EM QUAD */
+ case 0x2002: /* EN SPACE */
+ case 0x2003: /* EM SPACE */
+ case 0x2004: /* THREE-PER-EM SPACE */
+ case 0x2005: /* FOUR-PER-EM SPACE */
+ case 0x2006: /* SIX-PER-EM SPACE */
+ case 0x2007: /* FIGURE SPACE */
+ case 0x2008: /* PUNCTUATION SPACE */
+ case 0x2009: /* THIN SPACE */
+ case 0x200A: /* HAIR SPACE */
+ case 0x202f: /* NARROW NO-BREAK SPACE */
+ case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
+ case 0x3000: /* IDEOGRAPHIC SPACE */
+ gotspace = TRUE;
+ break;
+ }
+ if (gotspace == (ctype == OP_NOT_HSPACE)) break;
+ eptr += len;
+ }
+ break;
+
+ case OP_NOT_VSPACE:
+ case OP_VSPACE:
+ for (i = min; i < max; i++)
+ {
+ BOOL gotspace;
+ int len = 1;
+ if (eptr >= md->end_subject) break;
+ GETCHARLEN(c, eptr, len);
+ switch(c)
+ {
+ default: gotspace = FALSE; break;
+ case 0x0a: /* LF */
+ case 0x0b: /* VT */
+ case 0x0c: /* FF */
+ case 0x0d: /* CR */
+ case 0x85: /* NEL */
+ case 0x2028: /* LINE SEPARATOR */
+ case 0x2029: /* PARAGRAPH SEPARATOR */
+ gotspace = TRUE;
+ break;
+ }
+ if (gotspace == (ctype == OP_NOT_VSPACE)) break;
+ eptr += len;
+ }
+ break;
+
case OP_NOT_DIGIT:
for (i = min; i < max; i++)
{
@@ -3574,6 +4034,48 @@ for (;;)
}
break;
+ case OP_NOT_HSPACE:
+ for (i = min; i < max; i++)
+ {
+ if (eptr >= md->end_subject) break;
+ c = *eptr;
+ if (c == 0x09 || c == 0x20 || c == 0xa0) break;
+ eptr++;
+ }
+ break;
+
+ case OP_HSPACE:
+ for (i = min; i < max; i++)
+ {
+ if (eptr >= md->end_subject) break;
+ c = *eptr;
+ if (c != 0x09 && c != 0x20 && c != 0xa0) break;
+ eptr++;
+ }
+ break;
+
+ case OP_NOT_VSPACE:
+ for (i = min; i < max; i++)
+ {
+ if (eptr >= md->end_subject) break;
+ c = *eptr;
+ if (c == 0x0a || c == 0x0b || c == 0x0c || c == 0x0d || c == 0x85)
+ break;
+ eptr++;
+ }
+ break;
+
+ case OP_VSPACE:
+ for (i = min; i < max; i++)
+ {
+ if (eptr >= md->end_subject) break;
+ c = *eptr;
+ if (c != 0x0a && c != 0x0b && c != 0x0c && c != 0x0d && c != 0x85)
+ break;
+ eptr++;
+ }
+ break;
+
case OP_NOT_DIGIT:
for (i = min; i < max; i++)
{
diff --git a/pcre_internal.h b/pcre_internal.h
index 5ff4b15..d87b95c 100644
--- a/pcre_internal.h
+++ b/pcre_internal.h
@@ -606,8 +606,8 @@ consume characters. If any new escapes are put in between that don't consume a
character, that code will have to change. */
enum { ESC_A = 1, ESC_G, ESC_K, ESC_B, ESC_b, ESC_D, ESC_d, ESC_S, ESC_s,
- ESC_W, ESC_w, ESC_dum1, ESC_C, ESC_P, ESC_p, ESC_R, ESC_X, ESC_Z, ESC_z,
- ESC_E, ESC_Q, ESC_k, ESC_REF };
+ ESC_W, ESC_w, ESC_dum1, ESC_C, ESC_P, ESC_p, ESC_R, ESC_H, ESC_h,
+ ESC_V, ESC_v, ESC_X, ESC_Z, ESC_z, ESC_E, ESC_Q, ESC_k, ESC_REF };
/* Opcode table: OP_BRA must be last, as all values >= it are used for brackets
@@ -643,120 +643,124 @@ enum {
OP_NOTPROP, /* 14 \P (not Unicode property) */
OP_PROP, /* 15 \p (Unicode property) */
OP_ANYNL, /* 16 \R (any newline sequence) */
- OP_EXTUNI, /* 17 \X (extended Unicode sequence */
- OP_EODN, /* 18 End of data or \n at end of data: \Z. */
- OP_EOD, /* 19 End of data: \z */
-
- OP_OPT, /* 20 Set runtime options */
- OP_CIRC, /* 21 Start of line - varies with multiline switch */
- OP_DOLL, /* 22 End of line - varies with multiline switch */
- OP_CHAR, /* 23 Match one character, casefully */
- OP_CHARNC, /* 24 Match one character, caselessly */
- OP_NOT, /* 25 Match one character, not the following one */
-
- OP_STAR, /* 26 The maximizing and minimizing versions of */
- OP_MINSTAR, /* 27 these six opcodes must come in pairs, with */
- OP_PLUS, /* 28 the minimizing one second. */
- OP_MINPLUS, /* 29 This first set applies to single characters.*/
- OP_QUERY, /* 30 */
- OP_MINQUERY, /* 31 */
-
- OP_UPTO, /* 32 From 0 to n matches */
- OP_MINUPTO, /* 33 */
- OP_EXACT, /* 34 Exactly n matches */
-
- OP_POSSTAR, /* 35 Possessified star */
- OP_POSPLUS, /* 36 Possessified plus */
- OP_POSQUERY, /* 37 Posesssified query */
- OP_POSUPTO, /* 38 Possessified upto */
-
- OP_NOTSTAR, /* 39 The maximizing and minimizing versions of */
- OP_NOTMINSTAR, /* 40 these six opcodes must come in pairs, with */
- OP_NOTPLUS, /* 41 the minimizing one second. They must be in */
- OP_NOTMINPLUS, /* 42 exactly the same order as those above. */
- OP_NOTQUERY, /* 43 This set applies to "not" single characters. */
- OP_NOTMINQUERY, /* 44 */
-
- OP_NOTUPTO, /* 45 From 0 to n matches */
- OP_NOTMINUPTO, /* 46 */
- OP_NOTEXACT, /* 47 Exactly n matches */
-
- OP_NOTPOSSTAR, /* 48 Possessified versions */
- OP_NOTPOSPLUS, /* 49 */
- OP_NOTPOSQUERY, /* 50 */
- OP_NOTPOSUPTO, /* 51 */
-
- OP_TYPESTAR, /* 52 The maximizing and minimizing versions of */
- OP_TYPEMINSTAR, /* 53 these six opcodes must come in pairs, with */
- OP_TYPEPLUS, /* 54 the minimizing one second. These codes must */
- OP_TYPEMINPLUS, /* 55 be in exactly the same order as those above. */
- OP_TYPEQUERY, /* 56 This set applies to character types such as \d */
- OP_TYPEMINQUERY, /* 57 */
-
- OP_TYPEUPTO, /* 58 From 0 to n matches */
- OP_TYPEMINUPTO, /* 59 */
- OP_TYPEEXACT, /* 60 Exactly n matches */
-
- OP_TYPEPOSSTAR, /* 61 Possessified versions */
- OP_TYPEPOSPLUS, /* 62 */
- OP_TYPEPOSQUERY, /* 63 */
- OP_TYPEPOSUPTO, /* 64 */
-
- OP_CRSTAR, /* 65 The maximizing and minimizing versions of */
- OP_CRMINSTAR, /* 66 all these opcodes must come in pairs, with */
- OP_CRPLUS, /* 67 the minimizing one second. These codes must */
- OP_CRMINPLUS, /* 68 be in exactly the same order as those above. */
- OP_CRQUERY, /* 69 These are for character classes and back refs */
- OP_CRMINQUERY, /* 70 */
- OP_CRRANGE, /* 71 These are different to the three sets above. */
- OP_CRMINRANGE, /* 72 */
-
- OP_CLASS, /* 73 Match a character class, chars < 256 only */
- OP_NCLASS, /* 74 Same, but the bitmap was created from a negative
+ OP_NOT_HSPACE, /* 17 \H (not horizontal whitespace) */
+ OP_HSPACE, /* 18 \h (horizontal whitespace) */
+ OP_NOT_VSPACE, /* 19 \V (not vertical whitespace) */
+ OP_VSPACE, /* 20 \v (vertical whitespace) */
+ OP_EXTUNI, /* 21 \X (extended Unicode sequence */
+ OP_EODN, /* 22 End of data or \n at end of data: \Z. */
+ OP_EOD, /* 23 End of data: \z */
+
+ OP_OPT, /* 24 Set runtime options */
+ OP_CIRC, /* 25 Start of line - varies with multiline switch */
+ OP_DOLL, /* 26 End of line - varies with multiline switch */
+ OP_CHAR, /* 27 Match one character, casefully */
+ OP_CHARNC, /* 28 Match one character, caselessly */
+ OP_NOT, /* 29 Match one character, not the following one */
+
+ OP_STAR, /* 30 The maximizing and minimizing versions of */
+ OP_MINSTAR, /* 31 these six opcodes must come in pairs, with */
+ OP_PLUS, /* 32 the minimizing one second. */
+ OP_MINPLUS, /* 33 This first set applies to single characters.*/
+ OP_QUERY, /* 34 */
+ OP_MINQUERY, /* 35 */
+
+ OP_UPTO, /* 36 From 0 to n matches */
+ OP_MINUPTO, /* 37 */
+ OP_EXACT, /* 38 Exactly n matches */
+
+ OP_POSSTAR, /* 39 Possessified star */
+ OP_POSPLUS, /* 40 Possessified plus */
+ OP_POSQUERY, /* 41 Posesssified query */
+ OP_POSUPTO, /* 42 Possessified upto */
+
+ OP_NOTSTAR, /* 43 The maximizing and minimizing versions of */
+ OP_NOTMINSTAR, /* 44 these six opcodes must come in pairs, with */
+ OP_NOTPLUS, /* 45 the minimizing one second. They must be in */
+ OP_NOTMINPLUS, /* 46 exactly the same order as those above. */
+ OP_NOTQUERY, /* 47 This set applies to "not" single characters. */
+ OP_NOTMINQUERY, /* 48 */
+
+ OP_NOTUPTO, /* 49 From 0 to n matches */
+ OP_NOTMINUPTO, /* 50 */
+ OP_NOTEXACT, /* 51 Exactly n matches */
+
+ OP_NOTPOSSTAR, /* 52 Possessified versions */
+ OP_NOTPOSPLUS, /* 53 */
+ OP_NOTPOSQUERY, /* 54 */
+ OP_NOTPOSUPTO, /* 55 */
+
+ OP_TYPESTAR, /* 56 The maximizing and minimizing versions of */
+ OP_TYPEMINSTAR, /* 57 these six opcodes must come in pairs, with */
+ OP_TYPEPLUS, /* 58 the minimizing one second. These codes must */
+ OP_TYPEMINPLUS, /* 59 be in exactly the same order as those above. */
+ OP_TYPEQUERY, /* 60 This set applies to character types such as \d */
+ OP_TYPEMINQUERY, /* 61 */
+
+ OP_TYPEUPTO, /* 62 From 0 to n matches */
+ OP_TYPEMINUPTO, /* 63 */
+ OP_TYPEEXACT, /* 64 Exactly n matches */
+
+ OP_TYPEPOSSTAR, /* 65 Possessified versions */
+ OP_TYPEPOSPLUS, /* 66 */
+ OP_TYPEPOSQUERY, /* 67 */
+ OP_TYPEPOSUPTO, /* 68 */
+
+ OP_CRSTAR, /* 69 The maximizing and minimizing versions of */
+ OP_CRMINSTAR, /* 70 all these opcodes must come in pairs, with */
+ OP_CRPLUS, /* 71 the minimizing one second. These codes must */
+ OP_CRMINPLUS, /* 72 be in exactly the same order as those above. */
+ OP_CRQUERY, /* 73 These are for character classes and back refs */
+ OP_CRMINQUERY, /* 74 */
+ OP_CRRANGE, /* 75 These are different to the three sets above. */
+ OP_CRMINRANGE, /* 76 */
+
+ OP_CLASS, /* 77 Match a character class, chars < 256 only */
+ OP_NCLASS, /* 78 Same, but the bitmap was created from a negative
class - the difference is relevant only when a UTF-8
character > 255 is encountered. */
- OP_XCLASS, /* 75 Extended class for handling UTF-8 chars within the
+ OP_XCLASS, /* 79 Extended class for handling UTF-8 chars within the
class. This does both positive and negative. */
- OP_REF, /* 76 Match a back reference */
- OP_RECURSE, /* 77 Match a numbered subpattern (possibly recursive) */
- OP_CALLOUT, /* 78 Call out to external function if provided */
+ OP_REF, /* 80 Match a back reference */
+ OP_RECURSE, /* 81 Match a numbered subpattern (possibly recursive) */
+ OP_CALLOUT, /* 82 Call out to external function if provided */
- OP_ALT, /* 79 Start of alternation */
- OP_KET, /* 80 End of group that doesn't have an unbounded repeat */
- OP_KETRMAX, /* 81 These two must remain together and in this */
- OP_KETRMIN, /* 82 order. They are for groups the repeat for ever. */
+ OP_ALT, /* 83 Start of alternation */
+ OP_KET, /* 84 End of group that doesn't have an unbounded repeat */
+ OP_KETRMAX, /* 85 These two must remain together and in this */
+ OP_KETRMIN, /* 86 order. They are for groups the repeat for ever. */
/* The assertions must come before BRA, CBRA, ONCE, and COND.*/
- OP_ASSERT, /* 83 Positive lookahead */
- OP_ASSERT_NOT, /* 84 Negative lookahead */
- OP_ASSERTBACK, /* 85 Positive lookbehind */
- OP_ASSERTBACK_NOT, /* 86 Negative lookbehind */
- OP_REVERSE, /* 87 Move pointer back - used in lookbehind assertions */
+ OP_ASSERT, /* 87 Positive lookahead */
+ OP_ASSERT_NOT, /* 88 Negative lookahead */
+ OP_ASSERTBACK, /* 89 Positive lookbehind */
+ OP_ASSERTBACK_NOT, /* 90 Negative lookbehind */
+ OP_REVERSE, /* 91 Move pointer back - used in lookbehind assertions */
/* ONCE, BRA, CBRA, and COND must come after the assertions, with ONCE first,
as there's a test for >= ONCE for a subpattern that isn't an assertion. */
- OP_ONCE, /* 88 Atomic group */
- OP_BRA, /* 89 Start of non-capturing bracket */
- OP_CBRA, /* 90 Start of capturing bracket */
- OP_COND, /* 91 Conditional group */
+ OP_ONCE, /* 92 Atomic group */
+ OP_BRA, /* 83 Start of non-capturing bracket */
+ OP_CBRA, /* 94 Start of capturing bracket */
+ OP_COND, /* 95 Conditional group */
/* These three must follow the previous three, in the same order. There's a
check for >= SBRA to distinguish the two sets. */
- OP_SBRA, /* 92 Start of non-capturing bracket, check empty */
- OP_SCBRA, /* 93 Start of capturing bracket, check empty */
- OP_SCOND, /* 94 Conditional group, check empty */
+ OP_SBRA, /* 96 Start of non-capturing bracket, check empty */
+ OP_SCBRA, /* 97 Start of capturing bracket, check empty */
+ OP_SCOND, /* 98 Conditional group, check empty */
- OP_CREF, /* 95 Used to hold a capture number as condition */
- OP_RREF, /* 96 Used to hold a recursion number as condition */
- OP_DEF, /* 97 The DEFINE condition */
+ OP_CREF, /* 99 Used to hold a capture number as condition */
+ OP_RREF, /* 100 Used to hold a recursion number as condition */
+ OP_DEF, /* 101 The DEFINE condition */
- OP_BRAZERO, /* 98 These two must remain together and in this */
- OP_BRAMINZERO /* 99 order. */
+ OP_BRAZERO, /* 102 These two must remain together and in this */
+ OP_BRAMINZERO /* 103 order. */
};
@@ -766,8 +770,8 @@ for debugging. The macro is referenced only in pcre_printint.c. */
#define OP_NAME_LIST \
"End", "\\A", "\\G", "\\K", "\\B", "\\b", "\\D", "\\d", \
"\\S", "\\s", "\\W", "\\w", "Any", "Anybyte", \
- "notprop", "prop", "anynl", "extuni", \
- "\\Z", "\\z", \
+ "notprop", "prop", "\\R", "\\H", "\\h", "\\V", "\\v", \
+ "extuni", "\\Z", "\\z", \
"Opt", "^", "$", "char", "charnc", "not", \
"*", "*?", "+", "+?", "?", "??", "{", "{", "{", \
"*+","++", "?+", "{", \
@@ -797,7 +801,8 @@ in UTF-8 mode. The code that uses this table must know about such things. */
1, 1, 1, 1, 1, /* \A, \G, \K, \B, \b */ \
1, 1, 1, 1, 1, 1, /* \D, \d, \S, \s, \W, \w */ \
1, 1, /* Any, Anybyte */ \
- 3, 3, 1, 1, /* NOTPROP, PROP, EXTUNI, ANYNL */ \
+ 3, 3, 1, /* NOTPROP, PROP, EXTUNI */ \
+ 1, 1, 1, 1, 1, /* \R, \H, \h, \V, \v */ \
1, 1, 2, 1, 1, /* \Z, \z, Opt, ^, $ */ \
2, /* Char - the minimum length */ \
2, /* Charnc - the minimum length */ \
diff --git a/testdata/testinput1 b/testdata/testinput1
index 4d619a8..7e8e9f0 100644
--- a/testdata/testinput1
+++ b/testdata/testinput1
@@ -1494,8 +1494,8 @@
/(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)\12\123/
abcdefghijk\12S
-/ab\hdef/
- abhdef
+/ab\idef/
+ abidef
/a{0}bc/
bc
diff --git a/testdata/testinput2 b/testdata/testinput2
index 357b1a9..fdac749 100644
--- a/testdata/testinput2
+++ b/testdata/testinput2
@@ -28,9 +28,9 @@
*** Failers
def\nabc
-/ab\hdef/X
+/ab\idef/X
-/(?X)ab\hdef/X
+/(?X)ab\idef/X
/x{5,4}/
@@ -2240,4 +2240,51 @@ a random value. /Ix
** Failers
xyzxyz
+/\H\h\V\v/
+ X X\x0a
+ X\x09X\x0b
+ ** Failers
+ \xa0 X\x0a
+
+/\H*\h+\V?\v{3,4}/
+ \x09\x20\xa0X\x0a\x0b\x0c\x0d\x0a
+ \x09\x20\xa0\x0a\x0b\x0c\x0d\x0a
+ \x09\x20\xa0\x0a\x0b\x0c
+ ** Failers
+ \x09\x20\xa0\x0a\x0b
+
+/\H{3,4}/
+ XY ABCDE
+ XY PQR ST
+
+/.\h{3,4}./
+ XY AB PQRS
+
+/\h*X\h?\H+Y\H?Z/
+ >XNNNYZ
+ > X NYQZ
+ ** Failers
+ >XYZ
+ > X NY Z
+
+/\v*X\v?Y\v+Z\V*\x0a\V+\x0b\V{2,3}\x0c/
+ >XY\x0aZ\x0aA\x0bNN\x0c
+ >\x0a\x0dX\x0aY\x0a\x0bZZZ\x0aAAA\x0bNNN\x0c
+
+/[\h]/BZ
+ >\x09<
+
+/[\h]+/BZ
+ >\x09\x20\xa0<
+
+/[\v]/BZ
+
+/[\H]/BZ
+
+/[^\h]/BZ
+
+/[\V]/BZ
+
+/[\x0a\V]/BZ
+
/ End of testinput2 /
diff --git a/testdata/testinput5 b/testdata/testinput5
index c99bee7..e8e3cf7 100644
--- a/testdata/testinput5
+++ b/testdata/testinput5
@@ -355,4 +355,42 @@ can't tell the difference.) --/
a\n\n\n\rb
a\r
+/\H\h\V\v/8
+ X X\x0a
+ X\x09X\x0b
+ ** Failers
+ \x{a0} X\x0a
+
+/\H*\h+\V?\v{3,4}/8
+ \x09\x20\x{a0}X\x0a\x0b\x0c\x0d\x0a
+ \x09\x20\x{a0}\x0a\x0b\x0c\x0d\x0a
+ \x09\x20\x{a0}\x0a\x0b\x0c
+ ** Failers
+ \x09\x20\x{a0}\x0a\x0b
+
+/\H\h\V\v/8
+ \x{3001}\x{3000}\x{2030}\x{2028}
+ X\x{180e}X\x{85}
+ ** Failers
+ \x{2009} X\x0a
+
+/\H*\h+\V?\v{3,4}/8
+ \x{1680}\x{180e}\x{2007}X\x{2028}\x{2029}\x0c\x0d\x0a
+ \x09\x{205f}\x{a0}\x0a\x{2029}\x0c\x{2028}\x0a
+ \x09\x20\x{202f}\x0a\x0b\x0c
+ ** Failers
+ \x09\x{200a}\x{a0}\x{2028}\x0b
+
+/[\h]/8BZ
+ >\x{1680}
+
+/[\h]{3,}/8BZ
+ >\x{1680}\x{180e}\x{2000}\x{2003}\x{200a}\x{202f}\x{205f}\x{3000}<
+
+/[\v]/8BZ
+
+/[\H]/8BZ
+
+/[\V]/8BZ
+
/ End of testinput5 /
diff --git a/testdata/testinput7 b/testdata/testinput7
index 55ed629..2722980 100644
--- a/testdata/testinput7
+++ b/testdata/testinput7
@@ -1931,8 +1931,8 @@
/(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)\12\123/
abcdefghijk\12S
-/ab\hdef/
- abhdef
+/ab\idef/
+ abidef
/a{0}bc/
bc
@@ -4267,4 +4267,35 @@
** Failers
xyzxyz
+/\H\h\V\v/
+ X X\x0a
+ X\x09X\x0b
+ ** Failers
+ \xa0 X\x0a
+
+/\H*\h+\V?\v{3,4}/
+ \x09\x20\xa0X\x0a\x0b\x0c\x0d\x0a
+ \x09\x20\xa0\x0a\x0b\x0c\x0d\x0a
+ \x09\x20\xa0\x0a\x0b\x0c
+ ** Failers
+ \x09\x20\xa0\x0a\x0b
+
+/\H{3,4}/
+ XY ABCDE
+ XY PQR ST
+
+/.\h{3,4}./
+ XY AB PQRS
+
+/\h*X\h?\H+Y\H?Z/
+ >XNNNYZ
+ > X NYQZ
+ ** Failers
+ >XYZ
+ > X NY Z
+
+/\v*X\v?Y\v+Z\V*\x0a\V+\x0b\V{2,3}\x0c/
+ >XY\x0aZ\x0aA\x0bNN\x0c
+ >\x0a\x0dX\x0aY\x0a\x0bZZZ\x0aAAA\x0bNNN\x0c
+
/ End of testinput7 /
diff --git a/testdata/testinput8 b/testdata/testinput8
index a19493f..61e70e5 100644
--- a/testdata/testinput8
+++ b/testdata/testinput8
@@ -590,4 +590,42 @@
a\n\n\n\rb
a\r
+/\h+\V?\v{3,4}/8
+ \x09\x20\x{a0}X\x0a\x0b\x0c\x0d\x0a
+
+/\V?\v{3,4}/8
+ \x20\x{a0}X\x0a\x0b\x0c\x0d\x0a
+
+/\h+\V?\v{3,4}/8
+ >\x09\x20\x{a0}X\x0a\x0a\x0a<
+
+/\V?\v{3,4}/8
+ >\x09\x20\x{a0}X\x0a\x0a\x0a<
+
+/\H\h\V\v/8
+ X X\x0a
+ X\x09X\x0b
+ ** Failers
+ \x{a0} X\x0a
+
+/\H*\h+\V?\v{3,4}/8
+ \x09\x20\x{a0}X\x0a\x0b\x0c\x0d\x0a
+ \x09\x20\x{a0}\x0a\x0b\x0c\x0d\x0a
+ \x09\x20\x{a0}\x0a\x0b\x0c
+ ** Failers
+ \x09\x20\x{a0}\x0a\x0b
+
+/\H\h\V\v/8
+ \x{3001}\x{3000}\x{2030}\x{2028}
+ X\x{180e}X\x{85}
+ ** Failers
+ \x{2009} X\x0a
+
+/\H*\h+\V?\v{3,4}/8
+ \x{1680}\x{180e}\x{2007}X\x{2028}\x{2029}\x0c\x0d\x0a
+ \x09\x{205f}\x{a0}\x0a\x{2029}\x0c\x{2028}\x0a
+ \x09\x20\x{202f}\x0a\x0b\x0c
+ ** Failers
+ \x09\x{200a}\x{a0}\x{2028}\x0b
+
/ End of testinput 8 /
diff --git a/testdata/testoutput1 b/testdata/testoutput1
index 0bfad1e..209b0d3 100644
--- a/testdata/testoutput1
+++ b/testdata/testoutput1
@@ -2189,9 +2189,9 @@ No match
10: j
11: k
-/ab\hdef/
- abhdef
- 0: abhdef
+/ab\idef/
+ abidef
+ 0: abidef
/a{0}bc/
bc
diff --git a/testdata/testoutput2 b/testdata/testoutput2
index d975fa1..ab5a6a1 100644
--- a/testdata/testoutput2
+++ b/testdata/testoutput2
@@ -78,10 +78,10 @@ No match
def\nabc
No match
-/ab\hdef/X
+/ab\idef/X
Failed: unrecognized character follows \ at offset 3
-/(?X)ab\hdef/X
+/(?X)ab\idef/X
Failed: unrecognized character follows \ at offset 7
/x{5,4}/
@@ -8474,4 +8474,114 @@ No match
xyzxyz
No match
+/\H\h\V\v/
+ X X\x0a
+ 0: X X\x0a
+ X\x09X\x0b
+ 0: X\x09X\x0b
+ ** Failers
+No match
+ \xa0 X\x0a
+No match
+
+/\H*\h+\V?\v{3,4}/
+ \x09\x20\xa0X\x0a\x0b\x0c\x0d\x0a
+ 0: \x09 \xa0X\x0a\x0b\x0c\x0d
+ \x09\x20\xa0\x0a\x0b\x0c\x0d\x0a
+ 0: \x09 \xa0\x0a\x0b\x0c\x0d
+ \x09\x20\xa0\x0a\x0b\x0c
+ 0: \x09 \xa0\x0a\x0b\x0c
+ ** Failers
+No match
+ \x09\x20\xa0\x0a\x0b
+No match
+
+/\H{3,4}/
+ XY ABCDE
+ 0: ABCD
+ XY PQR ST
+ 0: PQR
+
+/.\h{3,4}./
+ XY AB PQRS
+ 0: B P
+
+/\h*X\h?\H+Y\H?Z/
+ >XNNNYZ
+ 0: XNNNYZ
+ > X NYQZ
+ 0: X NYQZ
+ ** Failers
+No match
+ >XYZ
+No match
+ > X NY Z
+No match
+
+/\v*X\v?Y\v+Z\V*\x0a\V+\x0b\V{2,3}\x0c/
+ >XY\x0aZ\x0aA\x0bNN\x0c
+ 0: XY\x0aZ\x0aA\x0bNN\x0c
+ >\x0a\x0dX\x0aY\x0a\x0bZZZ\x0aAAA\x0bNNN\x0c
+ 0: \x0a\x0dX\x0aY\x0a\x0bZZZ\x0aAAA\x0bNNN\x0c
+
+/[\h]/BZ
+------------------------------------------------------------------
+ Bra 0
+ [\x09 \xa0]
+ Ket
+ End
+------------------------------------------------------------------
+ >\x09<
+ 0: \x09
+
+/[\h]+/BZ
+------------------------------------------------------------------
+ Bra 0
+ [\x09 \xa0]+
+ Ket
+ End
+------------------------------------------------------------------
+ >\x09\x20\xa0<
+ 0: \x09 \xa0
+
+/[\v]/BZ
+------------------------------------------------------------------
+ Bra 0
+ [\x0a-\x0d\x85]
+ Ket
+ End
+------------------------------------------------------------------
+
+/[\H]/BZ
+------------------------------------------------------------------
+ Bra 0
+ [\x00-\x08\x0a-\x1f!-\x9f\xa1-\xff]
+ Ket
+ End
+------------------------------------------------------------------
+
+/[^\h]/BZ
+------------------------------------------------------------------
+ Bra 0
+ [\x00-\x08\x0a-\x1f!-\x9f\xa1-\xff] (neg)
+ Ket
+ End
+------------------------------------------------------------------
+
+/[\V]/BZ
+------------------------------------------------------------------
+ Bra 0
+ [\x00-\x09\x0e-\x84\x86-\xff]
+ Ket
+ End
+------------------------------------------------------------------
+
+/[\x0a\V]/BZ
+------------------------------------------------------------------
+ Bra 0
+ [\x00-\x0a\x0e-\x84\x86-\xff]
+ Ket
+ End
+------------------------------------------------------------------
+
/ End of testinput2 /
diff --git a/testdata/testoutput5 b/testdata/testoutput5
index cf80cd9..1f0b2b1 100644
--- a/testdata/testoutput5
+++ b/testdata/testoutput5
@@ -1385,4 +1385,92 @@ No match
a\r
No match
+/\H\h\V\v/8
+ X X\x0a
+ 0: X X\x{0a}
+ X\x09X\x0b
+ 0: X\x{09}X\x{0b}
+ ** Failers
+No match
+ \x{a0} X\x0a
+No match
+
+/\H*\h+\V?\v{3,4}/8
+ \x09\x20\x{a0}X\x0a\x0b\x0c\x0d\x0a
+ 0: \x{09} \x{a0}X\x{0a}\x{0b}\x{0c}\x{0d}
+ \x09\x20\x{a0}\x0a\x0b\x0c\x0d\x0a
+ 0: \x{09} \x{a0}\x{0a}\x{0b}\x{0c}\x{0d}
+ \x09\x20\x{a0}\x0a\x0b\x0c
+ 0: \x{09} \x{a0}\x{0a}\x{0b}\x{0c}
+ ** Failers
+No match
+ \x09\x20\x{a0}\x0a\x0b
+No match
+
+/\H\h\V\v/8
+ \x{3001}\x{3000}\x{2030}\x{2028}
+ 0: \x{3001}\x{3000}\x{2030}\x{2028}
+ X\x{180e}X\x{85}
+ 0: X\x{180e}X\x{85}
+ ** Failers
+No match
+ \x{2009} X\x0a
+No match
+
+/\H*\h+\V?\v{3,4}/8
+ \x{1680}\x{180e}\x{2007}X\x{2028}\x{2029}\x0c\x0d\x0a
+ 0: \x{1680}\x{180e}\x{2007}X\x{2028}\x{2029}\x{0c}\x{0d}
+ \x09\x{205f}\x{a0}\x0a\x{2029}\x0c\x{2028}\x0a
+ 0: \x{09}\x{205f}\x{a0}\x{0a}\x{2029}\x{0c}\x{2028}
+ \x09\x20\x{202f}\x0a\x0b\x0c
+ 0: \x{09} \x{202f}\x{0a}\x{0b}\x{0c}
+ ** Failers
+No match
+ \x09\x{200a}\x{a0}\x{2028}\x0b
+No match
+
+/[\h]/8BZ
+------------------------------------------------------------------
+ Bra 0
+ [\x09 \xa0\x{1680}\x{180e}\x{2000}-\x{200a}\x{202f}\x{205f}\x{3000}]
+ Ket
+ End
+------------------------------------------------------------------
+ >\x{1680}
+ 0: \x{1680}
+
+/[\h]{3,}/8BZ
+------------------------------------------------------------------
+ Bra 0
+ [\x09 \xa0\x{1680}\x{180e}\x{2000}-\x{200a}\x{202f}\x{205f}\x{3000}]{3,}
+ Ket
+ End
+------------------------------------------------------------------
+ >\x{1680}\x{180e}\x{2000}\x{2003}\x{200a}\x{202f}\x{205f}\x{3000}<
+ 0: \x{1680}\x{180e}\x{2000}\x{2003}\x{200a}\x{202f}\x{205f}\x{3000}
+
+/[\v]/8BZ
+------------------------------------------------------------------
+ Bra 0
+ [\x0a-\x0d\x85\x{2028}-\x{2029}]
+ Ket
+ End
+------------------------------------------------------------------
+
+/[\H]/8BZ
+------------------------------------------------------------------
+ Bra 0
+ [\x00-\x08\x0a-\x1f!-\x9f\xa1-\xff\x{100}-\x{167f}\x{1681}-\x{180d}\x{180f}-\x{1fff}\x{200b}-\x{202e}\x{2030}-\x{205e}\x{2060}-\x{2fff}\x{3001}-\x{7fffffff}]
+ Ket
+ End
+------------------------------------------------------------------
+
+/[\V]/8BZ
+------------------------------------------------------------------
+ Bra 0
+ [\x00-\x09\x0e-\x84\x86-\xff\x{100}-\x{2027}\x{2029}-\x{7fffffff}]
+ Ket
+ End
+------------------------------------------------------------------
+
/ End of testinput5 /
diff --git a/testdata/testoutput7 b/testdata/testoutput7
index 9127526..a77186d 100644
--- a/testdata/testoutput7
+++ b/testdata/testoutput7
@@ -3039,9 +3039,9 @@ No match
abcdefghijk\12S
0: abcdefghijk\x0aS
-/ab\hdef/
- abhdef
- 0: abhdef
+/ab\idef/
+ abidef
+ 0: abidef
/a{0}bc/
bc
@@ -7018,4 +7018,58 @@ No match
xyzxyz
No match
+/\H\h\V\v/
+ X X\x0a
+ 0: X X\x0a
+ X\x09X\x0b
+ 0: X\x09X\x0b
+ ** Failers
+No match
+ \xa0 X\x0a
+No match
+
+/\H*\h+\V?\v{3,4}/
+ \x09\x20\xa0X\x0a\x0b\x0c\x0d\x0a
+ 0: \x09 \xa0X\x0a\x0b\x0c\x0d
+ 1: \x09 \xa0X\x0a\x0b\x0c
+ \x09\x20\xa0\x0a\x0b\x0c\x0d\x0a
+ 0: \x09 \xa0\x0a\x0b\x0c\x0d
+ 1: \x09 \xa0\x0a\x0b\x0c
+ \x09\x20\xa0\x0a\x0b\x0c
+ 0: \x09 \xa0\x0a\x0b\x0c
+ ** Failers
+No match
+ \x09\x20\xa0\x0a\x0b
+No match
+
+/\H{3,4}/
+ XY ABCDE
+ 0: ABCD
+ 1: ABC
+ XY PQR ST
+ 0: PQR
+
+/.\h{3,4}./
+ XY AB PQRS
+ 0: B P
+ 1: B
+
+/\h*X\h?\H+Y\H?Z/
+ >XNNNYZ
+ 0: XNNNYZ
+ > X NYQZ
+ 0: X NYQZ
+ ** Failers
+No match
+ >XYZ
+No match
+ > X NY Z
+No match
+
+/\v*X\v?Y\v+Z\V*\x0a\V+\x0b\V{2,3}\x0c/
+ >XY\x0aZ\x0aA\x0bNN\x0c
+ 0: XY\x0aZ\x0aA\x0bNN\x0c
+ >\x0a\x0dX\x0aY\x0a\x0bZZZ\x0aAAA\x0bNNN\x0c
+ 0: \x0a\x0dX\x0aY\x0a\x0bZZZ\x0aAAA\x0bNNN\x0c
+
/ End of testinput7 /
diff --git a/testdata/testoutput8 b/testdata/testoutput8
index f8251ff..db0f4ad 100644
--- a/testdata/testoutput8
+++ b/testdata/testoutput8
@@ -1138,4 +1138,70 @@ No match
a\r
No match
+/\h+\V?\v{3,4}/8
+ \x09\x20\x{a0}X\x0a\x0b\x0c\x0d\x0a
+ 0: \x{09} \x{a0}X\x{0a}\x{0b}\x{0c}\x{0d}
+ 1: \x{09} \x{a0}X\x{0a}\x{0b}\x{0c}
+
+/\V?\v{3,4}/8
+ \x20\x{a0}X\x0a\x0b\x0c\x0d\x0a
+ 0: X\x{0a}\x{0b}\x{0c}\x{0d}
+ 1: X\x{0a}\x{0b}\x{0c}
+
+/\h+\V?\v{3,4}/8
+ >\x09\x20\x{a0}X\x0a\x0a\x0a<
+ 0: \x{09} \x{a0}X\x{0a}\x{0a}\x{0a}
+
+/\V?\v{3,4}/8
+ >\x09\x20\x{a0}X\x0a\x0a\x0a<
+ 0: X\x{0a}\x{0a}\x{0a}
+
+/\H\h\V\v/8
+ X X\x0a
+ 0: X X\x{0a}
+ X\x09X\x0b
+ 0: X\x{09}X\x{0b}
+ ** Failers
+No match
+ \x{a0} X\x0a
+No match
+
+/\H*\h+\V?\v{3,4}/8
+ \x09\x20\x{a0}X\x0a\x0b\x0c\x0d\x0a
+ 0: \x{09} \x{a0}X\x{0a}\x{0b}\x{0c}\x{0d}
+ 1: \x{09} \x{a0}X\x{0a}\x{0b}\x{0c}
+ \x09\x20\x{a0}\x0a\x0b\x0c\x0d\x0a
+ 0: \x{09} \x{a0}\x{0a}\x{0b}\x{0c}\x{0d}
+ 1: \x{09} \x{a0}\x{0a}\x{0b}\x{0c}
+ \x09\x20\x{a0}\x0a\x0b\x0c
+ 0: \x{09} \x{a0}\x{0a}\x{0b}\x{0c}
+ ** Failers
+No match
+ \x09\x20\x{a0}\x0a\x0b
+No match
+
+/\H\h\V\v/8
+ \x{3001}\x{3000}\x{2030}\x{2028}
+ 0: \x{3001}\x{3000}\x{2030}\x{2028}
+ X\x{180e}X\x{85}
+ 0: X\x{180e}X\x{85}
+ ** Failers
+No match
+ \x{2009} X\x0a
+No match
+
+/\H*\h+\V?\v{3,4}/8
+ \x{1680}\x{180e}\x{2007}X\x{2028}\x{2029}\x0c\x0d\x0a
+ 0: \x{1680}\x{180e}\x{2007}X\x{2028}\x{2029}\x{0c}\x{0d}
+ 1: \x{1680}\x{180e}\x{2007}X\x{2028}\x{2029}\x{0c}
+ \x09\x{205f}\x{a0}\x0a\x{2029}\x0c\x{2028}\x0a
+ 0: \x{09}\x{205f}\x{a0}\x{0a}\x{2029}\x{0c}\x{2028}
+ 1: \x{09}\x{205f}\x{a0}\x{0a}\x{2029}\x{0c}
+ \x09\x20\x{202f}\x0a\x0b\x0c
+ 0: \x{09} \x{202f}\x{0a}\x{0b}\x{0c}
+ ** Failers
+No match
+ \x09\x{200a}\x{a0}\x{2028}\x0b
+No match
+
/ End of testinput 8 /