summaryrefslogtreecommitdiff
path: root/doc/pcre2pattern.3
diff options
context:
space:
mode:
authorph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069>2018-07-07 16:10:29 +0000
committerph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069>2018-07-07 16:10:29 +0000
commit2f04a0431dbcfd6a3d1e83ab2475667d40bfa6ca (patch)
tree42b2765d206b26205f1f2e2c4c89555aed8ca6d7 /doc/pcre2pattern.3
parentc75868f77eb2ce2ff277355afcd966e3179e65a8 (diff)
downloadpcre2-2f04a0431dbcfd6a3d1e83ab2475667d40bfa6ca.tar.gz
Update to Unicode 11.0.0
git-svn-id: svn://vcs.exim.org/pcre2/code/trunk@958 6239d852-aaf2-0410-a92c-79f79f948069
Diffstat (limited to 'doc/pcre2pattern.3')
-rw-r--r--doc/pcre2pattern.334
1 files changed, 21 insertions, 13 deletions
diff --git a/doc/pcre2pattern.3 b/doc/pcre2pattern.3
index 2b534f2..cd9a99c 100644
--- a/doc/pcre2pattern.3
+++ b/doc/pcre2pattern.3
@@ -1,4 +1,4 @@
-.TH PCRE2PATTERN 3 "30 June 2018" "PCRE2 10.32"
+.TH PCRE2PATTERN 3 "07 July 2018" "PCRE2 10.32"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@@ -788,6 +788,7 @@ Cypriot,
Cyrillic,
Deseret,
Devanagari,
+Dogra,
Duployan,
Egyptian_Hieroglyphs,
Elbasan,
@@ -798,9 +799,11 @@ Gothic,
Grantha,
Greek,
Gujarati,
+Gunjala_Gondi,
Gurmukhi,
Han,
Hangul,
+Hanifi_Rohingya,
Hanunoo,
Hatran,
Hebrew,
@@ -828,11 +831,13 @@ Lisu,
Lycian,
Lydian,
Mahajani,
+Makasar,
Malayalam,
Mandaic,
Manichaean,
Marchen,
Masaram_Gondi,
+Medefaidrin,
Meetei_Mayek,
Mende_Kikakui,
Meroitic_Cursive,
@@ -855,6 +860,7 @@ Old_Italic,
Old_North_Arabian,
Old_Permic,
Old_Persian,
+Old_Sogdian,
Old_South_Arabian,
Old_Turkic,
Oriya,
@@ -875,6 +881,7 @@ Shavian,
Siddham,
SignWriting,
Sinhala,
+Sogdian,
Sora_Sompeng,
Soyombo,
Sundanese,
@@ -1003,7 +1010,10 @@ grapheme cluster", and treats the sequence as an atomic group
Unicode supports various kinds of composite character by giving each character
a grapheme breaking property, and having rules that use these properties to
define the boundaries of extended grapheme clusters. The rules are defined in
-Unicode Standard Annex 29, "Unicode Text Segmentation".
+Unicode Standard Annex 29, "Unicode Text Segmentation". Unicode 11.0.0
+abandoned the use of some previous properties that had been used for emojis.
+Instead it introduced various emoji-specific properties. PCRE2 uses only the
+Extended Pictographic property.
.P
\eX always matches at least one character. Then it decides whether to add
additional characters according to the following rules for ending a cluster:
@@ -1018,22 +1028,20 @@ L, V, LV, or LVT character; an LV or V character may be followed by a V or T
character; an LVT or T character may be follwed only by a T character.
.P
4. Do not end before extending characters or spacing marks or the "zero-width
-joiner" characters. Characters with the "mark" property always have the
+joiner" character. Characters with the "mark" property always have the
"extend" grapheme breaking property.
.P
5. Do not end after prepend characters.
.P
-6. Do not break within emoji modifier sequences (a base character followed by a
-modifier). Extending characters are allowed before the modifier.
+6. Do not break within emoji modifier sequences or emoji zwj sequences. That
+is, do not break between characters with the Extended_Pictographic property.
+Extend and ZWJ characters are allowed between the characters.
.P
-7. Do not break within emoji zwj sequences (zero-width joiner followed by
-"glue after ZWJ" or "base glue after ZWJ").
-.P
-8. Do not break within emoji flag sequences. That is, do not break between
+7. Do not break within emoji flag sequences. That is, do not break between
regional indicator (RI) characters if there are an odd number of RI characters
before the break point.
.P
-6. Otherwise, end the cluster.
+8. Otherwise, end the cluster.
.
.
.\" HTML <a name="extraprops"></a>
@@ -1112,8 +1120,8 @@ lead to odd effects. For example, consider this pattern:
.sp
(?<=\eKfoo)bar
.sp
-If the subject is "foobar", a call to \fBpcre2_match()\fP with a starting
-offset of 3 succeeds and reports the matching string as "foobar", that is, the
+If the subject is "foobar", a call to \fBpcre2_match()\fP with a starting
+offset of 3 succeeds and reports the matching string as "foobar", that is, the
start of the reported match is earlier than where the match started.
.
.
@@ -3517,6 +3525,6 @@ Cambridge, England.
.rs
.sp
.nf
-Last updated: 30 June 2018
+Last updated: 07 July 2018
Copyright (c) 1997-2018 University of Cambridge.
.fi