From 2f04a0431dbcfd6a3d1e83ab2475667d40bfa6ca Mon Sep 17 00:00:00 2001 From: ph10 Date: Sat, 7 Jul 2018 16:10:29 +0000 Subject: Update to Unicode 11.0.0 git-svn-id: svn://vcs.exim.org/pcre2/code/trunk@958 6239d852-aaf2-0410-a92c-79f79f948069 --- doc/html/pcre2pattern.html | 33 ++++++++++++++++++++------------- 1 file changed, 20 insertions(+), 13 deletions(-) (limited to 'doc/html/pcre2pattern.html') diff --git a/doc/html/pcre2pattern.html b/doc/html/pcre2pattern.html index 9adc426..9d241b7 100644 --- a/doc/html/pcre2pattern.html +++ b/doc/html/pcre2pattern.html @@ -789,6 +789,7 @@ Cypriot, Cyrillic, Deseret, Devanagari, +Dogra, Duployan, Egyptian_Hieroglyphs, Elbasan, @@ -799,9 +800,11 @@ Gothic, Grantha, Greek, Gujarati, +Gunjala_Gondi, Gurmukhi, Han, Hangul, +Hanifi_Rohingya, Hanunoo, Hatran, Hebrew, @@ -829,11 +832,13 @@ Lisu, Lycian, Lydian, Mahajani, +Makasar, Malayalam, Mandaic, Manichaean, Marchen, Masaram_Gondi, +Medefaidrin, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive, @@ -856,6 +861,7 @@ Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, +Old_Sogdian, Old_South_Arabian, Old_Turkic, Oriya, @@ -876,6 +882,7 @@ Shavian, Siddham, SignWriting, Sinhala, +Sogdian, Sora_Sompeng, Soyombo, Sundanese, @@ -1006,7 +1013,10 @@ grapheme cluster", and treats the sequence as an atomic group Unicode supports various kinds of composite character by giving each character a grapheme breaking property, and having rules that use these properties to define the boundaries of extended grapheme clusters. The rules are defined in -Unicode Standard Annex 29, "Unicode Text Segmentation". +Unicode Standard Annex 29, "Unicode Text Segmentation". Unicode 11.0.0 +abandoned the use of some previous properties that had been used for emojis. +Instead it introduced various emoji-specific properties. PCRE2 uses only the +Extended Pictographic property.

\X always matches at least one character. Then it decides whether to add @@ -1026,27 +1036,24 @@ character; an LVT or T character may be follwed only by a T character.

4. Do not end before extending characters or spacing marks or the "zero-width -joiner" characters. Characters with the "mark" property always have the +joiner" character. Characters with the "mark" property always have the "extend" grapheme breaking property.

5. Do not end after prepend characters.

-6. Do not break within emoji modifier sequences (a base character followed by a -modifier). Extending characters are allowed before the modifier. +6. Do not break within emoji modifier sequences or emoji zwj sequences. That +is, do not break between characters with the Extended_Pictographic property. +Extend and ZWJ characters are allowed between the characters.

-7. Do not break within emoji zwj sequences (zero-width joiner followed by -"glue after ZWJ" or "base glue after ZWJ"). -

-

-8. Do not break within emoji flag sequences. That is, do not break between +7. Do not break within emoji flag sequences. That is, do not break between regional indicator (RI) characters if there are an odd number of RI characters before the break point.

-6. Otherwise, end the cluster. +8. Otherwise, end the cluster.


PCRE2's additional properties @@ -1119,8 +1126,8 @@ lead to odd effects. For example, consider this pattern:
   (?<=\Kfoo)bar
 
-If the subject is "foobar", a call to pcre2_match() with a starting -offset of 3 succeeds and reports the matching string as "foobar", that is, the +If the subject is "foobar", a call to pcre2_match() with a starting +offset of 3 succeeds and reports the matching string as "foobar", that is, the start of the reported match is earlier than where the match started.


@@ -3490,7 +3497,7 @@ Cambridge, England.


REVISION

-Last updated: 30 June 2018 +Last updated: 07 July 2018
Copyright © 1997-2018 University of Cambridge.
-- cgit v1.2.1