Prepare for Unicode 9.0

The major code changes needed to support Unicode 9.0 are to changes in the boundary (break) rules, for things like \b{lb}, \b{wb}. regen/mk_invlists.pl creates two-dimensional arrays for all these properties. To see if a given point in the target string is a break or not, regexec.c looks up the entry in the property's table whose row corresponds to the code point before the potential break, and whose column corresponds to the one after. Mostly this is completely determining, but for some cases, extra context is required, and the array entry indicates this, and there has to be specially crafted code in regexec.c to handle each such possibility. When a new release comes along, mk_invlists.pl has to be changed to handle any new or changed rules, and regexec.c has to be changed to handle any changes to the custom code. Unfortunately this is not a mature area of the Standard, and changes are fairly common in new releases. In part, this is because new types of code points come along, which need new rules. Sometimes it is because they realized the previous version didn't work as well as it could. An example of the latter is that Unicode now realizes that Regional Indicator (RI) characters come in pairs, and that one should be able to break between each pair, but not within a pair. Previous versions treated any run of them as unbreakable. (Regional Indicators are a fairly recent type that was added to the Standard in 6.0, and things are still getting shaken out.) The other main changes to these rules also involve a fairly new type of character, emojis. We can expect further changes to these in the next Unicode releases. \b{gcb} for the first time, now depends on context (in rarely encountered cases, like RI's), so the function had to be changed from a simple table look-up to be more like the functions handling the other break properties. Some years ago I revamped mktables in part to try to make it require as few manual interventions as possible when upgrading to a new version of Unicode. For example, a new data file in a release requires telling mktables about it, but as long as it follows the format of existing recent files, nothing else need be done to get whatever properties it describes to be included. Some of changes to mktables involved guessing, from existing limited data, what the underlying paradigm for that data was. The problem with that is there may not have been a paradigm, just something they did ad hoc, which can change at will; or I didn't understand their unstated thinking, and guessed wrong. Besides the boundary rule changes, the only change that the existing mktables couldn't cope with was the addition of the Tangut script, whose character names include the code point, like CJK UNIFIED IDEOGRAPH-3400 has always done. The paradigm for this wasn't clear, since CJK was the only script that had this characteristic, and so I hard-coded it into mktables. The way Tangut is structured may show that there is a paradigm emerging (but we only have two examples, and there may not be a paradigm at all), and so I have guessed one, and changed mktables to assume this guessed paradigm. If other scripts like this come along, and I have guessed correctly, mktables will cope with these automatically without manual intervention.
author: Karl Williamson <khw@cpan.org> 2016-06-16 11:59:24 -0600
committer: Karl Williamson <khw@cpan.org> 2016-06-21 18:10:38 -0600
commit: b0e24409fd3623db353286c203d33b56e622bae6 (patch)
tree: bfba4fb7ba71ee2c200933eb6ac850448345e30d /regexec.c
parent: 6295dc14882a54531ce4542f1d80fa8ae7b4f8f0 (diff)
download: perl-b0e24409fd3623db353286c203d33b56e622bae6.tar.gz
1 files changed, 171 insertions, 20 deletions
diff --git a/regexec.c b/regexec.c
index fdcffc5ae6..448a605437 100644
--- a/regexec.c
+++ b/regexec.c
@@ -2118,7 +2118,11 @@ S_find_byclass(pTHX_ regexp * prog, const regnode *c, char *s,
                     while (s < strend) {
                         GCB_enum after = getGCB_VAL_UTF8((U8*) s,
                                                         (U8*) reginfo->strend);
-                        if (   (to_complement ^ isGCB(before, after))
+                        if (   (to_complement ^ isGCB(before,
+                                                      after,
+                                                      (U8*) reginfo->strbeg,
+                                                      (U8*) s,
+                                                      utf8_target))
                             && (reginfo->intuit || regtry(reginfo, &s)))
                         {
                             goto got_it;
@@ -4289,13 +4293,108 @@ S_setup_EXACTISH_ST_c1_c2(pTHX_ const regnode * const text_node, int *c1p,
     return TRUE;
 }
 
-PERL_STATIC_INLINE bool
-S_isGCB(const GCB_enum before, const GCB_enum after)
+STATIC bool
+S_isGCB(pTHX_ const GCB_enum before, const GCB_enum after, const U8 * const strbeg, const U8 * curpos, const bool utf8_target)
 {
     /* returns a boolean indicating if there is a Grapheme Cluster Boundary
-     * between the inputs.  See http://www.unicode.org/reports/tr29/ */
+     * between the inputs.  See http://www.unicode.org/reports/tr29/. */
+
+    PERL_ARGS_ASSERT_ISGCB;
+
+    switch (GCB_table[before][after]) {
+        case GCB_BREAKABLE:
+            return TRUE;
+
+        case GCB_NOBREAK:
+            return FALSE;
+
+        case GCB_RI_then_RI:
+            {
+                int RI_count = 1;
+                U8 * temp_pos = (U8 *) curpos;
+
+                /* Do not break within emoji flag sequences. That is, do not
+                 * break between regional indicator (RI) symbols if there is an
+                 * odd number of RI characters before the break point.
+                 *  GB12     ^ (RI RI)* RI × RI
+                 *  GB13 [^RI] (RI RI)* RI × RI */
+
+                while (backup_one_GCB(strbeg,
+                                    &temp_pos,
+                                    utf8_target) == GCB_Regional_Indicator)
+                {
+                    RI_count++;
+                }
+
+                return RI_count % 2 != 1;
+            }
+
+        case GCB_EX_then_EM:
+
+            /* GB10  ( E_Base | E_Base_GAZ ) Extend* ×  E_Modifier */
+            {
+                U8 * temp_pos = (U8 *) curpos;
+                GCB_enum prev;
+
+                do {
+                    prev = backup_one_GCB(strbeg, &temp_pos, utf8_target);
+                }
+                while (prev == GCB_Extend);
+
+                return prev != GCB_E_Base && prev != GCB_E_Base_GAZ;
+            }
 
-    return GCB_table[before][after];
+        default:
+            break;
+    }
+
+#ifdef DEBUGGING
+    Perl_re_printf( aTHX_  "Unhandled GCB pair: GCB_table[%d, %d] = %d\n",
+                                  before, after, GCB_table[before][after]);
+    assert(0);
+#endif
+    return TRUE;
+}
+
+STATIC GCB_enum
+S_backup_one_GCB(pTHX_ const U8 * const strbeg, U8 ** curpos, const bool utf8_target)
+{
+    GCB_enum gcb;
+
+    PERL_ARGS_ASSERT_BACKUP_ONE_GCB;
+
+    if (*curpos < strbeg) {
+        return GCB_EDGE;
+    }
+
+    if (utf8_target) {
+        U8 * prev_char_pos = reghopmaybe3(*curpos, -1, strbeg);
+        U8 * prev_prev_char_pos;
+
+        if (! prev_char_pos) {
+            return GCB_EDGE;
+        }
+
+        if ((prev_prev_char_pos = reghopmaybe3((U8 *) prev_char_pos, -1, strbeg))) {
+            gcb = getGCB_VAL_UTF8(prev_prev_char_pos, prev_char_pos);
+            *curpos = prev_char_pos;
+            prev_char_pos = prev_prev_char_pos;
+        }
+        else {
+            *curpos = (U8 *) strbeg;
+            return GCB_EDGE;
+        }
+    }
+    else {
+        if (*curpos - 2 < strbeg) {
+            *curpos = (U8 *) strbeg;
+            return GCB_EDGE;
+        }
+        (*curpos)--;
+        gcb = getGCB_VAL_CP(*(*curpos - 1));
+    }
+
+    return gcb;
 }
 
 /* Combining marks attach to most classes that precede them, but this defines
@@ -4326,7 +4425,7 @@ S_isLB(pTHX_ LB_enum before,
 
     PERL_ARGS_ASSERT_ISLB;
 
-    /* Rule numbers in the comments below are as of Unicode 8.0 */
+    /* Rule numbers in the comments below are as of Unicode 9.0 */
 
   redo:
     before = prev;
@@ -4420,14 +4519,14 @@ S_isLB(pTHX_ LB_enum before,
              * that is overriden */
             return LB_table[prev][after] != LB_NOBREAK_EVEN_WITH_SP_BETWEEN;
 
-        case LB_CM_foo:
+        case LB_CM_ZWJ_foo:
 
             /* We don't know how to treat the CM except by looking at the first
-             * non-CM character preceding it */
+             * non-CM character preceding it.  ZWJ is treated as CM */
             do {
                 prev = backup_one_LB(strbeg, &temp_pos, utf8_target);
             }
-            while (prev == LB_Combining_Mark);
+            while (prev == LB_Combining_Mark || prev == LB_ZWJ);
 
             /* Here, 'prev' is that first earlier non-CM character.  If the CM
              * attatches to it, then it inherits the behavior of 'prev'.  If it
@@ -4500,6 +4599,28 @@ S_isLB(pTHX_ LB_enum before,
             return LB_various_then_PO_or_PR;
         }
 
+        case LB_RI_then_RI + LB_NOBREAK:
+        case LB_RI_then_RI + LB_BREAKABLE:
+            {
+                int RI_count = 1;
+
+                /* LB30a Break between two regional indicator symbols if and
+                 * only if there are an even number of regional indicators
+                 * preceding the position of the break.
+                 *
+                 *  sot (RI RI)* RI × RI
+                 *  [^RI] (RI RI)* RI × RI */
+
+                while (backup_one_LB(strbeg,
+                                     &temp_pos,
+                                     utf8_target) == LB_Regional_Indicator)
+                {
+                    RI_count++;
+                }
+
+                return RI_count % 2 == 0;
+            }
+
         default:
             break;
     }
@@ -4884,7 +5005,7 @@ S_isWB(pTHX_ WB_enum previous,
 
     PERL_ARGS_ASSERT_ISWB;
 
-    /* Rule numbers in the comments below are as of Unicode 8.0 */
+    /* Rule numbers in the comments below are as of Unicode 9.0 */
 
   redo:
     before = prev;
@@ -4910,11 +5031,11 @@ S_isWB(pTHX_ WB_enum previous,
          * the beginning of a region of text', the rule is to break before
          * them, just like any other character.  Therefore, the default rule
          * applies and we don't have to look in more depth.  Should this ever
-         * change, we would have to have 2 'case' statements, like in the
-         * rules below, and backup a single character (not spacing over the
-         * extend ones) and then see if that is one of the region-end
-         * characters and go from there */
-        case WB_Ex_or_FO_then_foo:
+         * change, we would have to have 2 'case' statements, like in the rules
+         * below, and backup a single character (not spacing over the extend
+         * ones) and then see if that is one of the region-end characters and
+         * go from there */
+        case WB_Ex_or_FO_or_ZWJ_then_foo:
             prev = backup_one_WB(&previous, strbeg, &before_pos, utf8_target);
             goto redo;
 
@@ -5007,6 +5128,30 @@ S_isWB(pTHX_ WB_enum previous,
             return WB_table[before][after]
                                 - WB_NU_then_MB_or_MN_or_SQ == WB_BREAKABLE;
 
+        case WB_RI_then_RI + WB_NOBREAK:
+        case WB_RI_then_RI + WB_BREAKABLE:
+            {
+                int RI_count = 1;
+
+                /* Do not break within emoji flag sequences. That is, do not
+                 * break between regional indicator (RI) symbols if there is an
+                 * odd number of RI characters before the potential break
+                 * point.
+                 *
+                 * WB15     ^ (RI RI)* RI × RI
+                 * WB16 [^RI] (RI RI)* RI × RI */
+
+                while (backup_one_WB(&previous,
+                                     strbeg,
+                                     &before_pos,
+                                     utf8_target) == WB_Regional_Indicator)
+                {
+                    RI_count++;
+                }
+
+                return RI_count % 2 != 1;
+            }
+
         default:
             break;
     }
@@ -5087,8 +5232,8 @@ S_backup_one_WB(pTHX_ WB_enum * previous, const U8 * const strbeg, U8 ** curpos,
             *previous = (*curpos <= strbeg) ? WB_EDGE : WB_UNKNOWN;
         }
 
-        /* And we always back up over these two types */
-        if (wb != WB_Extend && wb != WB_Format) {
+        /* And we always back up over these three types */
+        if (wb != WB_Extend && wb != WB_Format && wb != WB_ZWJ) {
             return wb;
         }
     }
@@ -5119,7 +5264,7 @@ S_backup_one_WB(pTHX_ WB_enum * previous, const U8 * const strbeg, U8 ** curpos,
                 *curpos = (U8 *) strbeg;
                 return WB_EDGE;
             }
-        } while (wb == WB_Extend || wb == WB_Format);
+        } while (wb == WB_Extend || wb == WB_Format || wb == WB_ZWJ);
     }
     else {
         do {
@@ -6001,7 +6146,10 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
                                                         (U8*)(reginfo->strbeg)),
                                                 (U8*) reginfo->strend),
                                           getGCB_VAL_UTF8((U8*) locinput,
-                                                        (U8*) reginfo->strend));
+                                                        (U8*) reginfo->strend),
+                                          (U8*) reginfo->strbeg,
+                                          (U8*) locinput,
+                                          utf8_target);
                         }
                         break;
 
@@ -6383,7 +6531,10 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
                 while (locinput < reginfo->strend) {
                     GCB_enum cur_gcb = getGCB_VAL_UTF8((U8*) locinput,
                                                          (U8*) reginfo->strend);
-                    if (isGCB(prev_gcb, cur_gcb)) {
+                    if (isGCB(prev_gcb, cur_gcb,
+                              (U8*) reginfo->strbeg, (U8*) locinput,
+                              utf8_target))
+                    {
                         break;
                     }
author	Karl Williamson <khw@cpan.org>	2016-06-16 11:59:24 -0600
committer	Karl Williamson <khw@cpan.org>	2016-06-21 18:10:38 -0600
commit	b0e24409fd3623db353286c203d33b56e622bae6 (patch)
tree	bfba4fb7ba71ee2c200933eb6ac850448345e30d /regexec.c
parent	6295dc14882a54531ce4542f1d80fa8ae7b4f8f0 (diff)
download	perl-b0e24409fd3623db353286c203d33b56e622bae6.tar.gz