diff options
author | Karl Williamson <public@khwilliamson.com> | 2013-09-22 21:36:29 -0600 |
---|---|---|
committer | Karl Williamson <public@khwilliamson.com> | 2013-09-24 11:36:19 -0600 |
commit | cdd87c1d4df41f9a54cccff996fa64d291adcee8 (patch) | |
tree | 433feb19de18b9fb38c9a4dc8c1830ad3ae4be6c /embed.h | |
parent | fb38762fa113a105b623d0eb7681d2cc03b0c161 (diff) | |
download | perl-cdd87c1d4df41f9a54cccff996fa64d291adcee8.tar.gz |
Teach regex optimizer to handle above-Latin1
Until this commit, the regular expression optimizer has essentially
punted on above-Latin1 code points. Under some circumstances, they
would be taken into account, more or less, but often, the generated
synthetic start class would end up matching all above-Latin1 code
points. With the advent of inversion lists, it becomes feasible to
actually fully handle such code points, as inversion lists are a
convenient way to express arbitrary lists of code points and take their
union, intersection, etc. This commit changes the optimizer to use
inversion lists for operating on the code points the synthetic start
class can match.
I don't much understand the overall operation of the optimizer. I'm
told that previous porters found that perturbing it caused unexpected
behaviors. I had promised to get this change in 5.18, but didn't. I'm
trying to get it in early enough into the 5.20 preliminary series that
any problems will surface before 5.20 ships.
This commit doesn't change the macro level logic, but does significantly
change various micro level things. Thus the 'and' and 'or' subroutines
have been rewritten to use inversion lists. I'm pretty confident that
they do what their names suggest. I re-derived the equations for what
these operations should do, getting the same results in some cases, but
extending others where the previous code mostly punted. The derivations
are given in comments in the respective routines.
Some of the code is greatly simplified, as it no longer has to treat
above-Latin1 specially.
It is now feasible for /i matching of above-Latin1 code points to know
explicitly the folds that should be in the synthetic start class. But
more prepatory work needs to be done before putting that into place.
...
Diffstat (limited to 'embed.h')
-rw-r--r-- | embed.h | 8 |
1 files changed, 4 insertions, 4 deletions
@@ -947,16 +947,16 @@ #define set_ANYOF_arg(a,b,c,d,e,f) S_set_ANYOF_arg(aTHX_ a,b,c,d,e,f) #define ssc_add_range(a,b,c) S_ssc_add_range(aTHX_ a,b,c) #define ssc_and(a,b,c) S_ssc_and(aTHX_ a,b,c) -#define ssc_anything S_ssc_anything +#define ssc_anything(a,b) S_ssc_anything(aTHX_ a,b) #define ssc_clear_locale(a) S_ssc_clear_locale(aTHX_ a) #define ssc_cp_and(a,b) S_ssc_cp_and(aTHX_ a,b) #define ssc_finalize(a,b) S_ssc_finalize(aTHX_ a,b) #define ssc_flags_and S_ssc_flags_and -#define ssc_init S_ssc_init +#define ssc_init(a,b) S_ssc_init(aTHX_ a,b) #define ssc_intersection(a,b,c) S_ssc_intersection(aTHX_ a,b,c) -#define ssc_is_anything S_ssc_is_anything +#define ssc_is_anything(a) S_ssc_is_anything(aTHX_ a) #define ssc_is_cp_posixl_init(a,b) S_ssc_is_cp_posixl_init(aTHX_ a,b) -#define ssc_or S_ssc_or +#define ssc_or(a,b,c) S_ssc_or(aTHX_ a,b,c) #define ssc_union(a,b,c) S_ssc_union(aTHX_ a,b,c) #define study_chunk(a,b,c,d,e,f,g,h,i,j,k) S_study_chunk(aTHX_ a,b,c,d,e,f,g,h,i,j,k) # endif |