diff options
author | Karl Williamson <public@khwilliamson.com> | 2011-11-22 12:06:41 -0700 |
---|---|---|
committer | Karl Williamson <public@khwilliamson.com> | 2012-01-13 09:58:34 -0700 |
commit | c4a5db0c445f65e05b7efe68951a24a5f1826c18 (patch) | |
tree | ababcfae06e346506975815726d89977c636def1 /embed.h | |
parent | 4a7e937e5e38b45d73f74bc6acde04dfb15ad1aa (diff) | |
download | perl-c4a5db0c445f65e05b7efe68951a24a5f1826c18.tar.gz |
utf8.c: New function to retrieve non-copy of swash
Currently, swash_init returns a copy of the swash it finds. The core
portions of the swash are read-only, and the non-read-only portions are
derived from them. When the value for a code point is looked up, the
results for it and adjacent code points are stored in a new element,
so that the lookup never has to be performed again. But since a copy is
returned, those results are stored only in the copy, and any other uses
of the same logical stash don't have access to them, so the lookups have
to be performed for each logical use.
Here's an example. If you have 2 occurrences of /\p{Upper}/ in your
program, there are 2 different swashes created, both initialized
identically. As you start matching against code points, say "A" =~
/\p{Upper}/, the swashes diverge, as the results for each match are
saved in the one applicable to that match. If you match "A" in each
swash, it has to be looked up in each swash, and an (identical) element
will be saved for it in each swash. This is wasteful of both time and
memory.
This patch renames the function and returns the original and not a copy,
thus eliminating the overhead for stashes accessed through the new
interface. The old function name is serviced by a new function which
merely wraps the new name result with a copy, thus preserving the
interface for existing calls.
Thus, in the example above, there is only one swash, and matching "A"
against it results in only one new element, and so the second use will
find that, and not have to go out looking again. In a program with lots
of regular expressions, the savings in time and memory can be quite
large.
The new name is restricted to use only in regcomp.c and utf8.c (unless
XS code cheats the preprocessor), where we will code so as to not
destroy the original's data. Otherwise, a change to that would change
the definition of a Unicode property everywhere in the program.
Note that there are no current callers of the new interface; these will
be added in future commits.
Diffstat (limited to 'embed.h')
-rw-r--r-- | embed.h | 3 |
1 files changed, 3 insertions, 0 deletions
@@ -942,6 +942,9 @@ #define set_regclass_bit_fold(a,b,c,d,e) S_set_regclass_bit_fold(aTHX_ a,b,c,d,e) #define study_chunk(a,b,c,d,e,f,g,h,i,j,k) S_study_chunk(aTHX_ a,b,c,d,e,f,g,h,i,j,k) # endif +# if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C) || defined(PERL_IN_UTF8_C) +#define _core_swash_init(a,b,c,d,e) Perl__core_swash_init(aTHX_ a,b,c,d,e) +# endif # if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_UTF8_C) #define _append_range_to_invlist(a,b,c) Perl__append_range_to_invlist(aTHX_ a,b,c) #define _invlist_intersection(a,b,c) Perl__invlist_intersection(aTHX_ a,b,c) |