utf8.c: New function to retrieve non-copy of swash

Currently, swash_init returns a copy of the swash it finds. The core portions of the swash are read-only, and the non-read-only portions are derived from them. When the value for a code point is looked up, the results for it and adjacent code points are stored in a new element, so that the lookup never has to be performed again. But since a copy is returned, those results are stored only in the copy, and any other uses of the same logical stash don't have access to them, so the lookups have to be performed for each logical use. Here's an example. If you have 2 occurrences of /\p{Upper}/ in your program, there are 2 different swashes created, both initialized identically. As you start matching against code points, say "A" =~ /\p{Upper}/, the swashes diverge, as the results for each match are saved in the one applicable to that match. If you match "A" in each swash, it has to be looked up in each swash, and an (identical) element will be saved for it in each swash. This is wasteful of both time and memory. This patch renames the function and returns the original and not a copy, thus eliminating the overhead for stashes accessed through the new interface. The old function name is serviced by a new function which merely wraps the new name result with a copy, thus preserving the interface for existing calls. Thus, in the example above, there is only one swash, and matching "A" against it results in only one new element, and so the second use will find that, and not have to go out looking again. In a program with lots of regular expressions, the savings in time and memory can be quite large. The new name is restricted to use only in regcomp.c and utf8.c (unless XS code cheats the preprocessor), where we will code so as to not destroy the original's data. Otherwise, a change to that would change the definition of a Unicode property everywhere in the program. Note that there are no current callers of the new interface; these will be added in future commits.
author: Karl Williamson <public@khwilliamson.com> 2011-11-22 12:06:41 -0700
committer: Karl Williamson <public@khwilliamson.com> 2012-01-13 09:58:34 -0700
commit: c4a5db0c445f65e05b7efe68951a24a5f1826c18 (patch)
tree: ababcfae06e346506975815726d89977c636def1 /embed.h
parent: 4a7e937e5e38b45d73f74bc6acde04dfb15ad1aa (diff)
download: perl-c4a5db0c445f65e05b7efe68951a24a5f1826c18.tar.gz
1 files changed, 3 insertions, 0 deletions
diff --git a/embed.h b/embed.h
index 9d140ee869..3bf886a92b 100644
--- a/embed.h
+++ b/embed.h
@@ -942,6 +942,9 @@
 #define set_regclass_bit_fold(a,b,c,d,e)	S_set_regclass_bit_fold(aTHX_ a,b,c,d,e)
 #define study_chunk(a,b,c,d,e,f,g,h,i,j,k)	S_study_chunk(aTHX_ a,b,c,d,e,f,g,h,i,j,k)
 #  endif
+#  if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C) || defined(PERL_IN_UTF8_C)
+#define _core_swash_init(a,b,c,d,e)	Perl__core_swash_init(aTHX_ a,b,c,d,e)
+#  endif
 #  if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_UTF8_C)
 #define _append_range_to_invlist(a,b,c)	Perl__append_range_to_invlist(aTHX_ a,b,c)
 #define _invlist_intersection(a,b,c)	Perl__invlist_intersection(aTHX_ a,b,c)
author	Karl Williamson <public@khwilliamson.com>	2011-11-22 12:06:41 -0700
committer	Karl Williamson <public@khwilliamson.com>	2012-01-13 09:58:34 -0700
commit	c4a5db0c445f65e05b7efe68951a24a5f1826c18 (patch)
tree	ababcfae06e346506975815726d89977c636def1 /embed.h
parent	4a7e937e5e38b45d73f74bc6acde04dfb15ad1aa (diff)
download	perl-c4a5db0c445f65e05b7efe68951a24a5f1826c18.tar.gz