regen/mk_invlists.pl: Collapse unused boundary values

Each Unicode property that specifies a boundary conditions, like Word_Break, partitions all the Unicode code points into equivalence classes. So, for example, combining marks are placed into the Extend class, because they are usually used to extend the previous character and don't stand on their own. mk_invlists.pl creates a boolean table of all pairwise combinations of these classes, so that it knows by simple lookup if the first character is class X and the next character is class Y, if a break is permitted between these. However, in some cases the answer isn't as simple as this, and other means such as the characters in the vicinity of X and Y must be used to disambiguate. In these cases the table value in the cell (X,Y) isn't a boolean, but is some other number indicating some specially crafted code section to execute to resolve the issue. Over the years, Unicode has tended to subdivide partitions into smaller ones, as they've refined their algorithms. But with Unicode 11, they used another method and actually removed partitions. Rather, they retain the partitions, but no code point actually takes on the value of an obsolete partition. In order to not have to change the algorithm unnecessarily between Unicode releases (who knows, they might change their minds, and unobsolete these next time), mk_invlists has just kept the tables around, but those cells won't ever get accessed because no code point in the current release evaluates to them. But that makes the tables unnecessarily large. We can achieve the same thing by mapping each unused equivalence class to the same value, which we call 'unused'. The algorithms that refer to the obsolete partitions go through the data assigning values to the cells, but now the cells overlap, since all obsolete classes map to the same row or column. Thus the data is total garbage. But that doesn't matter, since that row or column is never read by the data in the Unicode release the table is constructed for. mk_invlists also can compile older Unicode releases, and this makes those tables smaller than before, with all unused classes in a given release collapsed into a single row and single column of (unused) garbage.
author: Karl Williamson <khw@cpan.org> 2018-07-21 14:09:24 -0600
committer: Karl Williamson <khw@cpan.org> 2018-07-21 15:26:48 -0600
commit: 2027d3658f4b767823e788e70ad97e67d3aa4ff2 (patch)
tree: cc77f3ac4b4807c55e5a9e7b08511755ac4d27af /uni_keywords.h
parent: 1c7f0e60a2c7edac13fe96fa358750553f733f66 (diff)
download: perl-2027d3658f4b767823e788e70ad97e67d3aa4ff2.tar.gz
1 files changed, 1 insertions, 1 deletions
diff --git a/uni_keywords.h b/uni_keywords.h
index f71cd5c38f..590c158397 100644
--- a/uni_keywords.h
+++ b/uni_keywords.h
@@ -6994,6 +6994,6 @@ MPH_VALt match_uniprop( const unsigned char * const key, const U16 key_len ) {
  * 7bd6bcbe3813e0cd55e0998053d182b7bc8c97dcfd0b85028e9f7f55af4ad61b lib/unicore/version
  * 4bb677187a1a64e39d48f2e341b5ecb6c99857e49d7a79cf503bd8a3c709999b regen/charset_translations.pl
  * 03e51b0f07beebd5da62ab943899aa4934eee1f792fa27c1fb638c33bf4ac6ea regen/mk_PL_charclass.pl
- * a232f3394a90476c4c315c1d067d92243d0d0e38ed4e4bc8d4bb46978c45eaeb regen/mk_invlists.pl
+ * 97779934175e3b7dc8c302762e8e9f342bb828583dd7f862ed85be5906aa939c regen/mk_invlists.pl
  * c42c035b18a0426443184e9f889aa2b16bef5a9add9805cd853c4e2a783712ff regen/mph.pl
  * ex: set ro: */
author	Karl Williamson <khw@cpan.org>	2018-07-21 14:09:24 -0600
committer	Karl Williamson <khw@cpan.org>	2018-07-21 15:26:48 -0600
commit	2027d3658f4b767823e788e70ad97e67d3aa4ff2 (patch)
tree	cc77f3ac4b4807c55e5a9e7b08511755ac4d27af /uni_keywords.h
parent	1c7f0e60a2c7edac13fe96fa358750553f733f66 (diff)
download	perl-2027d3658f4b767823e788e70ad97e67d3aa4ff2.tar.gz