summaryrefslogtreecommitdiff
path: root/invlist_inline.h
Commit message (Collapse)AuthorAgeFilesLines
* regcomp.c - decompose into smaller filesYves Orton2022-12-091-5/+95
| | | | | | | | | | | | | | | | | This splits a bunch of the subcomponents of the regex engine into smaller files. regcomp_debug.c regcomp_internal.h regcomp_invlist.c regcomp_study.c regcomp_trie.c The only real change besides to the build machine to achieve the split is to also adds some new defines which can be used in embed.fnc to control exports without having to enumerate /every/ regex engine file. For instance all of regcomp*.c defines PERL_IN_REGCOMP_ANY, and this is used in embed.fnc to manage exports.
* Improve parallel structure in one inline commentJames E Keenan2022-07-101-4/+5
|
* regex: Refactor bitmap vs non-bitmap of qr/[]/Karl Williamson2022-07-101-0/+36
| | | | | | | | | | | | | | | | | | A bracketed character class in a pattern is generally represented by some form of ANYOF node, with matches of characters in the Latin1 range handled by a bitmap, and an inversion list for higher code point matches. But some patterns only have low matches, and some only high, and some match everything that is high. This commit refactors a little so that the distinction between nothing high matches vs everything high matches is done through the same technique. Previously one was indicated by a flag, and the other by a special value in the node's structure. Now there are two special values, and the flag is freed up for a potential future use. In the past the meaning of the flags has had to be overloaded go accommodate all the needs. freeing of a flag means This all allows for some slight simplicfications.
* Make argument to S_is_invlist pointer to constDagfinn Ilmari Mannsåker2022-04-131-1/+1
| | | | This lets us avoid casting away the const in S_invlist_max()
* style: Detabify indentation of the C code maintained by the core.Michael G. Schwern2021-01-171-5/+5
| | | | | | | | | | | This just detabifies to get rid of the mixed tab/space indentation. Applying consistent indentation and dealing with other tabs are another issue. Done with `expand -i`. * vutil.* left alone, it's part of version. * Left regen managed files alone for now.
* PATCH: gh #17275 Silence new warningKarl Williamson2019-11-211-0/+3
| | | | | | | | | This was caused by a static inline function in a header that was #included in a file that didn't use it. Normally, these functions are #ifdef'd so as to be visible only to files in which they are used. Some compilers warn that the function is defined but not used otherwise. The solution is to remove this function's visibility from the file that didn't use it.
* Reimplement tr/// without swashesKarl Williamson2019-11-061-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This large commit removes the last use of swashes from core. It replaces swashes by inversion maps. This data structure is already in use for some Unicode properties, such as case changing. The inversion map data structure leads to straight forward implementation code, so I collapsed the two doop.c routines do_trans_complex_utf8() and do_trans_simple_utf8() into one. A few conditionals could be avoided in the loop if this function were split so that one version didn't have to test for, e.g., squashing, but I suspect these are in the noise in the loop, which has to deal with UTF-8 conversions. This should be faster than the previous implementation anyway. I measured the differences some releases back, and inversion maps were faster than the equivalent swash for up to 512 or 1024 different ranges. These numbers are unlikely to be exceeded in tr/// except possibly in machine-generated ones. Inversion maps are capable of handling both UTF-8 and non-UTF-8 cases, but I left in the existing non-UTF-8 implementation, which uses tables, because I suspect it is faster. This means that there is extra code, purely for runtime performance. An inversion map is always created from the input, and then if the table implementation is to be used, the table is easily derived from the map. Prior to this commit, the table implementation was used in certain edge cases involving code points above 255. Those cases are now handled by the inversion map implementation, because it would have taken extra code to detect them, and I didn't think it was worth it. That could be changed if I am wrong. Creating an inversion map for all inputs essentially normalizes them, and then the same logic is usable for all. This fixes some false negatives in the previous implementation. It also allows for detecting if the actual transliteration can be done in place. Previously, the code mostly punted on that detection for the UTF-8 case. This also allows for accurate counting of the lengths of the two sides, fixing some longstanding TODO warning tests. A new flag is created, OPpTRANS_CAN_FORCE_UTF8, when the tr/// has a below 256 character resolving to one that requires UTF-8. If this isn't set, the code knows that a non-UTF-8 input won't become UTF-8 in the process, and so can take short cuts. The bit representing this flag is the same as OPpTRANS_FROM_UTF, which is no longer used. That name is left in so that the dozen-ish modules in cpan that refer to it can still compile. AFAICT none of them actually use the flag, as well they shouldn't since it is private to the core. Inversion maps are ideally suited for tr/// implementations. An issue with them in general is that for some pathological data, they can become fragmented requiring more space than you would expect, to represent the underlying data. However, the typical tr/// would not have this issue, requiring only very short inversion maps to represent; in some cases shorter than the table implementation. Inversion maps are also easier to deparse than swashes. A deparse TODO was also fixed by this commit, and the code to deparse UTF-8 inputs is simplified. One could implement specialized data structures for specific types of inputs. For example, a common tr/// form is a single range, like tr/A-Z/a-z/. That could be implemented without a table and be quite fast. An intermediate step would be to use the inversion map implementation always when the transliteration is a single range, and then special case length=1 maps at execution time. Thanks to Nicholas Rochemagne for his help on B
* op.c: Add debugging dump functionKarl Williamson2019-11-061-1/+2
| | | | This function dumps out an inversion map
* Move some static fcns from regcomp.c to invlist_inline.hKarl Williamson2019-11-061-3/+144
| | | | | They are still only accessible from regcomp.c, but this is in preparation for them to be usable from other core files as well.
* invlist_inline.h: White space onlyKarl Williamson2019-11-061-1/+5
| | | | Fold a too-long line
* invlist_inline.h: Restrict files symbols are inKarl Williamson2019-11-061-1/+1
| | | | These are only needed in regcomp.c, so restrict them to that file
* is_invlist(): Allow NULL inputKarl Williamson2019-03-121-3/+1
| | | | For generality, it should allow a NULL and return FALSE.
* pp.c: Avoid use of unsafe functionKarl Williamson2019-02-041-1/+1
| | | | | | The function is unsafe because it doesn't check for running off the end of the buffer if presented with illegal UTF-8. The only remaining use now is from mathoms.c.
* Provide header guards to prevent re-inclusionJames E Keenan2018-12-041-0/+5
| | | | | | | | Per LGTM analysis: https://lgtm.com/projects/g/Perl/perl5/alerts/?mode=tree&ruleFocus=2163210746 and LGTM recommendation: https://lgtm.com/rules/2163210746/ For: RT 133699
* invlist_inline.h: Two symbols no longer needed in utf8.cKarl Williamson2018-08-201-1/+1
| | | | | This is because the code using them has been moved to regcomp.c in cef721997e14497f2fbc4be17ab736ad7ddfda29
* Add inline function to hide implementation detailsKarl Williamson2018-08-201-2/+10
|
* Use charnames inversion listsKarl Williamson2018-03-311-1/+1
| | | | | | | | This commit makes the inversion lists for parsing character name global instead of interpreter level, so can be initialized once per process, and no copies are created upon new thread instantiation. More importantly, this is another instance where utf8_heavy.pl no longer needs to be loaded, and the definition files read from disk.
* inline_invlist.c -> invlist_inline.hJarkko Hietaniemi2015-07-221-0/+87