summaryrefslogtreecommitdiff
path: root/doop.c
Commit message (Collapse)AuthorAgeFilesLines
* doop.c: do_vecget(): Add trivial case to the switch()Karl Williamson2021-08-061-8/+9
| | | | | We can save another conditional by adding a default: case to the switch statement created by the previous commit.
* doop.c: Refactor do_vecget()Karl Williamson2021-08-061-116/+50
| | | | | By using a switch statement this function can be cut in half, with fewer conditionals executed.
* doop.c: White space onlyKarl Williamson2021-08-061-2/+2
|
* doop.c: Call the macro instead of reinventing itKarl Williamson2021-08-061-1/+1
|
* doop.c: Refactor do_vecset()Karl Williamson2021-08-061-25/+20
| | | | | By converting to a switch statement with fall through, some redundancies can be removed and conditionals avoided.
* doop.c: Rmv redundant '&' instrsKarl Williamson2021-07-301-15/+15
| | | | | Casting to U8 has the same effect as ANDing with 0xFF. Remove the redundant '&'
* style: Detabify indentation of the C code maintained by the core.Michael G. Schwern2021-01-171-375/+375
| | | | | | | | | | | This just detabifies to get rid of the mixed tab/space indentation. Applying consistent indentation and dealing with other tabs are another issue. Done with `expand -i`. * vutil.* left alone, it's part of version. * Left regen managed files alone for now.
* doop.c: Comment, white-space onlyKarl Williamson2020-08-081-2/+2
| | | | This removes an obsolete comment
* (perl #17844) don't update SvCUR until after we've done movingTony Cook2020-07-301-1/+1
| | | | | | | | | | SvCUR() before the SvGROW() calls could result in reading beyond the end of a buffer. It wasn't a problem in the normal case, since sv_grow() just calls realloc() which has its own notion of how big the memory block is, but if the SV is SvOOK() sv_backoff() tries to move SvCUR()+1 bytes, which might be larger than the currently allocated size of the PV.
* doop.c: Remove unnecessary cautiousnessKarl Williamson2020-07-171-3/+0
| | | | | | The code this commit removes was used to make sure there was enough space allocated. It actually isn't necessary to be so cautious. The computed value, rounded up, is sufficient.
* handy.h: Create nBIT_MASK(n) macroKarl Williamson2020-07-171-2/+2
| | | | | This encapsulates a common paradigm, making sure that it is done correctly for the platform's size.
* doop.c: Fix typo in comment; add commentKarl Williamson2020-04-021-2/+3
|
* PATCH: GH #17391 tr/// regressionKarl Williamson2019-12-291-4/+4
| | | | | | This was the result of my not thinking through how things should work. I added tests for the incorrect behavior. This commit modifies them, so that there is no need for the test in the ticket.
* Clean up -Dy debuggingKarl Williamson2019-11-181-5/+5
| | | | | Commit 5d7580af4b14229eafb27db9b7a34b8b918876b4 didn't have it quite right.
* Add -Dy debugging of tr///, y///Karl Williamson2019-11-171-2/+42
|
* doop.c, op.c: Silence some compiler warningsKarl Williamson2019-11-151-3/+3
|
* remove leak in tr/ascii/utf8/David Mitchell2019-11-121-0/+1
| | | | | | | | | | | | The recent change to use invlists left a bug in S_do_trans_invmap() whereby it allocated a new temp buf if it knew the resulting string would be too long, but failed to free the buffer at the end. Showed up as smokes under ASAN failing these tests: op/tr_latin1.t op/tr.t uni/tr_utf8.t
* Silence some compiler warningsKarl Williamson2019-11-071-1/+1
| | | | | These were introduced in the tr/// changes in the series merged in 240494d6992696a7a350217c131e1d5dc1444a0c
* Remove swashes from coreKarl Williamson2019-11-061-1/+1
| | | | Also references to the term.
* Reimplement tr/// without swashesKarl Williamson2019-11-061-287/+219
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This large commit removes the last use of swashes from core. It replaces swashes by inversion maps. This data structure is already in use for some Unicode properties, such as case changing. The inversion map data structure leads to straight forward implementation code, so I collapsed the two doop.c routines do_trans_complex_utf8() and do_trans_simple_utf8() into one. A few conditionals could be avoided in the loop if this function were split so that one version didn't have to test for, e.g., squashing, but I suspect these are in the noise in the loop, which has to deal with UTF-8 conversions. This should be faster than the previous implementation anyway. I measured the differences some releases back, and inversion maps were faster than the equivalent swash for up to 512 or 1024 different ranges. These numbers are unlikely to be exceeded in tr/// except possibly in machine-generated ones. Inversion maps are capable of handling both UTF-8 and non-UTF-8 cases, but I left in the existing non-UTF-8 implementation, which uses tables, because I suspect it is faster. This means that there is extra code, purely for runtime performance. An inversion map is always created from the input, and then if the table implementation is to be used, the table is easily derived from the map. Prior to this commit, the table implementation was used in certain edge cases involving code points above 255. Those cases are now handled by the inversion map implementation, because it would have taken extra code to detect them, and I didn't think it was worth it. That could be changed if I am wrong. Creating an inversion map for all inputs essentially normalizes them, and then the same logic is usable for all. This fixes some false negatives in the previous implementation. It also allows for detecting if the actual transliteration can be done in place. Previously, the code mostly punted on that detection for the UTF-8 case. This also allows for accurate counting of the lengths of the two sides, fixing some longstanding TODO warning tests. A new flag is created, OPpTRANS_CAN_FORCE_UTF8, when the tr/// has a below 256 character resolving to one that requires UTF-8. If this isn't set, the code knows that a non-UTF-8 input won't become UTF-8 in the process, and so can take short cuts. The bit representing this flag is the same as OPpTRANS_FROM_UTF, which is no longer used. That name is left in so that the dozen-ish modules in cpan that refer to it can still compile. AFAICT none of them actually use the flag, as well they shouldn't since it is private to the core. Inversion maps are ideally suited for tr/// implementations. An issue with them in general is that for some pathological data, they can become fragmented requiring more space than you would expect, to represent the underlying data. However, the typical tr/// would not have this issue, requiring only very short inversion maps to represent; in some cases shorter than the table implementation. Inversion maps are also easier to deparse than swashes. A deparse TODO was also fixed by this commit, and the code to deparse UTF-8 inputs is simplified. One could implement specialized data structures for specific types of inputs. For example, a common tr/// form is a single range, like tr/A-Z/a-z/. That could be implemented without a table and be quite fast. An intermediate step would be to use the inversion map implementation always when the transliteration is a single range, and then special case length=1 maps at execution time. Thanks to Nicholas Rochemagne for his help on B
* doop.c: Refactor do_trans_complex()Karl Williamson2019-11-061-10/+27
| | | | | | | I had trouble understanding how this uncommented routine worked. And it turned out to be broken, squeezing the pre-transliterated characters instead of the post-transliterated ones. This fixes the TODO test added in the previous commit.
* doop.c: Change name of variableKarl Williamson2019-11-061-9/+9
| | | | This helped me understand what was going on in this function
* doop.c: Change out-of-bounds valueKarl Williamson2019-11-061-2/+2
| | | | | | This currently uses 0xfeedface as a marker for something that isn't a legal value. But that could in fact become legal at same point. This defines a value TR_OOB that can be guaranteed not to become legal.
* doop.c: Add, revise commentsKarl Williamson2019-11-061-16/+31
|
* op.c, doop.c Use mnemonics instead of numeric valuesKarl Williamson2019-11-061-7/+7
| | | | For legibility and maintainability
* doop.c: Add a parameter to a few fcnsKarl Williamson2019-11-061-18/+6
| | | | | instead of deriving it each time from inside the function. This is in preparation for future commits.
* doop.c, op.c: White-space onlyKarl Williamson2019-11-061-3/+3
| | | | Remove trailing blanks and outdent a doubly indented block
* remove CONSERVATIVE and LIBERALTomasz Konojacki2019-10-301-4/+1
| | | | | | | | | | | | These constants were undocumented and don't do anything useful. Saving a few kilobytes of memory doesn't justify the complexity caused by adding a new build flag. All platforms except 64-bit Windows were using LIBERAL. It's not clear why win64 was using -DCONSERVATIVE, but removing it doesn't break anything. [gh #17232]
* use PTR2nat() instead of casting pointers to unsigned longTomasz Konojacki2019-10-301-3/+3
| | | | | | | Casting a pointer to unsigned long will result in truncation when sizeof(void*) > sizeof(unsigned long) [gh #17232]
* Fix do_vecget and do_vecset to process GET magic only oncePali2019-09-021-3/+3
|
* Use of code points over 0xFF in string bitwise operatorsJames E Keenan2019-05-311-23/+4
| | | | | | | | | | | | | | Implement complete fatalization. Some instances of these were fatalized in 5.28. However, in cases where the wide characters did not affect the end result, no deprecation notice was raised. So they remained legal, though deprecated. Now, all occurrences are fatal (as of 5.32). Modify source code in doop.c. Adapt test file. Update perldiag and perldeprecation. For: RT 134140 (Commiter changed a verb to past tense in the pod)
* Use of strings with code points over 0xFF as arguments to "vec"James E Keenan2019-05-301-4/+1
| | | | | | | | | | | Implement scheduled fatalization. Adapt existing tests in t/op/vec.t. Eliminate t/lib/warnings/doop and move one test to t/op/vec.t. Document this fatalization in perldiag and perlfunc. Documentation improvement recommended by Karl Williamson. For: RT # 134139
* Remove remaining assignments to SvCUR and SvLEN in coreDagfinn Ilmari Mannsåker2019-05-281-2/+2
| | | | Also make the macros non-lvalues under PERL_CORE
* rmv/de-dup static const char array "strings"Daniel Dragan2018-03-071-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MSVC due to a bug doesn't merge identicals between .o'es or discard these vars and their contents. MEM_WRAP_CHECK_2 has never been used outside of core according to cpan grep MEM_WRAP_CHECK_2 was removed on the "have PERL_MALLOC_WRAP" branch in commit fabdb6c0879 "pre-likely cleanup" without explination, probably bc it was unused. But MEM_WRAP_CHECK_2 was still left on the "no PERL_MALLOC_WRAP" branch, so remove it from the "no" side for tidyness since it was a mistake to leave it there if it was removed from the "yes" side of the #ifdef. Add MEM_WRAP_CHECK_s API, letter "s" means argument is string or static. This lets us get rid of the "%s" argument passed to Perl_croak_nocontext at a couple call sites since we fully control the next and only argument and its guaranteed to be a string literal. This allows merging of 2 "Out of memory during array extend" c strings by linker now. Also change the 2 op.h messages into macros which become string literals at their call sites instead of "read char * from a global char **" which was going on before. VC 2003 32b perl527.dll section size before .text name DE503 virtual size .rdata name 4B621 virtual size after .text name DE503 virtual size .rdata name 4B5D1 virtual size
* S_do_trans_complex(): outdent a block of codeDavid Mitchell2018-02-201-33/+33
| | | | whitespace-only change left over from my recent tr///c fix work
* PATCH: [perl #132750] Silence uninit warningKarl Williamson2018-01-211-1/+1
| | | | | I inspected the code, and there is no problem here; it's a compiler mistake. Nevertheless, smply initializing the variable silences it.
* doop.c: White-space onlyKarl Williamson2018-01-191-10/+10
| | | | Indent to correspond with the new block placed by the previous commit.
* Deprecate above \xFF in bitwise string opsKarl Williamson2018-01-191-0/+5
| | | | | | | | | | | | This is already a fatal error for operations whose outcome depends on them, but in things like "abc" & "def\x{100}" the wide character doesn't actually need to participate in the AND, and so perl doesn't. As a result of the discussion in the thread beginning with http://nntp.perl.org/group/perl.perl5.porters/244884, it was decided to deprecate these ones too.
* doop.c: Use MIN()Karl Williamson2018-01-191-1/+1
| | | | This is slightly cleaner than hand rolling the min.
* tr///: eliminate I32 from the do_trans*() fnsDavid Mitchell2018-01-191-15/+15
| | | | Replace each with a more appropriate type
* tr///: return Size_t count rather than I32David Mitchell2018-01-191-13/+13
| | | | | | Change the signature of all the internal do_trans*() functions to return Size_t rather than I32, so that the count returned by tr//// can cope with strings longer than 2Gb.
* tr///; simplify $utf8 =~ tr/nonutf8/nonutf8/David Mitchell2018-01-191-94/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The run-time code to handle a non-utf8 tr/// against a utf8 string is complex, with many variants of similar code repeated depending on the presence of the /s and /c flags. Simplify them all into a single code block by changing how the translation table is stored. Formerly, the tr struct contained possibly two tables: the basic 0-255 slot one, plus in the presence of /c, a second one to map the implicit search range (\x{100}...) against any residual replacement chars not consumed by the first table. This commit merges the two tables into a single unified whole. For example tr/\x00-\xfe/abcd/c is equivalent to tr/xff-\x{7fffffff}/abcd/ which generates a 259-entry translation table consisting of: 0x00 => -1 0x01 => -1 ... 0xfe => -1 0xff => a 0x100 => b 0x101 => c 0x102 => d In addition we store: 1) the size of the translation table (0x103 in the example above); 2) an extra 'wildcard' entry stored 1 slot beyond the main table, which specifies the action for any codepoints outside the range of the table (i.e. chars 0x103..0x7fffffff). This can be either: a) a character, when the last replacement char is repeated; b) -1 when /c isn't in effect; c) -2 when /d is in effect; c) -3 identity: when the replacement list is empty but not /d. In the example above, this would be 0x103 => d The addition of -3 as a valid slot value is new. This makes the main runtime code for the utf8 string with non-utf8 tr// case look like, at its core: size = tbl->size; mapped_ch = tbl->map[ch >= size ? size : ch]; which then processes mapped_ch based on whether its >=0, or -1/-2/-3. This is a lot simpler than the old scheme, and should generally be faster too.
* tr///c: handle len(replacement charlist) > 32767David Mitchell2018-01-191-1/+1
| | | | | | | | | | | | | | | | | | | | RT #132608 In the non-utf8 case, the /c (complement) flag to tr adds an implied \x{100}-\x{7fffffff} range to the search charlist. If the replacement list contains more chars than are paired with the 0-255 part of the search list, then the excess chars are stored in an extended part of the table. The excess char count was being stored as a short, which caused problems if the replacement list contained more than 32767 excess chars: either substituting the wrong char, or substituting for a char located up to 0xffff bytes in memory before the real translation table. So change it to SSize_t. Note that this is only a problem when the search and replacement charlists are non-utf8, the replacement list contains around 0x8000+ entries, and where the string being translated is utf8 with at least one codepoint >= U+8000.
* add two structs for OP_TRANSDavid Mitchell2018-01-191-23/+28
| | | | | | | | | | | | | | | | | Originally, the op_pv of an OP_TRANS op pointed to a 256-slot array of shorts, which contained the translations. However, in the presence of tr///c, extra information needs to be stored to handle utf8 strings. The 256 slot array was extended, with slot 0x100 holding a length, and slots 0x101 holding some extra chars. This has made things a bit messy, so this commit adds two structs, one being an array of 256 shorts, and the other being the same but with some extra fields. So for example tbl->[0x100] has been replaced with tbl->excess_len. This commit should make no functional difference, but will allow us shortly to fix a bug by changing the type of the excess_len field from short to something bigger, for example.
* S_do_trans_complex(): re-indentDavid Mitchell2018-01-191-6/+6
| | | | outdent a code block following previous commit.
* fix "\x{100}..." =~ tr/.../.../cdDavid Mitchell2018-01-191-25/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In transliterations where the search and replacement charlists are non-utf8, but where the string being modified contains codepoints >= 0x100, then tr/.../.../cd would always delete all such codepoints, rather than potentially mapping some of them. In more detail: in the presence of /c (complement), an implicit 0x100..0x7fffffff is added to a non-utf8 search charlist. If the replacement list is longer than the < 0x100 part of the search list, then the last few replacement chars should in principle be paired off against the first few of (\x100, \x101, ...). However, this wasn't happening. For example, tr/\x00-\xfd/ABCD/cd should be equivalent to tr/\xfe-\x{7fffffff}/ABCD/d which should map: \xfe => A, \xff => B, \x{100} => C, \x{101} => D, and delete \x{102} onwards. But instead, it behaved like tr/\xfe-\x{7fffffff}/AB/d and deleted all codepoints >= 0x100. This commit fixes that by using the extended mapping table format for all /c variants (formerly it excluded /cd). I also changed a variable holding the mapped char from being I32 to UV: principally to avoid a casting mess in the fixed code. This may (or may not), as a side-effect, have fixed possible issues with very large codepoints.
* OP_TRANS: change extended table formatDavid Mitchell2018-01-191-9/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For non-utf8, OP_TRANS(R) ops have a translation table consisting of an array of 256 shorts attached. For tr///c, this table is extended to hold information about chars in the replacement list which aren't paired with chars in the search list. For example, tr/\x00-AE-\xff/bcdefg/c is equivalent to tr/BCD\x{100}-\x{7fffffff}/bcdefg/ which is equivalent to tr/BCD\x{100}-\x{7fffffff}/bcdefggggggggg..../ Only the BCD => bcd mappings can be stored in the basic 256-slot table, so potentially the following extra information needs recording in an extended table to handle codepoints > 0xff in the string being modified: 1) the extra replacement chars ("efg"); 2) the number of extra replacement chars (3); 3) the "repeat" char ('g'). Currently 2) and 3) are combined: the repeat char is found as the last extra char, and if there are no extra chars, the repeat char is treated as an extra char list of length 1. Similarly, an 'extra chars' length value of 1 can imply either one extra char, or no extra chars with the repeat char being faked as an extra char. An 'extra chars' length of 0 implies an empty replacement list, i.e. tr/....//c. This commit changes it so that the repeat char is *always* stored (in slot 0x101), with the extra chars stored beginning at slot 0x102. The 'extra chars' length value (located at slot 0x0100) has changed its meaning slightly: now -1 implies tr/....//c 0 implies no more replacement chars than search chars 1+ the number of excess replacement chars. This (should) make no function difference, but the extra information stored will make it easier to fix some bugs shortly.
* remove fossil debugging statement from do_trans()David Mitchell2018-01-191-2/+0
| | | | | | | | | This: DEBUG_t( Perl_deb(aTHX_ "2.TBL\n")); has been around in one form or another since perl1, but it makes no sense since perl5,000, where -Dt now shows the name of the op being executed.
* tr/// functions: add some basic code commentsDavid Mitchell2018-01-191-0/+63
| | | | | | | | | | | | | For the various C functions which implement the compile-time and run-time aspects of OP_TRANS, add some basic code comments at the top of each function explaining what its purpose is. Also add lots of code comments to the body of S_pmtrans() (which compiles a tr///). Also comment what the OPpTRANS_ private flag bits mean. No functional changes.
* doop.c: Change to use is_utf8_invariant_string()Karl Williamson2017-11-231-27/+9
| | | | | | | This commit changes 3 occurrences of byte-at-a-time looking to see if a string is invariant under UTF-8, to using the inlined is_utf8_invariant_string() which now does much faster word-at-a-time looking.