summaryrefslogtreecommitdiff
path: root/proto.h
Commit message (Collapse)AuthorAgeFilesLines
* regcomp.c: Remove unused parameter in static functionKarl Williamson2013-09-241-4/+3
| | | | | This parameter is no longer used, since a few commits ago in this series.
* Teach regex optimizer to handle above-Latin1Karl Williamson2013-09-241-12/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Until this commit, the regular expression optimizer has essentially punted on above-Latin1 code points. Under some circumstances, they would be taken into account, more or less, but often, the generated synthetic start class would end up matching all above-Latin1 code points. With the advent of inversion lists, it becomes feasible to actually fully handle such code points, as inversion lists are a convenient way to express arbitrary lists of code points and take their union, intersection, etc. This commit changes the optimizer to use inversion lists for operating on the code points the synthetic start class can match. I don't much understand the overall operation of the optimizer. I'm told that previous porters found that perturbing it caused unexpected behaviors. I had promised to get this change in 5.18, but didn't. I'm trying to get it in early enough into the 5.20 preliminary series that any problems will surface before 5.20 ships. This commit doesn't change the macro level logic, but does significantly change various micro level things. Thus the 'and' and 'or' subroutines have been rewritten to use inversion lists. I'm pretty confident that they do what their names suggest. I re-derived the equations for what these operations should do, getting the same results in some cases, but extending others where the previous code mostly punted. The derivations are given in comments in the respective routines. Some of the code is greatly simplified, as it no longer has to treat above-Latin1 specially. It is now feasible for /i matching of above-Latin1 code points to know explicitly the folds that should be in the synthetic start class. But more prepatory work needs to be done before putting that into place. ...
* regcomp.c: Add some static functionsKarl Williamson2013-09-241-0/+51
| | | | | | | | | This commit adds some functions that are currently unused, but will be used in a future commit. This commit is essentially to make the differences smaller in that commit, as 'diff' is getting confused and not outputting the logical differences. The functions are added in a block at the beginning of the file to avoid the 'diff' issues. A later white-space only commit will move them to more appropriate positions.
* regcomp.c: Use STR_WITH_LEN to avoid bookkeepingKarl Williamson2013-09-241-2/+2
| | | | | | By changing the order of the parameters to the static function S_add_data, we can call it with STR_WITH_LEN and avoid a human having to count characters.
* regcomp.c: Change some static parameters to constKarl Williamson2013-09-241-1/+1
| | | | I found I needed const in a future commit.
* regcomp.c: Add parameter to static functionKarl Williamson2013-09-241-4/+5
| | | | | | | This parameter will be used in future commits. This commit is really only to make the difference listing smaller in those, by committing separately just the book-keeping parts. This parameter requires also passing the aTHX_ thread parameter
* Make typedef fully typedefKarl Williamson2013-09-241-34/+33
| | | | | | | | | | | | | | | | | | | | | | The regcomp.c struct RExC_state_t has not been usable fully as a typedef, requiring the 'struct' at times. This has caused me, and I presume others, wasted time when we forget to use it under those circumstances when it should be used, but it's never been a big enough issue to cause me to spend tuits on it. But, working on something else, I finally came to the realization of what the problem is. It is because proto.h is #included before regcomp.h is, and so functions that are declared in proto.h that have something that is a RExC_state_t as a parameter don't know that it is a typedef because that is defined in regcomp.h. A way around this is already used for other similar structures, and that is to declare them in perl.h which is always read in before proto.h, leaving the definitions to regcomp.h. Thus proto.h knows enough to compile. The structure was already declared in perl.h; just not typedef'd. Otherwise proto.h would not know about it at all. This patch moves two regcomp.c related declarations in perl.h to the same section as the others, and changes the one for RExC_state_t to be a typedef. All the 'struct' uses are removed.
* regcomp.c: Change names of some static functionsKarl Williamson2013-09-241-31/+31
| | | | | | | | | | | | | | | | | The term 'class' is very overloaded in regex code and documentation. perlrecharclass.pod calls the dot (matching any char) a class, and calls the [] form "bracketed character classes". There are other meanings as well. This is the first commit in a short series that removes some of those overloadings. One instance of class is the "synthetic start class", generated by the regex optimizer to be a list of all the code points a sucessful match could possibly start with. This is useful in more quickly finding where to start looking in matching against a target string. Prior to this commit, the routines that referred to this began with 'cl_', and the formal parameters were 'cl', which could mean any class. This commit changes those instances of 'cl' to 'ssc' to indicate this is the only type of class that is being handled.
* regcomp.c: Rework static function call; commentsKarl Williamson2013-09-241-4/+3
| | | | | | The previous commit just extracted out code into a function. This commit renames a parameter for clarity, combines two parameters to make the interface cleaner, and adds and moves comments around.
* regcomp.c: Extract code into separate functionKarl Williamson2013-09-241-0/+7
| | | | | | A future commit will use this functionality from a second place. For now, just cut and paste, and do the minimal ancillary work to get it to compile and pass.
* Add regnode struct for synthetic start classKarl Williamson2013-09-241-6/+7
| | | | | | | | | | | | As part of extending the regular expression optimizer to properly handle above Latin1 code points, I need an inversion list to contain which code points the synthetic start class (ssc) matches. The ssc currently is the same as a locale-aware ANYOF node, which uses the struct of a regular ANYOF node, plus some extra fields at the end. This commit creates a new typedef for ssc use, which is the locale-aware ANYOF node, plus an extra SV* at the end to hold the inversion list.
* regcomp.c: Extract code into separate functionKarl Williamson2013-09-241-0/+6
| | | | | This is in preparation for it to be called from more than one place, in a future commit.
* Removed DUMP_FDS and dump_fds()Brian Fraser2013-09-211-7/+0
| | | | | | If perl was compiled with -DDUMP_FDS, it would define dump_fds and add it to the API, although even then nothing used it. dump_fds() itself was buggy, only checking for fds 0 through 32.
* toke.c, S_scan_ident(): Don't take a "end of buffer" argument, use PL_bufendBrian Fraser2013-09-181-4/+3
| | | | | | | | | | | | | | | All but one of scan_ident()'s callers already passed PL_bufend as the removed argument; The one deviant was intuit_more(), which was setting the "end of buffer" argument, to the next close-bracket. This commit modifies intuit_more() to temporarily set PL_bufend and then restore it. This was done as groundwork for the following commit, which will add more uses of PEEKSPACE() to scan_ident() in order to fix some whitespace and line number bugs, and PEEKSPACE() modifies PL_bufend directly if it encounters a newline at the end of the buffer -- that last bit being why changing intuit_more() to modify-and-restore PL_bufend is safe, since the end of the buffer will always be a ']'
* [perl #115928] a consistent (public) rand() implementationTony Cook2013-09-131-0/+10
| | | | | | | | | | | | | | | | Based on Yves's random branch work. This version makes the new random number visible to external modules, for example, List::Util's XS shuffle() implementation. I've also added a 64-bit implementation when HAS_QUAD is true, this should be significantly faster, even on 32-bit CPUs. This is intended to produce exactly the same sequence as the original implementation. The original version of this commit retained the "freebsd" name from Yves's original work for the function and data structure names. I've removed "freebsd" from most function names so the name isn't an issue if we choose to replace the implementation,
* gv.c: Split part of find_default_stash into gv_is_in_main.Brian Fraser2013-09-111-0/+5
| | | | | gv_is_in_main() checks if an unqualified identifier is in the main:: stash.
* gv.c: Rename magicalize_gv into gv_magicalize, make it more specific.Brian Fraser2013-09-111-7/+7
| | | | | Namely, gv_magicalize no longer stores the GV into the stash, which is gv_fetchpvn_flags' job.
* gv.c, gv_fetchpvn_flags: Split another chunk of magic-checking code.Brian Fraser2013-09-111-0/+6
| | | | | | This bit is called when a GV already exists, but it's name is length-one and it's on the main:: stash, so it might have multiple kinds of magic, like $! and %!, or @+ and %+.
* gv.c: Move the code that magicalizes new globs into magicalize_gv().Brian Fraser2013-09-111-0/+7
|
* gv.c: Begin splitting gv_fetchpvn_flags into smaller helper functions.Brian Fraser2013-09-111-0/+15
| | | | | This commit takes a chunk of code out of gv_fetchpvn_flags and turns it into two fuctions: parse_gv_stash_name and find_default_stash.
* regcomp.c: Make all warnings and error messages UTF-8 cleanBrian Fraser2013-09-101-3/+3
|
* [perl #117265] correctly handle overloaded stringsTony Cook2013-09-091-3/+3
|
* Revert "Let av_push accept NULL values"Father Chrysostomos2013-09-071-2/+3
| | | | | | | | | | | This reverts commit 7b6e8075e45ebc684565efbe3ce7b70435f20c79. It turns out to be problematic, because it causes NULLs on the stack, which XSUBs may trip on. My main reason for it was actually to try to resolve some CPAN failures, but it turns out that other fixes have removed the need for that.
* Fix PerlIO_get_cnt and friendsLeon Timmermans2013-09-071-4/+4
| | | | These functions worked with ints instead of SSize_t,
* Let av_push accept NULL valuesFather Chrysostomos2013-09-061-3/+2
| | | | | | | Now that NULL is used for a nonexistent element, it is easy for XS code to pass it to av_push(). av_store already accepts NULL, and av_push already works with it on non-debugging builds, so there is really no need for this restriction.
* Put AV defelem creation code in one placeFather Chrysostomos2013-09-061-0/+7
|
* [perl #115768] improve (caller)[2] line numbersFather Chrysostomos2013-09-011-5/+5
| | | | | | | | | | | | | warn and die have special code (closest_cop) to find a nulled nextstate op closest to the warn or die op, to get the line number from it. This commit extends that capability to caller, so that if (1) { foo(); } sub foo { warn +(caller)[2] } shows the right line number.
* utf8.c: Remove wrapper functions.Karl Williamson2013-08-291-34/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that the Unicode data is stored in native character set order, it is rare to need to work with the Unicode order. Traditionally, the real work was done in functions that worked with the Unicode order, and wrapper functions (or macros) were used to translate to/from native. There are two groups of functions: one that translates from code point to UTF-8, and the other group goes the opposite direction. This commit changes the base function that translates from UTF-8 to code point to output native instead of Unicode. Those extremely rare instances where Unicode output is needed instead will have to hand-wrap calls to this function with a translation macro, as now described in the API pod. Prior to this, it was the other way, the native was wrapped, and the rare, strict Unicode wasn't. This eliminates a layer of function call overhead for a common case. The base function that translates from code point to UTF-8 retains its Unicode input, as that is more natural to process. However, it is de-emphasized in the pod, with the functionality description moved to the pod for a native input wrapper function. And, those wrappers are now macros in all cases; previously there was function call overhead sometimes. (Equivalent exported functions are retained, however, for XS code that uses the Perl_foo() form.) I had hoped to rebase this commit, squashing it with an earlier commit in this series, eliminating the use of a temporary function name change, but the work involved turns out to be large, with no real payoff.
* utf8.c: Stop using two functionsKarl Williamson2013-08-291-0/+10
| | | | | | | | | | | | | | | | | This is in preparation for deprecating these functions, to force any code that has been using these functions to change. Since the Unicode tables are now stored in native order, these functions should only rarely be needed. However, the functionality of these is needed, and in actuality, on ASCII platforms, the native functions are #defined to these. So what this commit does is rename the functions to something else, and create wrappers with the old names, so that anyone using them will get the deprecation when it actually goes into effect: we are waiting for CPAN files distributed with the core to change before doing the deprecation. According to cpan.grep.me, this should affect fewer than 10 additional CPAN distributions.
* Convert uvuni_to_utf8() to functionKarl Williamson2013-08-291-2/+2
| | | | | | | Code should almost never be dealing with non-native code points This is in preparation for later deprecation when our CPAN modules have been converted away from using it.
* Deprecate utf8_to_uni_buf()Karl Williamson2013-08-291-0/+1
| | | | | | | Now that the tables are stored in native order, there is almost no need for code to be dealing in Unicode order. According to grep.cpan.me, there are no uses of this function in CPAN.
* Deprecate valid_utf8_to_uvuni()Karl Williamson2013-08-291-0/+1
| | | | | | | | | Now that all the tables are stored in native format, there is very little reason to use this function; and those who do need this kind of functionality should be using the bottom level routine, so as to make it clear they are doing nonstandard stuff. According to grep.cpan.me, there are no uses of this function in CPAN.
* utf8.c: Swap which fcn wraps the otherKarl Williamson2013-08-291-10/+5
| | | | This is in preparation for the current wrapee becoming deprecated
* Deprecate NATIVE_TO_NEED and ASCII_TO_NEEDKarl Williamson2013-08-291-0/+10
| | | | | | | | | | | | | | | | | | These macros are no longer called in the Perl core. This commit turns them into functions so that they can use gcc's deprecation facility. I believe these were defective right from the beginning, and I have struggled to understand what's going on. From the name, it appears NATIVE_TO_NEED taks a native byte and turns it into UTF-8 if the appropriate parameter indicates that. But that is impossible to do correctly from that API, as for variant characters, it needs to return two bytes. It could only work correctly if ch is an I8 byte, which isn't native, and hence the name would be wrong. Similar arguments for ASCII_TO_NEED. The function S_append_utf8_from_native_byte(const U8 byte, U8** dest) does what I think NATIVE_TO_NEED intended.
* Extract common code to an inline functionKarl Williamson2013-08-291-0/+5
| | | | | This fairly short paradigm is repeated in several places; a later commit will improve it.
* Only predeclare S_sv_or_pv_pos_u2b for -DPERL_CORE or -DPERL_EXTNicholas Clark2013-08-281-6/+8
| | | | | Otherwise when compiling XS code, there is a declaration for a function which is never used, which can cause some compilers to issue a warning.
* [perl #117265] safesyscalls: check embedded nul in syscall argsTony Cook2013-08-261-0/+8
| | | | | | | | | | | | | | | | Check for the nul char in pathnames and string arguments to syscalls, return undef and set errno to ENOENT. Added to the io warnings category syscalls. Strings with embedded \0 chars were prev. ignored in the syscall but kept in perl. The hidden payloads in these invalid string args may cause unnoticed security problems, as they are hard to detect, ignored by the syscalls but kept around in perl PVs. Allow an ending \0 though, as several modules add a \0 to such strings without adjusting the length. This is based on a change originally by Reini Urban, but pretty much all of the code has been replaced.
* Use SSize_t/STRLEN in more places in regexp codeFather Chrysostomos2013-08-251-7/+7
| | | | | | | | | | | | | | | | | | | As part of getting the regexp engine to handle long strings, this com- mit changes any variables, parameters and struct members that hold lengths of the string being matched against (or parts thereof) to use SSize_t or STRLEN instead of [IU]32. To avoid having to change any logic, I kept the signedness the same. I did not change anything that affects the length of the regular expression itself, so regexps are still practically limited to I32_MAX. Changing that would involve changing the size of regnodes, which would be a lot more involved. These changes should fix bugs, but are very hard to test. In most cases, I don’t know the regexp engine well enough to come up with test cases that test the paths in question with long strings. In other cases I don’t have a box with enough memory to test the fix.
* Stop substr re optimisation from rejecting long strsFather Chrysostomos2013-08-251-1/+1
| | | | | | | | | | | | | | Using I32 for the fields that record information about the location of a fixed string that must be found for a regular expression to match can result in match failures, because I32 is not large enough to store offsets >= 2**31. SSize_t is appropriate, since it is 64 bits on 64-bit platforms and 32 bits on 32-bit platforms. This commit changes enough instances of I32 to SSize_t to get the added test passing and suppress compiler warnings. A later commit will change many more.
* [perl #116907] Allow //g matching past 2**31 thresholdFather Chrysostomos2013-08-251-1/+1
| | | | | | | | | Change the internal fields for storing positions so that //g in scalar context can move past the 2**31 character threshold. Before this com- mit, the numbers would wrap, resulting in assertion failures. The changes in this commit are only enough to get the added test pass- ing. Stay tuned for more.
* Stop pos() from being confused by changing utf8nessFather Chrysostomos2013-08-251-0/+6
| | | | | | | | | | | | | | | | | | | | | | | The value of pos() is stored as a byte offset. If it is stored on a tied variable or a reference (or glob), then the stringification could change, resulting in pos() now pointing to a different character off- set or pointing to the middle of a character: $ ./perl -Ilib -le '$x = bless [], chr 256; pos $x=1; bless $x, a; print pos $x' 2 $ ./perl -Ilib -le '$x = bless [], chr 256; pos $x=1; bless $x, "\x{1000}"; print pos $x' Malformed UTF-8 character (unexpected end of string) in match position at -e line 1. 0 So pos() should be stored as a character offset. The regular expression engine expects byte offsets always, so allow it to store bytes when possible (a pure non-magical string) but use char- acters otherwise. This does result in more complexity than I should like, but the alter- native (always storing a character offset) would slow down regular expressions, which is a big no-no.
* Use SSize_t for arraysFather Chrysostomos2013-08-251-13/+13
| | | | | | | | | | Make the array interface 64-bit safe by using SSize_t instead of I32 for array indices. This is based on a patch by Chip Salzenberg. This completes what the previous commit began when it changed av_extend.
* Use SSize_t when extending the stackFather Chrysostomos2013-08-251-3/+3
| | | | | | | | | | | | | | | | (I am referring to what is usually known simply as The Stack.) This partially fixes #119161. By casting the argument to int, we can end up truncating/wrapping it on 64-bit systems, so EXTEND(SP, 2147483648) translates into EXTEND(SP, -1), which does not extend the stack at all. Then writing to the stack in code like ()=1..1000000000000 goes past the end of allocated memory and crashes. I can’t really write a test for this, since instead of crashing it will use more memory than I have available (and then I’ll start for- getting things).
* Use SSize_t for tmps stack offsetsFather Chrysostomos2013-08-251-1/+6
| | | | | | | | | | | | | | | This is a partial fix for #119161. On 64-bit platforms, I32 is too small to hold offsets into a stack that can grow larger than I32_MAX. What happens is the offsets can wrap so we end up referencing and modifying elements with negative indices, corrupting memory, and causing crashes. With this commit, ()=1..1000000000000 stops crashing immediately. Instead, it gobbles up all your memory first, and then, if your com- puter still survives, crashes. The second crash happesn bcause of a similar bug with the argument stack, which the next commit will take care of.
* PATCH: [perl #119443] Blead won't compile on winceKarl Williamson2013-08-231-3/+1
| | | | | | This commit adds #if's to cause locale handling code to compile on platforms that don't have full-featured locale handling. The commits mentioned in the ticket did not adequately cover these situations.
* [perl #3330] warn on increment of an non number/non-magically incable valueTony Cook2013-08-121-0/+11
|
* add adjust_size_and_find_bucket to embed.fncLukas Mai2013-08-111-0/+7
|
* Revert "[perl #117855] Store CopFILEGV in a pad under ithreads"Father Chrysostomos2013-08-091-10/+0
| | | | | | | | | | | | This reverts commit c82ecf346. It turn out to be faulty, because a location shared betweens threads (the cop) was holding a reference count on a pad entry in a particu- lar thread. So when you free the cop, how do you know where to do SvREFCNT_dec? In reverting c82ecf346, this commit still preserves the bug fix from 1311cfc0a7b, but shifts it around.
* Stop ‘used once’ warnings from crashing on circularitiesFather Chrysostomos2013-08-051-1/+1
| | | | | | | | gv_check was only checking for stashes nested directly inside them- selves (*foo:: = *foo::foo) and the main stash. Other stash circularities would cause infinite recursion, blowing the C stack and crashing.
* [perl #117855] Store CopFILEGV in a pad under ithreadsFather Chrysostomos2013-08-051-0/+10
| | | | | | | | | | | | | | | | This saves having to allocate a separate string buffer for every cop (control op; every statement has one). Under non-threaded builds, every cop has a pointer to the GV for that source file, namely *{"_<filename"}. Under threaded builds, the name of the GV used to be stored instead. Now we store an offset into the per-interpreter PL_filegvpad, which points to the GV. This makes no significant speed difference, but it reduces mem- ory usage.