summaryrefslogtreecommitdiff
path: root/toke.c
Commit message (Collapse)AuthorAgeFilesLines
* Revert "Forbid labels with keyword names"Jan Dubois2010-03-021-2/+0
| | | | | | | | This reverts commit f71d6157c7933c0d3df645f0411d97d7e2b66b2f. Revert "Add new error "Can't use keyword '%s' as a label"" This reverts commit 28ccebc469d90664106fcc1cb73d7321c4b60716.
* PATCH: deprecation warnings for unreasonable charnamesKarl Williamson2010-02-201-1/+64
| | | | | | | | | | | | | | | | | Prior to now just about anything has been legal for a character name in \N{...}. This means that legal code was broken by having \N{3,4} for example mean [^\n]{3,4}. Such code doesn't come from standard charnames, but from legal custom translators. This patch deprecates "unreasonable" names. handy.h is changed by the addition of macros that taken together define the names we deem reasonable, namely alpha beginning with alphanumerics and some punctuations as continuations. toke.c is changed to parse each name and to raise a warning if any problematic characters are found. Some tests and diagnostic documentation are also included.
* Add some missing dVAR'sMarcus Holland-Moritz2010-02-201-0/+2
| | | | | | Commits c3acb9e0760135dfd888c0ee1b415777d784aabc, 867fa1e2da145229b4db2c6e8d5b51700c15f114 and f0e67a1d29102aa9905aecf2b0f98449697d5af3 added or changed functions that now require a dVAR declaration to compile with -DPERL_GLOBAL_STRUCT.
* Avoid returning an undefined SV*Rafael Garcia-Suarez2010-02-191-1/+2
|
* Make a missing right brace on \N{ fatalKarl Williamson2010-02-191-24/+9
| | | | | | It was decided that this should be a fatal error instead of a warning. Also some comments were updated..
* PATCH: [perl #56444] delayed interpolation of \N{...}Karl Williamson2010-02-191-85/+299
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | make regen embed.fnc needs to be run on this patch. This patch fixes Bugs #56444 and #62056. Hopefully we have finally gotten this right. The parser used to handle all the escaped constants, expanding \x2e to its single byte equivalent. The problem is that for regexp patterns, this is a '.', which is a metacharacter and has special meaning that \x2e does not. So things were changed so that the parser didn't expand things in patterns. But this causes problems for \N{NAME}, when the pattern doesn't get evaluated until runtime, as for example when it has a scalar reference in it, like qr/$foo\N{NAME}/. We want the value for \N{NAME} that was in effect at the point during the parsing phase that this regex was encountered in, but we don't actually look at it until runtime, when these bug reports show that it is gone. The solution is for the tokenizer to parse \N{NAME}, but to compile it into an intermediate value that won't ever be considered a metacharacter. We have chosen to compile NAME to its equivalent code point value, and express it in the already existing \N{U+...} form. This indicates to the regex compiler that the original input was a named character and retains the value it had at that point in the parse. This means that \N{U+...} now always must imply Unicode semantics for the string or pattern it appeared in. Previously there was an inconsistency, where effectively \N{NAME} implied Unicode semantics, but \N{U+...} did not necessarily. So now, any string or pattern that has either of these forms is utf8 upgraded. A complication is that a charnames handler can return a sequence of multiple characters instead of just one. To deal with this case, the tokenizer will generate a constant of the form \N{U+c1.c2.c2...}, where c1 etc are the individual characters. Perhaps this will be made a public interface someday, but I decided to not expose it externally as far as possible for now in case we find reason to change it. It is possible to defeat this by passing it in a single quoted string to the regex compiler, so the documentation will be changed to discourage that. A further complication is that \N can have an additional meaning: to match a non-newline. This means that the two meanings have to be disambiguated. embed.fnc was changed to make public the function regcurly() in regcomp.c so that it could be referred to in toke.c to see if the ... in \N{...} is a legal quantifier like {2,}. This is used in the disambiguation. toke.c was changed to update some out-dated relevant comments. It now parses \N in patterns. If it determines that it isn't a named sequence, it passes it through unchanged. This happens when there is no brace after the \N, or no closing brace, or if the braces enclose a legal quantifier. Previously there has been essentially no restriction on what can come between the braces so that a custom translator can accept virtually anything. Now, legal quantifiers are assumed to mean that the \N is a "match non-newline that quantity of times". I removed the #ifdef'd out code that had been left in in case pack U reverted to earlier behavior. I did this because it complicated things, and because the change to pack U has been in long enough and shown that it is correct so it's not likely to be reverted. \N meaning a named character is handled differently depending on whether this is a pattern or not. In all cases, the output will be upgraded to utf8 because a named character implies Unicode semantics. If not a pattern, the \N is parsed into a utf8 string, as before. Otherwise it will be parsed into the intermediate \N{U+...} form. If the original was already a valid \N{U+...} constant, it is passed through unchanged. I now check that the sequence returned by the charnames handler is not malformed, which was lacking before. The code in regcomp.c which dealt with interfacing with the charnames handler has been removed. All the values should be determined by the time regcomp.c gets involved. The affected subroutine is necessarily restructured. An EXACT-type node is generated for the character sequence. Such a node has a capacity of 255 bytes, and so it is possible to overflow it. This wasn't checked for before, but now it is, and a warning issued and the overflowing characters are discarded.
* Allow arbitrary whitespace between NAME and VERSION in "package NAME ↵Jesse Vincent2010-02-031-0/+1
| | | | | | VERSION;" statements Fixes [perl #72432]
* Parse 'use NAME VERSION' with C localeDavid Golden2010-01-161-0/+6
|
* Omnibus strict and lax version parsingDavid Golden2010-01-131-1/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Authors: John Peacock, David Golden and Zefram The goal of this mega-patch is to enforce strict rules for version numbers provided to 'package NAME VERSION' while formalizing the prior, lax rules used for version object creation. Parsing for use() is unchanged. version.pm adds two globals, $STRICT and $LAX, containing regular expressions that define the rules. There are two additional functions -- version::is_strict and version::is_lax -- that test an argument against these rules. However, parsing of strings that might contain version numbers is done in core via the Perl_scan_version function, which may be called during compilation or may be called later when version objects are created by Perl_new_version or Perl_upg_version. A new helper function, Perl_prescan_version, has been added to validate a string under either strict or lax rules. This is used in toke.c for 'package NAME VERSION' in strict mode and by Perl_scan_version in lax mode. It matches the behavior of the verison.pm regular expressions, but does not use them directly. A new test file, comp/packagev.t, validates strict and lax behaviors of 'package NAME VERSION' and 'version->new(VERSION)' respectively and verifies their behavior against the $STRICT and $LAX regular expressions, as well. Validating these two implementation should help ensure they each work as intended. Other files and tests have been modified as necessary to support these changes. There is remaining work to be done in a few areas: * documenting all changes in behavior and new functions * determining proper treatment of "," as decimal separators in various locales * updating diagnostics for new error messages * porting changes back to the version.pm distribution on CPAN, including pure-Perl versions
* Move prototype parsing related warnings from the 'syntax' top level warnings ↵Matt S Trout2010-01-101-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | category to a new 'illegalproto' subcategory. Two warnings can be emitted when parsing a prototype - Illegal character in prototype for %s : %s Prototype after '%c' for %s : %s The first one is emitted when any invalid character is found, the latter when further prototype-type stuff is found after a slurpy entry (i.e. valid character but in such a place as to be a no-op, and therefore likely a bug). These warnings are distinct from those emitted when a sub is overwritten by one with a different prototype, and when calls are made to subroutines with prototypes - those are in the pre-existing sub-category 'prototype'. Since modules such as signatures.pm and Web::Simple only need to disable the warnings during parsing, I chose to add a new category containing only these. Moving these warnings into the 'prototype' sub-category would have forced authors to disable more warnings than they intended, and the entire raison d'etre of this patch is to allow the specific warnings involved to be disabled. In order to maintain compatibility with existing code, the new location needed to be a sub-category of 'syntax' - this means that no warnings 'syntax'; will continue to work as expected - even in cases like Web::Simple where all subcategories extant prior to this patch are re-enabled (this is another reason why a move into the 'protoype' category would not achieve the desired goal). The category name 'illegalproto' was chosen because the most common warning to encounter is the "Illegal character" one, and therefore 'illegalproto' while minorly inaccurate by ignoring the (relatively recent and unknown) second warning is an easy name to spot on an initial skim of perllexwarn and will behave as expected by also disabling the case of an unusual prototype that happens to look like a normal one. This patch updates pod/perllexwarn.pod, perldiag.pod and perl5113delta.pod to document the new category, toke.c and warnings.pl to create and implement the new category, and a new test t/op/protowarn.t that verifies the new behaviour in a number of cases. It also includes the files generated by regen.pl that are found in the repo - notably warnings.h and lib/warnings.pm.
* [perl #71748] Bleadperl f0e67a1 breaks CPAN: Template::Plugin::YAML::Encode 0.02Zefram2010-01-051-12/+8
| | | | | Unsurprisingly, the nature of the bug is that I accidentally changed the logic of one of the several types of space skipping. Fix attached.
* Allow "{sub f}" to compileVincent Pit2010-01-031-1/+1
|
* Remove spurious case of warning "Use of %s without parentheses is ambiguous"Rafael Garcia-Suarez2009-12-201-2/+0
| | | | | | Eric Brine pointed out that this warning doesn't apply to ".", as in C<rand . 1>, that shouldn't warn since C<. 1> cannot be mistaken for a floating point number.
* Introduce C<use feature "unicode_strings">Rafael Garcia-Suarez2009-12-201-1/+1
| | | | | | | | | | | | | This turns on the unicode semantics for uc/lc/ucfirst/lcfirst operations on strings without the UTF8 bit set but with ASCII characters higher than 127. This replaces the "legacy" pragma experiment. Note that currently this feature sets both a bit in $^H and a (unused) key in %^H. The bit in $^H could be replaced by a flag on the uc/lc/etc op. It's probably not feasible to test a key in %^H in pp_uc in friends each time we want to know which semantics to apply.
* Make eval {} compile directly to OP_ENTERTRYRafael Garcia-Suarez2009-12-201-2/+8
| | | | | This way, it's correctly caught and blocked by Safe, separately from eval "".
* Fix for [perl #70910] wrong line number in syntax error messageZefram2009-12-091-1/+2
|
* -Dmad: double free or corruptionTony Cook2009-12-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | > If your perl has -Dmad, the following program crashes: > > $ bleadperl -we '$x="x" x 257; eval "for $x"' > *** glibc detected *** bleadperl: double free or corruption (!prev): 0x0000000001dca670 *** Change 6136c704 changed S_scan_ident from: e = d + destlen - 3; to: register char * const e = d + destlen + 3; where e is used to mark the end of the buffer, this meant that the various buffer end checks allowed the various buffers supplied S_scan_ident to overflow. Attached is a fix, various tests with fencepost checks on different identifier lengths, and the specific case mentioned in the ticket. Tony Signed-off-by: H.Merijn Brand <h.m.brand@xs4all.nl>
* Fix -DPERL_NO_UTF16_FILTEREric Brine2009-11-301-7/+15
|
* Allow a closing brace after an "use VERSION"Vincent Pit2009-11-281-1/+2
| | | | This fixes [perl #70884] : use VERSION in BLOCK without semicolon -> syntax error
* -Dmad minitest failure bisectZefram2009-11-261-3/+0
| | | | | | | | | | | | | | | | | | | | | | Tony Cook wrote: >Smokes with -Dmad have been failing during make minitest since the >middle of last month. Mostly fixed by the attached patch. The fault is a logic error on my part, probably from the early phase of developing the lexer API patch, when I didn't properly understand the various buffer pointer variables. In my tests with -Dmad, I'm still getting a test failure ("panic: input overflow") from t/op/incfilter.t. The underlying problem is the filter layer mishandling things when a filter function gives it a multiline string, so it generates an invalid SV state (strlen(SvPVX(PL_linestr)) > SvCUR(PL_linestr)). This faulty state also occurs without -Dmad, and so doesn't appear to be Mad-related, it just doesn't in practice cause the test panic without -Dmad. I'm investigating this bug now. -zefram Signed-off-by: H.Merijn Brand <h.m.brand@xs4all.nl>
* perl-5.11.2 breaks NYTProf savesrc option (Lexer API suspected)Zefram2009-11-251-5/+8
| | | | | | | | | | | | | | | | | | Tim Bunce wrote: >The primary issue is the off-by-one error in the array indexing. There's a bit more to it than that. The indexing was off-by-one for *some* places that process a new line, but correct for others, so the saved source as a whole was mangled rather than simply offset. Also, there were some redundant calls to update_debugger_info(), so some lines got saved twice, in some cases off-by-one for one saving and not for the other. The saved source is, therefore, hopelessly broken in 5.11.2. Attached patch fixes the source saving. Includes a new test, which works through all reachable places that source lines get saved. This should close RT #70804. -zefram
* Also skip spaces after variable if we are within lexical brackets. Fixes ↵Gerard Goossen2009-11-251-1/+1
| | | | #70091: Segmentation fault in hash lookup in regex substitution
* lexer API fixesZefram2009-11-191-3/+6
| | | | | | | | | | | | | The attached patch contains these fixes for the lexer API work: * fix MinGW-revealed problem in BOM logic (replacing Jan's patch) * fix warnings from t/op/incfilter.t * probably fix g++ failure due to goto bypassing initialisation * perl5112delta update -zefram Signed-off-by: H.Merijn Brand <h.m.brand@xs4all.nl>
* Remove dead preprocessor code from toke.cJan Dubois2009-11-161-13/+0
| | | | | | The symbol FTELL_FOR_PIPE_IS_BROKEN is no longer being used and should have been removed with the commit 4c84d7f2, which removed the -P option.
* Fix crash in refactored lexer internalsJan Dubois2009-11-161-1/+1
| | | | | | | | | | | | Commit f0e67a1d29102aa9905aecf2b0f98449697d5af3 changed the control flow so that PerlIO_tell(PL_rsfp) could be called when PL_rsfp was NULL, which produces a crash at least on Windows with the MSVCRT runtime. This change moves the detection if PL_rsfp is NULL or not closer to the location where is is actually tested, which gets rid of the crashes. I however have *not* verified if the changes in control flow in f0e67a1d are otherwise correct or not.
* lexer APIZefram2009-11-151-213/+758
| | | | | | | | | Attached is a patch that adds a public API for the lowest layers of lexing. This is meant to provide a solid foundation for the parsing that Devel::Declare and similar modules do, and it complements the pluggable keyword mechanism. The API consists of some existing variables combined with some new functions, all marked as experimental (which making them public certainly is).
* Add length and flags arguments to Perl_allocmy().Nicholas Clark2009-11-091-2/+2
| | | | | | Currently no flags bits are used, and the length is cross-checked against strlen() on the pointer, but the intent is to re-work the entire pad API to be UTF-8 aware, from the current situation of char * pointers only.
* Bareword sub lookupsZefram2009-11-081-30/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Attached is a patch that changes how the tokeniser looks up subroutines, when they're referenced by a bareword, for prototype and const-sub purposes. Formerly, it has looked up bareword subs directly in the package, which is contrary to the way the generated op tree looks up the sub, via an rv2cv op. The patch makes the tokeniser generate the rv2cv op earlier, and dig around in that. The motivation for this is to allow modules to hook the rv2cv op creation, to affect the name->subroutine lookup process. Currently, such hooking affects op execution as intended, but everything goes wrong with a bareword ref where the tokeniser looks at some unrelated CV, or a blank space, in the package. With the patch in place, an rv2cv hook correctly affects the tokeniser and therefore the prototype-based aspects of parsing. The patch also changes ck_subr (which applies the argument context and checking parts of prototype behaviour) to handle subs referenced by an RV const op inside the rv2cv, where formerly it would only handle a gv op inside the rv2cv. This is to support the most likely kind of modified rv2cv op. The attached patch is the resulting revised version of the bareword sub patch. It incorporates the original patch (allowing rv2cv op hookers to control prototype processing), the GV-downgrading addition, and a mention in perldelta.
* Add length and flags arguments to Perl_pad_findmy(), moving it to the public ↵Nicholas Clark2009-11-071-2/+2
| | | | | | | | API. Currently no flags bits are used, and the length is cross-checked against strlen() on the pointer, but the intent is to re-work the entire pad API to be UTF-8 aware, from the current situation of char * pointers only.
* Placate a warning from Borland's compiler.Nicholas Clark2009-11-071-1/+1
|
* Implement facility to plug in syntax triggered by keywordsJesse Vincent2009-11-051-17/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Date: Tue, 27 Oct 2009 01:29:40 +0000 From: Zefram <zefram@fysh.org> To: perl5-porters@perl.org Subject: bareword sub lookups Attached is a patch that changes how the tokeniser looks up subroutines, when they're referenced by a bareword, for prototype and const-sub purposes. Formerly, it has looked up bareword subs directly in the package, which is contrary to the way the generated op tree looks up the sub, via an rv2cv op. The patch makes the tokeniser generate the rv2cv op earlier, and dig around in that. The motivation for this is to allow modules to hook the rv2cv op creation, to affect the name->subroutine lookup process. Currently, such hooking affects op execution as intended, but everything goes wrong with a bareword ref where the tokeniser looks at some unrelated CV, or a blank space, in the package. With the patch in place, an rv2cv hook correctly affects the tokeniser and therefore the prototype-based aspects of parsing. The patch also changes ck_subr (which applies the argument context and checking parts of prototype behaviour) to handle subs referenced by an RV const op inside the rv2cv, where formerly it would only handle a gv op inside the rv2cv. This is to support the most likely kind of modified rv2cv op. [This commit includes the Makefile.PL for XS-APITest-KeywordRPN missing from the original patch, as well as updates to perldiag.pod and a MANIFEST sort]
* Deprecate use of := to mean an empty attribute list in my $pi := 4;Nicholas Clark2009-11-041-0/+3
| | | | | | | | | | | An accident of Perl's parser meant that my $pi := 4; was parsed as an empty attribute list. Empty attribute lists are ignored, hence the above is equivalent to my $pi = 4; However, the fact that it is currently valid syntax means that := cannot be used as new token, without silently changing the meaning of existing code. Hence it is now deprecated, so that it can subsequently be removed, allowing the possibility of := to be used as a new token with new semantics.
* S_utf16_textfilter() was not returning EOF correctly in some situations.Nicholas Clark2009-11-011-2/+6
|
* Remove Perl_pmflag() from the public API, and mark it as deprecated.Nicholas Clark2009-11-011-11/+18
| | | | | | regcomp.c stopped using it before 5.10, leaving only toke.c. The only code on CPAN that uses it is copies of regcomp.c. Replace it with a static function, with a cleaner interface.
* In S_pending_ident(), only call gv_fetchpvn_flags() if the warning is enabled.Nicholas Clark2009-10-241-4/+5
| | | | ckWARN(WARN_AMBIGUOUS) is cheaper than Perl_gv_fetchpvn_flags().
* S_utf16_textfilter() can use the filter GV itself for an SV buffer.Nicholas Clark2009-10-231-3/+3
| | | | This saves allocating an extra SV head and body.
* In S_utf16_textfilter() replace sv_chop() with code, as we move 1 byte at most.Nicholas Clark2009-10-221-1/+9
|
* S_utf16_textfilter() needs to avoid splitting UTF-16 surrogate pairs.Nicholas Clark2009-10-221-1/+20
| | | | Easier said than done.
* Re-write S_utf16_textfilter() to correctly handle partial reads of UTF-16.Nicholas Clark2009-10-221-30/+71
| | | | | | | | | | Treat any (and all) octects after the BOM (or all, if there was no BOM) as initial read data for the filter, and call it to convert them to the first line, reading more if necessary. This correctly handles the "problem" that UTF-16LE read as a line, on the assumption that it's ASCII/ISO-8859-*/UTF-8/etc will be truncated after the first octect of the "\n\0" pair that is "\n" encoded as UTF-16LE. This fixes bug #69678. Read from the upstream filter in block mode, rather than line mode.
* Remove the "hack" that removes SVt_UTF8 in the UTF16 filter, by fixing t/TESTNicholas Clark2009-10-221-4/+0
| | | | | Given that t/TEST already had code to add -I../lib when testing UTF-8 with -utf8, do likewise for testing UTF-16 with -utf16.
* Refactor S_utf16_textfilter() to use a second SV for the UTF-16 input.Nicholas Clark2009-10-211-14/+28
| | | | Re-use the same SV for each call. Store it in IoTOP_GV(filter).
* Remove the PERLIO * argument to S_filter_gets(), as it's always PL_rsfpNicholas Clark2009-10-211-10/+12
| | | | | | | | Conceptually it's also wrong, as if there are source filters, the passed-in file handle is not passed up the stack of filters for the topmost filter to use to read from. It was in the parameter list from the first creation of filter_gets() in 16d20bd98cd29be76029ebf04027a7edd34d817b, when calls to sv_gets() were replaced by it.
* S_utf16_textfilter() needs FILTER_DATA() to get the filter's state SV.Nicholas Clark2009-10-211-1/+1
| | | | | aa6dbd607b0a3d8a wrongly assumed that the filter's state SV was the SV passed in as an argument to the filter read function.
* S_utf16_textfilter() was failing honour error returns from FILTER_READ()Nicholas Clark2009-10-211-1/+1
|
* panic if S_utf16_textfilter() is called in block mode.Nicholas Clark2009-10-211-0/+7
|
* Make filter_read() in block mode create a well-formed SV with a trailing '\0'Nicholas Clark2009-10-211-1/+2
|
* Pull out filter setup code from S_swallow_bom() into S_add_utf16_textfilter()Nicholas Clark2009-10-201-31/+25
|
* MAD-only code in S_swallow_bom() duplicated the actions of sv_setpvn()Nicholas Clark2009-10-201-9/+0
| | | | | | Remove it. All its writes were byte-for-byte identical with the memory they overwrote. The bugs it attempts to fix are real, but caused by the design and implementation of other parts of this routine and S_utf16_textfilter().
* Merge S_utf16_textfilter and S_utf16rev_textfilter().Nicholas Clark2009-10-181-27/+15
| | | | Use IoLINES() on the filter's SV to determine which encoding is in use.
* Note why S_pending_ident's prototype can't be generated by embed.fncNicholas Clark2009-10-181-0/+2
|