delta/perl.git - github.com: perl/perl5.git

	Commit message (Collapse)	Author	Age	Files	Lines
*	Change name of ibcmp to foldEQ	Karl Williamson	2010-06-05	1	-0/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As discussed on p5p, ibcmp has different semantics from other cmp functions in that it is a binary instead of ternary function. It is less confusing then to have a name that implies true/false. There are three functions affected: ibcmp, ibcmp_locale and ibcmp_utf8. ibcmp is actually equivalent to foldNE, but for the same reason that things like 'unless' and 'until' are cautioned against, I changed the functions to foldEQ, so that the existing names, like ibcmp_utf8 are defined as macros as being the complement of foldEQ. This patch also changes the one file where turning ibcmp into a macro causes problems. It changes it to use the new name. It also documents for the first time ibcmp, ibcmp_locale and their new names.
*	handle perl extended utf8 start bytes	Tony Cook	2010-05-31	1	-1/+3
\| \| \| \| \|	perl uses UTF8_IS_START() to test if a byte is a valid start byte, this didn't take perl's extended UTF-8 range into account.
*	Remove unused, wrong #define in utf8.h	Karl Williamson	2010-05-25	1	-2/+0
\| \| \| \| \|	is unused in the code, and is wrong for EBCDIC platforms, as there can be invariants there that aren't ASCII. I simply removed it.
*	Update .pods	Karl Williamson	2009-12-25	1	-4/+4
\| \| \| \|	Signed-off-by: Abigail <abigail@abigail.be>
*	Introduce C<use feature "unicode_strings">	Rafael Garcia-Suarez	2009-12-20	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	This turns on the unicode semantics for uc/lc/ucfirst/lcfirst operations on strings without the UTF8 bit set but with ASCII characters higher than 127. This replaces the "legacy" pragma experiment. Note that currently this feature sets both a bit in $^H and a (unused) key in %^H. The bit in $^H could be replaced by a flag on the uc/lc/etc op. It's probably not feasible to test a key in %^H in pp_uc in friends each time we want to know which semantics to apply.
*	qr/\X/ expansion	Karl Williamson	2009-12-05	1	-13/+13
\|
*	Make unicode semantics the default	Karl Williamson	2009-11-23	1	-1/+2
\|
*	add code for Unicode semantics for non-utf8 latin1 chars	Karl Williamson	2009-11-14	1	-0/+1
\|
*	More cleanup of utfebcdic.h and utf8.h	karl williamson	2009-11-09	1	-26/+42
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Attached is a patch that removes from utfebcdic.h most definitions that are common to it and utf8.h, and moves them to the common area of utf8.h. The duplicate ones that are retained are each an integral part of a larger related set that do differ between the headers. Some of the definitions had started to drift, so this brings them back into line, with a lowered possibility of future drift. In particular the ones for the 'lazy' macros did not do quite as intended, especially in the EBCDIC case. The bugs were a small performance hit only, in that the macro was not quite as lazy as expected, and so loaded utf8_heavy.pl possibly unnecessarily. In examining these, I noted that the utf8.h definition of the start byte of a utf8 encoded string accepts invalid start bytes 0xC0 and 0xC1. These are invalid because they are for overlong encodings of ASCII code points. One is not supposed to allow these, and there have been security attacks, according to Wikipedia, against code that does. But I don't know all the ramifications for Perl of changing to exclude these, so I left it alone, but added a comment (and an item on my personal todo list to check into it). I made some comment clarifications, and removed some definitions marked as obsolete in utf8.h that are in fact no longer used. I added some synonyms for existing macros that more clearly reflect the use that I intend to put them to in future patches. From ba581aa4db767e5531ec0c0efdea5de4e9b09921 Mon Sep 17 00:00:00 2001 From: Karl Williamson <khw@khw-desktop.(none)> Date: Mon, 9 Nov 2009 08:38:24 -0700 Subject: [PATCH] Clean up utf headers Signed-off-by: H.Merijn Brand <h.m.brand@xs4all.nl>
*	Define specially handled chars; and clean-up ebcdic vs unicode	Karl Williamson	2009-11-08	1	-16/+9
\|
*	Put parentheses around macro arguments	Rafael Garcia-Suarez	2009-02-01	1	-2/+2
\|
*	Bump coopyright year in embed.pl and various files that were just touched	Rafael Garcia-Suarez	2009-01-02	1	-1/+1
\| \| \| \|	(and run "make regen")
*	Faster sv_utf8_upgrade()	karl williamson	2009-01-02	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Consider what currently happens when the tokenizer is scanning a string. It looks through it byte-by-byte until it finds a character that forces it to decide to go to utf8. It then calls sv_utf8_upgrade() with the portion of the string scanned so far. sv_utf8_upgrade() starts over from the beginning, and scans the string byte-by-byte until it finds a character that varies between non-utf8 and utf8. It then calls bytes_to_utf8(). bytes_to_utf8() allocates a new string that can handle the worst case expansion, 2n+1, of the entire string, and starts over from the beginning, and scans the input string byte-by-byte copying and converting each character to the output string as it goes. It doesn't return the size of the new string, so sv_utf8_upgrade() assumes it is only as big as what actually got converted, throwing away knowledge of any spare. It then returns to the tokenizer, which immediately does a grow to get space for the unparsed input. This is likely to cause a new string to be allocated and copied from the one we had just created, even if that string in actuality had enough space in it. Thus, the invariant head portion of the string is scanned 3 times, and probably 2 strings will be allocated and copied. My solution to cutting this down is to do several things. First, I added an extra flag for sv_utf8_upgrade that says don't bother to check if the string needs to be converted to utf8, just assume it does. This eliminates one of the passes. I also added a new parameter to sv_utf8_upgrade that says when you return, I want this much unused space in the string. That eliminates the extra grow. This was all done by renaming the current work-horse function from sv_utf8_upgrade_flags to be sv_utf8_upgrade_flags_grow() and making the current function name be a macro which calls the revised one with a 0 grow parameter. I also improved the internal efficiency of sv_utf8_upgrade so that when it does scan the string, it doesn't call bytes_to_utf8, but does the conversion itself, using a fast memory copy instead of the byte-oriented one for the invariant header, and it uses that header to get a better estimate of the needed size of the new string, and it doesn't throw away the knowledge of the allocated size. And, if it is clear without scanning the whole string that the conversion will fit in the already allocated string, it just uses that instead of allocating and copying a new one, using the algorithm I copied from the tokenizer. (In this case it does have to finish scanning the whole string to get the correct size.) The comments have details. It still is byte-oriented. Vectorization et. al. could yield performance improvements. One idea for that is in the comments. The patch also includes a new synonym I created which is a more accurate name than NATIVE_TO_ASCII.
*	Add editor blocks to some header files.	Marcus Holland-Moritz	2008-01-01	1	-0/+10
\| \| \|	p4raw-id: //depot/perl@32793
*	Fix up copyright years for files modified in 2007.	Nicholas Clark	2007-11-07	1	-1/+1
\| \| \|	p4raw-id: //depot/perl@32237
*	Update copyright years in .h files. Also, in .pl	Rafael Garcia-Suarez	2007-01-05	1	-1/+1
\| \| \| \| \| \|	files that generate .h files, so they'll be ready next time. p4raw-id: //depot/perl@29695
*	Re: [perl #38293] chr(65535) should be allowed in regexes	SADAHIRO Tomoyuki	2006-04-02	1	-4/+6
\| \| \| \| \|	Message-Id: <20060402224657.B942.BQW10602@nifty.com> p4raw-id: //depot/perl@27688
*	Change 27677 missed two direct accesses to op_private in COPs.	Nicholas Clark	2006-04-02	1	-1/+1
\| \| \| \| \| \| \|	I believe that all are now found, as redefining CopHINTS_get(c) to (~(c)->op_private) (with corresponding changes to CopHINTS_set() and the initialisation of PL_compiling) works. p4raw-id: //depot/perl@27687
*	Compilation help for EBCDIC platforms, from Jarkko,	Rafael Garcia-Suarez	2005-12-22	1	-3/+5
\| \| \| \| \|	tested by Rajarshi Das p4raw-id: //depot/perl@26452
*	Re: XS-assisted SWASHGET (esp. for t/uni/class.t speedup)	SADAHIRO Tomoyuki	2005-11-30	1	-1/+1
\| \| \| \| \|	Message-Id: <20051127170016.A786.BQW10602@nifty.com> p4raw-id: //depot/perl@26229
*	A more elegant way to deal with utf8n_to_uvchr() and utf8n_to_uvuni().	Nicholas Clark	2005-10-31	1	-2/+2
\| \| \|	p4raw-id: //depot/perl@25926
*	Replace uvuni_to_utf8() with a macro that passes the extra 0 argument	Nicholas Clark	2005-10-30	1	-0/+1
\| \| \| \| \|	to uvuni_to_utf8_flags(). Move the old body to mathoms.c p4raw-id: //depot/perl@25905
*	Replace is_utf8_string_loc() with a macro that passes the extra 0	Nicholas Clark	2005-10-30	1	-1/+3
\| \| \| \| \| \|	argument to is_utf8_string_loc(). Correct the description of its parameters in its POD. p4raw-id: //depot/perl@25903
*	undef IS_UTF8_CHAR() on EBCDIC	SADAHIRO Tomoyuki	2005-10-09	1	-0/+4
\| \| \| \| \|	Message-Id: <20051008165752.348A.BQW10602@nifty.com> p4raw-id: //depot/perl@25716
*	one more round of is_utf8_foo tuneup	Jarkko Hietaniemi	2005-06-07	1	-0/+2
\| \| \| \| \|	Message-ID: <42A314E4.8060608@gmail.com> p4raw-id: //depot/perl@24730
*	speed up is_utf8_char()	Jarkko Hietaniemi	2005-06-03	1	-0/+70
\| \| \| \| \|	Message-ID: <429F557E.3090007@gmail.com> p4raw-id: //depot/perl@24687
*	Symbian port of Perl	Jarkko Hietaniemi	2005-04-21	1	-1/+1
\| \| \| \| \|	Message-ID: <B356D8F434D20B40A8CEDAEC305A1F2453D653@esebe105.NOE.Nokia.com> p4raw-id: //depot/perl@24271
*	Third consting batch	Andy Lester	2005-03-24	1	-5/+5
\| \| \| \| \|	Message-Id: <2f14220e7101a03f7659dbe79a03b115@petdance.com> p4raw-id: //depot/perl@24074
*	Re: uc($long_utf8_string) exhausts memory	Jarkko Hietaniemi	2005-01-22	1	-8/+20
\| \| \| \| \| \| \| \|	Message-Id: <41F1801C.3080201@iki.fi> Make buffer size estimates for utf8 case conversion less maximally pessimistic p4raw-id: //depot/perl@23857
*	UTF8_ALLOW_ANYUV should not allow overlong sequences [PATCH]	Gisle Aas	2004-12-09	1	-2/+1
\| \| \| \| \|	Message-ID: <lrmzwrae0j.fsf_-_@caliper.activestate.com> p4raw-id: //depot/perl@23632
*	It's UTF-8, not UTF8. (Note: not s/UTF-8/UTF8/,	Jarkko Hietaniemi	2003-09-12	1	-2/+2
\| \| \| \| \| \|	since that would break a lot of code.) Also few stray UTF16s, UTF32s, and "encoded in Unicode". p4raw-id: //depot/perl@21198
*	Fix up Larry's copyright statements to my best knowledge.	Jarkko Hietaniemi	2003-04-16	1	-1/+1
\| \| \| \| \| \| \|	(Lots of Perl 5 source code archaeology was involved.) Larry didn't make strangled noises when I showed him the patch, either :-) p4raw-id: //depot/perl@19242
*	Reverse copyright update (#18801) for files not changed in 2003.	Hugo van der Sanden	2003-03-02	1	-1/+1
\| \| \|	p4raw-id: //depot/perl@18807
*	Update all copyrights to 2003, from Jarkko	Hugo van der Sanden	2003-03-02	1	-1/+1
\| \| \|	p4raw-id: //depot/perl@18801
*	As noted by Philip Newton: nothing wrong with BOM,	Jarkko Hietaniemi	2002-04-06	1	-14/+12
\| \| \| \| \|	but 0xFFFE quite wrong. p4raw-id: //depot/perl@15762
*	Explain the "gaps" in the UTF-8 encoding.	Jarkko Hietaniemi	2002-04-06	1	-0/+4
\| \| \|	p4raw-id: //depot/perl@15761
*	What started as a small nit (the charnames test, nit found	Jarkko Hietaniemi	2002-04-02	1	-5/+5
\| \| \| \| \| \| \| \| \|	be Hugo), ballooned a bit... the goal is Larry's wish that illegal Unicode (such as U+FFFF) by default doesn't warn, since what if somebody WANTS to create illegal Unicode? Now getting close to this in the regex runtime. (Also, fix more of my fixation that BOM would be U+FFFE.) p4raw-id: //depot/perl@15689
*	Mysterious characters.	Jarkko Hietaniemi	2002-03-10	1	-6/+6
\| \| \|	p4raw-id: //depot/perl@15148
*	Update the UTF-8 explanation table.	Jarkko Hietaniemi	2002-02-27	1	-2/+25
\| \| \|	p4raw-id: //depot/perl@14900
*	Not extending enough.	Jarkko Hietaniemi	2002-02-19	1	-2/+4
\| \| \|	p4raw-id: //depot/perl@14758
*	EBCDIC: SHARP S is different.	Jarkko Hietaniemi	2002-02-05	1	-1/+14
\| \| \|	p4raw-id: //depot/perl@14561
*	Copyright++. (Not all the toplevel *.h have one, it seems.)	Jarkko Hietaniemi	2002-01-23	1	-1/+1
\| \| \|	p4raw-id: //depot/perl@14391
*	AIX cpp bug: having macro arguments and character constants	Jarkko Hietaniemi	2002-01-23	1	-7/+7
\| \| \| \| \| \| \| \| \|	"the same" means trouble (here s and 's') What broke now was 841 and 842 of t/op/pat.t, because of the ANYOF_UNICODE_FOLD_SHARP_S() in utf8.h, ccversion 5.0.1.0 (note that breakage happened only under cc_r and usethreads+ useithreads) p4raw-id: //depot/perl@14379
*	Sharp S as a special treat for our German UTF-8 testers :-)	Jarkko Hietaniemi	2002-01-12	1	-0/+8
\| \| \|	p4raw-id: //depot/perl@14222
*	More regex and utf8 debug dumping.	Jarkko Hietaniemi	2002-01-07	1	-0/+3
\| \| \|	p4raw-id: //depot/perl@14114
*	Finish up (ha!) the Unicode case folding;	Jarkko Hietaniemi	2002-01-05	1	-0/+2
\| \| \| \| \|	enhance regex dumping code. p4raw-id: //depot/perl@14096
*	The funky final sigma casefolding.	Jarkko Hietaniemi	2001-12-23	1	-0/+5
\| \| \|	p4raw-id: //depot/perl@13866
*	Make using U+FDD0..U+FDEF (noncharacters since Unicode 3.1),	Jarkko Hietaniemi	2001-12-21	1	-0/+11
\| \| \| \| \| \|	U+...FFFE, U+...FFFF, and characters beyond U+10FFFF (the Unicode maximum code point) warnable offenses. p4raw-id: //depot/perl@13823
*	Unadorned numbers evil.	Jarkko Hietaniemi	2001-12-13	1	-1/+6
\| \| \|	p4raw-id: //depot/perl@13672
*	PATCH Resubmission - was Re: [ID 20010902.001] v strings over 2*31 barf	John Peacock	2001-09-10	1	-1/+1
\| \| \| \| \|	Message-ID: <3B9D23D6.90BCCC25@rowman.com> p4raw-id: //depot/perl@11986