| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
| |
(and run "make regen")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Consider what currently happens when the tokenizer is scanning a string.
It looks through it byte-by-byte until it finds a character that forces
it to decide to go to utf8. It then calls sv_utf8_upgrade() with the
portion of the string scanned so far.
sv_utf8_upgrade() starts over from the beginning, and scans the string
byte-by-byte until it finds a character that varies between non-utf8 and
utf8. It then calls bytes_to_utf8().
bytes_to_utf8() allocates a new string that can handle the worst case
expansion, 2n+1, of the entire string, and starts over from the
beginning, and scans the input string byte-by-byte copying and
converting each character to the output string as it goes.
It doesn't return the size of the new string, so sv_utf8_upgrade()
assumes it is only as big as what actually got converted, throwing away
knowledge of any spare.
It then returns to the tokenizer, which immediately does a grow to get
space for the unparsed input. This is likely to cause a new string to
be allocated and copied from the one we had just created, even if that
string in actuality had enough space in it.
Thus, the invariant head portion of the string is scanned 3 times, and
probably 2 strings will be allocated and copied.
My solution to cutting this down is to do several things.
First, I added an extra flag for sv_utf8_upgrade that says don't bother
to check if the string needs to be converted to utf8, just assume it
does. This eliminates one of the passes.
I also added a new parameter to sv_utf8_upgrade that says when you
return, I want this much unused space in the string. That eliminates
the extra grow.
This was all done by renaming the current work-horse function from
sv_utf8_upgrade_flags to be sv_utf8_upgrade_flags_grow() and making the
current function name be a macro which calls the revised one with a 0
grow parameter.
I also improved the internal efficiency of sv_utf8_upgrade so that when
it does scan the string, it doesn't call bytes_to_utf8, but does the
conversion itself, using a fast memory copy instead of the byte-oriented
one for the invariant header, and it uses that header to get a better
estimate of the needed size of the new string, and it doesn't throw away
the knowledge of the allocated size.
And, if it is clear without scanning the whole string that the
conversion will fit in the already allocated string, it just uses that
instead of allocating and copying a new one, using the algorithm I
copied from the tokenizer. (In this case it does have to finish
scanning the whole string to get the correct size.) The comments have
details.
It still is byte-oriented. Vectorization et. al. could yield
performance improvements. One idea for that is in the comments.
The patch also includes a new synonym I created which is a more accurate
name than NATIVE_TO_ASCII.
|
|
|
| |
p4raw-id: //depot/perl@32793
|
|
|
| |
p4raw-id: //depot/perl@32237
|
|
|
|
|
|
| |
files that generate .h files, so they'll be ready
next time.
p4raw-id: //depot/perl@29695
|
|
|
|
|
| |
Message-Id: <20060402224657.B942.BQW10602@nifty.com>
p4raw-id: //depot/perl@27688
|
|
|
|
|
|
|
| |
I believe that all are now found, as redefining CopHINTS_get(c)
to (~(c)->op_private) (with corresponding changes to CopHINTS_set()
and the initialisation of PL_compiling) works.
p4raw-id: //depot/perl@27687
|
|
|
|
|
| |
tested by Rajarshi Das
p4raw-id: //depot/perl@26452
|
|
|
|
|
| |
Message-Id: <20051127170016.A786.BQW10602@nifty.com>
p4raw-id: //depot/perl@26229
|
|
|
| |
p4raw-id: //depot/perl@25926
|
|
|
|
|
| |
to uvuni_to_utf8_flags(). Move the old body to mathoms.c
p4raw-id: //depot/perl@25905
|
|
|
|
|
|
| |
argument to is_utf8_string_loc(). Correct the description of its
parameters in its POD.
p4raw-id: //depot/perl@25903
|
|
|
|
|
| |
Message-Id: <20051008165752.348A.BQW10602@nifty.com>
p4raw-id: //depot/perl@25716
|
|
|
|
|
| |
Message-ID: <42A314E4.8060608@gmail.com>
p4raw-id: //depot/perl@24730
|
|
|
|
|
| |
Message-ID: <429F557E.3090007@gmail.com>
p4raw-id: //depot/perl@24687
|
|
|
|
|
| |
Message-ID: <B356D8F434D20B40A8CEDAEC305A1F2453D653@esebe105.NOE.Nokia.com>
p4raw-id: //depot/perl@24271
|
|
|
|
|
| |
Message-Id: <2f14220e7101a03f7659dbe79a03b115@petdance.com>
p4raw-id: //depot/perl@24074
|
|
|
|
|
|
|
|
| |
Message-Id: <41F1801C.3080201@iki.fi>
Make buffer size estimates for utf8 case conversion less maximally
pessimistic
p4raw-id: //depot/perl@23857
|
|
|
|
|
| |
Message-ID: <lrmzwrae0j.fsf_-_@caliper.activestate.com>
p4raw-id: //depot/perl@23632
|
|
|
|
|
|
| |
since that would break a lot of code.) Also few
stray UTF16s, UTF32s, and "encoded in Unicode".
p4raw-id: //depot/perl@21198
|
|
|
|
|
|
|
| |
(Lots of Perl 5 source code archaeology was involved.)
Larry didn't make strangled noises when I showed him
the patch, either :-)
p4raw-id: //depot/perl@19242
|
|
|
| |
p4raw-id: //depot/perl@18807
|
|
|
| |
p4raw-id: //depot/perl@18801
|
|
|
|
|
| |
but 0xFFFE quite wrong.
p4raw-id: //depot/perl@15762
|
|
|
| |
p4raw-id: //depot/perl@15761
|
|
|
|
|
|
|
|
|
| |
be Hugo), ballooned a bit... the goal is Larry's wish that
illegal Unicode (such as U+FFFF) by default doesn't warn,
since what if somebody WANTS to create illegal Unicode?
Now getting close to this in the regex runtime.
(Also, fix more of my fixation that BOM would be U+FFFE.)
p4raw-id: //depot/perl@15689
|
|
|
| |
p4raw-id: //depot/perl@15148
|
|
|
| |
p4raw-id: //depot/perl@14900
|
|
|
| |
p4raw-id: //depot/perl@14758
|
|
|
| |
p4raw-id: //depot/perl@14561
|
|
|
| |
p4raw-id: //depot/perl@14391
|
|
|
|
|
|
|
|
|
| |
"the same" means trouble (here s and 's')
What broke now was 841 and 842 of t/op/pat.t, because of the
ANYOF_UNICODE_FOLD_SHARP_S() in utf8.h, ccversion 5.0.1.0
(note that breakage happened only under cc_r and usethreads+
useithreads)
p4raw-id: //depot/perl@14379
|
|
|
| |
p4raw-id: //depot/perl@14222
|
|
|
| |
p4raw-id: //depot/perl@14114
|
|
|
|
|
| |
enhance regex dumping code.
p4raw-id: //depot/perl@14096
|
|
|
| |
p4raw-id: //depot/perl@13866
|
|
|
|
|
|
| |
U+...FFFE, U+...FFFF, and characters beyond U+10FFFF
(the Unicode maximum code point) warnable offenses.
p4raw-id: //depot/perl@13823
|
|
|
| |
p4raw-id: //depot/perl@13672
|
|
|
|
|
| |
Message-ID: <3B9D23D6.90BCCC25@rowman.com>
p4raw-id: //depot/perl@11986
|
|
|
|
|
|
| |
and the Perl will be built to do that by default (adding that
will break scripts having non-UTF-8 binary data, such as Latin-1.)
p4raw-id: //depot/perl@11656
|
|
|
| |
p4raw-id: //depot/perl@11652
|
|
|
|
|
| |
Message-Id: <200107061339.JAA12582@bottesini.harvard.edu>
p4raw-id: //depot/perl@11184
|
|
|
|
|
|
| |
patch: rename HINT_BYTE and IN_BYTE to HINT_BYTES and IN_BYTES
to match the pragma name; various robustness cleanups.
p4raw-id: //depot/perl@10339
|
|
|
|
|
| |
Message-Id: <5.0.2.1.1.20010421192107.01ce5a50@ix.netcorps.com>
p4raw-id: //depot/perl@9775
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
into mainline.
fix a broken workaround for Borland compiler in change#4739
(caused weird "short reads" on DATA, which caused op/misc.t to fail)
nits spotted by Borland compiler
avoid redefinition warnings under Borland 5.02
various nits identified by the Borland 5.5 compiler; remove suppression
of a few warnings
p4raw-link: @9496 on //depot/maint-5.6/perl: 9d05ad52b0aa7d1f7d147da0c4dbc14de5fe4a37
p4raw-link: @9495 on //depot/maint-5.6/perl: 759997f1e719f33541bed70dd7f79bfa26a930b3
p4raw-link: @9494 on //depot/maint-5.6/perl: 01b59bde1cb7ff62776f3b83c0f2575c79a950a6
p4raw-link: @9493 on //depot/maint-5.6/perl: eea7051a8d4ef81c032143ab3193bc1240ab2e8f
p4raw-link: @4739 on //depot/perl: c39cd00800303e8967294e98aa4c427a1872a251
p4raw-id: //depot/perl@9497
p4raw-integrated: from //depot/maint-5.6/perl@9492 'merge in' sv.c
utf8.h (@9288..) toke.c (@9292..) ext/File/Glob/bsd_glob.c
(@9415..) win32/makefile.mk (@9426..) win32/win32.h (@9494..)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- Loose the extra level of function on ASCII.
- spotted a chr(0) issue in sv.c
- re-work of UTF-X tr/// ranges to work in Unicode
space. Still issues with the "0xff is illegal UTF-8" hack.
- Yet another ad. hoc. utf8 'upgrade' in op.c recoded
(why do it once when you can do it all over the place :-(
- Enable HINTS_UTF8 on EBCDIC - then ignore it in toke.c,
need utf8.pm for swashes.
- Simplified and commented scan_const() in toke.c
Still something wrong regexp and tr (swashes?).
p4raw-id: //depot/perlio@9267
|
|
|
| |
p4raw-id: //depot/perlio@9246
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
encoding on EBCDIC platforms. This has property that U+0000..U+009F i.e.
a superset of ASCII are invariant under the encoding. This is EBCDIC
friendly as an encoded string can be looked at as being EBCDIC by lexer
sprintf("%d",...) etc. in same manner that a UTF-8 string be considered
ASCII on ASCII machines.
- re-arrange utf8.h to get ASCII specific vs Unicode generic bits
seperate.
- Add some more macros to comprehend different shift amounts and
possible swizzle in UTF-EBCDIC vs UTF-8. Change utf8.c to use them.
- add utfebcdic.h which provides UTF-EBCDIC versions of the macros,
and conditionally #include it.
EBCDIC build as yet untested. ASCII still fails the one test.
p4raw-id: //depot/perlio@9185
|
|
|
| |
p4raw-id: //depot/perlio@9184
|