summaryrefslogtreecommitdiff
path: root/utf8.h
Commit message (Collapse)AuthorAgeFilesLines
* It's UTF-8, not UTF8. (Note: not s/UTF-8/UTF8/,Jarkko Hietaniemi2003-09-121-2/+2
| | | | | | since that would break a lot of code.) Also few stray UTF16s, UTF32s, and "encoded in Unicode". p4raw-id: //depot/perl@21198
* Fix up Larry's copyright statements to my best knowledge.Jarkko Hietaniemi2003-04-161-1/+1
| | | | | | | (Lots of Perl 5 source code archaeology was involved.) Larry didn't make strangled noises when I showed him the patch, either :-) p4raw-id: //depot/perl@19242
* Reverse copyright update (#18801) for files not changed in 2003.Hugo van der Sanden2003-03-021-1/+1
| | | p4raw-id: //depot/perl@18807
* Update all copyrights to 2003, from JarkkoHugo van der Sanden2003-03-021-1/+1
| | | p4raw-id: //depot/perl@18801
* As noted by Philip Newton: nothing wrong with BOM,Jarkko Hietaniemi2002-04-061-14/+12
| | | | | but 0xFFFE quite wrong. p4raw-id: //depot/perl@15762
* Explain the "gaps" in the UTF-8 encoding.Jarkko Hietaniemi2002-04-061-0/+4
| | | p4raw-id: //depot/perl@15761
* What started as a small nit (the charnames test, nit foundJarkko Hietaniemi2002-04-021-5/+5
| | | | | | | | | be Hugo), ballooned a bit... the goal is Larry's wish that illegal Unicode (such as U+FFFF) by default doesn't warn, since what if somebody WANTS to create illegal Unicode? Now getting close to this in the regex runtime. (Also, fix more of my fixation that BOM would be U+FFFE.) p4raw-id: //depot/perl@15689
* Mysterious characters.Jarkko Hietaniemi2002-03-101-6/+6
| | | p4raw-id: //depot/perl@15148
* Update the UTF-8 explanation table.Jarkko Hietaniemi2002-02-271-2/+25
| | | p4raw-id: //depot/perl@14900
* Not extending enough.Jarkko Hietaniemi2002-02-191-2/+4
| | | p4raw-id: //depot/perl@14758
* EBCDIC: SHARP S is different.Jarkko Hietaniemi2002-02-051-1/+14
| | | p4raw-id: //depot/perl@14561
* Copyright++. (Not all the toplevel *.h have one, it seems.)Jarkko Hietaniemi2002-01-231-1/+1
| | | p4raw-id: //depot/perl@14391
* AIX cpp bug: having macro arguments and character constantsJarkko Hietaniemi2002-01-231-7/+7
| | | | | | | | | "the same" means trouble (here s and 's') What broke now was 841 and 842 of t/op/pat.t, because of the ANYOF_UNICODE_FOLD_SHARP_S() in utf8.h, ccversion 5.0.1.0 (note that breakage happened only under cc_r and usethreads+ useithreads) p4raw-id: //depot/perl@14379
* Sharp S as a special treat for our German UTF-8 testers :-)Jarkko Hietaniemi2002-01-121-0/+8
| | | p4raw-id: //depot/perl@14222
* More regex and utf8 debug dumping.Jarkko Hietaniemi2002-01-071-0/+3
| | | p4raw-id: //depot/perl@14114
* Finish up (ha!) the Unicode case folding;Jarkko Hietaniemi2002-01-051-0/+2
| | | | | enhance regex dumping code. p4raw-id: //depot/perl@14096
* The funky final sigma casefolding.Jarkko Hietaniemi2001-12-231-0/+5
| | | p4raw-id: //depot/perl@13866
* Make using U+FDD0..U+FDEF (noncharacters since Unicode 3.1),Jarkko Hietaniemi2001-12-211-0/+11
| | | | | | U+...FFFE, U+...FFFF, and characters beyond U+10FFFF (the Unicode maximum code point) warnable offenses. p4raw-id: //depot/perl@13823
* Unadorned numbers evil.Jarkko Hietaniemi2001-12-131-1/+6
| | | p4raw-id: //depot/perl@13672
* PATCH Resubmission - was Re: [ID 20010902.001] v strings over 2*31 barfJohn Peacock2001-09-101-1/+1
| | | | | Message-ID: <3B9D23D6.90BCCC25@rowman.com> p4raw-id: //depot/perl@11986
* If you want you can now add -DUSE_UTF8_SCRIPTS to your cflagsJarkko Hietaniemi2001-08-121-0/+9
| | | | | | and the Perl will be built to do that by default (adding that will break scripts having non-UTF-8 binary data, such as Latin-1.) p4raw-id: //depot/perl@11656
* There is no IN_UTF8.Jarkko Hietaniemi2001-08-121-1/+0
| | | p4raw-id: //depot/perl@11652
* QNX patch extended for NTONorton T. Allen2001-07-061-1/+3
| | | | | Message-Id: <200107061339.JAA12582@bottesini.harvard.edu> p4raw-id: //depot/perl@11184
* Salvage bits and pieces from the experimental 'utf8 everywhere'Jarkko Hietaniemi2001-05-311-4/+4
| | | | | | patch: rename HINT_BYTE and IN_BYTE to HINT_BYTES and IN_BYTES to match the pragma name; various robustness cleanups. p4raw-id: //depot/perl@10339
* Typo in utf8.hJesús Quiroga2001-04-211-1/+1
| | | | | Message-Id: <5.0.2.1.1.20010421192107.01ce5a50@ix.netcorps.com> p4raw-id: //depot/perl@9775
* Integrate changes #9493,9494,9495,9496 from maintperlJarkko Hietaniemi2001-04-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | into mainline. fix a broken workaround for Borland compiler in change#4739 (caused weird "short reads" on DATA, which caused op/misc.t to fail) nits spotted by Borland compiler avoid redefinition warnings under Borland 5.02 various nits identified by the Borland 5.5 compiler; remove suppression of a few warnings p4raw-link: @9496 on //depot/maint-5.6/perl: 9d05ad52b0aa7d1f7d147da0c4dbc14de5fe4a37 p4raw-link: @9495 on //depot/maint-5.6/perl: 759997f1e719f33541bed70dd7f79bfa26a930b3 p4raw-link: @9494 on //depot/maint-5.6/perl: 01b59bde1cb7ff62776f3b83c0f2575c79a950a6 p4raw-link: @9493 on //depot/maint-5.6/perl: eea7051a8d4ef81c032143ab3193bc1240ab2e8f p4raw-link: @4739 on //depot/perl: c39cd00800303e8967294e98aa4c427a1872a251 p4raw-id: //depot/perl@9497 p4raw-integrated: from //depot/maint-5.6/perl@9492 'merge in' sv.c utf8.h (@9288..) toke.c (@9292..) ext/File/Glob/bsd_glob.c (@9415..) win32/makefile.mk (@9426..) win32/win32.h (@9494..)
* More EBCDIC stuff:Nick Ing-Simmons2001-03-201-0/+4
| | | | | | | | | | | | | | - Loose the extra level of function on ASCII. - spotted a chr(0) issue in sv.c - re-work of UTF-X tr/// ranges to work in Unicode space. Still issues with the "0xff is illegal UTF-8" hack. - Yet another ad. hoc. utf8 'upgrade' in op.c recoded (why do it once when you can do it all over the place :-( - Enable HINTS_UTF8 on EBCDIC - then ignore it in toke.c, need utf8.pm for swashes. - Simplified and commented scan_const() in toke.c Still something wrong regexp and tr (swashes?). p4raw-id: //depot/perlio@9267
* More EBCDIC fixes.Nick Ing-Simmons2001-03-191-1/+3
| | | p4raw-id: //depot/perlio@9246
* Infrastructure to use UTF-EBCDIC rather than UTF-8 as the internalNick Ing-Simmons2001-03-171-68/+69
| | | | | | | | | | | | | | | | | | encoding on EBCDIC platforms. This has property that U+0000..U+009F i.e. a superset of ASCII are invariant under the encoding. This is EBCDIC friendly as an encoded string can be looked at as being EBCDIC by lexer sprintf("%d",...) etc. in same manner that a UTF-8 string be considered ASCII on ASCII machines. - re-arrange utf8.h to get ASCII specific vs Unicode generic bits seperate. - Add some more macros to comprehend different shift amounts and possible swizzle in UTF-EBCDIC vs UTF-8. Change utf8.c to use them. - add utfebcdic.h which provides UTF-EBCDIC versions of the macros, and conditionally #include it. EBCDIC build as yet untested. ASCII still fails the one test. p4raw-id: //depot/perlio@9185
* Minor naming change UTF8_IS_ASCII => UTF8_IS_INVARIANTNick Ing-Simmons2001-03-171-0/+1
| | | p4raw-id: //depot/perlio@9184
* EBCDIC Fixes.Nick Ing-Simmons2001-03-161-9/+13
| | | p4raw-id: //depot/perlio@9180
* #ifdef'ed out code for 'USE_BYTES_DOWNGRADES' case.Nick Ing-Simmons2001-03-121-0/+4
| | | p4raw-id: //depot/perlio@9110
* EBCDIC sanity - phase INick Ing-Simmons2001-03-101-11/+7
| | | | | | | | | | | | | | - rename utf8/uv functions to indicate what sort of uv they provide (uvuni/uvchr) - use utf8n_xxxx (c.f. pvn) for forms which take length. - back out vN.N and $^V exceptions to e2a/a2e - make "locale" isxxx macros be uvchr (may be redundant?) Not clear yet that toUPPER_uni et. al. return being handled correctly. The tr// and rexexp stuff still needs an audit, assumption is they are working in Unicode space. Need to provide v5.6 names for XS modules (decide is uni or chr ?). p4raw-id: //depot/perlio@9096
* Re: Unicode/EBCDICPeter Prymmer2001-03-091-0/+19
| | | | | Message-ID: <Pine.OSF.4.10.10103081617390.377472-100000@aspara.forte.com> p4raw-id: //depot/perl@9082
* UTF-8 documentation.Jarkko Hietaniemi2001-02-111-0/+16
| | | p4raw-id: //depot/perl@8770
* Macrofy a magic UTF-8 test.Jarkko Hietaniemi2001-01-311-0/+1
| | | p4raw-id: //depot/perl@8647
* Unify UTF-8 malformedness handling.Jarkko Hietaniemi2001-01-051-10/+12
| | | p4raw-id: //depot/perl@8323
* Bump up Larry's copyright.Jarkko Hietaniemi2001-01-011-1/+1
| | | p4raw-id: //depot/perl@8289
* (Retracted by #8264) More join() testing which was good becauseJarkko Hietaniemi2000-12-291-3/+3
| | | | | it revealed a bug in #8248 (the UTF8_EIGHT_BIT_LO() was wrong). p4raw-id: //depot/perl@8249
* (Retracted by #8264) Externally: join() was still quite UTF-8-unaware.Jarkko Hietaniemi2000-12-291-5/+8
| | | | | | | | | Internally: sv_catsv() wasn't quite okay on UTF-8, it assumed that the only cases to care about are byte+byte and byte+character. TODO: See how well pp_concat() could be implemented in terms of sv_catsv(). p4raw-id: //depot/perl@8248
* Use the UTF8 macros a bit. They can't be used with abandonJarkko Hietaniemi2000-12-081-0/+5
| | | | | | everywhere because we do generate illegal UTF-8 in some situations. This is of course naughty. p4raw-id: //depot/perl@8033
* Introduce macros for UTF8 decoding.Jarkko Hietaniemi2000-12-081-1/+16
| | | p4raw-id: //depot/perl@8028
* UINT64_C() work continues.Jarkko Hietaniemi2000-11-151-2/+0
| | | p4raw-id: //depot/perl@7700
* Use UINT64_C().Jens Hamisch2000-11-151-1/+5
| | | | | | Subject: [ID 20001114.006] 5.7.0-7680 Solaris 8, 64 bit, utf8 patch Message-Id: <20001114191623.G20559@Strawberry.COM> p4raw-id: //depot/perl@7691
* [ID 20001113.003] utf8_to_uv on malformed utf returns wrong valuesYitzchak Scott-Thoennes2000-11-141-0/+2
| | | | | Message-Id: <200011132249.eADMnek09679@garcia.efn.org> p4raw-id: //depot/perl@7677
* Allow poking holes at the UTF-8 decoding strictness.Jarkko Hietaniemi2000-10-251-1/+12
| | | p4raw-id: //depot/perl@7438
* Rename UTF8LEN() to be UNISKIP(), too confusing to haveJarkko Hietaniemi2000-10-251-2/+2
| | | | | UTF8LEN() and UTF8SKIP(). p4raw-id: //depot/perl@7437
* Make the UTF-8 decoding stricter and more verbose whenJarkko Hietaniemi2000-10-241-1/+3
| | | | | | | | | | | | malformation happens. This involved adding an argument to utf8_to_uv_chk(), which involved changing its prototype, and prefer STRLEN over I32 for the UTF-8 length, which as a domino effect necessitated changing the prototypes of scan_bin(), scan_oct(), scan_hex(), and reg_uni(). The stricter UTF-8 decoding checking uses Markus Kuhn's UTF-8 Decode Stress Tester from http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt p4raw-id: //depot/perl@7416
* Make ~(chr(a).chr(b)) eq chr(~a).chr(~b) on utf8.Simon Cozens2000-10-151-0/+18
| | | | | | Subject: [PATCH] Re: [ID 20000918.005] ~ on wide chars Message-ID: <20001014205213.A9645@pembro4.pmb.ox.ac.uk> p4raw-id: //depot/perl@7235
* Tweak #7153.Jarkko Hietaniemi2000-10-061-2/+7
| | | p4raw-id: //depot/perl@7154