diff options
author | Jarkko Hietaniemi <jhi@iki.fi> | 2003-06-30 09:39:29 +0000 |
---|---|---|
committer | Jarkko Hietaniemi <jhi@iki.fi> | 2003-06-30 09:39:29 +0000 |
commit | 1aa6899ff053e0baf49f50bd5410ea0984993c58 (patch) | |
tree | 4f6d84bc0c92fcb088c5ea187eaca2a655bc747c /ext/Encode | |
parent | 6d686c581ff926a22874273b88cec04fcb536a31 (diff) | |
download | perl-1aa6899ff053e0baf49f50bd5410ea0984993c58.tar.gz |
Integrate:
[ 19857]
Regen Changes.
[ 19858]
perlhack update, by Steve Grazzini
about macro support in gdb and gcc.
[ 19859]
Hash/Util.t and Encode/t/Aliases.t seem to be having
random failures. To make these easier to reproduce,
add a variable, PERL_HASH_SEED_DEBUG, to display the
hash seed. E.g. in Debian/x86 Linux 3.0 PERL_HASH_SEED
of 82972356 makes the first one to fail.
[ 19860]
Subject: Re: Change 19854: Bite the bullet and apply the hash randomisation patch.
From: Tim Bunce <Tim.Bunce@pobox.com>
Date: Thu, 26 Jun 2003 10:53:22 +0100
Message-ID: <20030626095322.GE97463@dansat.data-plan.com>
[ 19861]
Do not obey PERL_HASH_SEED or PERL_HASH_SEED_DEBUG
if tainting-- but is this a good thing or a bad thing?
(At least it makes debugging lib/Hash/Util.t harder,
since it has, for no apparent good reason, -T: one must
make a copy of it without the -T.)
[ 19862]
Make doing_taint() always available (though not
part of the public API).
[ 19863]
Introduce (global) variable PL_earlytaint which
is set very early in main(), before perl_parse()
has been called and PL_tainting (or PL_taint_warn)
might have been set.
[ 19864]
Use the PL_earlytaint. (PL_earlytaint is a global,
not per-interp, since perl_construct() is not passed
the argc, argv, and therefore it can't set the per-interp
PL_tainting.)
[ 19865]
atoi() doesn't cut the mustard if the PERL_HASH_SEED
is larger than INT_MAX (atoi() returns -1 in that case).
[ 19866]
Some warnings about the (im)proper uses of the hash randomisation.
[ 19867]
The two-for-loops is no more a valid way to walk through
a hash (this was the reason the Hash/Util.t intermittently
failed, the two-loop didn't find all the SVs of the HV).
[ 19868]
Integrate mainline
[ 19869]
Fix test count, by Abe Timmerman.
[ 19870]
Two debugging patches.
The first allows to hold symbolic switches in $^D
and more generally fixes assignment to $^D. The
second one improves the information given by -Dl.
Subject: [PATCH] allow $^D = "flags"
From: Dave Mitchell <davem@fdgroup.com>
Date: Fri, 27 Jun 2003 22:26:24 +0100
Message-ID: <20030627212624.GB12887@fdgroup.com>
Subject: [PATCH] make -Dl show more scope info
From: Dave Mitchell <davem@fdgroup.com>
Date: Fri, 27 Jun 2003 23:00:36 +0100
Message-ID: <20030627220036.GC12887@fdgroup.com>
[ 19871]
Subject: [Encode] pre-1.97 patches
From: Dan Kogai <dankogai@dan.co.jp>
Date: Sat, 28 Jun 2003 01:20:59 +0900
Message-Id: <56D5BFEE-A8BB-11D7-9092-000393AE4244@dan.co.jp>
[ 19872]
Some clarification about the current semantics of CHECK and
INIT blocks. See bug [perl #22826].
[ 19873]
Using $1 without testing success of the regexp, bad.
[ 19874]
Retract #19867; the bug was really much simpler:
the < max must be <= max instead.
[ 19875]
Duh.
[ 19876]
Subject: Re: your malloc patches
From: Ilya Zakharevich <ilya@Math.Berkeley.EDU>
Date: Fri, 27 Jun 2003 06:54:06 -0700
Message-ID: <20030627135406.GA8914@math.berkeley.edu>
More malloc patches: now they seem to work even in Tru64.
[ 19877]
The #19842 is no more needed thanks to #19876,
and the #19842 was wrong anyway (it affected
only the threaded case.)
[ 19878]
Move the PL_earlytaint initialization to the PERL_SYS_INIT()
as per suggestion from Sarathy.
[ 19879]
Another spot where a zero $test{$max} can make things go boom.
[ 19880]
argc, argv.
[ 19881]
More coffee...
[ 19882]
Perl_doing_taint must be public, for programs that embed perl
[ 19883]
More on the macro debugging and expansion.
[ 19884]
The joy of $0. Undoing the #16399 makes Andreas'
tests (see [perl #22811]) pass (yes, padding with space instead
of nul makes no sense, but that seems to work, maybe Linux does
some deep magic in ps(1)?); moving the PL_origalen computation
earlier makes also the threaded-first case fully pass.
But in general modifying the argv[] is very non-portable.
(e.g. in Tru64 it seems to be limited to the size of the
original argv[0] since the argv[] are not contiguous?)
Everybody should just have setproctitle().
[ 19885]
Fix a faulty alias.
[ 19886]
Misc Pod Nits.
[ 19887]
$0 test tweaks from Andreas.
[ 19888]
$0 doc tweakage.
[ 19889]
The 'contiguous' test for argv[], envp[] was bogus
since those need not be in memory end-to-end, e.g.
in Tru64 they are aligned by eight. Loosen the test
so that 'contiguousness' is fulfilled if the elements
are within PTRSIZE alignment. This makes Tru64 to pass
the join.t, too.
[ 19890]
int is not UV.
p4raw-link: @19890 on //depot/perl: 7d8e7db38dc74a9a7ddcc48566f03f2b6af6f737
p4raw-link: @19889 on //depot/perl: 3cb9023dc910d8a9abbd8d44e501f6e492155eb5
p4raw-link: @19888 on //depot/perl: f9cbb277dec3cb2700132dedd25b05ea72cda45a
p4raw-link: @19887 on //depot/perl: ecce83c2318389c6dd5770c975354bb2411bd50f
p4raw-link: @19886 on //depot/perl: e13efe3ceea1a416bee536860751edb48e6bfcb3
p4raw-link: @19885 on //depot/perl: b9531c19967f04908d6f8236ceb2296ad6358488
p4raw-link: @19884 on //depot/perl: 54bfe034ba642318cf2c7d0b37579f30adef144a
p4raw-link: @19883 on //depot/perl: 52d59bef96c881381bce1bcb84a8c08ce48c2544
p4raw-link: @19882 on //depot/perl: d20fa10417f31b8f4d60b68adce91b91f9d3cd62
p4raw-link: @19881 on //depot/perl: c4b2e1b65d11779e63c2d42d6b840c9078181338
p4raw-link: @19880 on //depot/perl: f98d840496025d33749be7bdcdba70b97bd142b8
p4raw-link: @19879 on //depot/perl: a32c473717aa00461cd55052bb0345aa311e1123
p4raw-link: @19878 on //depot/perl: 1199dd43248b0956628341f2a63939a8378c8016
p4raw-link: @19877 on //depot/perl: 24130e51d52fd22992dd62e432895a9115f3a585
p4raw-link: @19876 on //depot/perl: d0bbed784b85a44e92a8a0e3d4046ce7f236db02
p4raw-link: @19875 on //depot/perl: f3f91eeab5d8feea9ff5606711dfaaa7851308c1
p4raw-link: @19874 on //depot/perl: 3a676441c258924612d07e12c0faa7606e5bbba2
p4raw-link: @19873 on //depot/perl: 2275acdc2a5e9bfc8338ccf52a5a82e52653b1b0
p4raw-link: @19872 on //depot/perl: ca62f0fc957407f48588d44995309a50a80e45ab
p4raw-link: @19871 on //depot/perl: 23f3589e21445e9141901c2894bc97b457493332
p4raw-link: @19870 on //depot/perl: b4ab917c3d812d8e61d365bfa48d9bf7675bc113
p4raw-link: @19869 on //depot/perl: 1d26cd9ec5ffb2d7823fb6941a001dc8e9a6d1c6
p4raw-link: @19868 on //depot/perlio: c9908cac60bbb191807f0d3fafd9567a2304b7e9
p4raw-link: @19867 on //depot/perl: 871661ef06c9321a672dd21cf8e97cec33e2c5ee
p4raw-link: @19866 on //depot/perl: 7b3f70378c41657f3e0c917f322e2cda58f33b5e
p4raw-link: @19865 on //depot/perl: bf1e01904b621fce6a1d1e1bcf187334cf1b1e04
p4raw-link: @19864 on //depot/perl: ed085813cee9c22e7ad548a324c6d8f6d7d726d2
p4raw-link: @19863 on //depot/perl: af419de789419c9e4520d33654a91564094b407a
p4raw-link: @19862 on //depot/perl: a06433151b0f1a3a12ccc4d2629feb511ea9fce6
p4raw-link: @19861 on //depot/perl: d0d2ba8fa784ab4c88f64ef679c2c1ff6203412a
p4raw-link: @19860 on //depot/perl: 3debabd9ba8d62a4b7656b07d06b582de8063b12
p4raw-link: @19859 on //depot/perl: 2191697ea9da49f0c020a5bcb1eb2a2e9d574a4e
p4raw-link: @19858 on //depot/perl: ea031e66439c986384865daf3860bb9bb815a8fa
p4raw-link: @19857 on //depot/maint-5.8/perl: 6d686c581ff926a22874273b88cec04fcb536a31
p4raw-id: //depot/maint-5.8/perl@19891
p4raw-integrated: from //depot/perl@19857 'copy in' pod/perlretut.pod
(@18299..) pod/perlhack.pod (@19211..)
ext/Encode/lib/Encode/Guess.pm (@19325..) pod/perlmod.pod
(@19425..) scope.h (@19431..) t/op/magic.t (@19452..)
ext/Encode/lib/Encode/Alias.pm (@19578..) mpeix/mpeixish.h
(@19602..) ext/threads/t/join.t (@19706..) lib/Test/Harness.pm
(@19766..) ext/Encode/Changes ext/Encode/Encode.pm (@19811..)
epoc/epocish.h plan9/plan9ish.h unixish.h vms/vmsish.h
(@19831..) malloc.c (@19834..) hints/dec_osf.sh (@19842..)
t/comp/require.t (@19851..) INSTALL pod/perlsec.pod (@19854..)
'ignore' miniperlmain.c (@19242..) 'merge in' ext/B/B.pm
(@18856..) cop.h (@19242..) global.sym (@19431..) perlvars.h
(@19499..) hv.c (@19632..) mg.c pod/perlvar.pod (@19769..)
dosish.h os2/os2ish.h (@19831..) embed.fnc embed.h proto.h
(@19843..) embedvar.h perl.c perl.h perlapi.h pod/perlrun.pod
(@19854..)
Diffstat (limited to 'ext/Encode')
-rw-r--r-- | ext/Encode/Changes | 10 | ||||
-rw-r--r-- | ext/Encode/Encode.pm | 2 | ||||
-rw-r--r-- | ext/Encode/lib/Encode/Alias.pm | 2 | ||||
-rw-r--r-- | ext/Encode/lib/Encode/Guess.pm | 124 |
4 files changed, 79 insertions, 59 deletions
diff --git a/ext/Encode/Changes b/ext/Encode/Changes index 18f5788e92..7251f5d365 100644 --- a/ext/Encode/Changes +++ b/ext/Encode/Changes @@ -3,6 +3,16 @@ # $Id: Changes,v 1.96 2003/06/18 09:29:02 dankogai Exp $ # $Revision: 1.96 $ $Date: 2003/06/18 09:29:02 $ +! lib/Encode/Guess.pm + $Encode::Guess::NoUTFAutoGuess is added so you can turn off + automatic utf(8|16|32) guessing -- originally by Autrijus + Message-Id: <20030626162731.GA2077@not.autrijus.org> +! Encode.pm + Addressed the following; + Subject: [perl #22835] FB_QUIET doesn't work with Encode::encode + Message-Id: <rt-22835-59975.6.8650775354304@rt.perl.org> + +1.96 2003/06/18 09:29:02 ! lib/Encode/JP/JP.pm t/guess.t m/(...)/ in void context then $1 is considered a Bad Thing Message-Id: <B5AB34D0-A019-11D7-AF03-000393AE4244@dan.co.jp> diff --git a/ext/Encode/Encode.pm b/ext/Encode/Encode.pm index 57bcc2b0d2..db74b6a194 100644 --- a/ext/Encode/Encode.pm +++ b/ext/Encode/Encode.pm @@ -147,7 +147,7 @@ sub encode($$;$) Carp::croak("Unknown encoding '$name'"); } my $octets = $enc->encode($string,$check); - return undef if ($check && length($string)); + $_[1] = $string if $check; return $octets; } diff --git a/ext/Encode/lib/Encode/Alias.pm b/ext/Encode/lib/Encode/Alias.pm index d684ced9ac..70b3dd8714 100644 --- a/ext/Encode/lib/Encode/Alias.pm +++ b/ext/Encode/lib/Encode/Alias.pm @@ -204,7 +204,7 @@ sub init_aliases # CP936 doesn't have vendor-addon for GBK, so they're identical. define_alias( qr/^gbk$/i => '"cp936"'); # This fixes gb2312 vs. euc-cn confusion, practically - define_alias( qr/\bGB[-_ ]?2312(?:\D.*$|$)/i => '"euc-cn"' ); + define_alias( qr/\bGB[-_ ]?2312(?!-?raw)/i => '"euc-cn"' ); # for Encode::JP define_alias( qr/\bjis$/i => '"7bit-jis"' ); define_alias( qr/\beuc.*jp$/i => '"euc-jp"' ); diff --git a/ext/Encode/lib/Encode/Guess.pm b/ext/Encode/lib/Encode/Guess.pm index fc8d267d02..5858f819cd 100644 --- a/ext/Encode/lib/Encode/Guess.pm +++ b/ext/Encode/lib/Encode/Guess.pm @@ -18,6 +18,7 @@ sub needs_lines { 1 } sub perlio_ok { 0 } our @EXPORT = qw(guess_encoding); +our $NoUTFAutoGuess = 0; sub import { # Exporter not used so we do it on our own my $callpkg = caller; @@ -70,75 +71,80 @@ sub guess { return unless defined $octet and length $octet; # cheat 0: utf8 flag; - Encode::is_utf8($octet) and return find_encoding('utf8'); + if ( Encode::is_utf8($octet) ) { + return find_encoding('utf8') unless $NoUTFAutoGuess; + Encode::_utf8_off($octet); + } # cheat 1: BOM use Encode::Unicode; - my $BOM = unpack('n', $octet); - return find_encoding('UTF-16') - if (defined $BOM and ($BOM == 0xFeFF or $BOM == 0xFFFe)); - $BOM = unpack('N', $octet); - return find_encoding('UTF-32') - if (defined $BOM and ($BOM == 0xFeFF or $BOM == 0xFFFe0000)); + unless ($NoUTFAutoGuess) { + my $BOM = unpack('n', $octet); + return find_encoding('UTF-16') + if (defined $BOM and ($BOM == 0xFeFF or $BOM == 0xFFFe)); + $BOM = unpack('N', $octet); + return find_encoding('UTF-32') + if (defined $BOM and ($BOM == 0xFeFF or $BOM == 0xFFFe0000)); + if ($octet =~ /\x00/o){ # if \x00 found, we assume UTF-(16|32)(BE|LE) + my $utf; + my ($be, $le) = (0, 0); + if ($octet =~ /\x00\x00/o){ # UTF-32(BE|LE) assumed + $utf = "UTF-32"; + for my $char (unpack('N*', $octet)){ + $char & 0x0000ffff and $be++; + $char & 0xffff0000 and $le++; + } + }else{ # UTF-16(BE|LE) assumed + $utf = "UTF-16"; + for my $char (unpack('n*', $octet)){ + $char & 0x00ff and $be++; + $char & 0xff00 and $le++; + } + } + $DEBUG and warn "$utf, be == $be, le == $le"; + $be == $le + and return + "Encodings ambiguous between $utf BE and LE ($be, $le)"; + $utf .= ($be > $le) ? 'BE' : 'LE'; + return find_encoding($utf); + } + } my %try = %{$obj->{Suspects}}; for my $c (@_){ my $e = find_encoding($c) or die "Unknown encoding: $c"; $try{$e->name} = $e; $DEBUG and warn "Added: ", $e->name; } - if ($octet =~ /\x00/o){ # if \x00 found, we assume UTF-(16|32)(BE|LE) - my $utf; - my ($be, $le) = (0, 0); - if ($octet =~ /\x00\x00/o){ # UTF-32(BE|LE) assumed - $utf = "UTF-32"; - for my $char (unpack('N*', $octet)){ - $char & 0x0000ffff and $be++; - $char & 0xffff0000 and $le++; - } - }else{ # UTF-16(BE|LE) assumed - $utf = "UTF-16"; - for my $char (unpack('n*', $octet)){ - $char & 0x00ff and $be++; - $char & 0xff00 and $le++; + my $nline = 1; + for my $line (split /\r\n?|\n/, $octet){ + # cheat 2 -- \e in the string + if ($line =~ /\e/o){ + my @keys = keys %try; + delete @try{qw/utf8 ascii/}; + for my $k (@keys){ + ref($try{$k}) eq 'Encode::XS' and delete $try{$k}; } } - $DEBUG and warn "$utf, be == $be, le == $le"; - $be == $le - and return "Encodings ambiguous between $utf BE and LE ($be, $le)"; - $utf .= ($be > $le) ? 'BE' : 'LE'; - return find_encoding($utf); - }else{ - my $nline = 1; - for my $line (split /\r\n?|\n/, $octet){ - # cheat 2 -- \e in the string - if ($line =~ /\e/o){ - my @keys = keys %try; - delete @try{qw/utf8 ascii/}; - for my $k (@keys){ - ref($try{$k}) eq 'Encode::XS' and delete $try{$k}; - } - } - my %ok = %try; - # warn join(",", keys %try); - for my $k (keys %try){ - my $scratch = $line; - $try{$k}->decode($scratch, FB_QUIET); - if ($scratch eq ''){ - $DEBUG and warn sprintf("%4d:%-24s ok\n", $nline, $k); - }else{ - use bytes (); - $DEBUG and - warn sprintf("%4d:%-24s not ok; %d bytes left\n", - $nline, $k, bytes::length($scratch)); - delete $ok{$k}; - } + my %ok = %try; + # warn join(",", keys %try); + for my $k (keys %try){ + my $scratch = $line; + $try{$k}->decode($scratch, FB_QUIET); + if ($scratch eq ''){ + $DEBUG and warn sprintf("%4d:%-24s ok\n", $nline, $k); + }else{ + use bytes (); + $DEBUG and + warn sprintf("%4d:%-24s not ok; %d bytes left\n", + $nline, $k, bytes::length($scratch)); + delete $ok{$k}; } - %ok or return "No appropriate encodings found!"; - if (scalar(keys(%ok)) == 1){ - my ($retval) = values(%ok); - return $retval; - } - %try = %ok; $nline++; } + %ok or return "No appropriate encodings found!"; + if (scalar(keys(%ok)) == 1){ + my ($retval) = values(%ok); + return $retval; + } + %try = %ok; $nline++; } $try{ascii} or return "Encodings too ambiguous: ", join(" or ", keys %try); @@ -189,6 +195,10 @@ canonical names or aliases. # tries all major Japanese Encodings as well use Encode::Guess qw/euc-jp shiftjis 7bit-jis/; +If the C<$Encode::Guess::NoUTFAutoGuess> variable is set to a true +value, no heuristics will be applied to UTF8/16/32, and the result +will be limited to the suspects and C<ascii>. + =over 4 =item Encode::Guess->set_suspects |