summaryrefslogtreecommitdiff
path: root/ext/Encode
diff options
context:
space:
mode:
authorJarkko Hietaniemi <jhi@iki.fi>2003-06-30 09:39:29 +0000
committerJarkko Hietaniemi <jhi@iki.fi>2003-06-30 09:39:29 +0000
commit1aa6899ff053e0baf49f50bd5410ea0984993c58 (patch)
tree4f6d84bc0c92fcb088c5ea187eaca2a655bc747c /ext/Encode
parent6d686c581ff926a22874273b88cec04fcb536a31 (diff)
downloadperl-1aa6899ff053e0baf49f50bd5410ea0984993c58.tar.gz
Integrate:
[ 19857] Regen Changes. [ 19858] perlhack update, by Steve Grazzini about macro support in gdb and gcc. [ 19859] Hash/Util.t and Encode/t/Aliases.t seem to be having random failures. To make these easier to reproduce, add a variable, PERL_HASH_SEED_DEBUG, to display the hash seed. E.g. in Debian/x86 Linux 3.0 PERL_HASH_SEED of 82972356 makes the first one to fail. [ 19860] Subject: Re: Change 19854: Bite the bullet and apply the hash randomisation patch. From: Tim Bunce <Tim.Bunce@pobox.com> Date: Thu, 26 Jun 2003 10:53:22 +0100 Message-ID: <20030626095322.GE97463@dansat.data-plan.com> [ 19861] Do not obey PERL_HASH_SEED or PERL_HASH_SEED_DEBUG if tainting-- but is this a good thing or a bad thing? (At least it makes debugging lib/Hash/Util.t harder, since it has, for no apparent good reason, -T: one must make a copy of it without the -T.) [ 19862] Make doing_taint() always available (though not part of the public API). [ 19863] Introduce (global) variable PL_earlytaint which is set very early in main(), before perl_parse() has been called and PL_tainting (or PL_taint_warn) might have been set. [ 19864] Use the PL_earlytaint. (PL_earlytaint is a global, not per-interp, since perl_construct() is not passed the argc, argv, and therefore it can't set the per-interp PL_tainting.) [ 19865] atoi() doesn't cut the mustard if the PERL_HASH_SEED is larger than INT_MAX (atoi() returns -1 in that case). [ 19866] Some warnings about the (im)proper uses of the hash randomisation. [ 19867] The two-for-loops is no more a valid way to walk through a hash (this was the reason the Hash/Util.t intermittently failed, the two-loop didn't find all the SVs of the HV). [ 19868] Integrate mainline [ 19869] Fix test count, by Abe Timmerman. [ 19870] Two debugging patches. The first allows to hold symbolic switches in $^D and more generally fixes assignment to $^D. The second one improves the information given by -Dl. Subject: [PATCH] allow $^D = "flags" From: Dave Mitchell <davem@fdgroup.com> Date: Fri, 27 Jun 2003 22:26:24 +0100 Message-ID: <20030627212624.GB12887@fdgroup.com> Subject: [PATCH] make -Dl show more scope info From: Dave Mitchell <davem@fdgroup.com> Date: Fri, 27 Jun 2003 23:00:36 +0100 Message-ID: <20030627220036.GC12887@fdgroup.com> [ 19871] Subject: [Encode] pre-1.97 patches From: Dan Kogai <dankogai@dan.co.jp> Date: Sat, 28 Jun 2003 01:20:59 +0900 Message-Id: <56D5BFEE-A8BB-11D7-9092-000393AE4244@dan.co.jp> [ 19872] Some clarification about the current semantics of CHECK and INIT blocks. See bug [perl #22826]. [ 19873] Using $1 without testing success of the regexp, bad. [ 19874] Retract #19867; the bug was really much simpler: the < max must be <= max instead. [ 19875] Duh. [ 19876] Subject: Re: your malloc patches From: Ilya Zakharevich <ilya@Math.Berkeley.EDU> Date: Fri, 27 Jun 2003 06:54:06 -0700 Message-ID: <20030627135406.GA8914@math.berkeley.edu> More malloc patches: now they seem to work even in Tru64. [ 19877] The #19842 is no more needed thanks to #19876, and the #19842 was wrong anyway (it affected only the threaded case.) [ 19878] Move the PL_earlytaint initialization to the PERL_SYS_INIT() as per suggestion from Sarathy. [ 19879] Another spot where a zero $test{$max} can make things go boom. [ 19880] argc, argv. [ 19881] More coffee... [ 19882] Perl_doing_taint must be public, for programs that embed perl [ 19883] More on the macro debugging and expansion. [ 19884] The joy of $0. Undoing the #16399 makes Andreas' tests (see [perl #22811]) pass (yes, padding with space instead of nul makes no sense, but that seems to work, maybe Linux does some deep magic in ps(1)?); moving the PL_origalen computation earlier makes also the threaded-first case fully pass. But in general modifying the argv[] is very non-portable. (e.g. in Tru64 it seems to be limited to the size of the original argv[0] since the argv[] are not contiguous?) Everybody should just have setproctitle(). [ 19885] Fix a faulty alias. [ 19886] Misc Pod Nits. [ 19887] $0 test tweaks from Andreas. [ 19888] $0 doc tweakage. [ 19889] The 'contiguous' test for argv[], envp[] was bogus since those need not be in memory end-to-end, e.g. in Tru64 they are aligned by eight. Loosen the test so that 'contiguousness' is fulfilled if the elements are within PTRSIZE alignment. This makes Tru64 to pass the join.t, too. [ 19890] int is not UV. p4raw-link: @19890 on //depot/perl: 7d8e7db38dc74a9a7ddcc48566f03f2b6af6f737 p4raw-link: @19889 on //depot/perl: 3cb9023dc910d8a9abbd8d44e501f6e492155eb5 p4raw-link: @19888 on //depot/perl: f9cbb277dec3cb2700132dedd25b05ea72cda45a p4raw-link: @19887 on //depot/perl: ecce83c2318389c6dd5770c975354bb2411bd50f p4raw-link: @19886 on //depot/perl: e13efe3ceea1a416bee536860751edb48e6bfcb3 p4raw-link: @19885 on //depot/perl: b9531c19967f04908d6f8236ceb2296ad6358488 p4raw-link: @19884 on //depot/perl: 54bfe034ba642318cf2c7d0b37579f30adef144a p4raw-link: @19883 on //depot/perl: 52d59bef96c881381bce1bcb84a8c08ce48c2544 p4raw-link: @19882 on //depot/perl: d20fa10417f31b8f4d60b68adce91b91f9d3cd62 p4raw-link: @19881 on //depot/perl: c4b2e1b65d11779e63c2d42d6b840c9078181338 p4raw-link: @19880 on //depot/perl: f98d840496025d33749be7bdcdba70b97bd142b8 p4raw-link: @19879 on //depot/perl: a32c473717aa00461cd55052bb0345aa311e1123 p4raw-link: @19878 on //depot/perl: 1199dd43248b0956628341f2a63939a8378c8016 p4raw-link: @19877 on //depot/perl: 24130e51d52fd22992dd62e432895a9115f3a585 p4raw-link: @19876 on //depot/perl: d0bbed784b85a44e92a8a0e3d4046ce7f236db02 p4raw-link: @19875 on //depot/perl: f3f91eeab5d8feea9ff5606711dfaaa7851308c1 p4raw-link: @19874 on //depot/perl: 3a676441c258924612d07e12c0faa7606e5bbba2 p4raw-link: @19873 on //depot/perl: 2275acdc2a5e9bfc8338ccf52a5a82e52653b1b0 p4raw-link: @19872 on //depot/perl: ca62f0fc957407f48588d44995309a50a80e45ab p4raw-link: @19871 on //depot/perl: 23f3589e21445e9141901c2894bc97b457493332 p4raw-link: @19870 on //depot/perl: b4ab917c3d812d8e61d365bfa48d9bf7675bc113 p4raw-link: @19869 on //depot/perl: 1d26cd9ec5ffb2d7823fb6941a001dc8e9a6d1c6 p4raw-link: @19868 on //depot/perlio: c9908cac60bbb191807f0d3fafd9567a2304b7e9 p4raw-link: @19867 on //depot/perl: 871661ef06c9321a672dd21cf8e97cec33e2c5ee p4raw-link: @19866 on //depot/perl: 7b3f70378c41657f3e0c917f322e2cda58f33b5e p4raw-link: @19865 on //depot/perl: bf1e01904b621fce6a1d1e1bcf187334cf1b1e04 p4raw-link: @19864 on //depot/perl: ed085813cee9c22e7ad548a324c6d8f6d7d726d2 p4raw-link: @19863 on //depot/perl: af419de789419c9e4520d33654a91564094b407a p4raw-link: @19862 on //depot/perl: a06433151b0f1a3a12ccc4d2629feb511ea9fce6 p4raw-link: @19861 on //depot/perl: d0d2ba8fa784ab4c88f64ef679c2c1ff6203412a p4raw-link: @19860 on //depot/perl: 3debabd9ba8d62a4b7656b07d06b582de8063b12 p4raw-link: @19859 on //depot/perl: 2191697ea9da49f0c020a5bcb1eb2a2e9d574a4e p4raw-link: @19858 on //depot/perl: ea031e66439c986384865daf3860bb9bb815a8fa p4raw-link: @19857 on //depot/maint-5.8/perl: 6d686c581ff926a22874273b88cec04fcb536a31 p4raw-id: //depot/maint-5.8/perl@19891 p4raw-integrated: from //depot/perl@19857 'copy in' pod/perlretut.pod (@18299..) pod/perlhack.pod (@19211..) ext/Encode/lib/Encode/Guess.pm (@19325..) pod/perlmod.pod (@19425..) scope.h (@19431..) t/op/magic.t (@19452..) ext/Encode/lib/Encode/Alias.pm (@19578..) mpeix/mpeixish.h (@19602..) ext/threads/t/join.t (@19706..) lib/Test/Harness.pm (@19766..) ext/Encode/Changes ext/Encode/Encode.pm (@19811..) epoc/epocish.h plan9/plan9ish.h unixish.h vms/vmsish.h (@19831..) malloc.c (@19834..) hints/dec_osf.sh (@19842..) t/comp/require.t (@19851..) INSTALL pod/perlsec.pod (@19854..) 'ignore' miniperlmain.c (@19242..) 'merge in' ext/B/B.pm (@18856..) cop.h (@19242..) global.sym (@19431..) perlvars.h (@19499..) hv.c (@19632..) mg.c pod/perlvar.pod (@19769..) dosish.h os2/os2ish.h (@19831..) embed.fnc embed.h proto.h (@19843..) embedvar.h perl.c perl.h perlapi.h pod/perlrun.pod (@19854..)
Diffstat (limited to 'ext/Encode')
-rw-r--r--ext/Encode/Changes10
-rw-r--r--ext/Encode/Encode.pm2
-rw-r--r--ext/Encode/lib/Encode/Alias.pm2
-rw-r--r--ext/Encode/lib/Encode/Guess.pm124
4 files changed, 79 insertions, 59 deletions
diff --git a/ext/Encode/Changes b/ext/Encode/Changes
index 18f5788e92..7251f5d365 100644
--- a/ext/Encode/Changes
+++ b/ext/Encode/Changes
@@ -3,6 +3,16 @@
# $Id: Changes,v 1.96 2003/06/18 09:29:02 dankogai Exp $
#
$Revision: 1.96 $ $Date: 2003/06/18 09:29:02 $
+! lib/Encode/Guess.pm
+ $Encode::Guess::NoUTFAutoGuess is added so you can turn off
+ automatic utf(8|16|32) guessing -- originally by Autrijus
+ Message-Id: <20030626162731.GA2077@not.autrijus.org>
+! Encode.pm
+ Addressed the following;
+ Subject: [perl #22835] FB_QUIET doesn't work with Encode::encode
+ Message-Id: <rt-22835-59975.6.8650775354304@rt.perl.org>
+
+1.96 2003/06/18 09:29:02
! lib/Encode/JP/JP.pm t/guess.t
m/(...)/ in void context then $1 is considered a Bad Thing
Message-Id: <B5AB34D0-A019-11D7-AF03-000393AE4244@dan.co.jp>
diff --git a/ext/Encode/Encode.pm b/ext/Encode/Encode.pm
index 57bcc2b0d2..db74b6a194 100644
--- a/ext/Encode/Encode.pm
+++ b/ext/Encode/Encode.pm
@@ -147,7 +147,7 @@ sub encode($$;$)
Carp::croak("Unknown encoding '$name'");
}
my $octets = $enc->encode($string,$check);
- return undef if ($check && length($string));
+ $_[1] = $string if $check;
return $octets;
}
diff --git a/ext/Encode/lib/Encode/Alias.pm b/ext/Encode/lib/Encode/Alias.pm
index d684ced9ac..70b3dd8714 100644
--- a/ext/Encode/lib/Encode/Alias.pm
+++ b/ext/Encode/lib/Encode/Alias.pm
@@ -204,7 +204,7 @@ sub init_aliases
# CP936 doesn't have vendor-addon for GBK, so they're identical.
define_alias( qr/^gbk$/i => '"cp936"');
# This fixes gb2312 vs. euc-cn confusion, practically
- define_alias( qr/\bGB[-_ ]?2312(?:\D.*$|$)/i => '"euc-cn"' );
+ define_alias( qr/\bGB[-_ ]?2312(?!-?raw)/i => '"euc-cn"' );
# for Encode::JP
define_alias( qr/\bjis$/i => '"7bit-jis"' );
define_alias( qr/\beuc.*jp$/i => '"euc-jp"' );
diff --git a/ext/Encode/lib/Encode/Guess.pm b/ext/Encode/lib/Encode/Guess.pm
index fc8d267d02..5858f819cd 100644
--- a/ext/Encode/lib/Encode/Guess.pm
+++ b/ext/Encode/lib/Encode/Guess.pm
@@ -18,6 +18,7 @@ sub needs_lines { 1 }
sub perlio_ok { 0 }
our @EXPORT = qw(guess_encoding);
+our $NoUTFAutoGuess = 0;
sub import { # Exporter not used so we do it on our own
my $callpkg = caller;
@@ -70,75 +71,80 @@ sub guess {
return unless defined $octet and length $octet;
# cheat 0: utf8 flag;
- Encode::is_utf8($octet) and return find_encoding('utf8');
+ if ( Encode::is_utf8($octet) ) {
+ return find_encoding('utf8') unless $NoUTFAutoGuess;
+ Encode::_utf8_off($octet);
+ }
# cheat 1: BOM
use Encode::Unicode;
- my $BOM = unpack('n', $octet);
- return find_encoding('UTF-16')
- if (defined $BOM and ($BOM == 0xFeFF or $BOM == 0xFFFe));
- $BOM = unpack('N', $octet);
- return find_encoding('UTF-32')
- if (defined $BOM and ($BOM == 0xFeFF or $BOM == 0xFFFe0000));
+ unless ($NoUTFAutoGuess) {
+ my $BOM = unpack('n', $octet);
+ return find_encoding('UTF-16')
+ if (defined $BOM and ($BOM == 0xFeFF or $BOM == 0xFFFe));
+ $BOM = unpack('N', $octet);
+ return find_encoding('UTF-32')
+ if (defined $BOM and ($BOM == 0xFeFF or $BOM == 0xFFFe0000));
+ if ($octet =~ /\x00/o){ # if \x00 found, we assume UTF-(16|32)(BE|LE)
+ my $utf;
+ my ($be, $le) = (0, 0);
+ if ($octet =~ /\x00\x00/o){ # UTF-32(BE|LE) assumed
+ $utf = "UTF-32";
+ for my $char (unpack('N*', $octet)){
+ $char & 0x0000ffff and $be++;
+ $char & 0xffff0000 and $le++;
+ }
+ }else{ # UTF-16(BE|LE) assumed
+ $utf = "UTF-16";
+ for my $char (unpack('n*', $octet)){
+ $char & 0x00ff and $be++;
+ $char & 0xff00 and $le++;
+ }
+ }
+ $DEBUG and warn "$utf, be == $be, le == $le";
+ $be == $le
+ and return
+ "Encodings ambiguous between $utf BE and LE ($be, $le)";
+ $utf .= ($be > $le) ? 'BE' : 'LE';
+ return find_encoding($utf);
+ }
+ }
my %try = %{$obj->{Suspects}};
for my $c (@_){
my $e = find_encoding($c) or die "Unknown encoding: $c";
$try{$e->name} = $e;
$DEBUG and warn "Added: ", $e->name;
}
- if ($octet =~ /\x00/o){ # if \x00 found, we assume UTF-(16|32)(BE|LE)
- my $utf;
- my ($be, $le) = (0, 0);
- if ($octet =~ /\x00\x00/o){ # UTF-32(BE|LE) assumed
- $utf = "UTF-32";
- for my $char (unpack('N*', $octet)){
- $char & 0x0000ffff and $be++;
- $char & 0xffff0000 and $le++;
- }
- }else{ # UTF-16(BE|LE) assumed
- $utf = "UTF-16";
- for my $char (unpack('n*', $octet)){
- $char & 0x00ff and $be++;
- $char & 0xff00 and $le++;
+ my $nline = 1;
+ for my $line (split /\r\n?|\n/, $octet){
+ # cheat 2 -- \e in the string
+ if ($line =~ /\e/o){
+ my @keys = keys %try;
+ delete @try{qw/utf8 ascii/};
+ for my $k (@keys){
+ ref($try{$k}) eq 'Encode::XS' and delete $try{$k};
}
}
- $DEBUG and warn "$utf, be == $be, le == $le";
- $be == $le
- and return "Encodings ambiguous between $utf BE and LE ($be, $le)";
- $utf .= ($be > $le) ? 'BE' : 'LE';
- return find_encoding($utf);
- }else{
- my $nline = 1;
- for my $line (split /\r\n?|\n/, $octet){
- # cheat 2 -- \e in the string
- if ($line =~ /\e/o){
- my @keys = keys %try;
- delete @try{qw/utf8 ascii/};
- for my $k (@keys){
- ref($try{$k}) eq 'Encode::XS' and delete $try{$k};
- }
- }
- my %ok = %try;
- # warn join(",", keys %try);
- for my $k (keys %try){
- my $scratch = $line;
- $try{$k}->decode($scratch, FB_QUIET);
- if ($scratch eq ''){
- $DEBUG and warn sprintf("%4d:%-24s ok\n", $nline, $k);
- }else{
- use bytes ();
- $DEBUG and
- warn sprintf("%4d:%-24s not ok; %d bytes left\n",
- $nline, $k, bytes::length($scratch));
- delete $ok{$k};
- }
+ my %ok = %try;
+ # warn join(",", keys %try);
+ for my $k (keys %try){
+ my $scratch = $line;
+ $try{$k}->decode($scratch, FB_QUIET);
+ if ($scratch eq ''){
+ $DEBUG and warn sprintf("%4d:%-24s ok\n", $nline, $k);
+ }else{
+ use bytes ();
+ $DEBUG and
+ warn sprintf("%4d:%-24s not ok; %d bytes left\n",
+ $nline, $k, bytes::length($scratch));
+ delete $ok{$k};
}
- %ok or return "No appropriate encodings found!";
- if (scalar(keys(%ok)) == 1){
- my ($retval) = values(%ok);
- return $retval;
- }
- %try = %ok; $nline++;
}
+ %ok or return "No appropriate encodings found!";
+ if (scalar(keys(%ok)) == 1){
+ my ($retval) = values(%ok);
+ return $retval;
+ }
+ %try = %ok; $nline++;
}
$try{ascii} or
return "Encodings too ambiguous: ", join(" or ", keys %try);
@@ -189,6 +195,10 @@ canonical names or aliases.
# tries all major Japanese Encodings as well
use Encode::Guess qw/euc-jp shiftjis 7bit-jis/;
+If the C<$Encode::Guess::NoUTFAutoGuess> variable is set to a true
+value, no heuristics will be applied to UTF8/16/32, and the result
+will be limited to the suspects and C<ascii>.
+
=over 4
=item Encode::Guess->set_suspects