| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
|
| |
This is only an issue for those few platforms without LC_ALL, as that
is initialized, and includes LC_MONETARY. This commit extends the
proper initialization to those other platforms. Perl doesn't use
LC_MONETARY itself, but it should be properly initialized for modules
that do.
|
|
|
|
|
|
|
|
|
| |
The code did not explicitly iinitialize LC_MESSAGES at startup, unlike
most of the other standard categories; I don't know why.
This is only an issue for those few platforms without LC_ALL, as that is
initialized, and includes LC_MESSAGES. This commit extends the proper
initialization to those other platforms.
|
|
|
|
|
|
|
|
| |
This takes one piece of code that is needlessly enclosed in braces and
removes the braces, outdenting and reflowing the comments.
Otherwise, it changes to correct indentation for the addition and
removal of braces by the previous commit.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If Perl encounters a problem during startup trying to initialize the
locales from the environment it has immediately reverted to the "C"
locale.
This commit generalizes that so it tries each of the applicable
environment variables in order of priority until it works, or it gives
up and uses the "C" locale. For example, if LC_ALL is set to something
that is invalid, but LANG is valid, LANG will be used. This was
motivated by trying to get the Windows system default locale used in
preference to "C" if all else fails.
|
|
|
|
|
|
|
|
|
|
|
| |
This adds some more comments at the beginning of a function concerning
its API, and moves them to before any declarations.
It also moves the declaration for 'done' to the block of other
declarations, and adds a PERL_UNUSED_VAR call if the code that uses it
is #ifdef'd out. Previously it was too easy to not notice the
declaration separate from the others, and to insert code between the
two, which would not compile under C89, but only on Ultrix machines.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Locale initialization and setting on Windows haven't been as
described in perllocale for setting locales to "". This is because that
tells Windows to use the system default locale, as set through the
Control Panel, but on POSIX systems, it means to look at various
environment variables.
This commit creates a wrapper for setlocale, used only on Windows, that
looks for the appropriate environment variables when called with a ""
input locale. If none are found, it continues to use the system default
locale.
|
|
|
|
|
|
|
|
|
|
|
|
| |
For regexec.c, one compiler amongst our smokers believes there is a path
where this array can be used uninitialized; it's easiest to just
initialize it, even though I think the compiler is wrong, unless it is
optimizing incorrectly, in which case, it would be still be best to
initialize it.
For locale.c, this is just the well-known gcc bug that they refuse to
fix concerning a (void) cast when the function has been declared to
require not ignoring the resul
|
|
|
|
| |
This will help field debugging of locale issues.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 119ee68b changed the method to determine if a locale is a UTF-8
one to a method that was usable on more platforms, by using the C99 libc
function mbtowc(). I didn't realize that there needs to be a special
call to this function preceeding the main call to make sure it is in the
initial state. This commit fixes that.
In looking at the results from several different platforms, I decided it
is best to use nl_langinfo() in preference to mbtowc() when available,
and only use mbtowc() if nl_langinfo doesn't exist on the platform or
fails to return a real result, which happens for some locales on Darwin.
This commit does that as well.
|
|
|
|
|
|
|
| |
This mostly indents and outdents base on blocks added or removed by the
previous commit. But there are a few comment changes and vertical
alignment of macro backslash continuation characters, and other
white-space changes
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This large (sorry, I couldn't figure out how to meaningfully split it
up) commit causes Perl to fully support LC_CTYPE operations (case
changing, character classification) in UTF-8 locales.
As a side effect it resolves [perl #56820].
The basics are easy, but there were a lot of details, and one
troublesome edge case discussed below.
What essentially happens is that when the locale is changed to a UTF-8
one, a global variable is set TRUE (FALSE when changed to a non-UTF-8
locale). Within the scope of 'use locale', this variable is checked,
and if TRUE, the code that Perl uses for non-locale behavior is used
instead of the code for locale behavior. Since Perl's internal
representation is UTF-8, we get UTF-8 behavior for a UTF-8 locale.
More work had to be done for regular expressions. There are three
cases.
1) The character classes \w, [[:punct:]] needed no extra work, as
the changes fall out from the base work.
2) Strings that are to be matched case-insensitively. These form
EXACTFL regops (nodes). Notice that if such a string contains only
characters above-Latin1 that match only themselves, that the node can be
downgraded to an EXACT-only node, which presents better optimization
possibilities, as we now have a fixed string known at compile time to be
required to be in the target string to match. Similarly if all
characters in the string match only other above-Latin1 characters
case-insensitively, the node can be downgraded to a regular EXACTFU node
(match, folding, using Unicode, not locale, rules). The code changes
for this could be done without accepting UTF-8 locales fully, but there
were edge cases which needed to be handled differently if I stopped
there, so I continued on.
In an EXACTFL node, all such characters are now folded at compile time
(just as before this commit), while the other characters whose folds are
locale-dependent are left unfolded. This means that they have to be
folded at execution time based on the locale in effect at the moment.
Again, this isn't a change from before. The difference is that now some
of the folds that need to be done at execution time (in regexec) are
potentially multi-char. Some of the code in regexec was trivial to
extend to account for this because of existing infrastructure, but the
part dealing with regex quantifiers, had to have more work.
Also the code that joins EXACTish nodes together had to be expanded to
account for the possibility of multi-character folds within locale
handling. This was fairly easy, because it already has infrastructure
to handle these under somewhat different circumstances.
3) In bracketed character classes, represented by ANYOF nodes, a new
inversion list was created giving the characters that should be matched
by this node when the runtime locale is UTF-8. The list is ignored
except under that circumstance. To do this, I created a new ANYOF type
which has an extra SV for the inversion list.
The edge case that caused the most difficulty is folding involving the
MICRO SIGN, U+00B5. It folds to the GREEK SMALL LETTER MU, as does the
GREEK CAPITAL LETTER MU. The MICRO SIGN is the only 0-255 range
character that folds to outside that range. The issue is that it
doesn't naturally fall out that it will match the CAP MU. If we let the
CAP MU fold to the samll mu at compile time (which it can because both
are above-Latin1 and so the fold is the same no matter what locale is in
effect), it could appear that the regnode can be downgraded away from
EXACTFL to EXACTFU, but doing so would cause the MICRO SIGN to not case
insensitvely match the CAP MU. This could be special cased in regcomp
and regexec, but I wanted to avoid that. Instead the mktables tables
are set up to include the CAP MU as a character whose presence forbids
the downgrading, so the special casing is in mktables, and not in the C
code.
|
|
|
|
|
| |
This compiler is not smart enough to realize that overflow of a byte
can't occur here as the loop stops at 255.
|
|
|
|
| |
This indents because the previous commit added a block
|
|
|
|
|
|
|
|
|
| |
locale.c has a function that tries to determine if the current POSIX
locale is a UTF-8 locale. Prior to this patch, it used nl_langinfo() to
determine this, falling back to heuristics if that is unavailable on the
platform. nl_langinfo() is part of POSIX.1-2001. -This patch adds the
use of two functions from C99, mbtowc() and MB_CUR_MAX, that also give
reliable results.
|
|
|
|
|
|
| |
I don't believe this code was causing any problem, but it can overwrite
static storage returned by setlocale(). It's safer to create a copy
first.
|
|
|
|
|
| |
This documents much of what I learned about how things work while
tracking down [perl #120723].
|
|
|
|
| |
Outdent code removed from a block by the previous commit
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This function is called when a new underlying LC_NUMERIC locale has been
set. If that locale is the same as the current underlying one, some
setup is skipped. However, prior to this commit, more was skipped than
should have been. The reason is that even if the underlying locale is
the same, it could be that LC_NUMERIC has been toggled to the "C"
locale, and so the information could be inconsistent. By always setting
the information, we ensure consistency.
This commit ia a portion of the fix for [perl #120723]. Tests will be
added with the final commit for it.
|
|
|
|
|
|
|
|
|
|
| |
Commit 68067e4e501e2ae1c0fb44558b6aa5c0a80a4143 inadvertently
broke regular expression /i matching under locale. The tests for this
were defective, so the breakage was not caught. A later commit will fix
the tests, but this commit restores the functionality.
It also casts the input parameter to some functions to be U8 to make
sure that optimizing compilers can omit bounds checks
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There were a few places that were doing
unsigned_var = cond ? signed_val : unsigned_val;
or similar. Fixed by suitable casts etc.
The four in utf8.c were fixed by assigning to an intermediate
unsigned var; this has the happy side-effect of collapsing
a large macro expansion, where toUPPER_LC() etc evaluate their arg
multiple times.
|
|
|
|
|
|
| |
This commit adds #if's to cause locale handling code to compile on
platforms that don't have full-featured locale handling. The commits
mentioned in the ticket did not adequately cover these situations.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a platform doesn't have nl_langinfo(), heuristics are employed
to see if a locale is UTF-8 . The first heuristic is looking at the
return value of setlocale(), which generally is the locale name.
However, in actuality the return value is opaque and can't be relied on
to signify the locale. Nevertheless if it contains the string UTF-8
(ignoring case, and with the hyphen optional), it is a safe bet that the
locale is indeed UTF-8. Prior to this patch, we only looked at the end
of the name for "UTF-8". This patch makes it not have to be
right-anchored. There are UTF-8 locales on our dromedary machine with
UTF-8 in the middle of their names.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit causes the locale initialization to skip calling
setlocal(foo, "") if the environment variable PERL_SKIP_LOCALE_INIT is
set. Instead, the setup code calls setlocale(LC_ALL, NULL) (plus other
similar calls for the subcategories) in order to find out what the
current locale is.
The original poster for this ticket has a workaround for it which
involves using a modified copy of Perl core code. This patch defines
the C preprocessor variable HAS_SKIP_LOCALE_INIT that can be used by XS
writers to discover if the current Perl version needs the workaround or
not.
I was unable to come up with a test for this patch that did not involve
building extensive infrastructure for testing embedded Perl. That does
not seem worth it for such a trivial patch. I tested by hand.
|
|
|
|
|
|
|
|
| |
This patch causes the radix string to be examined upon a new numeric
locale being set. If the string isn't ASCII, and the new locale is
UTF-8, it turns on the UTF-8 flag in the scalar that holds the radix.
When a floating point number is formatted in Perl_sv_vcatpvfn_flags(),
and the flag is on, the result's flag will be set on too.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In reality, the return value of setlocale() is documented to be opaque,
so using it to determine if a locale is UTF-8 or not may not work. It
is a char*, which we treat as a name. We can safely assume that if the
name contains UTF-8 (or slight variations thereof), that it is a UTF-8
locale. But if the name doesn't contain that, it still could be one.
In fact there are currently many locales on our dromedary machine that
fall into this category. Similarly, something containing 8859 is not
going to be UTF-8.
This commit adds another test for cases where there is no nl_langinfo(),
and the locale name isn't helpful. It looks at the currency symbol,
which typically will be in the locale's script. If that is illegal
UTF-8, we know for sure that the locale isn't UTF-8 (or is corrupted).
If it is legal UTF-8 (but not ASCII) we can be pretty sure that the
locale is UTF-8. If it is ASCII, we still don't know one way or the
other, so we err on it not being UTF-8.
Originally, I was going to use the locale's error message strings,
returned from strerror(), the source for $!, to check for this.
These are supposed to be in terms of the LC_MESSAGES locale. Chances
are vanishingly small that the locale is not UTF-8 if all the messages
pass a utf8ness test, provided that the messages aren't just ASCII.
However, on dromedary, the messages for many of the exotic locales
haven't been translated, and are still in English, which doesn't help at
all. I figure that this is quite likely to be the case generally, and
the currency symbol is much more likely to have been translated.
I left the code in though, commented out for possible future use.
Note that this test will run only on systems that don't have
nl_langinfo(). The test can also be turned off by setting a C compiler
flag -DNO_LOCALE_MONETARY, (and -DNO_LOCALE_MESSAGES for the
commented-out part), corresponding to the way the other categories can
be turned off (none of which is documented).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There was buggy code to see if the start-up locale is UTF-8. This
commit extracts it into a separate function.
The bugs involved looking at the name of the locale to see if that
implies a UTF-8 name. Prior to this commit, it looked at the
beginning of the locale name, whereas in reality, it is at the end, as
in "fr_FR.UTF8".
Also, it didn't look for the documented Windows name for UTF-8 locales
on those platforms.
The function is expanded to have an input category to find the utf8ness
of. Thus it now works on any non-LC_ALL category, not just LC_CTYPE.
It is possible for categories to be in different locales, so that
LC_CTYPE is in a UTF-8 locale, and LC_NUMERIC isn't. For the purposes
of PERL_UNICODE, the most applicable category is LC_CTYPE, so that is
the one used in its currently only call.
|
|
|
|
|
|
| |
Prior to this patch, one parameter to strNE would have been through a
standardizing function, while the other had not. By standardizing both
before doing the compare, we avoid false positives.
|
|
|
|
| |
This indents some nested #if's to clarify the program structure.
|
| |
|
|
|
|
|
| |
This updates the editor hints in our files for Emacs and vim to request
that tabs be inserted as spaces.
|
|
|
|
|
|
|
|
|
| |
097ee67dff1c60f2 didn't need to include <locale.h> in locale.c (then
util.c) because it had been included by perl.h since 5.002 beta 1
3f270f98f9305540 missed removing the include of <unistd.h> from perl.c
or perlio.c
de8ca8af19546d49 changed perl.h to also include <sys/wait.h>, but didn't
notice that it code therefore be removed from perl.c, pp_sys.c and util.c
|
| |
|
|
|
|
|
| |
Prefix it with "panic", report the two lengths that caused the sanity test
failure, and add the message to perldiag.pod.
|
| |
|
|
|
|
|
|
|
|
|
| |
# New Ticket Created by (Peter J. Acklam)
# Please include the string: [perl #81904]
# in the subject line of all future correspondence about this issue.
# <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=81904 >
Signed-off-by: Abigail <abigail@abigail.be>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As discussed on p5p, ibcmp has different semantics from other cmp
functions in that it is a binary instead of ternary function. It is
less confusing then to have a name that implies true/false.
There are three functions affected: ibcmp, ibcmp_locale and ibcmp_utf8.
ibcmp is actually equivalent to foldNE, but for the same reason that things
like 'unless' and 'until' are cautioned against, I changed the functions
to foldEQ, so that the existing names, like ibcmp_utf8 are defined as
macros as being the complement of foldEQ.
This patch also changes the one file where turning ibcmp into a macro
causes problems. It changes it to use the new name. It also documents
for the first time ibcmp, ibcmp_locale and their new names.
|
|
|
|
| |
util.c don't need the interpreter as well
|
|
|
|
|
|
| |
Message-ID: <25940.1225611819@chthon>
Date: Sun, 02 Nov 2008 01:43:39 -0600
p4raw-id: //depot/perl@34698
|
|
|
| |
p4raw-id: //depot/perl@34585
|
|
|
|
|
|
|
|
|
|
|
|
| |
ability to create landmines that will explode under someone in the
future when they upgrade their compiler to one with better
optimisation. We've already done this at least twice.
(Yes, some of the assertions are after code that would already have
SEGVd because it already deferences a pointer, but they are put in
to make it easier to automate checking that each and every case is
covered.)
Add a tool, checkARGS_ASSERT.pl, to check that every case is covered.
p4raw-id: //depot/perl@33291
|
|
|
| |
p4raw-id: //depot/perl@32237
|
|
|
|
|
|
|
| |
Subject: locale.c usage of strxfrm
From: "Devin Heitmueller" <devin.heitmueller@gmail.com>
Message-ID: <412bdbff0704201520i7aac0189n74f0cef5c5213f41@mail.gmail.com>
p4raw-id: //depot/perl@31092
|
|
|
| |
p4raw-id: //depot/perl@27875
|
|
|
|
|
|
| |
Message-Id: <200604111908.k3BJ8ewn030950@kosh.hut.fi>
Date: Tue, 11 Apr 2006 22:08:40 +0300 (EEST)
p4raw-id: //depot/perl@27769
|
|
|
|
|
| |
Message-ID: <4438B854.6040301@gmail.com>
p4raw-id: //depot/perl@27750
|
|
|
|
|
| |
Message-ID: <20060221062711.GA16160@petdance.com>
p4raw-id: //depot/perl@27300
|
|
|
|
|
|
| |
Message-ID: <20060203152449.GI12591@accognoscere.homeunix.org>
Date: Fri, 3 Feb 2006 16:24:49 +0100
p4raw-id: //depot/perl@27065
|
|
|
|
|
| |
Message-ID: <20060202093849.GD12591@accognoscere.homeunix.org>
p4raw-id: //depot/perl@27054
|