summaryrefslogtreecommitdiff
path: root/locale.c
Commit message (Collapse)AuthorAgeFilesLines
...
* Make sure LC_MONETARY is initializedKarl Williamson2014-02-151-0/+13
| | | | | | | | This is only an issue for those few platforms without LC_ALL, as that is initialized, and includes LC_MONETARY. This commit extends the proper initialization to those other platforms. Perl doesn't use LC_MONETARY itself, but it should be properly initialized for modules that do.
* Initialize LC_MESSAGES at start-upKarl Williamson2014-02-151-1/+13
| | | | | | | | | The code did not explicitly iinitialize LC_MESSAGES at startup, unlike most of the other standard categories; I don't know why. This is only an issue for those few platforms without LC_ALL, as that is initialized, and includes LC_MESSAGES. This commit extends the proper initialization to those other platforms.
* locale.c: White-space, useless brace removal onlyKarl Williamson2014-02-151-84/+82
| | | | | | | | This takes one piece of code that is needlessly enclosed in braces and removes the braces, outdenting and reflowing the comments. Otherwise, it changes to correct indentation for the addition and removal of braces by the previous commit.
* Improve fallback during locale initializationKarl Williamson2014-02-151-46/+162
| | | | | | | | | | | | | If Perl encounters a problem during startup trying to initialize the locales from the environment it has immediately reverted to the "C" locale. This commit generalizes that so it tries each of the applicable environment variables in order of priority until it works, or it gives up and uses the "C" locale. For example, if LC_ALL is set to something that is invalid, but LANG is valid, LANG will be used. This was motivated by trying to get the Windows system default locale used in preference to "C" if all else fails.
* locale.c: Add, move some comments, and a declarationKarl Williamson2014-02-151-7/+16
| | | | | | | | | | | This adds some more comments at the beginning of a function concerning its API, and moves them to before any declarations. It also moves the declaration for 'done' to the block of other declarations, and adds a PERL_UNUSED_VAR call if the code that uses it is #ifdef'd out. Previously it was too easy to not notice the declaration separate from the others, and to insert code between the two, which would not compile under C89, but only on Ultrix machines.
* Emulate POSIX locale setting on WindowsKarl Williamson2014-02-151-8/+87
| | | | | | | | | | | | | Locale initialization and setting on Windows haven't been as described in perllocale for setting locales to "". This is because that tells Windows to use the system default locale, as set through the Control Panel, but on POSIX systems, it means to look at various environment variables. This commit creates a wrapper for setlocale, used only on Windows, that looks for the appropriate environment variables when called with a "" input locale. If none are found, it continues to use the system default locale.
* regexec.c, locale.c: Silence some compiler warningsKarl Williamson2014-02-121-0/+2
| | | | | | | | | | | | For regexec.c, one compiler amongst our smokers believes there is a path where this array can be used uninitialized; it's easiest to just initialize it, even though I think the compiler is wrong, unless it is optimizing incorrectly, in which case, it would be still be best to initialize it. For locale.c, this is just the well-known gcc bug that they refuse to fix concerning a (void) cast when the function has been declared to require not ignoring the resul
* Add -DL option to trace setlocale callsKarl Williamson2014-02-031-0/+64
| | | | This will help field debugging of locale issues.
* locale.c: Fix failure to find UTF-8 localesKarl Williamson2014-01-291-31/+34
| | | | | | | | | | | | | | Commit 119ee68b changed the method to determine if a locale is a UTF-8 one to a method that was usable on more platforms, by using the C99 libc function mbtowc(). I didn't realize that there needs to be a special call to this function preceeding the main call to make sure it is in the initial state. This commit fixes that. In looking at the results from several different platforms, I decided it is best to use nl_langinfo() in preference to mbtowc() when available, and only use mbtowc() if nl_langinfo doesn't exist on the platform or fails to return a real result, which happens for some locales on Darwin. This commit does that as well.
* White-space, comments onlyKarl Williamson2014-01-271-9/+8
| | | | | | | This mostly indents and outdents base on blocks added or removed by the previous commit. But there are a few comment changes and vertical alignment of macro backslash continuation characters, and other white-space changes
* Work properly under UTF-8 LC_CTYPE localesKarl Williamson2014-01-271-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This large (sorry, I couldn't figure out how to meaningfully split it up) commit causes Perl to fully support LC_CTYPE operations (case changing, character classification) in UTF-8 locales. As a side effect it resolves [perl #56820]. The basics are easy, but there were a lot of details, and one troublesome edge case discussed below. What essentially happens is that when the locale is changed to a UTF-8 one, a global variable is set TRUE (FALSE when changed to a non-UTF-8 locale). Within the scope of 'use locale', this variable is checked, and if TRUE, the code that Perl uses for non-locale behavior is used instead of the code for locale behavior. Since Perl's internal representation is UTF-8, we get UTF-8 behavior for a UTF-8 locale. More work had to be done for regular expressions. There are three cases. 1) The character classes \w, [[:punct:]] needed no extra work, as the changes fall out from the base work. 2) Strings that are to be matched case-insensitively. These form EXACTFL regops (nodes). Notice that if such a string contains only characters above-Latin1 that match only themselves, that the node can be downgraded to an EXACT-only node, which presents better optimization possibilities, as we now have a fixed string known at compile time to be required to be in the target string to match. Similarly if all characters in the string match only other above-Latin1 characters case-insensitively, the node can be downgraded to a regular EXACTFU node (match, folding, using Unicode, not locale, rules). The code changes for this could be done without accepting UTF-8 locales fully, but there were edge cases which needed to be handled differently if I stopped there, so I continued on. In an EXACTFL node, all such characters are now folded at compile time (just as before this commit), while the other characters whose folds are locale-dependent are left unfolded. This means that they have to be folded at execution time based on the locale in effect at the moment. Again, this isn't a change from before. The difference is that now some of the folds that need to be done at execution time (in regexec) are potentially multi-char. Some of the code in regexec was trivial to extend to account for this because of existing infrastructure, but the part dealing with regex quantifiers, had to have more work. Also the code that joins EXACTish nodes together had to be expanded to account for the possibility of multi-character folds within locale handling. This was fairly easy, because it already has infrastructure to handle these under somewhat different circumstances. 3) In bracketed character classes, represented by ANYOF nodes, a new inversion list was created giving the characters that should be matched by this node when the runtime locale is UTF-8. The list is ignored except under that circumstance. To do this, I created a new ANYOF type which has an extra SV for the inversion list. The edge case that caused the most difficulty is folding involving the MICRO SIGN, U+00B5. It folds to the GREEK SMALL LETTER MU, as does the GREEK CAPITAL LETTER MU. The MICRO SIGN is the only 0-255 range character that folds to outside that range. The issue is that it doesn't naturally fall out that it will match the CAP MU. If we let the CAP MU fold to the samll mu at compile time (which it can because both are above-Latin1 and so the fold is the same no matter what locale is in effect), it could appear that the regnode can be downgraded away from EXACTFL to EXACTFU, but doing so would cause the MICRO SIGN to not case insensitvely match the CAP MU. This could be special cased in regcomp and regexec, but I wanted to avoid that. Instead the mktables tables are set up to include the CAP MU as a character whose presence forbids the downgrading, so the special casing is in mktables, and not in the C code.
* locale.c: Silence Win32 compiler warning.Karl Williamson2014-01-231-1/+1
| | | | | This compiler is not smart enough to realize that overflow of a byte can't occur here as the loop stops at 255.
* locale.c: White-space onlyKarl Williamson2014-01-221-13/+13
| | | | This indents because the previous commit added a block
* locale.c: Find utf8-8 locales reliably on C99 platformsKarl Williamson2014-01-221-10/+53
| | | | | | | | | locale.c has a function that tries to determine if the current POSIX locale is a UTF-8 locale. Prior to this patch, it used nl_langinfo() to determine this, falling back to heuristics if that is unavailable on the platform. nl_langinfo() is part of POSIX.1-2001. -This patch adds the use of two functions from C99, mbtowc() and MB_CUR_MAX, that also give reliable results.
* locale.c: Avoid writing libc static storageKarl Williamson2014-01-041-3/+3
| | | | | | I don't believe this code was causing any problem, but it can overwrite static storage returned by setlocale(). It's safer to create a copy first.
* locale.c: Add commentsKarl Williamson2014-01-041-7/+68
| | | | | This documents much of what I learned about how things work while tracking down [perl #120723].
* locale.c: White-space onlyKarl Williamson2014-01-041-4/+4
| | | | Outdent code removed from a block by the previous commit
* locale.c: Always set state variables for a new localeKarl Williamson2014-01-041-7/+5
| | | | | | | | | | | | | This function is called when a new underlying LC_NUMERIC locale has been set. If that locale is the same as the current underlying one, some setup is skipped. However, prior to this commit, more was skipped than should have been. The reason is that even if the underlying locale is the same, it could be that LC_NUMERIC has been toggled to the "C" locale, and so the information could be inconsistent. By always setting the information, we ensure consistency. This commit ia a portion of the fix for [perl #120723]. Tests will be added with the final commit for it.
* Fix broken locale case-insensitive matchingKarl Williamson2013-12-031-5/+5
| | | | | | | | | | Commit 68067e4e501e2ae1c0fb44558b6aa5c0a80a4143 inadvertently broke regular expression /i matching under locale. The tests for this were defective, so the breakage was not caught. A later commit will fix the tests, but this commit restores the functionality. It also casts the input parameter to some functions to be U8 to make sure that optimizing compilers can omit bounds checks
* fix -Wsign-compare in coreDavid Mitchell2013-11-291-2/+2
| | | | | | | | | | | | | There were a few places that were doing unsigned_var = cond ? signed_val : unsigned_val; or similar. Fixed by suitable casts etc. The four in utf8.c were fixed by assigning to an intermediate unsigned var; this has the happy side-effect of collapsing a large macro expansion, where toUPPER_LC() etc evaluate their arg multiple times.
* PATCH: [perl #119443] Blead won't compile on winceKarl Williamson2013-08-231-4/+13
| | | | | | This commit adds #if's to cause locale handling code to compile on platforms that don't have full-featured locale handling. The commits mentioned in the ticket did not adequately cover these situations.
* locale.c: Rmv unused variableKarl Williamson2013-08-121-1/+0
|
* Assume UTF-8 locale if that string occurs anywhere in nameKarl Williamson2013-08-121-11/+23
| | | | | | | | | | | | | When a platform doesn't have nl_langinfo(), heuristics are employed to see if a locale is UTF-8 . The first heuristic is looking at the return value of setlocale(), which generally is the locale name. However, in actuality the return value is opaque and can't be relied on to signify the locale. Nevertheless if it contains the string UTF-8 (ignoring case, and with the hyphen optional), it is a safe bet that the locale is indeed UTF-8. Prior to this patch, we only looked at the end of the name for "UTF-8". This patch makes it not have to be right-anchored. There are UTF-8 locales on our dromedary machine with UTF-8 in the middle of their names.
* locale.c: Add missing STATIC to fcn declKarl Williamson2013-07-191-1/+1
|
* PATCH: [perl #38193] embedded perl always calls setlocale(LC_ALL,"")Karl Williamson2013-07-091-8/+12
| | | | | | | | | | | | | | | | | | This commit causes the locale initialization to skip calling setlocal(foo, "") if the environment variable PERL_SKIP_LOCALE_INIT is set. Instead, the setup code calls setlocale(LC_ALL, NULL) (plus other similar calls for the subcategories) in order to find out what the current locale is. The original poster for this ticket has a workaround for it which involves using a modified copy of Perl core code. This patch defines the C preprocessor variable HAS_SKIP_LOCALE_INIT that can be used by XS writers to discover if the current Perl version needs the workaround or not. I was unable to come up with a test for this patch that did not involve building extensive infrastructure for testing embedded Perl. That does not seem worth it for such a trivial patch. I tested by hand.
* PATCH: [perl #118197] Cope with non-ASCII decimal separatorsKarl Williamson2013-07-071-0/+6
| | | | | | | | This patch causes the radix string to be examined upon a new numeric locale being set. If the string isn't ASCII, and the new locale is UTF-8, it turns on the UTF-8 flag in the scalar that holds the radix. When a floating point number is formatted in Perl_sv_vcatpvfn_flags(), and the flag is on, the result's flag will be set on too.
* locale.c: Further checks for utf8ness of a localeKarl Williamson2013-07-051-0/+154
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In reality, the return value of setlocale() is documented to be opaque, so using it to determine if a locale is UTF-8 or not may not work. It is a char*, which we treat as a name. We can safely assume that if the name contains UTF-8 (or slight variations thereof), that it is a UTF-8 locale. But if the name doesn't contain that, it still could be one. In fact there are currently many locales on our dromedary machine that fall into this category. Similarly, something containing 8859 is not going to be UTF-8. This commit adds another test for cases where there is no nl_langinfo(), and the locale name isn't helpful. It looks at the currency symbol, which typically will be in the locale's script. If that is illegal UTF-8, we know for sure that the locale isn't UTF-8 (or is corrupted). If it is legal UTF-8 (but not ASCII) we can be pretty sure that the locale is UTF-8. If it is ASCII, we still don't know one way or the other, so we err on it not being UTF-8. Originally, I was going to use the locale's error message strings, returned from strerror(), the source for $!, to check for this. These are supposed to be in terms of the LC_MESSAGES locale. Chances are vanishingly small that the locale is not UTF-8 if all the messages pass a utf8ness test, provided that the messages aren't just ASCII. However, on dromedary, the messages for many of the exotic locales haven't been translated, and are still in English, which doesn't help at all. I figure that this is quite likely to be the case generally, and the currency symbol is much more likely to have been translated. I left the code in though, commented out for possible future use. Note that this test will run only on systems that don't have nl_langinfo(). The test can also be turned off by setting a C compiler flag -DNO_LOCALE_MONETARY, (and -DNO_LOCALE_MESSAGES for the commented-out part), corresponding to the way the other categories can be turned off (none of which is documented).
* locale.c: Extract out, fix, expand fcn to see if a locale is utf8Karl Williamson2013-07-051-40/+121
| | | | | | | | | | | | | | | | | | | | | There was buggy code to see if the start-up locale is UTF-8. This commit extracts it into a separate function. The bugs involved looking at the name of the locale to see if that implies a UTF-8 name. Prior to this commit, it looked at the beginning of the locale name, whereas in reality, it is at the end, as in "fr_FR.UTF8". Also, it didn't look for the documented Windows name for UTF-8 locales on those platforms. The function is expanded to have an input category to find the utf8ness of. Thus it now works on any non-LC_ALL category, not just LC_CTYPE. It is possible for categories to be in different locales, so that LC_CTYPE is in a UTF-8 locale, and LC_NUMERIC isn't. For the purposes of PERL_UNICODE, the most applicable category is LC_CTYPE, so that is the one used in its currently only call.
* locale.c: Compare apples to applesKarl Williamson2013-07-051-4/+9
| | | | | | Prior to this patch, one parameter to strNE would have been through a standardizing function, while the other had not. By standardizing both before doing the compare, we avoid false positives.
* perl.h, locale.c: White space onlyKarl Williamson2013-07-051-8/+8
| | | | This indents some nested #if's to clarify the program structure.
* locale.c: Add commentsKarl Williamson2013-07-051-3/+4
|
* update the editor hints for spaces, not tabsRicardo Signes2012-05-291-2/+2
| | | | | This updates the editor hints in our files for Emacs and vim to request that tabs be inserted as spaces.
* Don't #include headers already included by perl.hNicholas Clark2011-09-151-4/+0
| | | | | | | | | 097ee67dff1c60f2 didn't need to include <locale.h> in locale.c (then util.c) because it had been included by perl.h since 5.002 beta 1 3f270f98f9305540 missed removing the include of <unistd.h> from perl.c or perlio.c de8ca8af19546d49 changed perl.h to also include <sys/wait.h>, but didn't notice that it code therefore be removed from perl.c, pp_sys.c and util.c
* When probing strxfrm, consider a consistent return value of 0 as saneNicholas Clark2011-09-091-1/+1
|
* Provide more information in the message for "strxfrm() gets absurd".Nicholas Clark2011-09-091-1/+2
| | | | | Prefix it with "panic", report the two lengths that caused the sanity test failure, and add the message to perldiag.pod.
* Convert some files from Latin-1 to UTF-8Keith Thompson2011-09-071-2/+2
|
* Fix typos (spelling errors) in Perl sources.Peter J. Acklam) (via RT2011-01-071-1/+1
| | | | | | | | | # New Ticket Created by (Peter J. Acklam) # Please include the string: [perl #81904] # in the subject line of all future correspondence about this issue. # <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=81904 > Signed-off-by: Abigail <abigail@abigail.be>
* Change name of ibcmp to foldEQKarl Williamson2010-06-051-8/+8
| | | | | | | | | | | | | | | | As discussed on p5p, ibcmp has different semantics from other cmp functions in that it is a binary instead of ternary function. It is less confusing then to have a name that implies true/false. There are three functions affected: ibcmp, ibcmp_locale and ibcmp_utf8. ibcmp is actually equivalent to foldNE, but for the same reason that things like 'unless' and 'until' are cautioned against, I changed the functions to foldEQ, so that the existing names, like ibcmp_utf8 are defined as macros as being the complement of foldEQ. This patch also changes the one file where turning ibcmp into a macro causes problems. It changes it to use the new name. It also documents for the first time ibcmp, ibcmp_locale and their new names.
* delimcopy(), ibcmp(), ibcmp_locale(), instr(), ninstr() and rninstr() from ↵Vincent Pit2009-08-271-8/+8
| | | | util.c don't need the interpreter as well
* PATCH: Large omnibus patch to clean up the JRRT quotesTom Christiansen2008-11-021-7/+9
| | | | | | Message-ID: <25940.1225611819@chthon> Date: Sun, 02 Nov 2008 01:43:39 -0600 p4raw-id: //depot/perl@34698
* Update copyright years.Nicholas Clark2008-10-251-2/+2
| | | p4raw-id: //depot/perl@34585
* assert() that every NN argument is not NULL. Otherwise we have theNicholas Clark2008-02-121-0/+7
| | | | | | | | | | | | ability to create landmines that will explode under someone in the future when they upgrade their compiler to one with better optimisation. We've already done this at least twice. (Yes, some of the assertions are after code that would already have SEGVd because it already deferences a pointer, but they are put in to make it easier to automate checking that each and every case is covered.) Add a tool, checkARGS_ASSERT.pl, to check that every case is covered. p4raw-id: //depot/perl@33291
* Fix up copyright years for files modified in 2007.Nicholas Clark2007-11-071-2/+2
| | | p4raw-id: //depot/perl@32237
* strxfrm() returns a size_t, not a ssize_t. See:Devin Heitmueller2007-04-261-2/+2
| | | | | | | Subject: locale.c usage of strxfrm From: "Devin Heitmueller" <devin.heitmueller@gmail.com> Message-ID: <412bdbff0704201520i7aac0189n74f0cef5c5213f41@mail.gmail.com> p4raw-id: //depot/perl@31092
* Turn on UTF8 cache assertions with -CaNicholas Clark2006-04-171-0/+2
| | | p4raw-id: //depot/perl@27875
* locale.c: more Safefree() (Coverity finding)Jarkko Hietaniemi2006-04-111-0/+6
| | | | | | Message-Id: <200604111908.k3BJ8ewn030950@kosh.hut.fi> Date: Tue, 11 Apr 2006 22:08:40 +0300 (EEST) p4raw-id: //depot/perl@27769
* Re: [PATCH] locale.c: Coverity findingJarkko Hietaniemi2006-04-091-0/+3
| | | | | Message-ID: <4438B854.6040301@gmail.com> p4raw-id: //depot/perl@27750
* unused context warningsAndy Lester2006-02-241-0/+1
| | | | | Message-ID: <20060221062711.GA16160@petdance.com> p4raw-id: //depot/perl@27300
* Re: [PATCH] s/Null(gv|hv|sv)/NULL/gSteven Schubiger2006-02-031-2/+2
| | | | | | Message-ID: <20060203152449.GI12591@accognoscere.homeunix.org> Date: Fri, 3 Feb 2006 16:24:49 +0100 p4raw-id: //depot/perl@27065
* Re: [PATCH] s/Null(av|ch)/NULL/gSteven Schubiger2006-02-021-6/+6
| | | | | Message-ID: <20060202093849.GD12591@accognoscere.homeunix.org> p4raw-id: //depot/perl@27054