diff options
author | Karl Williamson <public@khwilliamson.com> | 2011-10-16 09:04:15 -0600 |
---|---|---|
committer | Karl Williamson <public@khwilliamson.com> | 2011-10-17 17:04:28 -0600 |
commit | 86510fb15deab424253c3d02df84022ab415f805 (patch) | |
tree | ff56ef9c1f7332c47dbf33e6b5a75ca49d85212a /pp.c | |
parent | 16d951b76390ab0f9abe13820faa4a00c05fd647 (diff) | |
download | perl-86510fb15deab424253c3d02df84022ab415f805.tar.gz |
pp.c: Remove disabled code for context sensitive lc
This code was always #ifdef'd out. It would have been used to convert
to a Greek final sigma from a non-final one, depending on context. The
problem is that we can't know algorithmically if a final sigma is in
order or not. I excerpt this quote, that I find persuasive, from
correspondence from Father Chrysostomos, who knows Greek:
"I cannot see how any algorithm can know to get it right.
"The letter σ (or Σ in capitals) represents the number 200 in Greek
numerals. Those are not just ancient Greek numerals, but are used on a
regular basis even in modern Greek. In many printed books ς is used in
place of ϛ, which represents the number 6. So if casefolding should
change ͵ΑΣʹ to ͵αςʹ, or if an output layer changes ͵ασʹ similarly, it
will be changing the number (from 1200 to 1006). You can’t get around
it by checking for the Greek numeral sign (ʹ), as sometimes the tonos
(΄), oxeia (´), or even the ASCII straight quote is used. And often in
lists or chapter titles a dot is used instead of numeral sign.
"Also, σ is commonly used at the ends of abbreviations. Changing ‘βλέπε
σ. 16’ (‘see page 16’) to ‘βλέπε ς. 16’ is not acceptable.
"So, no, I don’t think a programming language should be fiddling with σ
versus ς. (A word processor is another matter.)"
Diffstat (limited to 'pp.c')
-rw-r--r-- | pp.c | 74 |
1 files changed, 4 insertions, 70 deletions
@@ -4132,76 +4132,12 @@ PP(pp_lc) const STRLEN u = UTF8SKIP(s); STRLEN ulen; -#ifndef CONTEXT_DEPENDENT_CASING toLOWER_utf8(s, tmpbuf, &ulen); -#else -/* This is ifdefd out because it probably is the wrong thing to do. The right - * thing is probably to have an I/O layer that converts final sigma to regular - * on input and vice versa (under the correct circumstances) on output. In - * effect, the final sigma is just a glyph variation when the regular one - * occurs at the end of a word. And we don't really know what's going to be - * the end of the word until it is finally output, as splitting and joining can - * occur at any time and change what once was the word end to be in the middle, - * and vice versa. */ - - const UV uv = toLOWER_utf8(s, tmpbuf, &ulen); - - /* If the lower case is a small sigma, it may be that we need - * to change it to a final sigma. This happens at the end of - * a word that contains more than just this character, and only - * when we started with a capital sigma. */ - if (uv == UNICODE_GREEK_SMALL_LETTER_SIGMA && - s > send - len && /* Makes sure not the first letter */ - utf8_to_uvchr(s, 0) == UNICODE_GREEK_CAPITAL_LETTER_SIGMA - ) { - - /* We use the algorithm in: - * http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf (C - * is a CAPITAL SIGMA): If C is preceded by a sequence - * consisting of a cased letter and a case-ignorable - * sequence, and C is not followed by a sequence consisting - * of a case ignorable sequence and then a cased letter, - * then when lowercasing C, C becomes a final sigma */ - - /* To determine if this is the end of a word, need to peek - * ahead. Look at the next character */ - const U8 *peek = s + u; - - /* Skip any case ignorable characters */ - while (peek < send && is_utf8_case_ignorable(peek)) { - peek += UTF8SKIP(peek); - } - /* If we reached the end of the string without finding any - * non-case ignorable characters, or if the next such one - * is not-cased, then we have met the conditions for it - * being a final sigma with regards to peek ahead, and so - * must do peek behind for the remaining conditions. (We - * know there is stuff behind to look at since we tested - * above that this isn't the first letter) */ - if (peek >= send || ! is_utf8_cased(peek)) { - peek = utf8_hop(s, -1); - - /* Here are at the beginning of the first character - * before the original upper case sigma. Keep backing - * up, skipping any case ignorable characters */ - while (is_utf8_case_ignorable(peek)) { - peek = utf8_hop(peek, -1); - } + /* Here is where we would do context-sensitive actions. See + * the commit message for this comment for why there isn't any + */ - /* Here peek points to the first byte of the closest - * non-case-ignorable character before the capital - * sigma. If it is cased, then by the Unicode - * algorithm, we should use a small final sigma instead - * of what we have */ - if (is_utf8_cased(peek)) { - STORE_UNI_TO_UTF8_TWO_BYTE(tmpbuf, - UNICODE_GREEK_SMALL_LETTER_FINAL_SIGMA); - } - } - } - else { /* Not a context sensitive mapping */ -#endif /* End of commented out context sensitive */ if (ulen > u && (SvLEN(dest) < (min += ulen - u))) { /* If the eventually required minimum size outgrows @@ -4218,9 +4154,7 @@ PP(pp_lc) SvGROW(dest, min); d = (U8*)SvPVX(dest) + o; } -#ifdef CONTEXT_DEPENDENT_CASING - } -#endif + /* Copy the newly lowercased letter to the output buffer we're * building */ Copy(tmpbuf, d, ulen, U8); |