summaryrefslogtreecommitdiff
path: root/ext/mbstring
Commit message (Collapse)AuthorAgeFilesLines
* Generate class entries from stubs for ldap, libxml, mbstring and mysqliMáté Kocsis2021-02-162-2/+2
| | | | Closes GH-6684
* Remove stray mentions of mbstring.func_overloadMax Semenik2021-02-154-4/+0
| | | | | | This feature has been completely removed. Closes GH-6688.
* Deprecate passing null to non-nullable arg of internal functionNikita Popov2021-02-115-66/+27
| | | | | | | | | | | | | | | | | | | | | This deprecates passing null to non-nullable scale arguments of internal functions, with the eventual goal of making the behavior consistent with userland functions, where null is never accepted for non-nullable arguments. This change is expected to cause quite a lot of fallout. In most cases, calling code should be adjusted to avoid passing null. In some cases, PHP should be adjusted to make some function arguments nullable. I have already fixed a number of functions before landing this, but feel free to file a bug if you encounter a function that doesn't accept null, but probably should. (The rule of thumb for this to be applicable is that the function must have special behavior for 0 or "", which is distinct from the natural behavior of the parameter.) RFC: https://wiki.php.net/rfc/deprecate_null_to_scalar_internal_arg Closes GH-6475.
* Update 'East Asian Width' table to comply with Unicode 13.0Alex Dowad2021-01-194-54/+189
| | | | | | | | | | | | | | | Instead of manually maintaining the data in eaw_table.h, it is now automatically generated by ucgendat/ucgendat.php, using the EastAsianWidth.txt file from the Unicode Consortium. Something must be said about the deleted test case. Back in 2004, someone noticed that `mb_strwidth` didn't comply with Unicode 4.0. A test case was added to expose the problem. Well, time keeps moving on, and with the changing years, new Unicodes are born and old Unicodes die. Some characters which were counted as double-width in Unicode 4.0 are no longer such in Unicode 13.0, which renders the test case obsolete. At the same time, make a couple of spelling/grammar fixes in ucgendat.php.
* Remove useless constant MBFL_ENCTYPE_MBCSAlex Dowad2021-01-1530-49/+42
| | | | | | This flag indicated that an encoding was 'multi-byte'; it can use a variable number of bytes to encode each character. As it turns out, we don't actually need to check this flag anywhere, so it's better to remove it.
* Remove unused macros from mbfilter_cp51932.c, mbfilter_iso2022jp_mobile.cAlex Dowad2021-01-152-14/+0
|
* Remove useless mbstring encoding 'JIS-ms'Alex Dowad2021-01-154-198/+13
| | | | | | | | | | | | | | | | | | | | | | | | | MicroSoft invented three encodings very similar to ISO-2022-JP/JIS7/JIS8, called CP50220, CP50221, and CP50222. All three are supported by mbstring. Since these encodings are very similar, some code can be shared. Actually, conversion of CP50220/1/2 to Unicode is exactly the same operation; it's when converting from Unicode to CP50220/1/2 that some small differences arise in how certain katakana are handled. The most important common code was a function called `mbfl_filt_wchar_jis_ms`. The `jis_ms` part doubtless refers to the fact that these encodings are modified versions of 'JIS' invented by 'MS'. mbstring also went a step further and exported 'JIS-ms' to userland as a separate encoding from CP50220/1/2. If users requested 'JIS-ms' conversion, they got something like CP50220/1/2, minus their special ways of handling half-width katakana when converting from Unicode. But... that 'encoding' is not something which actually exists in the world outside of mbstring. CP50220/1/2 do exist in MicroSoft software, but not 'JIS-ms'. For a text encoding conversion library, inventing new variant encodings and implementing them is not very productive. Our interest is in handling text encodings which real people actually use for... you know, storing actual text and things like that.
* Remove useless mbstring encoding 'CP50220-raw'Alex Dowad2021-01-154-61/+8
| | | | | | | | | | | | | | | | | | CP50220 is a variant of ISO-2022-JP invented by MicroSoft, which handles some Unicode characters which are not representable in ISO-2022-JP by converting them to similar characters which are representable. What, then, is CP50220-raw? An Internet search turns up absolutely nothing. Reference works which I consulted don't say anything about it. Other text conversion libraries don't support it. From looking at the code: It's just the same as CP50220, but it accepts unmapped JIS X 0208 characters passed through from other Japanese encodings and silently encodes them using the usual ISO-2022-JP escape sequence and representation for JIS X 0208 characters. It's hard to see how this could be useful. OK, let me come out and say it: it's _not_ useful. We can confidently jettison this (mis)feature.
* CP5022{0,1,2}: treat truncated multibyte characters as errorAlex Dowad2021-01-152-3/+20
|
* Add test suite for CP5022{0,1,2}Alex Dowad2021-01-151-0/+286
|
* Replace zend_bool uses with boolNikita Popov2021-01-152-20/+20
| | | | | | | We're starting to see a mix between uses of zend_bool and bool. Replace all usages with the standard bool type everywhere. Of course, zend_bool is retained as an alias.
* Print "interned" instead of fake refcount in debug_zval_dump()Nikita Popov2021-01-151-41/+41
| | | | | | | | | debug_zval_dump() currently prints refcount 1 for interned strings and arrays, which does not really reflect the truth. These values are not refcounted, so the refcount is misleading. Instead print an "interned" tag. Closes GH-6598.
* CP5022{0,1,2}: treat unrecognized escapes as errorAlex Dowad2021-01-151-4/+4
|
* CP5022{0,1,2}: use JISX0201 for U+203E (overline)Alex Dowad2021-01-151-11/+7
| | | | Same issue as d497c0e96f addressed for JIS7/JIS8, but for CP5022{0,1,2} this time.
* CP5022{0,1,2}: convert Unicode codepoints in 'user' area (0xE000-E757) correctlyAlex Dowad2021-01-151-31/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | Unicode has a range of 'private' codepoints which individual applications can use for their own purposes. When they were inventing CP932, MicroSoft mapped these 'private' or 'user' codepoints to ten new rows added to the JIS X 0208 character table. (JIS X 0208 is based on a 94x94 table; MS used rows 95-114 for private characters.) `mbfl_filt_conv_wchar_jis_ms` converted these private codepoints to rows 85-94 rather than 95-114. The code included a link to a document on the OpenGroup web site, dating back to 1996 [1], which proposed mapping private codepoints to these rows. However, that is not consistent with what mbstring does when converting CP5022x to Unicode. There seems to be a dearth of information on CP5022x on the web. However, I did find one (Japanese-language) page on CP50221, which states that it maps kuten codes 0x7F21-0x927E to the 'private' Unicode codepoints [2]. As a side note, using rows higher than 95 does seem to defeat one purpose of using an ISO-2022-JP variant: ISO-2022-JP was specifically designed to be "7-bit clean", but once you go beyond row 95, the ku codes are 0x80 and up, so 8 bits are needed. [1] https://web.archive.org/web/20000229180004/http://www.opengroup.or.jp/jvc/cde/ucs-conv.html [2] https://www.wdic.org/w/WDIC/Microsoft%20Windows%20Codepage%20%3A%2050221
* CP5022{0,1,2}: convert characters in ku 0x2D (13th row) correctlyAlex Dowad2021-01-151-3/+3
| | | | | | | | | | | | | | | Essentially, CP5022{0,1,2} are to CP932 as ISO-2022-JP is to Shift-JIS. As Shift-JIS and ISO-2022-JP both encode characters from the JIS X 0208 charset, CP932 and CP5022x both encode characters from JIS X 0208 _plus_ extra characters added as MicroSoft vendor extensions. Among the added characters are a number of symbols which MS put in the 13th row of the 94x94 character table. (In JIS X 0208, that row is empty.) mbfilter_cp50220x.c had an `if` clause which was intended to handle the conversion of characters in that 13th row, but it was dead code, as the previous clause was always true in those cases. The solution is to reverse the order of those two clauses (just as they already appeared in mbfilter_cp932.c).
* Stricter handling of erroneous input when converting CP5022{0,1,2} text encodingAlex Dowad2021-01-151-10/+2
| | | | | | Don't allow escape sequences to start in the middle of a multibyte character. Also, don't silently pass through illegal bytes which appear where the 2nd byte of a multibyte character should be.
* JIS7/JIS8 encoding: treat truncated multibyte characters as errorAlex Dowad2021-01-142-2/+24
|
* JIS7/JIS8 encoding: handle invalid 2nd byte for Kanji correctlyAlex Dowad2021-01-144-4/+6527
| | | | | | | | | | | | | | | | Previously, in ISO-2022-JP/JIS7/JIS8, if an escape sequence (starting with 0x1B) appeared where the 2nd byte of a multibyte character should have been, mbstring would forget all about the truncated multibyte character and happily accept the escape sequence. However, such sequences are not legal and should be flagged as errors. Also, any other illegal bytes appearing where the 2nd byte of a multibyte character was expected were just passed through quietly to the output. Fix that. Also add a test suite for both ISO-2022-JP and JIS7/JIS8. (These are extremely similar encodings; JIS7 and JIS8 are variants of ISO-2022-JP. mbstring's 'JIS' is actually a combination of JIS7 _and_ JIS8, since the extensions which each one adds to ISO-2022-JP are disjoint.)
* JIS7/JIS8 encoding: use JISX0201 for U+203E (overline)Alex Dowad2021-01-141-2/+2
| | | | | | | | | | | | | | | In other legacy Japanese encodings like Shift-JIS, we are now using a specific JISX 0208 character for the Unicode overline (U+203E). Previously, the single byte 0x7E was used, but an ASCII 0x7E does not represent an overline, so this was changed. However, JIS7/JIS8 can represent characters in the JISX 0201 character set as well. That character set also includes an overline character, which takes less bytes to encode than the corresponding JISX 0208 character, so we'll use it. This is what mbstring had been doing for a long time; but it changed as a side effect of the recent changes to how U+203E is encoded in Shift-JIS, etc. So change it back.
* JIS7/JIS8 encoding: treat unrecognized escapes as errorAlex Dowad2021-01-141-4/+4
|
* Add comment explaining why ISO-2022-JP-2004, etc strings end with ESC ( BAlex Dowad2021-01-141-1/+3
| | | | | | | | | | | These encodings have multiple modes which can be selected via escape sequences. The default starting mode is ASCII. If a string _ends_ in a different mode, we emit a 'redundant' escape sequence to switch back to ASCII. If the resulting string is never concatenated with other strings, that extra escape sequence serves no purpose. But if the resulting string is concatenated with other strings of the same encoding, it ensures that the resulting string will be valid.
* ISO-2022-JP-2004 conversion: handle invalid characters correctlyAlex Dowad2021-01-145-4/+18678
|
* ISO-2022-JP-2004 conversion: treat unrecognized escapes as errorAlex Dowad2021-01-141-4/+4
|
* ISO-2022-JP-2004 conversion: represent backslash and tilde as ASCIIAlex Dowad2021-01-141-2/+8
| | | | | | | | | | | | | | | This issue dates back to some commits I merged recently, which made encodings like Shift-JIS-2004 use appropriate JIS X 0208 characters to represent backslashes and tildes, rather than single-byte characters which are used in those encodings with a different meaning (for example, in these encodings, 0x5C is used for a halfwidth Yen sign, rather than a backslash). There was an unintended side effect: ISO-2022-JP-2004 was also made to represent backslashes and tildes using JIS X 0208 characters. However, ISO-2022-JP explicitly includes ASCII as one of its selectable character sets, and ISO-2022-JP-2004 is just an extension of ISO-2022-JP. So when converting text to ISO-2022-JP-2004, we can convert Unicode backslashes and tildes to ASCII rather than using the corresponding JIS X 0208 characters.
* Make convert_to_*_ex simple aliases of convert_to_*Nikita Popov2021-01-141-1/+1
| | | | | | | | | | | | | Historically, the _ex variants separated the zval first, if a conversion was necessary. This distinction no longer makes sense since PHP 7. The only difference that was still left is that _ex checked whether the type is the same first, but the usage of these macros did not actually distinguish on whether such an inlined check is valuable or not in a given context. Also drop the unused convert_to_explicit_type macros.
* Convert U+00AF (MACRON) to 0x8150 (FULLWIDTH MACRON) in some SJIS variantsAlex Dowad2020-11-257-3/+14
| | | | | | Except for vanilla Shift-JIS, where 0x7E is a halfwidth overline/macron. As for Shift-JIS-2004, it has an added character (byte sequence 0x854A) which was defined as a halfwidth macron in JIS X 0213:2000, so we use that.
* Convert U+FF5E (FULLWIDTH TILDE) to 0x8160 (WAVE DASH) in SJIS variantsAlex Dowad2020-11-257-17/+4
| | | | | | | | By entering this character in the JIS X 0208 conversion table, we can remove a bunch of explicit `if` clauses in different conversion filters. It also means that U+FF5E can be converted into SJIS-mac now; I don't know why this one SJIS variant rejected U+FF5E before, since 0x8160 means the same thing in SJIS-mac as the others.
* Convert U+203E (OVERLINE) to 0x8150 (FULLWIDTH MACRON) in some SJIS variantsAlex Dowad2020-11-259-9/+14
| | | | | | | | | | Converting U+203E to 0x7E was especially wrong for CP932, where 0x7E represents a tilde. For vanilla Shift-JIS and Shift-JIS-2004, converting to 0x7E is acceptable, since 0x7E does represent an overline/macron in those encodings. Follow the same principle in CP51932, which is closely related to CP932.
* 0x7E is not a tilde in Shift-JIS{,-2004}Alex Dowad2020-11-254-5/+10
|
* 0x5C is not a Yen sign in CP932 (or CP51932)Alex Dowad2020-11-254-4/+9
| | | | | | | | | | | | | | When Microsoft created CP932 (their version of Shift-JIS), they explicitly used bytes 0-0x7F to represent ASCII characters rather than JIS X 0201 characters. So when converting Unicode to CP932, it is not correct to convert U+00A5 to CP932 0x5C. Fortunately, CP932 does have a multi-byte FULLWIDTH YEN SIGN character which we can use instead. CP51932 uses the same extended character set as CP932; while CP932 is MicroSoft's extended version of Shift-JIS, CP51932 is their extended version of EUC-JP. So the same reasoning applies to CP51932.
* 0x5C is not a backslash in Shift-JIS-2004Alex Dowad2020-11-252-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Shift-JIS-2004 is an extension of Shift-JIS, which uses 0x5C for the Yen sign. Therefore, it is not correct to convert ASCII 0x5C (backslash) to Shift-JIS-2004 0x5C (yen sign). JIS X 0208 does have a backslash, so we can convert ASCII backslash to SJIS-2004 backslash instead. From time immemorial, there has been confusion around the treatment of 0x5C bytes on systems using legacy Japanese encodings. JIS X 0201 specified that 0x5C means a yen sign, and thus fonts on Japanese systems, including early versions of Windows, displayed a 0x5C byte as a yen sign. This meant that when ASCII text files were displayed on such systems, what were meant to be backslashes would appear as yen signs. Japanese C programmers could write character escapes using yen signs, and C compilers built on the assumption that the input was ASCII would interpret these escapes as desired. Likewise for shell scripts. Et cetera, et cetera... Therefore, if the input to `mb_convert_encoding` is (for example) a C program, and after converting to Shift-JIS-2004, the user wishes to feed the output into a C compiler, *then* perhaps ASCII 0x5C should be mapped to SJIS 0x5C. However, this scenario is ridiculous and will never happen. A more realistic scenario might be: an article written in SJIS-2004 has embedded Windows file paths (like 'C:\Program Files'), with yen signs used as a path separator. If we convert SJIS-2004 0x5C to ASCII 0x5C, then the path separators will be 'fixed' by the conversion. For general written texts, it is much better to convert backslashes to... backslashes. And yen signs, to yen signs.
* Enhance handling of CP51932 encodingAlex Dowad2020-11-253-20/+7741
| | | | | - Don't pass 'control' characters through in the middle of a multi-byte char - Treat truncated multi-byte characters as an error
* Fix mbstring support for SJIS-Mobile (DoCoMo, KDDI, and Softbank variants of ↵Alex Dowad2020-11-256-529/+1534
| | | | | | | | | | | | | | | | | | | | | | | | | | Shift-JIS) Lots of problems here. - Don't pass 'control' characters through silently in the middle of a multi-byte character. - Treat it as an error if a multi-byte character is truncated. - For ESC sequences used to encode emoji on earlier Softbank phones, if an invalid ESC sequence is found, don't pass it through. Rather, handle it as an error and respect `mb_substitute_character`. - In ranges used by mobile vendors for emoji, if a certain byte sequence doesn't map to any emoji, don't emit a mangled value (actually a raw (ku*94)+ten value, which may not even be a valid Unicode codepoint at all). - When converting Unicode to SJIS-Mobile, don't mangle codepoints which fall in the 2nd range of MicroSoft vendor extensions. Some vendor-specific emoji have been mapped to standard Unicode codepoints now, rather than 'private use area' codepoints. When the legacy code was written, these codepoints may not have existed yet in the Unicode standard which was current at that time. Also do a major code cleanup -- remove dead code, rearrange what is left, use some new macros and helper functions to make the code clearer...
* Combine MBFL_ENCTYPE_MWC2{BE,LE} constantsAlex Dowad2020-11-252-5/+4
| | | | | | | | | | These constants indicate that a text encoding uses 2+ bytes for each character, and is either big endian or little endian (respectively). But nothing in mbstring cares about the difference between MBFL_ENCTYPE_MWC2BE and MBFL_ENCTYPE_MWC2LE. (Actually, nothing cares about whether these flags are set at all... maybe we should just remove them?)
* Combine MBFL_ENCTYPE_WCS{2,4}{BE,LE} constantsAlex Dowad2020-11-257-33/+26
| | | | | | | | | These flags identify text encodings in mbstring which use a constant number of bytes per character. While some parts of the code do use these flags, usually to detect cases which can be optimized due to constant-width encoding, nothing cares whether the encodings are 'LE' (little-endian) or 'BE' (big-endian). So we can simplify things by combining constants.
* Don't pass invalid JIS X 0212, JIS X 0213, and Windows-CP932 characters throughAlex Dowad2020-11-2512-89/+13
| | | | | | Similarly to JIS X 0208, mbstring would pass kuten codes which are not mapped in the JIS X 0212, JIS X 0213, or CP932 character sets through silently when converting to another Japanese encoding.
* Don't pass invalid JIS X 0208 characters throughAlex Dowad2020-11-2510-31/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Many Japanese encodings, such as JIS7/8, Shift JIS, ISO-2022-JP, EUC-JP, and so on encode characters from the JIS X 0208 character set. JIS X 0208 is based on the concept of a 94x94 table, with numbered rows and columns. However, more than a thousand of the cells in that table are empty; JIS X 0208 does not actually use all 94x94=8,836 possible kuten codes. mbstring had a dubious feature whereby, if a Japanese string contained one of these 'unmapped' kuten codes, and it was being converted to another Japanese encoding which was also based on JIS X 0208, the non-existent character would be silently passed through, and the unmapped kuten code would be re-encoded using the normal encoding method of the target text encoding. Again, this _only_ happened if converting the text with the funky kuten code to a Japanese encoding. If one tried converting it to Unicode, mbstring would treat that as an error. If somebody, somewhere, made their own private extension to JIS X 0208, and used the regular Japanese encodings like Shift JIS and EUC-JP to encode this private character set, then this feature might conceivably be useful. But how likely is that? If someone is using Shift JIS, EUC-JP, ISO-2022-JP, etc. to encode a funky version of JIS X 0208 with extra characters added, then that should be treated as a separate text encoding. The code which flags such characters with MBFL_WCSPLANE_JIS0208 is retained solely for error reporting in `mbfl_filt_conv_illegal_output`.
* Enhance handling of CP932 text encodingAlex Dowad2020-11-253-9/+8116
| | | | | | | - Don't allow control characters to appear in the middle of a multi-byte character. (This was a strange feature of mbstring; it doesn't make much sense, and iconv doesn't allow it.) - Treat truncated multi-byte characters as an error.
* Bugfixes for findInvalidChars (helper for mbstring test suite)Alex Dowad2020-11-251-2/+2
|
* Consolidate all single-byte encodings in one source fileAlex Dowad2020-11-1175-4235/+639
| | | | We can squeeze out a lot of duplicated code in this way.
* Enhance mbstring support for UCS-2 textAlex Dowad2020-11-111-67/+39
| | | | | | | - For consistency with UTF-16, UTF-32, and UCS-4, strip leading byte order marks. - Treat it as an error if string is truncated (i.e. has an odd number of bytes).
* Leading BOM is stripped for UTF-32Alex Dowad2020-11-112-124/+45
| | | | | | For consistency with UTF-16 and UCS-4. Also, do some code cleanup.
* Add test suite for SJIS-mac encodingAlex Dowad2020-11-112-0/+7813
|
* Unicode -> SJIS-mac conversion doesn't reject valid codepoints after a bad ↵Alex Dowad2020-11-111-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | transcoding hint To give the background on this issue, here is an excerpt from JAPANESE.txt, from the Unicode Consortium: Apple has defined a block of 32 corporate characters as "transcoding hints." These are used in combination with standard Unicode characters to force them to be treated in a special way for mapping to other encodings; they have no other effect. Sixteen of these transcoding hints are "grouping hints" - they indicate that the next 2-4 Unicode characters should be treated as a single entity for transcoding. The other sixteen transcoding hints are "variant tags" - they are like combining characters, and can follow a standard Unicode (or a sequence consisting of a base character and other combining characters) to cause it to be treated in a special way for transcoding. These always terminate a combining-character sequence. The transcoding coding hints used in this mapping table are: 0xF860 group next 2 characters as a single entity for transcoding 0xF861 group next 3 characters as a single entity for transcoding 0xF862 group next 4 characters as a single entity for transcoding 0xF87A variant tag for "negative" (i.e. black & white reversed) 0xF87E variant tag for vertical form 0xF87F variant tag for other alternate form For example, the Apple addition character 0x85AB is Roman numeral thirteen. There is no single Unicode for this (although there are standard Unicodes for Roman numerals 1-12). Using the grouping hint 0xF862 in combination with standard Unicodes, we can map this as 0xF862+0x0058+0x0049+0x0049+0x0049 (i.e. X + I + I + I). Our SJIS-mac conversion code actually recognizes some special sequences which start with an Apple 'transcoding hint'. However, if a transcoding hint is misplaced and is not followed by one of the expected sequences, we can just emit one error marker for the bad transcoding hint and then process the following codepoint as normal.
* SJIS-mac encoding conversion: Stop the carnage of innocent Unicode codepointsAlex Dowad2020-11-111-0/+1
| | | | | | | | | | | | | | | | When converting Unicode to MacJapanese, some special sequences of Unicode codepoints are collapsed into a single SJIS character. When the implementation sees a codepoint which *might* begin such a sequence, it is cached and examined again after the next codepoint arrives. If it turns out that it wasn't one of the 'special' sequences, then a 'fallback' conversion table is consulted to convert the cached codepoint. Then we re-enter the regular conversion code to convert the immediately following codepoint. BUT, local variables need to be reinitialized properly when doing this! Because the locals weren't reinitialized, the sad result was that some codepoints would get chopped up into bit salad and emitted as something totally bogus (which might not even be valid SJIS-mac text at all).
* Convert Unicode halfwidth Yen sign to MacJapanese halfwidth Yen signAlex Dowad2020-11-111-2/+4
| | | | | | | | | | | Since 1993, Unicode has had a specific codepoint for a fullwidth Yen sign. Likewise, MacJapanese has separate kuten codes for halfwidth and fullwidth Yen signs. But mbstring mapped _both_ Yen sign codepoints to the MacJapanese fullwidth Yen sign. It's probably more appropriate to map the 'ordinary' Yen sign to the MacJapanese halfwidth Yen sign. Besides, this means that the conversion between Unicode and MacJapanese is closer to being lossless and reversible.
* SJIS-mac encoding conversion: handle invalid (or truncated) 2nd byte for ↵Alex Dowad2020-11-111-7/+19
| | | | | | | Kanji correctly Also, don't accept 1st bytes above 0xED, since none of the possible 2-byte sequences starting with 0xEE and above are actually mapped to any character.
* Add test suite for SJIS-2004 encodingAlex Dowad2020-11-112-0/+11611
|
* Don't mangle non-Japanese chars which appear after a 'combining' kana in ↵Alex Dowad2020-11-111-2/+2
| | | | | | | | | | | | | | | | | | | SJIS-2004 Unicode has 'combining' characters which join with another following character. Japanese hiragana and katakana with the 'two dots' voice mark can be represented in this way, with one Unicode character for the 'base' kana and another one which adds the voice mark. In SJIS-2004, however, there are dedicated characters for voiced and unvoiced kana. So some special checks are done to identify sequences of Unicode characters which need to be 'collapsed' into a single SJIS-2004 character. If a kana is immediately followed by some other unrelated character, like a Cyrillic letter, then the cached kana should be output 'as is' and we proceed with encoding the unrelated character. When doing this, though, we need to re-initialize local variables, or else the unrelated character will be mangled in some cases.