summaryrefslogtreecommitdiff
path: root/ext/mbstring
Commit message (Collapse)AuthorAgeFilesLines
...
* SJIS-2004 encoding conversion: handle invalid (or truncated) 2nd byte for ↵Alex Dowad2020-11-114-6/+18
| | | | | | | | | | | Kanji correctly If the 2nd byte of a 2-byte character is invalid, then mb_substitute_character() should be respected. Instead, what mbstring was doing was 'swallowing' the first byte, then emitting the 2nd byte as if it was an ASCII character. Likewise, if the 2nd byte is missing, instead of just keeping quiet, report an illegal character as specified by mb_substitute_character().
* Fix broken binary search function in mbstringAlex Dowad2020-11-111-28/+23
| | | | | | | | | | This faulty binary search would never reject values at the very high end of the range being searched, even if they were not actually in the table. Among other things, this meant that some Unicode codepoints which do not correspond to any character in JIS X 0213 would be converted to bogus Shift-JIS-2004 values rather than being rejected.
* Don't redundantly flush mbstring filters multiple timesAlex Dowad2020-11-112-13/+3
| | | | | | | Each flush function in a chain of mbstring conversion filters always calls the next flush function in the chain. So it is not necessary to explicitly flush the second filter in a chain. (Due to this bug, in many cases, flush functions were actually being called three times.)
* Test EUC-JP and Shift-JIS more thoroughlyAlex Dowad2020-11-114-12/+13
| | | | | | | Previously, the unit tests for these text encodings covered all mappings from legacy -> Unicode, and all _reversible_ mappings from Unicode -> legacy. However, we should also test the few Unicode -> legacy mappings which are not reversible.
* Remove mbstring identify filtersAlex Dowad2020-11-09121-2819/+120
| | | | | | | | | | | | | | | | | | | | | | | mbstring had an 'identify filter' for almost every supported text encoding which was used when auto-detecting the most likely encoding for a string. It would run over the string and set a 'flag' if it saw anything which did not appear likely to be the encoding in question. One problem with this scheme was that encodings which merely appeared less likely to be the correct one were completely rejected, even if there was no better candidate. Another problem was that the 'identify filters' had a huge amount of code duplication with the 'conversion filters'. Eliminate the identify filters. Instead, when auto-detecting text encoding, use conversion filters to see whether the input string is valid in candidate encodings or not. At the same type, watch the type of codepoints which the string decodes to and mark it as less likely if non-printable characters (ESC, form feed, bell, etc.) or 'private use area' codepoints are seen. Interestingly, one old test case in which JIS text was misidentified as UTF-8 (and this wrong behavior was enshrined in the test) was 'fixed' and the JIS string is now auto-detected as JIS.
* Treat non-ASCII characters as erroneous when converting ASCII textAlex Dowad2020-11-091-1/+6
|
* Fix mbstring support for EUC-JP text encodingAlex Dowad2020-11-093-68/+13289
| | | | | | | | | | | - Don't allow control characters to appear in the middle of a multi-byte character. (A strange feature, or perhaps misfeature, of mbstring which is not present in other libraries such as iconv.) - When checking whether string is valid, reject kuten codes which do not map to any character, whether converting from EUC-JP to another encoding, or converting another encoding which uses JIS X 0208/0212 charsets to EUC-JP. - Truncated multi-byte characters are treated as an error.
* Fix mbstring support for Shift-JISAlex Dowad2020-11-094-600/+7233
| | | | | | | | | | | | | | | | | | | | | | | | | | | - Reject otherwise valid kuten codes which don't map to anything in JIS X 0208. - Handle truncated multi-byte characters as an error. - Convert Shift-JIS 0x7E to Unicode 0x203E (overline) as recommended by the Unicode Consortium, and as iconv does. - Convert Shift-JIS 0x5C to Unicode 0xA5 (yen sign) as recommended by the Unicode Consortium, and as iconv does. (NOTE: This will affect PHP scripts which use an internal encoding of Shift-JIS! PHP assigns a special meaning to 0x5C, the backslash. For example, it is used for escapes in double-quoted strings. Mapping the Shift-JIS yen sign to the Unicode yen sign means the yen sign will not be usable for C escapes in double-quoted strings. Japanese PHP programmers who want to write their source code in Shift-JIS for some strange reason will have to use the JIS X 0208 backlash or 'REVERSE SOLIDUS' character for their C escapes.) - Convert Unicode 0x5C (backslash) to Shift-JIS 0x815F (reverse solidus). - Immediately handle error if first Shift-JIS byte is over 0xEF, rather than waiting to see the next byte. (Previously, the value used was 0xFC, which is the limit for the 2nd byte and not the 1st byte of a multi-byte character.) - Don't allow 'control characters' to appear in the middle of a multi-byte character. The test case for bug 47399 is now obsolete. That test assumed that a number of Shift-JIS byte sequences which don't map to any character were 'valid' (because the byte values were within the legal ranges).
* Remove useless byte{2,4}{be,le} encodings from mbstringAlex Dowad2020-11-0913-459/+34
| | | | | | | | There is no meaningful difference between these and UCS-{2,4}. They are just a little bit more lax about passing errors silently. They also have no known use. Alias to UCS-{2,4} in case someone, somewhere is using them.
* Fix issues with mbstring encoding testsAlex Dowad2020-11-092-8/+22
| | | | | I made some mistakes on this code, which meant that not everything which should be tested was actually being tested.
* Add test suite for ARMSCII-8 encodingAlex Dowad2020-11-022-0/+288
|
* Fix mbstring support for ARMSCII-8Alex Dowad2020-11-023-46/+15
| | | | | | | | | - Identify filter was completely wrong. - Respect `mb_substitute_character` rather than converting invalid bytes to Unicode 0xFFFD (generic replacement character). - Don't convert Unicode 0xFFFD to a valid ARMSCII-8 character. - When converting ARMSCII-8 to ARMSCII-8, don't pass invalid bytes through silently.
* Optimize (AND FIX) mb_check_encoding (cut execution time by 50%+)Alex Dowad2020-11-021-51/+15
| | | | | | | | | | | | | | | | | | | | | | Previously, `mb_check_encoding` did an awful lot of unneeded work. In order to determine whether a string was valid or not, it would convert the whole string into wchar (code points), which required dynamically allocating a (potentially large) buffer. Then it would turn right around and convert that big 'ol buffer of code points back to the original encoding again. Finally, it would check whether any invalid bytes were detected during that long and onerous process. The thing is, mbstring _already_ has machinery for detecting whether a string is valid in a certain encoding or not, and it doesn't require copying any data around or allocating buffers. Better yet, it can fail fast when an invalid byte is found. Why not use it? It's sure a lot faster! Further, the legacy code was also badly broken. Why? Because aside from checking whether illegal characters were detected, it would also check whether the conversion to and from wchars was lossless. But, some encodings have more than one valid encoding for the same character. In such cases, it is not possible to make the conversion to and from wchars lossless for every valid character. So `mb_check_encoding` would actually reject good strings in a lot of encodings!
* Add test suite for KOI8-U encodingAlex Dowad2020-11-022-0/+316
|
* Remove dead code from mbfilter_koi8u.c (and do general code cleanup)Alex Dowad2020-11-022-40/+8
|
* All bytes are valid in KOI8-U encodingAlex Dowad2020-11-021-12/+1
|
* Add test suite for KOI8-R encodingAlex Dowad2020-11-022-0/+309
|
* Remove dead code from mbfilter_iso8859_{2,4,5,9,10,13,14,15,16}.cAlex Dowad2020-11-0211-85/+0
| | | | ...Plus some dead code related to ISO-8859-1.
* Remove dead code from mbfilter_koi8r.cAlex Dowad2020-11-022-40/+8
|
* All bytes are valid in KOI8-R encodingAlex Dowad2020-11-021-12/+1
|
* Add test suite for CP850 encodingAlex Dowad2020-11-022-0/+287
|
* Remove dead code from mbfilter_cp850.c (and do general code cleanup)Alex Dowad2020-11-022-40/+8
| | | | | Since there are no invalid bytes in CP850, these `if` conditions will never be true.
* All bytes are valid in CP850 encodingAlex Dowad2020-11-021-12/+1
|
* Add test suite for CP866 encodingAlex Dowad2020-11-022-0/+288
|
* Remove dead code from mbfilter_cp866.c (and do general code cleanup)Alex Dowad2020-11-022-40/+8
| | | | | Since there are no invalid bytes in CP866, these `if` conditions will never be true.
* All bytes are valid in CP866 encodingAlex Dowad2020-11-021-12/+1
|
* Add test suite for CP1254 encodingAlex Dowad2020-11-022-0/+289
|
* Fix mbstring support for CP1254 encodingAlex Dowad2020-11-023-54/+18
| | | | | | | | | | | | | One funny thing: while the original author used Unicode 0xFFFD (generic replacement character) for invalid bytes in CP1251 and CP1252, for CP1254 they used 0xFFFE, which is not a valid Unicode codepoint at all, but is a reversed byte-order mark. Probably this was by mistake. Anyways, - Fixed identify filter, which was completely wrong. - Don't convert Unicode 0xFFFE to a random (but valid) CP1254 byte. - When converting CP1254 to CP1254, don't pass invalid bytes through silently.
* Add test suite for CP1251 encodingAlex Dowad2020-11-022-0/+289
|
* Fix mbstring support for CP1251 encodingAlex Dowad2020-11-023-45/+15
| | | | | | | | - Identify filter was as wrong as wrong can be. - Invalid CP1251 byte 0x98 was converted to Unicode 0xFFFD (generic replacement character), rather than respecting `mb_substitute_character`. - Unicode 0xFFFD was converted to some random CP1251 byte. - When converting CP1251 to CP1251, don't pass invalid bytes through silently.
* Test cases for mbstring encodings are less repetitiveAlex Dowad2020-11-023-41/+48
|
* Add test suite for CP1252 encodingAlex Dowad2020-10-305-60/+416
| | | | | | | | | | | | Also remove a bogus test (bug62545.phpt) which wrongly assumed that all invalid characters in CP1251 and CP1252 should map to Unicode 0xFFFD (REPLACEMENT CHARACTER). mbstring has an interface to specify what invalid characters should be replaced with; it's called `mb_substitute_character`. If a user wants to see the Unicode 'replacement character', they can specify that using `mb_substitute_character`. But if they specify something else, we should follow that.
* Fix mbstring support for CP1252 encodingAlex Dowad2020-10-302-34/+17
| | | | | | | | | | | | | | | | | It's a bit surprising how much was broken here. - Identify filter was utterly and completely wrong. - Instead of handling invalid CP1252 bytes as specified by `mb_substitute_character`, it would convert them to Unicode 0xFFFD (generic replacement character). - When converting ISO-8859-1 to CP1252, invalid ISO-8859-1 bytes would be passed through silently. - Unicode codepoints from 0x80-0x9F were converted to CP1252 bytes 0x80-0x9F, which is wrong. - Unicode codepoint 0xFFFD was converted to CP1252 0x9F, which is very wrong. Also clean up some unneeded code, and make the conversion table consistent with others by using zero as a 'invalid' marker, rather than 0xFFFD.
* UTF-32 conversion treats truncated characters as illegalAlex Dowad2020-10-271-3/+19
|
* Add identify filter for UTF-32{,BE,LE}Alex Dowad2020-10-273-0/+153
|
* Improve error handling for UTF-16{,BE,LE}Alex Dowad2020-10-272-94/+77
| | | | | | | Catch various errors such as the first part of a surrogate pair not being followed by a proper second part, the first part of a surrogate pair appearing at the end of a string, the second part of a surrogate pair appearing out of place, and so on.
* UTF-16 text conversion handles truncated characters as illegalAlex Dowad2020-10-271-3/+22
| | | | | | | | | | | | | | | | | | | This broke one old test (Zend/tests/multibyte_encoding_003.phpt), which used a PHP script encoded as UTF-16. The problem was that to terminate the test script, we need the text: "\n--EXPECT--". Out of that text, the terminating newline (0x0A byte) becomes part of the resulting test script; but a bare 0x0A byte with no 0x00 is not valid UTF-16. Since we now treat truncated UTF-16 characters as erroneous, an extra '?' is appended to the output as an 'illegal character' marker. Really, if we are running PHP scripts which are treated as encoded in UTF-16 or some other arbitrary text encoding (not ASCII), and the script is not actually a valid string in that encoding, inserting '?' characters into the code which the PHP interpreter runs is a bad thing to do. In such cases, the script shouldn't be treated as UTF-16 (or whatever) at all. I wonder if mbstring's encoding detection is being used in 'non-strict' mode?
* Add test suite for ISO-8859-x encoding verification and conversionAlex Dowad2020-10-1617-0/+4474
|
* Do not pass invalid ISO-8859-{3,6,7,8} characters through silentlyAlex Dowad2020-10-164-12/+0
| | | | | | | | | | | | | mbstring has a bad habit of passing invalid characters through silently when converting to the same (or a "compatible") encoding. For example, if you give it an invalid JIS X 0208 kuten code encoded with SJIS, and try to convert that to EUC-JP, mbstring will just quietly re-encode the invalid code in the EUC-JP representation. At the same, some parts of the code (like `mb_check_encoding`) assume that invalid characters will be treated as... well, invalid. Let's unbreak things by actually catching errors and reporting them, instead of swallowing them.
* Add identify filter for ISO-8859-8 (Latin/Hebrew)Alex Dowad2020-10-161-1/+11
|
* Add identify filter for ISO-8859-7 (Latin/Greek)Alex Dowad2020-10-161-1/+11
|
* Add identify filter for ISO-8859-6 (Latin/Arabic)Alex Dowad2020-10-161-1/+11
| | | | | | | | | Note that some text encoding conversion libraries, such as Solaris iconv and FreeBSD iconv, map 0x30-0x39 to the Arabic script numerals rather than the 'regular' Roman numerals. (That is, to Unicode codepoints 0x660-0x669.) Further, Windows CP28596 adds more mappings to use the unused bytes in ISO-8859-6.
* Add identify filter for ISO-8859-3 (Latin-3)Alex Dowad2020-10-161-1/+11
| | | | | | There are some bytes in this encoding which are not mapped to any character. Notably, MicroSoft added their own mappings for these 'unused' bits in their version of Latin-3, called CP28593.
* Add identify filter for ISO-8859-16 (Latin-10) encodingAlex Dowad2020-10-161-0/+2
| | | | | | Interestingly, it looks like the original author intended to add an identify filter for this encoding, but never did so. The needed struct is there, but was never added to the list of identify filters in mbfl_ident.c.
* Remove unused IS_SJIS1 and IS_SJIS2 macrosAlex Dowad2020-10-141-3/+0
|
* Merge branch 'PHP-8.0'Nikita Popov2020-10-1315-77/+71
|\ | | | | | | | | * PHP-8.0: Normalize mb_ereg() return value
| * Normalize mb_ereg() return valueNikita Popov2020-10-1315-77/+71
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mb_ereg()/mb_eregi() currently have an inconsistent return value based on whether the $matches parameter is passed or not: > Returns the byte length of the matched string if a match for > pattern was found in string, or FALSE if no matches were found > or an error occurred. > > If the optional parameter regs was not passed or the length of > the matched string is 0, this function returns 1. Coupling this behavior to the $matches parameter doesn't make sense -- we know the match length either way, there is no technical reason to distinguish them. However, returning the match length is not particularly useful either, especially due to the need to convert 0-length into 1-length to satisfy "truthy" checks. We could always return 1, which would kind of match the behavior of preg_match() -- however, preg_match() actually returns the number of matches, which is 0 or 1 for preg_match(), while false signals an error. However, mb_ereg() returns false both for no match and for an error. This would result in an odd 1|false return value. The patch canonicalizes mb_ereg() to always return a boolean, where true indicates a match and false indicates no match or error. This also matches the behavior of the mb_ereg_match() and mb_ereg_search() functions. This fixes the default value integrity violation in PHP 8. Closes GH-6331.
* | mUTF-7 (UTF7-IMAP) conversion: handle illegal (non-RFC-compliant) input ↵Alex Dowad2020-10-131-4/+28
| | | | | | | | | | | | | | correctly Instead of looking the other way and letting things slide, report errors when the input does not follow the RFC.
* | Add 'mUTF-7' alias for UTF7-IMAP encodingAlex Dowad2020-10-131-1/+3
| |
* | Add comment explaining mUTF-7 to mbfilter_utf7imap.cAlex Dowad2020-10-131-0/+48
| |