summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--README.UNICODE892
1 files changed, 436 insertions, 456 deletions
diff --git a/README.UNICODE b/README.UNICODE
index 99e933d037..d2cce26426 100644
--- a/README.UNICODE
+++ b/README.UNICODE
@@ -1,133 +1,111 @@
-Introduction
-============
-
-As successful as PHP has proven to be in the past several years, it is still
-the only remaining member of the P-trinity of scripting languages - Perl and
-Python being the other two - that remains blithely ignorant of the
-multilingual and multinational environment around it. The software
-development community has been moving towards Unicode Standard for some time
-now, and PHP can no longer afford to be outside of this movement. Surely,
-some steps have been taken recently to allow for easier processing of
-multibyte data with the mbstring extension, but it is not enabled in PHP by
-default and is not as intuitive or transparent as it could be.
-
-The basic goal of this document is to describe how PHP 6 will support the
-Unicode Standard natively. Since the full implementation of the Unicode
-Standard is very involved, the idea is to use the already existing,
-well-tested, full-featured, and freely available ICU (International
-Components for Unicode) library. This will allow us to concentrate on the
-details of PHP integration and speed up the implementation.
-
-General Remarks
-===============
-
-Backwards Compatibility
------------------------
-Throughout the design and implementation of Unicode support, backwards
-compatibility must be of paramount concern. PHP is used on an enormous number of
-sites and the upgrade to Unicode-enabled PHP has to be transparent. This means
-that the existing data types and functions must work as they have always
-done. However, the speed of certain operations may be affected, due to
-increased complexity of the code overall.
+Audience
+========
-Unicode Encoding
-----------------
-The initial version will not support Byte Order Mark. Text processing will
-generally perform better if the characters are in Normalization Form C.
+This README describes how PHP 6 provides native support for the Unicode
+Standard. Readers of this document should be proficient with PHP and have a
+basic understanding of Unicode concepts. For more technical details about
+PHP 6 design principles and for guidelines about writing Unicode-ready PHP
+extensions, refer to README.UNICODE-UPGRADES.
+Introduction
+============
-Implementation Approach
-=======================
-
-The implementation is done in phases. This allows for more basic and
-low-level implementation issues to be ironed out and tested before
-proceeding to more advanced topics.
-
-Legend:
- - TODO
- + finished
- * in progress
-
- Phase I
- -------
- + Basic Unicode string support, including instantiation, concatenation,
- indexing
-
- + Simple output of Unicode strings via 'print' and 'echo' statements
- with appropriate output encoding conversion
-
- + Conversion of Unicode strings to/from various encodings via encode() and
- decode() functions
-
- + Determining length of Unicode strings via strlen() function, some
- simple string functions ported (substr).
-
+As successful as PHP has proven to be over the years, its support for
+multilingual and multinational environments has languished. PHP can no
+longer afford to remain outside the overall movement towards the Unicode
+standard. Although recent updates involving the mbstring extension have
+enabled easier multibyte data processing, this does not constitute native
+Unicode support.
- Phase II
- --------
- * HTTP input request decoding
+Since the full implementation of the Unicode Standard is very involved, our
+approach is to speed up implementation by using the well-tested,
+full-featured, and freely available ICU (International Components for
+Unicode) library.
- + Fixing remaining string-aware operators (assignment to [] etc)
- + Support for Unicode and binary strings in PHP streams
+General Remarks
+===============
- + Support for Unicode identifiers
+International Components for Unicode
+------------------------------------
- + Configurable handling of conversion failures
+ICU (International Components for Unicode is a mature, widely used set of
+C/C++ and Java libraries for Unicode support, software internationalization
+and globalization. It provides:
- + \C{} escape sequence in strings
+ - Encoding conversions
+ - Collations
+ - Unicode text processing
+ - and much more
+When building PHP 6, Unicode support is always enabled. The only
+configuration option during development should be the location of the ICU
+headers and libraries.
- Phase III
- ---------
- * Exposing ICU API
+ --with-icu-dir=<dir>
+
+where <dir> specifies the location of ICU header and library files. If you do
+not specify this option, PHP attempts to find ICU under /usr and /usr/local.
- * Porting all remaining functions to support Unicode and/or binary
- strings
+NOTE: ICU is not bundled with PHP 6 yet. To download the distribution, visit
+http://icu.sourceforge.net. PHP requires ICU version 3.4 or higher.
+Backwards Compatibility
+-----------------------
+Our paramount concern for providing Unicode support is backwards compatibility.
+Because PHP is used on so many sites, existing data types and functions must
+work as they always have. However, although PHP's interfaces must remain
+backwards-compatible, the speed of certain operations might be affected due to
+internal implementation changes.
Encoding Names
-==============
-All the encoding settings discussed in this document accept any valid
-encoding name supported by ICU. See ICU online documentation for the full
-list of encodings.
+--------------
+All the encoding settings discussed in this document can accept any valid
+encoding name supported by ICU. For a full list of encodings, refer to the ICU
+online documentation.
+NOTE: References to "Unicode" in this document generally mean the UTF-16
+character encoding, unless explicitly stated otherwise.
Unicode Semantics Switch
========================
-Obviously, PHP cannot simply impose new Unicode support on everyone. There
-are many applications that do not care about Unicode and do not need it.
-Consequently, there is a switch that enables certain fundamental language
-changes related to Unicode. This switch is available only as a site-wide (per
-virtual server) INI setting.
+Because many applications do not require Unicode, PHP 6 provides a server-wide
+INI setting to enable Unicode support:
-Note that having switch turned off does not imply that PHP is unaware of Unicode
-at all and that no Unicode strings can exist. It only affects certain aspects of
-the language, and Unicode strings can always be created programmatically. All
-the functions and operators will still support Unicode strings and work
-appropriately.
+ unicode.semantics = On/Off
- unicode.semantics = On
+This switch is off by default. If your applications do not require native
+Unicode support, you may leave this switch off, and continue to use Unicode
+strings only when you need to.
+However, if your application is ready to fully support Unicode, you should
+turn this switch on. This activates various Unicode support mechanisms,
+including:
-Internal Encoding
-=================
+ * All string literals become Unicode
+ * All variables received from HTTP requests become Unicode
+ * PHP identifiers may use Unicode characters
-UTF-16 is the internal encoding used for Unicode strings. UTF-16 consumes
-two bytes for any Unicode character in the Basic Multilingual Plane, which
-is where most of the current world's languages are represented. While being
-less memory efficient for basic ASCII text it simplifies the processing and
-makes interfacing with ICU easier, since ICU uses UTF-16 for its internal
-processing as well.
+More fundamentally, your PHP environment is now a Unicode environment. Strings
+inside PHP are Unicode, and the system is responsible for converting non-Unicode
+strings on PHP's periphery (for example, in HTTP input and output, streams, and
+filesystem operations). With unicode.semantics on, you must specify binary
+strings explicitly. PHP makes no assumptions about the content of a binary
+string, so your application must handle all binary string appropriately.
+
+Conversely, if unicode.semantics is off, PHP behaves as it did in the past.
+String literals do not become Unicode, and files are binary strings for
+backwards compatibility. You can always create Unicode strings programmatically,
+and all functions and operators support Unicode strings transparently.
Fallback Encoding
=================
-This setting specifies the "fallback" encoding for all the other ones. So if
-a specific encoding setting is not set, PHP defaults it to the fallback
-encoding. If the fallback_encoding is not specified either, it is set to
+The fallback encoding provides a default value for all other unicode.*_encoding
+INI settings. If you do not set a particular unicode.*_encoding setting, PHP
+uses the fallback encoding. If you do not specify a fallback encoding, PHP uses
UTF-8.
unicode.fallback_encoding = "iso-8859-1"
@@ -136,114 +114,203 @@ UTF-8.
Runtime Encoding
================
-Currently PHP neither specifies nor cares what the encoding of its strings
-is. However, the Unicode implementation needs to know what this encoding is
-for several reasons, including explicit (casting) and implicit (concatenation,
-comparison, parameter passing) type coersions. This setting specifies the
-runtime encoding.
+The runtime encoding specifies the encoding PHP uses for converting binary
+strings within the PHP engine itself.
unicode.runtime_encoding = "iso-8859-1"
+This setting has no effect on I/O-related operations such as writing to
+standard out, reading from the filesystem, or decoding HTTP input variables.
+
+PHP enables you to explicitly convert strings using casting:
+
+ * (binary) -- casts to binary string type
+ * (unicode) -- casts to Unicode string type
+ * (string) -- casts to Unicode string type if unicode.semantics is on,
+ to binary otherwise
+
+For example, if unicode.runtime_encoding is iso-8859-1, and $uni is a unicode
+string, then
+
+ $str = (binary)$uni
+
+creates a binary string $str in the ISO-8859-1 encoding.
+
+Implicit conversions include concatenation, comparison, and parameter passing.
+For better precision, PHP attempts to convert strings to Unicode before
+performing these sorts of operations. For example, if we concatenate our binary
+string $str with a unicode literal, PHP converts $str to Unicode first, using
+the encoding specified by unicode.runtime_encoding.
Output Encoding
===============
-Automatic output encoding conversion is supported on the standard output
-stream. Therefore, commands such as 'print' and 'echo' automatically convert
-their arguments to the specified encoding. No automatic output encoding is
-performed for anything else. Therefore, when writing to files or external
-resources, the developer has to manually encode the data using functions
-provided by the unicode extension or rely on stream encoding features
-
-The existing default_charset setting so far has been used only for
-specifying the charset portion of the Content-Type MIME header. For several
-reasons, this setting is deprecated. Now it is only used when the Unicode
-semantics switch is disabled and does not affect the actual transcoding of
-the output stream. The output encoding setting takes precedence in all other
-cases. If the output encoding is set, PHP will automatically add 'charset'
-portion to the Conten-Type header.
+PHP automatically converts output for commands that write to the standard
+output stream, such as 'print' and 'echo'.
unicode.output_encoding = "utf-8"
+However, PHP does not convert binary strings. When writing to files or external
+resources, you must rely on stream encoding features or manually encode the data
+using functions provided by the unicode extension.
+
+The existing default_charset INI setting is DEPRECATED in favor of
+unicode.output_setting. Previously, default_charset only specified the charset
+portion of the Content-Type MIME header. Now default_charset only takes effect
+when unicode.semantics is off, and it does not affect the actual transcoding of
+the output stream. Setting unicode.output_encoding causes PHP to add the
+'charset' portion to the Content-Type header, overriding any value set for
+default_charset.
+
HTTP Input Encoding
===================
-There will be no explicit input encoding setting. Instead, PHP will rely on a
-couple of heuristics to determine what encoding the incoming request might be
-in. Firstly, PHP will attempt to decode the input using the value of the
-unicode.output_encoding setting, because that is the most logical choice if we
-assume that the clients send the data back in the encoding that the page with
-the form was in. If that is unsuccessful, we could fallback on the "_charset_"
-form parameter, if present. This parameter is sent by IE (and possibly Firefox)
-along with the form data and indicates the encoding of the request. Note that
-this parameter will be present only if the form contains a hidden field named
-"_charset_".
-
-The variables that are decoded successfully will be put into the request arrays
-as Unicode strings, those that fail -- as binary strings. PHP will set a
-flag (probably in the $_SERVER array) indicating that there were problems during
-the conversion. The user will have access to the raw input in case of
-failure via the input filter extension and can to access the request parameters
-via input_get_arg() function. The input filter extension always looks in
-the raw input data and not in the request arrays, and input_get_arg() has a
-'charset' parameter that can be specified to tell PHP what charset the incoming
-data is in. This kills two birds with one stone: users have access to request
-arrays data on successful decoding as well as a standard and secure way to get
-at the data in case of failed decoding.
+The HTTP input encoding specifies the encoding of variables received via
+HTTP, such as the contents of the $_GET and _$POST arrays.
+
+This functionality is currently under development. For a discussion of the
+approach that the PHP 6 team is taking, refer to:
+
+http://marc.theaimsgroup.com/?t=116613047300005&r=1&w=2
+
+
+Filesystem Encoding
+===================
+
+The filesystem encoding specifies the encoding of file and directory names
+on the filesystem.
+
+ unicode.filename_encoding = "utf-8"
+
+Filesystem-related functions such as opendir() perform this conversion when
+accepting and returning file names. You should set the filename encoding to
+the encoding used by your filesystem.
Script Encoding
===============
-PHP scripts may be written in any encoding supported by ICU. The encoding
-of the scripts can be specified site-wide via an INI directive, or with a
-'declare' pragma at the beginning of the script. The reason for pragma is that
-an application written in Shift-JIS, for example, should be executable on a
-system where the INI directive cannot be changed by the application itself. The
-pragma setting is valid only for the script it occurs in, and does not propagate
-to the included files.
+You may write PHP scripts in any encoding supported by ICU. To specify the
+script encoding site-wide, use the INI setting:
- pragma:
- <?php declare(encoding = 'utf-8'); ?>
-
- INI setting:
unicode.script_encoding = utf-8
+If you cannot change the encoding system wide, you can use a pragma to
+override the INI setting in a local script:
+
+ <?php declare(encoding = 'Shift-JIS'); ?>
+
+The pragma setting must be the first statement in the script. It only affects
+the script in which it occurs, and does not propagate to any included files.
+
INI Files
=========
-INI files will be presumed to contain UTF-8 encoded keys and values when the
-Unicode semantics mode is On. When the mode is off, the data is taken as-is,
+If unicode.semantics is on, INI files are presumed to contain UTF-8 encoded
+keys and values. If unicode.semantics is off, the data is taken as-is,
similar to PHP 5. No validation occurs during parsing. Instead invalid UTF-8
sequences are caught during access by ini_*() functions.
-Conversion Semantics
-====================
+Stream I/O
+==========
+
+PHP has a streams-based I/O system for generalized filesystem access,
+networking, data compression, and other operations. Since the data on the
+other end of the stream can be in any encoding, you need to think about
+data conversion.
+
+Okay, this needs to be clarified. By "default", streams are actually
+opened in binary mode. You have to specify 't' flag or use FILE_TEXT in
+order to open it in text mode, where conversions apply. And for the text
+mode streams, the default stream encoding is UTF-8 indeed.
+
+By default, PHP opens streams in binary mode. To open a file in text mode,
+you must use the 't' flag (or the FILE_TEXT parameter -- see below). The
+default encoding for streams in text mode is UTF-8. This means that if
+'file.txt' is a UTF-8 text file, this code snippet:
+
+ $fp = fopen('file.txt', 'rt');
+ $str = fread($fp, 100)
+
+returns 100 Unicode characters, while:
-Not all characters can be converted between Unicode and legacy encodings.
-Normally, when downconverting from Unicode, the default behavior of ICU
-converters is to substitute the missing sequence with the appropriate
-substitution sequence for that codepage, such as 0x1A (Control-Z) in
-ISO-8859-1. When upconverting to Unicode, if an encoding has a character
-which cannot be converted into Unicode, that sequence is replaced by the
-Unicode substitution character (U+FFFD).
+ $fp = fopen('file.txt', 'wt');
+ $fwrite($fp, $uni)
-The conversion error behavior can be customized:
+writes to a UTF-8 text file.
+
+If you mainly work with files in an encoding other than UTF-8, you can
+change the default context encoding setting:
+
+ stream_default_encoding('Shift-JIS');
+ $data = file_get_contents('file.txt', FILE_TEXT);
+ // work on $data
+ file_put_contents('file.txt', $data, FILE_TEXT);
+
+The file_get_contents() and file_put_contents() functions now accept an
+additional parameter, FILE_TEXT. If you provide FILE_TEXT for
+file_get_contents(), PHP returns a Unicode string. Without FILE_TEXT, PHP
+returns a binary string (which would be appropriate for true binary data, such
+as an image file). When writing a Unicode string with file_put_contents(), you
+must supply the FILE_TEXT parameter, or PHP generates a warning.
+
+If you need to work with multiple encodings, you can create custom contexts
+using stream_context_create() and then pass in the custom context as an
+additional parameter. For example:
+
+ $ctx = stream_context_create(NULL, array('encoding' => 'big5'));
+ $data = file_get_contents('file.txt', FILE_TEXT, $ctx);
+ // work on $data
+ file_put_contents('file.txt', $data, FILE_TEXT, $ctx);
+
+
+Conversion Semantics and Error Handling
+=======================================
+
+PHP can convert strings explicitly (casting) and implicitly (concatenation,
+comparison, and parameter passing. For example, when concatenating a Unicode
+string and a binary string, PHP converts the binary string to Unicode for better
+precision.
+
+However, not all characters can be converted between Unicode and legacy
+encodings. The first possibility is that a string contains corrupt data or
+an illegal byte sequence. In this case, the converter simply stops with
+a message that resembles:
+
+ Warning: Could not convert binary string to Unicode string
+ (converter UTF-8 failed on bytes (0xE9) at offset 2)
+
+Conversely, if a similar error occurs when attempting to convert Unicode to
+a legacy string, the converter generates a message that resembles:
+
+ Warning: Could not convert Unicode string to binary string
+ (converter ISO-8859-1 failed on character {U+DC00} at offset 2)
+
+To customize this behavior, refer to "Creating a Custom Error Handler" below.
+
+The second possibility is that a Unicode character simply cannot be represented
+in the legacy encoding. By default, when downconverting from Unicode, the
+converter substitutes any missing sequences with the appropriate substitution
+sequence for that codepage, such as 0x1A (Control-Z) in ISO-8859-1. When
+upconverting to Unicode, the converter replaces any byte sequence that has no
+Unicode equivalent with the Unicode substitution character (U+FFFD).
+
+You can customize the conversion error behavior to:
- stop the conversion and return an empty string
- skip any invalid characters
- substibute invalid characters with a custom substitution character
- escape the invalid character in various formats
-The global conversion error settings can be controlled with these two functions:
+To control the global conversion error settings, use the functions:
unicode_set_error_mode(int direction, int mode)
unicode_set_subst_char(unicode char)
-Where direction is either FROM_UNICODE or TO_UNICODE, and mode is one of these
+where direction is either FROM_UNICODE or TO_UNICODE, and mode is one of these
constants:
U_CONV_ERROR_STOP
@@ -255,31 +322,102 @@ constants:
U_CONV_ERROR_ESCAPE_XML_DEC
U_CONV_ERROR_ESCAPE_XML_HEX
-Substitution character can be set only for FROM_UNICODE direction and has to
-exist in the target character set.
+As an example, with a runtime encoding of ISO-8859-1, the conversion:
+ $str = (binary)"< \u30AB >";
-Unicode String Type
-===================
+results in:
+
+ MODE RESULT
+ --------------------------------------
+ stop ""
+ skip "< >"
+ substitute "< ? >"
+ escape (Unicode) "< {U+30AB} >"
+ escape (ICU) "< %U30AB >"
+ escape (Java) "< \u30AB >"
+ escape (XML decimal) "< &#12459; >"
+ escape (XML hex) "< &#x30AB; >"
+
+With a runtime encoding of UTF-8, the conversion of the (illegal) sequence:
+
+ $str = (unicode)b"< \xe9\xfe >";
+
+results in:
+
+ MODE RESULT
+ --------------------------------------
+ stop ""
+ skip ""
+ substitute ""
+ escape (Unicode) "< %XE9%XFE >"
+ escape (ICU) "< %XE9%XFE >"
+ escape (Java) "< \xE9\xFE >"
+ escape (XML decimal) "< &#233;&#254; >"
+ escape (XML hex) "< &#xE9;&#xFE; >"
+
+The substitution character can be set only for FROM_UNICODE direction and has to
+exist in the target character set. The default substitution character is (?).
+
+NOTE: Casting is just a shortcut for using unicode.runtime_encoding. To convert
+using an alternative encoding, use the unicode_encode() and unicode_decode()
+functions. For example,
+
+ $str = unicode_encode($uni, 'koi8-r', U_CONV_ERROR_SUBST);
+
+results in a binary KOI8-R encoded string.
+
+Creating a Custom Error Handler
+-------------------------------
+If an error occurs during the conversion, PHP outputs a warning describing the
+problem. Instead of this default behavior, PHP can invoke a user-provided error
+handler, similar to how the current user-defined error handler works. To set
+the custom conversion error handler, call:
+
+ mixed unicode_set_error_handler(callback error_handler)
+
+The function returns the previously defined custom error handler. If no error
+handler was defined, or if an error occurs when returning the handler, this
+function returns NULL.
+
+When the custom handler is set, the standard error handler is bypassed. It is
+the responsibility of the custom handler to output or log any messages, raise
+exceptions, or die(), if necessary. However, if the custom error handler returns
+FALSE, the standard handler will be invoked afterwards.
+
+The user function specified as the error_handler must accept five parameters:
+
+ mixed error_handler($direction, $encoding, $char_or_byte, $offset,
+ $message)
+
+where:
-Unicode string type (IS_UNICODE) is supposed to contain text data encoded in
-UTF-16 format. It is the main string type in PHP when Unicode semantics
-switch is turned on. Unicode strings can exist when the switch is off, but
-they have to be produced programmatically, via calls to functions that
-return Unicode type.
+ $direction - the direction of conversion, FROM_UNICODE/TO_UNICODE
-The operational unit when working with Unicode strings is a code point, not
-code unit or byte. One code point in UTF-16 may be comprised of 1 or 2 code
-units, each of which is a 16-bit word. Working on the code point level is
-necessary because doing otherwise would mean offloading the processing of
-surrogate pairs onto PHP users, and that is less than desirable.
+ $encoding - the name of the encoding to/from which the conversion
+ was attempted
-The repercussions are that one cannot expect code point N to be at offset N in
-the Unicode string. Instead, one has to iterate from the beginning from the
-string using U16_FWD() macro until the desired codepoint is reached. This will
-be transparent to the end user who will work only with "character" offsets.
+ $char_or_byte - either Unicode character or byte sequence (depending
+ on direction) which caused the error
-The codepoint access is one of the primary areas targeted for optimization.
+ $offset - the offset of the failed character/byte sequence in
+ the source string
+
+ $message - the error message describing the problem
+
+NOTE: If the error mode set by unicode_set_error_mode() is substitute,
+skip, or escape, the handler won't be called, since these are non-error
+causing operations. To always invoke your handler, set the error mode to
+U_CONV_ERROR_STOP.
+
+
+Unicode String Type
+===================
+
+The Unicode string type (IS_UNICODE) is supposed to contain text data encoded in
+UTF-16. This is the main string type in PHP when Unicode semantics switch is
+turned on. Unicode strings can exist when the switch is off, but they have to be
+produced programmatically via calls to functions that return Unicode types.
Binary String Type
@@ -294,108 +432,48 @@ true binary data such as images, PDFs, etc.
Printing binary data to the standard output passes it through as-is, independent
of the output encoding.
-
-Zval Structure Changes
-======================
-
-PHP is a type-agnostic language. Its data values are encapsulated in a zval
-(Zend value) structure that can change as necessary to accomodate various types.
-
-struct _zval_struct {
- /* Variable information */
- union {
- long lval; /* long value */
- double dval; /* double value */
- struct {
- char *val;
- int len;
- } str; /* string value */
- HashTable *ht; /* hash table value */
- zend_object_value obj; /* object value */
- } value;
- zend_uint refcount;
- zend_uchar type; /* active type */
- zend_uchar is_ref;
-};
-
-The type field determines what is stored in the union, IS_STRING being the only
-data type pertinent to this discussion. In the current version, the strings
-are binary-safe, but, for all intents and purposes, are assumed to be
-comprised of 8-bit characters. It is possible to treat the string value as
-an opaque type containing arbitrary binary data, and in fact that is how
-mbstring extension uses it, in order to store multibyte strings. However,
-many extensions and the Zend engine itself manipulate the string value
-directly without regard to its internals. Needless to say, this can lead to
-problems.
-
-For IS_UNICODE type, we need to add another structure to the union:
-
- union {
- ....
- struct {
- UChar *val; /* Unicode string value */
- int len; /* number of UChar's */
- } ustr;
- ....
- } value;
-
-This cleanly separates the two types of strings and helps preserve backwards
-compatibility.
-
-To optimize access to IS_STRING and IS_UNICODE storage at runtime, we need yet
-another structure:
-
- union {
- ....
- struct { /* Universal string type */
- zstr val;
- int len;
- } uni;
- ....
- } value;
-
-Where zstr ia union of char*, UChar*, and void*.
-
+For examples of specifying binary string literals, refer to the section
+"Language Modfications".
Language Modifications
======================
-If a Unicode switch is turned on, PHP string literals - single-quoted,
-double-quoted, and heredocs - become Unicode strings (IS_UNICODE type).
-They support all the same escape sequences and variable interpolations as
-previously, with the addition of some new escape sequences.
+If a Unicode switch is turned on, PHP string literals -- single-quoted,
+double-quoted, and heredocs -- become Unicode strings (IS_UNICODE type). String
+literals support all the same escape sequences and variable interpolations as
+before, plus several new escape sequences.
-The contents of the strings are interpreted as follows:
+PHP interprets the contents of strings as follows:
- all non-escaped characters are interpreted as a corresponding Unicode
- codepoint based on the current script encoding, e.g. ASCII 'a' (0x51) =>
- U+0061, Shift-JIS (0x92 0x69) => U+4E2D
+ codepoint based on the current script encoding, e.g. ASCII 'a' (0x61) =>
+ U+0061, Shift-JIS (0x92 0x86) => U+4E2D
- existing PHP escape sequences are also interpreted as Unicode codepoints,
including \xXX (hex) and \OOO (octal) numbers, e.g. "\x20" => U+0020
- - two new escape sequences, \uXXXX and \UXXXXXX are interpreted as a 4 or
+ - two new escape sequences, \uXXXX and \UXXXXXX, are interpreted as a 4 or
6-hex Unicode codepoint value, e.g. \u0221 => U+0221, \U010410 =>
- U+10410
-
+ U+10410. (Having two sequences avoids the ambiguity of \u020608 --
+ is that supposed to be U+0206 followed by "08", or U+020608 ?)
+
- a new escape sequence allows specifying a character by its full
Unicode name, e.g. \C{THAI CHARACTER PHO SAMPHAO} => U+0E20
-The single-quoted string is more restrictive than the other two types: so
-far the only escape sequence allowed inside of it was \', which specifies
-a literal single quote. However, single quoted strings now support the new
-Unicode character escape sequences as well.
+The single-quoted string is more restrictive than the other two types. So far
+the only escape sequence allowed inside of it was \', which specifies a literal
+single quote. However, single quoted strings now support the new Unicode
+character escape sequences as well.
PHP allows variable interpolation inside the double-quoted and heredoc strings.
However, the parser separates the string into literal and variable chunks during
-compilation, e.g. "abc $var def" -> "abc" . $var . "def". This means that the
-literal chunks can be handled in the normal way for as far as Unicode
-support is concerned.
+compilation, e.g. "abc $var def" -> "abc" . $var . "def". This means that PHP
+can handle literal chunks in the normal way as far as Unicode support is
+concerned.
-Since all string literals become Unicode by default, one loses the ability
-to specify byte-oriented or binary strings. In order to create binary string
-literals, a new syntax is necessary: prefixing a string literal with letter
-'b' creates a binary string.
+Since all string literals become Unicode by default, PHP 6 introduces new syntax
+for creating byte-oriented or binary strings. Prefixing a string literal with
+the letter 'b' creates a binary string:
$var = b'abc\001';
$var = b"abc\001";
@@ -403,235 +481,136 @@ literals, a new syntax is necessary: prefixing a string literal with letter
abc\001
EOD;
-The binary string literals support the same escape sequences as the current
-PHP strings. If the Unicode switch is turned off, then the binary string
-literals generate normal string (IS_STRING) type internally, without any
-effect on the application.
+The content of a binary string is the literal byte sequence inside the
+delimiters, which depends on the script encoding (unicode.script_encoding).
+Binary string literals support the same escape sequences as PHP 5 strings. If
+the Unicode switch is turned off, then the binary string literals generate the
+normal string (IS_STRING) type internally without any effect on the application.
-The string operators have been changed to accomodate the new IS_UNICODE and
-IS_BINARY types. In more detail:
+The string operators now accomodate the new IS_UNICODE and IS_BINARY types:
- - The concatenation (.) operator has been changed to automatically coerce
- IS_STRING type to the more precise IS_UNICODE if its operands are of two
- different string types.
+ - The concatenation operator (.) and concatenation assignment operator (.=)
+ automatically coerce the IS_STRING type to the more precise IS_UNICODE if
+ the operands are of different string types.
- - The concatenation assignment operator (.=) has been changed similarly.
-
- - The string indexing operator [] has been changed to accomodate IS_UNICODE
- type strings and extract the specified character. Note that the index
- specifies a code point, not a byte, or a code unit, thus supporting
- supplementary characters.
-
- - Both Unicode and binary string types can be used as array keys. If the
- Unicode switch is on, the binary keys are converted to Unicode.
+ - The string indexing operator [] now accommodates IS_UNICODE type strings
+ and extracts the specified character. To support supplementary characters,
+ the index specifies a code point, not a byte or a code unit.
- Bitwise operators and increment/decrement operators do not work on
Unicode strings. They do work on binary strings.
- Two new casting operators are introduced, (unicode) and (binary). The
- (string) operator will cast to Unicode type if the Unicode semantics switch is
+ (string) operator casts to Unicode type if the Unicode semantics switch is
on, and to binary type otherwise.
- - The comparison operators when applied to Unicode strings, perform
- comparison in binary code point order. They also do appropriate coersion
- if the strings are of differing types.
+ - The comparison operators compare Unicode strings in binary code point
+ order. They also coerce strings to Unicode if the strings are of different
+ types.
- The arithmetic operators use the same semantics as today for converting
strings to numbers. A Unicode string is considered numeric if it
- represents a long or a double number in en_US_POSIX locale.
+ represents a long or a double number in the en_US_POSIX locale.
-Inline HTML
-===========
-Because inline HTML blocks are intermixed with PHP ones, they are also
-written in the script encoding. PHP transcodes the HTML blocks to the output
-encoding as needed, resulting in direct passthrough if the script encoding
-matches output encoding.
+Unicode Support in Existing Functions
+=====================================
+All functions in the PHP default distribution are undergoing analysis to
+determine which functions need to be upgraded for native Unicode support.
+You can track progress here:
-Identifiers
-===========
-Considering that scripts may be written in various encodings, we do not
-restrict identifiers to be ASCII-only. PHP allows any valid identifier based
-on the Unicode Standard Annex #31. The identifiers are case folded when
-necessary (class and function names) and converted to normalization form
-NFKC, so that two identifiers written in two compatible ways refer to the
-same thing.
-
-
-Numbers
-=======
-Unlike identifiers, we restrict numbers to consist only of ASCII digits and
-do not interpret them as written in a specific locale. The numbers are
-expected to adhere to en_US_POSIX or C locale, i.e. having no thousands
-separator and fractional separator being (.) "full stop". Numeric strings
-are supposed to adhere to the same rules, i.e. "10,3" is not interpreted as
-a number even if the current locale's fractional separator is comma.
-
-
-Parameter Parsing API Modifications
-===================================
-
-Internal PHP functions largely uses zend_parse_parameters() API in order to
-obtain the parameters passed to them by the user. For example:
-
- char *str;
- int len;
+ http://www.php.net/~scoates/unicode/render_func_data.php
- if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s", &str, &len) == FAILURE) {
- return;
- }
+Key extensions that are fully converted include:
-This forces the input parameter to be a string, and its value and length are
-stored in the variables specified by the caller.
+ * curl
+ * dom
+ * json
+ * mysql
+ * mysqli
+ * oci8
+ * pcre
+ * reflection
+ * simplexml
+ * soap
+ * sqlite
+ * xml
+ * xmlreader/xmlwriter
+ * xsl
+ * zlib
-There are now five new specifiers: 'u', 't', 'T', 'U', and 'S'.
+NOTE: Unsafe functions might still work, since PHP performs Unicode conversions
+at runtime. However, unsafe functions might not work correctly with multibyte
+binary strings, or Unicode characters that are not representable in the
+specified unicode.runtime_encoding.
- 't' specifier
- -------------
- This specifier indicates that the caller requires the incoming parameter to be
- string data (IS_STRING, IS_UNICODE). The caller has to provide the storage for
- string value, length, and type.
- void *str;
- int len;
- zend_uchar type;
-
- if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "t", &str, &len, &type) == FAILURE) {
- return;
- }
- if (type == IS_UNICODE) {
- /* process Unicode string */
- } else {
- /* process binary string */
- }
-
- For IS_STRING type, the length represents the number of bytes, and for
- IS_UNICODE the number of UChar's. When converting other types (numbers,
- booleans, etc) to strings, the exact behavior depends on the Unicode semantics
- switch: if on, they are converted to IS_UNICODE, otherwise to IS_STRING.
+Identifiers
+===========
+Since scripts may be written in various encodings, we do not restrict
+identifiers to be ASCII-only. PHP allows any valid identifier based
+on the Unicode Standard Annex #31.
- 'u' specifier
- -------------
- This specifier indicates that the caller requires the incoming parameter
- to be a Unicode encoded string. If a non-Unicode string is passed, the engine
- creates a copy of the string and automatically convert it to Unicode type before
- passing it to the internal function. No such conversion is necessary for Unicode
- strings, obviously.
- UChar *str;
- int len;
+Numbers
+=======
- if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "u", &str, &len) == FAILURE) {
- return;
- }
- /* process Unicode string */
+Unlike identifiers, numbers must consist only of ASCII digits,.and are
+restricted to the en_US_POSIX or C locale. In other words, numbers have no
+thousands separator, and the fractional separator is (.) "full stop". Numeric
+strings adhere to the same rules, so "10,3" is not interpreted as a number even
+if the current locale's fractional separator is a comma.
-
- 'T' specifier
- -------------
- This specifier is useful when the function takes two or more strings and
- operates on them. Using 't' specifier for each one would be somewhat
- problematic if the passed-in strings are of mixed types, and multiple
- checks need to be performed in order to do anything. All parameters
- marked by the 'T' specifier are promoted to the same type.
-
- If at least one of the 'T' parameters is of Unicode type, then the rest of
- them are converted to IS_UNICODE. Otherwise all 'T' parameters are conveted to
- IS_STRING type.
+TextIterators
+=============
+Instead of using the offset operator [] to access characters in a linear
+fashion, use a TextIterator instead. TextIterator is very fast and enables you
+to iterate over code points, combining sequences, characters, words, lines, and
+sentences, both forward and backward. For example:
- void *str1, *str2;
- int len1, len2;
- zend_uchar type1, type2;
+ $text = "nai\u308ve";
+ foreach (new TextIterator($text) as $u) {
+ var_inspect($u)
+ }
- if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "TT", &str1, &len1,
- &type1, &str2, &len2, &type2) == FAILURE) {
- return;
- }
- if (type1 == IS_UNICODE) {
- /* process as Unicode, str2 is guaranteed to be Unicode as well */
- } else {
- /* process as binary string, str2 is guaranteed to be the same */
- }
+lists six code points, including the umlaut (U+0308) as a separate code point.
+Instantiating the TextIterator to iterate over characters,
+ $text = "nai\u308ve";
+ foreach (new TextIterator($text, TextIterator::CHARACTER) as $u) {
+ var_inspect($u)
+ }
-The existing 's' specifier has been modified as well. If a Unicode string is
-passed in, it automatically copies and converts the string to the runtime
-encoding, and issues a warning. If a binary type is passed-in, no conversion
-is necessary.
-
-The 'U' and 'S' specifiers are similar to 'u' and 's' but they are more strict
-about the type of the passed-in parameter. If 'U' is specified and the binary
-string is passed in, the engine will issue a warning instead of doing automatic
-conversion. The converse applies to the 'S' specifier.
+lists five characters, including an "i" with an umlaut as a single character.
+Locales
+=======
-Upgrading Existing Functions
-============================
+Unicode support in PHP relies exclusively on ICU locales, NOT the POSIX locales
+installed on the system. You may access the default ICU locale using:
-Upgrading functions to work with new data types will be a deliberate and
-involved process, because one needs to consider not only the mechanisms for
-processing Unicode characters, for example, but also the semantics of
-the function.
+ locale_set_default()
+ locale_get_default()
-The main tenet of the upgrade process should be that when processing Unicode
-strings, the unit of operation is a code point, not a code unit or a byte.
-For example, strlen() returns the number of code points in the string.
+ICU locale IDs have a somewhat different format from POSIX locale IDs. The ICU
+syntax is:
- strlen('abc') = 3
- strlen('ab\U010000') = 3
- strlen('ab\uD800\uDC00') = 3 /* not 4 */
+ <language>[_<script>]_<country>[_<variant>][@<keywords>]
-Function upgrade guidelines are available in a separate document.
+For example, sr_Latn_YU_REVISED@currency=USD is Serbian (Latin, Yugoslavia,
+Revised Orthography, Currency=US Dollar).
+Do not use the deprecated setlocale() function. This function interacts with the
+POSIX locale. If Unicode semantics are on, using setlocale() generates
+a deprecation warning.
Document TODO
==========================================
-- Streams support for Unicode - What stream filters will be provided?
-- User conversion error handler
-- INI files encoding - UTF-8? Do we support BOMs?
-- There are likely to be other issues which are missing from this document
-
-
-Build System
-============
-
-Unicode support in PHP is always enabled. The only configuration option
-during development should be the location of the ICU headers and libraries.
-
- --with-icu-dir=<dir> <dir> parameter specifies the location of ICU
- header and library files.
-
-After the initial development we have to repackage ICU library for our needs
-and bundle it with PHP.
-
-
-Document History
-================
- 0.6: Remove notion of native encoding string, only 2 string types are used
- now. Update conversion error behavior section and parameter parsing.
- Bring the document up-to-date with reality in general.
-
- 0.5: Updated per latest discussions. Removed tentative language in several
- places, since we have decided on everything described here already.
- Clarified details according to Phase II progress.
-
- 0.4: Updated to include all the latest discussions. Updated development
- phases.
-
- 0.3: Updated to include all the latest discussions.
-
- 0.2: Updated Phase I design proposal per discussion on unicode@php.net.
- Modified Internal Encoding section to contain only UTF-16 info..
- Expanded Script Encoding section.
- Added Binary Data Type section.
- Amended Language Modifications section to describe string literals
- behavior.
- Amended Build System section.
-
- 0.1: Phase I design proposal
+- Final review.
+- Fix the HTTP Input Encoding section, that's obsolete now.
References
@@ -665,5 +644,6 @@ References
Authors
=======
Andrei Zmievski <andrei@gravitonic.com>
+ Evan Goer <goer@yahoo-inc.com>
-vim: set et :
+vim: set et tw=80 :