summaryrefslogtreecommitdiff
path: root/README.UNICODE
blob: 6fa80ce4c53bf6a36be5c38d3660a4dba5234380 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
Audience
========

This README describes how PHP 6 provides native support for the Unicode 
Standard. Readers of this document should be proficient with PHP and have a
basic understanding of Unicode concepts. For more technical details about
PHP 6 design principles and for guidelines about writing Unicode-ready PHP 
extensions, refer to README.UNICODE-UPGRADES.

Introduction
============

As successful as PHP has proven to be over the years, its support for
multilingual and multinational environments has languished. PHP can no
longer afford to remain outside the overall movement towards the Unicode
standard.  Although recent updates involving the mbstring extension have
enabled easier multibyte data processing, this does not constitute native
Unicode support.

Since the full implementation of the Unicode Standard is very involved, our
approach is to speed up implementation by using the well-tested,
full-featured, and freely available ICU (International Components for
Unicode) library.


General Remarks
===============

International Components for Unicode
------------------------------------

ICU (International Components for Unicode is a mature, widely used set of
C/C++ and Java libraries for Unicode support, software internationalization
and globalization. It provides:

  - Encoding conversions
  - Collations
  - Unicode text processing
  - and much more

When building PHP 6, Unicode support is always enabled. The only
configuration option during development should be the location of the ICU
headers and libraries.

  --with-icu-dir=<dir>
  
where <dir> specifies the location of ICU header and library files. If you do
not specify this option, PHP attempts to find ICU under /usr and /usr/local.

NOTE: ICU is not bundled with PHP 6 yet. To download the distribution, visit
http://icu.sourceforge.net. PHP requires ICU version 3.4 or higher. 

Backwards Compatibility
-----------------------
Our paramount concern for providing Unicode support is backwards compatibility.
Because PHP is used on so many sites, existing data types and functions must
work as they always have. However, although PHP's interfaces must remain
backwards-compatible, the speed of certain operations might be affected due to
internal implementation changes.

Encoding Names
--------------
All the encoding settings discussed in this document can accept any valid
encoding name supported by ICU. For a full list of encodings, refer to the ICU
online documentation.

NOTE: References to "Unicode" in this document generally mean the UTF-16
character encoding, unless explicitly stated otherwise.

Unicode Semantics Switch
========================

Because many applications do not require Unicode, PHP 6 provides a server-wide
INI setting to enable Unicode support:

  unicode.semantics = On/Off

This switch is off by default. If your applications do not require native
Unicode support, you may leave this switch off, and continue to use Unicode
strings only when you need to. 

However, if your application is ready to fully support Unicode, you should 
turn this switch on. This activates various Unicode support mechanisms, 
including:

  * All string literals become Unicode
  * All variables received from HTTP requests become Unicode
  * PHP identifiers may use Unicode characters

More fundamentally, your PHP environment is now a Unicode environment.  Strings
inside PHP are Unicode, and the system is responsible for converting non-Unicode
strings on PHP's periphery (for example, in HTTP input and output, streams, and
filesystem operations). With unicode.semantics on, you must specify binary
strings explicitly. PHP makes no assumptions about the content of a binary
string, so your application must handle all binary string appropriately.

Conversely, if unicode.semantics is off, PHP behaves as it did in the past.
String literals do not become Unicode, and files are binary strings for
backwards compatibility. You can always create Unicode strings programmatically,
and all functions and operators support Unicode strings transparently.


Fallback Encoding
=================

The fallback encoding provides a default value for all other unicode.*_encoding
INI settings. If you do not set a particular unicode.*_encoding setting, PHP
uses the fallback encoding. If you do not specify a fallback encoding, PHP uses
UTF-8.

  unicode.fallback_encoding = "iso-8859-1"


Runtime Encoding
================

The runtime encoding specifies the encoding PHP uses for converting binary 
strings within the PHP engine itself. 

  unicode.runtime_encoding = "iso-8859-1"

This setting has no effect on I/O-related operations such as writing to 
standard out, reading from the filesystem, or decoding HTTP input variables.

PHP enables you to explicitly convert strings using casting:

  * (binary) -- casts to binary string type
  * (unicode) -- casts to Unicode string type
  * (string) -- casts to Unicode string type if unicode.semantics is on,
    to binary otherwise

For example, if unicode.runtime_encoding is iso-8859-1, and $uni is a unicode
string, then

  $str = (binary)$uni

creates a binary string $str in the ISO-8859-1 encoding.

Implicit conversions include concatenation, comparison, and parameter passing.
For better precision, PHP attempts to convert strings to Unicode before
performing these sorts of operations. For example, if we concatenate our binary
string $str with a unicode literal, PHP converts $str to Unicode first, using
the encoding specified by unicode.runtime_encoding.

Output Encoding
===============

PHP automatically converts output for commands that write to the standard 
output stream, such as 'print' and 'echo'.

  unicode.output_encoding = "utf-8"

However, PHP does not convert binary strings. When writing to files or external
resources, you must rely on stream encoding features or manually encode the data
using functions provided by the unicode extension.

The existing default_charset INI setting is DEPRECATED in favor of
unicode.output_setting. Previously, default_charset only specified the charset
portion of the Content-Type MIME header. Now default_charset only takes effect
when unicode.semantics is off, and it does not affect the actual transcoding of
the output stream. Setting unicode.output_encoding causes PHP to add the
'charset' portion to the Content-Type header, overriding any value set for
default_charset.


HTTP Input Encoding
===================

The HTTP input encoding specifies the encoding of variables received via
HTTP, such as the contents of the $_GET and _$POST arrays.

This functionality is currently under development. For a discussion of the
approach that the PHP 6 team is taking, refer to:

http://marc.theaimsgroup.com/?t=116613047300005&r=1&w=2


Filesystem Encoding
===================

The filesystem encoding specifies the encoding of file and directory names
on the filesystem. 

  unicode.filename_encoding = "utf-8"

Filesystem-related functions such as opendir() perform this conversion when 
accepting and returning file names. You should set the filename encoding to 
the encoding used by your filesystem. 


Script Encoding
===============

You may write PHP scripts in any encoding supported by ICU. To specify the
script encoding site-wide, use the INI setting:

   unicode.script_encoding = utf-8

If you cannot change the encoding system wide, you can use a pragma to 
override the INI setting in a local script:

   <?php declare(encoding = 'Shift-JIS'); ?>

The pragma setting must be the first statement in the script. It only affects 
the script in which it occurs, and does not propagate to any included files. 


INI Files
=========

If unicode.semantics is on, INI files are presumed to contain UTF-8 encoded 
keys and values. If unicode.semantics is off, the data is taken as-is,
similar to PHP 5. No validation occurs during parsing. Instead invalid UTF-8
sequences are caught during access by ini_*() functions.


Stream I/O
==========

PHP has a streams-based I/O system for generalized filesystem access, 
networking, data compression, and other operations. Since the data on the 
other end of the stream can be in any encoding, you need to think about 
data conversion. 

Okay, this needs to be clarified. By "default", streams are actually
opened in binary mode. You have to specify 't' flag or use FILE_TEXT in
order to open it in text mode, where conversions apply. And for the text
mode streams, the default stream encoding is UTF-8 indeed.

By default, PHP opens streams in binary mode. To open a file in text mode,
you must use the 't' flag (or the FILE_TEXT parameter -- see below). The 
default encoding for streams in text mode is UTF-8. This means that if 
'file.txt' is a UTF-8 text file, this code snippet:

  $fp = fopen('file.txt', 'rt');
  $str = fread($fp, 100)

returns 100 Unicode characters, while: 

  $fp = fopen('file.txt', 'wt');
  $fwrite($fp, $uni)

writes to a UTF-8 text file.

If you mainly work with files in an encoding other than UTF-8, you can
change the default context encoding setting:

  stream_default_encoding('Shift-JIS');
  $data = file_get_contents('file.txt', FILE_TEXT);
  // work on $data
  file_put_contents('file.txt', $data, FILE_TEXT);

The file_get_contents() and file_put_contents() functions now accept an
additional parameter, FILE_TEXT. If you provide FILE_TEXT for
file_get_contents(), PHP returns a Unicode string. Without FILE_TEXT, PHP
returns a binary string (which would be appropriate for true binary data, such
as an image file). When writing a Unicode string with file_put_contents(), you
must supply the FILE_TEXT parameter, or PHP generates a warning. 

If you need to work with multiple encodings, you can create custom contexts
using stream_context_create() and then pass in the custom context as an
additional parameter. For example: 

  $ctx = stream_context_create(NULL, array('encoding' => 'big5'));
  $data = file_get_contents('file.txt', FILE_TEXT, $ctx);
  // work on $data
  file_put_contents('file.txt', $data, FILE_TEXT, $ctx);


Conversion Semantics and Error Handling
=======================================

PHP can convert strings explicitly (casting) and implicitly (concatenation,
comparison, and parameter passing. For example, when concatenating a Unicode
string and a binary string, PHP converts the binary string to Unicode for better
precision.

However, not all characters can be converted between Unicode and legacy 
encodings. The first possibility is that a string contains corrupt data or
an illegal byte sequence. In this case, the converter simply stops with 
a message that resembles:

  Warning: Could not convert binary string to Unicode string
  (converter UTF-8 failed on bytes (0xE9) at offset 2)

Conversely, if a similar error occurs when attempting to convert Unicode to
a legacy string, the converter generates a message that resembles:

  Warning: Could not convert Unicode string to binary string
  (converter ISO-8859-1 failed on character {U+DC00} at offset 2)

To customize this behavior, refer to "Creating a Custom Error Handler" below.

The second possibility is that a Unicode character simply cannot be represented
in the legacy encoding. By default, when downconverting from Unicode, the
converter substitutes any missing sequences with the appropriate substitution
sequence for that codepage, such as 0x1A (Control-Z) in ISO-8859-1. When
upconverting to Unicode, the converter replaces any byte sequence that has no
Unicode equivalent with the Unicode substitution character (U+FFFD). 

You can customize the conversion error behavior to:

  - stop the conversion and return an empty string
  - skip any invalid characters
  - substitute invalid characters with a custom substitution character
  - escape the invalid character in various formats

To control the global conversion error settings, use the functions:

  unicode_set_error_mode(int direction, int mode)
  unicode_set_subst_char(unicode char)

where direction is either FROM_UNICODE or TO_UNICODE, and mode is one of these
constants:

  U_CONV_ERROR_STOP
  U_CONV_ERROR_SKIP 
  U_CONV_ERROR_SUBST
  U_CONV_ERROR_ESCAPE_UNICODE
  U_CONV_ERROR_ESCAPE_ICU
  U_CONV_ERROR_ESCAPE_JAVA
  U_CONV_ERROR_ESCAPE_XML_DEC
  U_CONV_ERROR_ESCAPE_XML_HEX

As an example, with a runtime encoding of ISO-8859-1, the conversion:

  $str = (binary)"< \u30AB >";

results in:

  MODE                    RESULT
  --------------------------------------
  stop                    ""
  skip                    "<   >"
  substitute              "< ? >"
  escape (Unicode)        "< {U+30AB} >"
  escape (ICU)            "< %U30AB >"
  escape (Java)           "< \u30AB >"
  escape (XML decimal)    "< &#12459; >"
  escape (XML hex)        "< &#x30AB; >"

With a runtime encoding of UTF-8, the conversion of the (illegal) sequence:

  $str = (unicode)b"< \xe9\xfe >";

results in:

  MODE                    RESULT
  --------------------------------------
  stop                    ""
  skip                    ""
  substitute              ""
  escape (Unicode)        "< %XE9%XFE >"
  escape (ICU)            "< %XE9%XFE >"
  escape (Java)           "< \xE9\xFE >"
  escape (XML decimal)    "< &#233;&#254; >"
  escape (XML hex)        "< &#xE9;&#xFE; >"

The substitution character can be set only for FROM_UNICODE direction and has to
exist in the target character set. The default substitution character is (?). 

NOTE: Casting is just a shortcut for using unicode.runtime_encoding. To convert
using an alternative encoding, use the unicode_encode() and unicode_decode()
functions. For example,

  $str = unicode_encode($uni, 'koi8-r', U_CONV_ERROR_SUBST);

results in a binary KOI8-R encoded string. 

Creating a Custom Error Handler
-------------------------------
If an error occurs during the conversion, PHP outputs a warning describing the
problem. Instead of this default behavior, PHP can invoke a user-provided error
handler, similar to how the current user-defined error handler works.  To set
the custom conversion error handler, call:

  mixed unicode_set_error_handler(callback error_handler)

The function returns the previously defined custom error handler. If no error
handler was defined, or if an error occurs when returning the handler, this 
function returns NULL.

When the custom handler is set, the standard error handler is bypassed. It is
the responsibility of the custom handler to output or log any messages, raise
exceptions, or die(), if necessary. However, if the custom error handler returns
FALSE, the standard handler will be invoked afterwards.

The user function specified as the error_handler must accept five parameters:

  mixed error_handler($direction, $encoding, $char_or_byte, $offset, 
  $message)

where:

  $direction    - the direction of conversion, FROM_UNICODE/TO_UNICODE

  $encoding     - the name of the encoding to/from which the conversion
                  was attempted

  $char_or_byte - either Unicode character or byte sequence (depending
                  on direction) which caused the error

  $offset       - the offset of the failed character/byte sequence in
                  the source string

  $message      - the error message describing the problem

NOTE: If the error mode set by unicode_set_error_mode() is substitute, 
skip, or escape, the handler won't be called, since these are non-error
causing operations. To always invoke your handler, set the error mode to
U_CONV_ERROR_STOP.


Unicode String Type
===================

The Unicode string type (IS_UNICODE) is supposed to contain text data encoded in
UTF-16. This is the main string type in PHP when Unicode semantics switch is
turned on. Unicode strings can exist when the switch is off, but they have to be
produced programmatically via calls to functions that return Unicode types.


Binary String Type
==================

Binary string type (IS_STRING) serves two purposes: backwards compatibility and
representing non-Unicode strings and binary data. When Unicode semantics switch
is off, it is used for all strings in PHP, same in previous versions. When the
switch is on, this type will be used to store text in other encodings as well as
true binary data such as images, PDFs, etc.

Printing binary data to the standard output passes it through as-is, independent
of the output encoding.

For examples of specifying binary string literals, refer to the section 
"Language Modfications".

Language Modifications
======================

If a Unicode switch is turned on, PHP string literals -- single-quoted,
double-quoted, and heredocs -- become Unicode strings (IS_UNICODE type).  String
literals support all the same escape sequences and variable interpolations as
before, plus several new escape sequences.

PHP interprets the contents of strings as follows:

  - all non-escaped characters are interpreted as a corresponding Unicode
    codepoint based on the current script encoding, e.g. ASCII 'a' (0x61) =>
    U+0061, Shift-JIS (0x92 0x86) => U+4E2D
 
  - existing PHP escape sequences are also interpreted as Unicode codepoints,
    including \xXX (hex) and \OOO (octal) numbers, e.g. "\x20" => U+0020

  - two new escape sequences, \uXXXX and \UXXXXXX, are interpreted as a 4 or
    6-hex Unicode codepoint value, e.g. \u0221 => U+0221, \U010410 =>
    U+10410. (Having two sequences avoids the ambiguity of \u020608 --
    is that supposed to be U+0206 followed by "08", or U+020608 ?)
    
  - a new escape sequence allows specifying a character by its full
    Unicode name, e.g. \C{THAI CHARACTER PHO SAMPHAO} => U+0E20

PHP allows variable interpolation inside the double-quoted and heredoc strings.
However, the parser separates the string into literal and variable chunks during
compilation, e.g. "abc $var def" -> "abc" . $var . "def". This means that PHP
can handle literal chunks in the normal way as far as Unicode support is
concerned.

Since all string literals become Unicode by default, PHP 6 introduces new syntax
for creating byte-oriented or binary strings. Prefixing a string literal with
the letter 'b' creates a binary string:

    $var = b'abc\001';
    $var = b"abc\001";
    $var = b<<<EOD
      abc\001
    EOD;

The content of a binary string is the literal byte sequence inside the
delimiters, which depends on the script encoding (unicode.script_encoding).
Binary string literals support the same escape sequences as PHP 5 strings. If
the Unicode switch is turned off, then the binary string literals generate the
normal string (IS_STRING) type internally without any effect on the application.

The string operators now accomodate the new IS_UNICODE and IS_BINARY types:

  - The concatenation operator (.) and concatenation assignment operator (.=)
    automatically coerce the IS_STRING type to the more precise IS_UNICODE if
    the operands are of different string types.

  - The string indexing operator [] now accommodates IS_UNICODE type strings 
    and extracts the specified character. To support supplementary characters,
    the index specifies a code point, not a byte or a code unit.

  - Bitwise operators and increment/decrement operators do not work on
    Unicode strings. They do work on binary strings.

  - Two new casting operators are introduced, (unicode) and (binary). The
    (string) operator casts to Unicode type if the Unicode semantics switch is
    on, and to binary type otherwise.

  - The comparison operators compare Unicode strings in binary code point 
    order. They also coerce strings to Unicode if the strings are of different 
    types.

  - The arithmetic operators use the same semantics as today for converting
    strings to numbers. A Unicode string is considered numeric if it
    represents a long or a double number in the en_US_POSIX locale.


Unicode Support in Existing Functions
=====================================

All functions in the PHP default distribution are undergoing analysis to 
determine which functions need to be upgraded for native Unicode support. 
You can track progress here:

  http://www.php.net/~scoates/unicode/render_func_data.php

Key extensions that are fully converted include:

  * curl
  * dom
  * json
  * mysql
  * mysqli
  * oci8
  * pcre
  * reflection
  * simplexml
  * soap
  * sqlite
  * xml
  * xmlreader/xmlwriter
  * xsl
  * zlib

NOTE: Unsafe functions might still work, since PHP performs Unicode conversions
at runtime. However, unsafe functions might not work correctly with multibyte
binary strings, or Unicode characters that are not representable in the
specified unicode.runtime_encoding. 


Identifiers
===========

Since scripts may be written in various encodings, we do not restrict 
identifiers to be ASCII-only. PHP allows any valid identifier based
on the Unicode Standard Annex #31. 


Numbers
=======

Unlike identifiers, numbers must consist only of ASCII digits,.and are
restricted to the en_US_POSIX or C locale. In other words, numbers have no
thousands separator, and the fractional separator is (.) "full stop".  Numeric
strings adhere to the same rules, so "10,3" is not interpreted as a number even
if the current locale's fractional separator is a comma.

TextIterators
=============

Instead of using the offset operator [] to access characters in a linear
fashion, use a TextIterator instead. TextIterator is very fast and enables you
to iterate over code points, combining sequences, characters, words, lines, and
sentences, both forward and backward. For example:

  $text = "nai\u308ve";  
  foreach (new TextIterator($text) as $u) {
      var_inspect($u)
  }

lists six code points, including the umlaut (U+0308) as a separate code point.
Instantiating the TextIterator to iterate over characters,

  $text = "nai\u308ve";  
  foreach (new TextIterator($text, TextIterator::CHARACTER) as $u) {
      var_inspect($u)
  }

lists five characters, including an "i" with an umlaut as a single character.

Locales
=======

Unicode support in PHP relies exclusively on ICU locales, NOT the POSIX locales
installed on the system. You may access the default ICU locale using:

  locale_set_default()
  locale_get_default()

ICU locale IDs have a somewhat different format from POSIX locale IDs. The ICU
syntax is:

  <language>[_<script>]_<country>[_<variant>][@<keywords>]

For example, sr_Latn_YU_REVISED@currency=USD is Serbian (Latin, Yugoslavia,
Revised Orthography, Currency=US Dollar).

Do not use the deprecated setlocale() function. This function interacts with the
POSIX locale. If Unicode semantics are on, using setlocale() generates
a deprecation warning.

Document TODO
==========================================
- Final review.
- Fix the HTTP Input Encoding section, that's obsolete now.


References
==========

  Unicode
  http://www.unicode.org

  Unicode Glossary
  http://www.unicode.org/glossary/

  UTF-8
  http://www.utf-8.com/

  UTF-16
  http://www.ietf.org/rfc/rfc2781.txt

  ICU Homepage
  http://www.ibm.com/software/globalization/icu/

  ICU User Guide and API Reference
  http://icu.sourceforge.net/

  Unicode Annex #31
  http://www.unicode.org/reports/tr31/

  PHP Parameter Parsing API
  http://www.php.net/manual/en/zend.arguments.retrieval.php


Authors
=======
  Andrei Zmievski <andrei@gravitonic.com>
  Evan Goer <goer@yahoo-inc.com>

vim: set et tw=80 :