summaryrefslogtreecommitdiff
path: root/README.UNICODE-UPGRADES
blob: 91dae75195d56aa79475ea45741b3b9d69a224cd (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
This document attempts to describe portions of the API related to the new
Unicode functionality and the best practices for upgrading existing
functions to support Unicode.

Your first stop should be README.UNICODE: it covers the general Unicode
functionality and concepts without going into technical implementation
details.

Internal Encoding
=================

UTF-16 is the internal encoding used for Unicode strings. UTF-16 consumes
two bytes for any Unicode character in the Basic Multilingual Plane, which
is where most of the current world's languages are represented. While being
less memory efficient for basic ASCII text it simplifies the processing and
makes interfacing with ICU easier, since ICU uses UTF-16 for its internal
processing as well.


Zval Structure Changes
======================

For IS_UNICODE type, we add another structure to the union:

    union {
    ....
        struct {
            UChar *val;            /* Unicode string value */
            int len;               /* number of UChar's */
        } ustr;
    ....
    } value;

This cleanly separates the two types of strings and helps preserve backwards
compatibility.

To optimize access to IS_STRING and IS_UNICODE storage at runtime, we need yet
another structure:

    union {
    ....
        struct {                    /* Universal string type */
            zstr val;
            int len;
        } uni;
    ....
    } value;

Where zstr is a union of char*, UChar*, and void*.


Parameter Parsing API Modifications
===================================

There are now five new specifiers: 'u', 't', 'T', 'U', 'S', 'x' and a new '&'
modifier.

  't' specifier
  -------------
  This specifier indicates that the caller requires the incoming parameter to be
  string data (IS_STRING, IS_UNICODE). The caller has to provide the storage for
  string value, length, and type.

    void *str;
    int len;
    zend_uchar type;

    if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "t", &str, &len, &type) == FAILURE) {
        return;
    }
    if (type == IS_UNICODE) {
       /* process Unicode string */
    } else {
       /* process binary string */
    }

  For IS_STRING type, the length represents the number of bytes, and for
  IS_UNICODE the number of UChar's. When converting other types (numbers,
  booleans, etc) to strings, the exact behavior depends on the Unicode semantics
  switch: if on, they are converted to IS_UNICODE, otherwise to IS_STRING.


  'u' specifier
  -------------
  This specifier indicates that the caller requires the incoming parameter
  to be a Unicode encoded string. If a non-Unicode string is passed, the engine
  creates a copy of the string and automatically convert it to Unicode type before
  passing it to the internal function. No such conversion is necessary for Unicode
  strings, obviously.

    UChar *str;
    int len;

    if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "u", &str, &len) == FAILURE) {
        return;
    }
    /* process Unicode string */


  'T' specifier
  -------------
  This specifier is useful when the function takes two or more strings and
  operates on them. Using 't' specifier for each one would be somewhat
  problematic if the passed-in strings are of mixed types, and multiple
  checks need to be performed in order to do anything. All parameters
  marked by the 'T' specifier are promoted to the same type.

  If at least one of the 'T' parameters is of Unicode type, then the rest of
  them are converted to IS_UNICODE. Otherwise all 'T' parameters are conveted to
  IS_STRING type.


    void *str1, *str2;
    int len1, len2;
    zend_uchar type1, type2;

    if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "TT", &str1, &len1,
                             &type1, &str2, &len2, &type2) == FAILURE) {
       return;
    }
    if (type1 == IS_UNICODE) {
       /* process as Unicode, str2 is guaranteed to be Unicode as well */
    } else {
       /* process as binary string, str2 is guaranteed to be the same */
    }


   'x' specifier
   -------------
   This specifier acts as either 'u' or 's', depending on the value of the
   unicode semantics switch. If UG(unicode) is on, it behaves as 'u', and as
   's' otherwise.

The existing 's' specifier has been modified as well. If a Unicode string is
passed in, it automatically copies and converts the string to the runtime
encoding, and issues a warning. If a binary type is passed-in, no conversion
is necessary. The '&' modifier can be used after 's' specifier to force
a different converter instead.

    char *str;
    int len;

    if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s&", &str, &len, UG(utf8_conv)) == FAILURE) {
        return;
    }
    /* here str is in UTF-8, if a Unicode string was passed in */

The 'U' and 'S' specifiers are similar to 'u' and 's' but they are more strict
about the type of the passed-in parameter. If 'U' is specified and the binary
string is passed in, the engine will issue a warning instead of doing automatic
conversion. The converse applies to the 'S' specifier.


Working in Unicode World
========================

Strings
-------

A lot of internal functionality is controlled by the unicode.semantics
switch. Its value is found in the Unicode globals variable, UG(unicode). It
is either on or off for the entire request.

The big thing is that there is a new string types: IS_UNICODE.
This has its own storage in the value union part of
zval (value.ustr) while non-unicode (binary) strings reuse the
IS_STRING type and the value.str element of the zval.

New macros exist (parallel to Z_STRVAL/Z_STRLEN) for accessing unicode strings.

Z_USTRVAL(), Z_USTRLEN()
 - accesses the value (as a UChar*) and length (in code units) of the Unicode type string
   value.ustr.val   value.ustr.len

Z_UNIVAL(), Z_UNILEN()
 - accesses the value (as a zstr) and length (in type-appropriate units)
   value.uni.val    value.uni.len

Z_USTRCPLEN()
 - gives the number of codepoints (not units) in the Unicode type string
   This macro examines the actual string taking into account Surrogate Pairs
   and returns the number of UChar32(UTF32) codepoints which may be less than the
   number of UChar(UTF16) codeunits found in the string buffer.
   If this value will be used repeatedly, consider storing it in a local variable
   to avoid having to reexamine the string every time.


ZVAL_* macros
-------------

The 'dup' parameter to the ZVAL_STRING()/RETVAL_STRING()/RETURN_STRING() type
macros has been extended slightly.  The following defines are now encouraged instead:

#define ZSTR_DUPLICATE (1<<0)
#define ZSTR_AUTOFREE  (1<<1)

ZSTR_DUPLICATE (which has a resulting value of 1) serves the same purpose as a
truth value in old-style 'dup' flags.  The value of 1 was specifically chosen
to match the common practice of passing a 1 for this parameter.
Warning: If you find extension code which uses a truth value other than one for
the dup flag, its logic should be modified to explicitly pass ZSTR_DUPLICATE instead.

ZSTR_AUTOFREE is used with macros such as ZVAL_RT_STRING which may populate Unicode
zvals from non-unicode source strings.  When UG(unicode) is on, the source string
will be implicitly copied (to make a UChar* version).  If the original string
needed copying anyway this is fine.  However if the original string was emalloc()'d
and would have ordinarily been given to the engine (i.e. RETURN_STRING(estrdup("foo"), 0))
then it will need to be freed in UG(unicode) mode to avoid leaking.
The ZSTR_AUTOFREE flag ensures that the original string is freed in UG(unicode) mode.

ZVAL_UNICODE(pzv, str, dup), ZVAL_UNICODEL(pzv, str, str_len, dup)
 - Sets zval to hold a Unicode string. Takes the same parameters as
   Z_STRING(), Z_STRINGL().

ZVAL_U_STRING(conv, pzv, str, dup), ZVAL_U_STRINGL(conv, pzv, str, str_len, dup)
 - When UG(unicode) is off, it's equivalent to Z_STRING(), ZSTRINGL()
   and the conv parameter is ignored.
   When UG(unicode) is on, it sets zval to hold a Unicode representation of the
   passed-in string using the UConverter* specified by conv.
   Since a new string is always created in this case, passing ZSTR_DUPLICATE
   for 'dup' does not matter, but ZSTR_AUTOFREE will be used will be used to
   efree the original value

ZVAL_RT_STRING(pzv, str, dup), ZVAL_RT_STRINGL(pzv, str, str_len, dup)
 - When UG(unicode) is off, it's equivalent to Z_STRING(), Z_STRINGL().
   When UG(unicode) is on, it takes the input string, converts it to Unicode
   using the runtime_encoding converter and sets zval to it.
   Since a new string is always created in this case, passing ZSTR_DUPLICATE
   for 'dup' does not matter, but ZSTR_AUTOFREE will be used will be used to
   efree the original value

ZVAL_ASCII_STRING(pzv, str, dup), ZVAL_ASCII_STRINGL(pzv, str, str_len, dup)
 - When UG(unicode) is off, it's equivalent to Z_STRING(), Z_STRINGL().
   When UG(unicode) is on, it takes the input string, converts it to Unicode
   using an ASCII converter and sets zval to it.
   Since a new string is always created in this case, passing ZSTR_DUPLICATE
   for 'dup' does not matter, but ZSTR_AUTOFREE will be used will be used to
   efree the original value

ZVAL_UTF8_STRING(pzv, str, dup), ZVAL_UTF8_STRINGL(pzv, str, str_len, dup)
 - When UG(unicode) is off, it's equivalent to Z_STRING(), Z_STRINGL().
   When UG(unicode) is on, it takes the input string, converts it to Unicode
   using a UTF8 converter and sets zval to it.
   Since a new string is always created in this case, passing ZSTR_DUPLICATE
   for 'dup' does not matter, but ZSTR_AUTOFREE will be used will be used to
   efree the original value

ZVAL_ZSTR(pzv, type, zstr, dup), ZVAL_ZSTRL(pzv, type, zstr, zstr_len, dup)
 - This macro uses 'type' to switch between calling ZVAL_STRING(pzv, zstr.s, dup)
   and ZVAL_UNICODE(pzv, zstr.u, dup).  No conversion happens so the
   presense of absense of ZSTR_AUTOFREE is ignored.

ZVAL_TEXT(pzv, zstr, dup), ZVAL_TEXTL(pzv, zstr, zstr_len, dup)
 - This macro sets the zval to hold either a Unicode or a normal string,
   depending on the value of UG(unicode). No conversion happens, so be certain
   that the string passed in matches the type expected by UG(unicode).
   One example of its usage would be to initialize zval to hold the name
   of a user function.

ZVAL_EMPTY_UNICODE(pzv) / ZVAL_EMPTY_TEXT(pzv)
 - These macros work identically to ZVAL_EMPTY_STRING() with the UNICODE
   version always generating an IS_UNICODE zval, and the TEXT version
   generating a UG(unicode) dependent string type.

ZVAL_UCHAR32(pzv, char)
 - Converts the character provided into a UChar string (which may potentially
   be 1 or 2 characters long in the case of surrogate pairs) and dispatches
   to ZVAL_UNICODEL().


As usual, for each ZVAL_* macro, there is a matching RETVAL_* and RETURN_* macro.

Conversion Macros
-----------------

convert_to_string_with_converter(zval *op, UConverter *conv)
 - converts a zval to native string using the specified converter, if necessary.

convert_to_unicode_with_converter(zval *op, UConverter *conv)
 - converts a zval to Unicode string using the specified converter, if
   necessary.

convert_to_unicode(zval *op)
 - converts a zval to Unicode string.

convert_to_string(zval *op)
 - Behaves just as it currently does, converting to IS_STRING type

convert_to_text(zval *op)
 - converts a zval to either Unicode or native string, depending on the
   value of UG(unicode) switch

zend_ascii_to_unicode() function can be used to convert an ASCII char*
string to Unicode. This is useful especially for inline string literals, in
which case you can simply use USTR_MAKE() macro, e.g.:

   UChar* ustr;

   ustr = USTR_MAKE("main");

If you need to initialize a few such variables, it may be more efficient to
use ICU macros, which avoid the conversion, depending on the platform. See
[1] for more information.

USTR_FREE(zstr) can be used to free a UChar* string safely, since it checks for
NULL argument. USTR_LEN() takes a zstr as its argument, and
depending on the UG(unicode) value, and returns its strlen() or u_strlen().

Array Manipulation
------------------

The add_next_index_*(), add_index_*() and add_assoc_*() functions have been
significantly expanded both to allow for the unicode type as a value and to
permit various types of keys.

Values: In the following examples, {1} represents a placeholder for the keytype and
its arguments (covered later).

add_{1}_unicode(zval *arr, {1}, UChar *ustr, int dup);
add_{1}_unicodel(zval *arr, {1}, UChar *ustr, int ustr_len, int dup);
 - Works like add_{1}_string() and add_{1}_stringl() but takes a UChar* value
   and adds an IS_UNICODE type.

add_{1}_rt_string(zval *arr, {1}, char *str, int dup);
add_{1}_rt_stringl(zval *arr, {1}, char *str, int str_len, int dup);
 - Works like add_{1}_string() and add_{1}_stringl() but converts the char*
   value to Unicode using runtime encoding when UG(unicode) is on.

add_{1}_ascii_string(zval *arr, {1}, char *str, int dup);
add_{1}_ascii_stringl(zval *arr, {1}, char *str, int str_len, int dup);
 - Works like add_{1}_rt_string() and add_{1}_rt_stringl() but uses
   an ASCII converter rather than runtime encoding.

add_{1}_utf8_string(zval *arr, {1}, char *str, int dup);
add_{1}_utf8_stringl(zval *arr, {1}, char *str, int str_len, int dup);
 - Works like add_{1}_rt_string() and add_{1}_rt_stringl() but uses
   a UTF8 converter rather than runtime encoding.

add_{1}_text(zval *arr, {1}, zstr str, int dup);
add_{1}_textl(zval *arr, {1}, zstr str, int str_len, int dup);
 - Wrapper which dispatches to add_{1}_string(l)() or add_{1}_unicode(l)()
   depending on the setting of UG(unicode).

add_{1}_zstr(zval *arr, {1}, zend_uchar type, zstr str, int dup);
add_{1}_zstrl(zval *arr, {1}, zend_uchar type, zstr str, int str_len, int dup);
 - Works like add_{1}_text() and add_{1}_textl(), but dispatches based on 'type'.


Keys: In the following example, the zval* type is used for values, however
each of the value types (including those listed above) are supported.

The existing key types work as they always have:
  add_next_index_zval(zval *arr, zval *val);
  add_index_zval(zval *arr, long idx, zval *val);
  add_assoc_zval(zval *arr, char *key, zval *val);
  add_assoc_zval_ex(zval *arr, char *key, int key_len, zval *val);
   . Associative keys are considered binary (IS_STRING)
   . Remember that key_len includes the terminating NULL

The following additional methods provide unicode capable keytypes:

add_u_assoc_zval(zval *arr, zend_uchar type, zstr key, zval *val);
add_u_assoc_zval_ex(zval *arr, zend_uchar type, zstr key, int key_len, zval *val);
 . When type==IS_STRING, these behave identically to their
   add_assoc_zval() and add_assoc_zval_ex() counterparts.
   When type==IS_STRING, the key is considered to be Unicode (UChar*).

add_rt_assoc_zval(zval *arr, char *key, zval *val);
add_rt_assoc_zval_ex(zval *arr, char *key, int key_len, zval *val);
 . When UG(unicode) is off, these behave identically to their
   add_assoc_zval() and add_assoc_zval_ex() counterparts.
   When UG(unicode) is on, key is converted to Unicode using runtime encoding.

add_ascii_assoc_zval(zval *arr, char *key, zval *val);
add_ascii_assoc_zval_ex(zval *arr, char *key, int key_len, zval *val);
 . When UG(unicode) is off, these behave identically to their
   add_assoc_zval() and add_assoc_zval_ex() counterparts.
   When UG(unicode) is on, key is converted to Unicode using an ASCII converter.

add_utf8_assoc_zval(zval *arr, char *key, zval *val);
add_utf8_assoc_zval_ex(zval *arr, char *key, int key_len, zval *val);
 . When UG(unicode) is off, these behave identically to their
   add_assoc_zval() and add_assoc_zval_ex() counterparts.
   When UG(unicode) is on, key is converted to Unicode using a UTF8 converter.


Keytype and Valuetype specification may be mixed in any combination, for example:
add_utf8_assoc_ascii_stringl_ex(zval *arr, char *key, int key_len, char *val, int val_len, int dup);


Miscellaneous
-------------

UBYTES() macro can be used to obtain the number of bytes necessary to store
the given number of UChar's. The typical usage is:

    char *constant_name = colon + (UG(unicode)?UBYTES(2):2);


Code Points and Code Units
--------------------------

Unicode type strings are in the UTF-16 encoding where 1 Unicode character
may be represented by 1 or 2 UChar's. Each UChar is referred to as a "code
unit", and a full Unicode character as a "code point". Consequently, number
of code units and number of code points for the same Unicode string may be
different. This has many implications, the most important of which is that
you cannot simply index the UChar* string to  get the desired codepoint.

The zval's value.ustr.len contains the number of code units (UChar -- UTF16).
To obtain the number of code points, one can use u_countChar32() ICU API
function or Z_USTRCPLEN() macro.

ICU provides a number of macros for working with UTF-16 strings on the
codepoint level [2]. They allow you to do things like obtain a codepoint at
random code unit offset, move forward and backward over the string, etc.
There are two versions of iterator macros, *_SAFE and *_UNSAFE. It is strong
recommended to use *_SAFE version, since they handle unpaired surrogates and
check for string boundaries. Here is an example of how to move through
UChar* string and work on codepoints.

    UChar *str = ...;
    int32_t str_len = ...;
    UChar32 codepoint;
    int32_t offset = 0;

    while (offset < str_len) {
        U16_NEXT(str, offset, str_len, codepoint);
        /* now we have the Unicode character in codepoint */
    }

There is not macro to get a codepoint at a certain code point offset, but
there is a Zend API function that does it.

    inline UChar32 zend_get_codepoint_at(UChar *str, int32_t length, int32_t n);

To retrieve 3rd codepoint, you would call:

    zend_get_codepoint_at(str, str_len, 3);

If you have a UChar32 codepoint and need to put it into a UChar* string,
there is another helper function, zend_codepoint_to_uchar(). It takes
a single UChar32 and converts it to a UChar sequence (1 or 2 UChar's).

    UChar buf[8];
    UChar32 codepoint = 0x101a2;
    int8_t num_uchars;
    num_uchars = zend_codepoint_to_uchar(codepoint, buf);

The return value is the number of resulting UChar's or 0, which indicates
invalid codepoint.


Memory Allocation
-----------------

For ease of use and to reduce possible bugs, there are memory allocation
functions specific to Unicode strings. Please use them at all times when
allocating UChar's.

    eumalloc(size)
    eurealloc(ptr, size)
    eustrndup(s, length)
    eustrdup(s)

    peumalloc(size, persistent)
    peurealloc(ptr, size, persistent)

The size parameter refers to the number of UChar's, not bytes.

	zend_zstrndup(type, zstr, length)


Hashes
------

Hashes API has been upgraded to work with Unicode and binary strings. All
hash functions that worked with string keys now have their equivalent
zend_u_hash_* API. The zend_u_hash_* functions take the type of the key
string as the second argument.

When UG(unicode) switch is on, the IS_STRING keys are upconverted to
IS_UNICODE and then used in the hash lookup.

A new HASH_KEY constant has been added for differentiating key types:
 . HASH_KEY_IS_UNICODE

Note that zend_hash_get_current_key_ex() does not have a zend_u_hash_*
version. It returns the key as a char* pointer, you can can cast it
appropriately based on the key type.


Identifiers and Class Entries
-----------------------------

In Unicode mode all the identifiers are Unicode strings. This means that
while various structures such as zend_class_entry, zend_function, etc store
the identifier name as a char* pointer, it will actually point to UChar*
string. Be careful when accessing the names of classes, functions, and such
-- always check UG(unicode) before using them.


Formatted Output
----------------

Since UTF-16 strings frequently contain NULL bytes, you cannot simpley use
%s format to print them out. Towards that end, output functions such as
php_printf(), spprintf(), etc now have three different formats for use with
Unicode strings:

  %r
    This format treats the corresponding argument as a Unicode string. The
    string is automatically converted to the output encoding. If you wish to
    apply a different converter to the string, use %*r and pass the
    converter before the string argument.

    UChar *class_name = USTR_NAME("ReflectionClass");
    zend_printf("%r", class_name);
    spprintf(&utf8_buffer, 0, "%*r", UG(utf8_conv), class_name);

  %R
    This format requires at least two arguments: the first one specifies the
    type of the string to follow (IS_STRING or IS_UNICODE), and the second
    one - the string itself. If the string is of Unicode type, it is
    automatically converted to the output encoding. If you wish to apply
    a different converter to the string, use %*R and pass the converter
    before the string argument.

    zend_throw_exception_ex(U_CLASS_ENTRY(reflection_exception_ptr), 0 TSRMLS_CC,
            "Interface %R does not exist",
            Z_TYPE_P(class_name), Z_UNIVAL_P(class_name));

  %v
    This format takes only one parameter, the string, but the expected
    string type depends on the UG(unicode) value. If the string is of
    Unicode type, it is automatically converted to the output encoding. If
    you wish to apply a different converter to the string, use %*R and pass
    the converter before the string argument.

    zend_error(E_WARNING, "%v::__toString() did not return anything",
            Z_OBJCE_P(object)->name);

  %Z
    This format prints a zval's value. You can specify the minimum length
    of the string representation using "%*Z" as well as the absolute length
    using "%.*Z". The following example is taken from the engine and
    therefor uses zend_spprintf rather than spprintf. Further more clen is
    an integer that is smaller than Z_UNILEN_P(callable).

    zend_spprintf(error, 0, "class '%.*Z' not found", clen, callable);

    The function allows to output any kind of zval values, as long as a
    string (or unicode) conversion is available. Note that printing non
    string zvals outside of request time is not possible.

Since [v]spprintf() can only output native strings there are also the new
functions [v]uspprintf() and [v]zspprintf() that create unicode strings and
return the number of characters printed. That is they return the length rather
than the byte size. The second pair of functions also takes an additional type
parameter that allows to create a string of arbitrary type. The following
example illustrates the use. Assume it fetches a unicode/native string into
path, path_len and path_type inorder to create sub_name, sub_len and sub_type.

	zstr path, sub_name;
	int path_len, sub_len;
	zend_uchar path_type, sub_type;

	/* fetch */

	if (path.v) {
		sub_type = path_type;
		sub_len = zspprintf(path_type, &sub_name, 0, "%R%c%s",
			path_type, path,
			DEFAULT_SLASH,
			entry.d_name);
	}


Upgrading Functions
===================

Let's take a look at a couple of functions that have been upgraded to
support new string types.

substr()
--------

This functions returns part of a string based on offset and length
parameters.

    zstr str;
    int str_len, cp_len;
    zend_uchar str_type;

    if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "tl|l", &str, &str_len, &str_type, &f, &l) == FAILURE) {
        return;
    }

The first thing we notice is that the incoming string specifier is 't',
which means that we can accept all 3 string types. The 'str' variable is
declared as zstr, because it can point to either UChar* or char*.
The actual type of the incoming string is stored in 'str_type' variable.

    if (str_type == IS_UNICODE) {
        cp_len = u_countChar32(str.u, str_len);
    } else {
        cp_len = str_len;
    }

If the string is a Unicode one, we cannot rely on the str_len value to tell
us the number of characters in it. Instead, we call u_countChar32() to
obtain it.

The next several lines normalize start and length parameters to fit within the
string. Nothing new here. Then we locate the appropriate segment.

    if (str_type == IS_UNICODE) {
        int32_t start = 0, end = 0;
        U16_FWD_N(str.u, end, str_len, f);
        start = end;
        U16_FWD_N(str.u, end, str_len, l);
        RETURN_UNICODEL(str.u + start, end-start, ZSTR_DUPLICATE);

Since codepoint (character) #n is not necessarily at offset #n in Unicode
strings, we start at the beginning and iterate forward until we have gone
through the required number of codepoints to reach the start of the segment.
Then we save the location in 'start' and continue iterating through the number
of codepoints specified by the offset. Once that's done, we can return the
segment as a Unicode string.

    } else {
        RETURN_STRINGL(str.s + f, l, ZSTR_DUPLICATE);
    }

For native strings, we can return the segment directly.


strrev()
--------

Let's look at strrev() which requires somewhat more complicated upgrade.
While one of the guidelines for upgrades is that combining sequences are not
really taken into account during processing -- substr() can break them up,
for example -- in this case, we actually should be concerned, because
reversing combining sequence may result in a completely different string. To
illustrate:

      a    (U+0061 LATIN SMALL LETTER A)
      o    (U+006f LATIN SMALL LETTER O)
    + '    (U+0301 COMBINING ACUTE ACCENT)
    + _    (U+0320 COMBINING MINUS SIGN BELOW)
      l    (U+006C LATIN SMALL LETTER L)

Reversing this would result in:

      l    (U+006C LATIN SMALL LETTER L)
    + _    (U+0320 COMBINING MINUS SIGN BELOW)
    + '    (U+0301 COMBINING ACUTE ACCENT)
      o    (U+006f LATIN SMALL LETTER O)
      a    (U+0061 LATIN SMALL LETTER A)

All of a sudden the combining marks are being applied to 'l' instead of 'o'.
To avoid this, we need to treat combininig sequences as a unit, by checking
the combining character class of each character with u_getCombiningClass().

strrev() obtains its single argument, a string, and unless the string is of
Unicode type, processes it exactly as before, simply swapping bytes around.
For Unicode case, the magic is like this:

    int32_t i, x1, x2;
    UChar32 ch;
    UChar *u_s, *u_n, *u_p;

    u_n = eumalloc(Z_USTRLEN_PP(str)+1);
    u_p = u_n;
    u_s = Z_USTRVAL_PP(str);

    i = Z_USTRLEN_PP(str);
    while (i > 0) {
        U16_PREV(u_s, 0, i, ch);
        if (u_getCombiningClass(ch) == 0) {
            u_p += zend_codepoint_to_uchar(ch, u_p);
        } else {
            x2 = i;
            do {
                U16_PREV(u_s, 0, i, ch);
            } while (u_getCombiningClass(ch) != 0);
            x1 = i;
            while (x1 <= x2) {
                U16_NEXT(u_s, x1, Z_USTRLEN_PP(str), ch);
                u_p += zend_codepoint_to_uchar(ch, u_p);
            }
        }
    }
    *u_p = 0;

The basic idea is to walk the string backwards from the end, using
U16_PREV() macro. If the combining class of the current character is 0,
meaning it's a base character and not a combining mark, we simply append it
to the new string. Otherwise, we save the location of the index and do a run
over the characters until we get to the next one with combining class 0. At
that point we append the sequence as is, without reversing, to the new
string. Voila.

Note that the code uses zend_codepoint_to_uchar() to convert full Unicode
characters (UChar32 type) to 1 or 2 UTF-16 code units (UChar type).


realpath()
----------

Filenames use their own converter as it's not uncommon, for example,
to need to access files on a filesystem with latin1 entries while outputting
UTF8 runtime content.

The most common approach to parsing filenames can be found in realpath():

zval **ppfilename;
char *filename;
int filename_len;

if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "Z", &ppfilename) == FAILURE ||
	php_stream_path_param_encode(ppfilename, &filename, &filename_len, REPORT_ERRORS, FG(default_context)) == FAILURE) {
	return;
}

Here, the filename is taken first as a generic zval**, then converted (separating if necessary)
and populated into local char* and int storage.  The filename will be converted according to
unicode.filesystem_encoding unless the wrapper specified overrides this with its own conversion
function (The http:// wrapper, for example, enforces utf8 conversion).


rmdir()
-------

If the function accepts a context parameter, then this context should be used in place of FG(default_context)

zval **ppdir, *zcontext = NULL;
char *dir;
int dir_len;

if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "Z|r", &ppdir, &zcontext) == FAILURE) {
	return;
}

context = php_stream_context_from_zval(zcontext, 0);
if (php_stream_path_param_encode(ppdir, &dir, &dir_len, REPORT_ERRORS, context) == FAILURE) {
	return;
}


sqlite_query()
--------------

If the function's underlying library expects a particular encoding (i.e. UTF8), then the alternate form of
the string parameter may be used with zend_parse_parameters().

char *sql;
int sql_len;

if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s&", &sql, &sql_len, UG(utf8_conv)) == FAILURE) {
    return;
}

Converters
==========

Standard Converters
-------------------

The following converters (UConverter*) are initialized by Zend and are always available (regardless of UG(unicode) mode):
  UG(utf8_conv)
  UG(ascii_conv)
  UG(fallback_encoding_conv) - UTF8 unless overridden by INI setting unicode.fallback_encoding

Additional converters will be optionally initialized depending on INI settings:
  UG(runtime_encoding_conv) - unicode.runtime_encoding
   . Unicode output generated by a script will be encoding using this converter

  UG(script_encoding_conv) - unicode.script_encoding
   . Scripts read from disk will be decoded using this converter

  UG(filesystem_encoding_conv) - unicode.filesystem_encoding
   . Filenames and paths will be encoding using this converter


Since these additional converters may not be instatiated (because their INI value is not set), all uses of these converters must
be wrapped in ZEND_U_CONVERTER() for safety.  If the converter hasn't been instantiated, then UG(fallback_encoding_conv) will be
used instead.

For example, RETURN_RT_STRING("foo", ZSTR_DUPLICATE); expands out to:
  RETURN_U_STRING(ZEND_U_CONVERTER(UG(runtime_encoding_conv)), "foo", ZSTR_DUPLICATE);

Which uses UG(runtime_encoding_conv) if it's been set, otherwise using UG(fallback_encoding_conv).

Note that the INI setting unicode.stream_encoding does not instantiate a UConverter* automatically for use by the process/thread,
it stores the value as a string for use during fopen() style calls where a UConverter* is instantiated for that particular stream.

References
==========

[1] http://icu.sourceforge.net/apiref/icu4c/ustring_8h.html#a1

[2] http://icu.sourceforge.net/apiref/icu4c/utf16_8h.html

vim: set et ai tw=76 fo=tron21: