From 88bcaf167c392f97d50caafc1cbf1a226e8a7d85 Mon Sep 17 00:00:00 2001 From: ph10 Date: Wed, 3 Aug 2016 09:01:02 +0000 Subject: Update pcre2test with the /utf8_input option, for generating wide characters in non-UTF 16-bit and 32-bit modes. git-svn-id: svn://vcs.exim.org/pcre2/code/trunk@553 6239d852-aaf2-0410-a92c-79f79f948069 --- doc/html/pcre2test.html | 87 +++++++++++++++++++++++++++++++++++++------------ 1 file changed, 67 insertions(+), 20 deletions(-) (limited to 'doc/html/pcre2test.html') diff --git a/doc/html/pcre2test.html b/doc/html/pcre2test.html index 17b308e..7509b37 100644 --- a/doc/html/pcre2test.html +++ b/doc/html/pcre2test.html @@ -61,7 +61,7 @@ subject is processed, and what output is produced.

As the original fairly simple PCRE library evolved, it acquired many different features, and as a result, the original pcretest program ended up with a -lot of options in a messy, arcane syntax, for testing all the features. The +lot of options in a messy, arcane syntax for testing all the features. The move to the new PCRE2 API provided an opportunity to re-implement the test program as pcre2test, with a cleaner modifier syntax. Nevertheless, there are still many obscure modifiers, some of which are specifically designed for @@ -77,32 +77,61 @@ strings that are encoded in 8-bit, 16-bit, or 32-bit code units. One, two, or all three of these libraries may be simultaneously installed. The pcre2test program can be used to test all the libraries. However, its own input and output are always in 8-bit format. When testing the 16-bit or 32-bit -libraries, patterns and subject strings are converted to 16- or 32-bit format -before being passed to the library functions. Results are converted back to -8-bit code units for output. +libraries, patterns and subject strings are converted to 16-bit or 32-bit +format before being passed to the library functions. Results are converted back +to 8-bit code units for output.

In the rest of this document, the names of library functions and structures are given in generic form, for example, pcre_compile(). The actual names used in the libraries have a suffix _8, _16, or _32, as appropriate. -

+


INPUT ENCODING

Input to pcre2test is processed line by line, either by calling the C -library's fgets() function, or via the libreadline library (see -below). The input is processed using using C's string functions, so must not +library's fgets() function, or via the libreadline library. In some +Windows environments character 26 (hex 1A) causes an immediate end of file, and +no further data is read, so this character should be avoided unless you really +want that action. +

+

+The input is processed using using C's string functions, so must not contain binary zeroes, even though in Unix-like environments, fgets() -treats any bytes other than newline as data characters. In some Windows -environments character 26 (hex 1A) causes an immediate end of file, and no -further data is read. +treats any bytes other than newline as data characters. An error is generated +if a binary zero is encountered. Subject lines are processed for backslash +escapes, which makes it possible to include any data value in strings that are +passed to the library for matching. For patterns, there is a facility for +specifying some or all of the 8-bit input characters as hexadecimal pairs, +which makes it possible to include binary zeros. +

+
+Input for the 16-bit and 32-bit libraries +
+

+When testing the 16-bit or 32-bit libraries, there is a need to be able to +generate character code points greater than 255 in the strings that are passed +to the library. For subject lines, backslash escapes can be used. In addition, +when the utf modifier (see +"Setting compilation options" +below) is set, the pattern and any following subject lines are interpreted as +UTF-8 strings and translated to UTF-16 or UTF-32 as appropriate.

-For maximum portability, therefore, it is safest to avoid non-printing -characters in pcre2test input files. There is a facility for specifying -some or all of a pattern's characters as hexadecimal pairs, thus making it -possible to include binary zeroes in a pattern for testing purposes. Subject -lines are processed for backslash escapes, which makes it possible to include -any data value. +For non-UTF testing of wide characters, the utf8_input modifier can be +used. This is mutually exclusive with utf, and is allowed only in 16-bit +or 32-bit mode. It causes the pattern and following subject lines to be treated +as UTF-8 according to the original definition (RFC 2279), which allows for +character values up to 0x7fffffff. Each character is placed in one 16-bit or +32-bit code unit (in the 16-bit case, values greater than 0xffff cause an error +to occur). +

+

+UTF-8 is not capable of encoding values greater than 0x7fffffff, but such +values can be handled by the 32-bit library. When testing this library in +non-UTF mode with utf8_input set, if any character is preceded by the +byte 0xff (which is an illegal byte in UTF-8) 0x80000000 is added to the +character's value. This is the only way of passing such code points in a +pattern string. For subject strings, using an escape sequence is preferable.


COMMAND LINE OPTIONS

@@ -553,7 +582,9 @@ for a description of their effects. As well as turning on the PCRE2_UTF option, the utf modifier causes all non-printing characters in output strings to be printed using the \x{hh...} notation. Otherwise, those less than 0x100 are output in hex without the curly -brackets. +brackets. Setting utf in 16-bit or 32-bit mode also causes pattern and +subject strings to be translated to UTF-16 or UTF-32, respectively, before +being passed to library functions.


Setting compilation controls @@ -584,6 +615,7 @@ about the pattern: pushcopy push a copy onto the stack stackguard=<number> test the stackguard feature tables=[0|1|2] select internal tables + utf8_input treat input as UTF-8 The effects of these modifiers are described in the following sections.

@@ -684,7 +716,8 @@ nine characters, only two of which are specified in hexadecimal: /ab "literal" 32/hex Either single or double quotes may be used. There is no way of including -the delimiter within a substring. +the delimiter within a substring. The hex and expand modifiers are +mutually exclusive.

By default, pcre2test passes patterns as zero-terminated strings to @@ -693,6 +726,19 @@ patterns specified with the hex modifier, the actual length of the pattern is passed.


+Specifying wide characters in 16-bit and 32-bit modes +
+

+In 16-bit and 32-bit modes, all input is automatically treated as UTF-8 and +translated to UTF-16 or UTF-32 when the utf modifier is set. For testing +the 16-bit and 32-bit libraries in non-UTF mode, the utf8_input modifier +can be used. It is mutually exclusive with utf. Input lines are +interpreted as UTF-8 as a means of specifying wide characters. More details are +given in +"Input encoding" +above. +

+
Generating long repetitive patterns

@@ -708,7 +754,8 @@ are expanded before the pattern is passed to pcre2_compile(). For example, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction cannot be nested. An initial "\[" sequence is recognized only if "]{" followed by decimal digits and "}" is found later in the pattern. If not, the characters -remain in the pattern unaltered. +remain in the pattern unaltered. The expand and hex modifiers are +mutually exclusive.

If part of an expanded pattern looks like an expansion, but is really part of @@ -1706,7 +1753,7 @@ Cambridge, England.


REVISION

-Last updated: 06 July 2016 +Last updated: 02 August 2016
Copyright © 1997-2016 University of Cambridge.
-- cgit v1.2.1