1 files changed, 182 insertions, 34 deletions
diff --git a/doc/pcre.txt b/doc/pcre.txt
index b8106e4..29cc490 100644
--- a/doc/pcre.txt
+++ b/doc/pcre.txt
@@ -28,6 +28,10 @@ SYNOPSIS
      int pcre_get_substring_list(const char *subject,
           int *ovector, int stringcount, const char ***listptr);
 
+     void pcre_free_substring(const char *stringptr);
+
+     void pcre_free_substring_list(const char **stringptr);
+
      const unsigned char *pcre_maketables(void);
 
      int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
@@ -48,9 +52,12 @@ DESCRIPTION
      The PCRE library is a set of functions that implement  regu-
      lar  expression  pattern  matching using the same syntax and
      semantics as Perl  5,  with  just  a  few  differences  (see
+
      below).  The  current  implementation  corresponds  to  Perl
-     5.005, with some additional features from the Perl  develop-
-     ment release.
+     5.005, with some additional features  from  later  versions.
+     This  includes  some  experimental,  incomplete  support for
+     UTF-8 encoded strings. Details of exactly what is  and  what
+     is not supported are given below.
 
      PCRE has its own native API,  which  is  described  in  this
      document.  There  is  also  a  set of wrapper functions that
@@ -67,13 +74,18 @@ DESCRIPTION
      releases.
 
      The functions pcre_compile(), pcre_study(), and  pcre_exec()
-     are  used  for  compiling  and matching regular expressions,
-     while   pcre_copy_substring(),   pcre_get_substring(),   and
-     pcre_get_substring_list()   are  convenience  functions  for
+     are used for compiling and matching regular expressions.
+
+     The functions  pcre_copy_substring(),  pcre_get_substring(),
+     and  pcre_get_substring_list() are convenience functions for
      extracting  captured  substrings  from  a  matched   subject
-     string.  The function pcre_maketables() is used (optionally)
-     to build a set of character tables in the current locale for
-     passing to pcre_compile().
+     string; pcre_free_substring() and pcre_free_substring_list()
+     are also provided, to free the  memory  used  for  extracted
+     strings.
+
+     The function pcre_maketables() is used (optionally) to build
+     a  set of character tables in the current locale for passing
+     to pcre_compile().
 
      The function pcre_fullinfo() is used to find out information
      about a compiled pattern; pcre_info() is an obsolete version
@@ -92,10 +104,19 @@ DESCRIPTION
 
 
 MULTI-THREADING
-     The PCRE functions can be used in  multi-threading  applica-
-     tions, with the proviso that the memory management functions
-     pointed to by pcre_malloc and pcre_free are  shared  by  all
-     threads.
+     The  PCRE  functions  can   be   used   in   multi-threading
+
+
+
+
+
+SunOS 5.8                 Last change:                          2
+
+
+
+     applications,  with  the  proviso that the memory management
+     functions pointed to by pcre_malloc and pcre_free are shared
+     by all threads.
 
      The compiled form of a regular  expression  is  not  altered
      during  matching, so the same compiled pattern can safely be
@@ -103,7 +124,6 @@ MULTI-THREADING
 
 
 
-
 COMPILING A PATTERN
      The function pcre_compile() is called to compile  a  pattern
      into  an internal form. The pattern is a C string terminated
@@ -235,12 +255,23 @@ COMPILING A PATTERN
      followed by "?". It is not compatible with Perl. It can also
      be set by a (?U) option setting within the pattern.
 
+       PCRE_UTF8
+
+     This option causes PCRE to regard both the pattern  and  the
+     subject  as strings of UTF-8 characters instead of just byte
+     strings. However, it is available  only  if  PCRE  has  been
+     built  to  include  UTF-8  support.  If not, the use of this
+     option provokes an error. Support for UTF-8 is new,  experi-
+     mental,  and incomplete.  Details of exactly what it entails
+     are given below.
+
 
 
 STUDYING A PATTERN
      When a pattern is going to be  used  several  times,  it  is
      worth  spending  more time analyzing it in order to speed up
      the time taken for matching. The function pcre_study() takes
+
      a  pointer  to a compiled pattern as its first argument, and
      returns a  pointer  to  a  pcre_extra  block  (another  void
      typedef)  containing  additional  information about the pat-
@@ -344,9 +375,9 @@ INFORMATION ABOUT A PATTERN
 
        PCRE_INFO_BACKREFMAX
 
-     Return the number of the highest back reference in the  pat-
-     tern.  The  fourth argument should point to an int variable.
-     Zero is returned if there are no back references.
+     Return the number of  the  highest  back  reference  in  the
+     pattern.  The  fourth  argument should point to an int vari-
+     able. Zero is returned if there are no back references.
 
        PCRE_INFO_FIRSTCHAR
 
@@ -605,6 +636,15 @@ MATCHING A PATTERN
 
 EXTRACTING CAPTURED SUBSTRINGS
      Captured substrings can be accessed directly  by  using  the
+
+
+
+
+
+SunOS 5.8                 Last change:                         12
+
+
+
      offsets returned by pcre_exec() in ovector. For convenience,
      the functions  pcre_copy_substring(),  pcre_get_substring(),
      and  pcre_get_substring_list()  are  provided for extracting
@@ -631,7 +671,7 @@ EXTRACTING CAPTURED SUBSTRINGS
      the entire pattern, while higher values extract the captured
      substrings. For pcre_copy_substring(), the string is  placed
      in  buffer,  whose  length is given by buffersize, while for
-     pcre_get_substring() a new block of store  is  obtained  via
+     pcre_get_substring() a new block of memory is  obtained  via
      pcre_malloc,  and its address is returned via stringptr. The
      yield of the function is  the  length  of  the  string,  not
      including the terminating zero, or one of
@@ -665,6 +705,16 @@ EXTRACTING CAPTURED SUBSTRINGS
      inspecting the appropriate offset in ovector, which is nega-
      tive for unset substrings.
 
+     The  two  convenience  functions  pcre_free_substring()  and
+     pcre_free_substring_list()  can  be  used to free the memory
+     returned by  a  previous  call  of  pcre_get_substring()  or
+     pcre_get_substring_list(),  respectively.  They  do  nothing
+     more than call the function pointed to by  pcre_free,  which
+     of  course  could  be called directly from a C program. How-
+     ever, PCRE is used in some situations where it is linked via
+     a  special  interface  to another programming language which
+     cannot use pcre_free directly; it is for  these  cases  that
+     the functions are provided.
 
 
 
@@ -733,6 +783,7 @@ DIFFERENCES FROM PERL
      (?p{code})  constructions. However, there is some experimen-
      tal support for recursive patterns using the  non-Perl  item
      (?R).
+
      8. There are at the time of writing some  oddities  in  Perl
      5.005_02  concerned  with  the  settings of captured strings
      when part of a pattern is repeated.  For  example,  matching
@@ -785,11 +836,17 @@ REGULAR EXPRESSION DETAILS
      The syntax and semantics of  the  regular  expressions  sup-
      ported  by PCRE are described below. Regular expressions are
      also described in the Perl documentation and in a number  of
-
      other  books,  some  of which have copious examples. Jeffrey
      Friedl's  "Mastering  Regular  Expressions",  published   by
-     O'Reilly  (ISBN  1-56592-257),  covers them in great detail.
+     O'Reilly (ISBN 1-56592-257), covers them in great detail.
+
      The description here is intended as reference documentation.
+     The basic operation of PCRE is on strings of bytes. However,
+     there is the beginnings of some support for UTF-8  character
+     strings.  To  use  this  support  you must configure PCRE to
+     include it, and then call pcre_compile() with the  PCRE_UTF8
+     option.  How  this affects the pattern matching is described
+     in the final section of this document.
 
      A regular expression is a pattern that is matched against  a
      subject string from left to right. Most characters stand for
@@ -1004,6 +1061,7 @@ CIRCUMFLEX AND DOLLAR
      Outside a character class, in the default matching mode, the
      circumflex  character  is an assertion which is true only if
      the current matching point is at the start  of  the  subject
+
      string.  If  the startoffset argument of pcre_exec() is non-
      zero, circumflex can never match. Inside a character  class,
      circumflex has an entirely different meaning (see below).
@@ -1056,6 +1114,7 @@ FULL STOP (PERIOD, DOT)
      Outside a character class, a dot in the pattern matches  any
      one character in the subject, including a non-printing char-
      acter, but not (by default)  newline.   If  the  PCRE_DOTALL
+
      option  is set, dots match newlines as well. The handling of
      dot is entirely independent of the  handling  of  circumflex
      and  dollar,  the  only  relationship  being  that they both
@@ -1517,18 +1576,19 @@ BACK REFERENCES
      A back reference that occurs inside the parentheses to which
      it  refers  fails when the subpattern is first used, so, for
      example, (a\1) never matches.  However, such references  can
-     be  useful  inside  repeated  subpatterns.  For example, the
-     pattern
+     be useful inside repeated subpatterns. For example, the pat-
+     tern
 
        (a|b\1)+
 
-     matches any number of "a"s and also "aba", "ababaa" etc.  At
+     matches any number of "a"s and also "aba", "ababbaa" etc. At
      each iteration of the subpattern, the back reference matches
-     the character string corresponding to  the  previous  itera-
-     tion.  In  order  for this to work, the pattern must be such
-     that the first iteration does not need  to  match  the  back
-     reference.  This  can  be  done using alternation, as in the
-     example above, or by a quantifier with a minimum of zero.
+     the  character  string   corresponding   to   the   previous
+     iteration.  In  order  for this to work, the pattern must be
+     such that the first iteration does not  need  to  match  the
+     back  reference.  This  can be done using alternation, as in
+     the example above, or by a  quantifier  with  a  minimum  of
+     zero.
 
 
 
@@ -1681,9 +1741,9 @@ ONCE-ONLY SUBPATTERNS
 
      This kind of parenthesis "locks up" the  part of the pattern
      it  contains once it has matched, and a failure further into
-     the pattern is prevented from backtracking  into  it.  Back-
-     tracking  past  it to previous items, however, works as nor-
-     mal.
+     the  pattern  is  prevented  from  backtracking   into   it.
+     Backtracking  past  it  to previous items, however, works as
+     normal.
 
      An alternative description is that a subpattern of this type
      matches  the  string  of  characters that an identical stan-
@@ -1941,9 +2001,9 @@ PERFORMANCE
      repeat can match 0, 1, 2, 3, or 4 times,  and  for  each  of
      those  cases other than 0, the + repeats can match different
      numbers of times.) When the remainder of the pattern is such
-     that  the entire match is going to fail, PCRE has in princi-
-     ple to try every possible variation, and this  can  take  an
-     extremely long time.
+     that  the  entire  match  is  going  to  fail,  PCRE  has in
+     principle to try every possible variation, and this can take
+     an extremely long time.
 
      An optimization catches some of the more simple  cases  such
      as
@@ -1966,6 +2026,93 @@ PERFORMANCE
 
 
 
+UTF-8 SUPPORT
+     Starting at release 3.3, PCRE has some support for character
+     strings encoded in the UTF-8 format. This is incomplete, and
+     is regarded as experimental. In order to use  it,  you  must
+     configure PCRE to include UTF-8 support in the code, and, in
+     addition, you must call pcre_compile()  with  the  PCRE_UTF8
+     option flag. When you do this, both the pattern and any sub-
+     ject strings that are matched  against  it  are  treated  as
+     UTF-8  strings instead of just strings of bytes, but only in
+     the cases that are mentioned below.
+
+     If you compile PCRE with UTF-8 support, but do not use it at
+     run  time,  the  library will be a bit bigger, but the addi-
+     tional run time overhead is limited to testing the PCRE_UTF8
+     flag in several places, so should not be very large.
+
+     PCRE assumes that the strings  it  is  given  contain  valid
+     UTF-8  codes. It does not diagnose invalid UTF-8 strings. If
+     you pass invalid UTF-8 strings  to  PCRE,  the  results  are
+     undefined.
+
+     Running with PCRE_UTF8 set causes these changes in  the  way
+     PCRE works:
+
+     1. In a pattern, the escape sequence \x{...}, where the con-
+     tents  of  the  braces is a string of hexadecimal digits, is
+     interpreted as a UTF-8 character whose code  number  is  the
+     given   hexadecimal  number,  for  example:  \x{1234}.  This
+     inserts from one to six  literal  bytes  into  the  pattern,
+     using the UTF-8 encoding. If a non-hexadecimal digit appears
+     between the braces, the item is not recognized.
+
+     2. The original hexadecimal escape sequence, \xhh, generates
+     a two-byte UTF-8 character if its value is greater than 127.
+
+     3. Repeat quantifiers are NOT correctly handled if they fol-
+     low  a  multibyte character. For example, \x{100}* and \xc3+
+     do not work. If you want to repeat such characters, you must
+     enclose  them  in  non-capturing  parentheses,  for  example
+     (?:\x{100}), at present.
+
+     4. The dot metacharacter matches one UTF-8 character instead
+     of a single byte.
+
+     5. Unlike literal UTF-8 characters,  the  dot  metacharacter
+     followed  by  a  repeat quantifier does operate correctly on
+     UTF-8 characters instead of single bytes.
+
+     4. Although the \x{...} escape is permitted in  a  character
+     class,  characters  whose values are greater than 255 cannot
+     be included in a class.
+
+     5. A class is matched against a UTF-8 character  instead  of
+     just  a  single byte, but it can match only characters whose
+     values are less than 256.  Characters  with  greater  values
+     always fail to match a class.
+
+     6. Repeated classes work correctly on multiple characters.
+
+     7. Classes containing just a single character whose value is
+     greater than 127 (but less than 256), for example, [\x80] or
+     [^\x{93}], do not work because these are optimized into sin-
+     gle  byte  matches.  In the first case, of course, the class
+     brackets are just redundant.
+
+     8. Lookbehind assertions move backwards in the subject by  a
+     fixed  number  of  characters  instead  of a fixed number of
+     bytes. Simple cases have been tested to work correctly,  but
+     there may be hidden gotchas herein.
+
+     9. The character types  such  as  \d  and  \w  do  not  work
+     correctly  with  UTF-8  characters.  They continue to test a
+     single byte.
+
+     10. Anything not explicitly mentioned here continues to work
+     in bytes rather than in characters.
+
+     The following UTF-8 features of  Perl  5.6  are  not  imple-
+     mented:
+
+     1. The escape sequence \C to match a single byte.
+
+     2. The use of Unicode tables and properties and escapes  \p,
+     \P, and \X.
+
+
+
 AUTHOR
      Philip Hazel <ph10@cam.ac.uk>
      University Computing Service,
@@ -1973,5 +2120,6 @@ AUTHOR
      Cambridge CB2 3QG, England.
      Phone: +44 1223 334714
 
-     Last updated: 27 January 2000
+     Last updated: 28 August 2000,
+       the 250th anniversary of the death of J.S. Bach.
      Copyright (c) 1997-2000 University of Cambridge.