Load pcre-4.0 into code/trunk.

git-svn-id: svn://vcs.exim.org/pcre/code/trunk@63 2f5784b3-3f2a-0410-8824-cb99058d5e15
author: nigel <nigel@2f5784b3-3f2a-0410-8824-cb99058d5e15> 2007-02-24 21:40:03 +0000
committer: nigel <nigel@2f5784b3-3f2a-0410-8824-cb99058d5e15> 2007-02-24 21:40:03 +0000
commit: c8cb607ab7e12e185e86a8b23d413b7f9536f24c (patch)
tree: e1c3675d531d498d2a84490908e187a249456d2c /ChangeLog
parent: e27c89c9227398c6feee3ca0748827fd064154cd (diff)
download: pcre-c8cb607ab7e12e185e86a8b23d413b7f9536f24c.tar.gz
1 files changed, 428 insertions, 1 deletions
diff --git a/ChangeLog b/ChangeLog
index 6a71290..9c99cf3 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,7 +1,434 @@
 ChangeLog for PCRE
 ------------------
 
-Version 3.0 02-Jan-02
+Version 4.00 17-Feb-03
+----------------------
+
+1. If a comment in an extended regex that started immediately after a meta-item
+extended to the end of string, PCRE compiled incorrect data. This could lead to
+all kinds of weird effects. Example: /#/ was bad; /()#/ was bad; /a#/ was not.
+
+2. Moved to autoconf 2.53 and libtool 1.4.2.
+
+3. Perl 5.8 no longer needs "use utf8" for doing UTF-8 things. Consequently,
+the special perltest8 script is no longer needed - all the tests can be run
+from a single perltest script.
+
+4. From 5.004, Perl has not included the VT character (0x0b) in the set defined
+by \s. It has now been removed in PCRE. This means it isn't recognized as
+whitespace in /x regexes too, which is the same as Perl. Note that the POSIX
+class [:space:] *does* include VT, thereby creating a mess.
+
+5. Added the class [:blank:] (a GNU extension from Perl 5.8) to match only
+space and tab.
+
+6. Perl 5.005 was a long time ago. It's time to amalgamate the tests that use
+its new features into the main test script, reducing the number of scripts.
+
+7. Perl 5.8 has changed the meaning of patterns like /a(?i)b/. Earlier versions
+were backward compatible, and made the (?i) apply to the whole pattern, as if
+/i were given. Now it behaves more logically, and applies the option setting
+only to what follows. PCRE has been changed to follow suit. However, if it
+finds options settings right at the start of the pattern, it extracts them into
+the global options, as before. Thus, they show up in the info data.
+
+8. Added support for the \Q...\E escape sequence. Characters in between are
+treated as literals. This is slightly different from Perl in that $ and @ are
+also handled as literals inside the quotes. In Perl, they will cause variable
+interpolation. Note the following examples:
+
+    Pattern            PCRE matches      Perl matches
+
+    \Qabc$xyz\E        abc$xyz           abc followed by the contents of $xyz
+    \Qabc\$xyz\E       abc\$xyz          abc\$xyz
+    \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz
+
+For compatibility with Perl, \Q...\E sequences are recognized inside character
+classes as well as outside them.
+
+9. Re-organized 3 code statements in pcretest to avoid "overflow in
+floating-point constant arithmetic" warnings from a Microsoft compiler. Added a
+(size_t) cast to one statement in pcretest and one in pcreposix to avoid
+signed/unsigned warnings.
+
+10. SunOS4 doesn't have strtoul(). This was used only for unpicking the -o
+option for pcretest, so I've replaced it by a simple function that does just
+that job.
+
+11. pcregrep was ending with code 0 instead of 2 for the commands "pcregrep" or
+"pcregrep -".
+
+12. Added "possessive quantifiers" ?+, *+, ++, and {,}+ which come from Sun's
+Java package. This provides some syntactic sugar for simple cases of what my
+documentation calls "once-only subpatterns". A pattern such as x*+ is the same
+as (?>x*). In other words, if what is inside (?>...) is just a single repeated
+item, you can use this simplified notation. Note that only makes sense with
+greedy quantifiers. Consequently, the use of the possessive quantifier forces
+greediness, whatever the setting of the PCRE_UNGREEDY option.
+
+13. A change of greediness default within a pattern was not taking effect at
+the current level for patterns like /(b+(?U)a+)/. It did apply to parenthesized
+subpatterns that followed. Patterns like /b+(?U)a+/ worked because the option
+was abstracted outside.
+
+14. PCRE now supports the \G assertion. It is true when the current matching
+position is at the start point of the match. This differs from \A when the
+starting offset is non-zero. Used with the /g option of pcretest (or similar
+code), it works in the same way as it does for Perl's /g option. If all
+alternatives of a regex begin with \G, the expression is anchored to the start
+match position, and the "anchored" flag is set in the compiled expression.
+
+15. Some bugs concerning the handling of certain option changes within patterns
+have been fixed. These applied to options other than (?ims). For example,
+"a(?x: b c )d" did not match "XabcdY" but did match "Xa b c dY". It should have
+been the other way round. Some of this was related to change 7 above.
+
+16. PCRE now gives errors for /[.x.]/ and /[=x=]/ as unsupported POSIX
+features, as Perl does. Previously, PCRE gave the warnings only for /[[.x.]]/
+and /[[=x=]]/. PCRE now also gives an error for /[:name:]/ because it supports
+POSIX classes only within a class (e.g. /[[:alpha:]]/).
+
+17. Added support for Perl's \C escape. This matches one byte, even in UTF8
+mode. Unlike ".", it always matches newline, whatever the setting of
+PCRE_DOTALL. However, PCRE does not permit \C to appear in lookbehind
+assertions. Perl allows it, but it doesn't (in general) work because it can't
+calculate the length of the lookbehind. At least, that's the case for Perl
+5.8.0 - I've been told they are going to document that it doesn't work in
+future.
+
+18. Added an error diagnosis for escapes that PCRE does not support: these are
+\L, \l, \N, \P, \p, \U, \u, and \X.
+
+19. Although correctly diagnosing a missing ']' in a character class, PCRE was
+reading past the end of the pattern in cases such as /[abcd/.
+
+20. PCRE was getting more memory than necessary for patterns with classes that
+contained both POSIX named classes and other characters, e.g. /[[:space:]abc/.
+
+21. Added some code, conditional on #ifdef VPCOMPAT, to make life easier for
+compiling PCRE for use with Virtual Pascal.
+
+22. Small fix to the Makefile to make it work properly if the build is done
+outside the source tree.
+
+23. Added a new extension: a condition to go with recursion. If a conditional
+subpattern starts with (?(R) the "true" branch is used if recursion has
+happened, whereas the "false" branch is used only at the top level.
+
+24. When there was a very long string of literal characters (over 255 bytes
+without UTF support, over 250 bytes with UTF support), the computation of how
+much memory was required could be incorrect, leading to segfaults or other
+strange effects.
+
+25. PCRE was incorrectly assuming anchoring (either to start of subject or to
+start of line for a non-DOTALL pattern) when a pattern started with (.*) and
+there was a subsequent back reference to those brackets. This meant that, for
+example, /(.*)\d+\1/ failed to match "abc123bc". Unfortunately, it isn't
+possible to check for precisely this case. All we can do is abandon the
+optimization if .* occurs inside capturing brackets when there are any back
+references whatsoever. (See below for a better fix that came later.)
+
+26. The handling of the optimization for finding the first character of a
+non-anchored pattern, and for finding a character that is required later in the
+match were failing in some cases. This didn't break the matching; it just
+failed to optimize when it could. The way this is done has been re-implemented.
+
+27. Fixed typo in error message for invalid (?R item (it said "(?p").
+
+28. Added a new feature that provides some of the functionality that Perl
+provides with (?{...}). The facility is termed a "callout". The way it is done
+in PCRE is for the caller to provide an optional function, by setting
+pcre_callout to its entry point. Like pcre_malloc and pcre_free, this is a
+global variable. By default it is unset, which disables all calling out. To get
+the function called, the regex must include (?C) at appropriate points. This
+is, in fact, equivalent to (?C0), and any number <= 255 may be given with (?C).
+This provides a means of identifying different callout points. When PCRE
+reaches such a point in the regex, if pcre_callout has been set, the external
+function is called. It is provided with data in a structure called
+pcre_callout_block, which is defined in pcre.h. If the function returns 0,
+matching continues; if it returns a non-zero value, the match at the current
+point fails. However, backtracking will occur if possible. [This was changed
+later and other features added - see item 49 below.]
+
+29. pcretest is upgraded to test the callout functionality. It provides a
+callout function that displays information. By default, it shows the start of
+the match and the current position in the text. There are some new data escapes
+to vary what happens:
+
+    \C+         in addition, show current contents of captured substrings
+    \C-         do not supply a callout function
+    \C!n        return 1 when callout number n is reached
+    \C!n!m      return 1 when callout number n is reached for the mth time
+
+30. If pcregrep was called with the -l option and just a single file name, it
+output "<stdin>" if a match was found, instead of the file name.
+
+31. Improve the efficiency of the POSIX API to PCRE. If the number of capturing
+slots is less than POSIX_MALLOC_THRESHOLD, use a block on the stack to pass to
+pcre_exec(). This saves a malloc/free per call. The default value of
+POSIX_MALLOC_THRESHOLD is 10; it can be changed by --with-posix-malloc-threshold
+when configuring.
+
+32. The default maximum size of a compiled pattern is 64K. There have been a
+few cases of people hitting this limit. The code now uses macros to handle the
+storing of links as offsets within the compiled pattern. It defaults to 2-byte
+links, but this can be changed to 3 or 4 bytes by --with-link-size when
+configuring. Tests 2 and 5 work only with 2-byte links because they output
+debugging information about compiled patterns.
+
+33. Internal code re-arrangements:
+
+(a) Moved the debugging function for printing out a compiled regex into
+    its own source file (printint.c) and used #include to pull it into
+    pcretest.c and, when DEBUG is defined, into pcre.c, instead of having two
+    separate copies.
+
+(b) Defined the list of op-code names for debugging as a macro in
+    internal.h so that it is next to the definition of the opcodes.
+
+(c) Defined a table of op-code lengths for simpler skipping along compiled
+    code. This is again a macro in internal.h so that it is next to the
+    definition of the opcodes.
+
+34. Added support for recursive calls to individual subpatterns, along the
+lines of Robin Houston's patch (but implemented somewhat differently).
+
+35. Further mods to the Makefile to help Win32. Also, added code to pcregrep to
+allow it to read and process whole directories in Win32. This code was
+contributed by Lionel Fourquaux; it has not been tested by me.
+
+36. Added support for named subpatterns. The Python syntax (?P<name>...) is
+used to name a group. Names consist of alphanumerics and underscores, and must
+be unique. Back references use the syntax (?P=name) and recursive calls use
+(?P>name) which is a PCRE extension to the Python extension. Groups still have
+numbers. The function pcre_fullinfo() can be used after compilation to extract
+a name/number map. There are three relevant calls:
+
+  PCRE_INFO_NAMEENTRYSIZE        yields the size of each entry in the map
+  PCRE_INFO_NAMECOUNT            yields the number of entries
+  PCRE_INFO_NAMETABLE            yields a pointer to the map.
+
+The map is a vector of fixed-size entries. The size of each entry depends on
+the length of the longest name used. The first two bytes of each entry are the
+group number, most significant byte first. There follows the corresponding
+name, zero terminated. The names are in alphabetical order.
+
+37. Make the maximum literal string in the compiled code 250 for the non-UTF-8
+case instead of 255. Making it the same both with and without UTF-8 support
+means that the same test output works with both.
+
+38. There was a case of malloc(0) in the POSIX testing code in pcretest. Avoid
+calling malloc() with a zero argument.
+
+39. Change 25 above had to resort to a heavy-handed test for the .* anchoring
+optimization. I've improved things by keeping a bitmap of backreferences with
+numbers 1-31 so that if .* occurs inside capturing brackets that are not in
+fact referenced, the optimization can be applied. It is unlikely that a
+relevant occurrence of .* (i.e. one which might indicate anchoring or forcing
+the match to follow \n) will appear inside brackets with a number greater than
+31, but if it does, any back reference > 31 suppresses the optimization.
+
+40. Added a new compile-time option PCRE_NO_AUTO_CAPTURE. This has the effect
+of disabling numbered capturing parentheses. Any opening parenthesis that is
+not followed by ? behaves as if it were followed by ?: but named parentheses
+can still be used for capturing (and they will acquire numbers in the usual
+way).
+
+41. Redesigned the return codes from the match() function into yes/no/error so
+that errors can be passed back from deep inside the nested calls. A malloc
+failure while inside a recursive subpattern call now causes the
+PCRE_ERROR_NOMEMORY return instead of quietly going wrong.
+
+42. It is now possible to set a limit on the number of times the match()
+function is called in a call to pcre_exec(). This facility makes it possible to
+limit the amount of recursion and backtracking, though not in a directly
+obvious way, because the match() function is used in a number of different
+circumstances. The count starts from zero for each position in the subject
+string (for non-anchored patterns). The default limit is, for compatibility, a
+large number, namely 10 000 000. You can change this in two ways:
+
+(a) When configuring PCRE before making, you can use --with-match-limit=n
+    to set a default value for the compiled library.
+
+(b) For each call to pcre_exec(), you can pass a pcre_extra block in which
+    a different value is set. See 45 below.
+
+If the limit is exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
+
+43. Added a new function pcre_config(int, void *) to enable run-time extraction
+of things that can be changed at compile time. The first argument specifies
+what is wanted and the second points to where the information is to be placed.
+The current list of available information is:
+
+  PCRE_CONFIG_UTF8
+
+The output is an integer that is set to one if UTF-8 support is available;
+otherwise it is set to zero.
+
+  PCRE_CONFIG_NEWLINE
+
+The output is an integer that it set to the value of the code that is used for
+newline. It is either LF (10) or CR (13).
+
+  PCRE_CONFIG_LINK_SIZE
+
+The output is an integer that contains the number of bytes used for internal
+linkage in compiled expressions. The value is 2, 3, or 4. See item 32 above.
+
+  PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
+
+The output is an integer that contains the threshold above which the POSIX
+interface uses malloc() for output vectors. See item 31 above.
+
+  PCRE_CONFIG_MATCH_LIMIT
+
+The output is an unsigned integer that contains the default limit of the number
+of match() calls in a pcre_exec() execution. See 42 above.
+
+44. pcretest has been upgraded by the addition of the -C option. This causes it
+to extract all the available output from the new pcre_config() function, and to
+output it. The program then exits immediately.
+
+45. A need has arisen to pass over additional data with calls to pcre_exec() in
+order to support additional features. One way would have been to define
+pcre_exec2() (for example) with extra arguments, but this would not have been
+extensible, and would also have required all calls to the original function to
+be mapped to the new one. Instead, I have chosen to extend the mechanism that
+is used for passing in "extra" data from pcre_study().
+
+The pcre_extra structure is now exposed and defined in pcre.h. It currently
+contains the following fields:
+
+  flags         a bitmap indicating which of the following fields are set
+  study_data    opaque data from pcre_study()
+  match_limit   a way of specifying a limit on match() calls for a specific
+                  call to pcre_exec()
+  callout_data  data for callouts (see 49 below)
+
+The flag bits are also defined in pcre.h, and are
+
+  PCRE_EXTRA_STUDY_DATA
+  PCRE_EXTRA_MATCH_LIMIT
+  PCRE_EXTRA_CALLOUT_DATA
+
+The pcre_study() function now returns one of these new pcre_extra blocks, with
+the actual study data pointed to by the study_data field, and the
+PCRE_EXTRA_STUDY_DATA flag set. This can be passed directly to pcre_exec() as
+before. That is, this change is entirely upwards-compatible and requires no
+change to existing code.
+
+If you want to pass in additional data to pcre_exec(), you can either place it
+in a pcre_extra block provided by pcre_study(), or create your own pcre_extra
+block.
+
+46. pcretest has been extended to test the PCRE_EXTRA_MATCH_LIMIT feature. If a
+data string contains the escape sequence \M, pcretest calls pcre_exec() several
+times with different match limits, until it finds the minimum value needed for
+pcre_exec() to complete. The value is then output. This can be instructive; for
+most simple matches the number is quite small, but for pathological cases it
+gets very large very quickly.
+
+47. There's a new option for pcre_fullinfo() called PCRE_INFO_STUDYSIZE. It
+returns the size of the data block pointed to by the study_data field in a
+pcre_extra block, that is, the value that was passed as the argument to
+pcre_malloc() when PCRE was getting memory in which to place the information
+created by pcre_study(). The fourth argument should point to a size_t variable.
+pcretest has been extended so that this information is shown after a successful
+pcre_study() call when information about the compiled regex is being displayed.
+
+48. Cosmetic change to Makefile: there's no need to have / after $(DESTDIR)
+because what follows is always an absolute path. (Later: it turns out that this
+is more than cosmetic for MinGW, because it doesn't like empty path
+components.)
+
+49. Some changes have been made to the callout feature (see 28 above):
+
+(i)  A callout function now has three choices for what it returns:
+
+       0  =>  success, carry on matching
+     > 0  =>  failure at this point, but backtrack if possible
+     < 0  =>  serious error, return this value from pcre_exec()
+
+     Negative values should normally be chosen from the set of PCRE_ERROR_xxx
+     values. In particular, returning PCRE_ERROR_NOMATCH forces a standard
+     "match failed" error. The error number PCRE_ERROR_CALLOUT is reserved for
+     use by callout functions. It will never be used by PCRE itself.
+
+(ii) The pcre_extra structure (see 45 above) has a void * field called
+     callout_data, with corresponding flag bit PCRE_EXTRA_CALLOUT_DATA. The
+     pcre_callout_block structure has a field of the same name. The contents of
+     the field passed in the pcre_extra structure are passed to the callout
+     function in the corresponding field in the callout block. This makes it
+     easier to use the same callout-containing regex from multiple threads. For
+     testing, the pcretest program has a new data escape
+
+       \C*n        pass the number n (may be negative) as callout_data
+
+     If the callout function in pcretest receives a non-zero value as
+     callout_data, it returns that value.
+
+50. Makefile wasn't handling CFLAGS properly when compiling dftables. Also,
+there were some redundant $(CFLAGS) in commands that are now specified as
+$(LINK), which already includes $(CFLAGS).
+
+51. Extensions to UTF-8 support are listed below. These all apply when (a) PCRE
+has been compiled with UTF-8 support *and* pcre_compile() has been compiled
+with the PCRE_UTF8 flag. Patterns that are compiled without that flag assume
+one-byte characters throughout. Note that case-insensitive matching applies
+only to characters whose values are less than 256. PCRE doesn't support the
+notion of cases for higher-valued characters.
+
+(i)   A character class whose characters are all within 0-255 is handled as
+      a bit map, and the map is inverted for negative classes. Previously, a
+      character > 255 always failed to match such a class; however it should
+      match if the class was a negative one (e.g. [^ab]). This has been fixed.
+
+(ii)  A negated character class with a single character < 255 is coded as
+      "not this character" (OP_NOT). This wasn't working properly when the test
+      character was multibyte, either singly or repeated.
+
+(iii) Repeats of multibyte characters are now handled correctly in UTF-8
+      mode, for example: \x{100}{2,3}.
+
+(iv)  The character escapes \b, \B, \d, \D, \s, \S, \w, and \W (either
+      singly or repeated) now correctly test multibyte characters. However,
+      PCRE doesn't recognize any characters with values greater than 255 as
+      digits, spaces, or word characters. Such characters always match \D, \S,
+      and \W, and never match \d, \s, or \w.
+
+(v)   Classes may now contain characters and character ranges with values
+      greater than 255. For example: [ab\x{100}-\x{400}].
+
+(vi)  pcregrep now has a --utf-8 option (synonym -u) which makes it call
+      PCRE in UTF-8 mode.
+
+52. The info request value PCRE_INFO_FIRSTCHAR has been renamed
+PCRE_INFO_FIRSTBYTE because it is a byte value. However, the old name is
+retained for backwards compatibility. (Note that LASTLITERAL is also a byte
+value.)
+
+53. The single man page has become too large. I have therefore split it up into
+a number of separate man pages. These also give rise to individual HTML pages;
+these are now put in a separate directory, and there is an index.html page that
+lists them all. Some hyperlinking between the pages has been installed.
+
+54. Added convenience functions for handling named capturing parentheses.
+
+55. Unknown escapes inside character classes (e.g. [\M]) and escapes that
+aren't interpreted therein (e.g. [\C]) are literals in Perl. This is now also
+true in PCRE, except when the PCRE_EXTENDED option is set, in which case they
+are faulted.
+
+56. Introduced HOST_CC and HOST_CFLAGS which can be set in the environment when
+calling configure. These values are used when compiling the dftables.c program
+which is run to generate the source of the default character tables. They
+default to the values of CC and CFLAGS. If you are cross-compiling PCRE,
+you will need to set these values.
+
+57. Updated the building process for Windows DLL, as provided by Fred Cox.
+
+
+Version 3.9 02-Jan-02
 ---------------------
 
 1. A bit of extraneous text had somehow crept into the pcregrep documentation.
author	nigel <nigel@2f5784b3-3f2a-0410-8824-cb99058d5e15>	2007-02-24 21:40:03 +0000
committer	nigel <nigel@2f5784b3-3f2a-0410-8824-cb99058d5e15>	2007-02-24 21:40:03 +0000
commit	c8cb607ab7e12e185e86a8b23d413b7f9536f24c (patch)
tree	e1c3675d531d498d2a84490908e187a249456d2c /ChangeLog
parent	e27c89c9227398c6feee3ca0748827fd064154cd (diff)
download	pcre-c8cb607ab7e12e185e86a8b23d413b7f9536f24c.tar.gz