diff options
Diffstat (limited to 'Performance')
-rw-r--r-- | Performance | 172 |
1 files changed, 172 insertions, 0 deletions
diff --git a/Performance b/Performance new file mode 100644 index 0000000..2bb4b96 --- /dev/null +++ b/Performance @@ -0,0 +1,172 @@ +Some comparisons of PCRE with the original Henry Spencer (1986) regular +expression functions were done on a SPARCstation IPC using gcc version 2.6.3 +with -O optimization, to give some idea as to how the two libraries compare. +This is not a major statistical investigation. + + +Code size +--------- + +The code size of PCRE is a bit over twice the size of the Henry Spencer +functions (roughly 33K vs 14K bytes on a SPARCstation with gcc -O). + + +Store size for compiled expressions +----------------------------------- + +For expressions that are compatible with both libraries, PCRE uses less store +for the examples tried, except in some cases that involve the use of character +classes. Except in the special case of a negated charcter class containing only +one character (e.g. [^a]), PCRE uses a 32-byte bit map for each character +class, in order to get the maximum matching speed. By contrast the Spencer code +uses a strchr() call. + +The Spencer functions have an overhead of 92 bytes per expression, because +there is a table for up to 10 matched substrings held with every compiled +expression. In contrast, PCRE's overhead is just 9 bytes, since it requires the +caller to supply a vector to receive the offsets of the matched substrings. In +the table below, the size without the overhead is shown in brackets. + +PCRE Spencer Pattern +---- ------- ------- + + 18 (09) 109 (17) /^$/ + 25 (16) 120 (28) /^.*nter/ + 26 (17) 121 (29) /^12.34/ + 37 (28) 126 (34) /the quick brown fox/ + 50 (41) 114 (22) /^[]cde]/ + 50 (41) 114 (22) /^[^]cde]/ + 51 (42) 125 (33) /^[.^$|()*+?{,}]+/ + 52 (43) 126 (34) /^[0-9]+$/ + 56 (47) 153 (61) /^(abc)(abc)?zz/ + 57 (48) 133 (41) /^xxx[0-9]+$/ + 57 (48) 145 (53) /([0-9a-fA-F:]+)$/ + 62 (53) 171 (79) /^([^!]+)!(.+)=apquxz\.ixr\.zzz\.ac\.uk$$/ + 70 (61) 170 (78) /^(b+|a)(b+|a)?c/ + 74 (65) 173 (81) /^(ba|b*)(ba|b*)?bc/ + 99 (90) 235 (143) /^(a(b(c)))(d(e(f)))(h(i(j)))$/ +119 (110) 157 (65) /^.+[0-9][0-9][0-9]$/ +165 (156) 446 (354) /^[a-zA-Z0-9][a-zA-Z0-9\-]*(\.[a-zA-Z0-9][a-zA-z0-9\-]*)*\.$/ +451 (442) 605 (513) /^From +([^ ]+) +[a-zA-Z][a-zA-Z][a-zA-Z] +[a-zA-Z][a-zA-Z][a-zA-Z] +[0-9]?[0-9] +[0-9][0-9]:[0-9][0-9]/ + + +Compilation time +---------------- + +Timing was done using the clock() function to time 2000 compilations of each +expression and then dividing by twice the number of clocks per second, to get a +value in milliseconds. The variation observed over several runs was never more +than 0.01: + +PCRE Spencer Pattern +---- ------- ------- + +0.04 0.07 /^$/ +0.06 0.12 /^.*nter/ +0.06 0.13 /^12.34/ +0.06 0.09 /^[]cde]/ +0.07 0.14 /^[0-9]+$/ +0.07 0.10 /^[^]cde]/ +0.08 0.17 /^xxx[0-9]+$/ +0.08 0.14 /the quick brown fox/ +0.09 0.14 /^[.^$|()*+?{,}]+/ +0.10 0.33 /([0-9a-fA-F:]+)$/ +0.12 0.26 /^.+[0-9][0-9][0-9]$/ +0.12 0.42 /^(abc)(abc)?zz/ +0.14 0.51 /^(b+|a)(b+|a)?c/ +0.15 0.53 /^(ba|b*)(ba|b*)?bc/ +0.19 0.51 /^([^!]+)!(.+)=apquxz\.ixr\.zzz\.ac\.uk$/ +0.34 1.59 /^(a(b(c)))(d(e(f)))(h(i(j)))$/ +0.47 1.32 /^[a-zA-Z0-9][a-zA-Z0-9\-]*(\.[a-zA-Z0-9][a-zA-z0-9\-]*)*\.$/ +0.66 1.78 /^From +([^ ]+) +[a-zA-Z][a-zA-Z][a-zA-Z] +[a-zA-Z][a-zA-Z][a-zA-Z] +[0-9]?[0-9] +[0-9][0-9]:[0-9][0-9]/ + + +Execution time +-------------- + +Execution timing was done in a similar manner. Blank entries in the "pattern" +column below indicate the use of the same pattern as before. + +PCRE Spencer Subject Pattern +---- ------- ------- ------- + +0.03 0.02 <null string> /^$/ +0.04 0.04 enter /^.*nter/ +0.04 0.04 uponter +0.03 0.03 12\r34 /^12.34/ +0.03 0.03 0 /^[0-9]+$/ +0.04 0.03 100 +0.03 0.03 ]thing /^[]cde]/ +0.03 0.03 ething +0.03 0.03 athing /^[^]cde]/ +0.04 0.04 xxx0 /^xxx[0-9]+$/ +0.04 0.04 xxx1234 +0.04 0.07 .^\$(*+)|{?,?} /^[.^$|()*+?{,}]+/ +0.03 0.03 the quick brown fox /the quick brown fox/ +0.06 0.08 What do you know about the quick brown fox? +0.04 0.07 0abc /([0-9a-fA-F:]+)$/ +0.04 0.07 abc +0.05 0.13 5f03:12C0::932e +0.06 0.07 x123 /^.+[0-9][0-9][0-9]$/ +0.06 0.07 123456 +0.06 0.09 abczz /^(abc)(abc)?zz/ +0.06 0.12 abcabczz /^(abc)(abc)?zz/ + /^([^!]+)!(.+)=apquxz\.ixr\.zzz\.ac\.uk$/ +0.23 0.28 abc!pqr=apquxz.ixr.zzz.ac.uk +0.09 0.15 bc /^(b+|a)(b+|a)?c/ +0.09 0.15 bbc +0.08 0.15 bbbc +0.09 0.15 bac +0.09 0.15 bbac +0.07 0.14 aac +0.09 0.15 abbbbbbbbbbbc +0.09 0.15 bbbbbbbbbbbac +0.09 0.18 babc /^(ba|b*)(ba|b*)?bc/ +0.12 0.24 bbabc +0.07 0.15 bababc +0.06 0.10 a. /^[a-zA-Z0-9][a-zA-Z0-9\-]*(\.[a-zA-Z0-9][a-zA-z0-9\-]*)*\.$/ +0.13 0.34 ab-c.pq-r. +0.24 0.58 sxk.zzz.ac.uk. +0.12 0.34 x-.y-. +0.20 0.38 abcdefhij /^(a(b(c)))(d(e(f)))(h(i(j)))$/ + /^From +([^ ]+) +[a-zA-Z][a-zA-Z][a-zA-Z] +[a-zA-Z][a-zA-Z][a-zA-Z] +[0-9]?[0-9] +[0-9][0-9]:[0-9][0-9]/ +0.18 0.30 From abcd Mon Sep 01 12:33:02 1997 + +In general, PCRE runs faster than the Spencer function, but remember, this +is just for one particular compiler on one set of hardware and operating +system. Until comprehensive tests have been run in other environments, the most +one can plausibly say is that it is probably no worse on average for the kinds +of expression tested here. + + +Speeding up matching +-------------------- + +A character class is much more efficient than a set of bracketed alternatives. +Matching /^[abc]{12}/ against "abcabcabcabc" took 0.03 ms, whereas +/^(a|b|c){12}/ took 0.33 ms. This is because brackets and alternatives involve +recursion. + + +Serious test +------------ + +One of the tests of PCRE is the monster regular expression from "Mastering +Regular Expressions" (O'Reilly's "hip owls" book, 1997, ISBN 1-56592-257-3) +which recognizes email addresses. There are two versions, unoptimized and +optimized. For interest, here are their timings, again on a SPARCstation IPC. +The compile times were 55 ms and 94 ms, and the compiled expressions +occupied 11010 and 15426 bytes of store, respectively. The following strings +were matched in the times shown (unoptimized first): + +0.34 0.38 user@dom.ain +0.38 0.42 <user@dom.ain> +0.88 0.60 Alan Other <user@dom.ain> +1.87 0.82 "A. Other" <user.1234@dom.ain> (a comment) +1.77 1.19 A. Other <user.1234@dom.ain> (a comment) +2.21 0.42 "/s=user/ou=host/o=place/prmd=uu.yy/admd= /c=gb/"@x400-re.lay + +The optimization of the expression clearly has a dramatic effect in some cases. + +Philip Hazel <ph10@cus.cam.ac.uk> +October 1997 |