ext/mbstring/oniguruma/doc/RE


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224

Oniguruma Regular Expressions     2003/07/04

syntax: REG_SYNTAX_RUBY (default)


1. Syntax elements

  \       escape
  |       alternation
  (...)   group
  [...]   character class  


2. Characters

  \t           horizontal tab (0x09)
  \v           vertical tab   (0x0B)
  \n           newline        (0x0A)
  \r           return         (0x0D)
  \b           back space     (0x08) (* in character class only)
  \f           form feed      (0x0C)
  \a           bell           (0x07)
  \e           escape         (0x1B)
  \nnn         octal char
  \xHH         hexadecimal char
  \x{7HHHHHHH} wide hexadecimal char
  \cx          control char
  \C-x         control char
  \M-x         meta  (x|0x80)  
  \M-\C-x      meta control char


3. Character types

  .        any character (except newline)
  \w       word character (alphanumeric, "_" and multibyte char)
  \W       non-word char
  \s       whitespace char (\t, \n, \v, \f, \r, \x20)
  \S       non-whitespace char
  \d       digit char
  \D       non-digit char


4. Quantifier

  greedy

  ?       1 or 0 times
  *       0 or more times
  +       1 or more times
  {n,m}   at least n but not more than m times  
  {n,}    at least n times
  {n}     n times

  reluctant

  ??      1 or 0 times
  *?      0 or more times
  +?      1 or more times
  {n,m}?  at least n but not more than m times  
  {n,}?   at least n times

  possessive (greedy and does not backtrack after repeated)

  ?+      1 or 0 times
  *+      0 or more times
  ++      1 or more times


5. Anchors

  ^       beginning of the line
  $       end of the line
  \b      word boundary
  \B      not word boundary
  \A      beginning of string
  \Z      end of string, or before newline at the end
  \z      end of string
  \G      previous end-of-match position


6. POSIX character class  ([:xxxxx:], negate [:^xxxxx:])

  alnum    alphabet or digit char
  alpha    alphabet
  ascii    code value: [0 - 127]
  blank    \t, \x20
  cntrl
  digit    0-9
  graph
  lower
  print
  punct
  space    \t, \n, \v, \f, \r, \x20
  upper
  xdigit   0-9, a-f, A-F


7. Operators in character class

  [...]   group (character class in character class)
  &&      intersection
         (lowest precedence operator in character class)
          
  ex. [a-w&&[^c-g]z] ==> ([a-w] and ([^c-g] or z)) ==> [abh-w]


8. Extended expressions

  (?#...)              comment
  (?imx-imx)           option on/off
                         i: ignore case
                         m: multi-line (dot(.) match newline)
                         x: extended form
  (?imx-imx:subexp)    option on/off for subexp
  (?:subexp)           not captured
  (?=subexp)           look-ahead
  (?!subexp)           negative look-ahead
  (?<=subexp)          look-behind
  (?<!subexp)          negative look-behind

                       Subexp of look-behind must be fixed character length.
                       But different character length is allowed in top level
                       alternatives only.
                       ex. (?<=a|bc) is OK. (?<=aaa(?:b|cd)) is not allowed.

  (?>subexp)           don't backtrack
  (?<name>subexp)      define named group
                       (name can not include '>', ')', '\' and NUL character)


9. Back reference

  \n          back reference by group number (n >= 1)
  \k<name>    back reference by group name


10. Subexp call ("Tanaka Akira special")

  \g<name>    call by group name
  \g<n>       call by group number (only if 'n' is not defined as name)


-----------------------------
11. Original extensions

   + named group     (?<name>...)
   + named backref   \k<name>
   + subexp call     \g<name>, \g<group-num>


12. Lacked features compare with perl 5.8.0

   + [:word:]
   + \N{name}
   + \l,\u,\L,\U, \P, \X, \C
   + (?{code})
   + (??{code})
   + (?(condition)yes-pat|no-pat)

   + \Q...\E   (* This is effective on REG_SYNTAX_PERL and REG_SYNTAX_JAVA)


13. Syntax depend options

   + REG_SYNTAX_RUBY (default)
     (?m): dot(.) match newline

   + REG_SYNTAX_PERL, REG_SYNTAX_JAVA
     (?s):  dot(.) match newline
     (?m): ^ match after newline, $ match before newline


14. Differences with Japanized GNU regex(version 0.12) of Ruby

   + add look behind
     (?<=fixed-char-length-pattern), (?<!fixed-char-length-pattern)
     (in negative-look-behind, capture group isn't allowed, 
      shy group(?:) is allowed.)
   + add possessive quantifier. ?+, *+, ++
   + add operations in character class. [], &&
   + add named group and subexp call.
   + octal or hexadecimal number sequence can be treated as 
     a multibyte code char in char-class, if multibyte encoding is specified.
     (ex. [\xa1\xa2], [\xa1\xa7-\xa4\xa1])
   + effect range of isolated option is to next ')'.
     ex. (?:(?i)a|b) is interpreted as (?:(?i:a|b)), not (?:(?i:a)|b).
   + isolated option is not transparent to previous pattern.
     ex. a(?i)* is a syntax error pattern.
   + allowed incompleted left brace as an usual char.
     ex. /{/, /({)/, /a{2,3/ etc...
   + negative POSIX bracket [:^xxxx:] is supported.
   + POSIX bracket [:ascii:] is added.
   + repeat of look-ahead is not allowd.
     ex. /(?=a)*/, /(?!b){5}/


14. Problems

   + Invalid first byte in UTF-8 is allowed.
     (which is the same as GNU regex of Ruby)

       /./u =~ "\xa3"

     Of course, although it is possible to validate,
     it will become later than now.

   + Zero-length match in infinite repeat stops the repeat,
     and captured group status isn't checked as stop condition.

       /()*\1/ =~ ""            #=> match
       /(?:()|())*\1\2/ =~ ""   #=> fail

       /(?:\1a|())*/ =~ "a"     #=> match with ""

   + Ignore case option is not effect to an octal or hexadecimal 
     numbered char, but it becomes effective if it appears in the char class.
     This doesn't have consistency, though they are the specifications
     which are the same as GNU regex of Ruby.

       /\x61/i.match("A")     # => nil
       /[\x61]/i.match("A")   # => match

// END