Commit | Line | Data |
---|---|---|
8ac170f3 | 1 | This file contains the PCRE man page that describes the regular expressions |
64f2600a | 2 | supported by PCRE version 7.2. Note that not all of the features are relevant |
495ae4b0 PH |
3 | in the context of Exim. In particular, the version of PCRE that is compiled |
4 | with Exim does not include UTF-8 support, there is no mechanism for changing | |
5 | the options with which the PCRE functions are called, and features such as | |
6 | callout are not accessible. | |
7 | ----------------------------------------------------------------------------- | |
8 | ||
92e772ff | 9 | PCREPATTERN(3) PCREPATTERN(3) |
495ae4b0 PH |
10 | |
11 | ||
12 | NAME | |
13 | PCRE - Perl-compatible regular expressions | |
14 | ||
8ac170f3 | 15 | |
495ae4b0 PH |
16 | PCRE REGULAR EXPRESSION DETAILS |
17 | ||
18 | The syntax and semantics of the regular expressions supported by PCRE | |
19 | are described below. Regular expressions are also described in the Perl | |
20 | documentation and in a number of books, some of which have copious | |
21 | examples. Jeffrey Friedl's "Mastering Regular Expressions", published | |
22 | by O'Reilly, covers regular expressions in great detail. This descrip- | |
23 | tion of PCRE's regular expressions is intended as reference material. | |
24 | ||
25 | The original operation of PCRE was on strings of one-byte characters. | |
26 | However, there is now also support for UTF-8 character strings. To use | |
27 | this, you must build PCRE to include UTF-8 support, and then call | |
28 | pcre_compile() with the PCRE_UTF8 option. How this affects pattern | |
29 | matching is mentioned in several places below. There is also a summary | |
30 | of UTF-8 features in the section on UTF-8 support in the main pcre | |
31 | page. | |
32 | ||
8ac170f3 PH |
33 | The remainder of this document discusses the patterns that are sup- |
34 | ported by PCRE when its main matching function, pcre_exec(), is used. | |
35 | From release 6.0, PCRE offers a second matching function, | |
36 | pcre_dfa_exec(), which matches using a different algorithm that is not | |
64f2600a PH |
37 | Perl-compatible. Some of the features discussed below are not available |
38 | when pcre_dfa_exec() is used. The advantages and disadvantages of the | |
39 | alternative function, and how it differs from the normal function, are | |
40 | discussed in the pcrematching page. | |
8ac170f3 | 41 | |
6bf342e1 PH |
42 | |
43 | CHARACTERS AND METACHARACTERS | |
44 | ||
64f2600a PH |
45 | A regular expression is a pattern that is matched against a subject |
46 | string from left to right. Most characters stand for themselves in a | |
47 | pattern, and match the corresponding characters in the subject. As a | |
495ae4b0 PH |
48 | trivial example, the pattern |
49 | ||
50 | The quick brown fox | |
51 | ||
8ac170f3 | 52 | matches a portion of a subject string that is identical to itself. When |
64f2600a PH |
53 | caseless matching is specified (the PCRE_CASELESS option), letters are |
54 | matched independently of case. In UTF-8 mode, PCRE always understands | |
55 | the concept of case for characters whose values are less than 128, so | |
56 | caseless matching is always possible. For characters with higher val- | |
57 | ues, the concept of case is supported if PCRE is compiled with Unicode | |
58 | property support, but not otherwise. If you want to use caseless | |
59 | matching for characters 128 and above, you must ensure that PCRE is | |
8ac170f3 PH |
60 | compiled with Unicode property support as well as with UTF-8 support. |
61 | ||
64f2600a PH |
62 | The power of regular expressions comes from the ability to include |
63 | alternatives and repetitions in the pattern. These are encoded in the | |
8ac170f3 PH |
64 | pattern by the use of metacharacters, which do not stand for themselves |
65 | but instead are interpreted in some special way. | |
66 | ||
64f2600a PH |
67 | There are two different sets of metacharacters: those that are recog- |
68 | nized anywhere in the pattern except within square brackets, and those | |
69 | that are recognized within square brackets. Outside square brackets, | |
6bf342e1 | 70 | the metacharacters are as follows: |
495ae4b0 PH |
71 | |
72 | \ general escape character with several uses | |
73 | ^ assert start of string (or line, in multiline mode) | |
74 | $ assert end of string (or line, in multiline mode) | |
75 | . match any character except newline (by default) | |
76 | [ start character class definition | |
77 | | start of alternative branch | |
78 | ( start subpattern | |
79 | ) end subpattern | |
80 | ? extends the meaning of ( | |
81 | also 0 or 1 quantifier | |
82 | also quantifier minimizer | |
83 | * 0 or more quantifier | |
84 | + 1 or more quantifier | |
85 | also "possessive quantifier" | |
86 | { start min/max quantifier | |
87 | ||
64f2600a | 88 | Part of a pattern that is in square brackets is called a "character |
495ae4b0 PH |
89 | class". In a character class the only metacharacters are: |
90 | ||
91 | \ general escape character | |
92 | ^ negate the class, but only if the first character | |
93 | - indicates character range | |
94 | [ POSIX character class (only if followed by POSIX | |
95 | syntax) | |
96 | ] terminates the character class | |
97 | ||
64f2600a | 98 | The following sections describe the use of each of the metacharacters. |
495ae4b0 PH |
99 | |
100 | ||
101 | BACKSLASH | |
102 | ||
103 | The backslash character has several uses. Firstly, if it is followed by | |
64f2600a PH |
104 | a non-alphanumeric character, it takes away any special meaning that |
105 | character may have. This use of backslash as an escape character | |
495ae4b0 PH |
106 | applies both inside and outside character classes. |
107 | ||
64f2600a PH |
108 | For example, if you want to match a * character, you write \* in the |
109 | pattern. This escaping action applies whether or not the following | |
110 | character would otherwise be interpreted as a metacharacter, so it is | |
111 | always safe to precede a non-alphanumeric with backslash to specify | |
112 | that it stands for itself. In particular, if you want to match a back- | |
495ae4b0 PH |
113 | slash, you write \\. |
114 | ||
64f2600a PH |
115 | If a pattern is compiled with the PCRE_EXTENDED option, whitespace in |
116 | the pattern (other than in a character class) and characters between a | |
aa41d2de | 117 | # outside a character class and the next newline are ignored. An escap- |
64f2600a | 118 | ing backslash can be used to include a whitespace or # character as |
aa41d2de | 119 | part of the pattern. |
495ae4b0 | 120 | |
64f2600a PH |
121 | If you want to remove the special meaning from a sequence of charac- |
122 | ters, you can do so by putting them between \Q and \E. This is differ- | |
123 | ent from Perl in that $ and @ are handled as literals in \Q...\E | |
124 | sequences in PCRE, whereas in Perl, $ and @ cause variable interpola- | |
495ae4b0 PH |
125 | tion. Note the following examples: |
126 | ||
127 | Pattern PCRE matches Perl matches | |
128 | ||
129 | \Qabc$xyz\E abc$xyz abc followed by the | |
130 | contents of $xyz | |
131 | \Qabc\$xyz\E abc\$xyz abc\$xyz | |
132 | \Qabc\E\$\Qxyz\E abc$xyz abc$xyz | |
133 | ||
64f2600a | 134 | The \Q...\E sequence is recognized both inside and outside character |
495ae4b0 PH |
135 | classes. |
136 | ||
137 | Non-printing characters | |
138 | ||
139 | A second use of backslash provides a way of encoding non-printing char- | |
64f2600a PH |
140 | acters in patterns in a visible manner. There is no restriction on the |
141 | appearance of non-printing characters, apart from the binary zero that | |
142 | terminates a pattern, but when a pattern is being prepared by text | |
143 | editing, it is usually easier to use one of the following escape | |
495ae4b0 PH |
144 | sequences than the binary character it represents: |
145 | ||
146 | \a alarm, that is, the BEL character (hex 07) | |
147 | \cx "control-x", where x is any character | |
148 | \e escape (hex 1B) | |
149 | \f formfeed (hex 0C) | |
150 | \n newline (hex 0A) | |
151 | \r carriage return (hex 0D) | |
152 | \t tab (hex 09) | |
153 | \ddd character with octal code ddd, or backreference | |
154 | \xhh character with hex code hh | |
aa41d2de | 155 | \x{hhh..} character with hex code hhh.. |
495ae4b0 | 156 | |
64f2600a PH |
157 | The precise effect of \cx is as follows: if x is a lower case letter, |
158 | it is converted to upper case. Then bit 6 of the character (hex 40) is | |
159 | inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c; | |
495ae4b0 PH |
160 | becomes hex 7B. |
161 | ||
64f2600a PH |
162 | After \x, from zero to two hexadecimal digits are read (letters can be |
163 | in upper or lower case). Any number of hexadecimal digits may appear | |
164 | between \x{ and }, but the value of the character code must be less | |
aa41d2de | 165 | than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode (that is, |
64f2600a PH |
166 | the maximum hexadecimal value is 7FFFFFFF). If characters other than |
167 | hexadecimal digits appear between \x{ and }, or if there is no termi- | |
168 | nating }, this form of escape is not recognized. Instead, the initial | |
aa41d2de PH |
169 | \x will be interpreted as a basic hexadecimal escape, with no following |
170 | digits, giving a character whose value is zero. | |
495ae4b0 PH |
171 | |
172 | Characters whose value is less than 256 can be defined by either of the | |
64f2600a | 173 | two syntaxes for \x. There is no difference in the way they are han- |
aa41d2de | 174 | dled. For example, \xdc is exactly the same as \x{dc}. |
495ae4b0 | 175 | |
64f2600a PH |
176 | After \0 up to two further octal digits are read. If there are fewer |
177 | than two digits, just those that are present are used. Thus the | |
aa41d2de | 178 | sequence \0\x\07 specifies two binary zeros followed by a BEL character |
64f2600a | 179 | (code value 7). Make sure you supply two digits after the initial zero |
aa41d2de | 180 | if the pattern character that follows is itself an octal digit. |
495ae4b0 PH |
181 | |
182 | The handling of a backslash followed by a digit other than 0 is compli- | |
183 | cated. Outside a character class, PCRE reads it and any following dig- | |
64f2600a | 184 | its as a decimal number. If the number is less than 10, or if there |
495ae4b0 | 185 | have been at least that many previous capturing left parentheses in the |
64f2600a PH |
186 | expression, the entire sequence is taken as a back reference. A |
187 | description of how this works is given later, following the discussion | |
495ae4b0 PH |
188 | of parenthesized subpatterns. |
189 | ||
64f2600a PH |
190 | Inside a character class, or if the decimal number is greater than 9 |
191 | and there have not been that many capturing subpatterns, PCRE re-reads | |
6bf342e1 | 192 | up to three octal digits following the backslash, and uses them to gen- |
64f2600a PH |
193 | erate a data character. Any subsequent digits stand for themselves. In |
194 | non-UTF-8 mode, the value of a character specified in octal must be | |
195 | less than \400. In UTF-8 mode, values up to \777 are permitted. For | |
aa41d2de | 196 | example: |
495ae4b0 PH |
197 | |
198 | \040 is another way of writing a space | |
199 | \40 is the same, provided there are fewer than 40 | |
200 | previous capturing subpatterns | |
201 | \7 is always a back reference | |
202 | \11 might be a back reference, or another way of | |
203 | writing a tab | |
204 | \011 is always a tab | |
205 | \0113 is a tab followed by the character "3" | |
206 | \113 might be a back reference, otherwise the | |
207 | character with octal code 113 | |
208 | \377 might be a back reference, otherwise | |
209 | the byte consisting entirely of 1 bits | |
210 | \81 is either a back reference, or a binary zero | |
211 | followed by the two characters "8" and "1" | |
212 | ||
64f2600a | 213 | Note that octal values of 100 or greater must not be introduced by a |
495ae4b0 PH |
214 | leading zero, because no more than three octal digits are ever read. |
215 | ||
aa41d2de | 216 | All the sequences that define a single character value can be used both |
64f2600a PH |
217 | inside and outside character classes. In addition, inside a character |
218 | class, the sequence \b is interpreted as the backspace character (hex | |
219 | 08), and the sequences \R and \X are interpreted as the characters "R" | |
220 | and "X", respectively. Outside a character class, these sequences have | |
6bf342e1 PH |
221 | different meanings (see below). |
222 | ||
223 | Absolute and relative back references | |
224 | ||
64f2600a PH |
225 | The sequence \g followed by a positive or negative number, optionally |
226 | enclosed in braces, is an absolute or relative back reference. A named | |
227 | back reference can be coded as \g{name}. Back references are discussed | |
228 | later, following the discussion of parenthesized subpatterns. | |
495ae4b0 PH |
229 | |
230 | Generic character types | |
231 | ||
6bf342e1 PH |
232 | Another use of backslash is for specifying generic character types. The |
233 | following are always recognized: | |
495ae4b0 PH |
234 | |
235 | \d any decimal digit | |
236 | \D any character that is not a decimal digit | |
64f2600a PH |
237 | \h any horizontal whitespace character |
238 | \H any character that is not a horizontal whitespace character | |
495ae4b0 PH |
239 | \s any whitespace character |
240 | \S any character that is not a whitespace character | |
64f2600a PH |
241 | \v any vertical whitespace character |
242 | \V any character that is not a vertical whitespace character | |
495ae4b0 PH |
243 | \w any "word" character |
244 | \W any "non-word" character | |
245 | ||
246 | Each pair of escape sequences partitions the complete set of characters | |
64f2600a | 247 | into two disjoint sets. Any given character matches one, and only one, |
495ae4b0 PH |
248 | of each pair. |
249 | ||
250 | These character type sequences can appear both inside and outside char- | |
64f2600a PH |
251 | acter classes. They each match one character of the appropriate type. |
252 | If the current matching point is at the end of the subject string, all | |
495ae4b0 PH |
253 | of them fail, since there is no character to match. |
254 | ||
64f2600a PH |
255 | For compatibility with Perl, \s does not match the VT character (code |
256 | 11). This makes it different from the the POSIX "space" class. The \s | |
257 | characters are HT (9), LF (10), FF (12), CR (13), and space (32). If | |
aa41d2de | 258 | "use locale;" is included in a Perl script, \s may match the VT charac- |
64f2600a | 259 | ter. In PCRE, it never does. |
495ae4b0 | 260 | |
64f2600a | 261 | In UTF-8 mode, characters with values greater than 128 never match \d, |
495ae4b0 | 262 | \s, or \w, and always match \D, \S, and \W. This is true even when Uni- |
64f2600a PH |
263 | code character property support is available. These sequences retain |
264 | their original meanings from before UTF-8 support was available, mainly | |
265 | for efficiency reasons. | |
266 | ||
267 | The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to | |
268 | the other sequences, these do match certain high-valued codepoints in | |
269 | UTF-8 mode. The horizontal space characters are: | |
270 | ||
271 | U+0009 Horizontal tab | |
272 | U+0020 Space | |
273 | U+00A0 Non-break space | |
274 | U+1680 Ogham space mark | |
275 | U+180E Mongolian vowel separator | |
276 | U+2000 En quad | |
277 | U+2001 Em quad | |
278 | U+2002 En space | |
279 | U+2003 Em space | |
280 | U+2004 Three-per-em space | |
281 | U+2005 Four-per-em space | |
282 | U+2006 Six-per-em space | |
283 | U+2007 Figure space | |
284 | U+2008 Punctuation space | |
285 | U+2009 Thin space | |
286 | U+200A Hair space | |
287 | U+202F Narrow no-break space | |
288 | U+205F Medium mathematical space | |
289 | U+3000 Ideographic space | |
290 | ||
291 | The vertical space characters are: | |
292 | ||
293 | U+000A Linefeed | |
294 | U+000B Vertical tab | |
295 | U+000C Formfeed | |
296 | U+000D Carriage return | |
297 | U+0085 Next line | |
298 | U+2028 Line separator | |
299 | U+2029 Paragraph separator | |
300 | ||
301 | A "word" character is an underscore or any character less than 256 that | |
302 | is a letter or digit. The definition of letters and digits is con- | |
303 | trolled by PCRE's low-valued character tables, and may vary if locale- | |
304 | specific matching is taking place (see "Locale support" in the pcreapi | |
305 | page). For example, in a French locale such as "fr_FR" in Unix-like | |
306 | systems, or "french" in Windows, some character codes greater than 128 | |
307 | are used for accented letters, and these are matched by \w. The use of | |
308 | locales with Unicode is discouraged. | |
495ae4b0 | 309 | |
6bf342e1 PH |
310 | Newline sequences |
311 | ||
64f2600a PH |
312 | Outside a character class, the escape sequence \R matches any Unicode |
313 | newline sequence. This is a Perl 5.10 feature. In non-UTF-8 mode \R is | |
6bf342e1 PH |
314 | equivalent to the following: |
315 | ||
316 | (?>\r\n|\n|\x0b|\f|\r|\x85) | |
317 | ||
64f2600a | 318 | This is an example of an "atomic group", details of which are given |
6bf342e1 | 319 | below. This particular group matches either the two-character sequence |
64f2600a | 320 | CR followed by LF, or one of the single characters LF (linefeed, |
6bf342e1 PH |
321 | U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage |
322 | return, U+000D), or NEL (next line, U+0085). The two-character sequence | |
323 | is treated as a single unit that cannot be split. | |
324 | ||
64f2600a | 325 | In UTF-8 mode, two additional characters whose codepoints are greater |
6bf342e1 | 326 | than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- |
64f2600a | 327 | rator, U+2029). Unicode character property support is not needed for |
6bf342e1 PH |
328 | these characters to be recognized. |
329 | ||
330 | Inside a character class, \R matches the letter "R". | |
331 | ||
495ae4b0 PH |
332 | Unicode character properties |
333 | ||
334 | When PCRE is built with Unicode character property support, three addi- | |
64f2600a PH |
335 | tional escape sequences that match characters with specific properties |
336 | are available. When not in UTF-8 mode, these sequences are of course | |
337 | limited to testing characters whose codepoints are less than 256, but | |
338 | they do work in this mode. The extra escape sequences are: | |
495ae4b0 | 339 | |
aa41d2de PH |
340 | \p{xx} a character with the xx property |
341 | \P{xx} a character without the xx property | |
342 | \X an extended Unicode sequence | |
495ae4b0 | 343 | |
64f2600a | 344 | The property names represented by xx above are limited to the Unicode |
aa41d2de PH |
345 | script names, the general category properties, and "Any", which matches |
346 | any character (including newline). Other properties such as "InMusical- | |
64f2600a | 347 | Symbols" are not currently supported by PCRE. Note that \P{Any} does |
aa41d2de PH |
348 | not match any characters, so always causes a match failure. |
349 | ||
350 | Sets of Unicode characters are defined as belonging to certain scripts. | |
64f2600a | 351 | A character from one of these sets can be matched using a script name. |
aa41d2de PH |
352 | For example: |
353 | ||
354 | \p{Greek} | |
355 | \P{Han} | |
356 | ||
64f2600a | 357 | Those that are not part of an identified script are lumped together as |
aa41d2de PH |
358 | "Common". The current list of scripts is: |
359 | ||
6bf342e1 | 360 | Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, |
64f2600a | 361 | Buhid, Canadian_Aboriginal, Cherokee, Common, Coptic, Cuneiform, |
6bf342e1 | 362 | Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, |
64f2600a PH |
363 | Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira- |
364 | gana, Inherited, Kannada, Katakana, Kharoshthi, Khmer, Lao, Latin, | |
6bf342e1 | 365 | Limbu, Linear_B, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, |
64f2600a | 366 | Ogham, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, |
6bf342e1 PH |
367 | Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, |
368 | Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi. | |
aa41d2de | 369 | |
64f2600a | 370 | Each character has exactly one general category property, specified by |
aa41d2de | 371 | a two-letter abbreviation. For compatibility with Perl, negation can be |
64f2600a | 372 | specified by including a circumflex between the opening brace and the |
aa41d2de PH |
373 | property name. For example, \p{^Lu} is the same as \P{Lu}. |
374 | ||
375 | If only one letter is specified with \p or \P, it includes all the gen- | |
64f2600a PH |
376 | eral category properties that start with that letter. In this case, in |
377 | the absence of negation, the curly brackets in the escape sequence are | |
aa41d2de | 378 | optional; these two examples have the same effect: |
495ae4b0 PH |
379 | |
380 | \p{L} | |
381 | \pL | |
382 | ||
aa41d2de | 383 | The following general category property codes are supported: |
495ae4b0 PH |
384 | |
385 | C Other | |
386 | Cc Control | |
387 | Cf Format | |
388 | Cn Unassigned | |
389 | Co Private use | |
390 | Cs Surrogate | |
391 | ||
392 | L Letter | |
393 | Ll Lower case letter | |
394 | Lm Modifier letter | |
395 | Lo Other letter | |
396 | Lt Title case letter | |
397 | Lu Upper case letter | |
398 | ||
399 | M Mark | |
400 | Mc Spacing mark | |
401 | Me Enclosing mark | |
402 | Mn Non-spacing mark | |
403 | ||
404 | N Number | |
405 | Nd Decimal number | |
406 | Nl Letter number | |
407 | No Other number | |
408 | ||
409 | P Punctuation | |
410 | Pc Connector punctuation | |
411 | Pd Dash punctuation | |
412 | Pe Close punctuation | |
413 | Pf Final punctuation | |
414 | Pi Initial punctuation | |
415 | Po Other punctuation | |
416 | Ps Open punctuation | |
417 | ||
418 | S Symbol | |
419 | Sc Currency symbol | |
420 | Sk Modifier symbol | |
421 | Sm Mathematical symbol | |
422 | So Other symbol | |
423 | ||
424 | Z Separator | |
425 | Zl Line separator | |
426 | Zp Paragraph separator | |
427 | Zs Space separator | |
428 | ||
64f2600a PH |
429 | The special property L& is also supported: it matches a character that |
430 | has the Lu, Ll, or Lt property, in other words, a letter that is not | |
aa41d2de PH |
431 | classified as a modifier or "other". |
432 | ||
64f2600a PH |
433 | The long synonyms for these properties that Perl supports (such as |
434 | \p{Letter}) are not supported by PCRE, nor is it permitted to prefix | |
aa41d2de PH |
435 | any of these properties with "Is". |
436 | ||
437 | No character that is in the Unicode table has the Cn (unassigned) prop- | |
438 | erty. Instead, this property is assumed for any code point that is not | |
439 | in the Unicode table. | |
495ae4b0 | 440 | |
64f2600a | 441 | Specifying caseless matching does not affect these escape sequences. |
495ae4b0 PH |
442 | For example, \p{Lu} always matches only upper case letters. |
443 | ||
64f2600a | 444 | The \X escape matches any number of Unicode characters that form an |
495ae4b0 PH |
445 | extended Unicode sequence. \X is equivalent to |
446 | ||
447 | (?>\PM\pM*) | |
448 | ||
64f2600a PH |
449 | That is, it matches a character without the "mark" property, followed |
450 | by zero or more characters with the "mark" property, and treats the | |
451 | sequence as an atomic group (see below). Characters with the "mark" | |
452 | property are typically accents that affect the preceding character. | |
453 | None of them have codepoints less than 256, so in non-UTF-8 mode \X | |
454 | matches any one character. | |
495ae4b0 | 455 | |
64f2600a PH |
456 | Matching characters by Unicode property is not fast, because PCRE has |
457 | to search a structure that contains data for over fifteen thousand | |
495ae4b0 PH |
458 | characters. That is why the traditional escape sequences such as \d and |
459 | \w do not use Unicode properties in PCRE. | |
460 | ||
64f2600a PH |
461 | Resetting the match start |
462 | ||
463 | The escape sequence \K, which is a Perl 5.10 feature, causes any previ- | |
464 | ously matched characters not to be included in the final matched | |
465 | sequence. For example, the pattern: | |
466 | ||
467 | foo\Kbar | |
468 | ||
469 | matches "foobar", but reports that it has matched "bar". This feature | |
470 | is similar to a lookbehind assertion (described below). However, in | |
471 | this case, the part of the subject before the real match does not have | |
472 | to be of fixed length, as lookbehind assertions do. The use of \K does | |
473 | not interfere with the setting of captured substrings. For example, | |
474 | when the pattern | |
475 | ||
476 | (foo)\Kbar | |
477 | ||
478 | matches "foobar", the first substring is still set to "foo". | |
479 | ||
495ae4b0 PH |
480 | Simple assertions |
481 | ||
6bf342e1 | 482 | The final use of backslash is for certain simple assertions. An asser- |
8ac170f3 PH |
483 | tion specifies a condition that has to be met at a particular point in |
484 | a match, without consuming any characters from the subject string. The | |
485 | use of subpatterns for more complicated assertions is described below. | |
495ae4b0 PH |
486 | The backslashed assertions are: |
487 | ||
488 | \b matches at a word boundary | |
489 | \B matches when not at a word boundary | |
6bf342e1 PH |
490 | \A matches at the start of the subject |
491 | \Z matches at the end of the subject | |
492 | also matches before a newline at the end of the subject | |
493 | \z matches only at the end of the subject | |
494 | \G matches at the first matching position in the subject | |
495ae4b0 | 495 | |
8ac170f3 | 496 | These assertions may not appear in character classes (but note that \b |
495ae4b0 PH |
497 | has a different meaning, namely the backspace character, inside a char- |
498 | acter class). | |
499 | ||
8ac170f3 PH |
500 | A word boundary is a position in the subject string where the current |
501 | character and the previous character do not both match \w or \W (i.e. | |
502 | one matches \w and the other matches \W), or the start or end of the | |
495ae4b0 PH |
503 | string if the first or last character matches \w, respectively. |
504 | ||
8ac170f3 | 505 | The \A, \Z, and \z assertions differ from the traditional circumflex |
495ae4b0 | 506 | and dollar (described in the next section) in that they only ever match |
8ac170f3 PH |
507 | at the very start and end of the subject string, whatever options are |
508 | set. Thus, they are independent of multiline mode. These three asser- | |
495ae4b0 | 509 | tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which |
8ac170f3 PH |
510 | affect only the behaviour of the circumflex and dollar metacharacters. |
511 | However, if the startoffset argument of pcre_exec() is non-zero, indi- | |
495ae4b0 | 512 | cating that matching is to start at a point other than the beginning of |
8ac170f3 | 513 | the subject, \A can never match. The difference between \Z and \z is |
aa41d2de PH |
514 | that \Z matches before a newline at the end of the string as well as at |
515 | the very end, whereas \z matches only at the end. | |
516 | ||
517 | The \G assertion is true only when the current matching position is at | |
518 | the start point of the match, as specified by the startoffset argument | |
519 | of pcre_exec(). It differs from \A when the value of startoffset is | |
520 | non-zero. By calling pcre_exec() multiple times with appropriate argu- | |
495ae4b0 PH |
521 | ments, you can mimic Perl's /g option, and it is in this kind of imple- |
522 | mentation where \G can be useful. | |
523 | ||
aa41d2de | 524 | Note, however, that PCRE's interpretation of \G, as the start of the |
495ae4b0 | 525 | current match, is subtly different from Perl's, which defines it as the |
aa41d2de PH |
526 | end of the previous match. In Perl, these can be different when the |
527 | previously matched string was empty. Because PCRE does just one match | |
495ae4b0 PH |
528 | at a time, it cannot reproduce this behaviour. |
529 | ||
aa41d2de | 530 | If all the alternatives of a pattern begin with \G, the expression is |
495ae4b0 PH |
531 | anchored to the starting match position, and the "anchored" flag is set |
532 | in the compiled regular expression. | |
533 | ||
534 | ||
535 | CIRCUMFLEX AND DOLLAR | |
536 | ||
537 | Outside a character class, in the default matching mode, the circumflex | |
aa41d2de PH |
538 | character is an assertion that is true only if the current matching |
539 | point is at the start of the subject string. If the startoffset argu- | |
540 | ment of pcre_exec() is non-zero, circumflex can never match if the | |
541 | PCRE_MULTILINE option is unset. Inside a character class, circumflex | |
495ae4b0 PH |
542 | has an entirely different meaning (see below). |
543 | ||
aa41d2de PH |
544 | Circumflex need not be the first character of the pattern if a number |
545 | of alternatives are involved, but it should be the first thing in each | |
546 | alternative in which it appears if the pattern is ever to match that | |
547 | branch. If all possible alternatives start with a circumflex, that is, | |
548 | if the pattern is constrained to match only at the start of the sub- | |
549 | ject, it is said to be an "anchored" pattern. (There are also other | |
495ae4b0 PH |
550 | constructs that can cause a pattern to be anchored.) |
551 | ||
aa41d2de PH |
552 | A dollar character is an assertion that is true only if the current |
553 | matching point is at the end of the subject string, or immediately | |
554 | before a newline at the end of the string (by default). Dollar need not | |
555 | be the last character of the pattern if a number of alternatives are | |
556 | involved, but it should be the last item in any branch in which it | |
557 | appears. Dollar has no special meaning in a character class. | |
495ae4b0 | 558 | |
8ac170f3 PH |
559 | The meaning of dollar can be changed so that it matches only at the |
560 | very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at | |
495ae4b0 PH |
561 | compile time. This does not affect the \Z assertion. |
562 | ||
563 | The meanings of the circumflex and dollar characters are changed if the | |
aa41d2de PH |
564 | PCRE_MULTILINE option is set. When this is the case, a circumflex |
565 | matches immediately after internal newlines as well as at the start of | |
566 | the subject string. It does not match after a newline that ends the | |
567 | string. A dollar matches before any newlines in the string, as well as | |
568 | at the very end, when PCRE_MULTILINE is set. When newline is specified | |
569 | as the two-character sequence CRLF, isolated CR and LF characters do | |
570 | not indicate newlines. | |
571 | ||
572 | For example, the pattern /^abc$/ matches the subject string "def\nabc" | |
573 | (where \n represents a newline) in multiline mode, but not otherwise. | |
574 | Consequently, patterns that are anchored in single line mode because | |
575 | all branches start with ^ are not anchored in multiline mode, and a | |
576 | match for circumflex is possible when the startoffset argument of | |
577 | pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if | |
578 | PCRE_MULTILINE is set. | |
579 | ||
580 | Note that the sequences \A, \Z, and \z can be used to match the start | |
581 | and end of the subject in both modes, and if all branches of a pattern | |
582 | start with \A it is always anchored, whether or not PCRE_MULTILINE is | |
583 | set. | |
495ae4b0 PH |
584 | |
585 | ||
586 | FULL STOP (PERIOD, DOT) | |
587 | ||
588 | Outside a character class, a dot in the pattern matches any one charac- | |
aa41d2de PH |
589 | ter in the subject string except (by default) a character that signi- |
590 | fies the end of a line. In UTF-8 mode, the matched character may be | |
6bf342e1 PH |
591 | more than one byte long. |
592 | ||
593 | When a line ending is defined as a single character, dot never matches | |
594 | that character; when the two-character sequence CRLF is used, dot does | |
595 | not match CR if it is immediately followed by LF, but otherwise it | |
596 | matches all characters (including isolated CRs and LFs). When any Uni- | |
597 | code line endings are being recognized, dot does not match CR or LF or | |
598 | any of the other line ending characters. | |
599 | ||
600 | The behaviour of dot with regard to newlines can be changed. If the | |
601 | PCRE_DOTALL option is set, a dot matches any one character, without | |
602 | exception. If the two-character sequence CRLF is present in the subject | |
603 | string, it takes two dots to match it. | |
604 | ||
605 | The handling of dot is entirely independent of the handling of circum- | |
606 | flex and dollar, the only relationship being that they both involve | |
aa41d2de | 607 | newlines. Dot has no special meaning in a character class. |
495ae4b0 PH |
608 | |
609 | ||
610 | MATCHING A SINGLE BYTE | |
611 | ||
612 | Outside a character class, the escape sequence \C matches any one byte, | |
6bf342e1 PH |
613 | both in and out of UTF-8 mode. Unlike a dot, it always matches any |
614 | line-ending characters. The feature is provided in Perl in order to | |
615 | match individual bytes in UTF-8 mode. Because it breaks up UTF-8 char- | |
616 | acters into individual bytes, what remains in the string may be a mal- | |
617 | formed UTF-8 string. For this reason, the \C escape sequence is best | |
618 | avoided. | |
495ae4b0 | 619 | |
8ac170f3 PH |
620 | PCRE does not allow \C to appear in lookbehind assertions (described |
621 | below), because in UTF-8 mode this would make it impossible to calcu- | |
495ae4b0 PH |
622 | late the length of the lookbehind. |
623 | ||
624 | ||
625 | SQUARE BRACKETS AND CHARACTER CLASSES | |
626 | ||
627 | An opening square bracket introduces a character class, terminated by a | |
628 | closing square bracket. A closing square bracket on its own is not spe- | |
629 | cial. If a closing square bracket is required as a member of the class, | |
8ac170f3 | 630 | it should be the first data character in the class (after an initial |
495ae4b0 PH |
631 | circumflex, if present) or escaped with a backslash. |
632 | ||
8ac170f3 PH |
633 | A character class matches a single character in the subject. In UTF-8 |
634 | mode, the character may occupy more than one byte. A matched character | |
495ae4b0 | 635 | must be in the set of characters defined by the class, unless the first |
8ac170f3 PH |
636 | character in the class definition is a circumflex, in which case the |
637 | subject character must not be in the set defined by the class. If a | |
638 | circumflex is actually required as a member of the class, ensure it is | |
495ae4b0 PH |
639 | not the first character, or escape it with a backslash. |
640 | ||
8ac170f3 PH |
641 | For example, the character class [aeiou] matches any lower case vowel, |
642 | while [^aeiou] matches any character that is not a lower case vowel. | |
495ae4b0 | 643 | Note that a circumflex is just a convenient notation for specifying the |
8ac170f3 PH |
644 | characters that are in the class by enumerating those that are not. A |
645 | class that starts with a circumflex is not an assertion: it still con- | |
646 | sumes a character from the subject string, and therefore it fails if | |
495ae4b0 PH |
647 | the current pointer is at the end of the string. |
648 | ||
8ac170f3 PH |
649 | In UTF-8 mode, characters with values greater than 255 can be included |
650 | in a class as a literal string of bytes, or by using the \x{ escaping | |
495ae4b0 PH |
651 | mechanism. |
652 | ||
8ac170f3 PH |
653 | When caseless matching is set, any letters in a class represent both |
654 | their upper case and lower case versions, so for example, a caseless | |
655 | [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not | |
656 | match "A", whereas a caseful version would. In UTF-8 mode, PCRE always | |
657 | understands the concept of case for characters whose values are less | |
658 | than 128, so caseless matching is always possible. For characters with | |
659 | higher values, the concept of case is supported if PCRE is compiled | |
660 | with Unicode property support, but not otherwise. If you want to use | |
661 | caseless matching for characters 128 and above, you must ensure that | |
662 | PCRE is compiled with Unicode property support as well as with UTF-8 | |
663 | support. | |
495ae4b0 | 664 | |
6bf342e1 PH |
665 | Characters that might indicate line breaks are never treated in any |
666 | special way when matching character classes, whatever line-ending | |
667 | sequence is in use, and whatever setting of the PCRE_DOTALL and | |
668 | PCRE_MULTILINE options is used. A class such as [^a] always matches one | |
669 | of these characters. | |
495ae4b0 PH |
670 | |
671 | The minus (hyphen) character can be used to specify a range of charac- | |
672 | ters in a character class. For example, [d-m] matches any letter | |
673 | between d and m, inclusive. If a minus character is required in a | |
674 | class, it must be escaped with a backslash or appear in a position | |
675 | where it cannot be interpreted as indicating a range, typically as the | |
676 | first or last character in the class. | |
677 | ||
678 | It is not possible to have the literal character "]" as the end charac- | |
679 | ter of a range. A pattern such as [W-]46] is interpreted as a class of | |
680 | two characters ("W" and "-") followed by a literal string "46]", so it | |
681 | would match "W46]" or "-46]". However, if the "]" is escaped with a | |
682 | backslash it is interpreted as the end of range, so [W-\]46] is inter- | |
683 | preted as a class containing a range followed by two other characters. | |
684 | The octal or hexadecimal representation of "]" can also be used to end | |
685 | a range. | |
686 | ||
687 | Ranges operate in the collating sequence of character values. They can | |
688 | also be used for characters specified numerically, for example | |
689 | [\000-\037]. In UTF-8 mode, ranges can include characters whose values | |
690 | are greater than 255, for example [\x{100}-\x{2ff}]. | |
691 | ||
692 | If a range that includes letters is used when caseless matching is set, | |
693 | it matches the letters in either case. For example, [W-c] is equivalent | |
694 | to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if | |
64f2600a | 695 | character tables for a French locale are in use, [\xc8-\xcb] matches |
495ae4b0 PH |
696 | accented E characters in both cases. In UTF-8 mode, PCRE supports the |
697 | concept of case for characters with values greater than 128 only when | |
698 | it is compiled with Unicode property support. | |
699 | ||
700 | The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear | |
701 | in a character class, and add the characters that they match to the | |
702 | class. For example, [\dABCDEF] matches any hexadecimal digit. A circum- | |
703 | flex can conveniently be used with the upper case character types to | |
704 | specify a more restricted set of characters than the matching lower | |
705 | case type. For example, the class [^\W_] matches any letter or digit, | |
706 | but not underscore. | |
707 | ||
708 | The only metacharacters that are recognized in character classes are | |
709 | backslash, hyphen (only where it can be interpreted as specifying a | |
710 | range), circumflex (only at the start), opening square bracket (only | |
711 | when it can be interpreted as introducing a POSIX class name - see the | |
712 | next section), and the terminating closing square bracket. However, | |
713 | escaping other non-alphanumeric characters does no harm. | |
714 | ||
715 | ||
716 | POSIX CHARACTER CLASSES | |
717 | ||
718 | Perl supports the POSIX notation for character classes. This uses names | |
719 | enclosed by [: and :] within the enclosing square brackets. PCRE also | |
720 | supports this notation. For example, | |
721 | ||
722 | [01[:alpha:]%] | |
723 | ||
724 | matches "0", "1", any alphabetic character, or "%". The supported class | |
725 | names are | |
726 | ||
727 | alnum letters and digits | |
728 | alpha letters | |
729 | ascii character codes 0 - 127 | |
730 | blank space or tab only | |
731 | cntrl control characters | |
732 | digit decimal digits (same as \d) | |
733 | graph printing characters, excluding space | |
734 | lower lower case letters | |
735 | print printing characters, including space | |
736 | punct printing characters, excluding letters and digits | |
737 | space white space (not quite the same as \s) | |
738 | upper upper case letters | |
739 | word "word" characters (same as \w) | |
740 | xdigit hexadecimal digits | |
741 | ||
742 | The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), | |
743 | and space (32). Notice that this list includes the VT character (code | |
744 | 11). This makes "space" different to \s, which does not include VT (for | |
745 | Perl compatibility). | |
746 | ||
747 | The name "word" is a Perl extension, and "blank" is a GNU extension | |
748 | from Perl 5.8. Another Perl extension is negation, which is indicated | |
749 | by a ^ character after the colon. For example, | |
750 | ||
751 | [12[:^digit:]] | |
752 | ||
753 | matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the | |
754 | POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but | |
755 | these are not supported, and an error is given if they are encountered. | |
756 | ||
757 | In UTF-8 mode, characters with values greater than 128 do not match any | |
758 | of the POSIX character classes. | |
759 | ||
760 | ||
761 | VERTICAL BAR | |
762 | ||
763 | Vertical bar characters are used to separate alternative patterns. For | |
764 | example, the pattern | |
765 | ||
766 | gilbert|sullivan | |
767 | ||
768 | matches either "gilbert" or "sullivan". Any number of alternatives may | |
769 | appear, and an empty alternative is permitted (matching the empty | |
aa41d2de PH |
770 | string). The matching process tries each alternative in turn, from left |
771 | to right, and the first one that succeeds is used. If the alternatives | |
772 | are within a subpattern (defined below), "succeeds" means matching the | |
773 | rest of the main pattern as well as the alternative in the subpattern. | |
495ae4b0 PH |
774 | |
775 | ||
776 | INTERNAL OPTION SETTING | |
777 | ||
778 | The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and | |
779 | PCRE_EXTENDED options can be changed from within the pattern by a | |
780 | sequence of Perl option letters enclosed between "(?" and ")". The | |
781 | option letters are | |
782 | ||
783 | i for PCRE_CASELESS | |
784 | m for PCRE_MULTILINE | |
785 | s for PCRE_DOTALL | |
786 | x for PCRE_EXTENDED | |
787 | ||
788 | For example, (?im) sets caseless, multiline matching. It is also possi- | |
789 | ble to unset these options by preceding the letter with a hyphen, and a | |
790 | combined setting and unsetting such as (?im-sx), which sets PCRE_CASE- | |
791 | LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, | |
792 | is also permitted. If a letter appears both before and after the | |
793 | hyphen, the option is unset. | |
794 | ||
795 | When an option change occurs at top level (that is, not inside subpat- | |
796 | tern parentheses), the change applies to the remainder of the pattern | |
797 | that follows. If the change is placed right at the start of a pattern, | |
798 | PCRE extracts it into the global options (and it will therefore show up | |
799 | in data extracted by the pcre_fullinfo() function). | |
800 | ||
6bf342e1 PH |
801 | An option change within a subpattern (see below for a description of |
802 | subpatterns) affects only that part of the current pattern that follows | |
803 | it, so | |
495ae4b0 PH |
804 | |
805 | (a(?i)b)c | |
806 | ||
807 | matches abc and aBc and no other strings (assuming PCRE_CASELESS is not | |
6bf342e1 PH |
808 | used). By this means, options can be made to have different settings |
809 | in different parts of the pattern. Any changes made in one alternative | |
810 | do carry on into subsequent branches within the same subpattern. For | |
495ae4b0 PH |
811 | example, |
812 | ||
813 | (a(?i)b|c) | |
814 | ||
6bf342e1 PH |
815 | matches "ab", "aB", "c", and "C", even though when matching "C" the |
816 | first branch is abandoned before the option setting. This is because | |
817 | the effects of option settings happen at compile time. There would be | |
495ae4b0 PH |
818 | some very weird behaviour otherwise. |
819 | ||
6bf342e1 PH |
820 | The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA |
821 | can be changed in the same way as the Perl-compatible options by using | |
aa41d2de | 822 | the characters J, U and X respectively. |
495ae4b0 PH |
823 | |
824 | ||
825 | SUBPATTERNS | |
826 | ||
827 | Subpatterns are delimited by parentheses (round brackets), which can be | |
828 | nested. Turning part of a pattern into a subpattern does two things: | |
829 | ||
830 | 1. It localizes a set of alternatives. For example, the pattern | |
831 | ||
832 | cat(aract|erpillar|) | |
833 | ||
6bf342e1 PH |
834 | matches one of the words "cat", "cataract", or "caterpillar". Without |
835 | the parentheses, it would match "cataract", "erpillar" or an empty | |
495ae4b0 PH |
836 | string. |
837 | ||
6bf342e1 PH |
838 | 2. It sets up the subpattern as a capturing subpattern. This means |
839 | that, when the whole pattern matches, that portion of the subject | |
495ae4b0 | 840 | string that matched the subpattern is passed back to the caller via the |
6bf342e1 PH |
841 | ovector argument of pcre_exec(). Opening parentheses are counted from |
842 | left to right (starting from 1) to obtain numbers for the capturing | |
495ae4b0 PH |
843 | subpatterns. |
844 | ||
6bf342e1 | 845 | For example, if the string "the red king" is matched against the pat- |
495ae4b0 PH |
846 | tern |
847 | ||
848 | the ((red|white) (king|queen)) | |
849 | ||
850 | the captured substrings are "red king", "red", and "king", and are num- | |
851 | bered 1, 2, and 3, respectively. | |
852 | ||
6bf342e1 PH |
853 | The fact that plain parentheses fulfil two functions is not always |
854 | helpful. There are often times when a grouping subpattern is required | |
855 | without a capturing requirement. If an opening parenthesis is followed | |
856 | by a question mark and a colon, the subpattern does not do any captur- | |
857 | ing, and is not counted when computing the number of any subsequent | |
858 | capturing subpatterns. For example, if the string "the white queen" is | |
495ae4b0 PH |
859 | matched against the pattern |
860 | ||
861 | the ((?:red|white) (king|queen)) | |
862 | ||
863 | the captured substrings are "white queen" and "queen", and are numbered | |
6bf342e1 | 864 | 1 and 2. The maximum number of capturing subpatterns is 65535. |
495ae4b0 | 865 | |
6bf342e1 PH |
866 | As a convenient shorthand, if any option settings are required at the |
867 | start of a non-capturing subpattern, the option letters may appear | |
495ae4b0 PH |
868 | between the "?" and the ":". Thus the two patterns |
869 | ||
870 | (?i:saturday|sunday) | |
871 | (?:(?i)saturday|sunday) | |
872 | ||
873 | match exactly the same set of strings. Because alternative branches are | |
6bf342e1 PH |
874 | tried from left to right, and options are not reset until the end of |
875 | the subpattern is reached, an option setting in one branch does affect | |
876 | subsequent branches, so the above patterns match "SUNDAY" as well as | |
495ae4b0 PH |
877 | "Saturday". |
878 | ||
879 | ||
64f2600a PH |
880 | DUPLICATE SUBPATTERN NUMBERS |
881 | ||
882 | Perl 5.10 introduced a feature whereby each alternative in a subpattern | |
883 | uses the same numbers for its capturing parentheses. Such a subpattern | |
884 | starts with (?| and is itself a non-capturing subpattern. For example, | |
885 | consider this pattern: | |
886 | ||
887 | (?|(Sat)ur|(Sun))day | |
888 | ||
889 | Because the two alternatives are inside a (?| group, both sets of cap- | |
890 | turing parentheses are numbered one. Thus, when the pattern matches, | |
891 | you can look at captured substring number one, whichever alternative | |
892 | matched. This construct is useful when you want to capture part, but | |
893 | not all, of one of a number of alternatives. Inside a (?| group, paren- | |
894 | theses are numbered as usual, but the number is reset at the start of | |
895 | each branch. The numbers of any capturing buffers that follow the sub- | |
896 | pattern start after the highest number used in any branch. The follow- | |
897 | ing example is taken from the Perl documentation. The numbers under- | |
898 | neath show in which buffer the captured content will be stored. | |
899 | ||
900 | # before ---------------branch-reset----------- after | |
901 | / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x | |
902 | # 1 2 2 3 2 3 4 | |
903 | ||
904 | A backreference or a recursive call to a numbered subpattern always | |
905 | refers to the first one in the pattern with the given number. | |
906 | ||
907 | An alternative approach to using this "branch reset" feature is to use | |
908 | duplicate named subpatterns, as described in the next section. | |
909 | ||
910 | ||
495ae4b0 PH |
911 | NAMED SUBPATTERNS |
912 | ||
6bf342e1 PH |
913 | Identifying capturing parentheses by number is simple, but it can be |
914 | very hard to keep track of the numbers in complicated regular expres- | |
915 | sions. Furthermore, if an expression is modified, the numbers may | |
916 | change. To help with this difficulty, PCRE supports the naming of sub- | |
917 | patterns. This feature was not added to Perl until release 5.10. Python | |
918 | had the feature earlier, and PCRE introduced it at release 4.0, using | |
919 | the Python syntax. PCRE now supports both the Perl and the Python syn- | |
920 | tax. | |
921 | ||
922 | In PCRE, a subpattern can be named in one of three ways: (?<name>...) | |
923 | or (?'name'...) as in Perl, or (?P<name>...) as in Python. References | |
924 | to capturing parentheses from other parts of the pattern, such as back- | |
925 | references, recursion, and conditions, can be made by name as well as | |
926 | by number. | |
927 | ||
928 | Names consist of up to 32 alphanumeric characters and underscores. | |
929 | Named capturing parentheses are still allocated numbers as well as | |
930 | names, exactly as if the names were not present. The PCRE API provides | |
931 | function calls for extracting the name-to-number translation table from | |
932 | a compiled pattern. There is also a convenience function for extracting | |
933 | a captured substring by name. | |
aa41d2de PH |
934 | |
935 | By default, a name must be unique within a pattern, but it is possible | |
936 | to relax this constraint by setting the PCRE_DUPNAMES option at compile | |
937 | time. This can be useful for patterns where only one instance of the | |
938 | named parentheses can match. Suppose you want to match the name of a | |
939 | weekday, either as a 3-letter abbreviation or as the full name, and in | |
940 | both cases you want to extract the abbreviation. This pattern (ignoring | |
941 | the line breaks) does the job: | |
942 | ||
6bf342e1 PH |
943 | (?<DN>Mon|Fri|Sun)(?:day)?| |
944 | (?<DN>Tue)(?:sday)?| | |
945 | (?<DN>Wed)(?:nesday)?| | |
946 | (?<DN>Thu)(?:rsday)?| | |
947 | (?<DN>Sat)(?:urday)? | |
aa41d2de PH |
948 | |
949 | There are five capturing substrings, but only one is ever set after a | |
64f2600a PH |
950 | match. (An alternative way of solving this problem is to use a "branch |
951 | reset" subpattern, as described in the previous section.) | |
952 | ||
953 | The convenience function for extracting the data by name returns the | |
954 | substring for the first (and in this example, the only) subpattern of | |
955 | that name that matched. This saves searching to find which numbered | |
956 | subpattern it was. If you make a reference to a non-unique named sub- | |
957 | pattern from elsewhere in the pattern, the one that corresponds to the | |
958 | lowest number is used. For further details of the interfaces for han- | |
959 | dling named subpatterns, see the pcreapi documentation. | |
495ae4b0 PH |
960 | |
961 | ||
962 | REPETITION | |
963 | ||
964 | Repetition is specified by quantifiers, which can follow any of the | |
965 | following items: | |
966 | ||
967 | a literal data character | |
6bf342e1 | 968 | the dot metacharacter |
495ae4b0 PH |
969 | the \C escape sequence |
970 | the \X escape sequence (in UTF-8 mode with Unicode properties) | |
6bf342e1 | 971 | the \R escape sequence |
495ae4b0 PH |
972 | an escape such as \d that matches a single character |
973 | a character class | |
974 | a back reference (see next section) | |
975 | a parenthesized subpattern (unless it is an assertion) | |
976 | ||
977 | The general repetition quantifier specifies a minimum and maximum num- | |
978 | ber of permitted matches, by giving the two numbers in curly brackets | |
979 | (braces), separated by a comma. The numbers must be less than 65536, | |
980 | and the first must be less than or equal to the second. For example: | |
981 | ||
982 | z{2,4} | |
983 | ||
984 | matches "zz", "zzz", or "zzzz". A closing brace on its own is not a | |
985 | special character. If the second number is omitted, but the comma is | |
986 | present, there is no upper limit; if the second number and the comma | |
987 | are both omitted, the quantifier specifies an exact number of required | |
988 | matches. Thus | |
989 | ||
990 | [aeiou]{3,} | |
991 | ||
992 | matches at least 3 successive vowels, but may match many more, while | |
993 | ||
994 | \d{8} | |
995 | ||
996 | matches exactly 8 digits. An opening curly bracket that appears in a | |
997 | position where a quantifier is not allowed, or one that does not match | |
998 | the syntax of a quantifier, is taken as a literal character. For exam- | |
999 | ple, {,6} is not a quantifier, but a literal string of four characters. | |
1000 | ||
1001 | In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to | |
1002 | individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char- | |
1003 | acters, each of which is represented by a two-byte sequence. Similarly, | |
1004 | when Unicode property support is available, \X{3} matches three Unicode | |
1005 | extended sequences, each of which may be several bytes long (and they | |
1006 | may be of different lengths). | |
1007 | ||
1008 | The quantifier {0} is permitted, causing the expression to behave as if | |
1009 | the previous item and the quantifier were not present. | |
1010 | ||
6bf342e1 PH |
1011 | For convenience, the three most common quantifiers have single-charac- |
1012 | ter abbreviations: | |
495ae4b0 PH |
1013 | |
1014 | * is equivalent to {0,} | |
1015 | + is equivalent to {1,} | |
1016 | ? is equivalent to {0,1} | |
1017 | ||
1018 | It is possible to construct infinite loops by following a subpattern | |
1019 | that can match no characters with a quantifier that has no upper limit, | |
1020 | for example: | |
1021 | ||
1022 | (a?)* | |
1023 | ||
1024 | Earlier versions of Perl and PCRE used to give an error at compile time | |
1025 | for such patterns. However, because there are cases where this can be | |
1026 | useful, such patterns are now accepted, but if any repetition of the | |
1027 | subpattern does in fact match no characters, the loop is forcibly bro- | |
1028 | ken. | |
1029 | ||
1030 | By default, the quantifiers are "greedy", that is, they match as much | |
1031 | as possible (up to the maximum number of permitted times), without | |
1032 | causing the rest of the pattern to fail. The classic example of where | |
1033 | this gives problems is in trying to match comments in C programs. These | |
1034 | appear between /* and */ and within the comment, individual * and / | |
1035 | characters may appear. An attempt to match C comments by applying the | |
1036 | pattern | |
1037 | ||
1038 | /\*.*\*/ | |
1039 | ||
1040 | to the string | |
1041 | ||
1042 | /* first comment */ not comment /* second comment */ | |
1043 | ||
1044 | fails, because it matches the entire string owing to the greediness of | |
1045 | the .* item. | |
1046 | ||
1047 | However, if a quantifier is followed by a question mark, it ceases to | |
1048 | be greedy, and instead matches the minimum number of times possible, so | |
1049 | the pattern | |
1050 | ||
1051 | /\*.*?\*/ | |
1052 | ||
1053 | does the right thing with the C comments. The meaning of the various | |
1054 | quantifiers is not otherwise changed, just the preferred number of | |
1055 | matches. Do not confuse this use of question mark with its use as a | |
1056 | quantifier in its own right. Because it has two uses, it can sometimes | |
1057 | appear doubled, as in | |
1058 | ||
1059 | \d??\d | |
1060 | ||
1061 | which matches one digit by preference, but can match two if that is the | |
1062 | only way the rest of the pattern matches. | |
1063 | ||
6bf342e1 | 1064 | If the PCRE_UNGREEDY option is set (an option that is not available in |
495ae4b0 PH |
1065 | Perl), the quantifiers are not greedy by default, but individual ones |
1066 | can be made greedy by following them with a question mark. In other | |
1067 | words, it inverts the default behaviour. | |
1068 | ||
1069 | When a parenthesized subpattern is quantified with a minimum repeat | |
1070 | count that is greater than 1 or with a limited maximum, more memory is | |
1071 | required for the compiled pattern, in proportion to the size of the | |
1072 | minimum or maximum. | |
1073 | ||
1074 | If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv- | |
6bf342e1 PH |
1075 | alent to Perl's /s) is set, thus allowing the dot to match newlines, |
1076 | the pattern is implicitly anchored, because whatever follows will be | |
1077 | tried against every character position in the subject string, so there | |
1078 | is no point in retrying the overall match at any position after the | |
1079 | first. PCRE normally treats such a pattern as though it were preceded | |
1080 | by \A. | |
1081 | ||
1082 | In cases where it is known that the subject string contains no new- | |
1083 | lines, it is worth setting PCRE_DOTALL in order to obtain this opti- | |
495ae4b0 PH |
1084 | mization, or alternatively using ^ to indicate anchoring explicitly. |
1085 | ||
6bf342e1 PH |
1086 | However, there is one situation where the optimization cannot be used. |
1087 | When .* is inside capturing parentheses that are the subject of a | |
1088 | backreference elsewhere in the pattern, a match at the start may fail | |
1089 | where a later one succeeds. Consider, for example: | |
495ae4b0 PH |
1090 | |
1091 | (.*)abc\1 | |
1092 | ||
6bf342e1 | 1093 | If the subject is "xyz123abc123" the match point is the fourth charac- |
495ae4b0 PH |
1094 | ter. For this reason, such a pattern is not implicitly anchored. |
1095 | ||
1096 | When a capturing subpattern is repeated, the value captured is the sub- | |
1097 | string that matched the final iteration. For example, after | |
1098 | ||
1099 | (tweedle[dume]{3}\s*)+ | |
1100 | ||
1101 | has matched "tweedledum tweedledee" the value of the captured substring | |
6bf342e1 PH |
1102 | is "tweedledee". However, if there are nested capturing subpatterns, |
1103 | the corresponding captured values may have been set in previous itera- | |
495ae4b0 PH |
1104 | tions. For example, after |
1105 | ||
1106 | /(a|(b))+/ | |
1107 | ||
1108 | matches "aba" the value of the second captured substring is "b". | |
1109 | ||
1110 | ||
1111 | ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS | |
1112 | ||
6bf342e1 PH |
1113 | With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") |
1114 | repetition, failure of what follows normally causes the repeated item | |
1115 | to be re-evaluated to see if a different number of repeats allows the | |
1116 | rest of the pattern to match. Sometimes it is useful to prevent this, | |
1117 | either to change the nature of the match, or to cause it fail earlier | |
1118 | than it otherwise might, when the author of the pattern knows there is | |
1119 | no point in carrying on. | |
495ae4b0 PH |
1120 | |
1121 | Consider, for example, the pattern \d+foo when applied to the subject | |
1122 | line | |
1123 | ||
1124 | 123456bar | |
1125 | ||
1126 | After matching all 6 digits and then failing to match "foo", the normal | |
1127 | action of the matcher is to try again with only 5 digits matching the | |
1128 | \d+ item, and then with 4, and so on, before ultimately failing. | |
1129 | "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides | |
1130 | the means for specifying that once a subpattern has matched, it is not | |
1131 | to be re-evaluated in this way. | |
1132 | ||
6bf342e1 PH |
1133 | If we use atomic grouping for the previous example, the matcher gives |
1134 | up immediately on failing to match "foo" the first time. The notation | |
1135 | is a kind of special parenthesis, starting with (?> as in this example: | |
495ae4b0 PH |
1136 | |
1137 | (?>\d+)foo | |
1138 | ||
1139 | This kind of parenthesis "locks up" the part of the pattern it con- | |
1140 | tains once it has matched, and a failure further into the pattern is | |
1141 | prevented from backtracking into it. Backtracking past it to previous | |
1142 | items, however, works as normal. | |
1143 | ||
1144 | An alternative description is that a subpattern of this type matches | |
1145 | the string of characters that an identical standalone pattern would | |
1146 | match, if anchored at the current point in the subject string. | |
1147 | ||
1148 | Atomic grouping subpatterns are not capturing subpatterns. Simple cases | |
1149 | such as the above example can be thought of as a maximizing repeat that | |
1150 | must swallow everything it can. So, while both \d+ and \d+? are pre- | |
1151 | pared to adjust the number of digits they match in order to make the | |
1152 | rest of the pattern match, (?>\d+) can only match an entire sequence of | |
1153 | digits. | |
1154 | ||
1155 | Atomic groups in general can of course contain arbitrarily complicated | |
1156 | subpatterns, and can be nested. However, when the subpattern for an | |
1157 | atomic group is just a single repeated item, as in the example above, a | |
1158 | simpler notation, called a "possessive quantifier" can be used. This | |
1159 | consists of an additional + character following a quantifier. Using | |
1160 | this notation, the previous example can be rewritten as | |
1161 | ||
1162 | \d++foo | |
1163 | ||
1164 | Possessive quantifiers are always greedy; the setting of the | |
1165 | PCRE_UNGREEDY option is ignored. They are a convenient notation for the | |
1166 | simpler forms of atomic group. However, there is no difference in the | |
6bf342e1 PH |
1167 | meaning of a possessive quantifier and the equivalent atomic group, |
1168 | though there may be a performance difference; possessive quantifiers | |
1169 | should be slightly faster. | |
1170 | ||
1171 | The possessive quantifier syntax is an extension to the Perl 5.8 syn- | |
1172 | tax. Jeffrey Friedl originated the idea (and the name) in the first | |
1173 | edition of his book. Mike McCloskey liked it, so implemented it when he | |
1174 | built Sun's Java package, and PCRE copied it from there. It ultimately | |
1175 | found its way into Perl at release 5.10. | |
1176 | ||
1177 | PCRE has an optimization that automatically "possessifies" certain sim- | |
1178 | ple pattern constructs. For example, the sequence A+B is treated as | |
1179 | A++B because there is no point in backtracking into a sequence of A's | |
1180 | when B must follow. | |
1181 | ||
1182 | When a pattern contains an unlimited repeat inside a subpattern that | |
1183 | can itself be repeated an unlimited number of times, the use of an | |
1184 | atomic group is the only way to avoid some failing matches taking a | |
495ae4b0 PH |
1185 | very long time indeed. The pattern |
1186 | ||
1187 | (\D+|<\d+>)*[!?] | |
1188 | ||
6bf342e1 PH |
1189 | matches an unlimited number of substrings that either consist of non- |
1190 | digits, or digits enclosed in <>, followed by either ! or ?. When it | |
495ae4b0 PH |
1191 | matches, it runs quickly. However, if it is applied to |
1192 | ||
1193 | aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa | |
1194 | ||
6bf342e1 PH |
1195 | it takes a long time before reporting failure. This is because the |
1196 | string can be divided between the internal \D+ repeat and the external | |
1197 | * repeat in a large number of ways, and all have to be tried. (The | |
1198 | example uses [!?] rather than a single character at the end, because | |
1199 | both PCRE and Perl have an optimization that allows for fast failure | |
1200 | when a single character is used. They remember the last single charac- | |
1201 | ter that is required for a match, and fail early if it is not present | |
1202 | in the string.) If the pattern is changed so that it uses an atomic | |
495ae4b0 PH |
1203 | group, like this: |
1204 | ||
1205 | ((?>\D+)|<\d+>)*[!?] | |
1206 | ||
6bf342e1 | 1207 | sequences of non-digits cannot be broken, and failure happens quickly. |
495ae4b0 PH |
1208 | |
1209 | ||
1210 | BACK REFERENCES | |
1211 | ||
1212 | Outside a character class, a backslash followed by a digit greater than | |
1213 | 0 (and possibly further digits) is a back reference to a capturing sub- | |
6bf342e1 | 1214 | pattern earlier (that is, to its left) in the pattern, provided there |
495ae4b0 PH |
1215 | have been that many previous capturing left parentheses. |
1216 | ||
1217 | However, if the decimal number following the backslash is less than 10, | |
6bf342e1 PH |
1218 | it is always taken as a back reference, and causes an error only if |
1219 | there are not that many capturing left parentheses in the entire pat- | |
1220 | tern. In other words, the parentheses that are referenced need not be | |
1221 | to the left of the reference for numbers less than 10. A "forward back | |
1222 | reference" of this type can make sense when a repetition is involved | |
1223 | and the subpattern to the right has participated in an earlier itera- | |
aa41d2de | 1224 | tion. |
495ae4b0 | 1225 | |
6bf342e1 PH |
1226 | It is not possible to have a numerical "forward back reference" to a |
1227 | subpattern whose number is 10 or more using this syntax because a | |
1228 | sequence such as \50 is interpreted as a character defined in octal. | |
1229 | See the subsection entitled "Non-printing characters" above for further | |
1230 | details of the handling of digits following a backslash. There is no | |
1231 | such problem when named parentheses are used. A back reference to any | |
1232 | subpattern is possible using named parentheses (see below). | |
1233 | ||
1234 | Another way of avoiding the ambiguity inherent in the use of digits | |
1235 | following a backslash is to use the \g escape sequence, which is a fea- | |
1236 | ture introduced in Perl 5.10. This escape must be followed by a posi- | |
1237 | tive or a negative number, optionally enclosed in braces. These exam- | |
1238 | ples are all identical: | |
1239 | ||
1240 | (ring), \1 | |
1241 | (ring), \g1 | |
1242 | (ring), \g{1} | |
1243 | ||
1244 | A positive number specifies an absolute reference without the ambiguity | |
1245 | that is present in the older syntax. It is also useful when literal | |
1246 | digits follow the reference. A negative number is a relative reference. | |
1247 | Consider this example: | |
1248 | ||
1249 | (abc(def)ghi)\g{-1} | |
1250 | ||
1251 | The sequence \g{-1} is a reference to the most recently started captur- | |
1252 | ing subpattern before \g, that is, is it equivalent to \2. Similarly, | |
1253 | \g{-2} would be equivalent to \1. The use of relative references can be | |
1254 | helpful in long patterns, and also in patterns that are created by | |
1255 | joining together fragments that contain references within themselves. | |
aa41d2de PH |
1256 | |
1257 | A back reference matches whatever actually matched the capturing sub- | |
1258 | pattern in the current subject string, rather than anything matching | |
495ae4b0 PH |
1259 | the subpattern itself (see "Subpatterns as subroutines" below for a way |
1260 | of doing that). So the pattern | |
1261 | ||
1262 | (sens|respons)e and \1ibility | |
1263 | ||
aa41d2de PH |
1264 | matches "sense and sensibility" and "response and responsibility", but |
1265 | not "sense and responsibility". If caseful matching is in force at the | |
1266 | time of the back reference, the case of letters is relevant. For exam- | |
495ae4b0 PH |
1267 | ple, |
1268 | ||
1269 | ((?i)rah)\s+\1 | |
1270 | ||
aa41d2de | 1271 | matches "rah rah" and "RAH RAH", but not "RAH rah", even though the |
495ae4b0 PH |
1272 | original capturing subpattern is matched caselessly. |
1273 | ||
64f2600a PH |
1274 | There are several different ways of writing back references to named |
1275 | subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or | |
1276 | \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's | |
1277 | unified back reference syntax, in which \g can be used for both numeric | |
1278 | and named references, is also supported. We could rewrite the above | |
1279 | example in any of the following ways: | |
495ae4b0 | 1280 | |
6bf342e1 | 1281 | (?<p1>(?i)rah)\s+\k<p1> |
64f2600a | 1282 | (?'p1'(?i)rah)\s+\k{p1} |
aa41d2de | 1283 | (?P<p1>(?i)rah)\s+(?P=p1) |
64f2600a | 1284 | (?<p1>(?i)rah)\s+\g{p1} |
aa41d2de | 1285 | |
64f2600a | 1286 | A subpattern that is referenced by name may appear in the pattern |
aa41d2de | 1287 | before or after the reference. |
495ae4b0 | 1288 | |
64f2600a PH |
1289 | There may be more than one back reference to the same subpattern. If a |
1290 | subpattern has not actually been used in a particular match, any back | |
495ae4b0 PH |
1291 | references to it always fail. For example, the pattern |
1292 | ||
1293 | (a|(bc))\2 | |
1294 | ||
64f2600a PH |
1295 | always fails if it starts to match "a" rather than "bc". Because there |
1296 | may be many capturing parentheses in a pattern, all digits following | |
1297 | the backslash are taken as part of a potential back reference number. | |
495ae4b0 | 1298 | If the pattern continues with a digit character, some delimiter must be |
64f2600a PH |
1299 | used to terminate the back reference. If the PCRE_EXTENDED option is |
1300 | set, this can be whitespace. Otherwise an empty comment (see "Com- | |
495ae4b0 PH |
1301 | ments" below) can be used. |
1302 | ||
64f2600a PH |
1303 | A back reference that occurs inside the parentheses to which it refers |
1304 | fails when the subpattern is first used, so, for example, (a\1) never | |
1305 | matches. However, such references can be useful inside repeated sub- | |
495ae4b0 PH |
1306 | patterns. For example, the pattern |
1307 | ||
1308 | (a|b\1)+ | |
1309 | ||
1310 | matches any number of "a"s and also "aba", "ababbaa" etc. At each iter- | |
64f2600a PH |
1311 | ation of the subpattern, the back reference matches the character |
1312 | string corresponding to the previous iteration. In order for this to | |
1313 | work, the pattern must be such that the first iteration does not need | |
1314 | to match the back reference. This can be done using alternation, as in | |
495ae4b0 PH |
1315 | the example above, or by a quantifier with a minimum of zero. |
1316 | ||
1317 | ||
1318 | ASSERTIONS | |
1319 | ||
64f2600a PH |
1320 | An assertion is a test on the characters following or preceding the |
1321 | current matching point that does not actually consume any characters. | |
1322 | The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are | |
495ae4b0 PH |
1323 | described above. |
1324 | ||
64f2600a PH |
1325 | More complicated assertions are coded as subpatterns. There are two |
1326 | kinds: those that look ahead of the current position in the subject | |
1327 | string, and those that look behind it. An assertion subpattern is | |
1328 | matched in the normal way, except that it does not cause the current | |
495ae4b0 PH |
1329 | matching position to be changed. |
1330 | ||
64f2600a PH |
1331 | Assertion subpatterns are not capturing subpatterns, and may not be |
1332 | repeated, because it makes no sense to assert the same thing several | |
1333 | times. If any kind of assertion contains capturing subpatterns within | |
1334 | it, these are counted for the purposes of numbering the capturing sub- | |
495ae4b0 | 1335 | patterns in the whole pattern. However, substring capturing is carried |
64f2600a | 1336 | out only for positive assertions, because it does not make sense for |
495ae4b0 PH |
1337 | negative assertions. |
1338 | ||
1339 | Lookahead assertions | |
1340 | ||
1341 | Lookahead assertions start with (?= for positive assertions and (?! for | |
1342 | negative assertions. For example, | |
1343 | ||
1344 | \w+(?=;) | |
1345 | ||
64f2600a | 1346 | matches a word followed by a semicolon, but does not include the semi- |
495ae4b0 PH |
1347 | colon in the match, and |
1348 | ||
1349 | foo(?!bar) | |
1350 | ||
64f2600a | 1351 | matches any occurrence of "foo" that is not followed by "bar". Note |
495ae4b0 PH |
1352 | that the apparently similar pattern |
1353 | ||
1354 | (?!foo)bar | |
1355 | ||
64f2600a PH |
1356 | does not find an occurrence of "bar" that is preceded by something |
1357 | other than "foo"; it finds any occurrence of "bar" whatsoever, because | |
495ae4b0 PH |
1358 | the assertion (?!foo) is always true when the next three characters are |
1359 | "bar". A lookbehind assertion is needed to achieve the other effect. | |
1360 | ||
1361 | If you want to force a matching failure at some point in a pattern, the | |
64f2600a PH |
1362 | most convenient way to do it is with (?!) because an empty string |
1363 | always matches, so an assertion that requires there not to be an empty | |
495ae4b0 PH |
1364 | string must always fail. |
1365 | ||
1366 | Lookbehind assertions | |
1367 | ||
64f2600a | 1368 | Lookbehind assertions start with (?<= for positive assertions and (?<! |
495ae4b0 PH |
1369 | for negative assertions. For example, |
1370 | ||
1371 | (?<!foo)bar | |
1372 | ||
64f2600a PH |
1373 | does find an occurrence of "bar" that is not preceded by "foo". The |
1374 | contents of a lookbehind assertion are restricted such that all the | |
495ae4b0 | 1375 | strings it matches must have a fixed length. However, if there are sev- |
64f2600a | 1376 | eral top-level alternatives, they do not all have to have the same |
aa41d2de | 1377 | fixed length. Thus |
495ae4b0 PH |
1378 | |
1379 | (?<=bullock|donkey) | |
1380 | ||
1381 | is permitted, but | |
1382 | ||
1383 | (?<!dogs?|cats?) | |
1384 | ||
64f2600a PH |
1385 | causes an error at compile time. Branches that match different length |
1386 | strings are permitted only at the top level of a lookbehind assertion. | |
1387 | This is an extension compared with Perl (at least for 5.8), which | |
1388 | requires all branches to match the same length of string. An assertion | |
495ae4b0 PH |
1389 | such as |
1390 | ||
1391 | (?<=ab(c|de)) | |
1392 | ||
64f2600a PH |
1393 | is not permitted, because its single top-level branch can match two |
1394 | different lengths, but it is acceptable if rewritten to use two top- | |
495ae4b0 PH |
1395 | level branches: |
1396 | ||
1397 | (?<=abc|abde) | |
1398 | ||
64f2600a PH |
1399 | In some cases, the Perl 5.10 escape sequence \K (see above) can be used |
1400 | instead of a lookbehind assertion; this is not restricted to a fixed- | |
1401 | length. | |
1402 | ||
1403 | The implementation of lookbehind assertions is, for each alternative, | |
1404 | to temporarily move the current position back by the fixed length and | |
495ae4b0 | 1405 | then try to match. If there are insufficient characters before the cur- |
6bf342e1 | 1406 | rent position, the assertion fails. |
495ae4b0 PH |
1407 | |
1408 | PCRE does not allow the \C escape (which matches a single byte in UTF-8 | |
64f2600a PH |
1409 | mode) to appear in lookbehind assertions, because it makes it impossi- |
1410 | ble to calculate the length of the lookbehind. The \X and \R escapes, | |
6bf342e1 | 1411 | which can match different numbers of bytes, are also not permitted. |
495ae4b0 | 1412 | |
64f2600a PH |
1413 | Possessive quantifiers can be used in conjunction with lookbehind |
1414 | assertions to specify efficient matching at the end of the subject | |
6bf342e1 | 1415 | string. Consider a simple pattern such as |
495ae4b0 PH |
1416 | |
1417 | abcd$ | |
1418 | ||
64f2600a | 1419 | when applied to a long string that does not match. Because matching |
495ae4b0 | 1420 | proceeds from left to right, PCRE will look for each "a" in the subject |
64f2600a | 1421 | and then see if what follows matches the rest of the pattern. If the |
495ae4b0 PH |
1422 | pattern is specified as |
1423 | ||
1424 | ^.*abcd$ | |
1425 | ||
64f2600a | 1426 | the initial .* matches the entire string at first, but when this fails |
495ae4b0 | 1427 | (because there is no following "a"), it backtracks to match all but the |
64f2600a PH |
1428 | last character, then all but the last two characters, and so on. Once |
1429 | again the search for "a" covers the entire string, from right to left, | |
495ae4b0 PH |
1430 | so we are no better off. However, if the pattern is written as |
1431 | ||
495ae4b0 PH |
1432 | ^.*+(?<=abcd) |
1433 | ||
64f2600a PH |
1434 | there can be no backtracking for the .*+ item; it can match only the |
1435 | entire string. The subsequent lookbehind assertion does a single test | |
1436 | on the last four characters. If it fails, the match fails immediately. | |
1437 | For long strings, this approach makes a significant difference to the | |
495ae4b0 PH |
1438 | processing time. |
1439 | ||
1440 | Using multiple assertions | |
1441 | ||
1442 | Several assertions (of any sort) may occur in succession. For example, | |
1443 | ||
1444 | (?<=\d{3})(?<!999)foo | |
1445 | ||
64f2600a PH |
1446 | matches "foo" preceded by three digits that are not "999". Notice that |
1447 | each of the assertions is applied independently at the same point in | |
1448 | the subject string. First there is a check that the previous three | |
1449 | characters are all digits, and then there is a check that the same | |
495ae4b0 | 1450 | three characters are not "999". This pattern does not match "foo" pre- |
64f2600a PH |
1451 | ceded by six characters, the first of which are digits and the last |
1452 | three of which are not "999". For example, it doesn't match "123abc- | |
495ae4b0 PH |
1453 | foo". A pattern to do that is |
1454 | ||
1455 | (?<=\d{3}...)(?<!999)foo | |
1456 | ||
64f2600a | 1457 | This time the first assertion looks at the preceding six characters, |
495ae4b0 PH |
1458 | checking that the first three are digits, and then the second assertion |
1459 | checks that the preceding three characters are not "999". | |
1460 | ||
1461 | Assertions can be nested in any combination. For example, | |
1462 | ||
1463 | (?<=(?<!foo)bar)baz | |
1464 | ||
64f2600a | 1465 | matches an occurrence of "baz" that is preceded by "bar" which in turn |
495ae4b0 PH |
1466 | is not preceded by "foo", while |
1467 | ||
1468 | (?<=\d{3}(?!999)...)foo | |
1469 | ||
64f2600a | 1470 | is another pattern that matches "foo" preceded by three digits and any |
495ae4b0 PH |
1471 | three characters that are not "999". |
1472 | ||
1473 | ||
1474 | CONDITIONAL SUBPATTERNS | |
1475 | ||
64f2600a PH |
1476 | It is possible to cause the matching process to obey a subpattern con- |
1477 | ditionally or to choose between two alternative subpatterns, depending | |
1478 | on the result of an assertion, or whether a previous capturing subpat- | |
1479 | tern matched or not. The two possible forms of conditional subpattern | |
495ae4b0 PH |
1480 | are |
1481 | ||
1482 | (?(condition)yes-pattern) | |
1483 | (?(condition)yes-pattern|no-pattern) | |
1484 | ||
64f2600a PH |
1485 | If the condition is satisfied, the yes-pattern is used; otherwise the |
1486 | no-pattern (if present) is used. If there are more than two alterna- | |
495ae4b0 PH |
1487 | tives in the subpattern, a compile-time error occurs. |
1488 | ||
64f2600a | 1489 | There are four kinds of condition: references to subpatterns, refer- |
6bf342e1 PH |
1490 | ences to recursion, a pseudo-condition called DEFINE, and assertions. |
1491 | ||
1492 | Checking for a used subpattern by number | |
1493 | ||
64f2600a PH |
1494 | If the text between the parentheses consists of a sequence of digits, |
1495 | the condition is true if the capturing subpattern of that number has | |
1496 | previously matched. An alternative notation is to precede the digits | |
1497 | with a plus or minus sign. In this case, the subpattern number is rela- | |
1498 | tive rather than absolute. The most recently opened parentheses can be | |
1499 | referenced by (?(-1), the next most recent by (?(-2), and so on. In | |
1500 | looping constructs it can also make sense to refer to subsequent groups | |
1501 | with constructs such as (?(+2). | |
aa41d2de PH |
1502 | |
1503 | Consider the following pattern, which contains non-significant white | |
1504 | space to make it more readable (assume the PCRE_EXTENDED option) and to | |
1505 | divide it into three parts for ease of discussion: | |
495ae4b0 PH |
1506 | |
1507 | ( \( )? [^()]+ (?(1) \) ) | |
1508 | ||
1509 | The first part matches an optional opening parenthesis, and if that | |
1510 | character is present, sets it as the first captured substring. The sec- | |
1511 | ond part matches one or more characters that are not parentheses. The | |
1512 | third part is a conditional subpattern that tests whether the first set | |
1513 | of parentheses matched or not. If they did, that is, if subject started | |
1514 | with an opening parenthesis, the condition is true, and so the yes-pat- | |
1515 | tern is executed and a closing parenthesis is required. Otherwise, | |
1516 | since no-pattern is not present, the subpattern matches nothing. In | |
1517 | other words, this pattern matches a sequence of non-parentheses, | |
6bf342e1 PH |
1518 | optionally enclosed in parentheses. |
1519 | ||
64f2600a PH |
1520 | If you were embedding this pattern in a larger one, you could use a |
1521 | relative reference: | |
1522 | ||
1523 | ...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... | |
1524 | ||
1525 | This makes the fragment independent of the parentheses in the larger | |
1526 | pattern. | |
1527 | ||
6bf342e1 PH |
1528 | Checking for a used subpattern by name |
1529 | ||
1530 | Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a | |
1531 | used subpattern by name. For compatibility with earlier versions of | |
1532 | PCRE, which had this facility before Perl, the syntax (?(name)...) is | |
1533 | also recognized. However, there is a possible ambiguity with this syn- | |
1534 | tax, because subpattern names may consist entirely of digits. PCRE | |
1535 | looks first for a named subpattern; if it cannot find one and the name | |
1536 | consists entirely of digits, PCRE looks for a subpattern of that num- | |
1537 | ber, which must be greater than zero. Using subpattern names that con- | |
1538 | sist entirely of digits is not recommended. | |
1539 | ||
1540 | Rewriting the above example to use a named subpattern gives this: | |
aa41d2de | 1541 | |
6bf342e1 PH |
1542 | (?<OPEN> \( )? [^()]+ (?(<OPEN>) \) ) |
1543 | ||
1544 | ||
1545 | Checking for pattern recursion | |
495ae4b0 | 1546 | |
aa41d2de | 1547 | If the condition is the string (R), and there is no subpattern with the |
6bf342e1 PH |
1548 | name R, the condition is true if a recursive call to the whole pattern |
1549 | or any subpattern has been made. If digits or a name preceded by amper- | |
1550 | sand follow the letter R, for example: | |
1551 | ||
1552 | (?(R3)...) or (?(R&name)...) | |
1553 | ||
1554 | the condition is true if the most recent recursion is into the subpat- | |
1555 | tern whose number or name is given. This condition does not check the | |
1556 | entire recursion stack. | |
1557 | ||
1558 | At "top level", all these recursion test conditions are false. Recur- | |
1559 | sive patterns are described below. | |
1560 | ||
1561 | Defining subpatterns for use by reference only | |
1562 | ||
1563 | If the condition is the string (DEFINE), and there is no subpattern | |
1564 | with the name DEFINE, the condition is always false. In this case, | |
1565 | there may be only one alternative in the subpattern. It is always | |
1566 | skipped if control reaches this point in the pattern; the idea of | |
1567 | DEFINE is that it can be used to define "subroutines" that can be ref- | |
1568 | erenced from elsewhere. (The use of "subroutines" is described below.) | |
1569 | For example, a pattern to match an IPv4 address could be written like | |
1570 | this (ignore whitespace and line breaks): | |
1571 | ||
1572 | (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) | |
1573 | \b (?&byte) (\.(?&byte)){3} \b | |
1574 | ||
1575 | The first part of the pattern is a DEFINE group inside which a another | |
1576 | group named "byte" is defined. This matches an individual component of | |
1577 | an IPv4 address (a number less than 256). When matching takes place, | |
1578 | this part of the pattern is skipped because DEFINE acts like a false | |
1579 | condition. | |
1580 | ||
1581 | The rest of the pattern uses references to the named group to match the | |
1582 | four dot-separated components of an IPv4 address, insisting on a word | |
1583 | boundary at each end. | |
495ae4b0 | 1584 | |
6bf342e1 PH |
1585 | Assertion conditions |
1586 | ||
1587 | If the condition is not in any of the above formats, it must be an | |
495ae4b0 PH |
1588 | assertion. This may be a positive or negative lookahead or lookbehind |
1589 | assertion. Consider this pattern, again containing non-significant | |
1590 | white space, and with the two alternatives on the second line: | |
1591 | ||
1592 | (?(?=[^a-z]*[a-z]) | |
1593 | \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) | |
1594 | ||
1595 | The condition is a positive lookahead assertion that matches an | |
1596 | optional sequence of non-letters followed by a letter. In other words, | |
1597 | it tests for the presence of at least one letter in the subject. If a | |
1598 | letter is found, the subject is matched against the first alternative; | |
1599 | otherwise it is matched against the second. This pattern matches | |
1600 | strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are | |
1601 | letters and dd are digits. | |
1602 | ||
1603 | ||
1604 | COMMENTS | |
1605 | ||
1606 | The sequence (?# marks the start of a comment that continues up to the | |
1607 | next closing parenthesis. Nested parentheses are not permitted. The | |
1608 | characters that make up a comment play no part in the pattern matching | |
1609 | at all. | |
1610 | ||
1611 | If the PCRE_EXTENDED option is set, an unescaped # character outside a | |
aa41d2de PH |
1612 | character class introduces a comment that continues to immediately |
1613 | after the next newline in the pattern. | |
495ae4b0 PH |
1614 | |
1615 | ||
1616 | RECURSIVE PATTERNS | |
1617 | ||
1618 | Consider the problem of matching a string in parentheses, allowing for | |
1619 | unlimited nested parentheses. Without the use of recursion, the best | |
1620 | that can be done is to use a pattern that matches up to some fixed | |
1621 | depth of nesting. It is not possible to handle an arbitrary nesting | |
6bf342e1 PH |
1622 | depth. |
1623 | ||
1624 | For some time, Perl has provided a facility that allows regular expres- | |
1625 | sions to recurse (amongst other things). It does this by interpolating | |
1626 | Perl code in the expression at run time, and the code can refer to the | |
1627 | expression itself. A Perl pattern using code interpolation to solve the | |
1628 | parentheses problem can be created like this: | |
495ae4b0 PH |
1629 | |
1630 | $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x; | |
1631 | ||
1632 | The (?p{...}) item interpolates Perl code at run time, and in this case | |
6bf342e1 PH |
1633 | refers recursively to the pattern in which it appears. |
1634 | ||
1635 | Obviously, PCRE cannot support the interpolation of Perl code. Instead, | |
1636 | it supports special syntax for recursion of the entire pattern, and | |
1637 | also for individual subpattern recursion. After its introduction in | |
1638 | PCRE and Python, this kind of recursion was introduced into Perl at | |
1639 | release 5.10. | |
495ae4b0 | 1640 | |
6bf342e1 | 1641 | A special item that consists of (? followed by a number greater than |
495ae4b0 | 1642 | zero and a closing parenthesis is a recursive call of the subpattern of |
6bf342e1 PH |
1643 | the given number, provided that it occurs inside that subpattern. (If |
1644 | not, it is a "subroutine" call, which is described in the next sec- | |
1645 | tion.) The special item (?R) or (?0) is a recursive call of the entire | |
1646 | regular expression. | |
495ae4b0 | 1647 | |
6bf342e1 PH |
1648 | In PCRE (like Python, but unlike Perl), a recursive subpattern call is |
1649 | always treated as an atomic group. That is, once it has matched some of | |
1650 | the subject string, it is never re-entered, even if it contains untried | |
1651 | alternatives and there is a subsequent matching failure. | |
aa41d2de | 1652 | |
6bf342e1 | 1653 | This PCRE pattern solves the nested parentheses problem (assume the |
aa41d2de | 1654 | PCRE_EXTENDED option is set so that white space is ignored): |
495ae4b0 PH |
1655 | |
1656 | \( ( (?>[^()]+) | (?R) )* \) | |
1657 | ||
6bf342e1 PH |
1658 | First it matches an opening parenthesis. Then it matches any number of |
1659 | substrings which can either be a sequence of non-parentheses, or a | |
1660 | recursive match of the pattern itself (that is, a correctly parenthe- | |
495ae4b0 PH |
1661 | sized substring). Finally there is a closing parenthesis. |
1662 | ||
6bf342e1 | 1663 | If this were part of a larger pattern, you would not want to recurse |
495ae4b0 PH |
1664 | the entire pattern, so instead you could use this: |
1665 | ||
1666 | ( \( ( (?>[^()]+) | (?1) )* \) ) | |
1667 | ||
6bf342e1 | 1668 | We have put the pattern into parentheses, and caused the recursion to |
64f2600a PH |
1669 | refer to them instead of the whole pattern. |
1670 | ||
1671 | In a larger pattern, keeping track of parenthesis numbers can be | |
1672 | tricky. This is made easier by the use of relative references. (A Perl | |
1673 | 5.10 feature.) Instead of (?1) in the pattern above you can write | |
1674 | (?-2) to refer to the second most recently opened parentheses preceding | |
1675 | the recursion. In other words, a negative number counts capturing | |
1676 | parentheses leftwards from the point at which it is encountered. | |
1677 | ||
1678 | It is also possible to refer to subsequently opened parentheses, by | |
1679 | writing references such as (?+2). However, these cannot be recursive | |
1680 | because the reference is not inside the parentheses that are refer- | |
1681 | enced. They are always "subroutine" calls, as described in the next | |
1682 | section. | |
1683 | ||
1684 | An alternative approach is to use named parentheses instead. The Perl | |
1685 | syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also | |
1686 | supported. We could rewrite the above example as follows: | |
495ae4b0 | 1687 | |
6bf342e1 | 1688 | (?<pn> \( ( (?>[^()]+) | (?&pn) )* \) ) |
495ae4b0 | 1689 | |
64f2600a PH |
1690 | If there is more than one subpattern with the same name, the earliest |
1691 | one is used. | |
1692 | ||
1693 | This particular example pattern that we have been looking at contains | |
1694 | nested unlimited repeats, and so the use of atomic grouping for match- | |
1695 | ing strings of non-parentheses is important when applying the pattern | |
1696 | to strings that do not match. For example, when this pattern is applied | |
1697 | to | |
495ae4b0 PH |
1698 | |
1699 | (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() | |
1700 | ||
6bf342e1 PH |
1701 | it yields "no match" quickly. However, if atomic grouping is not used, |
1702 | the match runs for a very long time indeed because there are so many | |
1703 | different ways the + and * repeats can carve up the subject, and all | |
495ae4b0 PH |
1704 | have to be tested before failure can be reported. |
1705 | ||
1706 | At the end of a match, the values set for any capturing subpatterns are | |
1707 | those from the outermost level of the recursion at which the subpattern | |
6bf342e1 PH |
1708 | value is set. If you want to obtain intermediate values, a callout |
1709 | function can be used (see below and the pcrecallout documentation). If | |
1710 | the pattern above is matched against | |
495ae4b0 PH |
1711 | |
1712 | (ab(cd)ef) | |
1713 | ||
6bf342e1 PH |
1714 | the value for the capturing parentheses is "ef", which is the last |
1715 | value taken on at the top level. If additional parentheses are added, | |
495ae4b0 PH |
1716 | giving |
1717 | ||
1718 | \( ( ( (?>[^()]+) | (?R) )* ) \) | |
1719 | ^ ^ | |
1720 | ^ ^ | |
1721 | ||
6bf342e1 PH |
1722 | the string they capture is "ab(cd)ef", the contents of the top level |
1723 | parentheses. If there are more than 15 capturing parentheses in a pat- | |
495ae4b0 | 1724 | tern, PCRE has to obtain extra memory to store data during a recursion, |
6bf342e1 PH |
1725 | which it does by using pcre_malloc, freeing it via pcre_free after- |
1726 | wards. If no memory can be obtained, the match fails with the | |
495ae4b0 PH |
1727 | PCRE_ERROR_NOMEMORY error. |
1728 | ||
6bf342e1 PH |
1729 | Do not confuse the (?R) item with the condition (R), which tests for |
1730 | recursion. Consider this pattern, which matches text in angle brack- | |
1731 | ets, allowing for arbitrary nesting. Only digits are allowed in nested | |
1732 | brackets (that is, when recursing), whereas any characters are permit- | |
495ae4b0 PH |
1733 | ted at the outer level. |
1734 | ||
1735 | < (?: (?(R) \d++ | [^<>]*+) | (?R)) * > | |
1736 | ||
6bf342e1 PH |
1737 | In this pattern, (?(R) is the start of a conditional subpattern, with |
1738 | two different alternatives for the recursive and non-recursive cases. | |
495ae4b0 PH |
1739 | The (?R) item is the actual recursive call. |
1740 | ||
1741 | ||
1742 | SUBPATTERNS AS SUBROUTINES | |
1743 | ||
1744 | If the syntax for a recursive subpattern reference (either by number or | |
6bf342e1 PH |
1745 | by name) is used outside the parentheses to which it refers, it oper- |
1746 | ates like a subroutine in a programming language. The "called" subpat- | |
64f2600a PH |
1747 | tern may be defined before or after the reference. A numbered reference |
1748 | can be absolute or relative, as in these examples: | |
1749 | ||
1750 | (...(absolute)...)...(?2)... | |
1751 | (...(relative)...)...(?-1)... | |
1752 | (...(?+1)...(relative)... | |
1753 | ||
1754 | An earlier example pointed out that the pattern | |
495ae4b0 PH |
1755 | |
1756 | (sens|respons)e and \1ibility | |
1757 | ||
1758 | matches "sense and sensibility" and "response and responsibility", but | |
1759 | not "sense and responsibility". If instead the pattern | |
1760 | ||
1761 | (sens|respons)e and (?1)ibility | |
1762 | ||
1763 | is used, it does match "sense and responsibility" as well as the other | |
6bf342e1 PH |
1764 | two strings. Another example is given in the discussion of DEFINE |
1765 | above. | |
aa41d2de PH |
1766 | |
1767 | Like recursive subpatterns, a "subroutine" call is always treated as an | |
6bf342e1 PH |
1768 | atomic group. That is, once it has matched some of the subject string, |
1769 | it is never re-entered, even if it contains untried alternatives and | |
aa41d2de | 1770 | there is a subsequent matching failure. |
495ae4b0 | 1771 | |
6bf342e1 PH |
1772 | When a subpattern is used as a subroutine, processing options such as |
1773 | case-independence are fixed when the subpattern is defined. They cannot | |
1774 | be changed for different calls. For example, consider this pattern: | |
1775 | ||
64f2600a | 1776 | (abc)(?i:(?-1)) |
6bf342e1 PH |
1777 | |
1778 | It matches "abcabc". It does not match "abcABC" because the change of | |
1779 | processing option does not affect the called subpattern. | |
1780 | ||
495ae4b0 PH |
1781 | |
1782 | CALLOUTS | |
1783 | ||
1784 | Perl has a feature whereby using the sequence (?{...}) causes arbitrary | |
1785 | Perl code to be obeyed in the middle of matching a regular expression. | |
1786 | This makes it possible, amongst other things, to extract different sub- | |
1787 | strings that match the same pair of parentheses when there is a repeti- | |
1788 | tion. | |
1789 | ||
1790 | PCRE provides a similar feature, but of course it cannot obey arbitrary | |
1791 | Perl code. The feature is called "callout". The caller of PCRE provides | |
1792 | an external function by putting its entry point in the global variable | |
1793 | pcre_callout. By default, this variable contains NULL, which disables | |
1794 | all calling out. | |
1795 | ||
1796 | Within a regular expression, (?C) indicates the points at which the | |
1797 | external function is to be called. If you want to identify different | |
1798 | callout points, you can put a number less than 256 after the letter C. | |
1799 | The default value is zero. For example, this pattern has two callout | |
1800 | points: | |
1801 | ||
1802 | (?C1)abc(?C2)def | |
1803 | ||
1804 | If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are | |
1805 | automatically installed before each item in the pattern. They are all | |
1806 | numbered 255. | |
1807 | ||
1808 | During matching, when PCRE reaches a callout point (and pcre_callout is | |
1809 | set), the external function is called. It is provided with the number | |
1810 | of the callout, the position in the pattern, and, optionally, one item | |
1811 | of data originally supplied by the caller of pcre_exec(). The callout | |
1812 | function may cause matching to proceed, to backtrack, or to fail alto- | |
1813 | gether. A complete description of the interface to the callout function | |
1814 | is given in the pcrecallout documentation. | |
1815 | ||
6bf342e1 PH |
1816 | |
1817 | SEE ALSO | |
1818 | ||
1819 | pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3). | |
1820 | ||
64f2600a PH |
1821 | |
1822 | AUTHOR | |
1823 | ||
1824 | Philip Hazel | |
1825 | University Computing Service | |
1826 | Cambridge CB2 3QH, England. | |
1827 | ||
1828 | ||
1829 | REVISION | |
1830 | ||
1831 | Last updated: 19 June 2007 | |
1832 | Copyright (c) 1997-2007 University of Cambridge. |