Reset locale after calling embedded Perl, in case it was changed.
[exim.git] / doc / doc-txt / pcrepattern.txt
CommitLineData
495ae4b0
PH
1This file contains the PCRE man page that describes the regular expressions
2supported by PCRE version 5.0. Note that not all of the features are relevant
3in the context of Exim. In particular, the version of PCRE that is compiled
4with Exim does not include UTF-8 support, there is no mechanism for changing
5the options with which the PCRE functions are called, and features such as
6callout are not accessible.
7-----------------------------------------------------------------------------
8
9PCRE(3) PCRE(3)
10
11
12
13NAME
14 PCRE - Perl-compatible regular expressions
15
16PCRE REGULAR EXPRESSION DETAILS
17
18 The syntax and semantics of the regular expressions supported by PCRE
19 are described below. Regular expressions are also described in the Perl
20 documentation and in a number of books, some of which have copious
21 examples. Jeffrey Friedl's "Mastering Regular Expressions", published
22 by O'Reilly, covers regular expressions in great detail. This descrip-
23 tion of PCRE's regular expressions is intended as reference material.
24
25 The original operation of PCRE was on strings of one-byte characters.
26 However, there is now also support for UTF-8 character strings. To use
27 this, you must build PCRE to include UTF-8 support, and then call
28 pcre_compile() with the PCRE_UTF8 option. How this affects pattern
29 matching is mentioned in several places below. There is also a summary
30 of UTF-8 features in the section on UTF-8 support in the main pcre
31 page.
32
33 A regular expression is a pattern that is matched against a subject
34 string from left to right. Most characters stand for themselves in a
35 pattern, and match the corresponding characters in the subject. As a
36 trivial example, the pattern
37
38 The quick brown fox
39
40 matches a portion of a subject string that is identical to itself. The
41 power of regular expressions comes from the ability to include alterna-
42 tives and repetitions in the pattern. These are encoded in the pattern
43 by the use of metacharacters, which do not stand for themselves but
44 instead are interpreted in some special way.
45
46 There are two different sets of metacharacters: those that are recog-
47 nized anywhere in the pattern except within square brackets, and those
48 that are recognized in square brackets. Outside square brackets, the
49 metacharacters are as follows:
50
51 \ general escape character with several uses
52 ^ assert start of string (or line, in multiline mode)
53 $ assert end of string (or line, in multiline mode)
54 . match any character except newline (by default)
55 [ start character class definition
56 | start of alternative branch
57 ( start subpattern
58 ) end subpattern
59 ? extends the meaning of (
60 also 0 or 1 quantifier
61 also quantifier minimizer
62 * 0 or more quantifier
63 + 1 or more quantifier
64 also "possessive quantifier"
65 { start min/max quantifier
66
67 Part of a pattern that is in square brackets is called a "character
68 class". In a character class the only metacharacters are:
69
70 \ general escape character
71 ^ negate the class, but only if the first character
72 - indicates character range
73 [ POSIX character class (only if followed by POSIX
74 syntax)
75 ] terminates the character class
76
77 The following sections describe the use of each of the metacharacters.
78
79
80BACKSLASH
81
82 The backslash character has several uses. Firstly, if it is followed by
83 a non-alphanumeric character, it takes away any special meaning that
84 character may have. This use of backslash as an escape character
85 applies both inside and outside character classes.
86
87 For example, if you want to match a * character, you write \* in the
88 pattern. This escaping action applies whether or not the following
89 character would otherwise be interpreted as a metacharacter, so it is
90 always safe to precede a non-alphanumeric with backslash to specify
91 that it stands for itself. In particular, if you want to match a back-
92 slash, you write \\.
93
94 If a pattern is compiled with the PCRE_EXTENDED option, whitespace in
95 the pattern (other than in a character class) and characters between a
96 # outside a character class and the next newline character are ignored.
97 An escaping backslash can be used to include a whitespace or # charac-
98 ter as part of the pattern.
99
100 If you want to remove the special meaning from a sequence of charac-
101 ters, you can do so by putting them between \Q and \E. This is differ-
102 ent from Perl in that $ and @ are handled as literals in \Q...\E
103 sequences in PCRE, whereas in Perl, $ and @ cause variable interpola-
104 tion. Note the following examples:
105
106 Pattern PCRE matches Perl matches
107
108 \Qabc$xyz\E abc$xyz abc followed by the
109 contents of $xyz
110 \Qabc\$xyz\E abc\$xyz abc\$xyz
111 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
112
113 The \Q...\E sequence is recognized both inside and outside character
114 classes.
115
116 Non-printing characters
117
118 A second use of backslash provides a way of encoding non-printing char-
119 acters in patterns in a visible manner. There is no restriction on the
120 appearance of non-printing characters, apart from the binary zero that
121 terminates a pattern, but when a pattern is being prepared by text
122 editing, it is usually easier to use one of the following escape
123 sequences than the binary character it represents:
124
125 \a alarm, that is, the BEL character (hex 07)
126 \cx "control-x", where x is any character
127 \e escape (hex 1B)
128 \f formfeed (hex 0C)
129 \n newline (hex 0A)
130 \r carriage return (hex 0D)
131 \t tab (hex 09)
132 \ddd character with octal code ddd, or backreference
133 \xhh character with hex code hh
134 \x{hhh..} character with hex code hhh... (UTF-8 mode only)
135
136 The precise effect of \cx is as follows: if x is a lower case letter,
137 it is converted to upper case. Then bit 6 of the character (hex 40) is
138 inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
139 becomes hex 7B.
140
141 After \x, from zero to two hexadecimal digits are read (letters can be
142 in upper or lower case). In UTF-8 mode, any number of hexadecimal dig-
143 its may appear between \x{ and }, but the value of the character code
144 must be less than 2**31 (that is, the maximum hexadecimal value is
145 7FFFFFFF). If characters other than hexadecimal digits appear between
146 \x{ and }, or if there is no terminating }, this form of escape is not
147 recognized. Instead, the initial \x will be interpreted as a basic hex-
148 adecimal escape, with no following digits, giving a character whose
149 value is zero.
150
151 Characters whose value is less than 256 can be defined by either of the
152 two syntaxes for \x when PCRE is in UTF-8 mode. There is no difference
153 in the way they are handled. For example, \xdc is exactly the same as
154 \x{dc}.
155
156 After \0 up to two further octal digits are read. In both cases, if
157 there are fewer than two digits, just those that are present are used.
158 Thus the sequence \0\x\07 specifies two binary zeros followed by a BEL
159 character (code value 7). Make sure you supply two digits after the
160 initial zero if the pattern character that follows is itself an octal
161 digit.
162
163 The handling of a backslash followed by a digit other than 0 is compli-
164 cated. Outside a character class, PCRE reads it and any following dig-
165 its as a decimal number. If the number is less than 10, or if there
166 have been at least that many previous capturing left parentheses in the
167 expression, the entire sequence is taken as a back reference. A
168 description of how this works is given later, following the discussion
169 of parenthesized subpatterns.
170
171 Inside a character class, or if the decimal number is greater than 9
172 and there have not been that many capturing subpatterns, PCRE re-reads
173 up to three octal digits following the backslash, and generates a sin-
174 gle byte from the least significant 8 bits of the value. Any subsequent
175 digits stand for themselves. For example:
176
177 \040 is another way of writing a space
178 \40 is the same, provided there are fewer than 40
179 previous capturing subpatterns
180 \7 is always a back reference
181 \11 might be a back reference, or another way of
182 writing a tab
183 \011 is always a tab
184 \0113 is a tab followed by the character "3"
185 \113 might be a back reference, otherwise the
186 character with octal code 113
187 \377 might be a back reference, otherwise
188 the byte consisting entirely of 1 bits
189 \81 is either a back reference, or a binary zero
190 followed by the two characters "8" and "1"
191
192 Note that octal values of 100 or greater must not be introduced by a
193 leading zero, because no more than three octal digits are ever read.
194
195 All the sequences that define a single byte value or a single UTF-8
196 character (in UTF-8 mode) can be used both inside and outside character
197 classes. In addition, inside a character class, the sequence \b is
198 interpreted as the backspace character (hex 08), and the sequence \X is
199 interpreted as the character "X". Outside a character class, these
200 sequences have different meanings (see below).
201
202 Generic character types
203
204 The third use of backslash is for specifying generic character types.
205 The following are always recognized:
206
207 \d any decimal digit
208 \D any character that is not a decimal digit
209 \s any whitespace character
210 \S any character that is not a whitespace character
211 \w any "word" character
212 \W any "non-word" character
213
214 Each pair of escape sequences partitions the complete set of characters
215 into two disjoint sets. Any given character matches one, and only one,
216 of each pair.
217
218 These character type sequences can appear both inside and outside char-
219 acter classes. They each match one character of the appropriate type.
220 If the current matching point is at the end of the subject string, all
221 of them fail, since there is no character to match.
222
223 For compatibility with Perl, \s does not match the VT character (code
224 11). This makes it different from the the POSIX "space" class. The \s
225 characters are HT (9), LF (10), FF (12), CR (13), and space (32).
226
227 A "word" character is an underscore or any character less than 256 that
228 is a letter or digit. The definition of letters and digits is con-
229 trolled by PCRE's low-valued character tables, and may vary if locale-
230 specific matching is taking place (see "Locale support" in the pcreapi
231 page). For example, in the "fr_FR" (French) locale, some character
232 codes greater than 128 are used for accented letters, and these are
233 matched by \w.
234
235 In UTF-8 mode, characters with values greater than 128 never match \d,
236 \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
237 code character property support is available.
238
239 Unicode character properties
240
241 When PCRE is built with Unicode character property support, three addi-
242 tional escape sequences to match generic character types are available
243 when UTF-8 mode is selected. They are:
244
245 \p{xx} a character with the xx property
246 \P{xx} a character without the xx property
247 \X an extended Unicode sequence
248
249 The property names represented by xx above are limited to the Unicode
250 general category properties. Each character has exactly one such prop-
251 erty, specified by a two-letter abbreviation. For compatibility with
252 Perl, negation can be specified by including a circumflex between the
253 opening brace and the property name. For example, \p{^Lu} is the same
254 as \P{Lu}.
255
256 If only one letter is specified with \p or \P, it includes all the
257 properties that start with that letter. In this case, in the absence of
258 negation, the curly brackets in the escape sequence are optional; these
259 two examples have the same effect:
260
261 \p{L}
262 \pL
263
264 The following property codes are supported:
265
266 C Other
267 Cc Control
268 Cf Format
269 Cn Unassigned
270 Co Private use
271 Cs Surrogate
272
273 L Letter
274 Ll Lower case letter
275 Lm Modifier letter
276 Lo Other letter
277 Lt Title case letter
278 Lu Upper case letter
279
280 M Mark
281 Mc Spacing mark
282 Me Enclosing mark
283 Mn Non-spacing mark
284
285 N Number
286 Nd Decimal number
287 Nl Letter number
288 No Other number
289
290 P Punctuation
291 Pc Connector punctuation
292 Pd Dash punctuation
293 Pe Close punctuation
294 Pf Final punctuation
295 Pi Initial punctuation
296 Po Other punctuation
297 Ps Open punctuation
298
299 S Symbol
300 Sc Currency symbol
301 Sk Modifier symbol
302 Sm Mathematical symbol
303 So Other symbol
304
305 Z Separator
306 Zl Line separator
307 Zp Paragraph separator
308 Zs Space separator
309
310 Extended properties such as "Greek" or "InMusicalSymbols" are not sup-
311 ported by PCRE.
312
313 Specifying caseless matching does not affect these escape sequences.
314 For example, \p{Lu} always matches only upper case letters.
315
316 The \X escape matches any number of Unicode characters that form an
317 extended Unicode sequence. \X is equivalent to
318
319 (?>\PM\pM*)
320
321 That is, it matches a character without the "mark" property, followed
322 by zero or more characters with the "mark" property, and treats the
323 sequence as an atomic group (see below). Characters with the "mark"
324 property are typically accents that affect the preceding character.
325
326 Matching characters by Unicode property is not fast, because PCRE has
327 to search a structure that contains data for over fifteen thousand
328 characters. That is why the traditional escape sequences such as \d and
329 \w do not use Unicode properties in PCRE.
330
331 Simple assertions
332
333 The fourth use of backslash is for certain simple assertions. An asser-
334 tion specifies a condition that has to be met at a particular point in
335 a match, without consuming any characters from the subject string. The
336 use of subpatterns for more complicated assertions is described below.
337 The backslashed assertions are:
338
339 \b matches at a word boundary
340 \B matches when not at a word boundary
341 \A matches at start of subject
342 \Z matches at end of subject or before newline at end
343 \z matches at end of subject
344 \G matches at first matching position in subject
345
346 These assertions may not appear in character classes (but note that \b
347 has a different meaning, namely the backspace character, inside a char-
348 acter class).
349
350 A word boundary is a position in the subject string where the current
351 character and the previous character do not both match \w or \W (i.e.
352 one matches \w and the other matches \W), or the start or end of the
353 string if the first or last character matches \w, respectively.
354
355 The \A, \Z, and \z assertions differ from the traditional circumflex
356 and dollar (described in the next section) in that they only ever match
357 at the very start and end of the subject string, whatever options are
358 set. Thus, they are independent of multiline mode. These three asser-
359 tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
360 affect only the behaviour of the circumflex and dollar metacharacters.
361 However, if the startoffset argument of pcre_exec() is non-zero, indi-
362 cating that matching is to start at a point other than the beginning of
363 the subject, \A can never match. The difference between \Z and \z is
364 that \Z matches before a newline that is the last character of the
365 string as well as at the end of the string, whereas \z matches only at
366 the end.
367
368 The \G assertion is true only when the current matching position is at
369 the start point of the match, as specified by the startoffset argument
370 of pcre_exec(). It differs from \A when the value of startoffset is
371 non-zero. By calling pcre_exec() multiple times with appropriate argu-
372 ments, you can mimic Perl's /g option, and it is in this kind of imple-
373 mentation where \G can be useful.
374
375 Note, however, that PCRE's interpretation of \G, as the start of the
376 current match, is subtly different from Perl's, which defines it as the
377 end of the previous match. In Perl, these can be different when the
378 previously matched string was empty. Because PCRE does just one match
379 at a time, it cannot reproduce this behaviour.
380
381 If all the alternatives of a pattern begin with \G, the expression is
382 anchored to the starting match position, and the "anchored" flag is set
383 in the compiled regular expression.
384
385
386CIRCUMFLEX AND DOLLAR
387
388 Outside a character class, in the default matching mode, the circumflex
389 character is an assertion that is true only if the current matching
390 point is at the start of the subject string. If the startoffset argu-
391 ment of pcre_exec() is non-zero, circumflex can never match if the
392 PCRE_MULTILINE option is unset. Inside a character class, circumflex
393 has an entirely different meaning (see below).
394
395 Circumflex need not be the first character of the pattern if a number
396 of alternatives are involved, but it should be the first thing in each
397 alternative in which it appears if the pattern is ever to match that
398 branch. If all possible alternatives start with a circumflex, that is,
399 if the pattern is constrained to match only at the start of the sub-
400 ject, it is said to be an "anchored" pattern. (There are also other
401 constructs that can cause a pattern to be anchored.)
402
403 A dollar character is an assertion that is true only if the current
404 matching point is at the end of the subject string, or immediately
405 before a newline character that is the last character in the string (by
406 default). Dollar need not be the last character of the pattern if a
407 number of alternatives are involved, but it should be the last item in
408 any branch in which it appears. Dollar has no special meaning in a
409 character class.
410
411 The meaning of dollar can be changed so that it matches only at the
412 very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at
413 compile time. This does not affect the \Z assertion.
414
415 The meanings of the circumflex and dollar characters are changed if the
416 PCRE_MULTILINE option is set. When this is the case, they match immedi-
417 ately after and immediately before an internal newline character,
418 respectively, in addition to matching at the start and end of the sub-
419 ject string. For example, the pattern /^abc$/ matches the subject
420 string "def\nabc" (where \n represents a newline character) in multi-
421 line mode, but not otherwise. Consequently, patterns that are anchored
422 in single line mode because all branches start with ^ are not anchored
423 in multiline mode, and a match for circumflex is possible when the
424 startoffset argument of pcre_exec() is non-zero. The PCRE_DOL-
425 LAR_ENDONLY option is ignored if PCRE_MULTILINE is set.
426
427 Note that the sequences \A, \Z, and \z can be used to match the start
428 and end of the subject in both modes, and if all branches of a pattern
429 start with \A it is always anchored, whether PCRE_MULTILINE is set or
430 not.
431
432
433FULL STOP (PERIOD, DOT)
434
435 Outside a character class, a dot in the pattern matches any one charac-
436 ter in the subject, including a non-printing character, but not (by
437 default) newline. In UTF-8 mode, a dot matches any UTF-8 character,
438 which might be more than one byte long, except (by default) newline. If
439 the PCRE_DOTALL option is set, dots match newlines as well. The han-
440 dling of dot is entirely independent of the handling of circumflex and
441 dollar, the only relationship being that they both involve newline
442 characters. Dot has no special meaning in a character class.
443
444
445MATCHING A SINGLE BYTE
446
447 Outside a character class, the escape sequence \C matches any one byte,
448 both in and out of UTF-8 mode. Unlike a dot, it can match a newline.
449 The feature is provided in Perl in order to match individual bytes in
450 UTF-8 mode. Because it breaks up UTF-8 characters into individual
451 bytes, what remains in the string may be a malformed UTF-8 string. For
452 this reason, the \C escape sequence is best avoided.
453
454 PCRE does not allow \C to appear in lookbehind assertions (described
455 below), because in UTF-8 mode this would make it impossible to calcu-
456 late the length of the lookbehind.
457
458
459SQUARE BRACKETS AND CHARACTER CLASSES
460
461 An opening square bracket introduces a character class, terminated by a
462 closing square bracket. A closing square bracket on its own is not spe-
463 cial. If a closing square bracket is required as a member of the class,
464 it should be the first data character in the class (after an initial
465 circumflex, if present) or escaped with a backslash.
466
467 A character class matches a single character in the subject. In UTF-8
468 mode, the character may occupy more than one byte. A matched character
469 must be in the set of characters defined by the class, unless the first
470 character in the class definition is a circumflex, in which case the
471 subject character must not be in the set defined by the class. If a
472 circumflex is actually required as a member of the class, ensure it is
473 not the first character, or escape it with a backslash.
474
475 For example, the character class [aeiou] matches any lower case vowel,
476 while [^aeiou] matches any character that is not a lower case vowel.
477 Note that a circumflex is just a convenient notation for specifying the
478 characters that are in the class by enumerating those that are not. A
479 class that starts with a circumflex is not an assertion: it still con-
480 sumes a character from the subject string, and therefore it fails if
481 the current pointer is at the end of the string.
482
483 In UTF-8 mode, characters with values greater than 255 can be included
484 in a class as a literal string of bytes, or by using the \x{ escaping
485 mechanism.
486
487 When caseless matching is set, any letters in a class represent both
488 their upper case and lower case versions, so for example, a caseless
489 [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
490 match "A", whereas a caseful version would. When running in UTF-8 mode,
491 PCRE supports the concept of case for characters with values greater
492 than 128 only when it is compiled with Unicode property support.
493
494 The newline character is never treated in any special way in character
495 classes, whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE
496 options is. A class such as [^a] will always match a newline.
497
498 The minus (hyphen) character can be used to specify a range of charac-
499 ters in a character class. For example, [d-m] matches any letter
500 between d and m, inclusive. If a minus character is required in a
501 class, it must be escaped with a backslash or appear in a position
502 where it cannot be interpreted as indicating a range, typically as the
503 first or last character in the class.
504
505 It is not possible to have the literal character "]" as the end charac-
506 ter of a range. A pattern such as [W-]46] is interpreted as a class of
507 two characters ("W" and "-") followed by a literal string "46]", so it
508 would match "W46]" or "-46]". However, if the "]" is escaped with a
509 backslash it is interpreted as the end of range, so [W-\]46] is inter-
510 preted as a class containing a range followed by two other characters.
511 The octal or hexadecimal representation of "]" can also be used to end
512 a range.
513
514 Ranges operate in the collating sequence of character values. They can
515 also be used for characters specified numerically, for example
516 [\000-\037]. In UTF-8 mode, ranges can include characters whose values
517 are greater than 255, for example [\x{100}-\x{2ff}].
518
519 If a range that includes letters is used when caseless matching is set,
520 it matches the letters in either case. For example, [W-c] is equivalent
521 to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if
522 character tables for the "fr_FR" locale are in use, [\xc8-\xcb] matches
523 accented E characters in both cases. In UTF-8 mode, PCRE supports the
524 concept of case for characters with values greater than 128 only when
525 it is compiled with Unicode property support.
526
527 The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear
528 in a character class, and add the characters that they match to the
529 class. For example, [\dABCDEF] matches any hexadecimal digit. A circum-
530 flex can conveniently be used with the upper case character types to
531 specify a more restricted set of characters than the matching lower
532 case type. For example, the class [^\W_] matches any letter or digit,
533 but not underscore.
534
535 The only metacharacters that are recognized in character classes are
536 backslash, hyphen (only where it can be interpreted as specifying a
537 range), circumflex (only at the start), opening square bracket (only
538 when it can be interpreted as introducing a POSIX class name - see the
539 next section), and the terminating closing square bracket. However,
540 escaping other non-alphanumeric characters does no harm.
541
542
543POSIX CHARACTER CLASSES
544
545 Perl supports the POSIX notation for character classes. This uses names
546 enclosed by [: and :] within the enclosing square brackets. PCRE also
547 supports this notation. For example,
548
549 [01[:alpha:]%]
550
551 matches "0", "1", any alphabetic character, or "%". The supported class
552 names are
553
554 alnum letters and digits
555 alpha letters
556 ascii character codes 0 - 127
557 blank space or tab only
558 cntrl control characters
559 digit decimal digits (same as \d)
560 graph printing characters, excluding space
561 lower lower case letters
562 print printing characters, including space
563 punct printing characters, excluding letters and digits
564 space white space (not quite the same as \s)
565 upper upper case letters
566 word "word" characters (same as \w)
567 xdigit hexadecimal digits
568
569 The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
570 and space (32). Notice that this list includes the VT character (code
571 11). This makes "space" different to \s, which does not include VT (for
572 Perl compatibility).
573
574 The name "word" is a Perl extension, and "blank" is a GNU extension
575 from Perl 5.8. Another Perl extension is negation, which is indicated
576 by a ^ character after the colon. For example,
577
578 [12[:^digit:]]
579
580 matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the
581 POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
582 these are not supported, and an error is given if they are encountered.
583
584 In UTF-8 mode, characters with values greater than 128 do not match any
585 of the POSIX character classes.
586
587
588VERTICAL BAR
589
590 Vertical bar characters are used to separate alternative patterns. For
591 example, the pattern
592
593 gilbert|sullivan
594
595 matches either "gilbert" or "sullivan". Any number of alternatives may
596 appear, and an empty alternative is permitted (matching the empty
597 string). The matching process tries each alternative in turn, from
598 left to right, and the first one that succeeds is used. If the alterna-
599 tives are within a subpattern (defined below), "succeeds" means match-
600 ing the rest of the main pattern as well as the alternative in the sub-
601 pattern.
602
603
604INTERNAL OPTION SETTING
605
606 The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
607 PCRE_EXTENDED options can be changed from within the pattern by a
608 sequence of Perl option letters enclosed between "(?" and ")". The
609 option letters are
610
611 i for PCRE_CASELESS
612 m for PCRE_MULTILINE
613 s for PCRE_DOTALL
614 x for PCRE_EXTENDED
615
616 For example, (?im) sets caseless, multiline matching. It is also possi-
617 ble to unset these options by preceding the letter with a hyphen, and a
618 combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-
619 LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
620 is also permitted. If a letter appears both before and after the
621 hyphen, the option is unset.
622
623 When an option change occurs at top level (that is, not inside subpat-
624 tern parentheses), the change applies to the remainder of the pattern
625 that follows. If the change is placed right at the start of a pattern,
626 PCRE extracts it into the global options (and it will therefore show up
627 in data extracted by the pcre_fullinfo() function).
628
629 An option change within a subpattern affects only that part of the cur-
630 rent pattern that follows it, so
631
632 (a(?i)b)c
633
634 matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
635 used). By this means, options can be made to have different settings
636 in different parts of the pattern. Any changes made in one alternative
637 do carry on into subsequent branches within the same subpattern. For
638 example,
639
640 (a(?i)b|c)
641
642 matches "ab", "aB", "c", and "C", even though when matching "C" the
643 first branch is abandoned before the option setting. This is because
644 the effects of option settings happen at compile time. There would be
645 some very weird behaviour otherwise.
646
647 The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can be changed
648 in the same way as the Perl-compatible options by using the characters
649 U and X respectively. The (?X) flag setting is special in that it must
650 always occur earlier in the pattern than any of the additional features
651 it turns on, even when it is at top level. It is best to put it at the
652 start.
653
654
655SUBPATTERNS
656
657 Subpatterns are delimited by parentheses (round brackets), which can be
658 nested. Turning part of a pattern into a subpattern does two things:
659
660 1. It localizes a set of alternatives. For example, the pattern
661
662 cat(aract|erpillar|)
663
664 matches one of the words "cat", "cataract", or "caterpillar". Without
665 the parentheses, it would match "cataract", "erpillar" or the empty
666 string.
667
668 2. It sets up the subpattern as a capturing subpattern. This means
669 that, when the whole pattern matches, that portion of the subject
670 string that matched the subpattern is passed back to the caller via the
671 ovector argument of pcre_exec(). Opening parentheses are counted from
672 left to right (starting from 1) to obtain numbers for the capturing
673 subpatterns.
674
675 For example, if the string "the red king" is matched against the pat-
676 tern
677
678 the ((red|white) (king|queen))
679
680 the captured substrings are "red king", "red", and "king", and are num-
681 bered 1, 2, and 3, respectively.
682
683 The fact that plain parentheses fulfil two functions is not always
684 helpful. There are often times when a grouping subpattern is required
685 without a capturing requirement. If an opening parenthesis is followed
686 by a question mark and a colon, the subpattern does not do any captur-
687 ing, and is not counted when computing the number of any subsequent
688 capturing subpatterns. For example, if the string "the white queen" is
689 matched against the pattern
690
691 the ((?:red|white) (king|queen))
692
693 the captured substrings are "white queen" and "queen", and are numbered
694 1 and 2. The maximum number of capturing subpatterns is 65535, and the
695 maximum depth of nesting of all subpatterns, both capturing and non-
696 capturing, is 200.
697
698 As a convenient shorthand, if any option settings are required at the
699 start of a non-capturing subpattern, the option letters may appear
700 between the "?" and the ":". Thus the two patterns
701
702 (?i:saturday|sunday)
703 (?:(?i)saturday|sunday)
704
705 match exactly the same set of strings. Because alternative branches are
706 tried from left to right, and options are not reset until the end of
707 the subpattern is reached, an option setting in one branch does affect
708 subsequent branches, so the above patterns match "SUNDAY" as well as
709 "Saturday".
710
711
712NAMED SUBPATTERNS
713
714 Identifying capturing parentheses by number is simple, but it can be
715 very hard to keep track of the numbers in complicated regular expres-
716 sions. Furthermore, if an expression is modified, the numbers may
717 change. To help with this difficulty, PCRE supports the naming of sub-
718 patterns, something that Perl does not provide. The Python syntax
719 (?P<name>...) is used. Names consist of alphanumeric characters and
720 underscores, and must be unique within a pattern.
721
722 Named capturing parentheses are still allocated numbers as well as
723 names. The PCRE API provides function calls for extracting the name-to-
724 number translation table from a compiled pattern. There is also a con-
725 venience function for extracting a captured substring by name. For fur-
726 ther details see the pcreapi documentation.
727
728
729REPETITION
730
731 Repetition is specified by quantifiers, which can follow any of the
732 following items:
733
734 a literal data character
735 the . metacharacter
736 the \C escape sequence
737 the \X escape sequence (in UTF-8 mode with Unicode properties)
738 an escape such as \d that matches a single character
739 a character class
740 a back reference (see next section)
741 a parenthesized subpattern (unless it is an assertion)
742
743 The general repetition quantifier specifies a minimum and maximum num-
744 ber of permitted matches, by giving the two numbers in curly brackets
745 (braces), separated by a comma. The numbers must be less than 65536,
746 and the first must be less than or equal to the second. For example:
747
748 z{2,4}
749
750 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
751 special character. If the second number is omitted, but the comma is
752 present, there is no upper limit; if the second number and the comma
753 are both omitted, the quantifier specifies an exact number of required
754 matches. Thus
755
756 [aeiou]{3,}
757
758 matches at least 3 successive vowels, but may match many more, while
759
760 \d{8}
761
762 matches exactly 8 digits. An opening curly bracket that appears in a
763 position where a quantifier is not allowed, or one that does not match
764 the syntax of a quantifier, is taken as a literal character. For exam-
765 ple, {,6} is not a quantifier, but a literal string of four characters.
766
767 In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to
768 individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
769 acters, each of which is represented by a two-byte sequence. Similarly,
770 when Unicode property support is available, \X{3} matches three Unicode
771 extended sequences, each of which may be several bytes long (and they
772 may be of different lengths).
773
774 The quantifier {0} is permitted, causing the expression to behave as if
775 the previous item and the quantifier were not present.
776
777 For convenience (and historical compatibility) the three most common
778 quantifiers have single-character abbreviations:
779
780 * is equivalent to {0,}
781 + is equivalent to {1,}
782 ? is equivalent to {0,1}
783
784 It is possible to construct infinite loops by following a subpattern
785 that can match no characters with a quantifier that has no upper limit,
786 for example:
787
788 (a?)*
789
790 Earlier versions of Perl and PCRE used to give an error at compile time
791 for such patterns. However, because there are cases where this can be
792 useful, such patterns are now accepted, but if any repetition of the
793 subpattern does in fact match no characters, the loop is forcibly bro-
794 ken.
795
796 By default, the quantifiers are "greedy", that is, they match as much
797 as possible (up to the maximum number of permitted times), without
798 causing the rest of the pattern to fail. The classic example of where
799 this gives problems is in trying to match comments in C programs. These
800 appear between /* and */ and within the comment, individual * and /
801 characters may appear. An attempt to match C comments by applying the
802 pattern
803
804 /\*.*\*/
805
806 to the string
807
808 /* first comment */ not comment /* second comment */
809
810 fails, because it matches the entire string owing to the greediness of
811 the .* item.
812
813 However, if a quantifier is followed by a question mark, it ceases to
814 be greedy, and instead matches the minimum number of times possible, so
815 the pattern
816
817 /\*.*?\*/
818
819 does the right thing with the C comments. The meaning of the various
820 quantifiers is not otherwise changed, just the preferred number of
821 matches. Do not confuse this use of question mark with its use as a
822 quantifier in its own right. Because it has two uses, it can sometimes
823 appear doubled, as in
824
825 \d??\d
826
827 which matches one digit by preference, but can match two if that is the
828 only way the rest of the pattern matches.
829
830 If the PCRE_UNGREEDY option is set (an option which is not available in
831 Perl), the quantifiers are not greedy by default, but individual ones
832 can be made greedy by following them with a question mark. In other
833 words, it inverts the default behaviour.
834
835 When a parenthesized subpattern is quantified with a minimum repeat
836 count that is greater than 1 or with a limited maximum, more memory is
837 required for the compiled pattern, in proportion to the size of the
838 minimum or maximum.
839
840 If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
841 alent to Perl's /s) is set, thus allowing the . to match newlines, the
842 pattern is implicitly anchored, because whatever follows will be tried
843 against every character position in the subject string, so there is no
844 point in retrying the overall match at any position after the first.
845 PCRE normally treats such a pattern as though it were preceded by \A.
846
847 In cases where it is known that the subject string contains no new-
848 lines, it is worth setting PCRE_DOTALL in order to obtain this opti-
849 mization, or alternatively using ^ to indicate anchoring explicitly.
850
851 However, there is one situation where the optimization cannot be used.
852 When .* is inside capturing parentheses that are the subject of a
853 backreference elsewhere in the pattern, a match at the start may fail,
854 and a later one succeed. Consider, for example:
855
856 (.*)abc\1
857
858 If the subject is "xyz123abc123" the match point is the fourth charac-
859 ter. For this reason, such a pattern is not implicitly anchored.
860
861 When a capturing subpattern is repeated, the value captured is the sub-
862 string that matched the final iteration. For example, after
863
864 (tweedle[dume]{3}\s*)+
865
866 has matched "tweedledum tweedledee" the value of the captured substring
867 is "tweedledee". However, if there are nested capturing subpatterns,
868 the corresponding captured values may have been set in previous itera-
869 tions. For example, after
870
871 /(a|(b))+/
872
873 matches "aba" the value of the second captured substring is "b".
874
875
876ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
877
878 With both maximizing and minimizing repetition, failure of what follows
879 normally causes the repeated item to be re-evaluated to see if a dif-
880 ferent number of repeats allows the rest of the pattern to match. Some-
881 times it is useful to prevent this, either to change the nature of the
882 match, or to cause it fail earlier than it otherwise might, when the
883 author of the pattern knows there is no point in carrying on.
884
885 Consider, for example, the pattern \d+foo when applied to the subject
886 line
887
888 123456bar
889
890 After matching all 6 digits and then failing to match "foo", the normal
891 action of the matcher is to try again with only 5 digits matching the
892 \d+ item, and then with 4, and so on, before ultimately failing.
893 "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
894 the means for specifying that once a subpattern has matched, it is not
895 to be re-evaluated in this way.
896
897 If we use atomic grouping for the previous example, the matcher would
898 give up immediately on failing to match "foo" the first time. The nota-
899 tion is a kind of special parenthesis, starting with (?> as in this
900 example:
901
902 (?>\d+)foo
903
904 This kind of parenthesis "locks up" the part of the pattern it con-
905 tains once it has matched, and a failure further into the pattern is
906 prevented from backtracking into it. Backtracking past it to previous
907 items, however, works as normal.
908
909 An alternative description is that a subpattern of this type matches
910 the string of characters that an identical standalone pattern would
911 match, if anchored at the current point in the subject string.
912
913 Atomic grouping subpatterns are not capturing subpatterns. Simple cases
914 such as the above example can be thought of as a maximizing repeat that
915 must swallow everything it can. So, while both \d+ and \d+? are pre-
916 pared to adjust the number of digits they match in order to make the
917 rest of the pattern match, (?>\d+) can only match an entire sequence of
918 digits.
919
920 Atomic groups in general can of course contain arbitrarily complicated
921 subpatterns, and can be nested. However, when the subpattern for an
922 atomic group is just a single repeated item, as in the example above, a
923 simpler notation, called a "possessive quantifier" can be used. This
924 consists of an additional + character following a quantifier. Using
925 this notation, the previous example can be rewritten as
926
927 \d++foo
928
929 Possessive quantifiers are always greedy; the setting of the
930 PCRE_UNGREEDY option is ignored. They are a convenient notation for the
931 simpler forms of atomic group. However, there is no difference in the
932 meaning or processing of a possessive quantifier and the equivalent
933 atomic group.
934
935 The possessive quantifier syntax is an extension to the Perl syntax. It
936 originates in Sun's Java package.
937
938 When a pattern contains an unlimited repeat inside a subpattern that
939 can itself be repeated an unlimited number of times, the use of an
940 atomic group is the only way to avoid some failing matches taking a
941 very long time indeed. The pattern
942
943 (\D+|<\d+>)*[!?]
944
945 matches an unlimited number of substrings that either consist of non-
946 digits, or digits enclosed in <>, followed by either ! or ?. When it
947 matches, it runs quickly. However, if it is applied to
948
949 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
950
951 it takes a long time before reporting failure. This is because the
952 string can be divided between the internal \D+ repeat and the external
953 * repeat in a large number of ways, and all have to be tried. (The
954 example uses [!?] rather than a single character at the end, because
955 both PCRE and Perl have an optimization that allows for fast failure
956 when a single character is used. They remember the last single charac-
957 ter that is required for a match, and fail early if it is not present
958 in the string.) If the pattern is changed so that it uses an atomic
959 group, like this:
960
961 ((?>\D+)|<\d+>)*[!?]
962
963 sequences of non-digits cannot be broken, and failure happens quickly.
964
965
966BACK REFERENCES
967
968 Outside a character class, a backslash followed by a digit greater than
969 0 (and possibly further digits) is a back reference to a capturing sub-
970 pattern earlier (that is, to its left) in the pattern, provided there
971 have been that many previous capturing left parentheses.
972
973 However, if the decimal number following the backslash is less than 10,
974 it is always taken as a back reference, and causes an error only if
975 there are not that many capturing left parentheses in the entire pat-
976 tern. In other words, the parentheses that are referenced need not be
977 to the left of the reference for numbers less than 10. See the subsec-
978 tion entitled "Non-printing characters" above for further details of
979 the handling of digits following a backslash.
980
981 A back reference matches whatever actually matched the capturing sub-
982 pattern in the current subject string, rather than anything matching
983 the subpattern itself (see "Subpatterns as subroutines" below for a way
984 of doing that). So the pattern
985
986 (sens|respons)e and \1ibility
987
988 matches "sense and sensibility" and "response and responsibility", but
989 not "sense and responsibility". If caseful matching is in force at the
990 time of the back reference, the case of letters is relevant. For exam-
991 ple,
992
993 ((?i)rah)\s+\1
994
995 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
996 original capturing subpattern is matched caselessly.
997
998 Back references to named subpatterns use the Python syntax (?P=name).
999 We could rewrite the above example as follows:
1000
1001 (?<p1>(?i)rah)\s+(?P=p1)
1002
1003 There may be more than one back reference to the same subpattern. If a
1004 subpattern has not actually been used in a particular match, any back
1005 references to it always fail. For example, the pattern
1006
1007 (a|(bc))\2
1008
1009 always fails if it starts to match "a" rather than "bc". Because there
1010 may be many capturing parentheses in a pattern, all digits following
1011 the backslash are taken as part of a potential back reference number.
1012 If the pattern continues with a digit character, some delimiter must be
1013 used to terminate the back reference. If the PCRE_EXTENDED option is
1014 set, this can be whitespace. Otherwise an empty comment (see "Com-
1015 ments" below) can be used.
1016
1017 A back reference that occurs inside the parentheses to which it refers
1018 fails when the subpattern is first used, so, for example, (a\1) never
1019 matches. However, such references can be useful inside repeated sub-
1020 patterns. For example, the pattern
1021
1022 (a|b\1)+
1023
1024 matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
1025 ation of the subpattern, the back reference matches the character
1026 string corresponding to the previous iteration. In order for this to
1027 work, the pattern must be such that the first iteration does not need
1028 to match the back reference. This can be done using alternation, as in
1029 the example above, or by a quantifier with a minimum of zero.
1030
1031
1032ASSERTIONS
1033
1034 An assertion is a test on the characters following or preceding the
1035 current matching point that does not actually consume any characters.
1036 The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are
1037 described above.
1038
1039 More complicated assertions are coded as subpatterns. There are two
1040 kinds: those that look ahead of the current position in the subject
1041 string, and those that look behind it. An assertion subpattern is
1042 matched in the normal way, except that it does not cause the current
1043 matching position to be changed.
1044
1045 Assertion subpatterns are not capturing subpatterns, and may not be
1046 repeated, because it makes no sense to assert the same thing several
1047 times. If any kind of assertion contains capturing subpatterns within
1048 it, these are counted for the purposes of numbering the capturing sub-
1049 patterns in the whole pattern. However, substring capturing is carried
1050 out only for positive assertions, because it does not make sense for
1051 negative assertions.
1052
1053 Lookahead assertions
1054
1055 Lookahead assertions start with (?= for positive assertions and (?! for
1056 negative assertions. For example,
1057
1058 \w+(?=;)
1059
1060 matches a word followed by a semicolon, but does not include the semi-
1061 colon in the match, and
1062
1063 foo(?!bar)
1064
1065 matches any occurrence of "foo" that is not followed by "bar". Note
1066 that the apparently similar pattern
1067
1068 (?!foo)bar
1069
1070 does not find an occurrence of "bar" that is preceded by something
1071 other than "foo"; it finds any occurrence of "bar" whatsoever, because
1072 the assertion (?!foo) is always true when the next three characters are
1073 "bar". A lookbehind assertion is needed to achieve the other effect.
1074
1075 If you want to force a matching failure at some point in a pattern, the
1076 most convenient way to do it is with (?!) because an empty string
1077 always matches, so an assertion that requires there not to be an empty
1078 string must always fail.
1079
1080 Lookbehind assertions
1081
1082 Lookbehind assertions start with (?<= for positive assertions and (?<!
1083 for negative assertions. For example,
1084
1085 (?<!foo)bar
1086
1087 does find an occurrence of "bar" that is not preceded by "foo". The
1088 contents of a lookbehind assertion are restricted such that all the
1089 strings it matches must have a fixed length. However, if there are sev-
1090 eral alternatives, they do not all have to have the same fixed length.
1091 Thus
1092
1093 (?<=bullock|donkey)
1094
1095 is permitted, but
1096
1097 (?<!dogs?|cats?)
1098
1099 causes an error at compile time. Branches that match different length
1100 strings are permitted only at the top level of a lookbehind assertion.
1101 This is an extension compared with Perl (at least for 5.8), which
1102 requires all branches to match the same length of string. An assertion
1103 such as
1104
1105 (?<=ab(c|de))
1106
1107 is not permitted, because its single top-level branch can match two
1108 different lengths, but it is acceptable if rewritten to use two top-
1109 level branches:
1110
1111 (?<=abc|abde)
1112
1113 The implementation of lookbehind assertions is, for each alternative,
1114 to temporarily move the current position back by the fixed width and
1115 then try to match. If there are insufficient characters before the cur-
1116 rent position, the match is deemed to fail.
1117
1118 PCRE does not allow the \C escape (which matches a single byte in UTF-8
1119 mode) to appear in lookbehind assertions, because it makes it impossi-
1120 ble to calculate the length of the lookbehind. The \X escape, which can
1121 match different numbers of bytes, is also not permitted.
1122
1123 Atomic groups can be used in conjunction with lookbehind assertions to
1124 specify efficient matching at the end of the subject string. Consider a
1125 simple pattern such as
1126
1127 abcd$
1128
1129 when applied to a long string that does not match. Because matching
1130 proceeds from left to right, PCRE will look for each "a" in the subject
1131 and then see if what follows matches the rest of the pattern. If the
1132 pattern is specified as
1133
1134 ^.*abcd$
1135
1136 the initial .* matches the entire string at first, but when this fails
1137 (because there is no following "a"), it backtracks to match all but the
1138 last character, then all but the last two characters, and so on. Once
1139 again the search for "a" covers the entire string, from right to left,
1140 so we are no better off. However, if the pattern is written as
1141
1142 ^(?>.*)(?<=abcd)
1143
1144 or, equivalently, using the possessive quantifier syntax,
1145
1146 ^.*+(?<=abcd)
1147
1148 there can be no backtracking for the .* item; it can match only the
1149 entire string. The subsequent lookbehind assertion does a single test
1150 on the last four characters. If it fails, the match fails immediately.
1151 For long strings, this approach makes a significant difference to the
1152 processing time.
1153
1154 Using multiple assertions
1155
1156 Several assertions (of any sort) may occur in succession. For example,
1157
1158 (?<=\d{3})(?<!999)foo
1159
1160 matches "foo" preceded by three digits that are not "999". Notice that
1161 each of the assertions is applied independently at the same point in
1162 the subject string. First there is a check that the previous three
1163 characters are all digits, and then there is a check that the same
1164 three characters are not "999". This pattern does not match "foo" pre-
1165 ceded by six characters, the first of which are digits and the last
1166 three of which are not "999". For example, it doesn't match "123abc-
1167 foo". A pattern to do that is
1168
1169 (?<=\d{3}...)(?<!999)foo
1170
1171 This time the first assertion looks at the preceding six characters,
1172 checking that the first three are digits, and then the second assertion
1173 checks that the preceding three characters are not "999".
1174
1175 Assertions can be nested in any combination. For example,
1176
1177 (?<=(?<!foo)bar)baz
1178
1179 matches an occurrence of "baz" that is preceded by "bar" which in turn
1180 is not preceded by "foo", while
1181
1182 (?<=\d{3}(?!999)...)foo
1183
1184 is another pattern that matches "foo" preceded by three digits and any
1185 three characters that are not "999".
1186
1187
1188CONDITIONAL SUBPATTERNS
1189
1190 It is possible to cause the matching process to obey a subpattern con-
1191 ditionally or to choose between two alternative subpatterns, depending
1192 on the result of an assertion, or whether a previous capturing subpat-
1193 tern matched or not. The two possible forms of conditional subpattern
1194 are
1195
1196 (?(condition)yes-pattern)
1197 (?(condition)yes-pattern|no-pattern)
1198
1199 If the condition is satisfied, the yes-pattern is used; otherwise the
1200 no-pattern (if present) is used. If there are more than two alterna-
1201 tives in the subpattern, a compile-time error occurs.
1202
1203 There are three kinds of condition. If the text between the parentheses
1204 consists of a sequence of digits, the condition is satisfied if the
1205 capturing subpattern of that number has previously matched. The number
1206 must be greater than zero. Consider the following pattern, which con-
1207 tains non-significant white space to make it more readable (assume the
1208 PCRE_EXTENDED option) and to divide it into three parts for ease of
1209 discussion:
1210
1211 ( \( )? [^()]+ (?(1) \) )
1212
1213 The first part matches an optional opening parenthesis, and if that
1214 character is present, sets it as the first captured substring. The sec-
1215 ond part matches one or more characters that are not parentheses. The
1216 third part is a conditional subpattern that tests whether the first set
1217 of parentheses matched or not. If they did, that is, if subject started
1218 with an opening parenthesis, the condition is true, and so the yes-pat-
1219 tern is executed and a closing parenthesis is required. Otherwise,
1220 since no-pattern is not present, the subpattern matches nothing. In
1221 other words, this pattern matches a sequence of non-parentheses,
1222 optionally enclosed in parentheses.
1223
1224 If the condition is the string (R), it is satisfied if a recursive call
1225 to the pattern or subpattern has been made. At "top level", the condi-
1226 tion is false. This is a PCRE extension. Recursive patterns are
1227 described in the next section.
1228
1229 If the condition is not a sequence of digits or (R), it must be an
1230 assertion. This may be a positive or negative lookahead or lookbehind
1231 assertion. Consider this pattern, again containing non-significant
1232 white space, and with the two alternatives on the second line:
1233
1234 (?(?=[^a-z]*[a-z])
1235 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
1236
1237 The condition is a positive lookahead assertion that matches an
1238 optional sequence of non-letters followed by a letter. In other words,
1239 it tests for the presence of at least one letter in the subject. If a
1240 letter is found, the subject is matched against the first alternative;
1241 otherwise it is matched against the second. This pattern matches
1242 strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
1243 letters and dd are digits.
1244
1245
1246COMMENTS
1247
1248 The sequence (?# marks the start of a comment that continues up to the
1249 next closing parenthesis. Nested parentheses are not permitted. The
1250 characters that make up a comment play no part in the pattern matching
1251 at all.
1252
1253 If the PCRE_EXTENDED option is set, an unescaped # character outside a
1254 character class introduces a comment that continues up to the next new-
1255 line character in the pattern.
1256
1257
1258RECURSIVE PATTERNS
1259
1260 Consider the problem of matching a string in parentheses, allowing for
1261 unlimited nested parentheses. Without the use of recursion, the best
1262 that can be done is to use a pattern that matches up to some fixed
1263 depth of nesting. It is not possible to handle an arbitrary nesting
1264 depth. Perl provides a facility that allows regular expressions to
1265 recurse (amongst other things). It does this by interpolating Perl code
1266 in the expression at run time, and the code can refer to the expression
1267 itself. A Perl pattern to solve the parentheses problem can be created
1268 like this:
1269
1270 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
1271
1272 The (?p{...}) item interpolates Perl code at run time, and in this case
1273 refers recursively to the pattern in which it appears. Obviously, PCRE
1274 cannot support the interpolation of Perl code. Instead, it supports
1275 some special syntax for recursion of the entire pattern, and also for
1276 individual subpattern recursion.
1277
1278 The special item that consists of (? followed by a number greater than
1279 zero and a closing parenthesis is a recursive call of the subpattern of
1280 the given number, provided that it occurs inside that subpattern. (If
1281 not, it is a "subroutine" call, which is described in the next sec-
1282 tion.) The special item (?R) is a recursive call of the entire regular
1283 expression.
1284
1285 For example, this PCRE pattern solves the nested parentheses problem
1286 (assume the PCRE_EXTENDED option is set so that white space is
1287 ignored):
1288
1289 \( ( (?>[^()]+) | (?R) )* \)
1290
1291 First it matches an opening parenthesis. Then it matches any number of
1292 substrings which can either be a sequence of non-parentheses, or a
1293 recursive match of the pattern itself (that is a correctly parenthe-
1294 sized substring). Finally there is a closing parenthesis.
1295
1296 If this were part of a larger pattern, you would not want to recurse
1297 the entire pattern, so instead you could use this:
1298
1299 ( \( ( (?>[^()]+) | (?1) )* \) )
1300
1301 We have put the pattern into parentheses, and caused the recursion to
1302 refer to them instead of the whole pattern. In a larger pattern, keep-
1303 ing track of parenthesis numbers can be tricky. It may be more conve-
1304 nient to use named parentheses instead. For this, PCRE uses (?P>name),
1305 which is an extension to the Python syntax that PCRE uses for named
1306 parentheses (Perl does not provide named parentheses). We could rewrite
1307 the above example as follows:
1308
1309 (?P<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )
1310
1311 This particular example pattern contains nested unlimited repeats, and
1312 so the use of atomic grouping for matching strings of non-parentheses
1313 is important when applying the pattern to strings that do not match.
1314 For example, when this pattern is applied to
1315
1316 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1317
1318 it yields "no match" quickly. However, if atomic grouping is not used,
1319 the match runs for a very long time indeed because there are so many
1320 different ways the + and * repeats can carve up the subject, and all
1321 have to be tested before failure can be reported.
1322
1323 At the end of a match, the values set for any capturing subpatterns are
1324 those from the outermost level of the recursion at which the subpattern
1325 value is set. If you want to obtain intermediate values, a callout
1326 function can be used (see the next section and the pcrecallout documen-
1327 tation). If the pattern above is matched against
1328
1329 (ab(cd)ef)
1330
1331 the value for the capturing parentheses is "ef", which is the last
1332 value taken on at the top level. If additional parentheses are added,
1333 giving
1334
1335 \( ( ( (?>[^()]+) | (?R) )* ) \)
1336 ^ ^
1337 ^ ^
1338
1339 the string they capture is "ab(cd)ef", the contents of the top level
1340 parentheses. If there are more than 15 capturing parentheses in a pat-
1341 tern, PCRE has to obtain extra memory to store data during a recursion,
1342 which it does by using pcre_malloc, freeing it via pcre_free after-
1343 wards. If no memory can be obtained, the match fails with the
1344 PCRE_ERROR_NOMEMORY error.
1345
1346 Do not confuse the (?R) item with the condition (R), which tests for
1347 recursion. Consider this pattern, which matches text in angle brack-
1348 ets, allowing for arbitrary nesting. Only digits are allowed in nested
1349 brackets (that is, when recursing), whereas any characters are permit-
1350 ted at the outer level.
1351
1352 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
1353
1354 In this pattern, (?(R) is the start of a conditional subpattern, with
1355 two different alternatives for the recursive and non-recursive cases.
1356 The (?R) item is the actual recursive call.
1357
1358
1359SUBPATTERNS AS SUBROUTINES
1360
1361 If the syntax for a recursive subpattern reference (either by number or
1362 by name) is used outside the parentheses to which it refers, it oper-
1363 ates like a subroutine in a programming language. An earlier example
1364 pointed out that the pattern
1365
1366 (sens|respons)e and \1ibility
1367
1368 matches "sense and sensibility" and "response and responsibility", but
1369 not "sense and responsibility". If instead the pattern
1370
1371 (sens|respons)e and (?1)ibility
1372
1373 is used, it does match "sense and responsibility" as well as the other
1374 two strings. Such references must, however, follow the subpattern to
1375 which they refer.
1376
1377
1378CALLOUTS
1379
1380 Perl has a feature whereby using the sequence (?{...}) causes arbitrary
1381 Perl code to be obeyed in the middle of matching a regular expression.
1382 This makes it possible, amongst other things, to extract different sub-
1383 strings that match the same pair of parentheses when there is a repeti-
1384 tion.
1385
1386 PCRE provides a similar feature, but of course it cannot obey arbitrary
1387 Perl code. The feature is called "callout". The caller of PCRE provides
1388 an external function by putting its entry point in the global variable
1389 pcre_callout. By default, this variable contains NULL, which disables
1390 all calling out.
1391
1392 Within a regular expression, (?C) indicates the points at which the
1393 external function is to be called. If you want to identify different
1394 callout points, you can put a number less than 256 after the letter C.
1395 The default value is zero. For example, this pattern has two callout
1396 points:
1397
1398 (?C1)abc(?C2)def
1399
1400 If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
1401 automatically installed before each item in the pattern. They are all
1402 numbered 255.
1403
1404 During matching, when PCRE reaches a callout point (and pcre_callout is
1405 set), the external function is called. It is provided with the number
1406 of the callout, the position in the pattern, and, optionally, one item
1407 of data originally supplied by the caller of pcre_exec(). The callout
1408 function may cause matching to proceed, to backtrack, or to fail alto-
1409 gether. A complete description of the interface to the callout function
1410 is given in the pcrecallout documentation.
1411
1412Last updated: 09 September 2004
1413Copyright (c) 1997-2004 University of Cambridge.