Added STRIP_COMMAND=/usr/bin/strip to the FreeBSD Makefile.
[exim.git] / doc / doc-txt / pcrepattern.txt
CommitLineData
8ac170f3 1This file contains the PCRE man page that describes the regular expressions
92e772ff 2supported by PCRE version 6.2. Note that not all of the features are relevant
495ae4b0
PH
3in the context of Exim. In particular, the version of PCRE that is compiled
4with Exim does not include UTF-8 support, there is no mechanism for changing
5the options with which the PCRE functions are called, and features such as
6callout are not accessible.
7-----------------------------------------------------------------------------
8
92e772ff 9PCREPATTERN(3) PCREPATTERN(3)
495ae4b0
PH
10
11
12NAME
13 PCRE - Perl-compatible regular expressions
14
8ac170f3 15
495ae4b0
PH
16PCRE REGULAR EXPRESSION DETAILS
17
18 The syntax and semantics of the regular expressions supported by PCRE
19 are described below. Regular expressions are also described in the Perl
20 documentation and in a number of books, some of which have copious
21 examples. Jeffrey Friedl's "Mastering Regular Expressions", published
22 by O'Reilly, covers regular expressions in great detail. This descrip-
23 tion of PCRE's regular expressions is intended as reference material.
24
25 The original operation of PCRE was on strings of one-byte characters.
26 However, there is now also support for UTF-8 character strings. To use
27 this, you must build PCRE to include UTF-8 support, and then call
28 pcre_compile() with the PCRE_UTF8 option. How this affects pattern
29 matching is mentioned in several places below. There is also a summary
30 of UTF-8 features in the section on UTF-8 support in the main pcre
31 page.
32
8ac170f3
PH
33 The remainder of this document discusses the patterns that are sup-
34 ported by PCRE when its main matching function, pcre_exec(), is used.
35 From release 6.0, PCRE offers a second matching function,
36 pcre_dfa_exec(), which matches using a different algorithm that is not
37 Perl-compatible. The advantages and disadvantages of the alternative
38 function, and how it differs from the normal function, are discussed in
39 the pcrematching page.
40
495ae4b0
PH
41 A regular expression is a pattern that is matched against a subject
42 string from left to right. Most characters stand for themselves in a
43 pattern, and match the corresponding characters in the subject. As a
44 trivial example, the pattern
45
46 The quick brown fox
47
8ac170f3
PH
48 matches a portion of a subject string that is identical to itself. When
49 caseless matching is specified (the PCRE_CASELESS option), letters are
50 matched independently of case. In UTF-8 mode, PCRE always understands
51 the concept of case for characters whose values are less than 128, so
52 caseless matching is always possible. For characters with higher val-
53 ues, the concept of case is supported if PCRE is compiled with Unicode
54 property support, but not otherwise. If you want to use caseless
55 matching for characters 128 and above, you must ensure that PCRE is
56 compiled with Unicode property support as well as with UTF-8 support.
57
58 The power of regular expressions comes from the ability to include
59 alternatives and repetitions in the pattern. These are encoded in the
60 pattern by the use of metacharacters, which do not stand for themselves
61 but instead are interpreted in some special way.
62
63 There are two different sets of metacharacters: those that are recog-
64 nized anywhere in the pattern except within square brackets, and those
65 that are recognized in square brackets. Outside square brackets, the
495ae4b0
PH
66 metacharacters are as follows:
67
68 \ general escape character with several uses
69 ^ assert start of string (or line, in multiline mode)
70 $ assert end of string (or line, in multiline mode)
71 . match any character except newline (by default)
72 [ start character class definition
73 | start of alternative branch
74 ( start subpattern
75 ) end subpattern
76 ? extends the meaning of (
77 also 0 or 1 quantifier
78 also quantifier minimizer
79 * 0 or more quantifier
80 + 1 or more quantifier
81 also "possessive quantifier"
82 { start min/max quantifier
83
8ac170f3 84 Part of a pattern that is in square brackets is called a "character
495ae4b0
PH
85 class". In a character class the only metacharacters are:
86
87 \ general escape character
88 ^ negate the class, but only if the first character
89 - indicates character range
90 [ POSIX character class (only if followed by POSIX
91 syntax)
92 ] terminates the character class
93
8ac170f3 94 The following sections describe the use of each of the metacharacters.
495ae4b0
PH
95
96
97BACKSLASH
98
99 The backslash character has several uses. Firstly, if it is followed by
8ac170f3
PH
100 a non-alphanumeric character, it takes away any special meaning that
101 character may have. This use of backslash as an escape character
495ae4b0
PH
102 applies both inside and outside character classes.
103
8ac170f3
PH
104 For example, if you want to match a * character, you write \* in the
105 pattern. This escaping action applies whether or not the following
106 character would otherwise be interpreted as a metacharacter, so it is
107 always safe to precede a non-alphanumeric with backslash to specify
108 that it stands for itself. In particular, if you want to match a back-
495ae4b0
PH
109 slash, you write \\.
110
8ac170f3
PH
111 If a pattern is compiled with the PCRE_EXTENDED option, whitespace in
112 the pattern (other than in a character class) and characters between a
495ae4b0 113 # outside a character class and the next newline character are ignored.
8ac170f3 114 An escaping backslash can be used to include a whitespace or # charac-
495ae4b0
PH
115 ter as part of the pattern.
116
8ac170f3
PH
117 If you want to remove the special meaning from a sequence of charac-
118 ters, you can do so by putting them between \Q and \E. This is differ-
119 ent from Perl in that $ and @ are handled as literals in \Q...\E
120 sequences in PCRE, whereas in Perl, $ and @ cause variable interpola-
495ae4b0
PH
121 tion. Note the following examples:
122
123 Pattern PCRE matches Perl matches
124
125 \Qabc$xyz\E abc$xyz abc followed by the
126 contents of $xyz
127 \Qabc\$xyz\E abc\$xyz abc\$xyz
128 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
129
8ac170f3 130 The \Q...\E sequence is recognized both inside and outside character
495ae4b0
PH
131 classes.
132
133 Non-printing characters
134
135 A second use of backslash provides a way of encoding non-printing char-
8ac170f3
PH
136 acters in patterns in a visible manner. There is no restriction on the
137 appearance of non-printing characters, apart from the binary zero that
138 terminates a pattern, but when a pattern is being prepared by text
139 editing, it is usually easier to use one of the following escape
495ae4b0
PH
140 sequences than the binary character it represents:
141
142 \a alarm, that is, the BEL character (hex 07)
143 \cx "control-x", where x is any character
144 \e escape (hex 1B)
145 \f formfeed (hex 0C)
146 \n newline (hex 0A)
147 \r carriage return (hex 0D)
148 \t tab (hex 09)
149 \ddd character with octal code ddd, or backreference
150 \xhh character with hex code hh
151 \x{hhh..} character with hex code hhh... (UTF-8 mode only)
152
8ac170f3
PH
153 The precise effect of \cx is as follows: if x is a lower case letter,
154 it is converted to upper case. Then bit 6 of the character (hex 40) is
155 inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
495ae4b0
PH
156 becomes hex 7B.
157
8ac170f3
PH
158 After \x, from zero to two hexadecimal digits are read (letters can be
159 in upper or lower case). In UTF-8 mode, any number of hexadecimal dig-
160 its may appear between \x{ and }, but the value of the character code
161 must be less than 2**31 (that is, the maximum hexadecimal value is
162 7FFFFFFF). If characters other than hexadecimal digits appear between
163 \x{ and }, or if there is no terminating }, this form of escape is not
164 recognized. Instead, the initial \x will be interpreted as a basic
165 hexadecimal escape, with no following digits, giving a character whose
495ae4b0
PH
166 value is zero.
167
168 Characters whose value is less than 256 can be defined by either of the
8ac170f3
PH
169 two syntaxes for \x when PCRE is in UTF-8 mode. There is no difference
170 in the way they are handled. For example, \xdc is exactly the same as
495ae4b0
PH
171 \x{dc}.
172
8ac170f3
PH
173 After \0 up to two further octal digits are read. In both cases, if
174 there are fewer than two digits, just those that are present are used.
175 Thus the sequence \0\x\07 specifies two binary zeros followed by a BEL
176 character (code value 7). Make sure you supply two digits after the
177 initial zero if the pattern character that follows is itself an octal
495ae4b0
PH
178 digit.
179
180 The handling of a backslash followed by a digit other than 0 is compli-
181 cated. Outside a character class, PCRE reads it and any following dig-
8ac170f3 182 its as a decimal number. If the number is less than 10, or if there
495ae4b0 183 have been at least that many previous capturing left parentheses in the
8ac170f3
PH
184 expression, the entire sequence is taken as a back reference. A
185 description of how this works is given later, following the discussion
495ae4b0
PH
186 of parenthesized subpatterns.
187
8ac170f3
PH
188 Inside a character class, or if the decimal number is greater than 9
189 and there have not been that many capturing subpatterns, PCRE re-reads
190 up to three octal digits following the backslash, and generates a sin-
495ae4b0
PH
191 gle byte from the least significant 8 bits of the value. Any subsequent
192 digits stand for themselves. For example:
193
194 \040 is another way of writing a space
195 \40 is the same, provided there are fewer than 40
196 previous capturing subpatterns
197 \7 is always a back reference
198 \11 might be a back reference, or another way of
199 writing a tab
200 \011 is always a tab
201 \0113 is a tab followed by the character "3"
202 \113 might be a back reference, otherwise the
203 character with octal code 113
204 \377 might be a back reference, otherwise
205 the byte consisting entirely of 1 bits
206 \81 is either a back reference, or a binary zero
207 followed by the two characters "8" and "1"
208
8ac170f3 209 Note that octal values of 100 or greater must not be introduced by a
495ae4b0
PH
210 leading zero, because no more than three octal digits are ever read.
211
8ac170f3 212 All the sequences that define a single byte value or a single UTF-8
495ae4b0 213 character (in UTF-8 mode) can be used both inside and outside character
8ac170f3 214 classes. In addition, inside a character class, the sequence \b is
495ae4b0 215 interpreted as the backspace character (hex 08), and the sequence \X is
8ac170f3 216 interpreted as the character "X". Outside a character class, these
495ae4b0
PH
217 sequences have different meanings (see below).
218
219 Generic character types
220
8ac170f3 221 The third use of backslash is for specifying generic character types.
495ae4b0
PH
222 The following are always recognized:
223
224 \d any decimal digit
225 \D any character that is not a decimal digit
226 \s any whitespace character
227 \S any character that is not a whitespace character
228 \w any "word" character
229 \W any "non-word" character
230
231 Each pair of escape sequences partitions the complete set of characters
8ac170f3 232 into two disjoint sets. Any given character matches one, and only one,
495ae4b0
PH
233 of each pair.
234
235 These character type sequences can appear both inside and outside char-
8ac170f3
PH
236 acter classes. They each match one character of the appropriate type.
237 If the current matching point is at the end of the subject string, all
495ae4b0
PH
238 of them fail, since there is no character to match.
239
8ac170f3
PH
240 For compatibility with Perl, \s does not match the VT character (code
241 11). This makes it different from the the POSIX "space" class. The \s
495ae4b0
PH
242 characters are HT (9), LF (10), FF (12), CR (13), and space (32).
243
244 A "word" character is an underscore or any character less than 256 that
8ac170f3
PH
245 is a letter or digit. The definition of letters and digits is con-
246 trolled by PCRE's low-valued character tables, and may vary if locale-
247 specific matching is taking place (see "Locale support" in the pcreapi
248 page). For example, in the "fr_FR" (French) locale, some character
249 codes greater than 128 are used for accented letters, and these are
495ae4b0
PH
250 matched by \w.
251
8ac170f3 252 In UTF-8 mode, characters with values greater than 128 never match \d,
495ae4b0
PH
253 \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
254 code character property support is available.
255
256 Unicode character properties
257
258 When PCRE is built with Unicode character property support, three addi-
8ac170f3 259 tional escape sequences to match generic character types are available
495ae4b0
PH
260 when UTF-8 mode is selected. They are:
261
262 \p{xx} a character with the xx property
263 \P{xx} a character without the xx property
264 \X an extended Unicode sequence
265
8ac170f3
PH
266 The property names represented by xx above are limited to the Unicode
267 general category properties. Each character has exactly one such prop-
268 erty, specified by a two-letter abbreviation. For compatibility with
269 Perl, negation can be specified by including a circumflex between the
270 opening brace and the property name. For example, \p{^Lu} is the same
495ae4b0
PH
271 as \P{Lu}.
272
8ac170f3 273 If only one letter is specified with \p or \P, it includes all the
495ae4b0
PH
274 properties that start with that letter. In this case, in the absence of
275 negation, the curly brackets in the escape sequence are optional; these
276 two examples have the same effect:
277
278 \p{L}
279 \pL
280
281 The following property codes are supported:
282
283 C Other
284 Cc Control
285 Cf Format
286 Cn Unassigned
287 Co Private use
288 Cs Surrogate
289
290 L Letter
291 Ll Lower case letter
292 Lm Modifier letter
293 Lo Other letter
294 Lt Title case letter
295 Lu Upper case letter
296
297 M Mark
298 Mc Spacing mark
299 Me Enclosing mark
300 Mn Non-spacing mark
301
302 N Number
303 Nd Decimal number
304 Nl Letter number
305 No Other number
306
307 P Punctuation
308 Pc Connector punctuation
309 Pd Dash punctuation
310 Pe Close punctuation
311 Pf Final punctuation
312 Pi Initial punctuation
313 Po Other punctuation
314 Ps Open punctuation
315
316 S Symbol
317 Sc Currency symbol
318 Sk Modifier symbol
319 Sm Mathematical symbol
320 So Other symbol
321
322 Z Separator
323 Zl Line separator
324 Zp Paragraph separator
325 Zs Space separator
326
8ac170f3 327 Extended properties such as "Greek" or "InMusicalSymbols" are not sup-
495ae4b0
PH
328 ported by PCRE.
329
8ac170f3 330 Specifying caseless matching does not affect these escape sequences.
495ae4b0
PH
331 For example, \p{Lu} always matches only upper case letters.
332
8ac170f3 333 The \X escape matches any number of Unicode characters that form an
495ae4b0
PH
334 extended Unicode sequence. \X is equivalent to
335
336 (?>\PM\pM*)
337
8ac170f3
PH
338 That is, it matches a character without the "mark" property, followed
339 by zero or more characters with the "mark" property, and treats the
340 sequence as an atomic group (see below). Characters with the "mark"
495ae4b0
PH
341 property are typically accents that affect the preceding character.
342
8ac170f3
PH
343 Matching characters by Unicode property is not fast, because PCRE has
344 to search a structure that contains data for over fifteen thousand
495ae4b0
PH
345 characters. That is why the traditional escape sequences such as \d and
346 \w do not use Unicode properties in PCRE.
347
348 Simple assertions
349
350 The fourth use of backslash is for certain simple assertions. An asser-
8ac170f3
PH
351 tion specifies a condition that has to be met at a particular point in
352 a match, without consuming any characters from the subject string. The
353 use of subpatterns for more complicated assertions is described below.
495ae4b0
PH
354 The backslashed assertions are:
355
356 \b matches at a word boundary
357 \B matches when not at a word boundary
358 \A matches at start of subject
359 \Z matches at end of subject or before newline at end
360 \z matches at end of subject
361 \G matches at first matching position in subject
362
8ac170f3 363 These assertions may not appear in character classes (but note that \b
495ae4b0
PH
364 has a different meaning, namely the backspace character, inside a char-
365 acter class).
366
8ac170f3
PH
367 A word boundary is a position in the subject string where the current
368 character and the previous character do not both match \w or \W (i.e.
369 one matches \w and the other matches \W), or the start or end of the
495ae4b0
PH
370 string if the first or last character matches \w, respectively.
371
8ac170f3 372 The \A, \Z, and \z assertions differ from the traditional circumflex
495ae4b0 373 and dollar (described in the next section) in that they only ever match
8ac170f3
PH
374 at the very start and end of the subject string, whatever options are
375 set. Thus, they are independent of multiline mode. These three asser-
495ae4b0 376 tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
8ac170f3
PH
377 affect only the behaviour of the circumflex and dollar metacharacters.
378 However, if the startoffset argument of pcre_exec() is non-zero, indi-
495ae4b0 379 cating that matching is to start at a point other than the beginning of
8ac170f3
PH
380 the subject, \A can never match. The difference between \Z and \z is
381 that \Z matches before a newline that is the last character of the
382 string as well as at the end of the string, whereas \z matches only at
495ae4b0
PH
383 the end.
384
8ac170f3
PH
385 The \G assertion is true only when the current matching position is at
386 the start point of the match, as specified by the startoffset argument
387 of pcre_exec(). It differs from \A when the value of startoffset is
388 non-zero. By calling pcre_exec() multiple times with appropriate argu-
495ae4b0
PH
389 ments, you can mimic Perl's /g option, and it is in this kind of imple-
390 mentation where \G can be useful.
391
8ac170f3 392 Note, however, that PCRE's interpretation of \G, as the start of the
495ae4b0 393 current match, is subtly different from Perl's, which defines it as the
8ac170f3
PH
394 end of the previous match. In Perl, these can be different when the
395 previously matched string was empty. Because PCRE does just one match
495ae4b0
PH
396 at a time, it cannot reproduce this behaviour.
397
8ac170f3 398 If all the alternatives of a pattern begin with \G, the expression is
495ae4b0
PH
399 anchored to the starting match position, and the "anchored" flag is set
400 in the compiled regular expression.
401
402
403CIRCUMFLEX AND DOLLAR
404
405 Outside a character class, in the default matching mode, the circumflex
8ac170f3
PH
406 character is an assertion that is true only if the current matching
407 point is at the start of the subject string. If the startoffset argu-
408 ment of pcre_exec() is non-zero, circumflex can never match if the
409 PCRE_MULTILINE option is unset. Inside a character class, circumflex
495ae4b0
PH
410 has an entirely different meaning (see below).
411
8ac170f3
PH
412 Circumflex need not be the first character of the pattern if a number
413 of alternatives are involved, but it should be the first thing in each
414 alternative in which it appears if the pattern is ever to match that
415 branch. If all possible alternatives start with a circumflex, that is,
416 if the pattern is constrained to match only at the start of the sub-
417 ject, it is said to be an "anchored" pattern. (There are also other
495ae4b0
PH
418 constructs that can cause a pattern to be anchored.)
419
8ac170f3
PH
420 A dollar character is an assertion that is true only if the current
421 matching point is at the end of the subject string, or immediately
495ae4b0 422 before a newline character that is the last character in the string (by
8ac170f3
PH
423 default). Dollar need not be the last character of the pattern if a
424 number of alternatives are involved, but it should be the last item in
425 any branch in which it appears. Dollar has no special meaning in a
495ae4b0
PH
426 character class.
427
8ac170f3
PH
428 The meaning of dollar can be changed so that it matches only at the
429 very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at
495ae4b0
PH
430 compile time. This does not affect the \Z assertion.
431
432 The meanings of the circumflex and dollar characters are changed if the
433 PCRE_MULTILINE option is set. When this is the case, they match immedi-
8ac170f3
PH
434 ately after and immediately before an internal newline character,
435 respectively, in addition to matching at the start and end of the sub-
436 ject string. For example, the pattern /^abc$/ matches the subject
437 string "def\nabc" (where \n represents a newline character) in multi-
495ae4b0 438 line mode, but not otherwise. Consequently, patterns that are anchored
8ac170f3
PH
439 in single line mode because all branches start with ^ are not anchored
440 in multiline mode, and a match for circumflex is possible when the
441 startoffset argument of pcre_exec() is non-zero. The PCRE_DOL-
495ae4b0
PH
442 LAR_ENDONLY option is ignored if PCRE_MULTILINE is set.
443
8ac170f3
PH
444 Note that the sequences \A, \Z, and \z can be used to match the start
445 and end of the subject in both modes, and if all branches of a pattern
446 start with \A it is always anchored, whether PCRE_MULTILINE is set or
495ae4b0
PH
447 not.
448
449
450FULL STOP (PERIOD, DOT)
451
452 Outside a character class, a dot in the pattern matches any one charac-
8ac170f3
PH
453 ter in the subject, including a non-printing character, but not (by
454 default) newline. In UTF-8 mode, a dot matches any UTF-8 character,
495ae4b0 455 which might be more than one byte long, except (by default) newline. If
8ac170f3
PH
456 the PCRE_DOTALL option is set, dots match newlines as well. The han-
457 dling of dot is entirely independent of the handling of circumflex and
458 dollar, the only relationship being that they both involve newline
495ae4b0
PH
459 characters. Dot has no special meaning in a character class.
460
461
462MATCHING A SINGLE BYTE
463
464 Outside a character class, the escape sequence \C matches any one byte,
8ac170f3
PH
465 both in and out of UTF-8 mode. Unlike a dot, it can match a newline.
466 The feature is provided in Perl in order to match individual bytes in
467 UTF-8 mode. Because it breaks up UTF-8 characters into individual
468 bytes, what remains in the string may be a malformed UTF-8 string. For
495ae4b0
PH
469 this reason, the \C escape sequence is best avoided.
470
8ac170f3
PH
471 PCRE does not allow \C to appear in lookbehind assertions (described
472 below), because in UTF-8 mode this would make it impossible to calcu-
495ae4b0
PH
473 late the length of the lookbehind.
474
475
476SQUARE BRACKETS AND CHARACTER CLASSES
477
478 An opening square bracket introduces a character class, terminated by a
479 closing square bracket. A closing square bracket on its own is not spe-
480 cial. If a closing square bracket is required as a member of the class,
8ac170f3 481 it should be the first data character in the class (after an initial
495ae4b0
PH
482 circumflex, if present) or escaped with a backslash.
483
8ac170f3
PH
484 A character class matches a single character in the subject. In UTF-8
485 mode, the character may occupy more than one byte. A matched character
495ae4b0 486 must be in the set of characters defined by the class, unless the first
8ac170f3
PH
487 character in the class definition is a circumflex, in which case the
488 subject character must not be in the set defined by the class. If a
489 circumflex is actually required as a member of the class, ensure it is
495ae4b0
PH
490 not the first character, or escape it with a backslash.
491
8ac170f3
PH
492 For example, the character class [aeiou] matches any lower case vowel,
493 while [^aeiou] matches any character that is not a lower case vowel.
495ae4b0 494 Note that a circumflex is just a convenient notation for specifying the
8ac170f3
PH
495 characters that are in the class by enumerating those that are not. A
496 class that starts with a circumflex is not an assertion: it still con-
497 sumes a character from the subject string, and therefore it fails if
495ae4b0
PH
498 the current pointer is at the end of the string.
499
8ac170f3
PH
500 In UTF-8 mode, characters with values greater than 255 can be included
501 in a class as a literal string of bytes, or by using the \x{ escaping
495ae4b0
PH
502 mechanism.
503
8ac170f3
PH
504 When caseless matching is set, any letters in a class represent both
505 their upper case and lower case versions, so for example, a caseless
506 [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
507 match "A", whereas a caseful version would. In UTF-8 mode, PCRE always
508 understands the concept of case for characters whose values are less
509 than 128, so caseless matching is always possible. For characters with
510 higher values, the concept of case is supported if PCRE is compiled
511 with Unicode property support, but not otherwise. If you want to use
512 caseless matching for characters 128 and above, you must ensure that
513 PCRE is compiled with Unicode property support as well as with UTF-8
514 support.
495ae4b0
PH
515
516 The newline character is never treated in any special way in character
517 classes, whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE
518 options is. A class such as [^a] will always match a newline.
519
520 The minus (hyphen) character can be used to specify a range of charac-
521 ters in a character class. For example, [d-m] matches any letter
522 between d and m, inclusive. If a minus character is required in a
523 class, it must be escaped with a backslash or appear in a position
524 where it cannot be interpreted as indicating a range, typically as the
525 first or last character in the class.
526
527 It is not possible to have the literal character "]" as the end charac-
528 ter of a range. A pattern such as [W-]46] is interpreted as a class of
529 two characters ("W" and "-") followed by a literal string "46]", so it
530 would match "W46]" or "-46]". However, if the "]" is escaped with a
531 backslash it is interpreted as the end of range, so [W-\]46] is inter-
532 preted as a class containing a range followed by two other characters.
533 The octal or hexadecimal representation of "]" can also be used to end
534 a range.
535
536 Ranges operate in the collating sequence of character values. They can
537 also be used for characters specified numerically, for example
538 [\000-\037]. In UTF-8 mode, ranges can include characters whose values
539 are greater than 255, for example [\x{100}-\x{2ff}].
540
541 If a range that includes letters is used when caseless matching is set,
542 it matches the letters in either case. For example, [W-c] is equivalent
543 to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if
544 character tables for the "fr_FR" locale are in use, [\xc8-\xcb] matches
545 accented E characters in both cases. In UTF-8 mode, PCRE supports the
546 concept of case for characters with values greater than 128 only when
547 it is compiled with Unicode property support.
548
549 The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear
550 in a character class, and add the characters that they match to the
551 class. For example, [\dABCDEF] matches any hexadecimal digit. A circum-
552 flex can conveniently be used with the upper case character types to
553 specify a more restricted set of characters than the matching lower
554 case type. For example, the class [^\W_] matches any letter or digit,
555 but not underscore.
556
557 The only metacharacters that are recognized in character classes are
558 backslash, hyphen (only where it can be interpreted as specifying a
559 range), circumflex (only at the start), opening square bracket (only
560 when it can be interpreted as introducing a POSIX class name - see the
561 next section), and the terminating closing square bracket. However,
562 escaping other non-alphanumeric characters does no harm.
563
564
565POSIX CHARACTER CLASSES
566
567 Perl supports the POSIX notation for character classes. This uses names
568 enclosed by [: and :] within the enclosing square brackets. PCRE also
569 supports this notation. For example,
570
571 [01[:alpha:]%]
572
573 matches "0", "1", any alphabetic character, or "%". The supported class
574 names are
575
576 alnum letters and digits
577 alpha letters
578 ascii character codes 0 - 127
579 blank space or tab only
580 cntrl control characters
581 digit decimal digits (same as \d)
582 graph printing characters, excluding space
583 lower lower case letters
584 print printing characters, including space
585 punct printing characters, excluding letters and digits
586 space white space (not quite the same as \s)
587 upper upper case letters
588 word "word" characters (same as \w)
589 xdigit hexadecimal digits
590
591 The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
592 and space (32). Notice that this list includes the VT character (code
593 11). This makes "space" different to \s, which does not include VT (for
594 Perl compatibility).
595
596 The name "word" is a Perl extension, and "blank" is a GNU extension
597 from Perl 5.8. Another Perl extension is negation, which is indicated
598 by a ^ character after the colon. For example,
599
600 [12[:^digit:]]
601
602 matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the
603 POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
604 these are not supported, and an error is given if they are encountered.
605
606 In UTF-8 mode, characters with values greater than 128 do not match any
607 of the POSIX character classes.
608
609
610VERTICAL BAR
611
612 Vertical bar characters are used to separate alternative patterns. For
613 example, the pattern
614
615 gilbert|sullivan
616
617 matches either "gilbert" or "sullivan". Any number of alternatives may
618 appear, and an empty alternative is permitted (matching the empty
619 string). The matching process tries each alternative in turn, from
620 left to right, and the first one that succeeds is used. If the alterna-
621 tives are within a subpattern (defined below), "succeeds" means match-
622 ing the rest of the main pattern as well as the alternative in the sub-
623 pattern.
624
625
626INTERNAL OPTION SETTING
627
628 The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
629 PCRE_EXTENDED options can be changed from within the pattern by a
630 sequence of Perl option letters enclosed between "(?" and ")". The
631 option letters are
632
633 i for PCRE_CASELESS
634 m for PCRE_MULTILINE
635 s for PCRE_DOTALL
636 x for PCRE_EXTENDED
637
638 For example, (?im) sets caseless, multiline matching. It is also possi-
639 ble to unset these options by preceding the letter with a hyphen, and a
640 combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-
641 LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
642 is also permitted. If a letter appears both before and after the
643 hyphen, the option is unset.
644
645 When an option change occurs at top level (that is, not inside subpat-
646 tern parentheses), the change applies to the remainder of the pattern
647 that follows. If the change is placed right at the start of a pattern,
648 PCRE extracts it into the global options (and it will therefore show up
649 in data extracted by the pcre_fullinfo() function).
650
651 An option change within a subpattern affects only that part of the cur-
652 rent pattern that follows it, so
653
654 (a(?i)b)c
655
656 matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
657 used). By this means, options can be made to have different settings
658 in different parts of the pattern. Any changes made in one alternative
659 do carry on into subsequent branches within the same subpattern. For
660 example,
661
662 (a(?i)b|c)
663
664 matches "ab", "aB", "c", and "C", even though when matching "C" the
665 first branch is abandoned before the option setting. This is because
666 the effects of option settings happen at compile time. There would be
667 some very weird behaviour otherwise.
668
669 The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can be changed
670 in the same way as the Perl-compatible options by using the characters
671 U and X respectively. The (?X) flag setting is special in that it must
672 always occur earlier in the pattern than any of the additional features
673 it turns on, even when it is at top level. It is best to put it at the
674 start.
675
676
677SUBPATTERNS
678
679 Subpatterns are delimited by parentheses (round brackets), which can be
680 nested. Turning part of a pattern into a subpattern does two things:
681
682 1. It localizes a set of alternatives. For example, the pattern
683
684 cat(aract|erpillar|)
685
686 matches one of the words "cat", "cataract", or "caterpillar". Without
687 the parentheses, it would match "cataract", "erpillar" or the empty
688 string.
689
690 2. It sets up the subpattern as a capturing subpattern. This means
691 that, when the whole pattern matches, that portion of the subject
692 string that matched the subpattern is passed back to the caller via the
693 ovector argument of pcre_exec(). Opening parentheses are counted from
694 left to right (starting from 1) to obtain numbers for the capturing
695 subpatterns.
696
697 For example, if the string "the red king" is matched against the pat-
698 tern
699
700 the ((red|white) (king|queen))
701
702 the captured substrings are "red king", "red", and "king", and are num-
703 bered 1, 2, and 3, respectively.
704
705 The fact that plain parentheses fulfil two functions is not always
706 helpful. There are often times when a grouping subpattern is required
707 without a capturing requirement. If an opening parenthesis is followed
708 by a question mark and a colon, the subpattern does not do any captur-
709 ing, and is not counted when computing the number of any subsequent
710 capturing subpatterns. For example, if the string "the white queen" is
711 matched against the pattern
712
713 the ((?:red|white) (king|queen))
714
715 the captured substrings are "white queen" and "queen", and are numbered
716 1 and 2. The maximum number of capturing subpatterns is 65535, and the
717 maximum depth of nesting of all subpatterns, both capturing and non-
718 capturing, is 200.
719
720 As a convenient shorthand, if any option settings are required at the
721 start of a non-capturing subpattern, the option letters may appear
722 between the "?" and the ":". Thus the two patterns
723
724 (?i:saturday|sunday)
725 (?:(?i)saturday|sunday)
726
727 match exactly the same set of strings. Because alternative branches are
728 tried from left to right, and options are not reset until the end of
729 the subpattern is reached, an option setting in one branch does affect
730 subsequent branches, so the above patterns match "SUNDAY" as well as
731 "Saturday".
732
733
734NAMED SUBPATTERNS
735
736 Identifying capturing parentheses by number is simple, but it can be
737 very hard to keep track of the numbers in complicated regular expres-
738 sions. Furthermore, if an expression is modified, the numbers may
739 change. To help with this difficulty, PCRE supports the naming of sub-
740 patterns, something that Perl does not provide. The Python syntax
741 (?P<name>...) is used. Names consist of alphanumeric characters and
742 underscores, and must be unique within a pattern.
743
744 Named capturing parentheses are still allocated numbers as well as
745 names. The PCRE API provides function calls for extracting the name-to-
746 number translation table from a compiled pattern. There is also a con-
747 venience function for extracting a captured substring by name. For fur-
748 ther details see the pcreapi documentation.
749
750
751REPETITION
752
753 Repetition is specified by quantifiers, which can follow any of the
754 following items:
755
756 a literal data character
757 the . metacharacter
758 the \C escape sequence
759 the \X escape sequence (in UTF-8 mode with Unicode properties)
760 an escape such as \d that matches a single character
761 a character class
762 a back reference (see next section)
763 a parenthesized subpattern (unless it is an assertion)
764
765 The general repetition quantifier specifies a minimum and maximum num-
766 ber of permitted matches, by giving the two numbers in curly brackets
767 (braces), separated by a comma. The numbers must be less than 65536,
768 and the first must be less than or equal to the second. For example:
769
770 z{2,4}
771
772 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
773 special character. If the second number is omitted, but the comma is
774 present, there is no upper limit; if the second number and the comma
775 are both omitted, the quantifier specifies an exact number of required
776 matches. Thus
777
778 [aeiou]{3,}
779
780 matches at least 3 successive vowels, but may match many more, while
781
782 \d{8}
783
784 matches exactly 8 digits. An opening curly bracket that appears in a
785 position where a quantifier is not allowed, or one that does not match
786 the syntax of a quantifier, is taken as a literal character. For exam-
787 ple, {,6} is not a quantifier, but a literal string of four characters.
788
789 In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to
790 individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
791 acters, each of which is represented by a two-byte sequence. Similarly,
792 when Unicode property support is available, \X{3} matches three Unicode
793 extended sequences, each of which may be several bytes long (and they
794 may be of different lengths).
795
796 The quantifier {0} is permitted, causing the expression to behave as if
797 the previous item and the quantifier were not present.
798
799 For convenience (and historical compatibility) the three most common
800 quantifiers have single-character abbreviations:
801
802 * is equivalent to {0,}
803 + is equivalent to {1,}
804 ? is equivalent to {0,1}
805
806 It is possible to construct infinite loops by following a subpattern
807 that can match no characters with a quantifier that has no upper limit,
808 for example:
809
810 (a?)*
811
812 Earlier versions of Perl and PCRE used to give an error at compile time
813 for such patterns. However, because there are cases where this can be
814 useful, such patterns are now accepted, but if any repetition of the
815 subpattern does in fact match no characters, the loop is forcibly bro-
816 ken.
817
818 By default, the quantifiers are "greedy", that is, they match as much
819 as possible (up to the maximum number of permitted times), without
820 causing the rest of the pattern to fail. The classic example of where
821 this gives problems is in trying to match comments in C programs. These
822 appear between /* and */ and within the comment, individual * and /
823 characters may appear. An attempt to match C comments by applying the
824 pattern
825
826 /\*.*\*/
827
828 to the string
829
830 /* first comment */ not comment /* second comment */
831
832 fails, because it matches the entire string owing to the greediness of
833 the .* item.
834
835 However, if a quantifier is followed by a question mark, it ceases to
836 be greedy, and instead matches the minimum number of times possible, so
837 the pattern
838
839 /\*.*?\*/
840
841 does the right thing with the C comments. The meaning of the various
842 quantifiers is not otherwise changed, just the preferred number of
843 matches. Do not confuse this use of question mark with its use as a
844 quantifier in its own right. Because it has two uses, it can sometimes
845 appear doubled, as in
846
847 \d??\d
848
849 which matches one digit by preference, but can match two if that is the
850 only way the rest of the pattern matches.
851
852 If the PCRE_UNGREEDY option is set (an option which is not available in
853 Perl), the quantifiers are not greedy by default, but individual ones
854 can be made greedy by following them with a question mark. In other
855 words, it inverts the default behaviour.
856
857 When a parenthesized subpattern is quantified with a minimum repeat
858 count that is greater than 1 or with a limited maximum, more memory is
859 required for the compiled pattern, in proportion to the size of the
860 minimum or maximum.
861
862 If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
863 alent to Perl's /s) is set, thus allowing the . to match newlines, the
864 pattern is implicitly anchored, because whatever follows will be tried
865 against every character position in the subject string, so there is no
866 point in retrying the overall match at any position after the first.
867 PCRE normally treats such a pattern as though it were preceded by \A.
868
869 In cases where it is known that the subject string contains no new-
870 lines, it is worth setting PCRE_DOTALL in order to obtain this opti-
871 mization, or alternatively using ^ to indicate anchoring explicitly.
872
873 However, there is one situation where the optimization cannot be used.
874 When .* is inside capturing parentheses that are the subject of a
875 backreference elsewhere in the pattern, a match at the start may fail,
876 and a later one succeed. Consider, for example:
877
878 (.*)abc\1
879
880 If the subject is "xyz123abc123" the match point is the fourth charac-
881 ter. For this reason, such a pattern is not implicitly anchored.
882
883 When a capturing subpattern is repeated, the value captured is the sub-
884 string that matched the final iteration. For example, after
885
886 (tweedle[dume]{3}\s*)+
887
888 has matched "tweedledum tweedledee" the value of the captured substring
889 is "tweedledee". However, if there are nested capturing subpatterns,
890 the corresponding captured values may have been set in previous itera-
891 tions. For example, after
892
893 /(a|(b))+/
894
895 matches "aba" the value of the second captured substring is "b".
896
897
898ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
899
900 With both maximizing and minimizing repetition, failure of what follows
901 normally causes the repeated item to be re-evaluated to see if a dif-
902 ferent number of repeats allows the rest of the pattern to match. Some-
903 times it is useful to prevent this, either to change the nature of the
904 match, or to cause it fail earlier than it otherwise might, when the
905 author of the pattern knows there is no point in carrying on.
906
907 Consider, for example, the pattern \d+foo when applied to the subject
908 line
909
910 123456bar
911
912 After matching all 6 digits and then failing to match "foo", the normal
913 action of the matcher is to try again with only 5 digits matching the
914 \d+ item, and then with 4, and so on, before ultimately failing.
915 "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
916 the means for specifying that once a subpattern has matched, it is not
917 to be re-evaluated in this way.
918
919 If we use atomic grouping for the previous example, the matcher would
920 give up immediately on failing to match "foo" the first time. The nota-
921 tion is a kind of special parenthesis, starting with (?> as in this
922 example:
923
924 (?>\d+)foo
925
926 This kind of parenthesis "locks up" the part of the pattern it con-
927 tains once it has matched, and a failure further into the pattern is
928 prevented from backtracking into it. Backtracking past it to previous
929 items, however, works as normal.
930
931 An alternative description is that a subpattern of this type matches
932 the string of characters that an identical standalone pattern would
933 match, if anchored at the current point in the subject string.
934
935 Atomic grouping subpatterns are not capturing subpatterns. Simple cases
936 such as the above example can be thought of as a maximizing repeat that
937 must swallow everything it can. So, while both \d+ and \d+? are pre-
938 pared to adjust the number of digits they match in order to make the
939 rest of the pattern match, (?>\d+) can only match an entire sequence of
940 digits.
941
942 Atomic groups in general can of course contain arbitrarily complicated
943 subpatterns, and can be nested. However, when the subpattern for an
944 atomic group is just a single repeated item, as in the example above, a
945 simpler notation, called a "possessive quantifier" can be used. This
946 consists of an additional + character following a quantifier. Using
947 this notation, the previous example can be rewritten as
948
949 \d++foo
950
951 Possessive quantifiers are always greedy; the setting of the
952 PCRE_UNGREEDY option is ignored. They are a convenient notation for the
953 simpler forms of atomic group. However, there is no difference in the
954 meaning or processing of a possessive quantifier and the equivalent
955 atomic group.
956
957 The possessive quantifier syntax is an extension to the Perl syntax. It
958 originates in Sun's Java package.
959
960 When a pattern contains an unlimited repeat inside a subpattern that
961 can itself be repeated an unlimited number of times, the use of an
962 atomic group is the only way to avoid some failing matches taking a
963 very long time indeed. The pattern
964
965 (\D+|<\d+>)*[!?]
966
967 matches an unlimited number of substrings that either consist of non-
968 digits, or digits enclosed in <>, followed by either ! or ?. When it
969 matches, it runs quickly. However, if it is applied to
970
971 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
972
973 it takes a long time before reporting failure. This is because the
974 string can be divided between the internal \D+ repeat and the external
975 * repeat in a large number of ways, and all have to be tried. (The
976 example uses [!?] rather than a single character at the end, because
977 both PCRE and Perl have an optimization that allows for fast failure
978 when a single character is used. They remember the last single charac-
979 ter that is required for a match, and fail early if it is not present
980 in the string.) If the pattern is changed so that it uses an atomic
981 group, like this:
982
983 ((?>\D+)|<\d+>)*[!?]
984
985 sequences of non-digits cannot be broken, and failure happens quickly.
986
987
988BACK REFERENCES
989
990 Outside a character class, a backslash followed by a digit greater than
991 0 (and possibly further digits) is a back reference to a capturing sub-
992 pattern earlier (that is, to its left) in the pattern, provided there
993 have been that many previous capturing left parentheses.
994
995 However, if the decimal number following the backslash is less than 10,
996 it is always taken as a back reference, and causes an error only if
997 there are not that many capturing left parentheses in the entire pat-
998 tern. In other words, the parentheses that are referenced need not be
999 to the left of the reference for numbers less than 10. See the subsec-
1000 tion entitled "Non-printing characters" above for further details of
1001 the handling of digits following a backslash.
1002
1003 A back reference matches whatever actually matched the capturing sub-
1004 pattern in the current subject string, rather than anything matching
1005 the subpattern itself (see "Subpatterns as subroutines" below for a way
1006 of doing that). So the pattern
1007
1008 (sens|respons)e and \1ibility
1009
1010 matches "sense and sensibility" and "response and responsibility", but
1011 not "sense and responsibility". If caseful matching is in force at the
1012 time of the back reference, the case of letters is relevant. For exam-
1013 ple,
1014
1015 ((?i)rah)\s+\1
1016
1017 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
1018 original capturing subpattern is matched caselessly.
1019
1020 Back references to named subpatterns use the Python syntax (?P=name).
1021 We could rewrite the above example as follows:
1022
1023 (?<p1>(?i)rah)\s+(?P=p1)
1024
1025 There may be more than one back reference to the same subpattern. If a
1026 subpattern has not actually been used in a particular match, any back
1027 references to it always fail. For example, the pattern
1028
1029 (a|(bc))\2
1030
1031 always fails if it starts to match "a" rather than "bc". Because there
1032 may be many capturing parentheses in a pattern, all digits following
1033 the backslash are taken as part of a potential back reference number.
1034 If the pattern continues with a digit character, some delimiter must be
1035 used to terminate the back reference. If the PCRE_EXTENDED option is
1036 set, this can be whitespace. Otherwise an empty comment (see "Com-
1037 ments" below) can be used.
1038
1039 A back reference that occurs inside the parentheses to which it refers
1040 fails when the subpattern is first used, so, for example, (a\1) never
1041 matches. However, such references can be useful inside repeated sub-
1042 patterns. For example, the pattern
1043
1044 (a|b\1)+
1045
1046 matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
1047 ation of the subpattern, the back reference matches the character
1048 string corresponding to the previous iteration. In order for this to
1049 work, the pattern must be such that the first iteration does not need
1050 to match the back reference. This can be done using alternation, as in
1051 the example above, or by a quantifier with a minimum of zero.
1052
1053
1054ASSERTIONS
1055
1056 An assertion is a test on the characters following or preceding the
1057 current matching point that does not actually consume any characters.
1058 The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are
1059 described above.
1060
1061 More complicated assertions are coded as subpatterns. There are two
1062 kinds: those that look ahead of the current position in the subject
1063 string, and those that look behind it. An assertion subpattern is
1064 matched in the normal way, except that it does not cause the current
1065 matching position to be changed.
1066
1067 Assertion subpatterns are not capturing subpatterns, and may not be
1068 repeated, because it makes no sense to assert the same thing several
1069 times. If any kind of assertion contains capturing subpatterns within
1070 it, these are counted for the purposes of numbering the capturing sub-
1071 patterns in the whole pattern. However, substring capturing is carried
1072 out only for positive assertions, because it does not make sense for
1073 negative assertions.
1074
1075 Lookahead assertions
1076
1077 Lookahead assertions start with (?= for positive assertions and (?! for
1078 negative assertions. For example,
1079
1080 \w+(?=;)
1081
1082 matches a word followed by a semicolon, but does not include the semi-
1083 colon in the match, and
1084
1085 foo(?!bar)
1086
1087 matches any occurrence of "foo" that is not followed by "bar". Note
1088 that the apparently similar pattern
1089
1090 (?!foo)bar
1091
1092 does not find an occurrence of "bar" that is preceded by something
1093 other than "foo"; it finds any occurrence of "bar" whatsoever, because
1094 the assertion (?!foo) is always true when the next three characters are
1095 "bar". A lookbehind assertion is needed to achieve the other effect.
1096
1097 If you want to force a matching failure at some point in a pattern, the
1098 most convenient way to do it is with (?!) because an empty string
1099 always matches, so an assertion that requires there not to be an empty
1100 string must always fail.
1101
1102 Lookbehind assertions
1103
1104 Lookbehind assertions start with (?<= for positive assertions and (?<!
1105 for negative assertions. For example,
1106
1107 (?<!foo)bar
1108
1109 does find an occurrence of "bar" that is not preceded by "foo". The
1110 contents of a lookbehind assertion are restricted such that all the
1111 strings it matches must have a fixed length. However, if there are sev-
1112 eral alternatives, they do not all have to have the same fixed length.
1113 Thus
1114
1115 (?<=bullock|donkey)
1116
1117 is permitted, but
1118
1119 (?<!dogs?|cats?)
1120
1121 causes an error at compile time. Branches that match different length
1122 strings are permitted only at the top level of a lookbehind assertion.
1123 This is an extension compared with Perl (at least for 5.8), which
1124 requires all branches to match the same length of string. An assertion
1125 such as
1126
1127 (?<=ab(c|de))
1128
1129 is not permitted, because its single top-level branch can match two
1130 different lengths, but it is acceptable if rewritten to use two top-
1131 level branches:
1132
1133 (?<=abc|abde)
1134
1135 The implementation of lookbehind assertions is, for each alternative,
1136 to temporarily move the current position back by the fixed width and
1137 then try to match. If there are insufficient characters before the cur-
1138 rent position, the match is deemed to fail.
1139
1140 PCRE does not allow the \C escape (which matches a single byte in UTF-8
1141 mode) to appear in lookbehind assertions, because it makes it impossi-
1142 ble to calculate the length of the lookbehind. The \X escape, which can
1143 match different numbers of bytes, is also not permitted.
1144
1145 Atomic groups can be used in conjunction with lookbehind assertions to
1146 specify efficient matching at the end of the subject string. Consider a
1147 simple pattern such as
1148
1149 abcd$
1150
1151 when applied to a long string that does not match. Because matching
1152 proceeds from left to right, PCRE will look for each "a" in the subject
1153 and then see if what follows matches the rest of the pattern. If the
1154 pattern is specified as
1155
1156 ^.*abcd$
1157
1158 the initial .* matches the entire string at first, but when this fails
1159 (because there is no following "a"), it backtracks to match all but the
1160 last character, then all but the last two characters, and so on. Once
1161 again the search for "a" covers the entire string, from right to left,
1162 so we are no better off. However, if the pattern is written as
1163
1164 ^(?>.*)(?<=abcd)
1165
1166 or, equivalently, using the possessive quantifier syntax,
1167
1168 ^.*+(?<=abcd)
1169
1170 there can be no backtracking for the .* item; it can match only the
1171 entire string. The subsequent lookbehind assertion does a single test
1172 on the last four characters. If it fails, the match fails immediately.
1173 For long strings, this approach makes a significant difference to the
1174 processing time.
1175
1176 Using multiple assertions
1177
1178 Several assertions (of any sort) may occur in succession. For example,
1179
1180 (?<=\d{3})(?<!999)foo
1181
1182 matches "foo" preceded by three digits that are not "999". Notice that
1183 each of the assertions is applied independently at the same point in
1184 the subject string. First there is a check that the previous three
1185 characters are all digits, and then there is a check that the same
1186 three characters are not "999". This pattern does not match "foo" pre-
1187 ceded by six characters, the first of which are digits and the last
1188 three of which are not "999". For example, it doesn't match "123abc-
1189 foo". A pattern to do that is
1190
1191 (?<=\d{3}...)(?<!999)foo
1192
1193 This time the first assertion looks at the preceding six characters,
1194 checking that the first three are digits, and then the second assertion
1195 checks that the preceding three characters are not "999".
1196
1197 Assertions can be nested in any combination. For example,
1198
1199 (?<=(?<!foo)bar)baz
1200
1201 matches an occurrence of "baz" that is preceded by "bar" which in turn
1202 is not preceded by "foo", while
1203
1204 (?<=\d{3}(?!999)...)foo
1205
1206 is another pattern that matches "foo" preceded by three digits and any
1207 three characters that are not "999".
1208
1209
1210CONDITIONAL SUBPATTERNS
1211
1212 It is possible to cause the matching process to obey a subpattern con-
1213 ditionally or to choose between two alternative subpatterns, depending
1214 on the result of an assertion, or whether a previous capturing subpat-
1215 tern matched or not. The two possible forms of conditional subpattern
1216 are
1217
1218 (?(condition)yes-pattern)
1219 (?(condition)yes-pattern|no-pattern)
1220
1221 If the condition is satisfied, the yes-pattern is used; otherwise the
1222 no-pattern (if present) is used. If there are more than two alterna-
1223 tives in the subpattern, a compile-time error occurs.
1224
1225 There are three kinds of condition. If the text between the parentheses
1226 consists of a sequence of digits, the condition is satisfied if the
1227 capturing subpattern of that number has previously matched. The number
1228 must be greater than zero. Consider the following pattern, which con-
1229 tains non-significant white space to make it more readable (assume the
1230 PCRE_EXTENDED option) and to divide it into three parts for ease of
1231 discussion:
1232
1233 ( \( )? [^()]+ (?(1) \) )
1234
1235 The first part matches an optional opening parenthesis, and if that
1236 character is present, sets it as the first captured substring. The sec-
1237 ond part matches one or more characters that are not parentheses. The
1238 third part is a conditional subpattern that tests whether the first set
1239 of parentheses matched or not. If they did, that is, if subject started
1240 with an opening parenthesis, the condition is true, and so the yes-pat-
1241 tern is executed and a closing parenthesis is required. Otherwise,
1242 since no-pattern is not present, the subpattern matches nothing. In
1243 other words, this pattern matches a sequence of non-parentheses,
1244 optionally enclosed in parentheses.
1245
1246 If the condition is the string (R), it is satisfied if a recursive call
1247 to the pattern or subpattern has been made. At "top level", the condi-
1248 tion is false. This is a PCRE extension. Recursive patterns are
1249 described in the next section.
1250
1251 If the condition is not a sequence of digits or (R), it must be an
1252 assertion. This may be a positive or negative lookahead or lookbehind
1253 assertion. Consider this pattern, again containing non-significant
1254 white space, and with the two alternatives on the second line:
1255
1256 (?(?=[^a-z]*[a-z])
1257 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
1258
1259 The condition is a positive lookahead assertion that matches an
1260 optional sequence of non-letters followed by a letter. In other words,
1261 it tests for the presence of at least one letter in the subject. If a
1262 letter is found, the subject is matched against the first alternative;
1263 otherwise it is matched against the second. This pattern matches
1264 strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
1265 letters and dd are digits.
1266
1267
1268COMMENTS
1269
1270 The sequence (?# marks the start of a comment that continues up to the
1271 next closing parenthesis. Nested parentheses are not permitted. The
1272 characters that make up a comment play no part in the pattern matching
1273 at all.
1274
1275 If the PCRE_EXTENDED option is set, an unescaped # character outside a
1276 character class introduces a comment that continues up to the next new-
1277 line character in the pattern.
1278
1279
1280RECURSIVE PATTERNS
1281
1282 Consider the problem of matching a string in parentheses, allowing for
1283 unlimited nested parentheses. Without the use of recursion, the best
1284 that can be done is to use a pattern that matches up to some fixed
1285 depth of nesting. It is not possible to handle an arbitrary nesting
1286 depth. Perl provides a facility that allows regular expressions to
1287 recurse (amongst other things). It does this by interpolating Perl code
1288 in the expression at run time, and the code can refer to the expression
1289 itself. A Perl pattern to solve the parentheses problem can be created
1290 like this:
1291
1292 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
1293
1294 The (?p{...}) item interpolates Perl code at run time, and in this case
1295 refers recursively to the pattern in which it appears. Obviously, PCRE
1296 cannot support the interpolation of Perl code. Instead, it supports
1297 some special syntax for recursion of the entire pattern, and also for
1298 individual subpattern recursion.
1299
1300 The special item that consists of (? followed by a number greater than
1301 zero and a closing parenthesis is a recursive call of the subpattern of
1302 the given number, provided that it occurs inside that subpattern. (If
1303 not, it is a "subroutine" call, which is described in the next sec-
1304 tion.) The special item (?R) is a recursive call of the entire regular
1305 expression.
1306
1307 For example, this PCRE pattern solves the nested parentheses problem
1308 (assume the PCRE_EXTENDED option is set so that white space is
1309 ignored):
1310
1311 \( ( (?>[^()]+) | (?R) )* \)
1312
1313 First it matches an opening parenthesis. Then it matches any number of
1314 substrings which can either be a sequence of non-parentheses, or a
1315 recursive match of the pattern itself (that is a correctly parenthe-
1316 sized substring). Finally there is a closing parenthesis.
1317
1318 If this were part of a larger pattern, you would not want to recurse
1319 the entire pattern, so instead you could use this:
1320
1321 ( \( ( (?>[^()]+) | (?1) )* \) )
1322
1323 We have put the pattern into parentheses, and caused the recursion to
1324 refer to them instead of the whole pattern. In a larger pattern, keep-
1325 ing track of parenthesis numbers can be tricky. It may be more conve-
1326 nient to use named parentheses instead. For this, PCRE uses (?P>name),
1327 which is an extension to the Python syntax that PCRE uses for named
1328 parentheses (Perl does not provide named parentheses). We could rewrite
1329 the above example as follows:
1330
1331 (?P<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )
1332
1333 This particular example pattern contains nested unlimited repeats, and
1334 so the use of atomic grouping for matching strings of non-parentheses
1335 is important when applying the pattern to strings that do not match.
1336 For example, when this pattern is applied to
1337
1338 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1339
1340 it yields "no match" quickly. However, if atomic grouping is not used,
1341 the match runs for a very long time indeed because there are so many
1342 different ways the + and * repeats can carve up the subject, and all
1343 have to be tested before failure can be reported.
1344
1345 At the end of a match, the values set for any capturing subpatterns are
1346 those from the outermost level of the recursion at which the subpattern
1347 value is set. If you want to obtain intermediate values, a callout
1348 function can be used (see the next section and the pcrecallout documen-
1349 tation). If the pattern above is matched against
1350
1351 (ab(cd)ef)
1352
1353 the value for the capturing parentheses is "ef", which is the last
1354 value taken on at the top level. If additional parentheses are added,
1355 giving
1356
1357 \( ( ( (?>[^()]+) | (?R) )* ) \)
1358 ^ ^
1359 ^ ^
1360
1361 the string they capture is "ab(cd)ef", the contents of the top level
1362 parentheses. If there are more than 15 capturing parentheses in a pat-
1363 tern, PCRE has to obtain extra memory to store data during a recursion,
1364 which it does by using pcre_malloc, freeing it via pcre_free after-
1365 wards. If no memory can be obtained, the match fails with the
1366 PCRE_ERROR_NOMEMORY error.
1367
1368 Do not confuse the (?R) item with the condition (R), which tests for
1369 recursion. Consider this pattern, which matches text in angle brack-
1370 ets, allowing for arbitrary nesting. Only digits are allowed in nested
1371 brackets (that is, when recursing), whereas any characters are permit-
1372 ted at the outer level.
1373
1374 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
1375
1376 In this pattern, (?(R) is the start of a conditional subpattern, with
1377 two different alternatives for the recursive and non-recursive cases.
1378 The (?R) item is the actual recursive call.
1379
1380
1381SUBPATTERNS AS SUBROUTINES
1382
1383 If the syntax for a recursive subpattern reference (either by number or
1384 by name) is used outside the parentheses to which it refers, it oper-
1385 ates like a subroutine in a programming language. An earlier example
1386 pointed out that the pattern
1387
1388 (sens|respons)e and \1ibility
1389
1390 matches "sense and sensibility" and "response and responsibility", but
1391 not "sense and responsibility". If instead the pattern
1392
1393 (sens|respons)e and (?1)ibility
1394
1395 is used, it does match "sense and responsibility" as well as the other
1396 two strings. Such references must, however, follow the subpattern to
1397 which they refer.
1398
1399
1400CALLOUTS
1401
1402 Perl has a feature whereby using the sequence (?{...}) causes arbitrary
1403 Perl code to be obeyed in the middle of matching a regular expression.
1404 This makes it possible, amongst other things, to extract different sub-
1405 strings that match the same pair of parentheses when there is a repeti-
1406 tion.
1407
1408 PCRE provides a similar feature, but of course it cannot obey arbitrary
1409 Perl code. The feature is called "callout". The caller of PCRE provides
1410 an external function by putting its entry point in the global variable
1411 pcre_callout. By default, this variable contains NULL, which disables
1412 all calling out.
1413
1414 Within a regular expression, (?C) indicates the points at which the
1415 external function is to be called. If you want to identify different
1416 callout points, you can put a number less than 256 after the letter C.
1417 The default value is zero. For example, this pattern has two callout
1418 points:
1419
1420 (?C1)abc(?C2)def
1421
1422 If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
1423 automatically installed before each item in the pattern. They are all
1424 numbered 255.
1425
1426 During matching, when PCRE reaches a callout point (and pcre_callout is
1427 set), the external function is called. It is provided with the number
1428 of the callout, the position in the pattern, and, optionally, one item
1429 of data originally supplied by the caller of pcre_exec(). The callout
1430 function may cause matching to proceed, to backtrack, or to fail alto-
1431 gether. A complete description of the interface to the callout function
1432 is given in the pcrecallout documentation.
1433
8ac170f3
PH
1434Last updated: 28 February 2005
1435Copyright (c) 1997-2005 University of Cambridge.