Check the length of off_t at build time and use %ld or %lld to print
[exim.git] / doc / doc-txt / pcrepattern.txt
CommitLineData
8ac170f3
PH
1This file contains the PCRE man page that describes the regular expressions
2supported by PCRE version 6.0. Note that not all of the features are relevant
495ae4b0
PH
3in the context of Exim. In particular, the version of PCRE that is compiled
4with Exim does not include UTF-8 support, there is no mechanism for changing
5the options with which the PCRE functions are called, and features such as
6callout are not accessible.
7-----------------------------------------------------------------------------
8
495ae4b0
PH
9
10
11NAME
12 PCRE - Perl-compatible regular expressions
13
8ac170f3 14
495ae4b0
PH
15PCRE REGULAR EXPRESSION DETAILS
16
17 The syntax and semantics of the regular expressions supported by PCRE
18 are described below. Regular expressions are also described in the Perl
19 documentation and in a number of books, some of which have copious
20 examples. Jeffrey Friedl's "Mastering Regular Expressions", published
21 by O'Reilly, covers regular expressions in great detail. This descrip-
22 tion of PCRE's regular expressions is intended as reference material.
23
24 The original operation of PCRE was on strings of one-byte characters.
25 However, there is now also support for UTF-8 character strings. To use
26 this, you must build PCRE to include UTF-8 support, and then call
27 pcre_compile() with the PCRE_UTF8 option. How this affects pattern
28 matching is mentioned in several places below. There is also a summary
29 of UTF-8 features in the section on UTF-8 support in the main pcre
30 page.
31
8ac170f3
PH
32 The remainder of this document discusses the patterns that are sup-
33 ported by PCRE when its main matching function, pcre_exec(), is used.
34 From release 6.0, PCRE offers a second matching function,
35 pcre_dfa_exec(), which matches using a different algorithm that is not
36 Perl-compatible. The advantages and disadvantages of the alternative
37 function, and how it differs from the normal function, are discussed in
38 the pcrematching page.
39
495ae4b0
PH
40 A regular expression is a pattern that is matched against a subject
41 string from left to right. Most characters stand for themselves in a
42 pattern, and match the corresponding characters in the subject. As a
43 trivial example, the pattern
44
45 The quick brown fox
46
8ac170f3
PH
47 matches a portion of a subject string that is identical to itself. When
48 caseless matching is specified (the PCRE_CASELESS option), letters are
49 matched independently of case. In UTF-8 mode, PCRE always understands
50 the concept of case for characters whose values are less than 128, so
51 caseless matching is always possible. For characters with higher val-
52 ues, the concept of case is supported if PCRE is compiled with Unicode
53 property support, but not otherwise. If you want to use caseless
54 matching for characters 128 and above, you must ensure that PCRE is
55 compiled with Unicode property support as well as with UTF-8 support.
56
57 The power of regular expressions comes from the ability to include
58 alternatives and repetitions in the pattern. These are encoded in the
59 pattern by the use of metacharacters, which do not stand for themselves
60 but instead are interpreted in some special way.
61
62 There are two different sets of metacharacters: those that are recog-
63 nized anywhere in the pattern except within square brackets, and those
64 that are recognized in square brackets. Outside square brackets, the
495ae4b0
PH
65 metacharacters are as follows:
66
67 \ general escape character with several uses
68 ^ assert start of string (or line, in multiline mode)
69 $ assert end of string (or line, in multiline mode)
70 . match any character except newline (by default)
71 [ start character class definition
72 | start of alternative branch
73 ( start subpattern
74 ) end subpattern
75 ? extends the meaning of (
76 also 0 or 1 quantifier
77 also quantifier minimizer
78 * 0 or more quantifier
79 + 1 or more quantifier
80 also "possessive quantifier"
81 { start min/max quantifier
82
8ac170f3 83 Part of a pattern that is in square brackets is called a "character
495ae4b0
PH
84 class". In a character class the only metacharacters are:
85
86 \ general escape character
87 ^ negate the class, but only if the first character
88 - indicates character range
89 [ POSIX character class (only if followed by POSIX
90 syntax)
91 ] terminates the character class
92
8ac170f3 93 The following sections describe the use of each of the metacharacters.
495ae4b0
PH
94
95
96BACKSLASH
97
98 The backslash character has several uses. Firstly, if it is followed by
8ac170f3
PH
99 a non-alphanumeric character, it takes away any special meaning that
100 character may have. This use of backslash as an escape character
495ae4b0
PH
101 applies both inside and outside character classes.
102
8ac170f3
PH
103 For example, if you want to match a * character, you write \* in the
104 pattern. This escaping action applies whether or not the following
105 character would otherwise be interpreted as a metacharacter, so it is
106 always safe to precede a non-alphanumeric with backslash to specify
107 that it stands for itself. In particular, if you want to match a back-
495ae4b0
PH
108 slash, you write \\.
109
8ac170f3
PH
110 If a pattern is compiled with the PCRE_EXTENDED option, whitespace in
111 the pattern (other than in a character class) and characters between a
495ae4b0 112 # outside a character class and the next newline character are ignored.
8ac170f3 113 An escaping backslash can be used to include a whitespace or # charac-
495ae4b0
PH
114 ter as part of the pattern.
115
8ac170f3
PH
116 If you want to remove the special meaning from a sequence of charac-
117 ters, you can do so by putting them between \Q and \E. This is differ-
118 ent from Perl in that $ and @ are handled as literals in \Q...\E
119 sequences in PCRE, whereas in Perl, $ and @ cause variable interpola-
495ae4b0
PH
120 tion. Note the following examples:
121
122 Pattern PCRE matches Perl matches
123
124 \Qabc$xyz\E abc$xyz abc followed by the
125 contents of $xyz
126 \Qabc\$xyz\E abc\$xyz abc\$xyz
127 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
128
8ac170f3 129 The \Q...\E sequence is recognized both inside and outside character
495ae4b0
PH
130 classes.
131
132 Non-printing characters
133
134 A second use of backslash provides a way of encoding non-printing char-
8ac170f3
PH
135 acters in patterns in a visible manner. There is no restriction on the
136 appearance of non-printing characters, apart from the binary zero that
137 terminates a pattern, but when a pattern is being prepared by text
138 editing, it is usually easier to use one of the following escape
495ae4b0
PH
139 sequences than the binary character it represents:
140
141 \a alarm, that is, the BEL character (hex 07)
142 \cx "control-x", where x is any character
143 \e escape (hex 1B)
144 \f formfeed (hex 0C)
145 \n newline (hex 0A)
146 \r carriage return (hex 0D)
147 \t tab (hex 09)
148 \ddd character with octal code ddd, or backreference
149 \xhh character with hex code hh
150 \x{hhh..} character with hex code hhh... (UTF-8 mode only)
151
8ac170f3
PH
152 The precise effect of \cx is as follows: if x is a lower case letter,
153 it is converted to upper case. Then bit 6 of the character (hex 40) is
154 inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
495ae4b0
PH
155 becomes hex 7B.
156
8ac170f3
PH
157 After \x, from zero to two hexadecimal digits are read (letters can be
158 in upper or lower case). In UTF-8 mode, any number of hexadecimal dig-
159 its may appear between \x{ and }, but the value of the character code
160 must be less than 2**31 (that is, the maximum hexadecimal value is
161 7FFFFFFF). If characters other than hexadecimal digits appear between
162 \x{ and }, or if there is no terminating }, this form of escape is not
163 recognized. Instead, the initial \x will be interpreted as a basic
164 hexadecimal escape, with no following digits, giving a character whose
495ae4b0
PH
165 value is zero.
166
167 Characters whose value is less than 256 can be defined by either of the
8ac170f3
PH
168 two syntaxes for \x when PCRE is in UTF-8 mode. There is no difference
169 in the way they are handled. For example, \xdc is exactly the same as
495ae4b0
PH
170 \x{dc}.
171
8ac170f3
PH
172 After \0 up to two further octal digits are read. In both cases, if
173 there are fewer than two digits, just those that are present are used.
174 Thus the sequence \0\x\07 specifies two binary zeros followed by a BEL
175 character (code value 7). Make sure you supply two digits after the
176 initial zero if the pattern character that follows is itself an octal
495ae4b0
PH
177 digit.
178
179 The handling of a backslash followed by a digit other than 0 is compli-
180 cated. Outside a character class, PCRE reads it and any following dig-
8ac170f3 181 its as a decimal number. If the number is less than 10, or if there
495ae4b0 182 have been at least that many previous capturing left parentheses in the
8ac170f3
PH
183 expression, the entire sequence is taken as a back reference. A
184 description of how this works is given later, following the discussion
495ae4b0
PH
185 of parenthesized subpatterns.
186
8ac170f3
PH
187 Inside a character class, or if the decimal number is greater than 9
188 and there have not been that many capturing subpatterns, PCRE re-reads
189 up to three octal digits following the backslash, and generates a sin-
495ae4b0
PH
190 gle byte from the least significant 8 bits of the value. Any subsequent
191 digits stand for themselves. For example:
192
193 \040 is another way of writing a space
194 \40 is the same, provided there are fewer than 40
195 previous capturing subpatterns
196 \7 is always a back reference
197 \11 might be a back reference, or another way of
198 writing a tab
199 \011 is always a tab
200 \0113 is a tab followed by the character "3"
201 \113 might be a back reference, otherwise the
202 character with octal code 113
203 \377 might be a back reference, otherwise
204 the byte consisting entirely of 1 bits
205 \81 is either a back reference, or a binary zero
206 followed by the two characters "8" and "1"
207
8ac170f3 208 Note that octal values of 100 or greater must not be introduced by a
495ae4b0
PH
209 leading zero, because no more than three octal digits are ever read.
210
8ac170f3 211 All the sequences that define a single byte value or a single UTF-8
495ae4b0 212 character (in UTF-8 mode) can be used both inside and outside character
8ac170f3 213 classes. In addition, inside a character class, the sequence \b is
495ae4b0 214 interpreted as the backspace character (hex 08), and the sequence \X is
8ac170f3 215 interpreted as the character "X". Outside a character class, these
495ae4b0
PH
216 sequences have different meanings (see below).
217
218 Generic character types
219
8ac170f3 220 The third use of backslash is for specifying generic character types.
495ae4b0
PH
221 The following are always recognized:
222
223 \d any decimal digit
224 \D any character that is not a decimal digit
225 \s any whitespace character
226 \S any character that is not a whitespace character
227 \w any "word" character
228 \W any "non-word" character
229
230 Each pair of escape sequences partitions the complete set of characters
8ac170f3 231 into two disjoint sets. Any given character matches one, and only one,
495ae4b0
PH
232 of each pair.
233
234 These character type sequences can appear both inside and outside char-
8ac170f3
PH
235 acter classes. They each match one character of the appropriate type.
236 If the current matching point is at the end of the subject string, all
495ae4b0
PH
237 of them fail, since there is no character to match.
238
8ac170f3
PH
239 For compatibility with Perl, \s does not match the VT character (code
240 11). This makes it different from the the POSIX "space" class. The \s
495ae4b0
PH
241 characters are HT (9), LF (10), FF (12), CR (13), and space (32).
242
243 A "word" character is an underscore or any character less than 256 that
8ac170f3
PH
244 is a letter or digit. The definition of letters and digits is con-
245 trolled by PCRE's low-valued character tables, and may vary if locale-
246 specific matching is taking place (see "Locale support" in the pcreapi
247 page). For example, in the "fr_FR" (French) locale, some character
248 codes greater than 128 are used for accented letters, and these are
495ae4b0
PH
249 matched by \w.
250
8ac170f3 251 In UTF-8 mode, characters with values greater than 128 never match \d,
495ae4b0
PH
252 \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
253 code character property support is available.
254
255 Unicode character properties
256
257 When PCRE is built with Unicode character property support, three addi-
8ac170f3 258 tional escape sequences to match generic character types are available
495ae4b0
PH
259 when UTF-8 mode is selected. They are:
260
261 \p{xx} a character with the xx property
262 \P{xx} a character without the xx property
263 \X an extended Unicode sequence
264
8ac170f3
PH
265 The property names represented by xx above are limited to the Unicode
266 general category properties. Each character has exactly one such prop-
267 erty, specified by a two-letter abbreviation. For compatibility with
268 Perl, negation can be specified by including a circumflex between the
269 opening brace and the property name. For example, \p{^Lu} is the same
495ae4b0
PH
270 as \P{Lu}.
271
8ac170f3 272 If only one letter is specified with \p or \P, it includes all the
495ae4b0
PH
273 properties that start with that letter. In this case, in the absence of
274 negation, the curly brackets in the escape sequence are optional; these
275 two examples have the same effect:
276
277 \p{L}
278 \pL
279
280 The following property codes are supported:
281
282 C Other
283 Cc Control
284 Cf Format
285 Cn Unassigned
286 Co Private use
287 Cs Surrogate
288
289 L Letter
290 Ll Lower case letter
291 Lm Modifier letter
292 Lo Other letter
293 Lt Title case letter
294 Lu Upper case letter
295
296 M Mark
297 Mc Spacing mark
298 Me Enclosing mark
299 Mn Non-spacing mark
300
301 N Number
302 Nd Decimal number
303 Nl Letter number
304 No Other number
305
306 P Punctuation
307 Pc Connector punctuation
308 Pd Dash punctuation
309 Pe Close punctuation
310 Pf Final punctuation
311 Pi Initial punctuation
312 Po Other punctuation
313 Ps Open punctuation
314
315 S Symbol
316 Sc Currency symbol
317 Sk Modifier symbol
318 Sm Mathematical symbol
319 So Other symbol
320
321 Z Separator
322 Zl Line separator
323 Zp Paragraph separator
324 Zs Space separator
325
8ac170f3 326 Extended properties such as "Greek" or "InMusicalSymbols" are not sup-
495ae4b0
PH
327 ported by PCRE.
328
8ac170f3 329 Specifying caseless matching does not affect these escape sequences.
495ae4b0
PH
330 For example, \p{Lu} always matches only upper case letters.
331
8ac170f3 332 The \X escape matches any number of Unicode characters that form an
495ae4b0
PH
333 extended Unicode sequence. \X is equivalent to
334
335 (?>\PM\pM*)
336
8ac170f3
PH
337 That is, it matches a character without the "mark" property, followed
338 by zero or more characters with the "mark" property, and treats the
339 sequence as an atomic group (see below). Characters with the "mark"
495ae4b0
PH
340 property are typically accents that affect the preceding character.
341
8ac170f3
PH
342 Matching characters by Unicode property is not fast, because PCRE has
343 to search a structure that contains data for over fifteen thousand
495ae4b0
PH
344 characters. That is why the traditional escape sequences such as \d and
345 \w do not use Unicode properties in PCRE.
346
347 Simple assertions
348
349 The fourth use of backslash is for certain simple assertions. An asser-
8ac170f3
PH
350 tion specifies a condition that has to be met at a particular point in
351 a match, without consuming any characters from the subject string. The
352 use of subpatterns for more complicated assertions is described below.
495ae4b0
PH
353 The backslashed assertions are:
354
355 \b matches at a word boundary
356 \B matches when not at a word boundary
357 \A matches at start of subject
358 \Z matches at end of subject or before newline at end
359 \z matches at end of subject
360 \G matches at first matching position in subject
361
8ac170f3 362 These assertions may not appear in character classes (but note that \b
495ae4b0
PH
363 has a different meaning, namely the backspace character, inside a char-
364 acter class).
365
8ac170f3
PH
366 A word boundary is a position in the subject string where the current
367 character and the previous character do not both match \w or \W (i.e.
368 one matches \w and the other matches \W), or the start or end of the
495ae4b0
PH
369 string if the first or last character matches \w, respectively.
370
8ac170f3 371 The \A, \Z, and \z assertions differ from the traditional circumflex
495ae4b0 372 and dollar (described in the next section) in that they only ever match
8ac170f3
PH
373 at the very start and end of the subject string, whatever options are
374 set. Thus, they are independent of multiline mode. These three asser-
495ae4b0 375 tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
8ac170f3
PH
376 affect only the behaviour of the circumflex and dollar metacharacters.
377 However, if the startoffset argument of pcre_exec() is non-zero, indi-
495ae4b0 378 cating that matching is to start at a point other than the beginning of
8ac170f3
PH
379 the subject, \A can never match. The difference between \Z and \z is
380 that \Z matches before a newline that is the last character of the
381 string as well as at the end of the string, whereas \z matches only at
495ae4b0
PH
382 the end.
383
8ac170f3
PH
384 The \G assertion is true only when the current matching position is at
385 the start point of the match, as specified by the startoffset argument
386 of pcre_exec(). It differs from \A when the value of startoffset is
387 non-zero. By calling pcre_exec() multiple times with appropriate argu-
495ae4b0
PH
388 ments, you can mimic Perl's /g option, and it is in this kind of imple-
389 mentation where \G can be useful.
390
8ac170f3 391 Note, however, that PCRE's interpretation of \G, as the start of the
495ae4b0 392 current match, is subtly different from Perl's, which defines it as the
8ac170f3
PH
393 end of the previous match. In Perl, these can be different when the
394 previously matched string was empty. Because PCRE does just one match
495ae4b0
PH
395 at a time, it cannot reproduce this behaviour.
396
8ac170f3 397 If all the alternatives of a pattern begin with \G, the expression is
495ae4b0
PH
398 anchored to the starting match position, and the "anchored" flag is set
399 in the compiled regular expression.
400
401
402CIRCUMFLEX AND DOLLAR
403
404 Outside a character class, in the default matching mode, the circumflex
8ac170f3
PH
405 character is an assertion that is true only if the current matching
406 point is at the start of the subject string. If the startoffset argu-
407 ment of pcre_exec() is non-zero, circumflex can never match if the
408 PCRE_MULTILINE option is unset. Inside a character class, circumflex
495ae4b0
PH
409 has an entirely different meaning (see below).
410
8ac170f3
PH
411 Circumflex need not be the first character of the pattern if a number
412 of alternatives are involved, but it should be the first thing in each
413 alternative in which it appears if the pattern is ever to match that
414 branch. If all possible alternatives start with a circumflex, that is,
415 if the pattern is constrained to match only at the start of the sub-
416 ject, it is said to be an "anchored" pattern. (There are also other
495ae4b0
PH
417 constructs that can cause a pattern to be anchored.)
418
8ac170f3
PH
419 A dollar character is an assertion that is true only if the current
420 matching point is at the end of the subject string, or immediately
495ae4b0 421 before a newline character that is the last character in the string (by
8ac170f3
PH
422 default). Dollar need not be the last character of the pattern if a
423 number of alternatives are involved, but it should be the last item in
424 any branch in which it appears. Dollar has no special meaning in a
495ae4b0
PH
425 character class.
426
8ac170f3
PH
427 The meaning of dollar can be changed so that it matches only at the
428 very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at
495ae4b0
PH
429 compile time. This does not affect the \Z assertion.
430
431 The meanings of the circumflex and dollar characters are changed if the
432 PCRE_MULTILINE option is set. When this is the case, they match immedi-
8ac170f3
PH
433 ately after and immediately before an internal newline character,
434 respectively, in addition to matching at the start and end of the sub-
435 ject string. For example, the pattern /^abc$/ matches the subject
436 string "def\nabc" (where \n represents a newline character) in multi-
495ae4b0 437 line mode, but not otherwise. Consequently, patterns that are anchored
8ac170f3
PH
438 in single line mode because all branches start with ^ are not anchored
439 in multiline mode, and a match for circumflex is possible when the
440 startoffset argument of pcre_exec() is non-zero. The PCRE_DOL-
495ae4b0
PH
441 LAR_ENDONLY option is ignored if PCRE_MULTILINE is set.
442
8ac170f3
PH
443 Note that the sequences \A, \Z, and \z can be used to match the start
444 and end of the subject in both modes, and if all branches of a pattern
445 start with \A it is always anchored, whether PCRE_MULTILINE is set or
495ae4b0
PH
446 not.
447
448
449FULL STOP (PERIOD, DOT)
450
451 Outside a character class, a dot in the pattern matches any one charac-
8ac170f3
PH
452 ter in the subject, including a non-printing character, but not (by
453 default) newline. In UTF-8 mode, a dot matches any UTF-8 character,
495ae4b0 454 which might be more than one byte long, except (by default) newline. If
8ac170f3
PH
455 the PCRE_DOTALL option is set, dots match newlines as well. The han-
456 dling of dot is entirely independent of the handling of circumflex and
457 dollar, the only relationship being that they both involve newline
495ae4b0
PH
458 characters. Dot has no special meaning in a character class.
459
460
461MATCHING A SINGLE BYTE
462
463 Outside a character class, the escape sequence \C matches any one byte,
8ac170f3
PH
464 both in and out of UTF-8 mode. Unlike a dot, it can match a newline.
465 The feature is provided in Perl in order to match individual bytes in
466 UTF-8 mode. Because it breaks up UTF-8 characters into individual
467 bytes, what remains in the string may be a malformed UTF-8 string. For
495ae4b0
PH
468 this reason, the \C escape sequence is best avoided.
469
8ac170f3
PH
470 PCRE does not allow \C to appear in lookbehind assertions (described
471 below), because in UTF-8 mode this would make it impossible to calcu-
495ae4b0
PH
472 late the length of the lookbehind.
473
474
475SQUARE BRACKETS AND CHARACTER CLASSES
476
477 An opening square bracket introduces a character class, terminated by a
478 closing square bracket. A closing square bracket on its own is not spe-
479 cial. If a closing square bracket is required as a member of the class,
8ac170f3 480 it should be the first data character in the class (after an initial
495ae4b0
PH
481 circumflex, if present) or escaped with a backslash.
482
8ac170f3
PH
483 A character class matches a single character in the subject. In UTF-8
484 mode, the character may occupy more than one byte. A matched character
495ae4b0 485 must be in the set of characters defined by the class, unless the first
8ac170f3
PH
486 character in the class definition is a circumflex, in which case the
487 subject character must not be in the set defined by the class. If a
488 circumflex is actually required as a member of the class, ensure it is
495ae4b0
PH
489 not the first character, or escape it with a backslash.
490
8ac170f3
PH
491 For example, the character class [aeiou] matches any lower case vowel,
492 while [^aeiou] matches any character that is not a lower case vowel.
495ae4b0 493 Note that a circumflex is just a convenient notation for specifying the
8ac170f3
PH
494 characters that are in the class by enumerating those that are not. A
495 class that starts with a circumflex is not an assertion: it still con-
496 sumes a character from the subject string, and therefore it fails if
495ae4b0
PH
497 the current pointer is at the end of the string.
498
8ac170f3
PH
499 In UTF-8 mode, characters with values greater than 255 can be included
500 in a class as a literal string of bytes, or by using the \x{ escaping
495ae4b0
PH
501 mechanism.
502
8ac170f3
PH
503 When caseless matching is set, any letters in a class represent both
504 their upper case and lower case versions, so for example, a caseless
505 [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
506 match "A", whereas a caseful version would. In UTF-8 mode, PCRE always
507 understands the concept of case for characters whose values are less
508 than 128, so caseless matching is always possible. For characters with
509 higher values, the concept of case is supported if PCRE is compiled
510 with Unicode property support, but not otherwise. If you want to use
511 caseless matching for characters 128 and above, you must ensure that
512 PCRE is compiled with Unicode property support as well as with UTF-8
513 support.
495ae4b0
PH
514
515 The newline character is never treated in any special way in character
516 classes, whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE
517 options is. A class such as [^a] will always match a newline.
518
519 The minus (hyphen) character can be used to specify a range of charac-
520 ters in a character class. For example, [d-m] matches any letter
521 between d and m, inclusive. If a minus character is required in a
522 class, it must be escaped with a backslash or appear in a position
523 where it cannot be interpreted as indicating a range, typically as the
524 first or last character in the class.
525
526 It is not possible to have the literal character "]" as the end charac-
527 ter of a range. A pattern such as [W-]46] is interpreted as a class of
528 two characters ("W" and "-") followed by a literal string "46]", so it
529 would match "W46]" or "-46]". However, if the "]" is escaped with a
530 backslash it is interpreted as the end of range, so [W-\]46] is inter-
531 preted as a class containing a range followed by two other characters.
532 The octal or hexadecimal representation of "]" can also be used to end
533 a range.
534
535 Ranges operate in the collating sequence of character values. They can
536 also be used for characters specified numerically, for example
537 [\000-\037]. In UTF-8 mode, ranges can include characters whose values
538 are greater than 255, for example [\x{100}-\x{2ff}].
539
540 If a range that includes letters is used when caseless matching is set,
541 it matches the letters in either case. For example, [W-c] is equivalent
542 to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if
543 character tables for the "fr_FR" locale are in use, [\xc8-\xcb] matches
544 accented E characters in both cases. In UTF-8 mode, PCRE supports the
545 concept of case for characters with values greater than 128 only when
546 it is compiled with Unicode property support.
547
548 The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear
549 in a character class, and add the characters that they match to the
550 class. For example, [\dABCDEF] matches any hexadecimal digit. A circum-
551 flex can conveniently be used with the upper case character types to
552 specify a more restricted set of characters than the matching lower
553 case type. For example, the class [^\W_] matches any letter or digit,
554 but not underscore.
555
556 The only metacharacters that are recognized in character classes are
557 backslash, hyphen (only where it can be interpreted as specifying a
558 range), circumflex (only at the start), opening square bracket (only
559 when it can be interpreted as introducing a POSIX class name - see the
560 next section), and the terminating closing square bracket. However,
561 escaping other non-alphanumeric characters does no harm.
562
563
564POSIX CHARACTER CLASSES
565
566 Perl supports the POSIX notation for character classes. This uses names
567 enclosed by [: and :] within the enclosing square brackets. PCRE also
568 supports this notation. For example,
569
570 [01[:alpha:]%]
571
572 matches "0", "1", any alphabetic character, or "%". The supported class
573 names are
574
575 alnum letters and digits
576 alpha letters
577 ascii character codes 0 - 127
578 blank space or tab only
579 cntrl control characters
580 digit decimal digits (same as \d)
581 graph printing characters, excluding space
582 lower lower case letters
583 print printing characters, including space
584 punct printing characters, excluding letters and digits
585 space white space (not quite the same as \s)
586 upper upper case letters
587 word "word" characters (same as \w)
588 xdigit hexadecimal digits
589
590 The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
591 and space (32). Notice that this list includes the VT character (code
592 11). This makes "space" different to \s, which does not include VT (for
593 Perl compatibility).
594
595 The name "word" is a Perl extension, and "blank" is a GNU extension
596 from Perl 5.8. Another Perl extension is negation, which is indicated
597 by a ^ character after the colon. For example,
598
599 [12[:^digit:]]
600
601 matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the
602 POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
603 these are not supported, and an error is given if they are encountered.
604
605 In UTF-8 mode, characters with values greater than 128 do not match any
606 of the POSIX character classes.
607
608
609VERTICAL BAR
610
611 Vertical bar characters are used to separate alternative patterns. For
612 example, the pattern
613
614 gilbert|sullivan
615
616 matches either "gilbert" or "sullivan". Any number of alternatives may
617 appear, and an empty alternative is permitted (matching the empty
618 string). The matching process tries each alternative in turn, from
619 left to right, and the first one that succeeds is used. If the alterna-
620 tives are within a subpattern (defined below), "succeeds" means match-
621 ing the rest of the main pattern as well as the alternative in the sub-
622 pattern.
623
624
625INTERNAL OPTION SETTING
626
627 The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
628 PCRE_EXTENDED options can be changed from within the pattern by a
629 sequence of Perl option letters enclosed between "(?" and ")". The
630 option letters are
631
632 i for PCRE_CASELESS
633 m for PCRE_MULTILINE
634 s for PCRE_DOTALL
635 x for PCRE_EXTENDED
636
637 For example, (?im) sets caseless, multiline matching. It is also possi-
638 ble to unset these options by preceding the letter with a hyphen, and a
639 combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-
640 LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
641 is also permitted. If a letter appears both before and after the
642 hyphen, the option is unset.
643
644 When an option change occurs at top level (that is, not inside subpat-
645 tern parentheses), the change applies to the remainder of the pattern
646 that follows. If the change is placed right at the start of a pattern,
647 PCRE extracts it into the global options (and it will therefore show up
648 in data extracted by the pcre_fullinfo() function).
649
650 An option change within a subpattern affects only that part of the cur-
651 rent pattern that follows it, so
652
653 (a(?i)b)c
654
655 matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
656 used). By this means, options can be made to have different settings
657 in different parts of the pattern. Any changes made in one alternative
658 do carry on into subsequent branches within the same subpattern. For
659 example,
660
661 (a(?i)b|c)
662
663 matches "ab", "aB", "c", and "C", even though when matching "C" the
664 first branch is abandoned before the option setting. This is because
665 the effects of option settings happen at compile time. There would be
666 some very weird behaviour otherwise.
667
668 The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can be changed
669 in the same way as the Perl-compatible options by using the characters
670 U and X respectively. The (?X) flag setting is special in that it must
671 always occur earlier in the pattern than any of the additional features
672 it turns on, even when it is at top level. It is best to put it at the
673 start.
674
675
676SUBPATTERNS
677
678 Subpatterns are delimited by parentheses (round brackets), which can be
679 nested. Turning part of a pattern into a subpattern does two things:
680
681 1. It localizes a set of alternatives. For example, the pattern
682
683 cat(aract|erpillar|)
684
685 matches one of the words "cat", "cataract", or "caterpillar". Without
686 the parentheses, it would match "cataract", "erpillar" or the empty
687 string.
688
689 2. It sets up the subpattern as a capturing subpattern. This means
690 that, when the whole pattern matches, that portion of the subject
691 string that matched the subpattern is passed back to the caller via the
692 ovector argument of pcre_exec(). Opening parentheses are counted from
693 left to right (starting from 1) to obtain numbers for the capturing
694 subpatterns.
695
696 For example, if the string "the red king" is matched against the pat-
697 tern
698
699 the ((red|white) (king|queen))
700
701 the captured substrings are "red king", "red", and "king", and are num-
702 bered 1, 2, and 3, respectively.
703
704 The fact that plain parentheses fulfil two functions is not always
705 helpful. There are often times when a grouping subpattern is required
706 without a capturing requirement. If an opening parenthesis is followed
707 by a question mark and a colon, the subpattern does not do any captur-
708 ing, and is not counted when computing the number of any subsequent
709 capturing subpatterns. For example, if the string "the white queen" is
710 matched against the pattern
711
712 the ((?:red|white) (king|queen))
713
714 the captured substrings are "white queen" and "queen", and are numbered
715 1 and 2. The maximum number of capturing subpatterns is 65535, and the
716 maximum depth of nesting of all subpatterns, both capturing and non-
717 capturing, is 200.
718
719 As a convenient shorthand, if any option settings are required at the
720 start of a non-capturing subpattern, the option letters may appear
721 between the "?" and the ":". Thus the two patterns
722
723 (?i:saturday|sunday)
724 (?:(?i)saturday|sunday)
725
726 match exactly the same set of strings. Because alternative branches are
727 tried from left to right, and options are not reset until the end of
728 the subpattern is reached, an option setting in one branch does affect
729 subsequent branches, so the above patterns match "SUNDAY" as well as
730 "Saturday".
731
732
733NAMED SUBPATTERNS
734
735 Identifying capturing parentheses by number is simple, but it can be
736 very hard to keep track of the numbers in complicated regular expres-
737 sions. Furthermore, if an expression is modified, the numbers may
738 change. To help with this difficulty, PCRE supports the naming of sub-
739 patterns, something that Perl does not provide. The Python syntax
740 (?P<name>...) is used. Names consist of alphanumeric characters and
741 underscores, and must be unique within a pattern.
742
743 Named capturing parentheses are still allocated numbers as well as
744 names. The PCRE API provides function calls for extracting the name-to-
745 number translation table from a compiled pattern. There is also a con-
746 venience function for extracting a captured substring by name. For fur-
747 ther details see the pcreapi documentation.
748
749
750REPETITION
751
752 Repetition is specified by quantifiers, which can follow any of the
753 following items:
754
755 a literal data character
756 the . metacharacter
757 the \C escape sequence
758 the \X escape sequence (in UTF-8 mode with Unicode properties)
759 an escape such as \d that matches a single character
760 a character class
761 a back reference (see next section)
762 a parenthesized subpattern (unless it is an assertion)
763
764 The general repetition quantifier specifies a minimum and maximum num-
765 ber of permitted matches, by giving the two numbers in curly brackets
766 (braces), separated by a comma. The numbers must be less than 65536,
767 and the first must be less than or equal to the second. For example:
768
769 z{2,4}
770
771 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
772 special character. If the second number is omitted, but the comma is
773 present, there is no upper limit; if the second number and the comma
774 are both omitted, the quantifier specifies an exact number of required
775 matches. Thus
776
777 [aeiou]{3,}
778
779 matches at least 3 successive vowels, but may match many more, while
780
781 \d{8}
782
783 matches exactly 8 digits. An opening curly bracket that appears in a
784 position where a quantifier is not allowed, or one that does not match
785 the syntax of a quantifier, is taken as a literal character. For exam-
786 ple, {,6} is not a quantifier, but a literal string of four characters.
787
788 In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to
789 individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
790 acters, each of which is represented by a two-byte sequence. Similarly,
791 when Unicode property support is available, \X{3} matches three Unicode
792 extended sequences, each of which may be several bytes long (and they
793 may be of different lengths).
794
795 The quantifier {0} is permitted, causing the expression to behave as if
796 the previous item and the quantifier were not present.
797
798 For convenience (and historical compatibility) the three most common
799 quantifiers have single-character abbreviations:
800
801 * is equivalent to {0,}
802 + is equivalent to {1,}
803 ? is equivalent to {0,1}
804
805 It is possible to construct infinite loops by following a subpattern
806 that can match no characters with a quantifier that has no upper limit,
807 for example:
808
809 (a?)*
810
811 Earlier versions of Perl and PCRE used to give an error at compile time
812 for such patterns. However, because there are cases where this can be
813 useful, such patterns are now accepted, but if any repetition of the
814 subpattern does in fact match no characters, the loop is forcibly bro-
815 ken.
816
817 By default, the quantifiers are "greedy", that is, they match as much
818 as possible (up to the maximum number of permitted times), without
819 causing the rest of the pattern to fail. The classic example of where
820 this gives problems is in trying to match comments in C programs. These
821 appear between /* and */ and within the comment, individual * and /
822 characters may appear. An attempt to match C comments by applying the
823 pattern
824
825 /\*.*\*/
826
827 to the string
828
829 /* first comment */ not comment /* second comment */
830
831 fails, because it matches the entire string owing to the greediness of
832 the .* item.
833
834 However, if a quantifier is followed by a question mark, it ceases to
835 be greedy, and instead matches the minimum number of times possible, so
836 the pattern
837
838 /\*.*?\*/
839
840 does the right thing with the C comments. The meaning of the various
841 quantifiers is not otherwise changed, just the preferred number of
842 matches. Do not confuse this use of question mark with its use as a
843 quantifier in its own right. Because it has two uses, it can sometimes
844 appear doubled, as in
845
846 \d??\d
847
848 which matches one digit by preference, but can match two if that is the
849 only way the rest of the pattern matches.
850
851 If the PCRE_UNGREEDY option is set (an option which is not available in
852 Perl), the quantifiers are not greedy by default, but individual ones
853 can be made greedy by following them with a question mark. In other
854 words, it inverts the default behaviour.
855
856 When a parenthesized subpattern is quantified with a minimum repeat
857 count that is greater than 1 or with a limited maximum, more memory is
858 required for the compiled pattern, in proportion to the size of the
859 minimum or maximum.
860
861 If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
862 alent to Perl's /s) is set, thus allowing the . to match newlines, the
863 pattern is implicitly anchored, because whatever follows will be tried
864 against every character position in the subject string, so there is no
865 point in retrying the overall match at any position after the first.
866 PCRE normally treats such a pattern as though it were preceded by \A.
867
868 In cases where it is known that the subject string contains no new-
869 lines, it is worth setting PCRE_DOTALL in order to obtain this opti-
870 mization, or alternatively using ^ to indicate anchoring explicitly.
871
872 However, there is one situation where the optimization cannot be used.
873 When .* is inside capturing parentheses that are the subject of a
874 backreference elsewhere in the pattern, a match at the start may fail,
875 and a later one succeed. Consider, for example:
876
877 (.*)abc\1
878
879 If the subject is "xyz123abc123" the match point is the fourth charac-
880 ter. For this reason, such a pattern is not implicitly anchored.
881
882 When a capturing subpattern is repeated, the value captured is the sub-
883 string that matched the final iteration. For example, after
884
885 (tweedle[dume]{3}\s*)+
886
887 has matched "tweedledum tweedledee" the value of the captured substring
888 is "tweedledee". However, if there are nested capturing subpatterns,
889 the corresponding captured values may have been set in previous itera-
890 tions. For example, after
891
892 /(a|(b))+/
893
894 matches "aba" the value of the second captured substring is "b".
895
896
897ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
898
899 With both maximizing and minimizing repetition, failure of what follows
900 normally causes the repeated item to be re-evaluated to see if a dif-
901 ferent number of repeats allows the rest of the pattern to match. Some-
902 times it is useful to prevent this, either to change the nature of the
903 match, or to cause it fail earlier than it otherwise might, when the
904 author of the pattern knows there is no point in carrying on.
905
906 Consider, for example, the pattern \d+foo when applied to the subject
907 line
908
909 123456bar
910
911 After matching all 6 digits and then failing to match "foo", the normal
912 action of the matcher is to try again with only 5 digits matching the
913 \d+ item, and then with 4, and so on, before ultimately failing.
914 "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
915 the means for specifying that once a subpattern has matched, it is not
916 to be re-evaluated in this way.
917
918 If we use atomic grouping for the previous example, the matcher would
919 give up immediately on failing to match "foo" the first time. The nota-
920 tion is a kind of special parenthesis, starting with (?> as in this
921 example:
922
923 (?>\d+)foo
924
925 This kind of parenthesis "locks up" the part of the pattern it con-
926 tains once it has matched, and a failure further into the pattern is
927 prevented from backtracking into it. Backtracking past it to previous
928 items, however, works as normal.
929
930 An alternative description is that a subpattern of this type matches
931 the string of characters that an identical standalone pattern would
932 match, if anchored at the current point in the subject string.
933
934 Atomic grouping subpatterns are not capturing subpatterns. Simple cases
935 such as the above example can be thought of as a maximizing repeat that
936 must swallow everything it can. So, while both \d+ and \d+? are pre-
937 pared to adjust the number of digits they match in order to make the
938 rest of the pattern match, (?>\d+) can only match an entire sequence of
939 digits.
940
941 Atomic groups in general can of course contain arbitrarily complicated
942 subpatterns, and can be nested. However, when the subpattern for an
943 atomic group is just a single repeated item, as in the example above, a
944 simpler notation, called a "possessive quantifier" can be used. This
945 consists of an additional + character following a quantifier. Using
946 this notation, the previous example can be rewritten as
947
948 \d++foo
949
950 Possessive quantifiers are always greedy; the setting of the
951 PCRE_UNGREEDY option is ignored. They are a convenient notation for the
952 simpler forms of atomic group. However, there is no difference in the
953 meaning or processing of a possessive quantifier and the equivalent
954 atomic group.
955
956 The possessive quantifier syntax is an extension to the Perl syntax. It
957 originates in Sun's Java package.
958
959 When a pattern contains an unlimited repeat inside a subpattern that
960 can itself be repeated an unlimited number of times, the use of an
961 atomic group is the only way to avoid some failing matches taking a
962 very long time indeed. The pattern
963
964 (\D+|<\d+>)*[!?]
965
966 matches an unlimited number of substrings that either consist of non-
967 digits, or digits enclosed in <>, followed by either ! or ?. When it
968 matches, it runs quickly. However, if it is applied to
969
970 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
971
972 it takes a long time before reporting failure. This is because the
973 string can be divided between the internal \D+ repeat and the external
974 * repeat in a large number of ways, and all have to be tried. (The
975 example uses [!?] rather than a single character at the end, because
976 both PCRE and Perl have an optimization that allows for fast failure
977 when a single character is used. They remember the last single charac-
978 ter that is required for a match, and fail early if it is not present
979 in the string.) If the pattern is changed so that it uses an atomic
980 group, like this:
981
982 ((?>\D+)|<\d+>)*[!?]
983
984 sequences of non-digits cannot be broken, and failure happens quickly.
985
986
987BACK REFERENCES
988
989 Outside a character class, a backslash followed by a digit greater than
990 0 (and possibly further digits) is a back reference to a capturing sub-
991 pattern earlier (that is, to its left) in the pattern, provided there
992 have been that many previous capturing left parentheses.
993
994 However, if the decimal number following the backslash is less than 10,
995 it is always taken as a back reference, and causes an error only if
996 there are not that many capturing left parentheses in the entire pat-
997 tern. In other words, the parentheses that are referenced need not be
998 to the left of the reference for numbers less than 10. See the subsec-
999 tion entitled "Non-printing characters" above for further details of
1000 the handling of digits following a backslash.
1001
1002 A back reference matches whatever actually matched the capturing sub-
1003 pattern in the current subject string, rather than anything matching
1004 the subpattern itself (see "Subpatterns as subroutines" below for a way
1005 of doing that). So the pattern
1006
1007 (sens|respons)e and \1ibility
1008
1009 matches "sense and sensibility" and "response and responsibility", but
1010 not "sense and responsibility". If caseful matching is in force at the
1011 time of the back reference, the case of letters is relevant. For exam-
1012 ple,
1013
1014 ((?i)rah)\s+\1
1015
1016 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
1017 original capturing subpattern is matched caselessly.
1018
1019 Back references to named subpatterns use the Python syntax (?P=name).
1020 We could rewrite the above example as follows:
1021
1022 (?<p1>(?i)rah)\s+(?P=p1)
1023
1024 There may be more than one back reference to the same subpattern. If a
1025 subpattern has not actually been used in a particular match, any back
1026 references to it always fail. For example, the pattern
1027
1028 (a|(bc))\2
1029
1030 always fails if it starts to match "a" rather than "bc". Because there
1031 may be many capturing parentheses in a pattern, all digits following
1032 the backslash are taken as part of a potential back reference number.
1033 If the pattern continues with a digit character, some delimiter must be
1034 used to terminate the back reference. If the PCRE_EXTENDED option is
1035 set, this can be whitespace. Otherwise an empty comment (see "Com-
1036 ments" below) can be used.
1037
1038 A back reference that occurs inside the parentheses to which it refers
1039 fails when the subpattern is first used, so, for example, (a\1) never
1040 matches. However, such references can be useful inside repeated sub-
1041 patterns. For example, the pattern
1042
1043 (a|b\1)+
1044
1045 matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
1046 ation of the subpattern, the back reference matches the character
1047 string corresponding to the previous iteration. In order for this to
1048 work, the pattern must be such that the first iteration does not need
1049 to match the back reference. This can be done using alternation, as in
1050 the example above, or by a quantifier with a minimum of zero.
1051
1052
1053ASSERTIONS
1054
1055 An assertion is a test on the characters following or preceding the
1056 current matching point that does not actually consume any characters.
1057 The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are
1058 described above.
1059
1060 More complicated assertions are coded as subpatterns. There are two
1061 kinds: those that look ahead of the current position in the subject
1062 string, and those that look behind it. An assertion subpattern is
1063 matched in the normal way, except that it does not cause the current
1064 matching position to be changed.
1065
1066 Assertion subpatterns are not capturing subpatterns, and may not be
1067 repeated, because it makes no sense to assert the same thing several
1068 times. If any kind of assertion contains capturing subpatterns within
1069 it, these are counted for the purposes of numbering the capturing sub-
1070 patterns in the whole pattern. However, substring capturing is carried
1071 out only for positive assertions, because it does not make sense for
1072 negative assertions.
1073
1074 Lookahead assertions
1075
1076 Lookahead assertions start with (?= for positive assertions and (?! for
1077 negative assertions. For example,
1078
1079 \w+(?=;)
1080
1081 matches a word followed by a semicolon, but does not include the semi-
1082 colon in the match, and
1083
1084 foo(?!bar)
1085
1086 matches any occurrence of "foo" that is not followed by "bar". Note
1087 that the apparently similar pattern
1088
1089 (?!foo)bar
1090
1091 does not find an occurrence of "bar" that is preceded by something
1092 other than "foo"; it finds any occurrence of "bar" whatsoever, because
1093 the assertion (?!foo) is always true when the next three characters are
1094 "bar". A lookbehind assertion is needed to achieve the other effect.
1095
1096 If you want to force a matching failure at some point in a pattern, the
1097 most convenient way to do it is with (?!) because an empty string
1098 always matches, so an assertion that requires there not to be an empty
1099 string must always fail.
1100
1101 Lookbehind assertions
1102
1103 Lookbehind assertions start with (?<= for positive assertions and (?<!
1104 for negative assertions. For example,
1105
1106 (?<!foo)bar
1107
1108 does find an occurrence of "bar" that is not preceded by "foo". The
1109 contents of a lookbehind assertion are restricted such that all the
1110 strings it matches must have a fixed length. However, if there are sev-
1111 eral alternatives, they do not all have to have the same fixed length.
1112 Thus
1113
1114 (?<=bullock|donkey)
1115
1116 is permitted, but
1117
1118 (?<!dogs?|cats?)
1119
1120 causes an error at compile time. Branches that match different length
1121 strings are permitted only at the top level of a lookbehind assertion.
1122 This is an extension compared with Perl (at least for 5.8), which
1123 requires all branches to match the same length of string. An assertion
1124 such as
1125
1126 (?<=ab(c|de))
1127
1128 is not permitted, because its single top-level branch can match two
1129 different lengths, but it is acceptable if rewritten to use two top-
1130 level branches:
1131
1132 (?<=abc|abde)
1133
1134 The implementation of lookbehind assertions is, for each alternative,
1135 to temporarily move the current position back by the fixed width and
1136 then try to match. If there are insufficient characters before the cur-
1137 rent position, the match is deemed to fail.
1138
1139 PCRE does not allow the \C escape (which matches a single byte in UTF-8
1140 mode) to appear in lookbehind assertions, because it makes it impossi-
1141 ble to calculate the length of the lookbehind. The \X escape, which can
1142 match different numbers of bytes, is also not permitted.
1143
1144 Atomic groups can be used in conjunction with lookbehind assertions to
1145 specify efficient matching at the end of the subject string. Consider a
1146 simple pattern such as
1147
1148 abcd$
1149
1150 when applied to a long string that does not match. Because matching
1151 proceeds from left to right, PCRE will look for each "a" in the subject
1152 and then see if what follows matches the rest of the pattern. If the
1153 pattern is specified as
1154
1155 ^.*abcd$
1156
1157 the initial .* matches the entire string at first, but when this fails
1158 (because there is no following "a"), it backtracks to match all but the
1159 last character, then all but the last two characters, and so on. Once
1160 again the search for "a" covers the entire string, from right to left,
1161 so we are no better off. However, if the pattern is written as
1162
1163 ^(?>.*)(?<=abcd)
1164
1165 or, equivalently, using the possessive quantifier syntax,
1166
1167 ^.*+(?<=abcd)
1168
1169 there can be no backtracking for the .* item; it can match only the
1170 entire string. The subsequent lookbehind assertion does a single test
1171 on the last four characters. If it fails, the match fails immediately.
1172 For long strings, this approach makes a significant difference to the
1173 processing time.
1174
1175 Using multiple assertions
1176
1177 Several assertions (of any sort) may occur in succession. For example,
1178
1179 (?<=\d{3})(?<!999)foo
1180
1181 matches "foo" preceded by three digits that are not "999". Notice that
1182 each of the assertions is applied independently at the same point in
1183 the subject string. First there is a check that the previous three
1184 characters are all digits, and then there is a check that the same
1185 three characters are not "999". This pattern does not match "foo" pre-
1186 ceded by six characters, the first of which are digits and the last
1187 three of which are not "999". For example, it doesn't match "123abc-
1188 foo". A pattern to do that is
1189
1190 (?<=\d{3}...)(?<!999)foo
1191
1192 This time the first assertion looks at the preceding six characters,
1193 checking that the first three are digits, and then the second assertion
1194 checks that the preceding three characters are not "999".
1195
1196 Assertions can be nested in any combination. For example,
1197
1198 (?<=(?<!foo)bar)baz
1199
1200 matches an occurrence of "baz" that is preceded by "bar" which in turn
1201 is not preceded by "foo", while
1202
1203 (?<=\d{3}(?!999)...)foo
1204
1205 is another pattern that matches "foo" preceded by three digits and any
1206 three characters that are not "999".
1207
1208
1209CONDITIONAL SUBPATTERNS
1210
1211 It is possible to cause the matching process to obey a subpattern con-
1212 ditionally or to choose between two alternative subpatterns, depending
1213 on the result of an assertion, or whether a previous capturing subpat-
1214 tern matched or not. The two possible forms of conditional subpattern
1215 are
1216
1217 (?(condition)yes-pattern)
1218 (?(condition)yes-pattern|no-pattern)
1219
1220 If the condition is satisfied, the yes-pattern is used; otherwise the
1221 no-pattern (if present) is used. If there are more than two alterna-
1222 tives in the subpattern, a compile-time error occurs.
1223
1224 There are three kinds of condition. If the text between the parentheses
1225 consists of a sequence of digits, the condition is satisfied if the
1226 capturing subpattern of that number has previously matched. The number
1227 must be greater than zero. Consider the following pattern, which con-
1228 tains non-significant white space to make it more readable (assume the
1229 PCRE_EXTENDED option) and to divide it into three parts for ease of
1230 discussion:
1231
1232 ( \( )? [^()]+ (?(1) \) )
1233
1234 The first part matches an optional opening parenthesis, and if that
1235 character is present, sets it as the first captured substring. The sec-
1236 ond part matches one or more characters that are not parentheses. The
1237 third part is a conditional subpattern that tests whether the first set
1238 of parentheses matched or not. If they did, that is, if subject started
1239 with an opening parenthesis, the condition is true, and so the yes-pat-
1240 tern is executed and a closing parenthesis is required. Otherwise,
1241 since no-pattern is not present, the subpattern matches nothing. In
1242 other words, this pattern matches a sequence of non-parentheses,
1243 optionally enclosed in parentheses.
1244
1245 If the condition is the string (R), it is satisfied if a recursive call
1246 to the pattern or subpattern has been made. At "top level", the condi-
1247 tion is false. This is a PCRE extension. Recursive patterns are
1248 described in the next section.
1249
1250 If the condition is not a sequence of digits or (R), it must be an
1251 assertion. This may be a positive or negative lookahead or lookbehind
1252 assertion. Consider this pattern, again containing non-significant
1253 white space, and with the two alternatives on the second line:
1254
1255 (?(?=[^a-z]*[a-z])
1256 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
1257
1258 The condition is a positive lookahead assertion that matches an
1259 optional sequence of non-letters followed by a letter. In other words,
1260 it tests for the presence of at least one letter in the subject. If a
1261 letter is found, the subject is matched against the first alternative;
1262 otherwise it is matched against the second. This pattern matches
1263 strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
1264 letters and dd are digits.
1265
1266
1267COMMENTS
1268
1269 The sequence (?# marks the start of a comment that continues up to the
1270 next closing parenthesis. Nested parentheses are not permitted. The
1271 characters that make up a comment play no part in the pattern matching
1272 at all.
1273
1274 If the PCRE_EXTENDED option is set, an unescaped # character outside a
1275 character class introduces a comment that continues up to the next new-
1276 line character in the pattern.
1277
1278
1279RECURSIVE PATTERNS
1280
1281 Consider the problem of matching a string in parentheses, allowing for
1282 unlimited nested parentheses. Without the use of recursion, the best
1283 that can be done is to use a pattern that matches up to some fixed
1284 depth of nesting. It is not possible to handle an arbitrary nesting
1285 depth. Perl provides a facility that allows regular expressions to
1286 recurse (amongst other things). It does this by interpolating Perl code
1287 in the expression at run time, and the code can refer to the expression
1288 itself. A Perl pattern to solve the parentheses problem can be created
1289 like this:
1290
1291 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
1292
1293 The (?p{...}) item interpolates Perl code at run time, and in this case
1294 refers recursively to the pattern in which it appears. Obviously, PCRE
1295 cannot support the interpolation of Perl code. Instead, it supports
1296 some special syntax for recursion of the entire pattern, and also for
1297 individual subpattern recursion.
1298
1299 The special item that consists of (? followed by a number greater than
1300 zero and a closing parenthesis is a recursive call of the subpattern of
1301 the given number, provided that it occurs inside that subpattern. (If
1302 not, it is a "subroutine" call, which is described in the next sec-
1303 tion.) The special item (?R) is a recursive call of the entire regular
1304 expression.
1305
1306 For example, this PCRE pattern solves the nested parentheses problem
1307 (assume the PCRE_EXTENDED option is set so that white space is
1308 ignored):
1309
1310 \( ( (?>[^()]+) | (?R) )* \)
1311
1312 First it matches an opening parenthesis. Then it matches any number of
1313 substrings which can either be a sequence of non-parentheses, or a
1314 recursive match of the pattern itself (that is a correctly parenthe-
1315 sized substring). Finally there is a closing parenthesis.
1316
1317 If this were part of a larger pattern, you would not want to recurse
1318 the entire pattern, so instead you could use this:
1319
1320 ( \( ( (?>[^()]+) | (?1) )* \) )
1321
1322 We have put the pattern into parentheses, and caused the recursion to
1323 refer to them instead of the whole pattern. In a larger pattern, keep-
1324 ing track of parenthesis numbers can be tricky. It may be more conve-
1325 nient to use named parentheses instead. For this, PCRE uses (?P>name),
1326 which is an extension to the Python syntax that PCRE uses for named
1327 parentheses (Perl does not provide named parentheses). We could rewrite
1328 the above example as follows:
1329
1330 (?P<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )
1331
1332 This particular example pattern contains nested unlimited repeats, and
1333 so the use of atomic grouping for matching strings of non-parentheses
1334 is important when applying the pattern to strings that do not match.
1335 For example, when this pattern is applied to
1336
1337 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1338
1339 it yields "no match" quickly. However, if atomic grouping is not used,
1340 the match runs for a very long time indeed because there are so many
1341 different ways the + and * repeats can carve up the subject, and all
1342 have to be tested before failure can be reported.
1343
1344 At the end of a match, the values set for any capturing subpatterns are
1345 those from the outermost level of the recursion at which the subpattern
1346 value is set. If you want to obtain intermediate values, a callout
1347 function can be used (see the next section and the pcrecallout documen-
1348 tation). If the pattern above is matched against
1349
1350 (ab(cd)ef)
1351
1352 the value for the capturing parentheses is "ef", which is the last
1353 value taken on at the top level. If additional parentheses are added,
1354 giving
1355
1356 \( ( ( (?>[^()]+) | (?R) )* ) \)
1357 ^ ^
1358 ^ ^
1359
1360 the string they capture is "ab(cd)ef", the contents of the top level
1361 parentheses. If there are more than 15 capturing parentheses in a pat-
1362 tern, PCRE has to obtain extra memory to store data during a recursion,
1363 which it does by using pcre_malloc, freeing it via pcre_free after-
1364 wards. If no memory can be obtained, the match fails with the
1365 PCRE_ERROR_NOMEMORY error.
1366
1367 Do not confuse the (?R) item with the condition (R), which tests for
1368 recursion. Consider this pattern, which matches text in angle brack-
1369 ets, allowing for arbitrary nesting. Only digits are allowed in nested
1370 brackets (that is, when recursing), whereas any characters are permit-
1371 ted at the outer level.
1372
1373 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
1374
1375 In this pattern, (?(R) is the start of a conditional subpattern, with
1376 two different alternatives for the recursive and non-recursive cases.
1377 The (?R) item is the actual recursive call.
1378
1379
1380SUBPATTERNS AS SUBROUTINES
1381
1382 If the syntax for a recursive subpattern reference (either by number or
1383 by name) is used outside the parentheses to which it refers, it oper-
1384 ates like a subroutine in a programming language. An earlier example
1385 pointed out that the pattern
1386
1387 (sens|respons)e and \1ibility
1388
1389 matches "sense and sensibility" and "response and responsibility", but
1390 not "sense and responsibility". If instead the pattern
1391
1392 (sens|respons)e and (?1)ibility
1393
1394 is used, it does match "sense and responsibility" as well as the other
1395 two strings. Such references must, however, follow the subpattern to
1396 which they refer.
1397
1398
1399CALLOUTS
1400
1401 Perl has a feature whereby using the sequence (?{...}) causes arbitrary
1402 Perl code to be obeyed in the middle of matching a regular expression.
1403 This makes it possible, amongst other things, to extract different sub-
1404 strings that match the same pair of parentheses when there is a repeti-
1405 tion.
1406
1407 PCRE provides a similar feature, but of course it cannot obey arbitrary
1408 Perl code. The feature is called "callout". The caller of PCRE provides
1409 an external function by putting its entry point in the global variable
1410 pcre_callout. By default, this variable contains NULL, which disables
1411 all calling out.
1412
1413 Within a regular expression, (?C) indicates the points at which the
1414 external function is to be called. If you want to identify different
1415 callout points, you can put a number less than 256 after the letter C.
1416 The default value is zero. For example, this pattern has two callout
1417 points:
1418
1419 (?C1)abc(?C2)def
1420
1421 If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
1422 automatically installed before each item in the pattern. They are all
1423 numbered 255.
1424
1425 During matching, when PCRE reaches a callout point (and pcre_callout is
1426 set), the external function is called. It is provided with the number
1427 of the callout, the position in the pattern, and, optionally, one item
1428 of data originally supplied by the caller of pcre_exec(). The callout
1429 function may cause matching to proceed, to backtrack, or to fail alto-
1430 gether. A complete description of the interface to the callout function
1431 is given in the pcrecallout documentation.
1432
8ac170f3
PH
1433Last updated: 28 February 2005
1434Copyright (c) 1997-2005 University of Cambridge.