Install PCRE 6.7 in in place of 6.2.
[exim.git] / doc / doc-txt / pcrepattern.txt
CommitLineData
8ac170f3 1This file contains the PCRE man page that describes the regular expressions
aa41d2de 2supported by PCRE version 6.7. Note that not all of the features are relevant
495ae4b0
PH
3in the context of Exim. In particular, the version of PCRE that is compiled
4with Exim does not include UTF-8 support, there is no mechanism for changing
5the options with which the PCRE functions are called, and features such as
6callout are not accessible.
7-----------------------------------------------------------------------------
8
92e772ff 9PCREPATTERN(3) PCREPATTERN(3)
495ae4b0
PH
10
11
12NAME
13 PCRE - Perl-compatible regular expressions
14
8ac170f3 15
495ae4b0
PH
16PCRE REGULAR EXPRESSION DETAILS
17
18 The syntax and semantics of the regular expressions supported by PCRE
19 are described below. Regular expressions are also described in the Perl
20 documentation and in a number of books, some of which have copious
21 examples. Jeffrey Friedl's "Mastering Regular Expressions", published
22 by O'Reilly, covers regular expressions in great detail. This descrip-
23 tion of PCRE's regular expressions is intended as reference material.
24
25 The original operation of PCRE was on strings of one-byte characters.
26 However, there is now also support for UTF-8 character strings. To use
27 this, you must build PCRE to include UTF-8 support, and then call
28 pcre_compile() with the PCRE_UTF8 option. How this affects pattern
29 matching is mentioned in several places below. There is also a summary
30 of UTF-8 features in the section on UTF-8 support in the main pcre
31 page.
32
8ac170f3
PH
33 The remainder of this document discusses the patterns that are sup-
34 ported by PCRE when its main matching function, pcre_exec(), is used.
35 From release 6.0, PCRE offers a second matching function,
36 pcre_dfa_exec(), which matches using a different algorithm that is not
37 Perl-compatible. The advantages and disadvantages of the alternative
38 function, and how it differs from the normal function, are discussed in
39 the pcrematching page.
40
495ae4b0
PH
41 A regular expression is a pattern that is matched against a subject
42 string from left to right. Most characters stand for themselves in a
43 pattern, and match the corresponding characters in the subject. As a
44 trivial example, the pattern
45
46 The quick brown fox
47
8ac170f3
PH
48 matches a portion of a subject string that is identical to itself. When
49 caseless matching is specified (the PCRE_CASELESS option), letters are
50 matched independently of case. In UTF-8 mode, PCRE always understands
51 the concept of case for characters whose values are less than 128, so
52 caseless matching is always possible. For characters with higher val-
53 ues, the concept of case is supported if PCRE is compiled with Unicode
54 property support, but not otherwise. If you want to use caseless
55 matching for characters 128 and above, you must ensure that PCRE is
56 compiled with Unicode property support as well as with UTF-8 support.
57
58 The power of regular expressions comes from the ability to include
59 alternatives and repetitions in the pattern. These are encoded in the
60 pattern by the use of metacharacters, which do not stand for themselves
61 but instead are interpreted in some special way.
62
63 There are two different sets of metacharacters: those that are recog-
64 nized anywhere in the pattern except within square brackets, and those
65 that are recognized in square brackets. Outside square brackets, the
495ae4b0
PH
66 metacharacters are as follows:
67
68 \ general escape character with several uses
69 ^ assert start of string (or line, in multiline mode)
70 $ assert end of string (or line, in multiline mode)
71 . match any character except newline (by default)
72 [ start character class definition
73 | start of alternative branch
74 ( start subpattern
75 ) end subpattern
76 ? extends the meaning of (
77 also 0 or 1 quantifier
78 also quantifier minimizer
79 * 0 or more quantifier
80 + 1 or more quantifier
81 also "possessive quantifier"
82 { start min/max quantifier
83
8ac170f3 84 Part of a pattern that is in square brackets is called a "character
495ae4b0
PH
85 class". In a character class the only metacharacters are:
86
87 \ general escape character
88 ^ negate the class, but only if the first character
89 - indicates character range
90 [ POSIX character class (only if followed by POSIX
91 syntax)
92 ] terminates the character class
93
8ac170f3 94 The following sections describe the use of each of the metacharacters.
495ae4b0
PH
95
96
97BACKSLASH
98
99 The backslash character has several uses. Firstly, if it is followed by
8ac170f3
PH
100 a non-alphanumeric character, it takes away any special meaning that
101 character may have. This use of backslash as an escape character
495ae4b0
PH
102 applies both inside and outside character classes.
103
8ac170f3
PH
104 For example, if you want to match a * character, you write \* in the
105 pattern. This escaping action applies whether or not the following
106 character would otherwise be interpreted as a metacharacter, so it is
107 always safe to precede a non-alphanumeric with backslash to specify
108 that it stands for itself. In particular, if you want to match a back-
495ae4b0
PH
109 slash, you write \\.
110
8ac170f3
PH
111 If a pattern is compiled with the PCRE_EXTENDED option, whitespace in
112 the pattern (other than in a character class) and characters between a
aa41d2de
PH
113 # outside a character class and the next newline are ignored. An escap-
114 ing backslash can be used to include a whitespace or # character as
115 part of the pattern.
495ae4b0 116
8ac170f3
PH
117 If you want to remove the special meaning from a sequence of charac-
118 ters, you can do so by putting them between \Q and \E. This is differ-
119 ent from Perl in that $ and @ are handled as literals in \Q...\E
120 sequences in PCRE, whereas in Perl, $ and @ cause variable interpola-
495ae4b0
PH
121 tion. Note the following examples:
122
123 Pattern PCRE matches Perl matches
124
125 \Qabc$xyz\E abc$xyz abc followed by the
126 contents of $xyz
127 \Qabc\$xyz\E abc\$xyz abc\$xyz
128 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
129
8ac170f3 130 The \Q...\E sequence is recognized both inside and outside character
495ae4b0
PH
131 classes.
132
133 Non-printing characters
134
135 A second use of backslash provides a way of encoding non-printing char-
8ac170f3
PH
136 acters in patterns in a visible manner. There is no restriction on the
137 appearance of non-printing characters, apart from the binary zero that
138 terminates a pattern, but when a pattern is being prepared by text
139 editing, it is usually easier to use one of the following escape
495ae4b0
PH
140 sequences than the binary character it represents:
141
142 \a alarm, that is, the BEL character (hex 07)
143 \cx "control-x", where x is any character
144 \e escape (hex 1B)
145 \f formfeed (hex 0C)
146 \n newline (hex 0A)
147 \r carriage return (hex 0D)
148 \t tab (hex 09)
149 \ddd character with octal code ddd, or backreference
150 \xhh character with hex code hh
aa41d2de 151 \x{hhh..} character with hex code hhh..
495ae4b0 152
8ac170f3
PH
153 The precise effect of \cx is as follows: if x is a lower case letter,
154 it is converted to upper case. Then bit 6 of the character (hex 40) is
155 inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
495ae4b0
PH
156 becomes hex 7B.
157
8ac170f3 158 After \x, from zero to two hexadecimal digits are read (letters can be
aa41d2de
PH
159 in upper or lower case). Any number of hexadecimal digits may appear
160 between \x{ and }, but the value of the character code must be less
161 than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode (that is,
162 the maximum hexadecimal value is 7FFFFFFF). If characters other than
163 hexadecimal digits appear between \x{ and }, or if there is no termi-
164 nating }, this form of escape is not recognized. Instead, the initial
165 \x will be interpreted as a basic hexadecimal escape, with no following
166 digits, giving a character whose value is zero.
495ae4b0
PH
167
168 Characters whose value is less than 256 can be defined by either of the
aa41d2de
PH
169 two syntaxes for \x. There is no difference in the way they are han-
170 dled. For example, \xdc is exactly the same as \x{dc}.
495ae4b0 171
aa41d2de
PH
172 After \0 up to two further octal digits are read. If there are fewer
173 than two digits, just those that are present are used. Thus the
174 sequence \0\x\07 specifies two binary zeros followed by a BEL character
175 (code value 7). Make sure you supply two digits after the initial zero
176 if the pattern character that follows is itself an octal digit.
495ae4b0
PH
177
178 The handling of a backslash followed by a digit other than 0 is compli-
179 cated. Outside a character class, PCRE reads it and any following dig-
8ac170f3 180 its as a decimal number. If the number is less than 10, or if there
495ae4b0 181 have been at least that many previous capturing left parentheses in the
8ac170f3
PH
182 expression, the entire sequence is taken as a back reference. A
183 description of how this works is given later, following the discussion
495ae4b0
PH
184 of parenthesized subpatterns.
185
8ac170f3
PH
186 Inside a character class, or if the decimal number is greater than 9
187 and there have not been that many capturing subpatterns, PCRE re-reads
aa41d2de
PH
188 up to three octal digits following the backslash, ane uses them to gen-
189 erate a data character. Any subsequent digits stand for themselves. In
190 non-UTF-8 mode, the value of a character specified in octal must be
191 less than \400. In UTF-8 mode, values up to \777 are permitted. For
192 example:
495ae4b0
PH
193
194 \040 is another way of writing a space
195 \40 is the same, provided there are fewer than 40
196 previous capturing subpatterns
197 \7 is always a back reference
198 \11 might be a back reference, or another way of
199 writing a tab
200 \011 is always a tab
201 \0113 is a tab followed by the character "3"
202 \113 might be a back reference, otherwise the
203 character with octal code 113
204 \377 might be a back reference, otherwise
205 the byte consisting entirely of 1 bits
206 \81 is either a back reference, or a binary zero
207 followed by the two characters "8" and "1"
208
8ac170f3 209 Note that octal values of 100 or greater must not be introduced by a
495ae4b0
PH
210 leading zero, because no more than three octal digits are ever read.
211
aa41d2de
PH
212 All the sequences that define a single character value can be used both
213 inside and outside character classes. In addition, inside a character
214 class, the sequence \b is interpreted as the backspace character (hex
215 08), and the sequence \X is interpreted as the character "X". Outside a
216 character class, these sequences have different meanings (see below).
495ae4b0
PH
217
218 Generic character types
219
aa41d2de 220 The third use of backslash is for specifying generic character types.
495ae4b0
PH
221 The following are always recognized:
222
223 \d any decimal digit
224 \D any character that is not a decimal digit
225 \s any whitespace character
226 \S any character that is not a whitespace character
227 \w any "word" character
228 \W any "non-word" character
229
230 Each pair of escape sequences partitions the complete set of characters
aa41d2de 231 into two disjoint sets. Any given character matches one, and only one,
495ae4b0
PH
232 of each pair.
233
234 These character type sequences can appear both inside and outside char-
aa41d2de
PH
235 acter classes. They each match one character of the appropriate type.
236 If the current matching point is at the end of the subject string, all
495ae4b0
PH
237 of them fail, since there is no character to match.
238
aa41d2de
PH
239 For compatibility with Perl, \s does not match the VT character (code
240 11). This makes it different from the the POSIX "space" class. The \s
241 characters are HT (9), LF (10), FF (12), CR (13), and space (32). (If
242 "use locale;" is included in a Perl script, \s may match the VT charac-
243 ter. In PCRE, it never does.)
495ae4b0
PH
244
245 A "word" character is an underscore or any character less than 256 that
aa41d2de
PH
246 is a letter or digit. The definition of letters and digits is con-
247 trolled by PCRE's low-valued character tables, and may vary if locale-
248 specific matching is taking place (see "Locale support" in the pcreapi
249 page). For example, in the "fr_FR" (French) locale, some character
250 codes greater than 128 are used for accented letters, and these are
495ae4b0
PH
251 matched by \w.
252
aa41d2de 253 In UTF-8 mode, characters with values greater than 128 never match \d,
495ae4b0 254 \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
aa41d2de
PH
255 code character property support is available. The use of locales with
256 Unicode is discouraged.
495ae4b0
PH
257
258 Unicode character properties
259
260 When PCRE is built with Unicode character property support, three addi-
aa41d2de 261 tional escape sequences to match character properties are available
495ae4b0
PH
262 when UTF-8 mode is selected. They are:
263
aa41d2de
PH
264 \p{xx} a character with the xx property
265 \P{xx} a character without the xx property
266 \X an extended Unicode sequence
495ae4b0 267
8ac170f3 268 The property names represented by xx above are limited to the Unicode
aa41d2de
PH
269 script names, the general category properties, and "Any", which matches
270 any character (including newline). Other properties such as "InMusical-
271 Symbols" are not currently supported by PCRE. Note that \P{Any} does
272 not match any characters, so always causes a match failure.
273
274 Sets of Unicode characters are defined as belonging to certain scripts.
275 A character from one of these sets can be matched using a script name.
276 For example:
277
278 \p{Greek}
279 \P{Han}
280
281 Those that are not part of an identified script are lumped together as
282 "Common". The current list of scripts is:
283
284 Arabic, Armenian, Bengali, Bopomofo, Braille, Buginese, Buhid, Cana-
285 dian_Aboriginal, Cherokee, Common, Coptic, Cypriot, Cyrillic, Deseret,
286 Devanagari, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati,
287 Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Inherited, Kannada,
288 Katakana, Kharoshthi, Khmer, Lao, Latin, Limbu, Linear_B, Malayalam,
289 Mongolian, Myanmar, New_Tai_Lue, Ogham, Old_Italic, Old_Persian, Oriya,
290 Osmanya, Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tag-
291 banwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh,
292 Ugaritic, Yi.
293
294 Each character has exactly one general category property, specified by
295 a two-letter abbreviation. For compatibility with Perl, negation can be
296 specified by including a circumflex between the opening brace and the
297 property name. For example, \p{^Lu} is the same as \P{Lu}.
298
299 If only one letter is specified with \p or \P, it includes all the gen-
300 eral category properties that start with that letter. In this case, in
301 the absence of negation, the curly brackets in the escape sequence are
302 optional; these two examples have the same effect:
495ae4b0
PH
303
304 \p{L}
305 \pL
306
aa41d2de 307 The following general category property codes are supported:
495ae4b0
PH
308
309 C Other
310 Cc Control
311 Cf Format
312 Cn Unassigned
313 Co Private use
314 Cs Surrogate
315
316 L Letter
317 Ll Lower case letter
318 Lm Modifier letter
319 Lo Other letter
320 Lt Title case letter
321 Lu Upper case letter
322
323 M Mark
324 Mc Spacing mark
325 Me Enclosing mark
326 Mn Non-spacing mark
327
328 N Number
329 Nd Decimal number
330 Nl Letter number
331 No Other number
332
333 P Punctuation
334 Pc Connector punctuation
335 Pd Dash punctuation
336 Pe Close punctuation
337 Pf Final punctuation
338 Pi Initial punctuation
339 Po Other punctuation
340 Ps Open punctuation
341
342 S Symbol
343 Sc Currency symbol
344 Sk Modifier symbol
345 Sm Mathematical symbol
346 So Other symbol
347
348 Z Separator
349 Zl Line separator
350 Zp Paragraph separator
351 Zs Space separator
352
aa41d2de
PH
353 The special property L& is also supported: it matches a character that
354 has the Lu, Ll, or Lt property, in other words, a letter that is not
355 classified as a modifier or "other".
356
357 The long synonyms for these properties that Perl supports (such as
358 \p{Letter}) are not supported by PCRE, nor is it permitted to prefix
359 any of these properties with "Is".
360
361 No character that is in the Unicode table has the Cn (unassigned) prop-
362 erty. Instead, this property is assumed for any code point that is not
363 in the Unicode table.
495ae4b0 364
8ac170f3 365 Specifying caseless matching does not affect these escape sequences.
495ae4b0
PH
366 For example, \p{Lu} always matches only upper case letters.
367
8ac170f3 368 The \X escape matches any number of Unicode characters that form an
495ae4b0
PH
369 extended Unicode sequence. \X is equivalent to
370
371 (?>\PM\pM*)
372
8ac170f3
PH
373 That is, it matches a character without the "mark" property, followed
374 by zero or more characters with the "mark" property, and treats the
375 sequence as an atomic group (see below). Characters with the "mark"
495ae4b0
PH
376 property are typically accents that affect the preceding character.
377
8ac170f3
PH
378 Matching characters by Unicode property is not fast, because PCRE has
379 to search a structure that contains data for over fifteen thousand
495ae4b0
PH
380 characters. That is why the traditional escape sequences such as \d and
381 \w do not use Unicode properties in PCRE.
382
383 Simple assertions
384
385 The fourth use of backslash is for certain simple assertions. An asser-
8ac170f3
PH
386 tion specifies a condition that has to be met at a particular point in
387 a match, without consuming any characters from the subject string. The
388 use of subpatterns for more complicated assertions is described below.
495ae4b0
PH
389 The backslashed assertions are:
390
391 \b matches at a word boundary
392 \B matches when not at a word boundary
393 \A matches at start of subject
394 \Z matches at end of subject or before newline at end
395 \z matches at end of subject
396 \G matches at first matching position in subject
397
8ac170f3 398 These assertions may not appear in character classes (but note that \b
495ae4b0
PH
399 has a different meaning, namely the backspace character, inside a char-
400 acter class).
401
8ac170f3
PH
402 A word boundary is a position in the subject string where the current
403 character and the previous character do not both match \w or \W (i.e.
404 one matches \w and the other matches \W), or the start or end of the
495ae4b0
PH
405 string if the first or last character matches \w, respectively.
406
8ac170f3 407 The \A, \Z, and \z assertions differ from the traditional circumflex
495ae4b0 408 and dollar (described in the next section) in that they only ever match
8ac170f3
PH
409 at the very start and end of the subject string, whatever options are
410 set. Thus, they are independent of multiline mode. These three asser-
495ae4b0 411 tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
8ac170f3
PH
412 affect only the behaviour of the circumflex and dollar metacharacters.
413 However, if the startoffset argument of pcre_exec() is non-zero, indi-
495ae4b0 414 cating that matching is to start at a point other than the beginning of
8ac170f3 415 the subject, \A can never match. The difference between \Z and \z is
aa41d2de
PH
416 that \Z matches before a newline at the end of the string as well as at
417 the very end, whereas \z matches only at the end.
418
419 The \G assertion is true only when the current matching position is at
420 the start point of the match, as specified by the startoffset argument
421 of pcre_exec(). It differs from \A when the value of startoffset is
422 non-zero. By calling pcre_exec() multiple times with appropriate argu-
495ae4b0
PH
423 ments, you can mimic Perl's /g option, and it is in this kind of imple-
424 mentation where \G can be useful.
425
aa41d2de 426 Note, however, that PCRE's interpretation of \G, as the start of the
495ae4b0 427 current match, is subtly different from Perl's, which defines it as the
aa41d2de
PH
428 end of the previous match. In Perl, these can be different when the
429 previously matched string was empty. Because PCRE does just one match
495ae4b0
PH
430 at a time, it cannot reproduce this behaviour.
431
aa41d2de 432 If all the alternatives of a pattern begin with \G, the expression is
495ae4b0
PH
433 anchored to the starting match position, and the "anchored" flag is set
434 in the compiled regular expression.
435
436
437CIRCUMFLEX AND DOLLAR
438
439 Outside a character class, in the default matching mode, the circumflex
aa41d2de
PH
440 character is an assertion that is true only if the current matching
441 point is at the start of the subject string. If the startoffset argu-
442 ment of pcre_exec() is non-zero, circumflex can never match if the
443 PCRE_MULTILINE option is unset. Inside a character class, circumflex
495ae4b0
PH
444 has an entirely different meaning (see below).
445
aa41d2de
PH
446 Circumflex need not be the first character of the pattern if a number
447 of alternatives are involved, but it should be the first thing in each
448 alternative in which it appears if the pattern is ever to match that
449 branch. If all possible alternatives start with a circumflex, that is,
450 if the pattern is constrained to match only at the start of the sub-
451 ject, it is said to be an "anchored" pattern. (There are also other
495ae4b0
PH
452 constructs that can cause a pattern to be anchored.)
453
aa41d2de
PH
454 A dollar character is an assertion that is true only if the current
455 matching point is at the end of the subject string, or immediately
456 before a newline at the end of the string (by default). Dollar need not
457 be the last character of the pattern if a number of alternatives are
458 involved, but it should be the last item in any branch in which it
459 appears. Dollar has no special meaning in a character class.
495ae4b0 460
8ac170f3
PH
461 The meaning of dollar can be changed so that it matches only at the
462 very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at
495ae4b0
PH
463 compile time. This does not affect the \Z assertion.
464
465 The meanings of the circumflex and dollar characters are changed if the
aa41d2de
PH
466 PCRE_MULTILINE option is set. When this is the case, a circumflex
467 matches immediately after internal newlines as well as at the start of
468 the subject string. It does not match after a newline that ends the
469 string. A dollar matches before any newlines in the string, as well as
470 at the very end, when PCRE_MULTILINE is set. When newline is specified
471 as the two-character sequence CRLF, isolated CR and LF characters do
472 not indicate newlines.
473
474 For example, the pattern /^abc$/ matches the subject string "def\nabc"
475 (where \n represents a newline) in multiline mode, but not otherwise.
476 Consequently, patterns that are anchored in single line mode because
477 all branches start with ^ are not anchored in multiline mode, and a
478 match for circumflex is possible when the startoffset argument of
479 pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
480 PCRE_MULTILINE is set.
481
482 Note that the sequences \A, \Z, and \z can be used to match the start
483 and end of the subject in both modes, and if all branches of a pattern
484 start with \A it is always anchored, whether or not PCRE_MULTILINE is
485 set.
495ae4b0
PH
486
487
488FULL STOP (PERIOD, DOT)
489
490 Outside a character class, a dot in the pattern matches any one charac-
aa41d2de
PH
491 ter in the subject string except (by default) a character that signi-
492 fies the end of a line. In UTF-8 mode, the matched character may be
493 more than one byte long. When a line ending is defined as a single
494 character (CR or LF), dot never matches that character; when the two-
495 character sequence CRLF is used, dot does not match CR if it is immedi-
496 ately followed by LF, but otherwise it matches all characters (includ-
497 ing isolated CRs and LFs).
498
499 The behaviour of dot with regard to newlines can be changed. If the
500 PCRE_DOTALL option is set, a dot matches any one character, without
501 exception. If newline is defined as the two-character sequence CRLF, it
502 takes two dots to match it.
503
504 The handling of dot is entirely independent of the handling of circum-
505 flex and dollar, the only relationship being that they both involve
506 newlines. Dot has no special meaning in a character class.
495ae4b0
PH
507
508
509MATCHING A SINGLE BYTE
510
511 Outside a character class, the escape sequence \C matches any one byte,
aa41d2de
PH
512 both in and out of UTF-8 mode. Unlike a dot, it always matches CR and
513 LF. The feature is provided in Perl in order to match individual bytes
514 in UTF-8 mode. Because it breaks up UTF-8 characters into individual
8ac170f3 515 bytes, what remains in the string may be a malformed UTF-8 string. For
495ae4b0
PH
516 this reason, the \C escape sequence is best avoided.
517
8ac170f3
PH
518 PCRE does not allow \C to appear in lookbehind assertions (described
519 below), because in UTF-8 mode this would make it impossible to calcu-
495ae4b0
PH
520 late the length of the lookbehind.
521
522
523SQUARE BRACKETS AND CHARACTER CLASSES
524
525 An opening square bracket introduces a character class, terminated by a
526 closing square bracket. A closing square bracket on its own is not spe-
527 cial. If a closing square bracket is required as a member of the class,
8ac170f3 528 it should be the first data character in the class (after an initial
495ae4b0
PH
529 circumflex, if present) or escaped with a backslash.
530
8ac170f3
PH
531 A character class matches a single character in the subject. In UTF-8
532 mode, the character may occupy more than one byte. A matched character
495ae4b0 533 must be in the set of characters defined by the class, unless the first
8ac170f3
PH
534 character in the class definition is a circumflex, in which case the
535 subject character must not be in the set defined by the class. If a
536 circumflex is actually required as a member of the class, ensure it is
495ae4b0
PH
537 not the first character, or escape it with a backslash.
538
8ac170f3
PH
539 For example, the character class [aeiou] matches any lower case vowel,
540 while [^aeiou] matches any character that is not a lower case vowel.
495ae4b0 541 Note that a circumflex is just a convenient notation for specifying the
8ac170f3
PH
542 characters that are in the class by enumerating those that are not. A
543 class that starts with a circumflex is not an assertion: it still con-
544 sumes a character from the subject string, and therefore it fails if
495ae4b0
PH
545 the current pointer is at the end of the string.
546
8ac170f3
PH
547 In UTF-8 mode, characters with values greater than 255 can be included
548 in a class as a literal string of bytes, or by using the \x{ escaping
495ae4b0
PH
549 mechanism.
550
8ac170f3
PH
551 When caseless matching is set, any letters in a class represent both
552 their upper case and lower case versions, so for example, a caseless
553 [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
554 match "A", whereas a caseful version would. In UTF-8 mode, PCRE always
555 understands the concept of case for characters whose values are less
556 than 128, so caseless matching is always possible. For characters with
557 higher values, the concept of case is supported if PCRE is compiled
558 with Unicode property support, but not otherwise. If you want to use
559 caseless matching for characters 128 and above, you must ensure that
560 PCRE is compiled with Unicode property support as well as with UTF-8
561 support.
495ae4b0 562
aa41d2de
PH
563 Characters that might indicate line breaks (CR and LF) are never
564 treated in any special way when matching character classes, whatever
565 line-ending sequence is in use, and whatever setting of the PCRE_DOTALL
566 and PCRE_MULTILINE options is used. A class such as [^a] always matches
567 one of these characters.
495ae4b0
PH
568
569 The minus (hyphen) character can be used to specify a range of charac-
570 ters in a character class. For example, [d-m] matches any letter
571 between d and m, inclusive. If a minus character is required in a
572 class, it must be escaped with a backslash or appear in a position
573 where it cannot be interpreted as indicating a range, typically as the
574 first or last character in the class.
575
576 It is not possible to have the literal character "]" as the end charac-
577 ter of a range. A pattern such as [W-]46] is interpreted as a class of
578 two characters ("W" and "-") followed by a literal string "46]", so it
579 would match "W46]" or "-46]". However, if the "]" is escaped with a
580 backslash it is interpreted as the end of range, so [W-\]46] is inter-
581 preted as a class containing a range followed by two other characters.
582 The octal or hexadecimal representation of "]" can also be used to end
583 a range.
584
585 Ranges operate in the collating sequence of character values. They can
586 also be used for characters specified numerically, for example
587 [\000-\037]. In UTF-8 mode, ranges can include characters whose values
588 are greater than 255, for example [\x{100}-\x{2ff}].
589
590 If a range that includes letters is used when caseless matching is set,
591 it matches the letters in either case. For example, [W-c] is equivalent
592 to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if
593 character tables for the "fr_FR" locale are in use, [\xc8-\xcb] matches
594 accented E characters in both cases. In UTF-8 mode, PCRE supports the
595 concept of case for characters with values greater than 128 only when
596 it is compiled with Unicode property support.
597
598 The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear
599 in a character class, and add the characters that they match to the
600 class. For example, [\dABCDEF] matches any hexadecimal digit. A circum-
601 flex can conveniently be used with the upper case character types to
602 specify a more restricted set of characters than the matching lower
603 case type. For example, the class [^\W_] matches any letter or digit,
604 but not underscore.
605
606 The only metacharacters that are recognized in character classes are
607 backslash, hyphen (only where it can be interpreted as specifying a
608 range), circumflex (only at the start), opening square bracket (only
609 when it can be interpreted as introducing a POSIX class name - see the
610 next section), and the terminating closing square bracket. However,
611 escaping other non-alphanumeric characters does no harm.
612
613
614POSIX CHARACTER CLASSES
615
616 Perl supports the POSIX notation for character classes. This uses names
617 enclosed by [: and :] within the enclosing square brackets. PCRE also
618 supports this notation. For example,
619
620 [01[:alpha:]%]
621
622 matches "0", "1", any alphabetic character, or "%". The supported class
623 names are
624
625 alnum letters and digits
626 alpha letters
627 ascii character codes 0 - 127
628 blank space or tab only
629 cntrl control characters
630 digit decimal digits (same as \d)
631 graph printing characters, excluding space
632 lower lower case letters
633 print printing characters, including space
634 punct printing characters, excluding letters and digits
635 space white space (not quite the same as \s)
636 upper upper case letters
637 word "word" characters (same as \w)
638 xdigit hexadecimal digits
639
640 The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
641 and space (32). Notice that this list includes the VT character (code
642 11). This makes "space" different to \s, which does not include VT (for
643 Perl compatibility).
644
645 The name "word" is a Perl extension, and "blank" is a GNU extension
646 from Perl 5.8. Another Perl extension is negation, which is indicated
647 by a ^ character after the colon. For example,
648
649 [12[:^digit:]]
650
651 matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the
652 POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
653 these are not supported, and an error is given if they are encountered.
654
655 In UTF-8 mode, characters with values greater than 128 do not match any
656 of the POSIX character classes.
657
658
659VERTICAL BAR
660
661 Vertical bar characters are used to separate alternative patterns. For
662 example, the pattern
663
664 gilbert|sullivan
665
666 matches either "gilbert" or "sullivan". Any number of alternatives may
667 appear, and an empty alternative is permitted (matching the empty
aa41d2de
PH
668 string). The matching process tries each alternative in turn, from left
669 to right, and the first one that succeeds is used. If the alternatives
670 are within a subpattern (defined below), "succeeds" means matching the
671 rest of the main pattern as well as the alternative in the subpattern.
495ae4b0
PH
672
673
674INTERNAL OPTION SETTING
675
676 The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
677 PCRE_EXTENDED options can be changed from within the pattern by a
678 sequence of Perl option letters enclosed between "(?" and ")". The
679 option letters are
680
681 i for PCRE_CASELESS
682 m for PCRE_MULTILINE
683 s for PCRE_DOTALL
684 x for PCRE_EXTENDED
685
686 For example, (?im) sets caseless, multiline matching. It is also possi-
687 ble to unset these options by preceding the letter with a hyphen, and a
688 combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-
689 LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
690 is also permitted. If a letter appears both before and after the
691 hyphen, the option is unset.
692
693 When an option change occurs at top level (that is, not inside subpat-
694 tern parentheses), the change applies to the remainder of the pattern
695 that follows. If the change is placed right at the start of a pattern,
696 PCRE extracts it into the global options (and it will therefore show up
697 in data extracted by the pcre_fullinfo() function).
698
699 An option change within a subpattern affects only that part of the cur-
700 rent pattern that follows it, so
701
702 (a(?i)b)c
703
704 matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
705 used). By this means, options can be made to have different settings
706 in different parts of the pattern. Any changes made in one alternative
707 do carry on into subsequent branches within the same subpattern. For
708 example,
709
710 (a(?i)b|c)
711
712 matches "ab", "aB", "c", and "C", even though when matching "C" the
713 first branch is abandoned before the option setting. This is because
714 the effects of option settings happen at compile time. There would be
715 some very weird behaviour otherwise.
716
aa41d2de
PH
717 The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
718 can be changed in the same way as the Perl-compatible options by using
719 the characters J, U and X respectively.
495ae4b0
PH
720
721
722SUBPATTERNS
723
724 Subpatterns are delimited by parentheses (round brackets), which can be
725 nested. Turning part of a pattern into a subpattern does two things:
726
727 1. It localizes a set of alternatives. For example, the pattern
728
729 cat(aract|erpillar|)
730
aa41d2de
PH
731 matches one of the words "cat", "cataract", or "caterpillar". Without
732 the parentheses, it would match "cataract", "erpillar" or the empty
495ae4b0
PH
733 string.
734
aa41d2de
PH
735 2. It sets up the subpattern as a capturing subpattern. This means
736 that, when the whole pattern matches, that portion of the subject
495ae4b0 737 string that matched the subpattern is passed back to the caller via the
aa41d2de
PH
738 ovector argument of pcre_exec(). Opening parentheses are counted from
739 left to right (starting from 1) to obtain numbers for the capturing
495ae4b0
PH
740 subpatterns.
741
aa41d2de 742 For example, if the string "the red king" is matched against the pat-
495ae4b0
PH
743 tern
744
745 the ((red|white) (king|queen))
746
747 the captured substrings are "red king", "red", and "king", and are num-
748 bered 1, 2, and 3, respectively.
749
aa41d2de
PH
750 The fact that plain parentheses fulfil two functions is not always
751 helpful. There are often times when a grouping subpattern is required
752 without a capturing requirement. If an opening parenthesis is followed
753 by a question mark and a colon, the subpattern does not do any captur-
754 ing, and is not counted when computing the number of any subsequent
755 capturing subpatterns. For example, if the string "the white queen" is
495ae4b0
PH
756 matched against the pattern
757
758 the ((?:red|white) (king|queen))
759
760 the captured substrings are "white queen" and "queen", and are numbered
aa41d2de
PH
761 1 and 2. The maximum number of capturing subpatterns is 65535, and the
762 maximum depth of nesting of all subpatterns, both capturing and non-
495ae4b0
PH
763 capturing, is 200.
764
aa41d2de
PH
765 As a convenient shorthand, if any option settings are required at the
766 start of a non-capturing subpattern, the option letters may appear
495ae4b0
PH
767 between the "?" and the ":". Thus the two patterns
768
769 (?i:saturday|sunday)
770 (?:(?i)saturday|sunday)
771
772 match exactly the same set of strings. Because alternative branches are
aa41d2de
PH
773 tried from left to right, and options are not reset until the end of
774 the subpattern is reached, an option setting in one branch does affect
775 subsequent branches, so the above patterns match "SUNDAY" as well as
495ae4b0
PH
776 "Saturday".
777
778
779NAMED SUBPATTERNS
780
aa41d2de
PH
781 Identifying capturing parentheses by number is simple, but it can be
782 very hard to keep track of the numbers in complicated regular expres-
783 sions. Furthermore, if an expression is modified, the numbers may
784 change. To help with this difficulty, PCRE supports the naming of sub-
785 patterns, something that Perl does not provide. The Python syntax
786 (?P<name>...) is used. References to capturing parentheses from other
787 parts of the pattern, such as backreferences, recursion, and condi-
788 tions, can be made by name as well as by number.
789
790 Names consist of up to 32 alphanumeric characters and underscores.
791 Named capturing parentheses are still allocated numbers as well as
495ae4b0 792 names. The PCRE API provides function calls for extracting the name-to-
aa41d2de
PH
793 number translation table from a compiled pattern. There is also a con-
794 venience function for extracting a captured substring by name.
795
796 By default, a name must be unique within a pattern, but it is possible
797 to relax this constraint by setting the PCRE_DUPNAMES option at compile
798 time. This can be useful for patterns where only one instance of the
799 named parentheses can match. Suppose you want to match the name of a
800 weekday, either as a 3-letter abbreviation or as the full name, and in
801 both cases you want to extract the abbreviation. This pattern (ignoring
802 the line breaks) does the job:
803
804 (?P<DN>Mon|Fri|Sun)(?:day)?|
805 (?P<DN>Tue)(?:sday)?|
806 (?P<DN>Wed)(?:nesday)?|
807 (?P<DN>Thu)(?:rsday)?|
808 (?P<DN>Sat)(?:urday)?
809
810 There are five capturing substrings, but only one is ever set after a
811 match. The convenience function for extracting the data by name
812 returns the substring for the first, and in this example, the only,
813 subpattern of that name that matched. This saves searching to find
814 which numbered subpattern it was. If you make a reference to a non-
815 unique named subpattern from elsewhere in the pattern, the one that
816 corresponds to the lowest number is used. For further details of the
817 interfaces for handling named subpatterns, see the pcreapi documenta-
818 tion.
495ae4b0
PH
819
820
821REPETITION
822
823 Repetition is specified by quantifiers, which can follow any of the
824 following items:
825
826 a literal data character
827 the . metacharacter
828 the \C escape sequence
829 the \X escape sequence (in UTF-8 mode with Unicode properties)
830 an escape such as \d that matches a single character
831 a character class
832 a back reference (see next section)
833 a parenthesized subpattern (unless it is an assertion)
834
835 The general repetition quantifier specifies a minimum and maximum num-
836 ber of permitted matches, by giving the two numbers in curly brackets
837 (braces), separated by a comma. The numbers must be less than 65536,
838 and the first must be less than or equal to the second. For example:
839
840 z{2,4}
841
842 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
843 special character. If the second number is omitted, but the comma is
844 present, there is no upper limit; if the second number and the comma
845 are both omitted, the quantifier specifies an exact number of required
846 matches. Thus
847
848 [aeiou]{3,}
849
850 matches at least 3 successive vowels, but may match many more, while
851
852 \d{8}
853
854 matches exactly 8 digits. An opening curly bracket that appears in a
855 position where a quantifier is not allowed, or one that does not match
856 the syntax of a quantifier, is taken as a literal character. For exam-
857 ple, {,6} is not a quantifier, but a literal string of four characters.
858
859 In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to
860 individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
861 acters, each of which is represented by a two-byte sequence. Similarly,
862 when Unicode property support is available, \X{3} matches three Unicode
863 extended sequences, each of which may be several bytes long (and they
864 may be of different lengths).
865
866 The quantifier {0} is permitted, causing the expression to behave as if
867 the previous item and the quantifier were not present.
868
869 For convenience (and historical compatibility) the three most common
870 quantifiers have single-character abbreviations:
871
872 * is equivalent to {0,}
873 + is equivalent to {1,}
874 ? is equivalent to {0,1}
875
876 It is possible to construct infinite loops by following a subpattern
877 that can match no characters with a quantifier that has no upper limit,
878 for example:
879
880 (a?)*
881
882 Earlier versions of Perl and PCRE used to give an error at compile time
883 for such patterns. However, because there are cases where this can be
884 useful, such patterns are now accepted, but if any repetition of the
885 subpattern does in fact match no characters, the loop is forcibly bro-
886 ken.
887
888 By default, the quantifiers are "greedy", that is, they match as much
889 as possible (up to the maximum number of permitted times), without
890 causing the rest of the pattern to fail. The classic example of where
891 this gives problems is in trying to match comments in C programs. These
892 appear between /* and */ and within the comment, individual * and /
893 characters may appear. An attempt to match C comments by applying the
894 pattern
895
896 /\*.*\*/
897
898 to the string
899
900 /* first comment */ not comment /* second comment */
901
902 fails, because it matches the entire string owing to the greediness of
903 the .* item.
904
905 However, if a quantifier is followed by a question mark, it ceases to
906 be greedy, and instead matches the minimum number of times possible, so
907 the pattern
908
909 /\*.*?\*/
910
911 does the right thing with the C comments. The meaning of the various
912 quantifiers is not otherwise changed, just the preferred number of
913 matches. Do not confuse this use of question mark with its use as a
914 quantifier in its own right. Because it has two uses, it can sometimes
915 appear doubled, as in
916
917 \d??\d
918
919 which matches one digit by preference, but can match two if that is the
920 only way the rest of the pattern matches.
921
922 If the PCRE_UNGREEDY option is set (an option which is not available in
923 Perl), the quantifiers are not greedy by default, but individual ones
924 can be made greedy by following them with a question mark. In other
925 words, it inverts the default behaviour.
926
927 When a parenthesized subpattern is quantified with a minimum repeat
928 count that is greater than 1 or with a limited maximum, more memory is
929 required for the compiled pattern, in proportion to the size of the
930 minimum or maximum.
931
932 If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
933 alent to Perl's /s) is set, thus allowing the . to match newlines, the
934 pattern is implicitly anchored, because whatever follows will be tried
935 against every character position in the subject string, so there is no
936 point in retrying the overall match at any position after the first.
937 PCRE normally treats such a pattern as though it were preceded by \A.
938
939 In cases where it is known that the subject string contains no new-
940 lines, it is worth setting PCRE_DOTALL in order to obtain this opti-
941 mization, or alternatively using ^ to indicate anchoring explicitly.
942
943 However, there is one situation where the optimization cannot be used.
944 When .* is inside capturing parentheses that are the subject of a
945 backreference elsewhere in the pattern, a match at the start may fail,
946 and a later one succeed. Consider, for example:
947
948 (.*)abc\1
949
950 If the subject is "xyz123abc123" the match point is the fourth charac-
951 ter. For this reason, such a pattern is not implicitly anchored.
952
953 When a capturing subpattern is repeated, the value captured is the sub-
954 string that matched the final iteration. For example, after
955
956 (tweedle[dume]{3}\s*)+
957
958 has matched "tweedledum tweedledee" the value of the captured substring
959 is "tweedledee". However, if there are nested capturing subpatterns,
960 the corresponding captured values may have been set in previous itera-
961 tions. For example, after
962
963 /(a|(b))+/
964
965 matches "aba" the value of the second captured substring is "b".
966
967
968ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
969
970 With both maximizing and minimizing repetition, failure of what follows
971 normally causes the repeated item to be re-evaluated to see if a dif-
972 ferent number of repeats allows the rest of the pattern to match. Some-
973 times it is useful to prevent this, either to change the nature of the
974 match, or to cause it fail earlier than it otherwise might, when the
975 author of the pattern knows there is no point in carrying on.
976
977 Consider, for example, the pattern \d+foo when applied to the subject
978 line
979
980 123456bar
981
982 After matching all 6 digits and then failing to match "foo", the normal
983 action of the matcher is to try again with only 5 digits matching the
984 \d+ item, and then with 4, and so on, before ultimately failing.
985 "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
986 the means for specifying that once a subpattern has matched, it is not
987 to be re-evaluated in this way.
988
989 If we use atomic grouping for the previous example, the matcher would
990 give up immediately on failing to match "foo" the first time. The nota-
991 tion is a kind of special parenthesis, starting with (?> as in this
992 example:
993
994 (?>\d+)foo
995
996 This kind of parenthesis "locks up" the part of the pattern it con-
997 tains once it has matched, and a failure further into the pattern is
998 prevented from backtracking into it. Backtracking past it to previous
999 items, however, works as normal.
1000
1001 An alternative description is that a subpattern of this type matches
1002 the string of characters that an identical standalone pattern would
1003 match, if anchored at the current point in the subject string.
1004
1005 Atomic grouping subpatterns are not capturing subpatterns. Simple cases
1006 such as the above example can be thought of as a maximizing repeat that
1007 must swallow everything it can. So, while both \d+ and \d+? are pre-
1008 pared to adjust the number of digits they match in order to make the
1009 rest of the pattern match, (?>\d+) can only match an entire sequence of
1010 digits.
1011
1012 Atomic groups in general can of course contain arbitrarily complicated
1013 subpatterns, and can be nested. However, when the subpattern for an
1014 atomic group is just a single repeated item, as in the example above, a
1015 simpler notation, called a "possessive quantifier" can be used. This
1016 consists of an additional + character following a quantifier. Using
1017 this notation, the previous example can be rewritten as
1018
1019 \d++foo
1020
1021 Possessive quantifiers are always greedy; the setting of the
1022 PCRE_UNGREEDY option is ignored. They are a convenient notation for the
1023 simpler forms of atomic group. However, there is no difference in the
1024 meaning or processing of a possessive quantifier and the equivalent
1025 atomic group.
1026
aa41d2de
PH
1027 The possessive quantifier syntax is an extension to the Perl syntax.
1028 Jeffrey Friedl originated the idea (and the name) in the first edition
1029 of his book. Mike McCloskey liked it, so implemented it when he built
1030 Sun's Java package, and PCRE copied it from there.
495ae4b0
PH
1031
1032 When a pattern contains an unlimited repeat inside a subpattern that
1033 can itself be repeated an unlimited number of times, the use of an
1034 atomic group is the only way to avoid some failing matches taking a
1035 very long time indeed. The pattern
1036
1037 (\D+|<\d+>)*[!?]
1038
1039 matches an unlimited number of substrings that either consist of non-
1040 digits, or digits enclosed in <>, followed by either ! or ?. When it
1041 matches, it runs quickly. However, if it is applied to
1042
1043 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1044
1045 it takes a long time before reporting failure. This is because the
1046 string can be divided between the internal \D+ repeat and the external
1047 * repeat in a large number of ways, and all have to be tried. (The
1048 example uses [!?] rather than a single character at the end, because
1049 both PCRE and Perl have an optimization that allows for fast failure
1050 when a single character is used. They remember the last single charac-
1051 ter that is required for a match, and fail early if it is not present
1052 in the string.) If the pattern is changed so that it uses an atomic
1053 group, like this:
1054
1055 ((?>\D+)|<\d+>)*[!?]
1056
1057 sequences of non-digits cannot be broken, and failure happens quickly.
1058
1059
1060BACK REFERENCES
1061
1062 Outside a character class, a backslash followed by a digit greater than
1063 0 (and possibly further digits) is a back reference to a capturing sub-
1064 pattern earlier (that is, to its left) in the pattern, provided there
1065 have been that many previous capturing left parentheses.
1066
1067 However, if the decimal number following the backslash is less than 10,
1068 it is always taken as a back reference, and causes an error only if
1069 there are not that many capturing left parentheses in the entire pat-
1070 tern. In other words, the parentheses that are referenced need not be
aa41d2de
PH
1071 to the left of the reference for numbers less than 10. A "forward back
1072 reference" of this type can make sense when a repetition is involved
1073 and the subpattern to the right has participated in an earlier itera-
1074 tion.
495ae4b0 1075
aa41d2de
PH
1076 It is not possible to have a numerical "forward back reference" to sub-
1077 pattern whose number is 10 or more. However, a back reference to any
1078 subpattern is possible using named parentheses (see below). See also
1079 the subsection entitled "Non-printing characters" above for further
1080 details of the handling of digits following a backslash.
1081
1082 A back reference matches whatever actually matched the capturing sub-
1083 pattern in the current subject string, rather than anything matching
495ae4b0
PH
1084 the subpattern itself (see "Subpatterns as subroutines" below for a way
1085 of doing that). So the pattern
1086
1087 (sens|respons)e and \1ibility
1088
aa41d2de
PH
1089 matches "sense and sensibility" and "response and responsibility", but
1090 not "sense and responsibility". If caseful matching is in force at the
1091 time of the back reference, the case of letters is relevant. For exam-
495ae4b0
PH
1092 ple,
1093
1094 ((?i)rah)\s+\1
1095
aa41d2de 1096 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
495ae4b0
PH
1097 original capturing subpattern is matched caselessly.
1098
aa41d2de 1099 Back references to named subpatterns use the Python syntax (?P=name).
495ae4b0
PH
1100 We could rewrite the above example as follows:
1101
aa41d2de
PH
1102 (?P<p1>(?i)rah)\s+(?P=p1)
1103
1104 A subpattern that is referenced by name may appear in the pattern
1105 before or after the reference.
495ae4b0
PH
1106
1107 There may be more than one back reference to the same subpattern. If a
1108 subpattern has not actually been used in a particular match, any back
1109 references to it always fail. For example, the pattern
1110
1111 (a|(bc))\2
1112
1113 always fails if it starts to match "a" rather than "bc". Because there
1114 may be many capturing parentheses in a pattern, all digits following
1115 the backslash are taken as part of a potential back reference number.
1116 If the pattern continues with a digit character, some delimiter must be
1117 used to terminate the back reference. If the PCRE_EXTENDED option is
1118 set, this can be whitespace. Otherwise an empty comment (see "Com-
1119 ments" below) can be used.
1120
1121 A back reference that occurs inside the parentheses to which it refers
1122 fails when the subpattern is first used, so, for example, (a\1) never
1123 matches. However, such references can be useful inside repeated sub-
1124 patterns. For example, the pattern
1125
1126 (a|b\1)+
1127
1128 matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
1129 ation of the subpattern, the back reference matches the character
1130 string corresponding to the previous iteration. In order for this to
1131 work, the pattern must be such that the first iteration does not need
1132 to match the back reference. This can be done using alternation, as in
1133 the example above, or by a quantifier with a minimum of zero.
1134
1135
1136ASSERTIONS
1137
1138 An assertion is a test on the characters following or preceding the
1139 current matching point that does not actually consume any characters.
1140 The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are
1141 described above.
1142
1143 More complicated assertions are coded as subpatterns. There are two
1144 kinds: those that look ahead of the current position in the subject
1145 string, and those that look behind it. An assertion subpattern is
1146 matched in the normal way, except that it does not cause the current
1147 matching position to be changed.
1148
1149 Assertion subpatterns are not capturing subpatterns, and may not be
1150 repeated, because it makes no sense to assert the same thing several
1151 times. If any kind of assertion contains capturing subpatterns within
1152 it, these are counted for the purposes of numbering the capturing sub-
1153 patterns in the whole pattern. However, substring capturing is carried
1154 out only for positive assertions, because it does not make sense for
1155 negative assertions.
1156
1157 Lookahead assertions
1158
1159 Lookahead assertions start with (?= for positive assertions and (?! for
1160 negative assertions. For example,
1161
1162 \w+(?=;)
1163
1164 matches a word followed by a semicolon, but does not include the semi-
1165 colon in the match, and
1166
1167 foo(?!bar)
1168
1169 matches any occurrence of "foo" that is not followed by "bar". Note
1170 that the apparently similar pattern
1171
1172 (?!foo)bar
1173
1174 does not find an occurrence of "bar" that is preceded by something
1175 other than "foo"; it finds any occurrence of "bar" whatsoever, because
1176 the assertion (?!foo) is always true when the next three characters are
1177 "bar". A lookbehind assertion is needed to achieve the other effect.
1178
1179 If you want to force a matching failure at some point in a pattern, the
1180 most convenient way to do it is with (?!) because an empty string
1181 always matches, so an assertion that requires there not to be an empty
1182 string must always fail.
1183
1184 Lookbehind assertions
1185
1186 Lookbehind assertions start with (?<= for positive assertions and (?<!
1187 for negative assertions. For example,
1188
1189 (?<!foo)bar
1190
1191 does find an occurrence of "bar" that is not preceded by "foo". The
1192 contents of a lookbehind assertion are restricted such that all the
1193 strings it matches must have a fixed length. However, if there are sev-
aa41d2de
PH
1194 eral top-level alternatives, they do not all have to have the same
1195 fixed length. Thus
495ae4b0
PH
1196
1197 (?<=bullock|donkey)
1198
1199 is permitted, but
1200
1201 (?<!dogs?|cats?)
1202
1203 causes an error at compile time. Branches that match different length
1204 strings are permitted only at the top level of a lookbehind assertion.
1205 This is an extension compared with Perl (at least for 5.8), which
1206 requires all branches to match the same length of string. An assertion
1207 such as
1208
1209 (?<=ab(c|de))
1210
1211 is not permitted, because its single top-level branch can match two
1212 different lengths, but it is acceptable if rewritten to use two top-
1213 level branches:
1214
1215 (?<=abc|abde)
1216
1217 The implementation of lookbehind assertions is, for each alternative,
1218 to temporarily move the current position back by the fixed width and
1219 then try to match. If there are insufficient characters before the cur-
1220 rent position, the match is deemed to fail.
1221
1222 PCRE does not allow the \C escape (which matches a single byte in UTF-8
1223 mode) to appear in lookbehind assertions, because it makes it impossi-
1224 ble to calculate the length of the lookbehind. The \X escape, which can
1225 match different numbers of bytes, is also not permitted.
1226
1227 Atomic groups can be used in conjunction with lookbehind assertions to
1228 specify efficient matching at the end of the subject string. Consider a
1229 simple pattern such as
1230
1231 abcd$
1232
1233 when applied to a long string that does not match. Because matching
1234 proceeds from left to right, PCRE will look for each "a" in the subject
1235 and then see if what follows matches the rest of the pattern. If the
1236 pattern is specified as
1237
1238 ^.*abcd$
1239
1240 the initial .* matches the entire string at first, but when this fails
1241 (because there is no following "a"), it backtracks to match all but the
1242 last character, then all but the last two characters, and so on. Once
1243 again the search for "a" covers the entire string, from right to left,
1244 so we are no better off. However, if the pattern is written as
1245
1246 ^(?>.*)(?<=abcd)
1247
1248 or, equivalently, using the possessive quantifier syntax,
1249
1250 ^.*+(?<=abcd)
1251
1252 there can be no backtracking for the .* item; it can match only the
1253 entire string. The subsequent lookbehind assertion does a single test
1254 on the last four characters. If it fails, the match fails immediately.
1255 For long strings, this approach makes a significant difference to the
1256 processing time.
1257
1258 Using multiple assertions
1259
1260 Several assertions (of any sort) may occur in succession. For example,
1261
1262 (?<=\d{3})(?<!999)foo
1263
1264 matches "foo" preceded by three digits that are not "999". Notice that
1265 each of the assertions is applied independently at the same point in
1266 the subject string. First there is a check that the previous three
1267 characters are all digits, and then there is a check that the same
1268 three characters are not "999". This pattern does not match "foo" pre-
1269 ceded by six characters, the first of which are digits and the last
1270 three of which are not "999". For example, it doesn't match "123abc-
1271 foo". A pattern to do that is
1272
1273 (?<=\d{3}...)(?<!999)foo
1274
1275 This time the first assertion looks at the preceding six characters,
1276 checking that the first three are digits, and then the second assertion
1277 checks that the preceding three characters are not "999".
1278
1279 Assertions can be nested in any combination. For example,
1280
1281 (?<=(?<!foo)bar)baz
1282
1283 matches an occurrence of "baz" that is preceded by "bar" which in turn
1284 is not preceded by "foo", while
1285
1286 (?<=\d{3}(?!999)...)foo
1287
1288 is another pattern that matches "foo" preceded by three digits and any
1289 three characters that are not "999".
1290
1291
1292CONDITIONAL SUBPATTERNS
1293
1294 It is possible to cause the matching process to obey a subpattern con-
1295 ditionally or to choose between two alternative subpatterns, depending
1296 on the result of an assertion, or whether a previous capturing subpat-
1297 tern matched or not. The two possible forms of conditional subpattern
1298 are
1299
1300 (?(condition)yes-pattern)
1301 (?(condition)yes-pattern|no-pattern)
1302
1303 If the condition is satisfied, the yes-pattern is used; otherwise the
1304 no-pattern (if present) is used. If there are more than two alterna-
1305 tives in the subpattern, a compile-time error occurs.
1306
1307 There are three kinds of condition. If the text between the parentheses
aa41d2de
PH
1308 consists of a sequence of digits, or a sequence of alphanumeric charac-
1309 ters and underscores, the condition is satisfied if the capturing sub-
1310 pattern of that number or name has previously matched. There is a pos-
1311 sible ambiguity here, because subpattern names may consist entirely of
1312 digits. PCRE looks first for a named subpattern; if it cannot find one
1313 and the text consists entirely of digits, it looks for a subpattern of
1314 that number, which must be greater than zero. Using subpattern names
1315 that consist entirely of digits is not recommended.
1316
1317 Consider the following pattern, which contains non-significant white
1318 space to make it more readable (assume the PCRE_EXTENDED option) and to
1319 divide it into three parts for ease of discussion:
495ae4b0
PH
1320
1321 ( \( )? [^()]+ (?(1) \) )
1322
1323 The first part matches an optional opening parenthesis, and if that
1324 character is present, sets it as the first captured substring. The sec-
1325 ond part matches one or more characters that are not parentheses. The
1326 third part is a conditional subpattern that tests whether the first set
1327 of parentheses matched or not. If they did, that is, if subject started
1328 with an opening parenthesis, the condition is true, and so the yes-pat-
1329 tern is executed and a closing parenthesis is required. Otherwise,
1330 since no-pattern is not present, the subpattern matches nothing. In
1331 other words, this pattern matches a sequence of non-parentheses,
aa41d2de
PH
1332 optionally enclosed in parentheses. Rewriting it to use a named subpat-
1333 tern gives this:
1334
1335 (?P<OPEN> \( )? [^()]+ (?(OPEN) \) )
495ae4b0 1336
aa41d2de
PH
1337 If the condition is the string (R), and there is no subpattern with the
1338 name R, the condition is satisfied if a recursive call to the pattern
1339 or subpattern has been made. At "top level", the condition is false.
1340 This is a PCRE extension. Recursive patterns are described in the next
1341 section.
495ae4b0
PH
1342
1343 If the condition is not a sequence of digits or (R), it must be an
1344 assertion. This may be a positive or negative lookahead or lookbehind
1345 assertion. Consider this pattern, again containing non-significant
1346 white space, and with the two alternatives on the second line:
1347
1348 (?(?=[^a-z]*[a-z])
1349 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
1350
1351 The condition is a positive lookahead assertion that matches an
1352 optional sequence of non-letters followed by a letter. In other words,
1353 it tests for the presence of at least one letter in the subject. If a
1354 letter is found, the subject is matched against the first alternative;
1355 otherwise it is matched against the second. This pattern matches
1356 strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
1357 letters and dd are digits.
1358
1359
1360COMMENTS
1361
1362 The sequence (?# marks the start of a comment that continues up to the
1363 next closing parenthesis. Nested parentheses are not permitted. The
1364 characters that make up a comment play no part in the pattern matching
1365 at all.
1366
1367 If the PCRE_EXTENDED option is set, an unescaped # character outside a
aa41d2de
PH
1368 character class introduces a comment that continues to immediately
1369 after the next newline in the pattern.
495ae4b0
PH
1370
1371
1372RECURSIVE PATTERNS
1373
1374 Consider the problem of matching a string in parentheses, allowing for
1375 unlimited nested parentheses. Without the use of recursion, the best
1376 that can be done is to use a pattern that matches up to some fixed
1377 depth of nesting. It is not possible to handle an arbitrary nesting
1378 depth. Perl provides a facility that allows regular expressions to
1379 recurse (amongst other things). It does this by interpolating Perl code
1380 in the expression at run time, and the code can refer to the expression
1381 itself. A Perl pattern to solve the parentheses problem can be created
1382 like this:
1383
1384 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
1385
1386 The (?p{...}) item interpolates Perl code at run time, and in this case
1387 refers recursively to the pattern in which it appears. Obviously, PCRE
1388 cannot support the interpolation of Perl code. Instead, it supports
1389 some special syntax for recursion of the entire pattern, and also for
1390 individual subpattern recursion.
1391
1392 The special item that consists of (? followed by a number greater than
1393 zero and a closing parenthesis is a recursive call of the subpattern of
1394 the given number, provided that it occurs inside that subpattern. (If
1395 not, it is a "subroutine" call, which is described in the next sec-
1396 tion.) The special item (?R) is a recursive call of the entire regular
1397 expression.
1398
aa41d2de
PH
1399 A recursive subpattern call is always treated as an atomic group. That
1400 is, once it has matched some of the subject string, it is never re-
1401 entered, even if it contains untried alternatives and there is a subse-
1402 quent matching failure.
1403
1404 This PCRE pattern solves the nested parentheses problem (assume the
1405 PCRE_EXTENDED option is set so that white space is ignored):
495ae4b0
PH
1406
1407 \( ( (?>[^()]+) | (?R) )* \)
1408
1409 First it matches an opening parenthesis. Then it matches any number of
1410 substrings which can either be a sequence of non-parentheses, or a
aa41d2de 1411 recursive match of the pattern itself (that is, a correctly parenthe-
495ae4b0
PH
1412 sized substring). Finally there is a closing parenthesis.
1413
1414 If this were part of a larger pattern, you would not want to recurse
1415 the entire pattern, so instead you could use this:
1416
1417 ( \( ( (?>[^()]+) | (?1) )* \) )
1418
1419 We have put the pattern into parentheses, and caused the recursion to
1420 refer to them instead of the whole pattern. In a larger pattern, keep-
1421 ing track of parenthesis numbers can be tricky. It may be more conve-
1422 nient to use named parentheses instead. For this, PCRE uses (?P>name),
1423 which is an extension to the Python syntax that PCRE uses for named
1424 parentheses (Perl does not provide named parentheses). We could rewrite
1425 the above example as follows:
1426
1427 (?P<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )
1428
1429 This particular example pattern contains nested unlimited repeats, and
1430 so the use of atomic grouping for matching strings of non-parentheses
1431 is important when applying the pattern to strings that do not match.
1432 For example, when this pattern is applied to
1433
1434 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1435
1436 it yields "no match" quickly. However, if atomic grouping is not used,
1437 the match runs for a very long time indeed because there are so many
1438 different ways the + and * repeats can carve up the subject, and all
1439 have to be tested before failure can be reported.
1440
1441 At the end of a match, the values set for any capturing subpatterns are
1442 those from the outermost level of the recursion at which the subpattern
1443 value is set. If you want to obtain intermediate values, a callout
1444 function can be used (see the next section and the pcrecallout documen-
1445 tation). If the pattern above is matched against
1446
1447 (ab(cd)ef)
1448
1449 the value for the capturing parentheses is "ef", which is the last
1450 value taken on at the top level. If additional parentheses are added,
1451 giving
1452
1453 \( ( ( (?>[^()]+) | (?R) )* ) \)
1454 ^ ^
1455 ^ ^
1456
1457 the string they capture is "ab(cd)ef", the contents of the top level
1458 parentheses. If there are more than 15 capturing parentheses in a pat-
1459 tern, PCRE has to obtain extra memory to store data during a recursion,
1460 which it does by using pcre_malloc, freeing it via pcre_free after-
1461 wards. If no memory can be obtained, the match fails with the
1462 PCRE_ERROR_NOMEMORY error.
1463
1464 Do not confuse the (?R) item with the condition (R), which tests for
1465 recursion. Consider this pattern, which matches text in angle brack-
1466 ets, allowing for arbitrary nesting. Only digits are allowed in nested
1467 brackets (that is, when recursing), whereas any characters are permit-
1468 ted at the outer level.
1469
1470 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
1471
1472 In this pattern, (?(R) is the start of a conditional subpattern, with
1473 two different alternatives for the recursive and non-recursive cases.
1474 The (?R) item is the actual recursive call.
1475
1476
1477SUBPATTERNS AS SUBROUTINES
1478
1479 If the syntax for a recursive subpattern reference (either by number or
1480 by name) is used outside the parentheses to which it refers, it oper-
1481 ates like a subroutine in a programming language. An earlier example
1482 pointed out that the pattern
1483
1484 (sens|respons)e and \1ibility
1485
1486 matches "sense and sensibility" and "response and responsibility", but
1487 not "sense and responsibility". If instead the pattern
1488
1489 (sens|respons)e and (?1)ibility
1490
1491 is used, it does match "sense and responsibility" as well as the other
aa41d2de
PH
1492 two strings. Such references, if given numerically, must follow the
1493 subpattern to which they refer. However, named references can refer to
1494 later subpatterns.
1495
1496 Like recursive subpatterns, a "subroutine" call is always treated as an
1497 atomic group. That is, once it has matched some of the subject string,
1498 it is never re-entered, even if it contains untried alternatives and
1499 there is a subsequent matching failure.
495ae4b0
PH
1500
1501
1502CALLOUTS
1503
1504 Perl has a feature whereby using the sequence (?{...}) causes arbitrary
1505 Perl code to be obeyed in the middle of matching a regular expression.
1506 This makes it possible, amongst other things, to extract different sub-
1507 strings that match the same pair of parentheses when there is a repeti-
1508 tion.
1509
1510 PCRE provides a similar feature, but of course it cannot obey arbitrary
1511 Perl code. The feature is called "callout". The caller of PCRE provides
1512 an external function by putting its entry point in the global variable
1513 pcre_callout. By default, this variable contains NULL, which disables
1514 all calling out.
1515
1516 Within a regular expression, (?C) indicates the points at which the
1517 external function is to be called. If you want to identify different
1518 callout points, you can put a number less than 256 after the letter C.
1519 The default value is zero. For example, this pattern has two callout
1520 points:
1521
1522 (?C1)abc(?C2)def
1523
1524 If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
1525 automatically installed before each item in the pattern. They are all
1526 numbered 255.
1527
1528 During matching, when PCRE reaches a callout point (and pcre_callout is
1529 set), the external function is called. It is provided with the number
1530 of the callout, the position in the pattern, and, optionally, one item
1531 of data originally supplied by the caller of pcre_exec(). The callout
1532 function may cause matching to proceed, to backtrack, or to fail alto-
1533 gether. A complete description of the interface to the callout function
1534 is given in the pcrecallout documentation.
1535
aa41d2de
PH
1536Last updated: 06 June 2006
1537Copyright (c) 1997-2006 University of Cambridge.