Commit | Line | Data |
---|---|---|
8ac170f3 | 1 | This file contains the PCRE man page that describes the regular expressions |
92e772ff | 2 | supported by PCRE version 6.2. Note that not all of the features are relevant |
495ae4b0 PH |
3 | in the context of Exim. In particular, the version of PCRE that is compiled |
4 | with Exim does not include UTF-8 support, there is no mechanism for changing | |
5 | the options with which the PCRE functions are called, and features such as | |
6 | callout are not accessible. | |
7 | ----------------------------------------------------------------------------- | |
8 | ||
92e772ff | 9 | PCREPATTERN(3) PCREPATTERN(3) |
495ae4b0 PH |
10 | |
11 | ||
12 | NAME | |
13 | PCRE - Perl-compatible regular expressions | |
14 | ||
8ac170f3 | 15 | |
495ae4b0 PH |
16 | PCRE REGULAR EXPRESSION DETAILS |
17 | ||
18 | The syntax and semantics of the regular expressions supported by PCRE | |
19 | are described below. Regular expressions are also described in the Perl | |
20 | documentation and in a number of books, some of which have copious | |
21 | examples. Jeffrey Friedl's "Mastering Regular Expressions", published | |
22 | by O'Reilly, covers regular expressions in great detail. This descrip- | |
23 | tion of PCRE's regular expressions is intended as reference material. | |
24 | ||
25 | The original operation of PCRE was on strings of one-byte characters. | |
26 | However, there is now also support for UTF-8 character strings. To use | |
27 | this, you must build PCRE to include UTF-8 support, and then call | |
28 | pcre_compile() with the PCRE_UTF8 option. How this affects pattern | |
29 | matching is mentioned in several places below. There is also a summary | |
30 | of UTF-8 features in the section on UTF-8 support in the main pcre | |
31 | page. | |
32 | ||
8ac170f3 PH |
33 | The remainder of this document discusses the patterns that are sup- |
34 | ported by PCRE when its main matching function, pcre_exec(), is used. | |
35 | From release 6.0, PCRE offers a second matching function, | |
36 | pcre_dfa_exec(), which matches using a different algorithm that is not | |
37 | Perl-compatible. The advantages and disadvantages of the alternative | |
38 | function, and how it differs from the normal function, are discussed in | |
39 | the pcrematching page. | |
40 | ||
495ae4b0 PH |
41 | A regular expression is a pattern that is matched against a subject |
42 | string from left to right. Most characters stand for themselves in a | |
43 | pattern, and match the corresponding characters in the subject. As a | |
44 | trivial example, the pattern | |
45 | ||
46 | The quick brown fox | |
47 | ||
8ac170f3 PH |
48 | matches a portion of a subject string that is identical to itself. When |
49 | caseless matching is specified (the PCRE_CASELESS option), letters are | |
50 | matched independently of case. In UTF-8 mode, PCRE always understands | |
51 | the concept of case for characters whose values are less than 128, so | |
52 | caseless matching is always possible. For characters with higher val- | |
53 | ues, the concept of case is supported if PCRE is compiled with Unicode | |
54 | property support, but not otherwise. If you want to use caseless | |
55 | matching for characters 128 and above, you must ensure that PCRE is | |
56 | compiled with Unicode property support as well as with UTF-8 support. | |
57 | ||
58 | The power of regular expressions comes from the ability to include | |
59 | alternatives and repetitions in the pattern. These are encoded in the | |
60 | pattern by the use of metacharacters, which do not stand for themselves | |
61 | but instead are interpreted in some special way. | |
62 | ||
63 | There are two different sets of metacharacters: those that are recog- | |
64 | nized anywhere in the pattern except within square brackets, and those | |
65 | that are recognized in square brackets. Outside square brackets, the | |
495ae4b0 PH |
66 | metacharacters are as follows: |
67 | ||
68 | \ general escape character with several uses | |
69 | ^ assert start of string (or line, in multiline mode) | |
70 | $ assert end of string (or line, in multiline mode) | |
71 | . match any character except newline (by default) | |
72 | [ start character class definition | |
73 | | start of alternative branch | |
74 | ( start subpattern | |
75 | ) end subpattern | |
76 | ? extends the meaning of ( | |
77 | also 0 or 1 quantifier | |
78 | also quantifier minimizer | |
79 | * 0 or more quantifier | |
80 | + 1 or more quantifier | |
81 | also "possessive quantifier" | |
82 | { start min/max quantifier | |
83 | ||
8ac170f3 | 84 | Part of a pattern that is in square brackets is called a "character |
495ae4b0 PH |
85 | class". In a character class the only metacharacters are: |
86 | ||
87 | \ general escape character | |
88 | ^ negate the class, but only if the first character | |
89 | - indicates character range | |
90 | [ POSIX character class (only if followed by POSIX | |
91 | syntax) | |
92 | ] terminates the character class | |
93 | ||
8ac170f3 | 94 | The following sections describe the use of each of the metacharacters. |
495ae4b0 PH |
95 | |
96 | ||
97 | BACKSLASH | |
98 | ||
99 | The backslash character has several uses. Firstly, if it is followed by | |
8ac170f3 PH |
100 | a non-alphanumeric character, it takes away any special meaning that |
101 | character may have. This use of backslash as an escape character | |
495ae4b0 PH |
102 | applies both inside and outside character classes. |
103 | ||
8ac170f3 PH |
104 | For example, if you want to match a * character, you write \* in the |
105 | pattern. This escaping action applies whether or not the following | |
106 | character would otherwise be interpreted as a metacharacter, so it is | |
107 | always safe to precede a non-alphanumeric with backslash to specify | |
108 | that it stands for itself. In particular, if you want to match a back- | |
495ae4b0 PH |
109 | slash, you write \\. |
110 | ||
8ac170f3 PH |
111 | If a pattern is compiled with the PCRE_EXTENDED option, whitespace in |
112 | the pattern (other than in a character class) and characters between a | |
495ae4b0 | 113 | # outside a character class and the next newline character are ignored. |
8ac170f3 | 114 | An escaping backslash can be used to include a whitespace or # charac- |
495ae4b0 PH |
115 | ter as part of the pattern. |
116 | ||
8ac170f3 PH |
117 | If you want to remove the special meaning from a sequence of charac- |
118 | ters, you can do so by putting them between \Q and \E. This is differ- | |
119 | ent from Perl in that $ and @ are handled as literals in \Q...\E | |
120 | sequences in PCRE, whereas in Perl, $ and @ cause variable interpola- | |
495ae4b0 PH |
121 | tion. Note the following examples: |
122 | ||
123 | Pattern PCRE matches Perl matches | |
124 | ||
125 | \Qabc$xyz\E abc$xyz abc followed by the | |
126 | contents of $xyz | |
127 | \Qabc\$xyz\E abc\$xyz abc\$xyz | |
128 | \Qabc\E\$\Qxyz\E abc$xyz abc$xyz | |
129 | ||
8ac170f3 | 130 | The \Q...\E sequence is recognized both inside and outside character |
495ae4b0 PH |
131 | classes. |
132 | ||
133 | Non-printing characters | |
134 | ||
135 | A second use of backslash provides a way of encoding non-printing char- | |
8ac170f3 PH |
136 | acters in patterns in a visible manner. There is no restriction on the |
137 | appearance of non-printing characters, apart from the binary zero that | |
138 | terminates a pattern, but when a pattern is being prepared by text | |
139 | editing, it is usually easier to use one of the following escape | |
495ae4b0 PH |
140 | sequences than the binary character it represents: |
141 | ||
142 | \a alarm, that is, the BEL character (hex 07) | |
143 | \cx "control-x", where x is any character | |
144 | \e escape (hex 1B) | |
145 | \f formfeed (hex 0C) | |
146 | \n newline (hex 0A) | |
147 | \r carriage return (hex 0D) | |
148 | \t tab (hex 09) | |
149 | \ddd character with octal code ddd, or backreference | |
150 | \xhh character with hex code hh | |
151 | \x{hhh..} character with hex code hhh... (UTF-8 mode only) | |
152 | ||
8ac170f3 PH |
153 | The precise effect of \cx is as follows: if x is a lower case letter, |
154 | it is converted to upper case. Then bit 6 of the character (hex 40) is | |
155 | inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c; | |
495ae4b0 PH |
156 | becomes hex 7B. |
157 | ||
8ac170f3 PH |
158 | After \x, from zero to two hexadecimal digits are read (letters can be |
159 | in upper or lower case). In UTF-8 mode, any number of hexadecimal dig- | |
160 | its may appear between \x{ and }, but the value of the character code | |
161 | must be less than 2**31 (that is, the maximum hexadecimal value is | |
162 | 7FFFFFFF). If characters other than hexadecimal digits appear between | |
163 | \x{ and }, or if there is no terminating }, this form of escape is not | |
164 | recognized. Instead, the initial \x will be interpreted as a basic | |
165 | hexadecimal escape, with no following digits, giving a character whose | |
495ae4b0 PH |
166 | value is zero. |
167 | ||
168 | Characters whose value is less than 256 can be defined by either of the | |
8ac170f3 PH |
169 | two syntaxes for \x when PCRE is in UTF-8 mode. There is no difference |
170 | in the way they are handled. For example, \xdc is exactly the same as | |
495ae4b0 PH |
171 | \x{dc}. |
172 | ||
8ac170f3 PH |
173 | After \0 up to two further octal digits are read. In both cases, if |
174 | there are fewer than two digits, just those that are present are used. | |
175 | Thus the sequence \0\x\07 specifies two binary zeros followed by a BEL | |
176 | character (code value 7). Make sure you supply two digits after the | |
177 | initial zero if the pattern character that follows is itself an octal | |
495ae4b0 PH |
178 | digit. |
179 | ||
180 | The handling of a backslash followed by a digit other than 0 is compli- | |
181 | cated. Outside a character class, PCRE reads it and any following dig- | |
8ac170f3 | 182 | its as a decimal number. If the number is less than 10, or if there |
495ae4b0 | 183 | have been at least that many previous capturing left parentheses in the |
8ac170f3 PH |
184 | expression, the entire sequence is taken as a back reference. A |
185 | description of how this works is given later, following the discussion | |
495ae4b0 PH |
186 | of parenthesized subpatterns. |
187 | ||
8ac170f3 PH |
188 | Inside a character class, or if the decimal number is greater than 9 |
189 | and there have not been that many capturing subpatterns, PCRE re-reads | |
190 | up to three octal digits following the backslash, and generates a sin- | |
495ae4b0 PH |
191 | gle byte from the least significant 8 bits of the value. Any subsequent |
192 | digits stand for themselves. For example: | |
193 | ||
194 | \040 is another way of writing a space | |
195 | \40 is the same, provided there are fewer than 40 | |
196 | previous capturing subpatterns | |
197 | \7 is always a back reference | |
198 | \11 might be a back reference, or another way of | |
199 | writing a tab | |
200 | \011 is always a tab | |
201 | \0113 is a tab followed by the character "3" | |
202 | \113 might be a back reference, otherwise the | |
203 | character with octal code 113 | |
204 | \377 might be a back reference, otherwise | |
205 | the byte consisting entirely of 1 bits | |
206 | \81 is either a back reference, or a binary zero | |
207 | followed by the two characters "8" and "1" | |
208 | ||
8ac170f3 | 209 | Note that octal values of 100 or greater must not be introduced by a |
495ae4b0 PH |
210 | leading zero, because no more than three octal digits are ever read. |
211 | ||
8ac170f3 | 212 | All the sequences that define a single byte value or a single UTF-8 |
495ae4b0 | 213 | character (in UTF-8 mode) can be used both inside and outside character |
8ac170f3 | 214 | classes. In addition, inside a character class, the sequence \b is |
495ae4b0 | 215 | interpreted as the backspace character (hex 08), and the sequence \X is |
8ac170f3 | 216 | interpreted as the character "X". Outside a character class, these |
495ae4b0 PH |
217 | sequences have different meanings (see below). |
218 | ||
219 | Generic character types | |
220 | ||
8ac170f3 | 221 | The third use of backslash is for specifying generic character types. |
495ae4b0 PH |
222 | The following are always recognized: |
223 | ||
224 | \d any decimal digit | |
225 | \D any character that is not a decimal digit | |
226 | \s any whitespace character | |
227 | \S any character that is not a whitespace character | |
228 | \w any "word" character | |
229 | \W any "non-word" character | |
230 | ||
231 | Each pair of escape sequences partitions the complete set of characters | |
8ac170f3 | 232 | into two disjoint sets. Any given character matches one, and only one, |
495ae4b0 PH |
233 | of each pair. |
234 | ||
235 | These character type sequences can appear both inside and outside char- | |
8ac170f3 PH |
236 | acter classes. They each match one character of the appropriate type. |
237 | If the current matching point is at the end of the subject string, all | |
495ae4b0 PH |
238 | of them fail, since there is no character to match. |
239 | ||
8ac170f3 PH |
240 | For compatibility with Perl, \s does not match the VT character (code |
241 | 11). This makes it different from the the POSIX "space" class. The \s | |
495ae4b0 PH |
242 | characters are HT (9), LF (10), FF (12), CR (13), and space (32). |
243 | ||
244 | A "word" character is an underscore or any character less than 256 that | |
8ac170f3 PH |
245 | is a letter or digit. The definition of letters and digits is con- |
246 | trolled by PCRE's low-valued character tables, and may vary if locale- | |
247 | specific matching is taking place (see "Locale support" in the pcreapi | |
248 | page). For example, in the "fr_FR" (French) locale, some character | |
249 | codes greater than 128 are used for accented letters, and these are | |
495ae4b0 PH |
250 | matched by \w. |
251 | ||
8ac170f3 | 252 | In UTF-8 mode, characters with values greater than 128 never match \d, |
495ae4b0 PH |
253 | \s, or \w, and always match \D, \S, and \W. This is true even when Uni- |
254 | code character property support is available. | |
255 | ||
256 | Unicode character properties | |
257 | ||
258 | When PCRE is built with Unicode character property support, three addi- | |
8ac170f3 | 259 | tional escape sequences to match generic character types are available |
495ae4b0 PH |
260 | when UTF-8 mode is selected. They are: |
261 | ||
262 | \p{xx} a character with the xx property | |
263 | \P{xx} a character without the xx property | |
264 | \X an extended Unicode sequence | |
265 | ||
8ac170f3 PH |
266 | The property names represented by xx above are limited to the Unicode |
267 | general category properties. Each character has exactly one such prop- | |
268 | erty, specified by a two-letter abbreviation. For compatibility with | |
269 | Perl, negation can be specified by including a circumflex between the | |
270 | opening brace and the property name. For example, \p{^Lu} is the same | |
495ae4b0 PH |
271 | as \P{Lu}. |
272 | ||
8ac170f3 | 273 | If only one letter is specified with \p or \P, it includes all the |
495ae4b0 PH |
274 | properties that start with that letter. In this case, in the absence of |
275 | negation, the curly brackets in the escape sequence are optional; these | |
276 | two examples have the same effect: | |
277 | ||
278 | \p{L} | |
279 | \pL | |
280 | ||
281 | The following property codes are supported: | |
282 | ||
283 | C Other | |
284 | Cc Control | |
285 | Cf Format | |
286 | Cn Unassigned | |
287 | Co Private use | |
288 | Cs Surrogate | |
289 | ||
290 | L Letter | |
291 | Ll Lower case letter | |
292 | Lm Modifier letter | |
293 | Lo Other letter | |
294 | Lt Title case letter | |
295 | Lu Upper case letter | |
296 | ||
297 | M Mark | |
298 | Mc Spacing mark | |
299 | Me Enclosing mark | |
300 | Mn Non-spacing mark | |
301 | ||
302 | N Number | |
303 | Nd Decimal number | |
304 | Nl Letter number | |
305 | No Other number | |
306 | ||
307 | P Punctuation | |
308 | Pc Connector punctuation | |
309 | Pd Dash punctuation | |
310 | Pe Close punctuation | |
311 | Pf Final punctuation | |
312 | Pi Initial punctuation | |
313 | Po Other punctuation | |
314 | Ps Open punctuation | |
315 | ||
316 | S Symbol | |
317 | Sc Currency symbol | |
318 | Sk Modifier symbol | |
319 | Sm Mathematical symbol | |
320 | So Other symbol | |
321 | ||
322 | Z Separator | |
323 | Zl Line separator | |
324 | Zp Paragraph separator | |
325 | Zs Space separator | |
326 | ||
8ac170f3 | 327 | Extended properties such as "Greek" or "InMusicalSymbols" are not sup- |
495ae4b0 PH |
328 | ported by PCRE. |
329 | ||
8ac170f3 | 330 | Specifying caseless matching does not affect these escape sequences. |
495ae4b0 PH |
331 | For example, \p{Lu} always matches only upper case letters. |
332 | ||
8ac170f3 | 333 | The \X escape matches any number of Unicode characters that form an |
495ae4b0 PH |
334 | extended Unicode sequence. \X is equivalent to |
335 | ||
336 | (?>\PM\pM*) | |
337 | ||
8ac170f3 PH |
338 | That is, it matches a character without the "mark" property, followed |
339 | by zero or more characters with the "mark" property, and treats the | |
340 | sequence as an atomic group (see below). Characters with the "mark" | |
495ae4b0 PH |
341 | property are typically accents that affect the preceding character. |
342 | ||
8ac170f3 PH |
343 | Matching characters by Unicode property is not fast, because PCRE has |
344 | to search a structure that contains data for over fifteen thousand | |
495ae4b0 PH |
345 | characters. That is why the traditional escape sequences such as \d and |
346 | \w do not use Unicode properties in PCRE. | |
347 | ||
348 | Simple assertions | |
349 | ||
350 | The fourth use of backslash is for certain simple assertions. An asser- | |
8ac170f3 PH |
351 | tion specifies a condition that has to be met at a particular point in |
352 | a match, without consuming any characters from the subject string. The | |
353 | use of subpatterns for more complicated assertions is described below. | |
495ae4b0 PH |
354 | The backslashed assertions are: |
355 | ||
356 | \b matches at a word boundary | |
357 | \B matches when not at a word boundary | |
358 | \A matches at start of subject | |
359 | \Z matches at end of subject or before newline at end | |
360 | \z matches at end of subject | |
361 | \G matches at first matching position in subject | |
362 | ||
8ac170f3 | 363 | These assertions may not appear in character classes (but note that \b |
495ae4b0 PH |
364 | has a different meaning, namely the backspace character, inside a char- |
365 | acter class). | |
366 | ||
8ac170f3 PH |
367 | A word boundary is a position in the subject string where the current |
368 | character and the previous character do not both match \w or \W (i.e. | |
369 | one matches \w and the other matches \W), or the start or end of the | |
495ae4b0 PH |
370 | string if the first or last character matches \w, respectively. |
371 | ||
8ac170f3 | 372 | The \A, \Z, and \z assertions differ from the traditional circumflex |
495ae4b0 | 373 | and dollar (described in the next section) in that they only ever match |
8ac170f3 PH |
374 | at the very start and end of the subject string, whatever options are |
375 | set. Thus, they are independent of multiline mode. These three asser- | |
495ae4b0 | 376 | tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which |
8ac170f3 PH |
377 | affect only the behaviour of the circumflex and dollar metacharacters. |
378 | However, if the startoffset argument of pcre_exec() is non-zero, indi- | |
495ae4b0 | 379 | cating that matching is to start at a point other than the beginning of |
8ac170f3 PH |
380 | the subject, \A can never match. The difference between \Z and \z is |
381 | that \Z matches before a newline that is the last character of the | |
382 | string as well as at the end of the string, whereas \z matches only at | |
495ae4b0 PH |
383 | the end. |
384 | ||
8ac170f3 PH |
385 | The \G assertion is true only when the current matching position is at |
386 | the start point of the match, as specified by the startoffset argument | |
387 | of pcre_exec(). It differs from \A when the value of startoffset is | |
388 | non-zero. By calling pcre_exec() multiple times with appropriate argu- | |
495ae4b0 PH |
389 | ments, you can mimic Perl's /g option, and it is in this kind of imple- |
390 | mentation where \G can be useful. | |
391 | ||
8ac170f3 | 392 | Note, however, that PCRE's interpretation of \G, as the start of the |
495ae4b0 | 393 | current match, is subtly different from Perl's, which defines it as the |
8ac170f3 PH |
394 | end of the previous match. In Perl, these can be different when the |
395 | previously matched string was empty. Because PCRE does just one match | |
495ae4b0 PH |
396 | at a time, it cannot reproduce this behaviour. |
397 | ||
8ac170f3 | 398 | If all the alternatives of a pattern begin with \G, the expression is |
495ae4b0 PH |
399 | anchored to the starting match position, and the "anchored" flag is set |
400 | in the compiled regular expression. | |
401 | ||
402 | ||
403 | CIRCUMFLEX AND DOLLAR | |
404 | ||
405 | Outside a character class, in the default matching mode, the circumflex | |
8ac170f3 PH |
406 | character is an assertion that is true only if the current matching |
407 | point is at the start of the subject string. If the startoffset argu- | |
408 | ment of pcre_exec() is non-zero, circumflex can never match if the | |
409 | PCRE_MULTILINE option is unset. Inside a character class, circumflex | |
495ae4b0 PH |
410 | has an entirely different meaning (see below). |
411 | ||
8ac170f3 PH |
412 | Circumflex need not be the first character of the pattern if a number |
413 | of alternatives are involved, but it should be the first thing in each | |
414 | alternative in which it appears if the pattern is ever to match that | |
415 | branch. If all possible alternatives start with a circumflex, that is, | |
416 | if the pattern is constrained to match only at the start of the sub- | |
417 | ject, it is said to be an "anchored" pattern. (There are also other | |
495ae4b0 PH |
418 | constructs that can cause a pattern to be anchored.) |
419 | ||
8ac170f3 PH |
420 | A dollar character is an assertion that is true only if the current |
421 | matching point is at the end of the subject string, or immediately | |
495ae4b0 | 422 | before a newline character that is the last character in the string (by |
8ac170f3 PH |
423 | default). Dollar need not be the last character of the pattern if a |
424 | number of alternatives are involved, but it should be the last item in | |
425 | any branch in which it appears. Dollar has no special meaning in a | |
495ae4b0 PH |
426 | character class. |
427 | ||
8ac170f3 PH |
428 | The meaning of dollar can be changed so that it matches only at the |
429 | very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at | |
495ae4b0 PH |
430 | compile time. This does not affect the \Z assertion. |
431 | ||
432 | The meanings of the circumflex and dollar characters are changed if the | |
433 | PCRE_MULTILINE option is set. When this is the case, they match immedi- | |
8ac170f3 PH |
434 | ately after and immediately before an internal newline character, |
435 | respectively, in addition to matching at the start and end of the sub- | |
436 | ject string. For example, the pattern /^abc$/ matches the subject | |
437 | string "def\nabc" (where \n represents a newline character) in multi- | |
495ae4b0 | 438 | line mode, but not otherwise. Consequently, patterns that are anchored |
8ac170f3 PH |
439 | in single line mode because all branches start with ^ are not anchored |
440 | in multiline mode, and a match for circumflex is possible when the | |
441 | startoffset argument of pcre_exec() is non-zero. The PCRE_DOL- | |
495ae4b0 PH |
442 | LAR_ENDONLY option is ignored if PCRE_MULTILINE is set. |
443 | ||
8ac170f3 PH |
444 | Note that the sequences \A, \Z, and \z can be used to match the start |
445 | and end of the subject in both modes, and if all branches of a pattern | |
446 | start with \A it is always anchored, whether PCRE_MULTILINE is set or | |
495ae4b0 PH |
447 | not. |
448 | ||
449 | ||
450 | FULL STOP (PERIOD, DOT) | |
451 | ||
452 | Outside a character class, a dot in the pattern matches any one charac- | |
8ac170f3 PH |
453 | ter in the subject, including a non-printing character, but not (by |
454 | default) newline. In UTF-8 mode, a dot matches any UTF-8 character, | |
495ae4b0 | 455 | which might be more than one byte long, except (by default) newline. If |
8ac170f3 PH |
456 | the PCRE_DOTALL option is set, dots match newlines as well. The han- |
457 | dling of dot is entirely independent of the handling of circumflex and | |
458 | dollar, the only relationship being that they both involve newline | |
495ae4b0 PH |
459 | characters. Dot has no special meaning in a character class. |
460 | ||
461 | ||
462 | MATCHING A SINGLE BYTE | |
463 | ||
464 | Outside a character class, the escape sequence \C matches any one byte, | |
8ac170f3 PH |
465 | both in and out of UTF-8 mode. Unlike a dot, it can match a newline. |
466 | The feature is provided in Perl in order to match individual bytes in | |
467 | UTF-8 mode. Because it breaks up UTF-8 characters into individual | |
468 | bytes, what remains in the string may be a malformed UTF-8 string. For | |
495ae4b0 PH |
469 | this reason, the \C escape sequence is best avoided. |
470 | ||
8ac170f3 PH |
471 | PCRE does not allow \C to appear in lookbehind assertions (described |
472 | below), because in UTF-8 mode this would make it impossible to calcu- | |
495ae4b0 PH |
473 | late the length of the lookbehind. |
474 | ||
475 | ||
476 | SQUARE BRACKETS AND CHARACTER CLASSES | |
477 | ||
478 | An opening square bracket introduces a character class, terminated by a | |
479 | closing square bracket. A closing square bracket on its own is not spe- | |
480 | cial. If a closing square bracket is required as a member of the class, | |
8ac170f3 | 481 | it should be the first data character in the class (after an initial |
495ae4b0 PH |
482 | circumflex, if present) or escaped with a backslash. |
483 | ||
8ac170f3 PH |
484 | A character class matches a single character in the subject. In UTF-8 |
485 | mode, the character may occupy more than one byte. A matched character | |
495ae4b0 | 486 | must be in the set of characters defined by the class, unless the first |
8ac170f3 PH |
487 | character in the class definition is a circumflex, in which case the |
488 | subject character must not be in the set defined by the class. If a | |
489 | circumflex is actually required as a member of the class, ensure it is | |
495ae4b0 PH |
490 | not the first character, or escape it with a backslash. |
491 | ||
8ac170f3 PH |
492 | For example, the character class [aeiou] matches any lower case vowel, |
493 | while [^aeiou] matches any character that is not a lower case vowel. | |
495ae4b0 | 494 | Note that a circumflex is just a convenient notation for specifying the |
8ac170f3 PH |
495 | characters that are in the class by enumerating those that are not. A |
496 | class that starts with a circumflex is not an assertion: it still con- | |
497 | sumes a character from the subject string, and therefore it fails if | |
495ae4b0 PH |
498 | the current pointer is at the end of the string. |
499 | ||
8ac170f3 PH |
500 | In UTF-8 mode, characters with values greater than 255 can be included |
501 | in a class as a literal string of bytes, or by using the \x{ escaping | |
495ae4b0 PH |
502 | mechanism. |
503 | ||
8ac170f3 PH |
504 | When caseless matching is set, any letters in a class represent both |
505 | their upper case and lower case versions, so for example, a caseless | |
506 | [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not | |
507 | match "A", whereas a caseful version would. In UTF-8 mode, PCRE always | |
508 | understands the concept of case for characters whose values are less | |
509 | than 128, so caseless matching is always possible. For characters with | |
510 | higher values, the concept of case is supported if PCRE is compiled | |
511 | with Unicode property support, but not otherwise. If you want to use | |
512 | caseless matching for characters 128 and above, you must ensure that | |
513 | PCRE is compiled with Unicode property support as well as with UTF-8 | |
514 | support. | |
495ae4b0 PH |
515 | |
516 | The newline character is never treated in any special way in character | |
517 | classes, whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE | |
518 | options is. A class such as [^a] will always match a newline. | |
519 | ||
520 | The minus (hyphen) character can be used to specify a range of charac- | |
521 | ters in a character class. For example, [d-m] matches any letter | |
522 | between d and m, inclusive. If a minus character is required in a | |
523 | class, it must be escaped with a backslash or appear in a position | |
524 | where it cannot be interpreted as indicating a range, typically as the | |
525 | first or last character in the class. | |
526 | ||
527 | It is not possible to have the literal character "]" as the end charac- | |
528 | ter of a range. A pattern such as [W-]46] is interpreted as a class of | |
529 | two characters ("W" and "-") followed by a literal string "46]", so it | |
530 | would match "W46]" or "-46]". However, if the "]" is escaped with a | |
531 | backslash it is interpreted as the end of range, so [W-\]46] is inter- | |
532 | preted as a class containing a range followed by two other characters. | |
533 | The octal or hexadecimal representation of "]" can also be used to end | |
534 | a range. | |
535 | ||
536 | Ranges operate in the collating sequence of character values. They can | |
537 | also be used for characters specified numerically, for example | |
538 | [\000-\037]. In UTF-8 mode, ranges can include characters whose values | |
539 | are greater than 255, for example [\x{100}-\x{2ff}]. | |
540 | ||
541 | If a range that includes letters is used when caseless matching is set, | |
542 | it matches the letters in either case. For example, [W-c] is equivalent | |
543 | to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if | |
544 | character tables for the "fr_FR" locale are in use, [\xc8-\xcb] matches | |
545 | accented E characters in both cases. In UTF-8 mode, PCRE supports the | |
546 | concept of case for characters with values greater than 128 only when | |
547 | it is compiled with Unicode property support. | |
548 | ||
549 | The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear | |
550 | in a character class, and add the characters that they match to the | |
551 | class. For example, [\dABCDEF] matches any hexadecimal digit. A circum- | |
552 | flex can conveniently be used with the upper case character types to | |
553 | specify a more restricted set of characters than the matching lower | |
554 | case type. For example, the class [^\W_] matches any letter or digit, | |
555 | but not underscore. | |
556 | ||
557 | The only metacharacters that are recognized in character classes are | |
558 | backslash, hyphen (only where it can be interpreted as specifying a | |
559 | range), circumflex (only at the start), opening square bracket (only | |
560 | when it can be interpreted as introducing a POSIX class name - see the | |
561 | next section), and the terminating closing square bracket. However, | |
562 | escaping other non-alphanumeric characters does no harm. | |
563 | ||
564 | ||
565 | POSIX CHARACTER CLASSES | |
566 | ||
567 | Perl supports the POSIX notation for character classes. This uses names | |
568 | enclosed by [: and :] within the enclosing square brackets. PCRE also | |
569 | supports this notation. For example, | |
570 | ||
571 | [01[:alpha:]%] | |
572 | ||
573 | matches "0", "1", any alphabetic character, or "%". The supported class | |
574 | names are | |
575 | ||
576 | alnum letters and digits | |
577 | alpha letters | |
578 | ascii character codes 0 - 127 | |
579 | blank space or tab only | |
580 | cntrl control characters | |
581 | digit decimal digits (same as \d) | |
582 | graph printing characters, excluding space | |
583 | lower lower case letters | |
584 | print printing characters, including space | |
585 | punct printing characters, excluding letters and digits | |
586 | space white space (not quite the same as \s) | |
587 | upper upper case letters | |
588 | word "word" characters (same as \w) | |
589 | xdigit hexadecimal digits | |
590 | ||
591 | The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), | |
592 | and space (32). Notice that this list includes the VT character (code | |
593 | 11). This makes "space" different to \s, which does not include VT (for | |
594 | Perl compatibility). | |
595 | ||
596 | The name "word" is a Perl extension, and "blank" is a GNU extension | |
597 | from Perl 5.8. Another Perl extension is negation, which is indicated | |
598 | by a ^ character after the colon. For example, | |
599 | ||
600 | [12[:^digit:]] | |
601 | ||
602 | matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the | |
603 | POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but | |
604 | these are not supported, and an error is given if they are encountered. | |
605 | ||
606 | In UTF-8 mode, characters with values greater than 128 do not match any | |
607 | of the POSIX character classes. | |
608 | ||
609 | ||
610 | VERTICAL BAR | |
611 | ||
612 | Vertical bar characters are used to separate alternative patterns. For | |
613 | example, the pattern | |
614 | ||
615 | gilbert|sullivan | |
616 | ||
617 | matches either "gilbert" or "sullivan". Any number of alternatives may | |
618 | appear, and an empty alternative is permitted (matching the empty | |
619 | string). The matching process tries each alternative in turn, from | |
620 | left to right, and the first one that succeeds is used. If the alterna- | |
621 | tives are within a subpattern (defined below), "succeeds" means match- | |
622 | ing the rest of the main pattern as well as the alternative in the sub- | |
623 | pattern. | |
624 | ||
625 | ||
626 | INTERNAL OPTION SETTING | |
627 | ||
628 | The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and | |
629 | PCRE_EXTENDED options can be changed from within the pattern by a | |
630 | sequence of Perl option letters enclosed between "(?" and ")". The | |
631 | option letters are | |
632 | ||
633 | i for PCRE_CASELESS | |
634 | m for PCRE_MULTILINE | |
635 | s for PCRE_DOTALL | |
636 | x for PCRE_EXTENDED | |
637 | ||
638 | For example, (?im) sets caseless, multiline matching. It is also possi- | |
639 | ble to unset these options by preceding the letter with a hyphen, and a | |
640 | combined setting and unsetting such as (?im-sx), which sets PCRE_CASE- | |
641 | LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, | |
642 | is also permitted. If a letter appears both before and after the | |
643 | hyphen, the option is unset. | |
644 | ||
645 | When an option change occurs at top level (that is, not inside subpat- | |
646 | tern parentheses), the change applies to the remainder of the pattern | |
647 | that follows. If the change is placed right at the start of a pattern, | |
648 | PCRE extracts it into the global options (and it will therefore show up | |
649 | in data extracted by the pcre_fullinfo() function). | |
650 | ||
651 | An option change within a subpattern affects only that part of the cur- | |
652 | rent pattern that follows it, so | |
653 | ||
654 | (a(?i)b)c | |
655 | ||
656 | matches abc and aBc and no other strings (assuming PCRE_CASELESS is not | |
657 | used). By this means, options can be made to have different settings | |
658 | in different parts of the pattern. Any changes made in one alternative | |
659 | do carry on into subsequent branches within the same subpattern. For | |
660 | example, | |
661 | ||
662 | (a(?i)b|c) | |
663 | ||
664 | matches "ab", "aB", "c", and "C", even though when matching "C" the | |
665 | first branch is abandoned before the option setting. This is because | |
666 | the effects of option settings happen at compile time. There would be | |
667 | some very weird behaviour otherwise. | |
668 | ||
669 | The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can be changed | |
670 | in the same way as the Perl-compatible options by using the characters | |
671 | U and X respectively. The (?X) flag setting is special in that it must | |
672 | always occur earlier in the pattern than any of the additional features | |
673 | it turns on, even when it is at top level. It is best to put it at the | |
674 | start. | |
675 | ||
676 | ||
677 | SUBPATTERNS | |
678 | ||
679 | Subpatterns are delimited by parentheses (round brackets), which can be | |
680 | nested. Turning part of a pattern into a subpattern does two things: | |
681 | ||
682 | 1. It localizes a set of alternatives. For example, the pattern | |
683 | ||
684 | cat(aract|erpillar|) | |
685 | ||
686 | matches one of the words "cat", "cataract", or "caterpillar". Without | |
687 | the parentheses, it would match "cataract", "erpillar" or the empty | |
688 | string. | |
689 | ||
690 | 2. It sets up the subpattern as a capturing subpattern. This means | |
691 | that, when the whole pattern matches, that portion of the subject | |
692 | string that matched the subpattern is passed back to the caller via the | |
693 | ovector argument of pcre_exec(). Opening parentheses are counted from | |
694 | left to right (starting from 1) to obtain numbers for the capturing | |
695 | subpatterns. | |
696 | ||
697 | For example, if the string "the red king" is matched against the pat- | |
698 | tern | |
699 | ||
700 | the ((red|white) (king|queen)) | |
701 | ||
702 | the captured substrings are "red king", "red", and "king", and are num- | |
703 | bered 1, 2, and 3, respectively. | |
704 | ||
705 | The fact that plain parentheses fulfil two functions is not always | |
706 | helpful. There are often times when a grouping subpattern is required | |
707 | without a capturing requirement. If an opening parenthesis is followed | |
708 | by a question mark and a colon, the subpattern does not do any captur- | |
709 | ing, and is not counted when computing the number of any subsequent | |
710 | capturing subpatterns. For example, if the string "the white queen" is | |
711 | matched against the pattern | |
712 | ||
713 | the ((?:red|white) (king|queen)) | |
714 | ||
715 | the captured substrings are "white queen" and "queen", and are numbered | |
716 | 1 and 2. The maximum number of capturing subpatterns is 65535, and the | |
717 | maximum depth of nesting of all subpatterns, both capturing and non- | |
718 | capturing, is 200. | |
719 | ||
720 | As a convenient shorthand, if any option settings are required at the | |
721 | start of a non-capturing subpattern, the option letters may appear | |
722 | between the "?" and the ":". Thus the two patterns | |
723 | ||
724 | (?i:saturday|sunday) | |
725 | (?:(?i)saturday|sunday) | |
726 | ||
727 | match exactly the same set of strings. Because alternative branches are | |
728 | tried from left to right, and options are not reset until the end of | |
729 | the subpattern is reached, an option setting in one branch does affect | |
730 | subsequent branches, so the above patterns match "SUNDAY" as well as | |
731 | "Saturday". | |
732 | ||
733 | ||
734 | NAMED SUBPATTERNS | |
735 | ||
736 | Identifying capturing parentheses by number is simple, but it can be | |
737 | very hard to keep track of the numbers in complicated regular expres- | |
738 | sions. Furthermore, if an expression is modified, the numbers may | |
739 | change. To help with this difficulty, PCRE supports the naming of sub- | |
740 | patterns, something that Perl does not provide. The Python syntax | |
741 | (?P<name>...) is used. Names consist of alphanumeric characters and | |
742 | underscores, and must be unique within a pattern. | |
743 | ||
744 | Named capturing parentheses are still allocated numbers as well as | |
745 | names. The PCRE API provides function calls for extracting the name-to- | |
746 | number translation table from a compiled pattern. There is also a con- | |
747 | venience function for extracting a captured substring by name. For fur- | |
748 | ther details see the pcreapi documentation. | |
749 | ||
750 | ||
751 | REPETITION | |
752 | ||
753 | Repetition is specified by quantifiers, which can follow any of the | |
754 | following items: | |
755 | ||
756 | a literal data character | |
757 | the . metacharacter | |
758 | the \C escape sequence | |
759 | the \X escape sequence (in UTF-8 mode with Unicode properties) | |
760 | an escape such as \d that matches a single character | |
761 | a character class | |
762 | a back reference (see next section) | |
763 | a parenthesized subpattern (unless it is an assertion) | |
764 | ||
765 | The general repetition quantifier specifies a minimum and maximum num- | |
766 | ber of permitted matches, by giving the two numbers in curly brackets | |
767 | (braces), separated by a comma. The numbers must be less than 65536, | |
768 | and the first must be less than or equal to the second. For example: | |
769 | ||
770 | z{2,4} | |
771 | ||
772 | matches "zz", "zzz", or "zzzz". A closing brace on its own is not a | |
773 | special character. If the second number is omitted, but the comma is | |
774 | present, there is no upper limit; if the second number and the comma | |
775 | are both omitted, the quantifier specifies an exact number of required | |
776 | matches. Thus | |
777 | ||
778 | [aeiou]{3,} | |
779 | ||
780 | matches at least 3 successive vowels, but may match many more, while | |
781 | ||
782 | \d{8} | |
783 | ||
784 | matches exactly 8 digits. An opening curly bracket that appears in a | |
785 | position where a quantifier is not allowed, or one that does not match | |
786 | the syntax of a quantifier, is taken as a literal character. For exam- | |
787 | ple, {,6} is not a quantifier, but a literal string of four characters. | |
788 | ||
789 | In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to | |
790 | individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char- | |
791 | acters, each of which is represented by a two-byte sequence. Similarly, | |
792 | when Unicode property support is available, \X{3} matches three Unicode | |
793 | extended sequences, each of which may be several bytes long (and they | |
794 | may be of different lengths). | |
795 | ||
796 | The quantifier {0} is permitted, causing the expression to behave as if | |
797 | the previous item and the quantifier were not present. | |
798 | ||
799 | For convenience (and historical compatibility) the three most common | |
800 | quantifiers have single-character abbreviations: | |
801 | ||
802 | * is equivalent to {0,} | |
803 | + is equivalent to {1,} | |
804 | ? is equivalent to {0,1} | |
805 | ||
806 | It is possible to construct infinite loops by following a subpattern | |
807 | that can match no characters with a quantifier that has no upper limit, | |
808 | for example: | |
809 | ||
810 | (a?)* | |
811 | ||
812 | Earlier versions of Perl and PCRE used to give an error at compile time | |
813 | for such patterns. However, because there are cases where this can be | |
814 | useful, such patterns are now accepted, but if any repetition of the | |
815 | subpattern does in fact match no characters, the loop is forcibly bro- | |
816 | ken. | |
817 | ||
818 | By default, the quantifiers are "greedy", that is, they match as much | |
819 | as possible (up to the maximum number of permitted times), without | |
820 | causing the rest of the pattern to fail. The classic example of where | |
821 | this gives problems is in trying to match comments in C programs. These | |
822 | appear between /* and */ and within the comment, individual * and / | |
823 | characters may appear. An attempt to match C comments by applying the | |
824 | pattern | |
825 | ||
826 | /\*.*\*/ | |
827 | ||
828 | to the string | |
829 | ||
830 | /* first comment */ not comment /* second comment */ | |
831 | ||
832 | fails, because it matches the entire string owing to the greediness of | |
833 | the .* item. | |
834 | ||
835 | However, if a quantifier is followed by a question mark, it ceases to | |
836 | be greedy, and instead matches the minimum number of times possible, so | |
837 | the pattern | |
838 | ||
839 | /\*.*?\*/ | |
840 | ||
841 | does the right thing with the C comments. The meaning of the various | |
842 | quantifiers is not otherwise changed, just the preferred number of | |
843 | matches. Do not confuse this use of question mark with its use as a | |
844 | quantifier in its own right. Because it has two uses, it can sometimes | |
845 | appear doubled, as in | |
846 | ||
847 | \d??\d | |
848 | ||
849 | which matches one digit by preference, but can match two if that is the | |
850 | only way the rest of the pattern matches. | |
851 | ||
852 | If the PCRE_UNGREEDY option is set (an option which is not available in | |
853 | Perl), the quantifiers are not greedy by default, but individual ones | |
854 | can be made greedy by following them with a question mark. In other | |
855 | words, it inverts the default behaviour. | |
856 | ||
857 | When a parenthesized subpattern is quantified with a minimum repeat | |
858 | count that is greater than 1 or with a limited maximum, more memory is | |
859 | required for the compiled pattern, in proportion to the size of the | |
860 | minimum or maximum. | |
861 | ||
862 | If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv- | |
863 | alent to Perl's /s) is set, thus allowing the . to match newlines, the | |
864 | pattern is implicitly anchored, because whatever follows will be tried | |
865 | against every character position in the subject string, so there is no | |
866 | point in retrying the overall match at any position after the first. | |
867 | PCRE normally treats such a pattern as though it were preceded by \A. | |
868 | ||
869 | In cases where it is known that the subject string contains no new- | |
870 | lines, it is worth setting PCRE_DOTALL in order to obtain this opti- | |
871 | mization, or alternatively using ^ to indicate anchoring explicitly. | |
872 | ||
873 | However, there is one situation where the optimization cannot be used. | |
874 | When .* is inside capturing parentheses that are the subject of a | |
875 | backreference elsewhere in the pattern, a match at the start may fail, | |
876 | and a later one succeed. Consider, for example: | |
877 | ||
878 | (.*)abc\1 | |
879 | ||
880 | If the subject is "xyz123abc123" the match point is the fourth charac- | |
881 | ter. For this reason, such a pattern is not implicitly anchored. | |
882 | ||
883 | When a capturing subpattern is repeated, the value captured is the sub- | |
884 | string that matched the final iteration. For example, after | |
885 | ||
886 | (tweedle[dume]{3}\s*)+ | |
887 | ||
888 | has matched "tweedledum tweedledee" the value of the captured substring | |
889 | is "tweedledee". However, if there are nested capturing subpatterns, | |
890 | the corresponding captured values may have been set in previous itera- | |
891 | tions. For example, after | |
892 | ||
893 | /(a|(b))+/ | |
894 | ||
895 | matches "aba" the value of the second captured substring is "b". | |
896 | ||
897 | ||
898 | ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS | |
899 | ||
900 | With both maximizing and minimizing repetition, failure of what follows | |
901 | normally causes the repeated item to be re-evaluated to see if a dif- | |
902 | ferent number of repeats allows the rest of the pattern to match. Some- | |
903 | times it is useful to prevent this, either to change the nature of the | |
904 | match, or to cause it fail earlier than it otherwise might, when the | |
905 | author of the pattern knows there is no point in carrying on. | |
906 | ||
907 | Consider, for example, the pattern \d+foo when applied to the subject | |
908 | line | |
909 | ||
910 | 123456bar | |
911 | ||
912 | After matching all 6 digits and then failing to match "foo", the normal | |
913 | action of the matcher is to try again with only 5 digits matching the | |
914 | \d+ item, and then with 4, and so on, before ultimately failing. | |
915 | "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides | |
916 | the means for specifying that once a subpattern has matched, it is not | |
917 | to be re-evaluated in this way. | |
918 | ||
919 | If we use atomic grouping for the previous example, the matcher would | |
920 | give up immediately on failing to match "foo" the first time. The nota- | |
921 | tion is a kind of special parenthesis, starting with (?> as in this | |
922 | example: | |
923 | ||
924 | (?>\d+)foo | |
925 | ||
926 | This kind of parenthesis "locks up" the part of the pattern it con- | |
927 | tains once it has matched, and a failure further into the pattern is | |
928 | prevented from backtracking into it. Backtracking past it to previous | |
929 | items, however, works as normal. | |
930 | ||
931 | An alternative description is that a subpattern of this type matches | |
932 | the string of characters that an identical standalone pattern would | |
933 | match, if anchored at the current point in the subject string. | |
934 | ||
935 | Atomic grouping subpatterns are not capturing subpatterns. Simple cases | |
936 | such as the above example can be thought of as a maximizing repeat that | |
937 | must swallow everything it can. So, while both \d+ and \d+? are pre- | |
938 | pared to adjust the number of digits they match in order to make the | |
939 | rest of the pattern match, (?>\d+) can only match an entire sequence of | |
940 | digits. | |
941 | ||
942 | Atomic groups in general can of course contain arbitrarily complicated | |
943 | subpatterns, and can be nested. However, when the subpattern for an | |
944 | atomic group is just a single repeated item, as in the example above, a | |
945 | simpler notation, called a "possessive quantifier" can be used. This | |
946 | consists of an additional + character following a quantifier. Using | |
947 | this notation, the previous example can be rewritten as | |
948 | ||
949 | \d++foo | |
950 | ||
951 | Possessive quantifiers are always greedy; the setting of the | |
952 | PCRE_UNGREEDY option is ignored. They are a convenient notation for the | |
953 | simpler forms of atomic group. However, there is no difference in the | |
954 | meaning or processing of a possessive quantifier and the equivalent | |
955 | atomic group. | |
956 | ||
957 | The possessive quantifier syntax is an extension to the Perl syntax. It | |
958 | originates in Sun's Java package. | |
959 | ||
960 | When a pattern contains an unlimited repeat inside a subpattern that | |
961 | can itself be repeated an unlimited number of times, the use of an | |
962 | atomic group is the only way to avoid some failing matches taking a | |
963 | very long time indeed. The pattern | |
964 | ||
965 | (\D+|<\d+>)*[!?] | |
966 | ||
967 | matches an unlimited number of substrings that either consist of non- | |
968 | digits, or digits enclosed in <>, followed by either ! or ?. When it | |
969 | matches, it runs quickly. However, if it is applied to | |
970 | ||
971 | aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa | |
972 | ||
973 | it takes a long time before reporting failure. This is because the | |
974 | string can be divided between the internal \D+ repeat and the external | |
975 | * repeat in a large number of ways, and all have to be tried. (The | |
976 | example uses [!?] rather than a single character at the end, because | |
977 | both PCRE and Perl have an optimization that allows for fast failure | |
978 | when a single character is used. They remember the last single charac- | |
979 | ter that is required for a match, and fail early if it is not present | |
980 | in the string.) If the pattern is changed so that it uses an atomic | |
981 | group, like this: | |
982 | ||
983 | ((?>\D+)|<\d+>)*[!?] | |
984 | ||
985 | sequences of non-digits cannot be broken, and failure happens quickly. | |
986 | ||
987 | ||
988 | BACK REFERENCES | |
989 | ||
990 | Outside a character class, a backslash followed by a digit greater than | |
991 | 0 (and possibly further digits) is a back reference to a capturing sub- | |
992 | pattern earlier (that is, to its left) in the pattern, provided there | |
993 | have been that many previous capturing left parentheses. | |
994 | ||
995 | However, if the decimal number following the backslash is less than 10, | |
996 | it is always taken as a back reference, and causes an error only if | |
997 | there are not that many capturing left parentheses in the entire pat- | |
998 | tern. In other words, the parentheses that are referenced need not be | |
999 | to the left of the reference for numbers less than 10. See the subsec- | |
1000 | tion entitled "Non-printing characters" above for further details of | |
1001 | the handling of digits following a backslash. | |
1002 | ||
1003 | A back reference matches whatever actually matched the capturing sub- | |
1004 | pattern in the current subject string, rather than anything matching | |
1005 | the subpattern itself (see "Subpatterns as subroutines" below for a way | |
1006 | of doing that). So the pattern | |
1007 | ||
1008 | (sens|respons)e and \1ibility | |
1009 | ||
1010 | matches "sense and sensibility" and "response and responsibility", but | |
1011 | not "sense and responsibility". If caseful matching is in force at the | |
1012 | time of the back reference, the case of letters is relevant. For exam- | |
1013 | ple, | |
1014 | ||
1015 | ((?i)rah)\s+\1 | |
1016 | ||
1017 | matches "rah rah" and "RAH RAH", but not "RAH rah", even though the | |
1018 | original capturing subpattern is matched caselessly. | |
1019 | ||
1020 | Back references to named subpatterns use the Python syntax (?P=name). | |
1021 | We could rewrite the above example as follows: | |
1022 | ||
1023 | (?<p1>(?i)rah)\s+(?P=p1) | |
1024 | ||
1025 | There may be more than one back reference to the same subpattern. If a | |
1026 | subpattern has not actually been used in a particular match, any back | |
1027 | references to it always fail. For example, the pattern | |
1028 | ||
1029 | (a|(bc))\2 | |
1030 | ||
1031 | always fails if it starts to match "a" rather than "bc". Because there | |
1032 | may be many capturing parentheses in a pattern, all digits following | |
1033 | the backslash are taken as part of a potential back reference number. | |
1034 | If the pattern continues with a digit character, some delimiter must be | |
1035 | used to terminate the back reference. If the PCRE_EXTENDED option is | |
1036 | set, this can be whitespace. Otherwise an empty comment (see "Com- | |
1037 | ments" below) can be used. | |
1038 | ||
1039 | A back reference that occurs inside the parentheses to which it refers | |
1040 | fails when the subpattern is first used, so, for example, (a\1) never | |
1041 | matches. However, such references can be useful inside repeated sub- | |
1042 | patterns. For example, the pattern | |
1043 | ||
1044 | (a|b\1)+ | |
1045 | ||
1046 | matches any number of "a"s and also "aba", "ababbaa" etc. At each iter- | |
1047 | ation of the subpattern, the back reference matches the character | |
1048 | string corresponding to the previous iteration. In order for this to | |
1049 | work, the pattern must be such that the first iteration does not need | |
1050 | to match the back reference. This can be done using alternation, as in | |
1051 | the example above, or by a quantifier with a minimum of zero. | |
1052 | ||
1053 | ||
1054 | ASSERTIONS | |
1055 | ||
1056 | An assertion is a test on the characters following or preceding the | |
1057 | current matching point that does not actually consume any characters. | |
1058 | The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are | |
1059 | described above. | |
1060 | ||
1061 | More complicated assertions are coded as subpatterns. There are two | |
1062 | kinds: those that look ahead of the current position in the subject | |
1063 | string, and those that look behind it. An assertion subpattern is | |
1064 | matched in the normal way, except that it does not cause the current | |
1065 | matching position to be changed. | |
1066 | ||
1067 | Assertion subpatterns are not capturing subpatterns, and may not be | |
1068 | repeated, because it makes no sense to assert the same thing several | |
1069 | times. If any kind of assertion contains capturing subpatterns within | |
1070 | it, these are counted for the purposes of numbering the capturing sub- | |
1071 | patterns in the whole pattern. However, substring capturing is carried | |
1072 | out only for positive assertions, because it does not make sense for | |
1073 | negative assertions. | |
1074 | ||
1075 | Lookahead assertions | |
1076 | ||
1077 | Lookahead assertions start with (?= for positive assertions and (?! for | |
1078 | negative assertions. For example, | |
1079 | ||
1080 | \w+(?=;) | |
1081 | ||
1082 | matches a word followed by a semicolon, but does not include the semi- | |
1083 | colon in the match, and | |
1084 | ||
1085 | foo(?!bar) | |
1086 | ||
1087 | matches any occurrence of "foo" that is not followed by "bar". Note | |
1088 | that the apparently similar pattern | |
1089 | ||
1090 | (?!foo)bar | |
1091 | ||
1092 | does not find an occurrence of "bar" that is preceded by something | |
1093 | other than "foo"; it finds any occurrence of "bar" whatsoever, because | |
1094 | the assertion (?!foo) is always true when the next three characters are | |
1095 | "bar". A lookbehind assertion is needed to achieve the other effect. | |
1096 | ||
1097 | If you want to force a matching failure at some point in a pattern, the | |
1098 | most convenient way to do it is with (?!) because an empty string | |
1099 | always matches, so an assertion that requires there not to be an empty | |
1100 | string must always fail. | |
1101 | ||
1102 | Lookbehind assertions | |
1103 | ||
1104 | Lookbehind assertions start with (?<= for positive assertions and (?<! | |
1105 | for negative assertions. For example, | |
1106 | ||
1107 | (?<!foo)bar | |
1108 | ||
1109 | does find an occurrence of "bar" that is not preceded by "foo". The | |
1110 | contents of a lookbehind assertion are restricted such that all the | |
1111 | strings it matches must have a fixed length. However, if there are sev- | |
1112 | eral alternatives, they do not all have to have the same fixed length. | |
1113 | Thus | |
1114 | ||
1115 | (?<=bullock|donkey) | |
1116 | ||
1117 | is permitted, but | |
1118 | ||
1119 | (?<!dogs?|cats?) | |
1120 | ||
1121 | causes an error at compile time. Branches that match different length | |
1122 | strings are permitted only at the top level of a lookbehind assertion. | |
1123 | This is an extension compared with Perl (at least for 5.8), which | |
1124 | requires all branches to match the same length of string. An assertion | |
1125 | such as | |
1126 | ||
1127 | (?<=ab(c|de)) | |
1128 | ||
1129 | is not permitted, because its single top-level branch can match two | |
1130 | different lengths, but it is acceptable if rewritten to use two top- | |
1131 | level branches: | |
1132 | ||
1133 | (?<=abc|abde) | |
1134 | ||
1135 | The implementation of lookbehind assertions is, for each alternative, | |
1136 | to temporarily move the current position back by the fixed width and | |
1137 | then try to match. If there are insufficient characters before the cur- | |
1138 | rent position, the match is deemed to fail. | |
1139 | ||
1140 | PCRE does not allow the \C escape (which matches a single byte in UTF-8 | |
1141 | mode) to appear in lookbehind assertions, because it makes it impossi- | |
1142 | ble to calculate the length of the lookbehind. The \X escape, which can | |
1143 | match different numbers of bytes, is also not permitted. | |
1144 | ||
1145 | Atomic groups can be used in conjunction with lookbehind assertions to | |
1146 | specify efficient matching at the end of the subject string. Consider a | |
1147 | simple pattern such as | |
1148 | ||
1149 | abcd$ | |
1150 | ||
1151 | when applied to a long string that does not match. Because matching | |
1152 | proceeds from left to right, PCRE will look for each "a" in the subject | |
1153 | and then see if what follows matches the rest of the pattern. If the | |
1154 | pattern is specified as | |
1155 | ||
1156 | ^.*abcd$ | |
1157 | ||
1158 | the initial .* matches the entire string at first, but when this fails | |
1159 | (because there is no following "a"), it backtracks to match all but the | |
1160 | last character, then all but the last two characters, and so on. Once | |
1161 | again the search for "a" covers the entire string, from right to left, | |
1162 | so we are no better off. However, if the pattern is written as | |
1163 | ||
1164 | ^(?>.*)(?<=abcd) | |
1165 | ||
1166 | or, equivalently, using the possessive quantifier syntax, | |
1167 | ||
1168 | ^.*+(?<=abcd) | |
1169 | ||
1170 | there can be no backtracking for the .* item; it can match only the | |
1171 | entire string. The subsequent lookbehind assertion does a single test | |
1172 | on the last four characters. If it fails, the match fails immediately. | |
1173 | For long strings, this approach makes a significant difference to the | |
1174 | processing time. | |
1175 | ||
1176 | Using multiple assertions | |
1177 | ||
1178 | Several assertions (of any sort) may occur in succession. For example, | |
1179 | ||
1180 | (?<=\d{3})(?<!999)foo | |
1181 | ||
1182 | matches "foo" preceded by three digits that are not "999". Notice that | |
1183 | each of the assertions is applied independently at the same point in | |
1184 | the subject string. First there is a check that the previous three | |
1185 | characters are all digits, and then there is a check that the same | |
1186 | three characters are not "999". This pattern does not match "foo" pre- | |
1187 | ceded by six characters, the first of which are digits and the last | |
1188 | three of which are not "999". For example, it doesn't match "123abc- | |
1189 | foo". A pattern to do that is | |
1190 | ||
1191 | (?<=\d{3}...)(?<!999)foo | |
1192 | ||
1193 | This time the first assertion looks at the preceding six characters, | |
1194 | checking that the first three are digits, and then the second assertion | |
1195 | checks that the preceding three characters are not "999". | |
1196 | ||
1197 | Assertions can be nested in any combination. For example, | |
1198 | ||
1199 | (?<=(?<!foo)bar)baz | |
1200 | ||
1201 | matches an occurrence of "baz" that is preceded by "bar" which in turn | |
1202 | is not preceded by "foo", while | |
1203 | ||
1204 | (?<=\d{3}(?!999)...)foo | |
1205 | ||
1206 | is another pattern that matches "foo" preceded by three digits and any | |
1207 | three characters that are not "999". | |
1208 | ||
1209 | ||
1210 | CONDITIONAL SUBPATTERNS | |
1211 | ||
1212 | It is possible to cause the matching process to obey a subpattern con- | |
1213 | ditionally or to choose between two alternative subpatterns, depending | |
1214 | on the result of an assertion, or whether a previous capturing subpat- | |
1215 | tern matched or not. The two possible forms of conditional subpattern | |
1216 | are | |
1217 | ||
1218 | (?(condition)yes-pattern) | |
1219 | (?(condition)yes-pattern|no-pattern) | |
1220 | ||
1221 | If the condition is satisfied, the yes-pattern is used; otherwise the | |
1222 | no-pattern (if present) is used. If there are more than two alterna- | |
1223 | tives in the subpattern, a compile-time error occurs. | |
1224 | ||
1225 | There are three kinds of condition. If the text between the parentheses | |
1226 | consists of a sequence of digits, the condition is satisfied if the | |
1227 | capturing subpattern of that number has previously matched. The number | |
1228 | must be greater than zero. Consider the following pattern, which con- | |
1229 | tains non-significant white space to make it more readable (assume the | |
1230 | PCRE_EXTENDED option) and to divide it into three parts for ease of | |
1231 | discussion: | |
1232 | ||
1233 | ( \( )? [^()]+ (?(1) \) ) | |
1234 | ||
1235 | The first part matches an optional opening parenthesis, and if that | |
1236 | character is present, sets it as the first captured substring. The sec- | |
1237 | ond part matches one or more characters that are not parentheses. The | |
1238 | third part is a conditional subpattern that tests whether the first set | |
1239 | of parentheses matched or not. If they did, that is, if subject started | |
1240 | with an opening parenthesis, the condition is true, and so the yes-pat- | |
1241 | tern is executed and a closing parenthesis is required. Otherwise, | |
1242 | since no-pattern is not present, the subpattern matches nothing. In | |
1243 | other words, this pattern matches a sequence of non-parentheses, | |
1244 | optionally enclosed in parentheses. | |
1245 | ||
1246 | If the condition is the string (R), it is satisfied if a recursive call | |
1247 | to the pattern or subpattern has been made. At "top level", the condi- | |
1248 | tion is false. This is a PCRE extension. Recursive patterns are | |
1249 | described in the next section. | |
1250 | ||
1251 | If the condition is not a sequence of digits or (R), it must be an | |
1252 | assertion. This may be a positive or negative lookahead or lookbehind | |
1253 | assertion. Consider this pattern, again containing non-significant | |
1254 | white space, and with the two alternatives on the second line: | |
1255 | ||
1256 | (?(?=[^a-z]*[a-z]) | |
1257 | \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) | |
1258 | ||
1259 | The condition is a positive lookahead assertion that matches an | |
1260 | optional sequence of non-letters followed by a letter. In other words, | |
1261 | it tests for the presence of at least one letter in the subject. If a | |
1262 | letter is found, the subject is matched against the first alternative; | |
1263 | otherwise it is matched against the second. This pattern matches | |
1264 | strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are | |
1265 | letters and dd are digits. | |
1266 | ||
1267 | ||
1268 | COMMENTS | |
1269 | ||
1270 | The sequence (?# marks the start of a comment that continues up to the | |
1271 | next closing parenthesis. Nested parentheses are not permitted. The | |
1272 | characters that make up a comment play no part in the pattern matching | |
1273 | at all. | |
1274 | ||
1275 | If the PCRE_EXTENDED option is set, an unescaped # character outside a | |
1276 | character class introduces a comment that continues up to the next new- | |
1277 | line character in the pattern. | |
1278 | ||
1279 | ||
1280 | RECURSIVE PATTERNS | |
1281 | ||
1282 | Consider the problem of matching a string in parentheses, allowing for | |
1283 | unlimited nested parentheses. Without the use of recursion, the best | |
1284 | that can be done is to use a pattern that matches up to some fixed | |
1285 | depth of nesting. It is not possible to handle an arbitrary nesting | |
1286 | depth. Perl provides a facility that allows regular expressions to | |
1287 | recurse (amongst other things). It does this by interpolating Perl code | |
1288 | in the expression at run time, and the code can refer to the expression | |
1289 | itself. A Perl pattern to solve the parentheses problem can be created | |
1290 | like this: | |
1291 | ||
1292 | $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x; | |
1293 | ||
1294 | The (?p{...}) item interpolates Perl code at run time, and in this case | |
1295 | refers recursively to the pattern in which it appears. Obviously, PCRE | |
1296 | cannot support the interpolation of Perl code. Instead, it supports | |
1297 | some special syntax for recursion of the entire pattern, and also for | |
1298 | individual subpattern recursion. | |
1299 | ||
1300 | The special item that consists of (? followed by a number greater than | |
1301 | zero and a closing parenthesis is a recursive call of the subpattern of | |
1302 | the given number, provided that it occurs inside that subpattern. (If | |
1303 | not, it is a "subroutine" call, which is described in the next sec- | |
1304 | tion.) The special item (?R) is a recursive call of the entire regular | |
1305 | expression. | |
1306 | ||
1307 | For example, this PCRE pattern solves the nested parentheses problem | |
1308 | (assume the PCRE_EXTENDED option is set so that white space is | |
1309 | ignored): | |
1310 | ||
1311 | \( ( (?>[^()]+) | (?R) )* \) | |
1312 | ||
1313 | First it matches an opening parenthesis. Then it matches any number of | |
1314 | substrings which can either be a sequence of non-parentheses, or a | |
1315 | recursive match of the pattern itself (that is a correctly parenthe- | |
1316 | sized substring). Finally there is a closing parenthesis. | |
1317 | ||
1318 | If this were part of a larger pattern, you would not want to recurse | |
1319 | the entire pattern, so instead you could use this: | |
1320 | ||
1321 | ( \( ( (?>[^()]+) | (?1) )* \) ) | |
1322 | ||
1323 | We have put the pattern into parentheses, and caused the recursion to | |
1324 | refer to them instead of the whole pattern. In a larger pattern, keep- | |
1325 | ing track of parenthesis numbers can be tricky. It may be more conve- | |
1326 | nient to use named parentheses instead. For this, PCRE uses (?P>name), | |
1327 | which is an extension to the Python syntax that PCRE uses for named | |
1328 | parentheses (Perl does not provide named parentheses). We could rewrite | |
1329 | the above example as follows: | |
1330 | ||
1331 | (?P<pn> \( ( (?>[^()]+) | (?P>pn) )* \) ) | |
1332 | ||
1333 | This particular example pattern contains nested unlimited repeats, and | |
1334 | so the use of atomic grouping for matching strings of non-parentheses | |
1335 | is important when applying the pattern to strings that do not match. | |
1336 | For example, when this pattern is applied to | |
1337 | ||
1338 | (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() | |
1339 | ||
1340 | it yields "no match" quickly. However, if atomic grouping is not used, | |
1341 | the match runs for a very long time indeed because there are so many | |
1342 | different ways the + and * repeats can carve up the subject, and all | |
1343 | have to be tested before failure can be reported. | |
1344 | ||
1345 | At the end of a match, the values set for any capturing subpatterns are | |
1346 | those from the outermost level of the recursion at which the subpattern | |
1347 | value is set. If you want to obtain intermediate values, a callout | |
1348 | function can be used (see the next section and the pcrecallout documen- | |
1349 | tation). If the pattern above is matched against | |
1350 | ||
1351 | (ab(cd)ef) | |
1352 | ||
1353 | the value for the capturing parentheses is "ef", which is the last | |
1354 | value taken on at the top level. If additional parentheses are added, | |
1355 | giving | |
1356 | ||
1357 | \( ( ( (?>[^()]+) | (?R) )* ) \) | |
1358 | ^ ^ | |
1359 | ^ ^ | |
1360 | ||
1361 | the string they capture is "ab(cd)ef", the contents of the top level | |
1362 | parentheses. If there are more than 15 capturing parentheses in a pat- | |
1363 | tern, PCRE has to obtain extra memory to store data during a recursion, | |
1364 | which it does by using pcre_malloc, freeing it via pcre_free after- | |
1365 | wards. If no memory can be obtained, the match fails with the | |
1366 | PCRE_ERROR_NOMEMORY error. | |
1367 | ||
1368 | Do not confuse the (?R) item with the condition (R), which tests for | |
1369 | recursion. Consider this pattern, which matches text in angle brack- | |
1370 | ets, allowing for arbitrary nesting. Only digits are allowed in nested | |
1371 | brackets (that is, when recursing), whereas any characters are permit- | |
1372 | ted at the outer level. | |
1373 | ||
1374 | < (?: (?(R) \d++ | [^<>]*+) | (?R)) * > | |
1375 | ||
1376 | In this pattern, (?(R) is the start of a conditional subpattern, with | |
1377 | two different alternatives for the recursive and non-recursive cases. | |
1378 | The (?R) item is the actual recursive call. | |
1379 | ||
1380 | ||
1381 | SUBPATTERNS AS SUBROUTINES | |
1382 | ||
1383 | If the syntax for a recursive subpattern reference (either by number or | |
1384 | by name) is used outside the parentheses to which it refers, it oper- | |
1385 | ates like a subroutine in a programming language. An earlier example | |
1386 | pointed out that the pattern | |
1387 | ||
1388 | (sens|respons)e and \1ibility | |
1389 | ||
1390 | matches "sense and sensibility" and "response and responsibility", but | |
1391 | not "sense and responsibility". If instead the pattern | |
1392 | ||
1393 | (sens|respons)e and (?1)ibility | |
1394 | ||
1395 | is used, it does match "sense and responsibility" as well as the other | |
1396 | two strings. Such references must, however, follow the subpattern to | |
1397 | which they refer. | |
1398 | ||
1399 | ||
1400 | CALLOUTS | |
1401 | ||
1402 | Perl has a feature whereby using the sequence (?{...}) causes arbitrary | |
1403 | Perl code to be obeyed in the middle of matching a regular expression. | |
1404 | This makes it possible, amongst other things, to extract different sub- | |
1405 | strings that match the same pair of parentheses when there is a repeti- | |
1406 | tion. | |
1407 | ||
1408 | PCRE provides a similar feature, but of course it cannot obey arbitrary | |
1409 | Perl code. The feature is called "callout". The caller of PCRE provides | |
1410 | an external function by putting its entry point in the global variable | |
1411 | pcre_callout. By default, this variable contains NULL, which disables | |
1412 | all calling out. | |
1413 | ||
1414 | Within a regular expression, (?C) indicates the points at which the | |
1415 | external function is to be called. If you want to identify different | |
1416 | callout points, you can put a number less than 256 after the letter C. | |
1417 | The default value is zero. For example, this pattern has two callout | |
1418 | points: | |
1419 | ||
1420 | (?C1)abc(?C2)def | |
1421 | ||
1422 | If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are | |
1423 | automatically installed before each item in the pattern. They are all | |
1424 | numbered 255. | |
1425 | ||
1426 | During matching, when PCRE reaches a callout point (and pcre_callout is | |
1427 | set), the external function is called. It is provided with the number | |
1428 | of the callout, the position in the pattern, and, optionally, one item | |
1429 | of data originally supplied by the caller of pcre_exec(). The callout | |
1430 | function may cause matching to proceed, to backtrack, or to fail alto- | |
1431 | gether. A complete description of the interface to the callout function | |
1432 | is given in the pcrecallout documentation. | |
1433 | ||
8ac170f3 PH |
1434 | Last updated: 28 February 2005 |
1435 | Copyright (c) 1997-2005 University of Cambridge. |