doc/doc-txt/pcrepattern.txt

   1 This file contains the PCRE man page that describes the regular expressions
   2 supported by PCRE version 7.0. Note that not all of the features are relevant
   3 in the context of Exim. In particular, the version of PCRE that is compiled
   4 with Exim does not include UTF-8 support, there is no mechanism for changing
   5 the options with which the PCRE functions are called, and features such as
   6 callout are not accessible.
   7 -----------------------------------------------------------------------------
   8
   9 PCREPATTERN(3)                                                  PCREPATTERN(3)
  10
  11
  12 NAME
  13        PCRE - Perl-compatible regular expressions
  14
  15
  16 PCRE REGULAR EXPRESSION DETAILS
  17
  18        The  syntax  and semantics of the regular expressions supported by PCRE
  19        are described below. Regular expressions are also described in the Perl
  20        documentation  and  in  a  number  of books, some of which have copious
  21        examples.  Jeffrey Friedl's "Mastering Regular Expressions",  published
  22        by  O'Reilly, covers regular expressions in great detail. This descrip-
  23        tion of PCRE's regular expressions is intended as reference material.
  24
  25        The original operation of PCRE was on strings of  one-byte  characters.
  26        However,  there is now also support for UTF-8 character strings. To use
  27        this, you must build PCRE to  include  UTF-8  support,  and  then  call
  28        pcre_compile()  with  the  PCRE_UTF8  option.  How this affects pattern
  29        matching is mentioned in several places below. There is also a  summary
  30        of  UTF-8  features  in  the  section on UTF-8 support in the main pcre
  31        page.
  32
  33        The remainder of this document discusses the  patterns  that  are  sup-
  34        ported  by  PCRE when its main matching function, pcre_exec(), is used.
  35        From  release  6.0,   PCRE   offers   a   second   matching   function,
  36        pcre_dfa_exec(),  which matches using a different algorithm that is not
  37        Perl-compatible. The advantages and disadvantages  of  the  alternative
  38        function, and how it differs from the normal function, are discussed in
  39        the pcrematching page.
  40
  41
  42 CHARACTERS AND METACHARACTERS
  43
  44        A regular expression is a pattern that is  matched  against  a  subject
  45        string  from  left  to right. Most characters stand for themselves in a
  46        pattern, and match the corresponding characters in the  subject.  As  a
  47        trivial example, the pattern
  48
  49          The quick brown fox
  50
  51        matches a portion of a subject string that is identical to itself. When
  52        caseless matching is specified (the PCRE_CASELESS option), letters  are
  53        matched  independently  of case. In UTF-8 mode, PCRE always understands
  54        the concept of case for characters whose values are less than  128,  so
  55        caseless  matching  is always possible. For characters with higher val-
  56        ues, the concept of case is supported if PCRE is compiled with  Unicode
  57        property  support,  but  not  otherwise.   If  you want to use caseless
  58        matching for characters 128 and above, you must  ensure  that  PCRE  is
  59        compiled with Unicode property support as well as with UTF-8 support.
  60
  61        The  power  of  regular  expressions  comes from the ability to include
  62        alternatives and repetitions in the pattern. These are encoded  in  the
  63        pattern by the use of metacharacters, which do not stand for themselves
  64        but instead are interpreted in some special way.
  65
  66        There are two different sets of metacharacters: those that  are  recog-
  67        nized  anywhere in the pattern except within square brackets, and those
  68        that are recognized within square brackets.  Outside  square  brackets,
  69        the metacharacters are as follows:
  70
  71          \      general escape character with several uses
  72          ^      assert start of string (or line, in multiline mode)
  73          $      assert end of string (or line, in multiline mode)
  74          .      match any character except newline (by default)
  75          [      start character class definition
  76          |      start of alternative branch
  77          (      start subpattern
  78          )      end subpattern
  79          ?      extends the meaning of (
  80                 also 0 or 1 quantifier
  81                 also quantifier minimizer
  82          *      0 or more quantifier
  83          +      1 or more quantifier
  84                 also "possessive quantifier"
  85          {      start min/max quantifier
  86
  87        Part  of  a  pattern  that is in square brackets is called a "character
  88        class". In a character class the only metacharacters are:
  89
  90          \      general escape character
  91          ^      negate the class, but only if the first character
  92          -      indicates character range
  93          [      POSIX character class (only if followed by POSIX
  94                   syntax)
  95          ]      terminates the character class
  96
  97        The following sections describe the use of each of the  metacharacters.
  98
  99
 100 BACKSLASH
 101
 102        The backslash character has several uses. Firstly, if it is followed by
 103        a non-alphanumeric character, it takes away any  special  meaning  that
 104        character  may  have.  This  use  of  backslash  as an escape character
 105        applies both inside and outside character classes.
 106
 107        For example, if you want to match a * character, you write  \*  in  the
 108        pattern.   This  escaping  action  applies whether or not the following
 109        character would otherwise be interpreted as a metacharacter, so  it  is
 110        always  safe  to  precede  a non-alphanumeric with backslash to specify
 111        that it stands for itself. In particular, if you want to match a  back-
 112        slash, you write \\.
 113
 114        If  a  pattern is compiled with the PCRE_EXTENDED option, whitespace in
 115        the pattern (other than in a character class) and characters between  a
 116        # outside a character class and the next newline are ignored. An escap-
 117        ing backslash can be used to include a whitespace  or  #  character  as
 118        part of the pattern.
 119
 120        If  you  want  to remove the special meaning from a sequence of charac-
 121        ters, you can do so by putting them between \Q and \E. This is  differ-
 122        ent  from  Perl  in  that  $  and  @ are handled as literals in \Q...\E
 123        sequences in PCRE, whereas in Perl, $ and @ cause  variable  interpola-
 124        tion. Note the following examples:
 125
 126          Pattern            PCRE matches   Perl matches
 127
 128          \Qabc$xyz\E        abc$xyz        abc followed by the
 129                                              contents of $xyz
 130          \Qabc\$xyz\E       abc\$xyz       abc\$xyz
 131          \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
 132
 133        The  \Q...\E  sequence  is recognized both inside and outside character
 134        classes.
 135
 136    Non-printing characters
 137
 138        A second use of backslash provides a way of encoding non-printing char-
 139        acters  in patterns in a visible manner. There is no restriction on the
 140        appearance of non-printing characters, apart from the binary zero  that
 141        terminates  a  pattern,  but  when  a pattern is being prepared by text
 142        editing, it is usually easier  to  use  one  of  the  following  escape
 143        sequences than the binary character it represents:
 144
 145          \a        alarm, that is, the BEL character (hex 07)
 146          \cx       "control-x", where x is any character
 147          \e        escape (hex 1B)
 148          \f        formfeed (hex 0C)
 149          \n        newline (hex 0A)
 150          \r        carriage return (hex 0D)
 151          \t        tab (hex 09)
 152          \ddd      character with octal code ddd, or backreference
 153          \xhh      character with hex code hh
 154          \x{hhh..} character with hex code hhh..
 155
 156        The  precise  effect of \cx is as follows: if x is a lower case letter,
 157        it is converted to upper case. Then bit 6 of the character (hex 40)  is
 158        inverted.   Thus  \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
 159        becomes hex 7B.
 160
 161        After \x, from zero to two hexadecimal digits are read (letters can  be
 162        in  upper  or  lower case). Any number of hexadecimal digits may appear
 163        between \x{ and }, but the value of the character  code  must  be  less
 164        than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode (that is,
 165        the maximum hexadecimal value is 7FFFFFFF). If  characters  other  than
 166        hexadecimal  digits  appear between \x{ and }, or if there is no termi-
 167        nating }, this form of escape is not recognized.  Instead, the  initial
 168        \x will be interpreted as a basic hexadecimal escape, with no following
 169        digits, giving a character whose value is zero.
 170
 171        Characters whose value is less than 256 can be defined by either of the
 172        two  syntaxes  for  \x. There is no difference in the way they are han-
 173        dled. For example, \xdc is exactly the same as \x{dc}.
 174
 175        After \0 up to two further octal digits are read. If  there  are  fewer
 176        than  two  digits,  just  those  that  are  present  are used. Thus the
 177        sequence \0\x\07 specifies two binary zeros followed by a BEL character
 178        (code  value 7). Make sure you supply two digits after the initial zero
 179        if the pattern character that follows is itself an octal digit.
 180
 181        The handling of a backslash followed by a digit other than 0 is compli-
 182        cated.  Outside a character class, PCRE reads it and any following dig-
 183        its as a decimal number. If the number is less than  10,  or  if  there
 184        have been at least that many previous capturing left parentheses in the
 185        expression, the entire  sequence  is  taken  as  a  back  reference.  A
 186        description  of how this works is given later, following the discussion
 187        of parenthesized subpatterns.
 188
 189        Inside a character class, or if the decimal number is  greater  than  9
 190        and  there have not been that many capturing subpatterns, PCRE re-reads
 191        up to three octal digits following the backslash, and uses them to gen-
 192        erate  a data character. Any subsequent digits stand for themselves. In
 193        non-UTF-8 mode, the value of a character specified  in  octal  must  be
 194        less  than  \400.  In  UTF-8 mode, values up to \777 are permitted. For
 195        example:
 196
 197          \040   is another way of writing a space
 198          \40    is the same, provided there are fewer than 40
 199                    previous capturing subpatterns
 200          \7     is always a back reference
 201          \11    might be a back reference, or another way of
 202                    writing a tab
 203          \011   is always a tab
 204          \0113  is a tab followed by the character "3"
 205          \113   might be a back reference, otherwise the
 206                    character with octal code 113
 207          \377   might be a back reference, otherwise
 208                    the byte consisting entirely of 1 bits
 209          \81    is either a back reference, or a binary zero
 210                    followed by the two characters "8" and "1"
 211
 212        Note that octal values of 100 or greater must not be  introduced  by  a
 213        leading zero, because no more than three octal digits are ever read.
 214
 215        All the sequences that define a single character value can be used both
 216        inside and outside character classes. In addition, inside  a  character
 217        class,  the  sequence \b is interpreted as the backspace character (hex
 218        08), and the sequences \R and \X are interpreted as the characters  "R"
 219        and  "X", respectively. Outside a character class, these sequences have
 220        different meanings (see below).
 221
 222    Absolute and relative back references
 223
 224        The sequence \g followed by a positive or negative  number,  optionally
 225        enclosed  in  braces,  is  an absolute or relative back reference. Back
 226        references are discussed later, following the discussion  of  parenthe-
 227        sized subpatterns.
 228
 229    Generic character types
 230
 231        Another use of backslash is for specifying generic character types. The
 232        following are always recognized:
 233
 234          \d     any decimal digit
 235          \D     any character that is not a decimal digit
 236          \s     any whitespace character
 237          \S     any character that is not a whitespace character
 238          \w     any "word" character
 239          \W     any "non-word" character
 240
 241        Each pair of escape sequences partitions the complete set of characters
 242        into  two disjoint sets. Any given character matches one, and only one,
 243        of each pair.
 244
 245        These character type sequences can appear both inside and outside char-
 246        acter  classes.  They each match one character of the appropriate type.
 247        If the current matching point is at the end of the subject string,  all
 248        of them fail, since there is no character to match.
 249
 250        For  compatibility  with Perl, \s does not match the VT character (code
 251        11).  This makes it different from the the POSIX "space" class. The  \s
 252        characters  are  HT (9), LF (10), FF (12), CR (13), and space (32). (If
 253        "use locale;" is included in a Perl script, \s may match the VT charac-
 254        ter. In PCRE, it never does.)
 255
 256        A "word" character is an underscore or any character less than 256 that
 257        is a letter or digit. The definition of  letters  and  digits  is  con-
 258        trolled  by PCRE's low-valued character tables, and may vary if locale-
 259        specific matching is taking place (see "Locale support" in the  pcreapi
 260        page).  For  example,  in  the  "fr_FR" (French) locale, some character
 261        codes greater than 128 are used for accented  letters,  and  these  are
 262        matched by \w.
 263
 264        In  UTF-8 mode, characters with values greater than 128 never match \d,
 265        \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
 266        code  character  property support is available. The use of locales with
 267        Unicode is discouraged.
 268
 269    Newline sequences
 270
 271        Outside a character class, the escape sequence \R matches  any  Unicode
 272        newline sequence. This is an extension to Perl. In non-UTF-8 mode \R is
 273        equivalent to the following:
 274
 275          (?>\r\n|\n|\x0b|\f|\r|\x85)
 276
 277        This is an example of an "atomic group", details  of  which  are  given
 278        below.  This particular group matches either the two-character sequence
 279        CR followed by LF, or  one  of  the  single  characters  LF  (linefeed,
 280        U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage
 281        return, U+000D), or NEL (next line, U+0085). The two-character sequence
 282        is treated as a single unit that cannot be split.
 283
 284        In  UTF-8  mode, two additional characters whose codepoints are greater
 285        than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
 286        rator,  U+2029).   Unicode character property support is not needed for
 287        these characters to be recognized.
 288
 289        Inside a character class, \R matches the letter "R".
 290
 291    Unicode character properties
 292
 293        When PCRE is built with Unicode character property support, three addi-
 294        tional  escape  sequences  to  match character properties are available
 295        when UTF-8 mode is selected. They are:
 296
 297          \p{xx}   a character with the xx property
 298          \P{xx}   a character without the xx property
 299          \X       an extended Unicode sequence
 300
 301        The property names represented by xx above are limited to  the  Unicode
 302        script names, the general category properties, and "Any", which matches
 303        any character (including newline). Other properties such as "InMusical-
 304        Symbols"  are  not  currently supported by PCRE. Note that \P{Any} does
 305        not match any characters, so always causes a match failure.
 306
 307        Sets of Unicode characters are defined as belonging to certain scripts.
 308        A  character from one of these sets can be matched using a script name.
 309        For example:
 310
 311          \p{Greek}
 312          \P{Han}
 313
 314        Those that are not part of an identified script are lumped together  as
 315        "Common". The current list of scripts is:
 316
 317        Arabic,  Armenian,  Balinese,  Bengali,  Bopomofo,  Braille,  Buginese,
 318        Buhid,  Canadian_Aboriginal,  Cherokee,  Common,   Coptic,   Cuneiform,
 319        Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic,
 320        Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew,  Hira-
 321        gana,  Inherited,  Kannada,  Katakana,  Kharoshthi,  Khmer, Lao, Latin,
 322        Limbu,  Linear_B,  Malayalam,  Mongolian,  Myanmar,  New_Tai_Lue,  Nko,
 323        Ogham,  Old_Italic,  Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician,
 324        Runic,  Shavian,  Sinhala,  Syloti_Nagri,  Syriac,  Tagalog,  Tagbanwa,
 325        Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi.
 326
 327        Each  character has exactly one general category property, specified by
 328        a two-letter abbreviation. For compatibility with Perl, negation can be
 329        specified  by  including a circumflex between the opening brace and the
 330        property name. For example, \p{^Lu} is the same as \P{Lu}.
 331
 332        If only one letter is specified with \p or \P, it includes all the gen-
 333        eral  category properties that start with that letter. In this case, in
 334        the absence of negation, the curly brackets in the escape sequence  are
 335        optional; these two examples have the same effect:
 336
 337          \p{L}
 338          \pL
 339
 340        The following general category property codes are supported:
 341
 342          C     Other
 343          Cc    Control
 344          Cf    Format
 345          Cn    Unassigned
 346          Co    Private use
 347          Cs    Surrogate
 348
 349          L     Letter
 350          Ll    Lower case letter
 351          Lm    Modifier letter
 352          Lo    Other letter
 353          Lt    Title case letter
 354          Lu    Upper case letter
 355
 356          M     Mark
 357          Mc    Spacing mark
 358          Me    Enclosing mark
 359          Mn    Non-spacing mark
 360
 361          N     Number
 362          Nd    Decimal number
 363          Nl    Letter number
 364          No    Other number
 365
 366          P     Punctuation
 367          Pc    Connector punctuation
 368          Pd    Dash punctuation
 369          Pe    Close punctuation
 370          Pf    Final punctuation
 371          Pi    Initial punctuation
 372          Po    Other punctuation
 373          Ps    Open punctuation
 374
 375          S     Symbol
 376          Sc    Currency symbol
 377          Sk    Modifier symbol
 378          Sm    Mathematical symbol
 379          So    Other symbol
 380
 381          Z     Separator
 382          Zl    Line separator
 383          Zp    Paragraph separator
 384          Zs    Space separator
 385
 386        The  special property L& is also supported: it matches a character that
 387        has the Lu, Ll, or Lt property, in other words, a letter  that  is  not
 388        classified as a modifier or "other".
 389
 390        The  long  synonyms  for  these  properties that Perl supports (such as
 391        \p{Letter}) are not supported by PCRE, nor is it  permitted  to  prefix
 392        any of these properties with "Is".
 393
 394        No character that is in the Unicode table has the Cn (unassigned) prop-
 395        erty.  Instead, this property is assumed for any code point that is not
 396        in the Unicode table.
 397
 398        Specifying  caseless  matching  does not affect these escape sequences.
 399        For example, \p{Lu} always matches only upper case letters.
 400
 401        The \X escape matches any number of Unicode  characters  that  form  an
 402        extended Unicode sequence. \X is equivalent to
 403
 404          (?>\PM\pM*)
 405
 406        That  is,  it matches a character without the "mark" property, followed
 407        by zero or more characters with the "mark"  property,  and  treats  the
 408        sequence  as  an  atomic group (see below).  Characters with the "mark"
 409        property are typically accents that affect the preceding character.
 410
 411        Matching characters by Unicode property is not fast, because  PCRE  has
 412        to  search  a  structure  that  contains data for over fifteen thousand
 413        characters. That is why the traditional escape sequences such as \d and
 414        \w do not use Unicode properties in PCRE.
 415
 416    Simple assertions
 417
 418        The  final use of backslash is for certain simple assertions. An asser-
 419        tion specifies a condition that has to be met at a particular point  in
 420        a  match, without consuming any characters from the subject string. The
 421        use of subpatterns for more complicated assertions is described  below.
 422        The backslashed assertions are:
 423
 424          \b     matches at a word boundary
 425          \B     matches when not at a word boundary
 426          \A     matches at the start of the subject
 427          \Z     matches at the end of the subject
 428                  also matches before a newline at the end of the subject
 429          \z     matches only at the end of the subject
 430          \G     matches at the first matching position in the subject
 431
 432        These  assertions may not appear in character classes (but note that \b
 433        has a different meaning, namely the backspace character, inside a char-
 434        acter class).
 435
 436        A  word  boundary is a position in the subject string where the current
 437        character and the previous character do not both match \w or  \W  (i.e.
 438        one  matches  \w  and the other matches \W), or the start or end of the
 439        string if the first or last character matches \w, respectively.
 440
 441        The \A, \Z, and \z assertions differ from  the  traditional  circumflex
 442        and dollar (described in the next section) in that they only ever match
 443        at the very start and end of the subject string, whatever  options  are
 444        set.  Thus,  they are independent of multiline mode. These three asser-
 445        tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
 446        affect  only the behaviour of the circumflex and dollar metacharacters.
 447        However, if the startoffset argument of pcre_exec() is non-zero,  indi-
 448        cating that matching is to start at a point other than the beginning of
 449        the subject, \A can never match. The difference between \Z  and  \z  is
 450        that \Z matches before a newline at the end of the string as well as at
 451        the very end, whereas \z matches only at the end.
 452
 453        The \G assertion is true only when the current matching position is  at
 454        the  start point of the match, as specified by the startoffset argument
 455        of pcre_exec(). It differs from \A when the  value  of  startoffset  is
 456        non-zero.  By calling pcre_exec() multiple times with appropriate argu-
 457        ments, you can mimic Perl's /g option, and it is in this kind of imple-
 458        mentation where \G can be useful.
 459
 460        Note,  however,  that  PCRE's interpretation of \G, as the start of the
 461        current match, is subtly different from Perl's, which defines it as the
 462        end  of  the  previous  match. In Perl, these can be different when the
 463        previously matched string was empty. Because PCRE does just  one  match
 464        at a time, it cannot reproduce this behaviour.
 465
 466        If  all  the alternatives of a pattern begin with \G, the expression is
 467        anchored to the starting match position, and the "anchored" flag is set
 468        in the compiled regular expression.
 469
 470
 471 CIRCUMFLEX AND DOLLAR
 472
 473        Outside a character class, in the default matching mode, the circumflex
 474        character is an assertion that is true only  if  the  current  matching
 475        point  is  at the start of the subject string. If the startoffset argu-
 476        ment of pcre_exec() is non-zero, circumflex  can  never  match  if  the
 477        PCRE_MULTILINE  option  is  unset. Inside a character class, circumflex
 478        has an entirely different meaning (see below).
 479
 480        Circumflex need not be the first character of the pattern if  a  number
 481        of  alternatives are involved, but it should be the first thing in each
 482        alternative in which it appears if the pattern is ever  to  match  that
 483        branch.  If all possible alternatives start with a circumflex, that is,
 484        if the pattern is constrained to match only at the start  of  the  sub-
 485        ject,  it  is  said  to be an "anchored" pattern. (There are also other
 486        constructs that can cause a pattern to be anchored.)
 487
 488        A dollar character is an assertion that is true  only  if  the  current
 489        matching  point  is  at  the  end of the subject string, or immediately
 490        before a newline at the end of the string (by default). Dollar need not
 491        be  the  last  character of the pattern if a number of alternatives are
 492        involved, but it should be the last item in  any  branch  in  which  it
 493        appears. Dollar has no special meaning in a character class.
 494
 495        The  meaning  of  dollar  can be changed so that it matches only at the
 496        very end of the string, by setting the  PCRE_DOLLAR_ENDONLY  option  at
 497        compile time. This does not affect the \Z assertion.
 498
 499        The meanings of the circumflex and dollar characters are changed if the
 500        PCRE_MULTILINE option is set. When  this  is  the  case,  a  circumflex
 501        matches  immediately after internal newlines as well as at the start of
 502        the subject string. It does not match after a  newline  that  ends  the
 503        string.  A dollar matches before any newlines in the string, as well as
 504        at the very end, when PCRE_MULTILINE is set. When newline is  specified
 505        as  the  two-character  sequence CRLF, isolated CR and LF characters do
 506        not indicate newlines.
 507
 508        For example, the pattern /^abc$/ matches the subject string  "def\nabc"
 509        (where  \n  represents a newline) in multiline mode, but not otherwise.
 510        Consequently, patterns that are anchored in single  line  mode  because
 511        all  branches  start  with  ^ are not anchored in multiline mode, and a
 512        match for circumflex is  possible  when  the  startoffset  argument  of
 513        pcre_exec()  is  non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
 514        PCRE_MULTILINE is set.
 515
 516        Note that the sequences \A, \Z, and \z can be used to match  the  start
 517        and  end of the subject in both modes, and if all branches of a pattern
 518        start with \A it is always anchored, whether or not  PCRE_MULTILINE  is
 519        set.
 520
 521
 522 FULL STOP (PERIOD, DOT)
 523
 524        Outside a character class, a dot in the pattern matches any one charac-
 525        ter in the subject string except (by default) a character  that  signi-
 526        fies  the  end  of  a line. In UTF-8 mode, the matched character may be
 527        more than one byte long.
 528
 529        When a line ending is defined as a single character, dot never  matches
 530        that  character; when the two-character sequence CRLF is used, dot does
 531        not match CR if it is immediately followed  by  LF,  but  otherwise  it
 532        matches  all characters (including isolated CRs and LFs). When any Uni-
 533        code line endings are being recognized, dot does not match CR or LF  or
 534        any of the other line ending characters.
 535
 536        The  behaviour  of  dot  with regard to newlines can be changed. If the
 537        PCRE_DOTALL option is set, a dot matches  any  one  character,  without
 538        exception. If the two-character sequence CRLF is present in the subject
 539        string, it takes two dots to match it.
 540
 541        The handling of dot is entirely independent of the handling of  circum-
 542        flex  and  dollar,  the  only relationship being that they both involve
 543        newlines. Dot has no special meaning in a character class.
 544
 545
 546 MATCHING A SINGLE BYTE
 547
 548        Outside a character class, the escape sequence \C matches any one byte,
 549        both  in  and  out  of  UTF-8 mode. Unlike a dot, it always matches any
 550        line-ending characters. The feature is provided in  Perl  in  order  to
 551        match  individual bytes in UTF-8 mode. Because it breaks up UTF-8 char-
 552        acters into individual bytes, what remains in the string may be a  mal-
 553        formed  UTF-8  string.  For this reason, the \C escape sequence is best
 554        avoided.
 555
 556        PCRE does not allow \C to appear in  lookbehind  assertions  (described
 557        below),  because  in UTF-8 mode this would make it impossible to calcu-
 558        late the length of the lookbehind.
 559
 560
 561 SQUARE BRACKETS AND CHARACTER CLASSES
 562
 563        An opening square bracket introduces a character class, terminated by a
 564        closing square bracket. A closing square bracket on its own is not spe-
 565        cial. If a closing square bracket is required as a member of the class,
 566        it  should  be  the first data character in the class (after an initial
 567        circumflex, if present) or escaped with a backslash.
 568
 569        A character class matches a single character in the subject.  In  UTF-8
 570        mode,  the character may occupy more than one byte. A matched character
 571        must be in the set of characters defined by the class, unless the first
 572        character  in  the  class definition is a circumflex, in which case the
 573        subject character must not be in the set defined by  the  class.  If  a
 574        circumflex  is actually required as a member of the class, ensure it is
 575        not the first character, or escape it with a backslash.
 576
 577        For example, the character class [aeiou] matches any lower case  vowel,
 578        while  [^aeiou]  matches  any character that is not a lower case vowel.
 579        Note that a circumflex is just a convenient notation for specifying the
 580        characters  that  are in the class by enumerating those that are not. A
 581        class that starts with a circumflex is not an assertion: it still  con-
 582        sumes  a  character  from the subject string, and therefore it fails if
 583        the current pointer is at the end of the string.
 584
 585        In UTF-8 mode, characters with values greater than 255 can be  included
 586        in  a  class as a literal string of bytes, or by using the \x{ escaping
 587        mechanism.
 588
 589        When caseless matching is set, any letters in a  class  represent  both
 590        their  upper  case  and lower case versions, so for example, a caseless
 591        [aeiou] matches "A" as well as "a", and a caseless  [^aeiou]  does  not
 592        match  "A", whereas a caseful version would. In UTF-8 mode, PCRE always
 593        understands the concept of case for characters whose  values  are  less
 594        than  128, so caseless matching is always possible. For characters with
 595        higher values, the concept of case is supported  if  PCRE  is  compiled
 596        with  Unicode  property support, but not otherwise.  If you want to use
 597        caseless matching for characters 128 and above, you  must  ensure  that
 598        PCRE  is  compiled  with Unicode property support as well as with UTF-8
 599        support.
 600
 601        Characters that might indicate line breaks are  never  treated  in  any
 602        special  way  when  matching  character  classes,  whatever line-ending
 603        sequence is in  use,  and  whatever  setting  of  the  PCRE_DOTALL  and
 604        PCRE_MULTILINE options is used. A class such as [^a] always matches one
 605        of these characters.
 606
 607        The minus (hyphen) character can be used to specify a range of  charac-
 608        ters  in  a  character  class.  For  example,  [d-m] matches any letter
 609        between d and m, inclusive. If a  minus  character  is  required  in  a
 610        class,  it  must  be  escaped  with a backslash or appear in a position
 611        where it cannot be interpreted as indicating a range, typically as  the
 612        first or last character in the class.
 613
 614        It is not possible to have the literal character "]" as the end charac-
 615        ter of a range. A pattern such as [W-]46] is interpreted as a class  of
 616        two  characters ("W" and "-") followed by a literal string "46]", so it
 617        would match "W46]" or "-46]". However, if the "]"  is  escaped  with  a
 618        backslash  it is interpreted as the end of range, so [W-\]46] is inter-
 619        preted as a class containing a range followed by two other  characters.
 620        The  octal or hexadecimal representation of "]" can also be used to end
 621        a range.
 622
 623        Ranges operate in the collating sequence of character values. They  can
 624        also   be  used  for  characters  specified  numerically,  for  example
 625        [\000-\037]. In UTF-8 mode, ranges can include characters whose  values
 626        are greater than 255, for example [\x{100}-\x{2ff}].
 627
 628        If a range that includes letters is used when caseless matching is set,
 629        it matches the letters in either case. For example, [W-c] is equivalent
 630        to  [][\\^_`wxyzabc],  matched  caselessly,  and  in non-UTF-8 mode, if
 631        character tables for the "fr_FR" locale are in use, [\xc8-\xcb] matches
 632        accented  E  characters in both cases. In UTF-8 mode, PCRE supports the
 633        concept of case for characters with values greater than 128  only  when
 634        it is compiled with Unicode property support.
 635
 636        The  character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear
 637        in a character class, and add the characters that  they  match  to  the
 638        class. For example, [\dABCDEF] matches any hexadecimal digit. A circum-
 639        flex can conveniently be used with the upper case  character  types  to
 640        specify  a  more  restricted  set of characters than the matching lower
 641        case type. For example, the class [^\W_] matches any letter  or  digit,
 642        but not underscore.
 643
 644        The  only  metacharacters  that are recognized in character classes are
 645        backslash, hyphen (only where it can be  interpreted  as  specifying  a
 646        range),  circumflex  (only  at the start), opening square bracket (only
 647        when it can be interpreted as introducing a POSIX class name - see  the
 648        next  section),  and  the  terminating closing square bracket. However,
 649        escaping other non-alphanumeric characters does no harm.
 650
 651
 652 POSIX CHARACTER CLASSES
 653
 654        Perl supports the POSIX notation for character classes. This uses names
 655        enclosed  by  [: and :] within the enclosing square brackets. PCRE also
 656        supports this notation. For example,
 657
 658          [01[:alpha:]%]
 659
 660        matches "0", "1", any alphabetic character, or "%". The supported class
 661        names are
 662
 663          alnum    letters and digits
 664          alpha    letters
 665          ascii    character codes 0 - 127
 666          blank    space or tab only
 667          cntrl    control characters
 668          digit    decimal digits (same as \d)
 669          graph    printing characters, excluding space
 670          lower    lower case letters
 671          print    printing characters, including space
 672          punct    printing characters, excluding letters and digits
 673          space    white space (not quite the same as \s)
 674          upper    upper case letters
 675          word     "word" characters (same as \w)
 676          xdigit   hexadecimal digits
 677
 678        The  "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
 679        and space (32). Notice that this list includes the VT  character  (code
 680        11). This makes "space" different to \s, which does not include VT (for
 681        Perl compatibility).
 682
 683        The name "word" is a Perl extension, and "blank"  is  a  GNU  extension
 684        from  Perl  5.8. Another Perl extension is negation, which is indicated
 685        by a ^ character after the colon. For example,
 686
 687          [12[:^digit:]]
 688
 689        matches "1", "2", or any non-digit. PCRE (and Perl) also recognize  the
 690        POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
 691        these are not supported, and an error is given if they are encountered.
 692
 693        In UTF-8 mode, characters with values greater than 128 do not match any
 694        of the POSIX character classes.
 695
 696
 697 VERTICAL BAR
 698
 699        Vertical bar characters are used to separate alternative patterns.  For
 700        example, the pattern
 701
 702          gilbert|sullivan
 703
 704        matches  either "gilbert" or "sullivan". Any number of alternatives may
 705        appear, and an empty  alternative  is  permitted  (matching  the  empty
 706        string). The matching process tries each alternative in turn, from left
 707        to right, and the first one that succeeds is used. If the  alternatives
 708        are  within a subpattern (defined below), "succeeds" means matching the
 709        rest of the main pattern as well as the alternative in the  subpattern.
 710
 711
 712 INTERNAL OPTION SETTING
 713
 714        The  settings  of  the  PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
 715        PCRE_EXTENDED options can be changed  from  within  the  pattern  by  a
 716        sequence  of  Perl  option  letters  enclosed between "(?" and ")". The
 717        option letters are
 718
 719          i  for PCRE_CASELESS
 720          m  for PCRE_MULTILINE
 721          s  for PCRE_DOTALL
 722          x  for PCRE_EXTENDED
 723
 724        For example, (?im) sets caseless, multiline matching. It is also possi-
 725        ble to unset these options by preceding the letter with a hyphen, and a
 726        combined setting and unsetting such as (?im-sx), which sets  PCRE_CASE-
 727        LESS  and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
 728        is also permitted. If a  letter  appears  both  before  and  after  the
 729        hyphen, the option is unset.
 730
 731        When  an option change occurs at top level (that is, not inside subpat-
 732        tern parentheses), the change applies to the remainder of  the  pattern
 733        that follows.  If the change is placed right at the start of a pattern,
 734        PCRE extracts it into the global options (and it will therefore show up
 735        in data extracted by the pcre_fullinfo() function).
 736
 737        An  option  change  within a subpattern (see below for a description of
 738        subpatterns) affects only that part of the current pattern that follows
 739        it, so
 740
 741          (a(?i)b)c
 742
 743        matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
 744        used).  By this means, options can be made to have  different  settings
 745        in  different parts of the pattern. Any changes made in one alternative
 746        do carry on into subsequent branches within the  same  subpattern.  For
 747        example,
 748
 749          (a(?i)b|c)
 750
 751        matches  "ab",  "aB",  "c",  and "C", even though when matching "C" the
 752        first branch is abandoned before the option setting.  This  is  because
 753        the  effects  of option settings happen at compile time. There would be
 754        some very weird behaviour otherwise.
 755
 756        The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and  PCRE_EXTRA
 757        can  be changed in the same way as the Perl-compatible options by using
 758        the characters J, U and X respectively.
 759
 760
 761 SUBPATTERNS
 762
 763        Subpatterns are delimited by parentheses (round brackets), which can be
 764        nested.  Turning part of a pattern into a subpattern does two things:
 765
 766        1. It localizes a set of alternatives. For example, the pattern
 767
 768          cat(aract|erpillar|)
 769
 770        matches  one  of the words "cat", "cataract", or "caterpillar". Without
 771        the parentheses, it would match  "cataract",  "erpillar"  or  an  empty
 772        string.
 773
 774        2.  It  sets  up  the  subpattern as a capturing subpattern. This means
 775        that, when the whole pattern  matches,  that  portion  of  the  subject
 776        string that matched the subpattern is passed back to the caller via the
 777        ovector argument of pcre_exec(). Opening parentheses are  counted  from
 778        left  to  right  (starting  from 1) to obtain numbers for the capturing
 779        subpatterns.
 780
 781        For example, if the string "the red king" is matched against  the  pat-
 782        tern
 783
 784          the ((red|white) (king|queen))
 785
 786        the captured substrings are "red king", "red", and "king", and are num-
 787        bered 1, 2, and 3, respectively.
 788
 789        The fact that plain parentheses fulfil  two  functions  is  not  always
 790        helpful.   There are often times when a grouping subpattern is required
 791        without a capturing requirement. If an opening parenthesis is  followed
 792        by  a question mark and a colon, the subpattern does not do any captur-
 793        ing, and is not counted when computing the  number  of  any  subsequent
 794        capturing  subpatterns. For example, if the string "the white queen" is
 795        matched against the pattern
 796
 797          the ((?:red|white) (king|queen))
 798
 799        the captured substrings are "white queen" and "queen", and are numbered
 800        1 and 2. The maximum number of capturing subpatterns is 65535.
 801
 802        As  a  convenient shorthand, if any option settings are required at the
 803        start of a non-capturing subpattern,  the  option  letters  may  appear
 804        between the "?" and the ":". Thus the two patterns
 805
 806          (?i:saturday|sunday)
 807          (?:(?i)saturday|sunday)
 808
 809        match exactly the same set of strings. Because alternative branches are
 810        tried from left to right, and options are not reset until  the  end  of
 811        the  subpattern is reached, an option setting in one branch does affect
 812        subsequent branches, so the above patterns match "SUNDAY"  as  well  as
 813        "Saturday".
 814
 815
 816 NAMED SUBPATTERNS
 817
 818        Identifying  capturing  parentheses  by number is simple, but it can be
 819        very hard to keep track of the numbers in complicated  regular  expres-
 820        sions.  Furthermore,  if  an  expression  is  modified, the numbers may
 821        change. To help with this difficulty, PCRE supports the naming of  sub-
 822        patterns. This feature was not added to Perl until release 5.10. Python
 823        had the feature earlier, and PCRE introduced it at release  4.0,  using
 824        the  Python syntax. PCRE now supports both the Perl and the Python syn-
 825        tax.
 826
 827        In PCRE, a subpattern can be named in one of three  ways:  (?<name>...)
 828        or  (?'name'...)  as in Perl, or (?P<name>...) as in Python. References
 829        to capturing parentheses from other parts of the pattern, such as back-
 830        references,  recursion,  and conditions, can be made by name as well as
 831        by number.
 832
 833        Names consist of up to  32  alphanumeric  characters  and  underscores.
 834        Named  capturing  parentheses  are  still  allocated numbers as well as
 835        names, exactly as if the names were not present. The PCRE API  provides
 836        function calls for extracting the name-to-number translation table from
 837        a compiled pattern. There is also a convenience function for extracting
 838        a captured substring by name.
 839
 840        By  default, a name must be unique within a pattern, but it is possible
 841        to relax this constraint by setting the PCRE_DUPNAMES option at compile
 842        time.  This  can  be useful for patterns where only one instance of the
 843        named parentheses can match. Suppose you want to match the  name  of  a
 844        weekday,  either as a 3-letter abbreviation or as the full name, and in
 845        both cases you want to extract the abbreviation. This pattern (ignoring
 846        the line breaks) does the job:
 847
 848          (?<DN>Mon|Fri|Sun)(?:day)?|
 849          (?<DN>Tue)(?:sday)?|
 850          (?<DN>Wed)(?:nesday)?|
 851          (?<DN>Thu)(?:rsday)?|
 852          (?<DN>Sat)(?:urday)?
 853
 854        There  are  five capturing substrings, but only one is ever set after a
 855        match.  The convenience  function  for  extracting  the  data  by  name
 856        returns  the  substring  for  the first (and in this example, the only)
 857        subpattern of that name that matched.  This  saves  searching  to  find
 858        which  numbered  subpattern  it  was. If you make a reference to a non-
 859        unique named subpattern from elsewhere in the  pattern,  the  one  that
 860        corresponds  to  the  lowest number is used. For further details of the
 861        interfaces for handling named subpatterns, see the  pcreapi  documenta-
 862        tion.
 863
 864
 865 REPETITION
 866
 867        Repetition  is  specified  by  quantifiers, which can follow any of the
 868        following items:
 869
 870          a literal data character
 871          the dot metacharacter
 872          the \C escape sequence
 873          the \X escape sequence (in UTF-8 mode with Unicode properties)
 874          the \R escape sequence
 875          an escape such as \d that matches a single character
 876          a character class
 877          a back reference (see next section)
 878          a parenthesized subpattern (unless it is an assertion)
 879
 880        The general repetition quantifier specifies a minimum and maximum  num-
 881        ber  of  permitted matches, by giving the two numbers in curly brackets
 882        (braces), separated by a comma. The numbers must be  less  than  65536,
 883        and the first must be less than or equal to the second. For example:
 884
 885          z{2,4}
 886
 887        matches  "zz",  "zzz",  or  "zzzz". A closing brace on its own is not a
 888        special character. If the second number is omitted, but  the  comma  is
 889        present,  there  is  no upper limit; if the second number and the comma
 890        are both omitted, the quantifier specifies an exact number of  required
 891        matches. Thus
 892
 893          [aeiou]{3,}
 894
 895        matches at least 3 successive vowels, but may match many more, while
 896
 897          \d{8}
 898
 899        matches  exactly  8  digits. An opening curly bracket that appears in a
 900        position where a quantifier is not allowed, or one that does not  match
 901        the  syntax of a quantifier, is taken as a literal character. For exam-
 902        ple, {,6} is not a quantifier, but a literal string of four characters.
 903
 904        In  UTF-8  mode,  quantifiers  apply to UTF-8 characters rather than to
 905        individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
 906        acters, each of which is represented by a two-byte sequence. Similarly,
 907        when Unicode property support is available, \X{3} matches three Unicode
 908        extended  sequences,  each of which may be several bytes long (and they
 909        may be of different lengths).
 910
 911        The quantifier {0} is permitted, causing the expression to behave as if
 912        the previous item and the quantifier were not present.
 913
 914        For  convenience, the three most common quantifiers have single-charac-
 915        ter abbreviations:
 916
 917          *    is equivalent to {0,}
 918          +    is equivalent to {1,}
 919          ?    is equivalent to {0,1}
 920
 921        It is possible to construct infinite loops by  following  a  subpattern
 922        that can match no characters with a quantifier that has no upper limit,
 923        for example:
 924
 925          (a?)*
 926
 927        Earlier versions of Perl and PCRE used to give an error at compile time
 928        for  such  patterns. However, because there are cases where this can be
 929        useful, such patterns are now accepted, but if any  repetition  of  the
 930        subpattern  does in fact match no characters, the loop is forcibly bro-
 931        ken.
 932
 933        By default, the quantifiers are "greedy", that is, they match  as  much
 934        as  possible  (up  to  the  maximum number of permitted times), without
 935        causing the rest of the pattern to fail. The classic example  of  where
 936        this gives problems is in trying to match comments in C programs. These
 937        appear between /* and */ and within the comment,  individual  *  and  /
 938        characters  may  appear. An attempt to match C comments by applying the
 939        pattern
 940
 941          /\*.*\*/
 942
 943        to the string
 944
 945          /* first comment */  not comment  /* second comment */
 946
 947        fails, because it matches the entire string owing to the greediness  of
 948        the .*  item.
 949
 950        However,  if  a quantifier is followed by a question mark, it ceases to
 951        be greedy, and instead matches the minimum number of times possible, so
 952        the pattern
 953
 954          /\*.*?\*/
 955
 956        does  the  right  thing with the C comments. The meaning of the various
 957        quantifiers is not otherwise changed,  just  the  preferred  number  of
 958        matches.   Do  not  confuse this use of question mark with its use as a
 959        quantifier in its own right. Because it has two uses, it can  sometimes
 960        appear doubled, as in
 961
 962          \d??\d
 963
 964        which matches one digit by preference, but can match two if that is the
 965        only way the rest of the pattern matches.
 966
 967        If the PCRE_UNGREEDY option is set (an option that is not available  in
 968        Perl),  the  quantifiers are not greedy by default, but individual ones
 969        can be made greedy by following them with a  question  mark.  In  other
 970        words, it inverts the default behaviour.
 971
 972        When  a  parenthesized  subpattern  is quantified with a minimum repeat
 973        count that is greater than 1 or with a limited maximum, more memory  is
 974        required  for  the  compiled  pattern, in proportion to the size of the
 975        minimum or maximum.
 976
 977        If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
 978        alent  to  Perl's  /s) is set, thus allowing the dot to match newlines,
 979        the pattern is implicitly anchored, because whatever  follows  will  be
 980        tried  against every character position in the subject string, so there
 981        is no point in retrying the overall match at  any  position  after  the
 982        first.  PCRE  normally treats such a pattern as though it were preceded
 983        by \A.
 984
 985        In cases where it is known that the subject  string  contains  no  new-
 986        lines,  it  is  worth setting PCRE_DOTALL in order to obtain this opti-
 987        mization, or alternatively using ^ to indicate anchoring explicitly.
 988
 989        However, there is one situation where the optimization cannot be  used.
 990        When  .*   is  inside  capturing  parentheses that are the subject of a
 991        backreference elsewhere in the pattern, a match at the start  may  fail
 992        where a later one succeeds. Consider, for example:
 993
 994          (.*)abc\1
 995
 996        If  the subject is "xyz123abc123" the match point is the fourth charac-
 997        ter. For this reason, such a pattern is not implicitly anchored.
 998
 999        When a capturing subpattern is repeated, the value captured is the sub-
1000        string that matched the final iteration. For example, after
1001
1002          (tweedle[dume]{3}\s*)+
1003
1004        has matched "tweedledum tweedledee" the value of the captured substring
1005        is "tweedledee". However, if there are  nested  capturing  subpatterns,
1006        the  corresponding captured values may have been set in previous itera-
1007        tions. For example, after
1008
1009          /(a|(b))+/
1010
1011        matches "aba" the value of the second captured substring is "b".
1012
1013
1014 ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
1015
1016        With both maximizing ("greedy") and minimizing ("ungreedy"  or  "lazy")
1017        repetition,  failure  of what follows normally causes the repeated item
1018        to be re-evaluated to see if a different number of repeats  allows  the
1019        rest  of  the pattern to match. Sometimes it is useful to prevent this,
1020        either to change the nature of the match, or to cause it  fail  earlier
1021        than  it otherwise might, when the author of the pattern knows there is
1022        no point in carrying on.
1023
1024        Consider, for example, the pattern \d+foo when applied to  the  subject
1025        line
1026
1027          123456bar
1028
1029        After matching all 6 digits and then failing to match "foo", the normal
1030        action of the matcher is to try again with only 5 digits  matching  the
1031        \d+  item,  and  then  with  4,  and  so on, before ultimately failing.
1032        "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides
1033        the  means for specifying that once a subpattern has matched, it is not
1034        to be re-evaluated in this way.
1035
1036        If we use atomic grouping for the previous example, the  matcher  gives
1037        up  immediately  on failing to match "foo" the first time. The notation
1038        is a kind of special parenthesis, starting with (?> as in this example:
1039
1040          (?>\d+)foo
1041
1042        This  kind  of  parenthesis "locks up" the  part of the pattern it con-
1043        tains once it has matched, and a failure further into  the  pattern  is
1044        prevented  from  backtracking into it. Backtracking past it to previous
1045        items, however, works as normal.
1046
1047        An alternative description is that a subpattern of  this  type  matches
1048        the  string  of  characters  that an identical standalone pattern would
1049        match, if anchored at the current point in the subject string.
1050
1051        Atomic grouping subpatterns are not capturing subpatterns. Simple cases
1052        such as the above example can be thought of as a maximizing repeat that
1053        must swallow everything it can. So, while both \d+ and  \d+?  are  pre-
1054        pared  to  adjust  the number of digits they match in order to make the
1055        rest of the pattern match, (?>\d+) can only match an entire sequence of
1056        digits.
1057
1058        Atomic  groups in general can of course contain arbitrarily complicated
1059        subpatterns, and can be nested. However, when  the  subpattern  for  an
1060        atomic group is just a single repeated item, as in the example above, a
1061        simpler notation, called a "possessive quantifier" can  be  used.  This
1062        consists  of  an  additional  + character following a quantifier. Using
1063        this notation, the previous example can be rewritten as
1064
1065          \d++foo
1066
1067        Possessive  quantifiers  are  always  greedy;  the   setting   of   the
1068        PCRE_UNGREEDY option is ignored. They are a convenient notation for the
1069        simpler forms of atomic group. However, there is no difference  in  the
1070        meaning  of  a  possessive  quantifier and the equivalent atomic group,
1071        though there may be a performance  difference;  possessive  quantifiers
1072        should be slightly faster.
1073
1074        The  possessive  quantifier syntax is an extension to the Perl 5.8 syn-
1075        tax.  Jeffrey Friedl originated the idea (and the name)  in  the  first
1076        edition of his book. Mike McCloskey liked it, so implemented it when he
1077        built Sun's Java package, and PCRE copied it from there. It  ultimately
1078        found its way into Perl at release 5.10.
1079
1080        PCRE has an optimization that automatically "possessifies" certain sim-
1081        ple pattern constructs. For example, the sequence  A+B  is  treated  as
1082        A++B  because  there is no point in backtracking into a sequence of A's
1083        when B must follow.
1084
1085        When a pattern contains an unlimited repeat inside  a  subpattern  that
1086        can  itself  be  repeated  an  unlimited number of times, the use of an
1087        atomic group is the only way to avoid some  failing  matches  taking  a
1088        very long time indeed. The pattern
1089
1090          (\D+|<\d+>)*[!?]
1091
1092        matches  an  unlimited number of substrings that either consist of non-
1093        digits, or digits enclosed in <>, followed by either ! or  ?.  When  it
1094        matches, it runs quickly. However, if it is applied to
1095
1096          aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1097
1098        it  takes  a  long  time  before reporting failure. This is because the
1099        string can be divided between the internal \D+ repeat and the  external
1100        *  repeat  in  a  large  number of ways, and all have to be tried. (The
1101        example uses [!?] rather than a single character at  the  end,  because
1102        both  PCRE  and  Perl have an optimization that allows for fast failure
1103        when a single character is used. They remember the last single  charac-
1104        ter  that  is required for a match, and fail early if it is not present
1105        in the string.) If the pattern is changed so that  it  uses  an  atomic
1106        group, like this:
1107
1108          ((?>\D+)|<\d+>)*[!?]
1109
1110        sequences  of non-digits cannot be broken, and failure happens quickly.
1111
1112
1113 BACK REFERENCES
1114
1115        Outside a character class, a backslash followed by a digit greater than
1116        0 (and possibly further digits) is a back reference to a capturing sub-
1117        pattern earlier (that is, to its left) in the pattern,  provided  there
1118        have been that many previous capturing left parentheses.
1119
1120        However, if the decimal number following the backslash is less than 10,
1121        it is always taken as a back reference, and causes  an  error  only  if
1122        there  are  not that many capturing left parentheses in the entire pat-
1123        tern. In other words, the parentheses that are referenced need  not  be
1124        to  the left of the reference for numbers less than 10. A "forward back
1125        reference" of this type can make sense when a  repetition  is  involved
1126        and  the  subpattern to the right has participated in an earlier itera-
1127        tion.
1128
1129        It is not possible to have a numerical "forward back  reference"  to  a
1130        subpattern  whose  number  is  10  or  more using this syntax because a
1131        sequence such as \50 is interpreted as a character  defined  in  octal.
1132        See the subsection entitled "Non-printing characters" above for further
1133        details of the handling of digits following a backslash.  There  is  no
1134        such  problem  when named parentheses are used. A back reference to any
1135        subpattern is possible using named parentheses (see below).
1136
1137        Another way of avoiding the ambiguity inherent in  the  use  of  digits
1138        following a backslash is to use the \g escape sequence, which is a fea-
1139        ture introduced in Perl 5.10. This escape must be followed by  a  posi-
1140        tive  or  a negative number, optionally enclosed in braces. These exam-
1141        ples are all identical:
1142
1143          (ring), \1
1144          (ring), \g1
1145          (ring), \g{1}
1146
1147        A positive number specifies an absolute reference without the ambiguity
1148        that  is  present  in  the older syntax. It is also useful when literal
1149        digits follow the reference. A negative number is a relative reference.
1150        Consider this example:
1151
1152          (abc(def)ghi)\g{-1}
1153
1154        The sequence \g{-1} is a reference to the most recently started captur-
1155        ing subpattern before \g, that is, is it equivalent to  \2.  Similarly,
1156        \g{-2} would be equivalent to \1. The use of relative references can be
1157        helpful in long patterns, and also in  patterns  that  are  created  by
1158        joining together fragments that contain references within themselves.
1159
1160        A  back  reference matches whatever actually matched the capturing sub-
1161        pattern in the current subject string, rather  than  anything  matching
1162        the subpattern itself (see "Subpatterns as subroutines" below for a way
1163        of doing that). So the pattern
1164
1165          (sens|respons)e and \1ibility
1166
1167        matches "sense and sensibility" and "response and responsibility",  but
1168        not  "sense and responsibility". If caseful matching is in force at the
1169        time of the back reference, the case of letters is relevant. For  exam-
1170        ple,
1171
1172          ((?i)rah)\s+\1
1173
1174        matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the
1175        original capturing subpattern is matched caselessly.
1176
1177        Back references to named subpatterns use the Perl  syntax  \k<name>  or
1178        \k'name'  or  the  Python  syntax (?P=name). We could rewrite the above
1179        example in either of the following ways:
1180
1181          (?<p1>(?i)rah)\s+\k<p1>
1182          (?P<p1>(?i)rah)\s+(?P=p1)
1183
1184        A subpattern that is referenced by  name  may  appear  in  the  pattern
1185        before or after the reference.
1186
1187        There  may be more than one back reference to the same subpattern. If a
1188        subpattern has not actually been used in a particular match,  any  back
1189        references to it always fail. For example, the pattern
1190
1191          (a|(bc))\2
1192
1193        always  fails if it starts to match "a" rather than "bc". Because there
1194        may be many capturing parentheses in a pattern,  all  digits  following
1195        the  backslash  are taken as part of a potential back reference number.
1196        If the pattern continues with a digit character, some delimiter must be
1197        used  to  terminate  the back reference. If the PCRE_EXTENDED option is
1198        set, this can be whitespace.  Otherwise an  empty  comment  (see  "Com-
1199        ments" below) can be used.
1200
1201        A  back reference that occurs inside the parentheses to which it refers
1202        fails when the subpattern is first used, so, for example,  (a\1)  never
1203        matches.   However,  such references can be useful inside repeated sub-
1204        patterns. For example, the pattern
1205
1206          (a|b\1)+
1207
1208        matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
1209        ation  of  the  subpattern,  the  back  reference matches the character
1210        string corresponding to the previous iteration. In order  for  this  to
1211        work,  the  pattern must be such that the first iteration does not need
1212        to match the back reference. This can be done using alternation, as  in
1213        the example above, or by a quantifier with a minimum of zero.
1214
1215
1216 ASSERTIONS
1217
1218        An  assertion  is  a  test on the characters following or preceding the
1219        current matching point that does not actually consume  any  characters.
1220        The  simple  assertions  coded  as  \b, \B, \A, \G, \Z, \z, ^ and $ are
1221        described above.
1222
1223        More complicated assertions are coded as  subpatterns.  There  are  two
1224        kinds:  those  that  look  ahead of the current position in the subject
1225        string, and those that look  behind  it.  An  assertion  subpattern  is
1226        matched  in  the  normal way, except that it does not cause the current
1227        matching position to be changed.
1228
1229        Assertion subpatterns are not capturing subpatterns,  and  may  not  be
1230        repeated,  because  it  makes no sense to assert the same thing several
1231        times. If any kind of assertion contains capturing  subpatterns  within
1232        it,  these are counted for the purposes of numbering the capturing sub-
1233        patterns in the whole pattern.  However, substring capturing is carried
1234        out  only  for  positive assertions, because it does not make sense for
1235        negative assertions.
1236
1237    Lookahead assertions
1238
1239        Lookahead assertions start with (?= for positive assertions and (?! for
1240        negative assertions. For example,
1241
1242          \w+(?=;)
1243
1244        matches  a word followed by a semicolon, but does not include the semi-
1245        colon in the match, and
1246
1247          foo(?!bar)
1248
1249        matches any occurrence of "foo" that is not  followed  by  "bar".  Note
1250        that the apparently similar pattern
1251
1252          (?!foo)bar
1253
1254        does  not  find  an  occurrence  of "bar" that is preceded by something
1255        other than "foo"; it finds any occurrence of "bar" whatsoever,  because
1256        the assertion (?!foo) is always true when the next three characters are
1257        "bar". A lookbehind assertion is needed to achieve the other effect.
1258
1259        If you want to force a matching failure at some point in a pattern, the
1260        most  convenient  way  to  do  it  is with (?!) because an empty string
1261        always matches, so an assertion that requires there not to be an  empty
1262        string must always fail.
1263
1264    Lookbehind assertions
1265
1266        Lookbehind  assertions start with (?<= for positive assertions and (?<!
1267        for negative assertions. For example,
1268
1269          (?<!foo)bar
1270
1271        does find an occurrence of "bar" that is not  preceded  by  "foo".  The
1272        contents  of  a  lookbehind  assertion are restricted such that all the
1273        strings it matches must have a fixed length. However, if there are sev-
1274        eral  top-level  alternatives,  they  do  not all have to have the same
1275        fixed length. Thus
1276
1277          (?<=bullock|donkey)
1278
1279        is permitted, but
1280
1281          (?<!dogs?|cats?)
1282
1283        causes an error at compile time. Branches that match  different  length
1284        strings  are permitted only at the top level of a lookbehind assertion.
1285        This is an extension compared with  Perl  (at  least  for  5.8),  which
1286        requires  all branches to match the same length of string. An assertion
1287        such as
1288
1289          (?<=ab(c|de))
1290
1291        is not permitted, because its single top-level  branch  can  match  two
1292        different  lengths,  but  it is acceptable if rewritten to use two top-
1293        level branches:
1294
1295          (?<=abc|abde)
1296
1297        The implementation of lookbehind assertions is, for  each  alternative,
1298        to  temporarily  move the current position back by the fixed length and
1299        then try to match. If there are insufficient characters before the cur-
1300        rent position, the assertion fails.
1301
1302        PCRE does not allow the \C escape (which matches a single byte in UTF-8
1303        mode) to appear in lookbehind assertions, because it makes it  impossi-
1304        ble  to  calculate the length of the lookbehind. The \X and \R escapes,
1305        which can match different numbers of bytes, are also not permitted.
1306
1307        Possessive quantifiers can  be  used  in  conjunction  with  lookbehind
1308        assertions  to  specify  efficient  matching  at the end of the subject
1309        string. Consider a simple pattern such as
1310
1311          abcd$
1312
1313        when applied to a long string that does  not  match.  Because  matching
1314        proceeds from left to right, PCRE will look for each "a" in the subject
1315        and then see if what follows matches the rest of the  pattern.  If  the
1316        pattern is specified as
1317
1318          ^.*abcd$
1319
1320        the  initial .* matches the entire string at first, but when this fails
1321        (because there is no following "a"), it backtracks to match all but the
1322        last  character,  then all but the last two characters, and so on. Once
1323        again the search for "a" covers the entire string, from right to  left,
1324        so we are no better off. However, if the pattern is written as
1325
1326          ^.*+(?<=abcd)
1327
1328        there  can  be  no backtracking for the .*+ item; it can match only the
1329        entire string. The subsequent lookbehind assertion does a  single  test
1330        on  the last four characters. If it fails, the match fails immediately.
1331        For long strings, this approach makes a significant difference  to  the
1332        processing time.
1333
1334    Using multiple assertions
1335
1336        Several assertions (of any sort) may occur in succession. For example,
1337
1338          (?<=\d{3})(?<!999)foo
1339
1340        matches  "foo" preceded by three digits that are not "999". Notice that
1341        each of the assertions is applied independently at the  same  point  in
1342        the  subject  string.  First  there  is a check that the previous three
1343        characters are all digits, and then there is  a  check  that  the  same
1344        three characters are not "999".  This pattern does not match "foo" pre-
1345        ceded by six characters, the first of which are  digits  and  the  last
1346        three  of  which  are not "999". For example, it doesn't match "123abc-
1347        foo". A pattern to do that is
1348
1349          (?<=\d{3}...)(?<!999)foo
1350
1351        This time the first assertion looks at the  preceding  six  characters,
1352        checking that the first three are digits, and then the second assertion
1353        checks that the preceding three characters are not "999".
1354
1355        Assertions can be nested in any combination. For example,
1356
1357          (?<=(?<!foo)bar)baz
1358
1359        matches an occurrence of "baz" that is preceded by "bar" which in  turn
1360        is not preceded by "foo", while
1361
1362          (?<=\d{3}(?!999)...)foo
1363
1364        is  another pattern that matches "foo" preceded by three digits and any
1365        three characters that are not "999".
1366
1367
1368 CONDITIONAL SUBPATTERNS
1369
1370        It is possible to cause the matching process to obey a subpattern  con-
1371        ditionally  or to choose between two alternative subpatterns, depending
1372        on the result of an assertion, or whether a previous capturing  subpat-
1373        tern  matched  or not. The two possible forms of conditional subpattern
1374        are
1375
1376          (?(condition)yes-pattern)
1377          (?(condition)yes-pattern|no-pattern)
1378
1379        If the condition is satisfied, the yes-pattern is used;  otherwise  the
1380        no-pattern  (if  present)  is used. If there are more than two alterna-
1381        tives in the subpattern, a compile-time error occurs.
1382
1383        There are four kinds of condition: references  to  subpatterns,  refer-
1384        ences to recursion, a pseudo-condition called DEFINE, and assertions.
1385
1386    Checking for a used subpattern by number
1387
1388        If  the  text between the parentheses consists of a sequence of digits,
1389        the condition is true if the capturing subpattern of  that  number  has
1390        previously matched.
1391
1392        Consider  the  following  pattern, which contains non-significant white
1393        space to make it more readable (assume the PCRE_EXTENDED option) and to
1394        divide it into three parts for ease of discussion:
1395
1396          ( \( )?    [^()]+    (?(1) \) )
1397
1398        The  first  part  matches  an optional opening parenthesis, and if that
1399        character is present, sets it as the first captured substring. The sec-
1400        ond  part  matches one or more characters that are not parentheses. The
1401        third part is a conditional subpattern that tests whether the first set
1402        of parentheses matched or not. If they did, that is, if subject started
1403        with an opening parenthesis, the condition is true, and so the yes-pat-
1404        tern  is  executed  and  a  closing parenthesis is required. Otherwise,
1405        since no-pattern is not present, the  subpattern  matches  nothing.  In
1406        other  words,  this  pattern  matches  a  sequence  of non-parentheses,
1407        optionally enclosed in parentheses.
1408
1409    Checking for a used subpattern by name
1410
1411        Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test  for  a
1412        used  subpattern  by  name.  For compatibility with earlier versions of
1413        PCRE, which had this facility before Perl, the syntax  (?(name)...)  is
1414        also  recognized. However, there is a possible ambiguity with this syn-
1415        tax, because subpattern names may  consist  entirely  of  digits.  PCRE
1416        looks  first for a named subpattern; if it cannot find one and the name
1417        consists entirely of digits, PCRE looks for a subpattern of  that  num-
1418        ber,  which must be greater than zero. Using subpattern names that con-
1419        sist entirely of digits is not recommended.
1420
1421        Rewriting the above example to use a named subpattern gives this:
1422
1423          (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
1424
1425
1426    Checking for pattern recursion
1427
1428        If the condition is the string (R), and there is no subpattern with the
1429        name  R, the condition is true if a recursive call to the whole pattern
1430        or any subpattern has been made. If digits or a name preceded by amper-
1431        sand follow the letter R, for example:
1432
1433          (?(R3)...) or (?(R&name)...)
1434
1435        the  condition is true if the most recent recursion is into the subpat-
1436        tern whose number or name is given. This condition does not  check  the
1437        entire recursion stack.
1438
1439        At  "top  level", all these recursion test conditions are false. Recur-
1440        sive patterns are described below.
1441
1442    Defining subpatterns for use by reference only
1443
1444        If the condition is the string (DEFINE), and  there  is  no  subpattern
1445        with  the  name  DEFINE,  the  condition is always false. In this case,
1446        there may be only one alternative  in  the  subpattern.  It  is  always
1447        skipped  if  control  reaches  this  point  in the pattern; the idea of
1448        DEFINE is that it can be used to define "subroutines" that can be  ref-
1449        erenced  from elsewhere. (The use of "subroutines" is described below.)
1450        For example, a pattern to match an IPv4 address could be  written  like
1451        this (ignore whitespace and line breaks):
1452
1453          (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
1454          \b (?&byte) (\.(?&byte)){3} \b
1455
1456        The  first part of the pattern is a DEFINE group inside which a another
1457        group named "byte" is defined. This matches an individual component  of
1458        an  IPv4  address  (a number less than 256). When matching takes place,
1459        this part of the pattern is skipped because DEFINE acts  like  a  false
1460        condition.
1461
1462        The rest of the pattern uses references to the named group to match the
1463        four dot-separated components of an IPv4 address, insisting on  a  word
1464        boundary at each end.
1465
1466    Assertion conditions
1467
1468        If  the  condition  is  not  in any of the above formats, it must be an
1469        assertion.  This may be a positive or negative lookahead or  lookbehind
1470        assertion.  Consider  this  pattern,  again  containing non-significant
1471        white space, and with the two alternatives on the second line:
1472
1473          (?(?=[^a-z]*[a-z])
1474          \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
1475
1476        The condition  is  a  positive  lookahead  assertion  that  matches  an
1477        optional  sequence of non-letters followed by a letter. In other words,
1478        it tests for the presence of at least one letter in the subject.  If  a
1479        letter  is found, the subject is matched against the first alternative;
1480        otherwise it is  matched  against  the  second.  This  pattern  matches
1481        strings  in  one  of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
1482        letters and dd are digits.
1483
1484
1485 COMMENTS
1486
1487        The sequence (?# marks the start of a comment that continues up to  the
1488        next  closing  parenthesis.  Nested  parentheses are not permitted. The
1489        characters that make up a comment play no part in the pattern  matching
1490        at all.
1491
1492        If  the PCRE_EXTENDED option is set, an unescaped # character outside a
1493        character class introduces a  comment  that  continues  to  immediately
1494        after the next newline in the pattern.
1495
1496
1497 RECURSIVE PATTERNS
1498
1499        Consider  the problem of matching a string in parentheses, allowing for
1500        unlimited nested parentheses. Without the use of  recursion,  the  best
1501        that  can  be  done  is  to use a pattern that matches up to some fixed
1502        depth of nesting. It is not possible to  handle  an  arbitrary  nesting
1503        depth.
1504
1505        For some time, Perl has provided a facility that allows regular expres-
1506        sions to recurse (amongst other things). It does this by  interpolating
1507        Perl  code in the expression at run time, and the code can refer to the
1508        expression itself. A Perl pattern using code interpolation to solve the
1509        parentheses problem can be created like this:
1510
1511          $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
1512
1513        The (?p{...}) item interpolates Perl code at run time, and in this case
1514        refers recursively to the pattern in which it appears.
1515
1516        Obviously, PCRE cannot support the interpolation of Perl code. Instead,
1517        it  supports  special  syntax  for recursion of the entire pattern, and
1518        also for individual subpattern recursion.  After  its  introduction  in
1519        PCRE  and  Python,  this  kind of recursion was introduced into Perl at
1520        release 5.10.
1521
1522        A special item that consists of (? followed by a  number  greater  than
1523        zero and a closing parenthesis is a recursive call of the subpattern of
1524        the given number, provided that it occurs inside that  subpattern.  (If
1525        not,  it  is  a  "subroutine" call, which is described in the next sec-
1526        tion.) The special item (?R) or (?0) is a recursive call of the  entire
1527        regular expression.
1528
1529        In  PCRE (like Python, but unlike Perl), a recursive subpattern call is
1530        always treated as an atomic group. That is, once it has matched some of
1531        the subject string, it is never re-entered, even if it contains untried
1532        alternatives and there is a subsequent matching failure.
1533
1534        This PCRE pattern solves the nested  parentheses  problem  (assume  the
1535        PCRE_EXTENDED option is set so that white space is ignored):
1536
1537          \( ( (?>[^()]+) | (?R) )* \)
1538
1539        First  it matches an opening parenthesis. Then it matches any number of
1540        substrings which can either be a  sequence  of  non-parentheses,  or  a
1541        recursive  match  of the pattern itself (that is, a correctly parenthe-
1542        sized substring).  Finally there is a closing parenthesis.
1543
1544        If this were part of a larger pattern, you would not  want  to  recurse
1545        the entire pattern, so instead you could use this:
1546
1547          ( \( ( (?>[^()]+) | (?1) )* \) )
1548
1549        We  have  put the pattern into parentheses, and caused the recursion to
1550        refer to them instead of the whole pattern. In a larger pattern,  keep-
1551        ing  track  of parenthesis numbers can be tricky. It may be more conve-
1552        nient to use named parentheses instead. The Perl  syntax  for  this  is
1553        (?&name);  PCRE's  earlier syntax (?P>name) is also supported. We could
1554        rewrite the above example as follows:
1555
1556          (?<pn> \( ( (?>[^()]+) | (?&pn) )* \) )
1557
1558        If there is more than one subpattern with the same name,  the  earliest
1559        one  is used. This particular example pattern contains nested unlimited
1560        repeats, and so the use of atomic grouping for matching strings of non-
1561        parentheses  is  important when applying the pattern to strings that do
1562        not match. For example, when this pattern is applied to
1563
1564          (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1565
1566        it yields "no match" quickly. However, if atomic grouping is not  used,
1567        the  match  runs  for a very long time indeed because there are so many
1568        different ways the + and * repeats can carve up the  subject,  and  all
1569        have to be tested before failure can be reported.
1570
1571        At the end of a match, the values set for any capturing subpatterns are
1572        those from the outermost level of the recursion at which the subpattern
1573        value  is  set.   If  you want to obtain intermediate values, a callout
1574        function can be used (see below and the pcrecallout documentation).  If
1575        the pattern above is matched against
1576
1577          (ab(cd)ef)
1578
1579        the  value  for  the  capturing  parentheses is "ef", which is the last
1580        value taken on at the top level. If additional parentheses  are  added,
1581        giving
1582
1583          \( ( ( (?>[^()]+) | (?R) )* ) \)
1584             ^                        ^
1585             ^                        ^
1586
1587        the  string  they  capture is "ab(cd)ef", the contents of the top level
1588        parentheses. If there are more than 15 capturing parentheses in a  pat-
1589        tern, PCRE has to obtain extra memory to store data during a recursion,
1590        which it does by using pcre_malloc, freeing  it  via  pcre_free  after-
1591        wards.  If  no  memory  can  be  obtained,  the  match  fails  with the
1592        PCRE_ERROR_NOMEMORY error.
1593
1594        Do not confuse the (?R) item with the condition (R),  which  tests  for
1595        recursion.   Consider  this pattern, which matches text in angle brack-
1596        ets, allowing for arbitrary nesting. Only digits are allowed in  nested
1597        brackets  (that is, when recursing), whereas any characters are permit-
1598        ted at the outer level.
1599
1600          < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
1601
1602        In this pattern, (?(R) is the start of a conditional  subpattern,  with
1603        two  different  alternatives for the recursive and non-recursive cases.
1604        The (?R) item is the actual recursive call.
1605
1606
1607 SUBPATTERNS AS SUBROUTINES
1608
1609        If the syntax for a recursive subpattern reference (either by number or
1610        by  name)  is used outside the parentheses to which it refers, it oper-
1611        ates like a subroutine in a programming language. The "called"  subpat-
1612        tern  may  be defined before or after the reference. An earlier example
1613        pointed out that the pattern
1614
1615          (sens|respons)e and \1ibility
1616
1617        matches "sense and sensibility" and "response and responsibility",  but
1618        not "sense and responsibility". If instead the pattern
1619
1620          (sens|respons)e and (?1)ibility
1621
1622        is  used, it does match "sense and responsibility" as well as the other
1623        two strings. Another example is  given  in  the  discussion  of  DEFINE
1624        above.
1625
1626        Like recursive subpatterns, a "subroutine" call is always treated as an
1627        atomic group. That is, once it has matched some of the subject  string,
1628        it  is  never  re-entered, even if it contains untried alternatives and
1629        there is a subsequent matching failure.
1630
1631        When a subpattern is used as a subroutine, processing options  such  as
1632        case-independence are fixed when the subpattern is defined. They cannot
1633        be changed for different calls. For example, consider this pattern:
1634
1635          (abc)(?i:(?1))
1636
1637        It matches "abcabc". It does not match "abcABC" because the  change  of
1638        processing option does not affect the called subpattern.
1639
1640
1641 CALLOUTS
1642
1643        Perl has a feature whereby using the sequence (?{...}) causes arbitrary
1644        Perl code to be obeyed in the middle of matching a regular  expression.
1645        This makes it possible, amongst other things, to extract different sub-
1646        strings that match the same pair of parentheses when there is a repeti-
1647        tion.
1648
1649        PCRE provides a similar feature, but of course it cannot obey arbitrary
1650        Perl code. The feature is called "callout". The caller of PCRE provides
1651        an  external function by putting its entry point in the global variable
1652        pcre_callout.  By default, this variable contains NULL, which  disables
1653        all calling out.
1654
1655        Within  a  regular  expression,  (?C) indicates the points at which the
1656        external function is to be called. If you want  to  identify  different
1657        callout  points, you can put a number less than 256 after the letter C.
1658        The default value is zero.  For example, this pattern has  two  callout
1659        points:
1660
1661          (?C1)abc(?C2)def
1662
1663        If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
1664        automatically installed before each item in the pattern. They  are  all
1665        numbered 255.
1666
1667        During matching, when PCRE reaches a callout point (and pcre_callout is
1668        set), the external function is called. It is provided with  the  number
1669        of  the callout, the position in the pattern, and, optionally, one item
1670        of data originally supplied by the caller of pcre_exec().  The  callout
1671        function  may cause matching to proceed, to backtrack, or to fail alto-
1672        gether. A complete description of the interface to the callout function
1673        is given in the pcrecallout documentation.
1674
1675
1676 SEE ALSO
1677
1678        pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3).
1679
1680 Last updated: 06 December 2006
1681 Copyright (c) 1997-2006 University of Cambridge.