PCRE documentation changes. Fixes #657

[exim.git] / doc / doc-txt / pcretest.txt
diff --git a/doc/doc-txt/pcretest.txt b/doc/doc-txt/pcretest.txt

index 695741daea34d69e737dee03f5fa0be99b186300..d93ec26d1d5c8b76ec73383217277173132d26b0 100644 (file)
--- a/doc/doc-txt/pcretest.txt
+++ b/doc/doc-txt/pcretest.txt
@@ -12,8 +12,7 @@ NAME
  
  SYNOPSIS
  
-       pcretest [-C] [-d] [-dfa] [-i] [-m] [-o osize] [-p] [-t] [source]
-            [destination]
+       pcretest [options] [source] [destination]
  
         pcretest  was written as a test program for the PCRE regular expression
         library itself, but it can also be used for experimenting with  regular
@@ -25,18 +24,24 @@ SYNOPSIS
  
  OPTIONS
  
+       -b        Behave as if each regex has the /B (show bytecode)  modifier;
+                 the internal form is output after compilation.
+
         -C        Output the version number of the PCRE library, and all avail-
-                 able   information  about  the  optional  features  that  are
+                 able  information  about  the  optional  features  that   are
                   included, and then exit.
  
-       -d        Behave as if each regex has  the  /D  (debug)  modifier;  the
-                 internal form is output after compilation.
+       -d        Behave  as  if  each  regex  has the /D (debug) modifier; the
+                 internal form and information about the compiled  pattern  is
+                 output after compilation; -d is equivalent to -b -i.
  
         -dfa      Behave  as if each data line contains the \D escape sequence;
                   this    causes    the    alternative    matching    function,
                   pcre_dfa_exec(),   to   be   used  instead  of  the  standard
                   pcre_exec() function (more detail is given below).
  
+       -help     Output a brief summary these options and then exit.
+
         -i        Behave as if each regex  has  the  /I  modifier;  information
                   about the compiled pattern is given after compilation.
  
@@ -46,20 +51,34 @@ OPTIONS
                   pcretest, -s is a synonym for -m.
  
         -o osize  Set the number of elements in the output vector that is  used
-                 when  calling  pcre_exec()  to be osize. The default value is
-                 45, which is enough for 14 capturing subexpressions. The vec-
-                 tor  size  can  be  changed  for individual matching calls by
-                 including \O in the data line (see below).
+                 when  calling pcre_exec() or pcre_dfa_exec() to be osize. The
+                 default value is 45, which is enough for 14 capturing  subex-
+                 pressions   for  pcre_exec()  or  22  different  matches  for
+                 pcre_dfa_exec(). The vector size can be changed for  individ-
+                 ual  matching  calls  by  including  \O in the data line (see
+                 below).
  
         -p        Behave as if each regex has the /P modifier; the POSIX  wrap-
                   per  API  is used to call PCRE. None of the other options has
                   any effect when -p is set.
  
+       -q        Do not output the version number of pcretest at the start  of
+                 execution.
+
+       -S size   On  Unix-like  systems,  set the size of the runtime stack to
+                 size megabytes.
+
         -t        Run each compile, study, and match many times with  a  timer,
                   and  output resulting time per compile or match (in millisec-
                   onds). Do not set -m with -t, because you will then  get  the
                   size  output  a  zillion  times,  and the timing will be dis-
-                 torted.
+                 torted. You can control the number  of  iterations  that  are
+                 used  for timing by following -t with a number (as a separate
+                 item on the command line). For example, "-t 1000" would iter-
+                 ate 1000 times. The default is to iterate 500000 times.
+
+       -tm       This is like -t except that it times only the matching phase,
+                 not the compile or study phases.
  
  
  DESCRIPTION
@@ -76,13 +95,15 @@ DESCRIPTION
         ber of data lines to be matched against the pattern.
  
         Each  data line is matched separately and independently. If you want to
-       do multiple-line matches, you have to use the \n escape sequence  in  a
-       single  line  of  input  to  encode the newline characters. The maximum
-       length of data line is 30,000 characters.
+       do multi-line matches, you have to use the \n escape sequence (or \r or
+       \r\n, etc., depending on the newline setting) in a single line of input
+       to encode the newline sequences. There is no limit  on  the  length  of
+       data  lines;  the  input  buffer is automatically extended if it is too
+       small.
  
         An empty line signals the end of the data lines, at which point  a  new
         regular  expression is read. The regular expressions are given enclosed
-       in any non-alphanumeric delimiters other than backslash, for example
+       in any non-alphanumeric delimiters other than backslash, for example:
  
           /(a|bc)x+yz/
  
@@ -130,38 +151,64 @@ PATTERN MODIFIERS
         The following table shows additional modifiers for setting PCRE options
         that do not correspond to anything in Perl:
  
-         /A    PCRE_ANCHORED
-         /C    PCRE_AUTO_CALLOUT
-         /E    PCRE_DOLLAR_ENDONLY
-         /f    PCRE_FIRSTLINE
-         /N    PCRE_NO_AUTO_CAPTURE
-         /U    PCRE_UNGREEDY
-         /X    PCRE_EXTRA
+         /A          PCRE_ANCHORED
+         /C          PCRE_AUTO_CALLOUT
+         /E          PCRE_DOLLAR_ENDONLY
+         /f          PCRE_FIRSTLINE
+         /J          PCRE_DUPNAMES
+         /N          PCRE_NO_AUTO_CAPTURE
+         /U          PCRE_UNGREEDY
+         /X          PCRE_EXTRA
+         /<cr>       PCRE_NEWLINE_CR
+         /<lf>       PCRE_NEWLINE_LF
+         /<crlf>     PCRE_NEWLINE_CRLF
+         /<anycrlf>  PCRE_NEWLINE_ANYCRLF
+         /<any>      PCRE_NEWLINE_ANY
+
+       Those  specifying  line ending sequencess are literal strings as shown.
+       This example sets multiline matching  with  CRLF  as  the  line  ending
+       sequence:
+
+         /^abc/m<crlf>
+
+       Details  of the meanings of these PCRE options are given in the pcreapi
+       documentation.
+
+   Finding all matches in a string
  
-       Searching  for  all  possible matches within each subject string can be
-       requested by the /g or /G modifier. After  finding  a  match,  PCRE  is
+       Searching for all possible matches within each subject  string  can  be
+       requested  by  the  /g  or  /G modifier. After finding a match, PCRE is
         called again to search the remainder of the subject string. The differ-
         ence between /g and /G is that the former uses the startoffset argument
-       to  pcre_exec()  to  start  searching  at a new point within the entire
-       string (which is in effect what Perl does), whereas the  latter  passes
-       over  a  shortened  substring.  This makes a difference to the matching
+       to pcre_exec() to start searching at a  new  point  within  the  entire
+       string  (which  is in effect what Perl does), whereas the latter passes
+       over a shortened substring. This makes a  difference  to  the  matching
         process if the pattern begins with a lookbehind assertion (including \b
         or \B).
  
-       If  any  call  to  pcre_exec()  in a /g or /G sequence matches an empty
-       string, the next call is done with the PCRE_NOTEMPTY and  PCRE_ANCHORED
-       flags  set in order to search for another, non-empty, match at the same
-       point.  If this second match fails, the start  offset  is  advanced  by
-       one,  and  the normal match is retried. This imitates the way Perl han-
+       If any call to pcre_exec() in a /g or  /G  sequence  matches  an  empty
+       string,  the next call is done with the PCRE_NOTEMPTY and PCRE_ANCHORED
+       flags set in order to search for another, non-empty, match at the  same
+       point.   If  this  second  match fails, the start offset is advanced by
+       one, and the normal match is retried. This imitates the way  Perl  han-
         dles such cases when using the /g modifier or the split() function.
  
+   Other modifiers
+
         There are yet more modifiers for controlling the way pcretest operates.
  
-       The  /+ modifier requests that as well as outputting the substring that
-       matched the entire pattern, pcretest  should  in  addition  output  the
-       remainder  of  the  subject  string. This is useful for tests where the
+       The /+ modifier requests that as well as outputting the substring  that
+       matched  the  entire  pattern,  pcretest  should in addition output the
+       remainder of the subject string. This is useful  for  tests  where  the
         subject contains multiple copies of the same substring.
  
+       The  /B modifier is a debugging feature. It requests that pcretest out-
+       put a representation of the compiled byte code after compilation.  Nor-
+       mally  this  information contains length and offset values; however, if
+       /Z is also present, this data is replaced by spaces. This is a  special
+       feature for use in the automatic test scripts; it ensures that the same
+       output is generated for different internal link sizes.
+
         The /L modifier must be followed directly by the name of a locale,  for
         example,
  
@@ -180,10 +227,8 @@ PATTERN MODIFIERS
         pattern.  If  the pattern is studied, the results of that are also out-
         put.
  
-       The /D modifier is a PCRE debugging feature, which also assumes /I.  It
-       causes  the  internal form of compiled regular expressions to be output
-       after compilation. If the pattern was studied, the information returned
-       is also output.
+       The /D modifier is a PCRE debugging feature, and is equivalent to  /BI,
+       that is, both the /B and the /I modifiers.
  
         The /F modifier causes pcretest to flip the byte order of the fields in
         the compiled pattern that  contain  2-byte  and  4-byte  numbers.  This
@@ -225,20 +270,24 @@ DATA LINES
         nary"  regular  expressions,  you probably don't need any of these. The
         following escapes are recognized:
  
-         \a         alarm (= BEL)
-         \b         backspace
-         \e         escape
-         \f         formfeed
-         \n         newline
-         \r         carriage return
-         \t         tab
-         \v         vertical tab
+         \a         alarm (BEL, \x07)
+         \b         backspace (\x08)
+         \e         escape (\x27)
+         \f         formfeed (\x0c)
+         \n         newline (\x0a)
+         \qdd       set the PCRE_MATCH_LIMIT limit to dd
+                      (any number of digits)
+         \r         carriage return (\x0d)
+         \t         tab (\x09)
+         \v         vertical tab (\x0b)
           \nnn       octal character (up to 3 octal digits)
           \xhh       hexadecimal character (up to 2 hex digits)
           \x{hh...}  hexadecimal character, any number of digits
                        in UTF-8 mode
           \A         pass the PCRE_ANCHORED option to pcre_exec()
+                      or pcre_dfa_exec()
           \B         pass the PCRE_NOTBOL option to pcre_exec()
+                      or pcre_dfa_exec()
           \Cdd       call pcre_copy_substring() for substring dd
                        after a successful match (number less than 32)
           \Cname     call pcre_copy_named_substring() for substring
@@ -262,19 +311,39 @@ DATA LINES
                        ated by next non-alphanumeric character)
           \L         call pcre_get_substringlist() after a
                        successful match
-         \M         discover the minimum MATCH_LIMIT setting
+         \M         discover the minimum MATCH_LIMIT and
+                      MATCH_LIMIT_RECURSION settings
           \N         pass the PCRE_NOTEMPTY option to pcre_exec()
+                      or pcre_dfa_exec()
           \Odd       set the size of the output vector passed to
                        pcre_exec() to dd (any number of digits)
           \P         pass the PCRE_PARTIAL option to pcre_exec()
                        or pcre_dfa_exec()
+         \Qdd       set the PCRE_MATCH_LIMIT_RECURSION limit to dd
+                      (any number of digits)
           \R         pass the PCRE_DFA_RESTART option to pcre_dfa_exec()
           \S         output details of memory get/free calls during matching
           \Z         pass the PCRE_NOTEOL option to pcre_exec()
+                      or pcre_dfa_exec()
           \?         pass the PCRE_NO_UTF8_CHECK option to
-                      pcre_exec()
+                      pcre_exec() or pcre_dfa_exec()
           \>dd       start the match at offset dd (any number of digits);
                        this sets the startoffset argument for pcre_exec()
+                      or pcre_dfa_exec()
+         \<cr>      pass the PCRE_NEWLINE_CR option to pcre_exec()
+                      or pcre_dfa_exec()
+         \<lf>      pass the PCRE_NEWLINE_LF option to pcre_exec()
+                      or pcre_dfa_exec()
+         \<crlf>    pass the PCRE_NEWLINE_CRLF option to pcre_exec()
+                      or pcre_dfa_exec()
+         \<anycrlf> pass the PCRE_NEWLINE_ANYCRLF option to pcre_exec()
+                      or pcre_dfa_exec()
+         \<any>     pass the PCRE_NEWLINE_ANY option to pcre_exec()
+                      or pcre_dfa_exec()
+
+       The escapes that specify line ending  sequences  are  literal  strings,
+       exactly as shown. No more than one newline setting should be present in
+       any data line.
  
         A backslash followed by anything else just escapes the  anything  else.
         If  the very last character is a backslash, it is ignored. This gives a
@@ -282,21 +351,25 @@ DATA LINES
         nates the data input.
  
         If  \M  is present, pcretest calls pcre_exec() several times, with dif-
-       ferent values in the match_limit field of the  pcre_extra  data  struc-
-       ture,  until it finds the minimum number that is needed for pcre_exec()
-       to complete. This number is a measure of the amount  of  recursion  and
-       backtracking  that takes place, and checking it out can be instructive.
-       For most simple matches, the number is quite small,  but  for  patterns
-       with  very large numbers of matching possibilities, it can become large
-       very quickly with increasing length of subject string.
-
-       When \O is used, the value specified may be higher or  lower  than  the
+       ferent values in the match_limit and  match_limit_recursion  fields  of
+       the  pcre_extra  data structure, until it finds the minimum numbers for
+       each parameter that allow pcre_exec() to complete. The match_limit num-
+       ber  is  a  measure of the amount of backtracking that takes place, and
+       checking it out can be instructive. For most simple matches, the number
+       is  quite  small,  but for patterns with very large numbers of matching
+       possibilities, it can become large very quickly with increasing  length
+       of subject string. The match_limit_recursion number is a measure of how
+       much stack (or, if PCRE is compiled with  NO_RECURSE,  how  much  heap)
+       memory is needed to complete the match attempt.
+
+       When  \O  is  used, the value specified may be higher or lower than the
         size set by the -O command line option (or defaulted to 45); \O applies
         only to the call of pcre_exec() for the line in which it appears.
  
-       If the /P modifier was present on the pattern, causing the POSIX  wrap-
-       per  API to be used, only \B and \Z have any effect, causing REG_NOTBOL
-       and REG_NOTEOL to be passed to regexec() respectively.
+       If  the /P modifier was present on the pattern, causing the POSIX wrap-
+       per API to be used, the only option-setting  sequences  that  have  any
+       effect  are \B and \Z, causing REG_NOTBOL and REG_NOTEOL, respectively,
+       to be passed to regexec().
  
         The use of \x{hh...} to represent UTF-8 characters is not dependent  on
         the  use  of  the  /8 modifier on the pattern. It is recognized always.
@@ -332,7 +405,7 @@ DEFAULT OUTPUT FROM PCRETEST
         is an example of an interactive pcretest run.
  
           $ pcretest
-         PCRE version 5.00 07-Sep-2004
+         PCRE version 7.0 30-Nov-2006
  
             re> /^abc(\d+)/
           data> abc123
@@ -343,16 +416,17 @@ DEFAULT OUTPUT FROM PCRETEST
  
         If the strings contain any non-printing characters, they are output  as
         \0x  escapes,  or  as \x{...} escapes if the /8 modifier was present on
-       the pattern. If the pattern has the /+ modifier, the  output  for  sub-
-       string  0 is followed by the the rest of the subject string, identified
-       by "0+" like this:
+       the pattern. See below for the definition of  non-printing  characters.
+       If  the pattern has the /+ modifier, the output for substring 0 is fol-
+       lowed by the the rest of the subject string, identified  by  "0+"  like
+       this:
  
             re> /cat/+
           data> cataract
            0: cat
            0+ aract
  
-       If the pattern has the /g or /G modifier,  the  results  of  successive
+       If  the  pattern  has  the /g or /G modifier, the results of successive
         matching attempts are output in sequence, like this:
  
             re> /\Bi(\w\w)/g
@@ -366,16 +440,17 @@ DEFAULT OUTPUT FROM PCRETEST
  
         "No match" is output only if the first match attempt fails.
  
-       If  any  of the sequences \C, \G, or \L are present in a data line that
-       is successfully matched, the substrings extracted  by  the  convenience
+       If any of the sequences \C, \G, or \L are present in a data  line  that
+       is  successfully  matched,  the substrings extracted by the convenience
         functions are output with C, G, or L after the string number instead of
         a colon. This is in addition to the normal full list. The string length
-       (that  is,  the return from the extraction function) is given in paren-
+       (that is, the return from the extraction function) is given  in  paren-
         theses after each string for \C and \G.
  
-       Note that while patterns can be continued over several lines  (a  plain
+       Note that whereas patterns can be continued over several lines (a plain
         ">" prompt is used for continuations), data lines may not. However new-
-       lines can be included in data by means of the \n escape.
+       lines  can  be included in data by means of the \n escape (or \r, \r\n,
+       etc., depending on the newline sequence setting).
  
  
  OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
@@ -394,8 +469,8 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
         (Using  the  normal  matching function on this data finds only "tang".)
         The longest matching string is always given first (and numbered  zero).
  
-       If  /gP  is  present  on  the  pattern,  the search for further matches
-       resumes at the end of the longest match. For example:
+       If /g is present on the pattern, the search for further matches resumes
+       at the end of the longest match. For example:
  
             re> /(tang|tangerine|tan)/g
           data> yellow tangerine and tangy sultana\D
@@ -418,7 +493,7 @@ RESTARTING AFTER A PARTIAL MATCH
         can restart the match with additional subject data by means of  the  \R
         escape sequence. For example:
  
-           re> /^?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/
+           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
           data> 23ja\P\D
           Partial match: 23ja
           data> n05\R\D
@@ -468,6 +543,18 @@ CALLOUTS
         the pcrecallout documentation.
  
  
+NON-PRINTING CHARACTERS
+
+       When  pcretest is outputting text in the compiled version of a pattern,
+       bytes other than 32-126 are always treated as  non-printing  characters
+       are are therefore shown as hex escapes.
+
+       When  pcretest  is  outputting text that is a matched part of a subject
+       string, it behaves in the same way, unless a different locale has  been
+       set  for  the  pattern  (using  the  /L  modifier).  In  this case, the
+       isprint() function to distinguish printing and non-printing characters.
+
+
  SAVING AND RELOADING COMPILED PATTERNS
  
         The  facilities  described  in  this section are not available when the
@@ -524,11 +611,20 @@ SAVING AND RELOADING COMPILED PATTERNS
         a file that is not in the correct format, the result is undefined.
  
  
+SEE ALSO
+
+       pcre(3), pcreapi(3), pcrecallout(3),  pcrematching(3),  pcrepartial(d),
+       pcrepattern(3), pcreprecompile(3).
+
+
  AUTHOR
  
         Philip Hazel
-       University Computing Service,
-       Cambridge CB2 3QG, England.
+       University Computing Service
+       Cambridge CB2 3QH, England.
+
+
+REVISION
  
-Last updated: 28 February 2005
-Copyright (c) 1997-2005 University of Cambridge.
+       Last updated: 24 April 2007
+       Copyright (c) 1997-2007 University of Cambridge.