Allow user to specify own escape-chars
[frak.git] / README.md
CommitLineData
fef2a3db
JH
1# frak
2
3frak transforms collections of strings into regular expressions for
adbcb1dc
JH
4matching those strings. The primary goal of this library is to
5generate regular expressions from a known set of inputs which avoid
c80cdf1e
JH
6backtracking as much as possible. It is available as a [command line
7utility](#command-line-usage) and for the [browser](#browser-usage)
8as a JavaScript library.
fef2a3db 9
84e74fb3
JH
10## "Installation"
11
12Add frak as a dependency to your `project.clj` file.
13
14```clojure
20ec8e3e 15[frak "0.1.5"]
84e74fb3
JH
16```
17
c80cdf1e 18## Clojure(Script) usage
fef2a3db
JH
19
20```clojure
21user> (require 'frak)
22nil
23user> (frak/pattern ["foo" "bar" "baz" "quux"])
6f6c103f 24#"(?:ba[rz]|foo|quux)"
aa0a57a0
JH
25user> (frak/pattern ["Clojure" "Clojars" "ClojureScript"])
26#"Cloj(?:ure(?:Script)?|ars)"
c80cdf1e
JH
27user> (frak/pattern ["skill" "skills" "skull" "skulls"])
28#"sk(?:[ui]lls?)"
fef2a3db 29```
adbcb1dc 30
c80cdf1e 31## Command line usage
adbcb1dc 32
c80cdf1e 33frak can be used from the command line with either Leiningen or NodeJS.
adbcb1dc 34
c80cdf1e 35### With Leiningen
adbcb1dc 36
c80cdf1e
JH
37Use the `lein run` command:
38
39```shell
40$ lein run -e foo bar baz quux
41^(?:ba[rz]|foo|quux)$
42```
43
44### With NodeJS
45
46Compile the NodeJS version
47
48```shell
49$ lein do cljx once, cljsbuild once node
50$ chmod +x bin/frak
51$ bin/frak -e foo bar baz quux
52^(?:ba[rz]|foo|quux)$
adbcb1dc
JH
53```
54
c80cdf1e
JH
55## Browser usage
56
57To use frak as a standalone library in the browser with JavaScript
58compile the browser version:
59
60```shell
61$ lein do cljx once, cljsbuild once browser
62$ mv ./target/js/frak.min.js <destination>
63```
64
65Try it using this HTML:
66
67```html
68<!DOCTYPE html>
69<html>
70<head>
71</head>
72<body>
73 <pre>Input: <span id="input"></span></pre>
74 <pre>Output: <span id="output"></span></pre>
75 <script src="http://code.jquery.com/jquery-2.0.3.min.js"></script>
76 <script src="frak.min.js"></script>
77 <script>
78 var strings = ["foo", "bar", "baz", "quux"];
79 // It's a good idea to use the `"exact?"` option.
80 var pattern = frak.pattern(strings, {"exact?": true})
81 jQuery("#input").text(strings.join(" "));
82 jQuery("#output").text(pattern);
83 </script>
84</body>
85</html>
86```
87
88For even more fun try it with [AngularJS](http://angularjs.org/)!
89
90## How?
91
92A frak pattern is constructed from a trie of characters and a
93renderer which processes it. As characters are added to the trie, data
0e9dc48f 94such as such as which characters are terminal are stored in it's
c80cdf1e
JH
95branches.
96
97During the rendering process frak analyzes each branch and attempts to
98emit the most concise regular expression possible. Additional post
0e9dc48f
JH
99operations are applied after rendering to improve the expression where
100possible.
c80cdf1e 101
adbcb1dc
JH
102## Why?
103
104[Here's](https://github.com/guns/vim-clojure-static/blob/249328ee659190babe2b14cd119f972b21b80538/syntax/clojure.vim#L91-L92)
105why. Also because.
106
a7041588
JH
107## And now for something completely different
108
adbcb1dc 109Let's build a regular expression for matching any word in
a7041588 110`/usr/share/dict/words`.
adbcb1dc
JH
111
112```clojure
113user> (require '[clojure.java.io :as io])
114nil
115user> (def words
116 (-> (io/file "/usr/share/dict/words")
117 io/reader
118 line-seq))
119#'user/words
120user> (def word-re (frak/pattern words))
121#'user/word-re
edb98134 122user> (every? #(re-matches word-re %) words)
adbcb1dc
JH
123true
124```
125
0e9dc48f
JH
126The last two operations will take a moment since there are over
127235,000 words to consider.
45176aef 128
adbcb1dc
JH
129You can view the full expression
130[here](https://gist.github.com/noprompt/6106573/raw/fcb683834bb2e171618ca91bf0b234014b5b957d/word-re.clj)
131(it's approximately `1.5M`!).
265c997f
JH
132
133## Benchmarks
134
135```clojure
136(use 'criterium.core)
137
138(def words
139 (-> (io/file "/usr/share/dict/words")
140 io/reader
141 line-seq))
142
143(defn naive-pattern
144 "Create a naive regular expression pattern for matching every string
145 in strs."
146 [strs]
147 (->> strs
148 (clojure.string/join "|")
d35e9dc9
JH
149 (format "(?:%s)")
150 re-pattern))
265c997f
JH
151
152;; Shuffle 10000 words and build a naive and frak pattern from them.
153(def ws (shuffle (take 10000 words)))
0e9dc48f 154
265c997f
JH
155(def n-pat (naive-pattern ws))
156(def f-pat (frak/pattern ws))
157
158;; Verify the naive pattern matches everything it was constructed from.
159(every? #(re-matches n-pat %) ws)
160;; => true
161
162;; Shuffle the words again since the naive pattern is built in the
163;; same order as it's inputs.
164(def ws' (shuffle ws))
165
166;;;; Benchmarks
167
edb98134 168;; Naive pattern
265c997f
JH
169
170(bench (doseq [w ws'] (re-matches n-pat w)))
265c997f
JH
171;; Execution time mean : 1.499489 sec
172;; Execution time std-deviation : 181.365166 ms
173;; Execution time lower quantile : 1.337817 sec ( 2.5%)
174;; Execution time upper quantile : 1.828733 sec (97.5%)
175
176;; frak pattern
177
178(bench (doseq [w ws'] (re-matches f-pat w)))
179;; Execution time mean : 155.515855 ms
180;; Execution time std-deviation : 5.663346 ms
181;; Execution time lower quantile : 148.168855 ms ( 2.5%)
182;; Execution time upper quantile : 164.164294 ms (97.5%)
183```