Version bump to v0.1.9
[frak.git] / README.md
1 # frak
2
3 frak transforms collections of strings into regular expressions for
4 matching those strings. The primary goal of this library is to
5 generate regular expressions from a known set of inputs which avoid
6 backtracking as much as possible. It is available as a [command line
7 utility](#command-line-usage) and for the [browser](#browser-usage)
8 as a JavaScript library.
9
10 ## "Installation"
11
12 Add frak as a dependency to your `project.clj` file.
13
14 [![Clojars Project](https://img.shields.io/clojars/v/frak.svg)](https://clojars.org/frak)
15
16
17 ## Clojure(Script) usage
18
19 ```clojure
20 user> (require 'frak)
21 nil
22 user> (frak/pattern ["foo" "bar" "baz" "quux"])
23 #"(?:ba[rz]|foo|quux)"
24 user> (frak/pattern ["Clojure" "Clojars" "ClojureScript"])
25 #"Cloj(?:ure(?:Script)?|ars)"
26 user> (frak/pattern ["skill" "skills" "skull" "skulls"])
27 #"sk(?:[ui]lls?)"
28 ```
29
30 ## Options
31 Frak's `pattern` function can accept an options map as its second argument, the available options are:
32
33 * `:capture?` - boolean (default false), whether rendered regex should create capture groups for each match
34 * `:escape-chars` - vector (default see [`frak/metacharacters`](https://github.com/noprompt/frak/blob/3c22d9b33646ca4bc681a5746ac0d11fd06e579b/src/cljx/frak.cljx#L54)), characters to escape when rendering a regular expression.
35 * `:exact?` - boolean (default false), whether the rendered regex should only produces matches when the entire input string matches.
36 * `:whole-words?` - boolean (default false), whether the rendered regex should match only whole words (word boundary at both ends of the match) in the input string.
37
38 ## Command line usage
39
40 frak can be used from the command line with either Leiningen or NodeJS.
41
42 ### With Leiningen
43
44 Use the `lein run` command:
45
46 ```shell
47 $ lein run -e foo bar baz quux
48 ^(?:ba[rz]|foo|quux)$
49 ```
50
51 ### With NodeJS
52
53 Compile the NodeJS version
54
55 ```shell
56 $ lein do cljx once, cljsbuild once node
57 $ chmod +x bin/frak
58 $ bin/frak -e foo bar baz quux
59 ^(?:ba[rz]|foo|quux)$
60 ```
61
62 ## Browser usage
63
64 To use frak as a standalone library in the browser with JavaScript
65 compile the browser version:
66
67 ```shell
68 $ lein do cljx once, cljsbuild once browser
69 $ mv ./target/js/frak.min.js <destination>
70 ```
71
72 Try it using this HTML:
73
74 ```html
75 <!DOCTYPE html>
76 <html>
77 <head>
78 </head>
79 <body>
80 <pre>Input: <span id="input"></span></pre>
81 <pre>Output: <span id="output"></span></pre>
82 <script src="http://code.jquery.com/jquery-2.0.3.min.js"></script>
83 <script src="frak.min.js"></script>
84 <script>
85 var strings = ["foo", "bar", "baz", "quux"];
86 // It's a good idea to use the `"exact?"` option.
87 var pattern = frak.pattern(strings, {"exact?": true})
88 jQuery("#input").text(strings.join(" "));
89 jQuery("#output").text(pattern);
90 </script>
91 </body>
92 </html>
93 ```
94
95 For even more fun try it with [AngularJS](http://angularjs.org/)!
96
97 ## How?
98
99 A frak pattern is constructed from a trie of characters and a
100 renderer which processes it. As characters are added to the trie, data
101 such as such as which characters are terminal are stored in it's
102 branches.
103
104 During the rendering process frak analyzes each branch and attempts to
105 emit the most concise regular expression possible. Additional post
106 operations are applied after rendering to improve the expression where
107 possible.
108
109 ## Why?
110
111 [Here's](https://github.com/guns/vim-clojure-static/blob/249328ee659190babe2b14cd119f972b21b80538/syntax/clojure.vim#L91-L92)
112 why. Also because.
113
114 ## And now for something completely different
115
116 Let's build a regular expression for matching any word in
117 `/usr/share/dict/words`.
118
119 ```clojure
120 user> (require '[clojure.java.io :as io])
121 nil
122 user> (def words
123 (-> (io/file "/usr/share/dict/words")
124 io/reader
125 line-seq))
126 #'user/words
127 user> (def word-re (frak/pattern words))
128 #'user/word-re
129 user> (every? #(re-matches word-re %) words)
130 true
131 ```
132
133 The last two operations will take a moment since there are over
134 235,000 words to consider.
135
136 You can view the full expression
137 [here](https://gist.github.com/noprompt/6106573/raw/fcb683834bb2e171618ca91bf0b234014b5b957d/word-re.clj)
138 (it's approximately `1.5M`!).
139
140 ## Benchmarks
141
142 ```clojure
143 (use 'criterium.core)
144
145 (def words
146 (-> (io/file "/usr/share/dict/words")
147 io/reader
148 line-seq))
149
150 (defn naive-pattern
151 "Create a naive regular expression pattern for matching every string
152 in strs."
153 [strs]
154 (->> strs
155 (clojure.string/join "|")
156 (format "(?:%s)")
157 re-pattern))
158
159 ;; Shuffle 10000 words and build a naive and frak pattern from them.
160 (def ws (shuffle (take 10000 words)))
161
162 (def n-pat (naive-pattern ws))
163 (def f-pat (frak/pattern ws))
164
165 ;; Verify the naive pattern matches everything it was constructed from.
166 (every? #(re-matches n-pat %) ws)
167 ;; => true
168
169 ;; Shuffle the words again since the naive pattern is built in the
170 ;; same order as it's inputs.
171 (def ws' (shuffle ws))
172
173 ;;;; Benchmarks
174
175 ;; Naive pattern
176
177 (bench (doseq [w ws'] (re-matches n-pat w)))
178 ;; Execution time mean : 1.499489 sec
179 ;; Execution time std-deviation : 181.365166 ms
180 ;; Execution time lower quantile : 1.337817 sec ( 2.5%)
181 ;; Execution time upper quantile : 1.828733 sec (97.5%)
182
183 ;; frak pattern
184
185 (bench (doseq [w ws'] (re-matches f-pat w)))
186 ;; Execution time mean : 155.515855 ms
187 ;; Execution time std-deviation : 5.663346 ms
188 ;; Execution time lower quantile : 148.168855 ms ( 2.5%)
189 ;; Execution time upper quantile : 164.164294 ms (97.5%)
190 ```