Commit | Line | Data |
---|---|---|
fef2a3db JH |
1 | # frak |
2 | ||
3 | frak transforms collections of strings into regular expressions for | |
adbcb1dc JH |
4 | matching those strings. The primary goal of this library is to |
5 | generate regular expressions from a known set of inputs which avoid | |
c80cdf1e JH |
6 | backtracking as much as possible. It is available as a [command line |
7 | utility](#command-line-usage) and for the [browser](#browser-usage) | |
8 | as a JavaScript library. | |
fef2a3db | 9 | |
84e74fb3 JH |
10 | ## "Installation" |
11 | ||
12 | Add frak as a dependency to your `project.clj` file. | |
13 | ||
9aeb1f15 JH |
14 | [![Clojars Project](https://img.shields.io/clojars/v/frak.svg)](https://clojars.org/frak) |
15 | ||
84e74fb3 | 16 | |
c80cdf1e | 17 | ## Clojure(Script) usage |
fef2a3db JH |
18 | |
19 | ```clojure | |
20 | user> (require 'frak) | |
21 | nil | |
22 | user> (frak/pattern ["foo" "bar" "baz" "quux"]) | |
6f6c103f | 23 | #"(?:ba[rz]|foo|quux)" |
aa0a57a0 JH |
24 | user> (frak/pattern ["Clojure" "Clojars" "ClojureScript"]) |
25 | #"Cloj(?:ure(?:Script)?|ars)" | |
c80cdf1e JH |
26 | user> (frak/pattern ["skill" "skills" "skull" "skulls"]) |
27 | #"sk(?:[ui]lls?)" | |
fef2a3db | 28 | ``` |
adbcb1dc | 29 | |
df8061d7 JL |
30 | ## Options |
31 | Frak's `pattern` function can accept an options map as its second argument, the available options are: | |
32 | ||
33 | * `:capture?` - boolean (default false), whether rendered regex should create capture groups for each match | |
34 | * `:escape-chars` - vector (default see [`frak/metacharacters`](https://github.com/noprompt/frak/blob/3c22d9b33646ca4bc681a5746ac0d11fd06e579b/src/cljx/frak.cljx#L54)), characters to escape when rendering a regular expression. | |
35 | * `:exact?` - boolean (default false), whether the rendered regex should only produces matches when the entire input string matches. | |
36 | * `:whole-words?` - boolean (default false), whether the rendered regex should match only whole words (word boundary at both ends of the match) in the input string. | |
37 | ||
c80cdf1e | 38 | ## Command line usage |
adbcb1dc | 39 | |
c80cdf1e | 40 | frak can be used from the command line with either Leiningen or NodeJS. |
adbcb1dc | 41 | |
c80cdf1e | 42 | ### With Leiningen |
adbcb1dc | 43 | |
c80cdf1e JH |
44 | Use the `lein run` command: |
45 | ||
46 | ```shell | |
47 | $ lein run -e foo bar baz quux | |
48 | ^(?:ba[rz]|foo|quux)$ | |
49 | ``` | |
50 | ||
51 | ### With NodeJS | |
52 | ||
53 | Compile the NodeJS version | |
54 | ||
55 | ```shell | |
56 | $ lein do cljx once, cljsbuild once node | |
57 | $ chmod +x bin/frak | |
58 | $ bin/frak -e foo bar baz quux | |
59 | ^(?:ba[rz]|foo|quux)$ | |
adbcb1dc JH |
60 | ``` |
61 | ||
c80cdf1e JH |
62 | ## Browser usage |
63 | ||
64 | To use frak as a standalone library in the browser with JavaScript | |
65 | compile the browser version: | |
66 | ||
67 | ```shell | |
68 | $ lein do cljx once, cljsbuild once browser | |
69 | $ mv ./target/js/frak.min.js <destination> | |
70 | ``` | |
71 | ||
72 | Try it using this HTML: | |
73 | ||
74 | ```html | |
75 | <!DOCTYPE html> | |
76 | <html> | |
77 | <head> | |
78 | </head> | |
79 | <body> | |
80 | <pre>Input: <span id="input"></span></pre> | |
81 | <pre>Output: <span id="output"></span></pre> | |
82 | <script src="http://code.jquery.com/jquery-2.0.3.min.js"></script> | |
83 | <script src="frak.min.js"></script> | |
84 | <script> | |
85 | var strings = ["foo", "bar", "baz", "quux"]; | |
86 | // It's a good idea to use the `"exact?"` option. | |
87 | var pattern = frak.pattern(strings, {"exact?": true}) | |
88 | jQuery("#input").text(strings.join(" ")); | |
89 | jQuery("#output").text(pattern); | |
90 | </script> | |
91 | </body> | |
92 | </html> | |
93 | ``` | |
94 | ||
95 | For even more fun try it with [AngularJS](http://angularjs.org/)! | |
96 | ||
97 | ## How? | |
98 | ||
99 | A frak pattern is constructed from a trie of characters and a | |
100 | renderer which processes it. As characters are added to the trie, data | |
0e9dc48f | 101 | such as such as which characters are terminal are stored in it's |
c80cdf1e JH |
102 | branches. |
103 | ||
104 | During the rendering process frak analyzes each branch and attempts to | |
105 | emit the most concise regular expression possible. Additional post | |
0e9dc48f JH |
106 | operations are applied after rendering to improve the expression where |
107 | possible. | |
c80cdf1e | 108 | |
adbcb1dc JH |
109 | ## Why? |
110 | ||
111 | [Here's](https://github.com/guns/vim-clojure-static/blob/249328ee659190babe2b14cd119f972b21b80538/syntax/clojure.vim#L91-L92) | |
112 | why. Also because. | |
113 | ||
a7041588 JH |
114 | ## And now for something completely different |
115 | ||
adbcb1dc | 116 | Let's build a regular expression for matching any word in |
a7041588 | 117 | `/usr/share/dict/words`. |
adbcb1dc JH |
118 | |
119 | ```clojure | |
120 | user> (require '[clojure.java.io :as io]) | |
121 | nil | |
122 | user> (def words | |
123 | (-> (io/file "/usr/share/dict/words") | |
124 | io/reader | |
125 | line-seq)) | |
126 | #'user/words | |
127 | user> (def word-re (frak/pattern words)) | |
128 | #'user/word-re | |
edb98134 | 129 | user> (every? #(re-matches word-re %) words) |
adbcb1dc JH |
130 | true |
131 | ``` | |
132 | ||
0e9dc48f JH |
133 | The last two operations will take a moment since there are over |
134 | 235,000 words to consider. | |
45176aef | 135 | |
adbcb1dc JH |
136 | You can view the full expression |
137 | [here](https://gist.github.com/noprompt/6106573/raw/fcb683834bb2e171618ca91bf0b234014b5b957d/word-re.clj) | |
138 | (it's approximately `1.5M`!). | |
265c997f JH |
139 | |
140 | ## Benchmarks | |
141 | ||
142 | ```clojure | |
143 | (use 'criterium.core) | |
144 | ||
145 | (def words | |
146 | (-> (io/file "/usr/share/dict/words") | |
147 | io/reader | |
148 | line-seq)) | |
149 | ||
150 | (defn naive-pattern | |
151 | "Create a naive regular expression pattern for matching every string | |
152 | in strs." | |
153 | [strs] | |
154 | (->> strs | |
155 | (clojure.string/join "|") | |
d35e9dc9 JH |
156 | (format "(?:%s)") |
157 | re-pattern)) | |
265c997f JH |
158 | |
159 | ;; Shuffle 10000 words and build a naive and frak pattern from them. | |
160 | (def ws (shuffle (take 10000 words))) | |
0e9dc48f | 161 | |
265c997f JH |
162 | (def n-pat (naive-pattern ws)) |
163 | (def f-pat (frak/pattern ws)) | |
164 | ||
165 | ;; Verify the naive pattern matches everything it was constructed from. | |
166 | (every? #(re-matches n-pat %) ws) | |
167 | ;; => true | |
168 | ||
169 | ;; Shuffle the words again since the naive pattern is built in the | |
170 | ;; same order as it's inputs. | |
171 | (def ws' (shuffle ws)) | |
172 | ||
173 | ;;;; Benchmarks | |
174 | ||
edb98134 | 175 | ;; Naive pattern |
265c997f JH |
176 | |
177 | (bench (doseq [w ws'] (re-matches n-pat w))) | |
265c997f JH |
178 | ;; Execution time mean : 1.499489 sec |
179 | ;; Execution time std-deviation : 181.365166 ms | |
180 | ;; Execution time lower quantile : 1.337817 sec ( 2.5%) | |
181 | ;; Execution time upper quantile : 1.828733 sec (97.5%) | |
182 | ||
183 | ;; frak pattern | |
184 | ||
185 | (bench (doseq [w ws'] (re-matches f-pat w))) | |
186 | ;; Execution time mean : 155.515855 ms | |
187 | ;; Execution time std-deviation : 5.663346 ms | |
188 | ;; Execution time lower quantile : 148.168855 ms ( 2.5%) | |
189 | ;; Execution time upper quantile : 164.164294 ms (97.5%) | |
190 | ``` |