Commit | Line | Data |
---|---|---|
fef2a3db JH |
1 | # frak |
2 | ||
3 | frak transforms collections of strings into regular expressions for | |
adbcb1dc JH |
4 | matching those strings. The primary goal of this library is to |
5 | generate regular expressions from a known set of inputs which avoid | |
c80cdf1e JH |
6 | backtracking as much as possible. It is available as a [command line |
7 | utility](#command-line-usage) and for the [browser](#browser-usage) | |
8 | as a JavaScript library. | |
fef2a3db | 9 | |
84e74fb3 JH |
10 | ## "Installation" |
11 | ||
12 | Add frak as a dependency to your `project.clj` file. | |
13 | ||
14 | ```clojure | |
20ec8e3e | 15 | [frak "0.1.5"] |
84e74fb3 JH |
16 | ``` |
17 | ||
c80cdf1e | 18 | ## Clojure(Script) usage |
fef2a3db JH |
19 | |
20 | ```clojure | |
21 | user> (require 'frak) | |
22 | nil | |
23 | user> (frak/pattern ["foo" "bar" "baz" "quux"]) | |
6f6c103f | 24 | #"(?:ba[rz]|foo|quux)" |
aa0a57a0 JH |
25 | user> (frak/pattern ["Clojure" "Clojars" "ClojureScript"]) |
26 | #"Cloj(?:ure(?:Script)?|ars)" | |
c80cdf1e JH |
27 | user> (frak/pattern ["skill" "skills" "skull" "skulls"]) |
28 | #"sk(?:[ui]lls?)" | |
fef2a3db | 29 | ``` |
adbcb1dc | 30 | |
c80cdf1e | 31 | ## Command line usage |
adbcb1dc | 32 | |
c80cdf1e | 33 | frak can be used from the command line with either Leiningen or NodeJS. |
adbcb1dc | 34 | |
c80cdf1e | 35 | ### With Leiningen |
adbcb1dc | 36 | |
c80cdf1e JH |
37 | Use the `lein run` command: |
38 | ||
39 | ```shell | |
40 | $ lein run -e foo bar baz quux | |
41 | ^(?:ba[rz]|foo|quux)$ | |
42 | ``` | |
43 | ||
44 | ### With NodeJS | |
45 | ||
46 | Compile the NodeJS version | |
47 | ||
48 | ```shell | |
49 | $ lein do cljx once, cljsbuild once node | |
50 | $ chmod +x bin/frak | |
51 | $ bin/frak -e foo bar baz quux | |
52 | ^(?:ba[rz]|foo|quux)$ | |
adbcb1dc JH |
53 | ``` |
54 | ||
c80cdf1e JH |
55 | ## Browser usage |
56 | ||
57 | To use frak as a standalone library in the browser with JavaScript | |
58 | compile the browser version: | |
59 | ||
60 | ```shell | |
61 | $ lein do cljx once, cljsbuild once browser | |
62 | $ mv ./target/js/frak.min.js <destination> | |
63 | ``` | |
64 | ||
65 | Try it using this HTML: | |
66 | ||
67 | ```html | |
68 | <!DOCTYPE html> | |
69 | <html> | |
70 | <head> | |
71 | </head> | |
72 | <body> | |
73 | <pre>Input: <span id="input"></span></pre> | |
74 | <pre>Output: <span id="output"></span></pre> | |
75 | <script src="http://code.jquery.com/jquery-2.0.3.min.js"></script> | |
76 | <script src="frak.min.js"></script> | |
77 | <script> | |
78 | var strings = ["foo", "bar", "baz", "quux"]; | |
79 | // It's a good idea to use the `"exact?"` option. | |
80 | var pattern = frak.pattern(strings, {"exact?": true}) | |
81 | jQuery("#input").text(strings.join(" ")); | |
82 | jQuery("#output").text(pattern); | |
83 | </script> | |
84 | </body> | |
85 | </html> | |
86 | ``` | |
87 | ||
88 | For even more fun try it with [AngularJS](http://angularjs.org/)! | |
89 | ||
90 | ## How? | |
91 | ||
92 | A frak pattern is constructed from a trie of characters and a | |
93 | renderer which processes it. As characters are added to the trie, data | |
0e9dc48f | 94 | such as such as which characters are terminal are stored in it's |
c80cdf1e JH |
95 | branches. |
96 | ||
97 | During the rendering process frak analyzes each branch and attempts to | |
98 | emit the most concise regular expression possible. Additional post | |
0e9dc48f JH |
99 | operations are applied after rendering to improve the expression where |
100 | possible. | |
c80cdf1e | 101 | |
adbcb1dc JH |
102 | ## Why? |
103 | ||
104 | [Here's](https://github.com/guns/vim-clojure-static/blob/249328ee659190babe2b14cd119f972b21b80538/syntax/clojure.vim#L91-L92) | |
105 | why. Also because. | |
106 | ||
a7041588 JH |
107 | ## And now for something completely different |
108 | ||
adbcb1dc | 109 | Let's build a regular expression for matching any word in |
a7041588 | 110 | `/usr/share/dict/words`. |
adbcb1dc JH |
111 | |
112 | ```clojure | |
113 | user> (require '[clojure.java.io :as io]) | |
114 | nil | |
115 | user> (def words | |
116 | (-> (io/file "/usr/share/dict/words") | |
117 | io/reader | |
118 | line-seq)) | |
119 | #'user/words | |
120 | user> (def word-re (frak/pattern words)) | |
121 | #'user/word-re | |
edb98134 | 122 | user> (every? #(re-matches word-re %) words) |
adbcb1dc JH |
123 | true |
124 | ``` | |
125 | ||
0e9dc48f JH |
126 | The last two operations will take a moment since there are over |
127 | 235,000 words to consider. | |
45176aef | 128 | |
adbcb1dc JH |
129 | You can view the full expression |
130 | [here](https://gist.github.com/noprompt/6106573/raw/fcb683834bb2e171618ca91bf0b234014b5b957d/word-re.clj) | |
131 | (it's approximately `1.5M`!). | |
265c997f JH |
132 | |
133 | ## Benchmarks | |
134 | ||
135 | ```clojure | |
136 | (use 'criterium.core) | |
137 | ||
138 | (def words | |
139 | (-> (io/file "/usr/share/dict/words") | |
140 | io/reader | |
141 | line-seq)) | |
142 | ||
143 | (defn naive-pattern | |
144 | "Create a naive regular expression pattern for matching every string | |
145 | in strs." | |
146 | [strs] | |
147 | (->> strs | |
148 | (clojure.string/join "|") | |
d35e9dc9 JH |
149 | (format "(?:%s)") |
150 | re-pattern)) | |
265c997f JH |
151 | |
152 | ;; Shuffle 10000 words and build a naive and frak pattern from them. | |
153 | (def ws (shuffle (take 10000 words))) | |
0e9dc48f | 154 | |
265c997f JH |
155 | (def n-pat (naive-pattern ws)) |
156 | (def f-pat (frak/pattern ws)) | |
157 | ||
158 | ;; Verify the naive pattern matches everything it was constructed from. | |
159 | (every? #(re-matches n-pat %) ws) | |
160 | ;; => true | |
161 | ||
162 | ;; Shuffle the words again since the naive pattern is built in the | |
163 | ;; same order as it's inputs. | |
164 | (def ws' (shuffle ws)) | |
165 | ||
166 | ;;;; Benchmarks | |
167 | ||
edb98134 | 168 | ;; Naive pattern |
265c997f JH |
169 | |
170 | (bench (doseq [w ws'] (re-matches n-pat w))) | |
265c997f JH |
171 | ;; Execution time mean : 1.499489 sec |
172 | ;; Execution time std-deviation : 181.365166 ms | |
173 | ;; Execution time lower quantile : 1.337817 sec ( 2.5%) | |
174 | ;; Execution time upper quantile : 1.828733 sec (97.5%) | |
175 | ||
176 | ;; frak pattern | |
177 | ||
178 | (bench (doseq [w ws'] (re-matches f-pat w))) | |
179 | ;; Execution time mean : 155.515855 ms | |
180 | ;; Execution time std-deviation : 5.663346 ms | |
181 | ;; Execution time lower quantile : 148.168855 ms ( 2.5%) | |
182 | ;; Execution time upper quantile : 164.164294 ms (97.5%) | |
183 | ``` |