Commit | Line | Data |
---|---|---|
fef2a3db JH |
1 | # frak |
2 | ||
3 | frak transforms collections of strings into regular expressions for | |
adbcb1dc JH |
4 | matching those strings. The primary goal of this library is to |
5 | generate regular expressions from a known set of inputs which avoid | |
6 | backtracking as much as possible. | |
fef2a3db | 7 | |
84e74fb3 JH |
8 | ## "Installation" |
9 | ||
10 | Add frak as a dependency to your `project.clj` file. | |
11 | ||
12 | ```clojure | |
ddfeefe3 | 13 | [frak "0.1.1"] |
84e74fb3 JH |
14 | ``` |
15 | ||
fef2a3db JH |
16 | ## Usage |
17 | ||
18 | ```clojure | |
19 | user> (require 'frak) | |
20 | nil | |
21 | user> (frak/pattern ["foo" "bar" "baz" "quux"]) | |
22 | #"(?:ba(?:r|z)|foo|quux)" | |
aa0a57a0 JH |
23 | user> (frak/pattern ["Clojure" "Clojars" "ClojureScript"]) |
24 | #"Cloj(?:ure(?:Script)?|ars)" | |
fef2a3db | 25 | ``` |
adbcb1dc JH |
26 | |
27 | ## How? | |
28 | ||
24a60b46 JH |
29 | A frak pattern is constructed from a trie of characters. As characters |
30 | are added to it, meta data is stored in it's branches containing | |
31 | information such as which branches are terminal and a record of | |
32 | characters which have "visited" the branch. | |
adbcb1dc JH |
33 | |
34 | During the rendering process frak will prefer branch characters that | |
35 | have "visited" the most. In the example above, you will notice the | |
36 | `ba(?:r|z)` branch takes precedence over `foo` even though `"foo"` was | |
37 | the first to enter the trie. This is because the character `\b` has | |
38 | frequented the branch more than `\f` and `\q`. The example below | |
39 | illustrates this behavior on the second character of each input. | |
40 | ||
41 | ```clojure | |
42 | user> (frak/pattern ["bit" "bat" "ban" "bot" "bar" "box"]) | |
43 | #"b(?:a(?:t|n|r)|o(?:t|x)|it)" | |
44 | ``` | |
45 | ||
46 | ## Why? | |
47 | ||
48 | [Here's](https://github.com/guns/vim-clojure-static/blob/249328ee659190babe2b14cd119f972b21b80538/syntax/clojure.vim#L91-L92) | |
49 | why. Also because. | |
50 | ||
24a60b46 JH |
51 | ## Next |
52 | ||
53 | While the patterns currently generated by frak are correct, there is | |
54 | potential for improvement; the word trie could be converted to a | |
55 | directed acyclic word graph (DAWG). | |
56 | ||
57 | By using a DAWG it might be possible to produce more efficient | |
917b57ee JH |
58 | patterns which consider common prefixes and suffixes. This could |
59 | reduce backtracking and overall pattern size. | |
24a60b46 | 60 | |
a7041588 JH |
61 | ## And now for something completely different |
62 | ||
adbcb1dc | 63 | Let's build a regular expression for matching any word in |
a7041588 | 64 | `/usr/share/dict/words`. |
adbcb1dc JH |
65 | |
66 | ```clojure | |
67 | user> (require '[clojure.java.io :as io]) | |
68 | nil | |
69 | user> (def words | |
70 | (-> (io/file "/usr/share/dict/words") | |
71 | io/reader | |
72 | line-seq)) | |
73 | #'user/words | |
74 | user> (def word-re (frak/pattern words)) | |
75 | #'user/word-re | |
76 | user> (every? #(re-matches word-re %) words) | |
77 | true | |
78 | ``` | |
79 | ||
80 | You can view the full expression | |
81 | [here](https://gist.github.com/noprompt/6106573/raw/fcb683834bb2e171618ca91bf0b234014b5b957d/word-re.clj) | |
82 | (it's approximately `1.5M`!). | |
265c997f JH |
83 | |
84 | ## Benchmarks | |
85 | ||
86 | ```clojure | |
87 | (use 'criterium.core) | |
88 | ||
89 | (def words | |
90 | (-> (io/file "/usr/share/dict/words") | |
91 | io/reader | |
92 | line-seq)) | |
93 | ||
94 | (defn naive-pattern | |
95 | "Create a naive regular expression pattern for matching every string | |
96 | in strs." | |
97 | [strs] | |
98 | (->> strs | |
99 | (clojure.string/join "|") | |
d35e9dc9 JH |
100 | (format "(?:%s)") |
101 | re-pattern)) | |
265c997f JH |
102 | |
103 | ;; Shuffle 10000 words and build a naive and frak pattern from them. | |
104 | (def ws (shuffle (take 10000 words))) | |
105 | (def n-pat (naive-pattern ws)) | |
106 | (def f-pat (frak/pattern ws)) | |
107 | ||
108 | ;; Verify the naive pattern matches everything it was constructed from. | |
109 | (every? #(re-matches n-pat %) ws) | |
110 | ;; => true | |
111 | ||
112 | ;; Shuffle the words again since the naive pattern is built in the | |
113 | ;; same order as it's inputs. | |
114 | (def ws' (shuffle ws)) | |
115 | ||
116 | ;;;; Benchmarks | |
117 | ||
118 | ;; Naive pattern | |
119 | ||
120 | (bench (doseq [w ws'] (re-matches n-pat w))) | |
265c997f JH |
121 | ;; Execution time mean : 1.499489 sec |
122 | ;; Execution time std-deviation : 181.365166 ms | |
123 | ;; Execution time lower quantile : 1.337817 sec ( 2.5%) | |
124 | ;; Execution time upper quantile : 1.828733 sec (97.5%) | |
125 | ||
126 | ;; frak pattern | |
127 | ||
128 | (bench (doseq [w ws'] (re-matches f-pat w))) | |
129 | ;; Execution time mean : 155.515855 ms | |
130 | ;; Execution time std-deviation : 5.663346 ms | |
131 | ;; Execution time lower quantile : 148.168855 ms ( 2.5%) | |
132 | ;; Execution time upper quantile : 164.164294 ms (97.5%) | |
133 | ``` |