Commit | Line | Data |
---|---|---|
fef2a3db JH |
1 | # frak |
2 | ||
3 | frak transforms collections of strings into regular expressions for | |
adbcb1dc JH |
4 | matching those strings. The primary goal of this library is to |
5 | generate regular expressions from a known set of inputs which avoid | |
6 | backtracking as much as possible. | |
fef2a3db | 7 | |
84e74fb3 JH |
8 | ## "Installation" |
9 | ||
10 | Add frak as a dependency to your `project.clj` file. | |
11 | ||
12 | ```clojure | |
ddfeefe3 | 13 | [frak "0.1.1"] |
84e74fb3 JH |
14 | ``` |
15 | ||
fef2a3db JH |
16 | ## Usage |
17 | ||
18 | ```clojure | |
19 | user> (require 'frak) | |
20 | nil | |
21 | user> (frak/pattern ["foo" "bar" "baz" "quux"]) | |
22 | #"(?:ba(?:r|z)|foo|quux)" | |
aa0a57a0 JH |
23 | user> (frak/pattern ["Clojure" "Clojars" "ClojureScript"]) |
24 | #"Cloj(?:ure(?:Script)?|ars)" | |
fef2a3db | 25 | ``` |
adbcb1dc JH |
26 | |
27 | ## How? | |
28 | ||
29 | A frak pattern is constructed from a very stupid trie of characters. | |
3b2ac58d JH |
30 | As characters are added to it, meta data is stored in it's branches. |
31 | The meta data contains information such as which branches are terminal | |
32 | and a record of characters which have "visited" the branch. | |
adbcb1dc JH |
33 | |
34 | During the rendering process frak will prefer branch characters that | |
35 | have "visited" the most. In the example above, you will notice the | |
36 | `ba(?:r|z)` branch takes precedence over `foo` even though `"foo"` was | |
37 | the first to enter the trie. This is because the character `\b` has | |
38 | frequented the branch more than `\f` and `\q`. The example below | |
39 | illustrates this behavior on the second character of each input. | |
40 | ||
41 | ```clojure | |
42 | user> (frak/pattern ["bit" "bat" "ban" "bot" "bar" "box"]) | |
43 | #"b(?:a(?:t|n|r)|o(?:t|x)|it)" | |
44 | ``` | |
45 | ||
46 | ## Why? | |
47 | ||
48 | [Here's](https://github.com/guns/vim-clojure-static/blob/249328ee659190babe2b14cd119f972b21b80538/syntax/clojure.vim#L91-L92) | |
49 | why. Also because. | |
50 | ||
a7041588 JH |
51 | ## And now for something completely different |
52 | ||
adbcb1dc | 53 | Let's build a regular expression for matching any word in |
a7041588 | 54 | `/usr/share/dict/words`. |
adbcb1dc JH |
55 | |
56 | ```clojure | |
57 | user> (require '[clojure.java.io :as io]) | |
58 | nil | |
59 | user> (def words | |
60 | (-> (io/file "/usr/share/dict/words") | |
61 | io/reader | |
62 | line-seq)) | |
63 | #'user/words | |
64 | user> (def word-re (frak/pattern words)) | |
65 | #'user/word-re | |
66 | user> (every? #(re-matches word-re %) words) | |
67 | true | |
68 | ``` | |
69 | ||
70 | You can view the full expression | |
71 | [here](https://gist.github.com/noprompt/6106573/raw/fcb683834bb2e171618ca91bf0b234014b5b957d/word-re.clj) | |
72 | (it's approximately `1.5M`!). | |
265c997f JH |
73 | |
74 | ## Benchmarks | |
75 | ||
76 | ```clojure | |
77 | (use 'criterium.core) | |
78 | ||
79 | (def words | |
80 | (-> (io/file "/usr/share/dict/words") | |
81 | io/reader | |
82 | line-seq)) | |
83 | ||
84 | (defn naive-pattern | |
85 | "Create a naive regular expression pattern for matching every string | |
86 | in strs." | |
87 | [strs] | |
88 | (->> strs | |
89 | (clojure.string/join "|") | |
90 | (format "(?:%s)"))) | |
91 | ||
92 | ;; Shuffle 10000 words and build a naive and frak pattern from them. | |
93 | (def ws (shuffle (take 10000 words))) | |
94 | (def n-pat (naive-pattern ws)) | |
95 | (def f-pat (frak/pattern ws)) | |
96 | ||
97 | ;; Verify the naive pattern matches everything it was constructed from. | |
98 | (every? #(re-matches n-pat %) ws) | |
99 | ;; => true | |
100 | ||
101 | ;; Shuffle the words again since the naive pattern is built in the | |
102 | ;; same order as it's inputs. | |
103 | (def ws' (shuffle ws)) | |
104 | ||
105 | ;;;; Benchmarks | |
106 | ||
107 | ;; Naive pattern | |
108 | ||
109 | (bench (doseq [w ws'] (re-matches n-pat w))) | |
110 | ||
111 | ;; Execution time mean : 1.499489 sec | |
112 | ;; Execution time std-deviation : 181.365166 ms | |
113 | ;; Execution time lower quantile : 1.337817 sec ( 2.5%) | |
114 | ;; Execution time upper quantile : 1.828733 sec (97.5%) | |
115 | ||
116 | ;; frak pattern | |
117 | ||
118 | (bench (doseq [w ws'] (re-matches f-pat w))) | |
119 | ;; Execution time mean : 155.515855 ms | |
120 | ;; Execution time std-deviation : 5.663346 ms | |
121 | ;; Execution time lower quantile : 148.168855 ms ( 2.5%) | |
122 | ;; Execution time upper quantile : 164.164294 ms (97.5%) | |
123 | ``` |