Commit | Line | Data |
---|---|---|
fef2a3db JH |
1 | # frak |
2 | ||
3 | frak transforms collections of strings into regular expressions for | |
adbcb1dc JH |
4 | matching those strings. The primary goal of this library is to |
5 | generate regular expressions from a known set of inputs which avoid | |
6 | backtracking as much as possible. | |
fef2a3db | 7 | |
84e74fb3 JH |
8 | ## "Installation" |
9 | ||
10 | Add frak as a dependency to your `project.clj` file. | |
11 | ||
12 | ```clojure | |
6f6c103f | 13 | [frak "0.1.2"] |
84e74fb3 JH |
14 | ``` |
15 | ||
fef2a3db JH |
16 | ## Usage |
17 | ||
18 | ```clojure | |
19 | user> (require 'frak) | |
20 | nil | |
21 | user> (frak/pattern ["foo" "bar" "baz" "quux"]) | |
6f6c103f | 22 | #"(?:ba[rz]|foo|quux)" |
aa0a57a0 JH |
23 | user> (frak/pattern ["Clojure" "Clojars" "ClojureScript"]) |
24 | #"Cloj(?:ure(?:Script)?|ars)" | |
fef2a3db | 25 | ``` |
adbcb1dc JH |
26 | |
27 | ## How? | |
28 | ||
24a60b46 JH |
29 | A frak pattern is constructed from a trie of characters. As characters |
30 | are added to it, meta data is stored in it's branches containing | |
31 | information such as which branches are terminal and a record of | |
32 | characters which have "visited" the branch. | |
adbcb1dc JH |
33 | |
34 | During the rendering process frak will prefer branch characters that | |
35 | have "visited" the most. In the example above, you will notice the | |
edb98134 | 36 | `ba[rz]` branch takes precedence over `foo` even though `"foo"` was |
adbcb1dc JH |
37 | the first to enter the trie. This is because the character `\b` has |
38 | frequented the branch more than `\f` and `\q`. The example below | |
39 | illustrates this behavior on the second character of each input. | |
40 | ||
41 | ```clojure | |
42 | user> (frak/pattern ["bit" "bat" "ban" "bot" "bar" "box"]) | |
6f6c103f | 43 | #"b(?:a[tnr]|o[tx]|it)" |
adbcb1dc JH |
44 | ``` |
45 | ||
46 | ## Why? | |
47 | ||
48 | [Here's](https://github.com/guns/vim-clojure-static/blob/249328ee659190babe2b14cd119f972b21b80538/syntax/clojure.vim#L91-L92) | |
49 | why. Also because. | |
50 | ||
a7041588 JH |
51 | ## And now for something completely different |
52 | ||
adbcb1dc | 53 | Let's build a regular expression for matching any word in |
a7041588 | 54 | `/usr/share/dict/words`. |
adbcb1dc JH |
55 | |
56 | ```clojure | |
57 | user> (require '[clojure.java.io :as io]) | |
58 | nil | |
59 | user> (def words | |
60 | (-> (io/file "/usr/share/dict/words") | |
61 | io/reader | |
62 | line-seq)) | |
63 | #'user/words | |
64 | user> (def word-re (frak/pattern words)) | |
65 | #'user/word-re | |
edb98134 | 66 | user> (every? #(re-matches word-re %) words) |
adbcb1dc JH |
67 | true |
68 | ``` | |
69 | ||
45176aef JH |
70 | The last two operations will take a moment since there are about |
71 | 235,886 words to consider. | |
72 | ||
adbcb1dc JH |
73 | You can view the full expression |
74 | [here](https://gist.github.com/noprompt/6106573/raw/fcb683834bb2e171618ca91bf0b234014b5b957d/word-re.clj) | |
75 | (it's approximately `1.5M`!). | |
265c997f JH |
76 | |
77 | ## Benchmarks | |
78 | ||
79 | ```clojure | |
80 | (use 'criterium.core) | |
81 | ||
82 | (def words | |
83 | (-> (io/file "/usr/share/dict/words") | |
84 | io/reader | |
85 | line-seq)) | |
86 | ||
87 | (defn naive-pattern | |
88 | "Create a naive regular expression pattern for matching every string | |
89 | in strs." | |
90 | [strs] | |
91 | (->> strs | |
92 | (clojure.string/join "|") | |
d35e9dc9 JH |
93 | (format "(?:%s)") |
94 | re-pattern)) | |
265c997f JH |
95 | |
96 | ;; Shuffle 10000 words and build a naive and frak pattern from them. | |
97 | (def ws (shuffle (take 10000 words))) | |
98 | (def n-pat (naive-pattern ws)) | |
99 | (def f-pat (frak/pattern ws)) | |
100 | ||
101 | ;; Verify the naive pattern matches everything it was constructed from. | |
102 | (every? #(re-matches n-pat %) ws) | |
103 | ;; => true | |
104 | ||
105 | ;; Shuffle the words again since the naive pattern is built in the | |
106 | ;; same order as it's inputs. | |
107 | (def ws' (shuffle ws)) | |
108 | ||
109 | ;;;; Benchmarks | |
110 | ||
edb98134 | 111 | ;; Naive pattern |
265c997f JH |
112 | |
113 | (bench (doseq [w ws'] (re-matches n-pat w))) | |
265c997f JH |
114 | ;; Execution time mean : 1.499489 sec |
115 | ;; Execution time std-deviation : 181.365166 ms | |
116 | ;; Execution time lower quantile : 1.337817 sec ( 2.5%) | |
117 | ;; Execution time upper quantile : 1.828733 sec (97.5%) | |
118 | ||
119 | ;; frak pattern | |
120 | ||
121 | (bench (doseq [w ws'] (re-matches f-pat w))) | |
122 | ;; Execution time mean : 155.515855 ms | |
123 | ;; Execution time std-deviation : 5.663346 ms | |
124 | ;; Execution time lower quantile : 148.168855 ms ( 2.5%) | |
125 | ;; Execution time upper quantile : 164.164294 ms (97.5%) | |
126 | ``` |