Merge branch 'master' of https://github.com/noprompt/frak
[frak.git] / README.md
CommitLineData
fef2a3db
JH
1# frak
2
3frak transforms collections of strings into regular expressions for
adbcb1dc
JH
4matching those strings. The primary goal of this library is to
5generate regular expressions from a known set of inputs which avoid
6backtracking as much as possible.
fef2a3db 7
84e74fb3
JH
8## "Installation"
9
10Add frak as a dependency to your `project.clj` file.
11
12```clojure
ddfeefe3 13[frak "0.1.1"]
84e74fb3
JH
14```
15
fef2a3db
JH
16## Usage
17
18```clojure
19user> (require 'frak)
20nil
21user> (frak/pattern ["foo" "bar" "baz" "quux"])
22#"(?:ba(?:r|z)|foo|quux)"
aa0a57a0
JH
23user> (frak/pattern ["Clojure" "Clojars" "ClojureScript"])
24#"Cloj(?:ure(?:Script)?|ars)"
fef2a3db 25```
adbcb1dc
JH
26
27## How?
28
24a60b46
JH
29A frak pattern is constructed from a trie of characters. As characters
30are added to it, meta data is stored in it's branches containing
31information such as which branches are terminal and a record of
32characters which have "visited" the branch.
adbcb1dc
JH
33
34During the rendering process frak will prefer branch characters that
35have "visited" the most. In the example above, you will notice the
36`ba(?:r|z)` branch takes precedence over `foo` even though `"foo"` was
37the first to enter the trie. This is because the character `\b` has
38frequented the branch more than `\f` and `\q`. The example below
39illustrates this behavior on the second character of each input.
40
41```clojure
42user> (frak/pattern ["bit" "bat" "ban" "bot" "bar" "box"])
43#"b(?:a(?:t|n|r)|o(?:t|x)|it)"
44```
45
46## Why?
47
48[Here's](https://github.com/guns/vim-clojure-static/blob/249328ee659190babe2b14cd119f972b21b80538/syntax/clojure.vim#L91-L92)
49why. Also because.
50
24a60b46
JH
51## Next
52
53While the patterns currently generated by frak are correct, there is
54potential for improvement; the word trie could be converted to a
55directed acyclic word graph (DAWG).
56
57By using a DAWG it might be possible to produce more efficient
917b57ee
JH
58patterns which consider common prefixes and suffixes. This could
59reduce backtracking and overall pattern size.
24a60b46 60
a7041588
JH
61## And now for something completely different
62
adbcb1dc 63Let's build a regular expression for matching any word in
a7041588 64`/usr/share/dict/words`.
adbcb1dc
JH
65
66```clojure
67user> (require '[clojure.java.io :as io])
68nil
69user> (def words
70 (-> (io/file "/usr/share/dict/words")
71 io/reader
72 line-seq))
73#'user/words
74user> (def word-re (frak/pattern words))
75#'user/word-re
76user> (every? #(re-matches word-re %) words)
77true
78```
79
80You can view the full expression
81[here](https://gist.github.com/noprompt/6106573/raw/fcb683834bb2e171618ca91bf0b234014b5b957d/word-re.clj)
82(it's approximately `1.5M`!).
265c997f
JH
83
84## Benchmarks
85
86```clojure
87(use 'criterium.core)
88
89(def words
90 (-> (io/file "/usr/share/dict/words")
91 io/reader
92 line-seq))
93
94(defn naive-pattern
95 "Create a naive regular expression pattern for matching every string
96 in strs."
97 [strs]
98 (->> strs
99 (clojure.string/join "|")
d35e9dc9
JH
100 (format "(?:%s)")
101 re-pattern))
265c997f
JH
102
103;; Shuffle 10000 words and build a naive and frak pattern from them.
104(def ws (shuffle (take 10000 words)))
105(def n-pat (naive-pattern ws))
106(def f-pat (frak/pattern ws))
107
108;; Verify the naive pattern matches everything it was constructed from.
109(every? #(re-matches n-pat %) ws)
110;; => true
111
112;; Shuffle the words again since the naive pattern is built in the
113;; same order as it's inputs.
114(def ws' (shuffle ws))
115
116;;;; Benchmarks
117
118;; Naive pattern
119
120(bench (doseq [w ws'] (re-matches n-pat w)))
265c997f
JH
121;; Execution time mean : 1.499489 sec
122;; Execution time std-deviation : 181.365166 ms
123;; Execution time lower quantile : 1.337817 sec ( 2.5%)
124;; Execution time upper quantile : 1.828733 sec (97.5%)
125
126;; frak pattern
127
128(bench (doseq [w ws'] (re-matches f-pat w)))
129;; Execution time mean : 155.515855 ms
130;; Execution time std-deviation : 5.663346 ms
131;; Execution time lower quantile : 148.168855 ms ( 2.5%)
132;; Execution time upper quantile : 164.164294 ms (97.5%)
133```