Add benchmarks to README
[frak.git] / README.md
CommitLineData
fef2a3db
JH
1# frak
2
3frak transforms collections of strings into regular expressions for
adbcb1dc
JH
4matching those strings. The primary goal of this library is to
5generate regular expressions from a known set of inputs which avoid
6backtracking as much as possible.
fef2a3db 7
84e74fb3
JH
8## "Installation"
9
10Add frak as a dependency to your `project.clj` file.
11
12```clojure
ddfeefe3 13[frak "0.1.1"]
84e74fb3
JH
14```
15
fef2a3db
JH
16## Usage
17
18```clojure
19user> (require 'frak)
20nil
21user> (frak/pattern ["foo" "bar" "baz" "quux"])
22#"(?:ba(?:r|z)|foo|quux)"
aa0a57a0
JH
23user> (frak/pattern ["Clojure" "Clojars" "ClojureScript"])
24#"Cloj(?:ure(?:Script)?|ars)"
fef2a3db 25```
adbcb1dc
JH
26
27## How?
28
29A frak pattern is constructed from a very stupid trie of characters.
3b2ac58d
JH
30As characters are added to it, meta data is stored in it's branches.
31The meta data contains information such as which branches are terminal
32and a record of characters which have "visited" the branch.
adbcb1dc
JH
33
34During the rendering process frak will prefer branch characters that
35have "visited" the most. In the example above, you will notice the
36`ba(?:r|z)` branch takes precedence over `foo` even though `"foo"` was
37the first to enter the trie. This is because the character `\b` has
38frequented the branch more than `\f` and `\q`. The example below
39illustrates this behavior on the second character of each input.
40
41```clojure
42user> (frak/pattern ["bit" "bat" "ban" "bot" "bar" "box"])
43#"b(?:a(?:t|n|r)|o(?:t|x)|it)"
44```
45
46## Why?
47
48[Here's](https://github.com/guns/vim-clojure-static/blob/249328ee659190babe2b14cd119f972b21b80538/syntax/clojure.vim#L91-L92)
49why. Also because.
50
a7041588
JH
51## And now for something completely different
52
adbcb1dc 53Let's build a regular expression for matching any word in
a7041588 54`/usr/share/dict/words`.
adbcb1dc
JH
55
56```clojure
57user> (require '[clojure.java.io :as io])
58nil
59user> (def words
60 (-> (io/file "/usr/share/dict/words")
61 io/reader
62 line-seq))
63#'user/words
64user> (def word-re (frak/pattern words))
65#'user/word-re
66user> (every? #(re-matches word-re %) words)
67true
68```
69
70You can view the full expression
71[here](https://gist.github.com/noprompt/6106573/raw/fcb683834bb2e171618ca91bf0b234014b5b957d/word-re.clj)
72(it's approximately `1.5M`!).
265c997f
JH
73
74## Benchmarks
75
76```clojure
77(use 'criterium.core)
78
79(def words
80 (-> (io/file "/usr/share/dict/words")
81 io/reader
82 line-seq))
83
84(defn naive-pattern
85 "Create a naive regular expression pattern for matching every string
86 in strs."
87 [strs]
88 (->> strs
89 (clojure.string/join "|")
90 (format "(?:%s)")))
91
92;; Shuffle 10000 words and build a naive and frak pattern from them.
93(def ws (shuffle (take 10000 words)))
94(def n-pat (naive-pattern ws))
95(def f-pat (frak/pattern ws))
96
97;; Verify the naive pattern matches everything it was constructed from.
98(every? #(re-matches n-pat %) ws)
99;; => true
100
101;; Shuffle the words again since the naive pattern is built in the
102;; same order as it's inputs.
103(def ws' (shuffle ws))
104
105;;;; Benchmarks
106
107;; Naive pattern
108
109(bench (doseq [w ws'] (re-matches n-pat w)))
110
111;; Execution time mean : 1.499489 sec
112;; Execution time std-deviation : 181.365166 ms
113;; Execution time lower quantile : 1.337817 sec ( 2.5%)
114;; Execution time upper quantile : 1.828733 sec (97.5%)
115
116;; frak pattern
117
118(bench (doseq [w ws'] (re-matches f-pat w)))
119;; Execution time mean : 155.515855 ms
120;; Execution time std-deviation : 5.663346 ms
121;; Execution time lower quantile : 148.168855 ms ( 2.5%)
122;; Execution time upper quantile : 164.164294 ms (97.5%)
123```