Move src/frak/cli.clj -> src/clj/frak/cli.clj
[frak.git] / README.md
1 # frak
2
3 frak transforms collections of strings into regular expressions for
4 matching those strings. The primary goal of this library is to
5 generate regular expressions from a known set of inputs which avoid
6 backtracking as much as possible.
7
8 ## "Installation"
9
10 Add frak as a dependency to your `project.clj` file.
11
12 ```clojure
13 [frak "0.1.2"]
14 ```
15
16 ## Usage
17
18 ```clojure
19 user> (require 'frak)
20 nil
21 user> (frak/pattern ["foo" "bar" "baz" "quux"])
22 #"(?:ba[rz]|foo|quux)"
23 user> (frak/pattern ["Clojure" "Clojars" "ClojureScript"])
24 #"Cloj(?:ure(?:Script)?|ars)"
25 ```
26
27 ## How?
28
29 A frak pattern is constructed from a trie of characters. As characters
30 are added to it, meta data is stored in it's branches containing
31 information such as which branches are terminal and a record of
32 characters which have "visited" the branch.
33
34 During the rendering process frak will prefer branch characters that
35 have "visited" the most. In the example above, you will notice the
36 `ba[rz]` branch takes precedence over `foo` even though `"foo"` was
37 the first to enter the trie. This is because the character `\b` has
38 frequented the branch more than `\f` and `\q`. The example below
39 illustrates this behavior on the second character of each input.
40
41 ```clojure
42 user> (frak/pattern ["bit" "bat" "ban" "bot" "bar" "box"])
43 #"b(?:a[tnr]|o[tx]|it)"
44 ```
45
46 ## Why?
47
48 [Here's](https://github.com/guns/vim-clojure-static/blob/249328ee659190babe2b14cd119f972b21b80538/syntax/clojure.vim#L91-L92)
49 why. Also because.
50
51 ## Next
52
53 While the patterns currently generated by frak are correct, there is
54 potential for improvement; the word trie could be converted to a
55 directed acyclic word graph (DAWG).
56
57 By using a DAWG it might be possible to produce more efficient
58 patterns which consider common prefixes and suffixes. This could
59 reduce backtracking and overall pattern size.
60
61 ## And now for something completely different
62
63 Let's build a regular expression for matching any word in
64 `/usr/share/dict/words`.
65
66 ```clojure
67 user> (require '[clojure.java.io :as io])
68 nil
69 user> (def words
70 (-> (io/file "/usr/share/dict/words")
71 io/reader
72 line-seq))
73 #'user/words
74 user> (def word-re (frak/pattern words))
75 #'user/word-re
76 user> (every? #(re-matches word-re %) words)
77 true
78 ```
79
80 The last two operations will take a moment since there are about
81 235,886 words to consider.
82
83 You can view the full expression
84 [here](https://gist.github.com/noprompt/6106573/raw/fcb683834bb2e171618ca91bf0b234014b5b957d/word-re.clj)
85 (it's approximately `1.5M`!).
86
87 ## Benchmarks
88
89 ```clojure
90 (use 'criterium.core)
91
92 (def words
93 (-> (io/file "/usr/share/dict/words")
94 io/reader
95 line-seq))
96
97 (defn naive-pattern
98 "Create a naive regular expression pattern for matching every string
99 in strs."
100 [strs]
101 (->> strs
102 (clojure.string/join "|")
103 (format "(?:%s)")
104 re-pattern))
105
106 ;; Shuffle 10000 words and build a naive and frak pattern from them.
107 (def ws (shuffle (take 10000 words)))
108 (def n-pat (naive-pattern ws))
109 (def f-pat (frak/pattern ws))
110
111 ;; Verify the naive pattern matches everything it was constructed from.
112 (every? #(re-matches n-pat %) ws)
113 ;; => true
114
115 ;; Shuffle the words again since the naive pattern is built in the
116 ;; same order as it's inputs.
117 (def ws' (shuffle ws))
118
119 ;;;; Benchmarks
120
121 ;; Naive pattern
122
123 (bench (doseq [w ws'] (re-matches n-pat w)))
124 ;; Execution time mean : 1.499489 sec
125 ;; Execution time std-deviation : 181.365166 ms
126 ;; Execution time lower quantile : 1.337817 sec ( 2.5%)
127 ;; Execution time upper quantile : 1.828733 sec (97.5%)
128
129 ;; frak pattern
130
131 (bench (doseq [w ws'] (re-matches f-pat w)))
132 ;; Execution time mean : 155.515855 ms
133 ;; Execution time std-deviation : 5.663346 ms
134 ;; Execution time lower quantile : 148.168855 ms ( 2.5%)
135 ;; Execution time upper quantile : 164.164294 ms (97.5%)
136 ```