Export string-pattern as stringPattern for JS compatibility
[frak.git] / README.md
CommitLineData
fef2a3db
JH
1# frak
2
3frak transforms collections of strings into regular expressions for
adbcb1dc
JH
4matching those strings. The primary goal of this library is to
5generate regular expressions from a known set of inputs which avoid
6backtracking as much as possible.
fef2a3db 7
84e74fb3
JH
8## "Installation"
9
10Add frak as a dependency to your `project.clj` file.
11
12```clojure
6f6c103f 13[frak "0.1.2"]
84e74fb3
JH
14```
15
fef2a3db
JH
16## Usage
17
18```clojure
19user> (require 'frak)
20nil
21user> (frak/pattern ["foo" "bar" "baz" "quux"])
6f6c103f 22#"(?:ba[rz]|foo|quux)"
aa0a57a0
JH
23user> (frak/pattern ["Clojure" "Clojars" "ClojureScript"])
24#"Cloj(?:ure(?:Script)?|ars)"
fef2a3db 25```
adbcb1dc
JH
26
27## How?
28
24a60b46
JH
29A frak pattern is constructed from a trie of characters. As characters
30are added to it, meta data is stored in it's branches containing
31information such as which branches are terminal and a record of
32characters which have "visited" the branch.
adbcb1dc
JH
33
34During the rendering process frak will prefer branch characters that
35have "visited" the most. In the example above, you will notice the
edb98134 36`ba[rz]` branch takes precedence over `foo` even though `"foo"` was
adbcb1dc
JH
37the first to enter the trie. This is because the character `\b` has
38frequented the branch more than `\f` and `\q`. The example below
39illustrates this behavior on the second character of each input.
40
41```clojure
42user> (frak/pattern ["bit" "bat" "ban" "bot" "bar" "box"])
6f6c103f 43#"b(?:a[tnr]|o[tx]|it)"
adbcb1dc
JH
44```
45
46## Why?
47
48[Here's](https://github.com/guns/vim-clojure-static/blob/249328ee659190babe2b14cd119f972b21b80538/syntax/clojure.vim#L91-L92)
49why. Also because.
50
a7041588
JH
51## And now for something completely different
52
adbcb1dc 53Let's build a regular expression for matching any word in
a7041588 54`/usr/share/dict/words`.
adbcb1dc
JH
55
56```clojure
57user> (require '[clojure.java.io :as io])
58nil
59user> (def words
60 (-> (io/file "/usr/share/dict/words")
61 io/reader
62 line-seq))
63#'user/words
64user> (def word-re (frak/pattern words))
65#'user/word-re
edb98134 66user> (every? #(re-matches word-re %) words)
adbcb1dc
JH
67true
68```
69
45176aef
JH
70The last two operations will take a moment since there are about
71235,886 words to consider.
72
adbcb1dc
JH
73You can view the full expression
74[here](https://gist.github.com/noprompt/6106573/raw/fcb683834bb2e171618ca91bf0b234014b5b957d/word-re.clj)
75(it's approximately `1.5M`!).
265c997f
JH
76
77## Benchmarks
78
79```clojure
80(use 'criterium.core)
81
82(def words
83 (-> (io/file "/usr/share/dict/words")
84 io/reader
85 line-seq))
86
87(defn naive-pattern
88 "Create a naive regular expression pattern for matching every string
89 in strs."
90 [strs]
91 (->> strs
92 (clojure.string/join "|")
d35e9dc9
JH
93 (format "(?:%s)")
94 re-pattern))
265c997f
JH
95
96;; Shuffle 10000 words and build a naive and frak pattern from them.
97(def ws (shuffle (take 10000 words)))
98(def n-pat (naive-pattern ws))
99(def f-pat (frak/pattern ws))
100
101;; Verify the naive pattern matches everything it was constructed from.
102(every? #(re-matches n-pat %) ws)
103;; => true
104
105;; Shuffle the words again since the naive pattern is built in the
106;; same order as it's inputs.
107(def ws' (shuffle ws))
108
109;;;; Benchmarks
110
edb98134 111;; Naive pattern
265c997f
JH
112
113(bench (doseq [w ws'] (re-matches n-pat w)))
265c997f
JH
114;; Execution time mean : 1.499489 sec
115;; Execution time std-deviation : 181.365166 ms
116;; Execution time lower quantile : 1.337817 sec ( 2.5%)
117;; Execution time upper quantile : 1.828733 sec (97.5%)
118
119;; frak pattern
120
121(bench (doseq [w ws'] (re-matches f-pat w)))
122;; Execution time mean : 155.515855 ms
123;; Execution time std-deviation : 5.663346 ms
124;; Execution time lower quantile : 148.168855 ms ( 2.5%)
125;; Execution time upper quantile : 164.164294 ms (97.5%)
126```