Version bump to v0.1.9
[frak.git] / README.md
CommitLineData
fef2a3db
JH
1# frak
2
3frak transforms collections of strings into regular expressions for
adbcb1dc
JH
4matching those strings. The primary goal of this library is to
5generate regular expressions from a known set of inputs which avoid
c80cdf1e
JH
6backtracking as much as possible. It is available as a [command line
7utility](#command-line-usage) and for the [browser](#browser-usage)
8as a JavaScript library.
fef2a3db 9
84e74fb3
JH
10## "Installation"
11
12Add frak as a dependency to your `project.clj` file.
13
9aeb1f15
JH
14[![Clojars Project](https://img.shields.io/clojars/v/frak.svg)](https://clojars.org/frak)
15
84e74fb3 16
c80cdf1e 17## Clojure(Script) usage
fef2a3db
JH
18
19```clojure
20user> (require 'frak)
21nil
22user> (frak/pattern ["foo" "bar" "baz" "quux"])
6f6c103f 23#"(?:ba[rz]|foo|quux)"
aa0a57a0
JH
24user> (frak/pattern ["Clojure" "Clojars" "ClojureScript"])
25#"Cloj(?:ure(?:Script)?|ars)"
c80cdf1e
JH
26user> (frak/pattern ["skill" "skills" "skull" "skulls"])
27#"sk(?:[ui]lls?)"
fef2a3db 28```
adbcb1dc 29
df8061d7
JL
30## Options
31Frak's `pattern` function can accept an options map as its second argument, the available options are:
32
33* `:capture?` - boolean (default false), whether rendered regex should create capture groups for each match
34* `:escape-chars` - vector (default see [`frak/metacharacters`](https://github.com/noprompt/frak/blob/3c22d9b33646ca4bc681a5746ac0d11fd06e579b/src/cljx/frak.cljx#L54)), characters to escape when rendering a regular expression.
35* `:exact?` - boolean (default false), whether the rendered regex should only produces matches when the entire input string matches.
36* `:whole-words?` - boolean (default false), whether the rendered regex should match only whole words (word boundary at both ends of the match) in the input string.
37
c80cdf1e 38## Command line usage
adbcb1dc 39
c80cdf1e 40frak can be used from the command line with either Leiningen or NodeJS.
adbcb1dc 41
c80cdf1e 42### With Leiningen
adbcb1dc 43
c80cdf1e
JH
44Use the `lein run` command:
45
46```shell
47$ lein run -e foo bar baz quux
48^(?:ba[rz]|foo|quux)$
49```
50
51### With NodeJS
52
53Compile the NodeJS version
54
55```shell
56$ lein do cljx once, cljsbuild once node
57$ chmod +x bin/frak
58$ bin/frak -e foo bar baz quux
59^(?:ba[rz]|foo|quux)$
adbcb1dc
JH
60```
61
c80cdf1e
JH
62## Browser usage
63
64To use frak as a standalone library in the browser with JavaScript
65compile the browser version:
66
67```shell
68$ lein do cljx once, cljsbuild once browser
69$ mv ./target/js/frak.min.js <destination>
70```
71
72Try it using this HTML:
73
74```html
75<!DOCTYPE html>
76<html>
77<head>
78</head>
79<body>
80 <pre>Input: <span id="input"></span></pre>
81 <pre>Output: <span id="output"></span></pre>
82 <script src="http://code.jquery.com/jquery-2.0.3.min.js"></script>
83 <script src="frak.min.js"></script>
84 <script>
85 var strings = ["foo", "bar", "baz", "quux"];
86 // It's a good idea to use the `"exact?"` option.
87 var pattern = frak.pattern(strings, {"exact?": true})
88 jQuery("#input").text(strings.join(" "));
89 jQuery("#output").text(pattern);
90 </script>
91</body>
92</html>
93```
94
95For even more fun try it with [AngularJS](http://angularjs.org/)!
96
97## How?
98
99A frak pattern is constructed from a trie of characters and a
100renderer which processes it. As characters are added to the trie, data
0e9dc48f 101such as such as which characters are terminal are stored in it's
c80cdf1e
JH
102branches.
103
104During the rendering process frak analyzes each branch and attempts to
105emit the most concise regular expression possible. Additional post
0e9dc48f
JH
106operations are applied after rendering to improve the expression where
107possible.
c80cdf1e 108
adbcb1dc
JH
109## Why?
110
111[Here's](https://github.com/guns/vim-clojure-static/blob/249328ee659190babe2b14cd119f972b21b80538/syntax/clojure.vim#L91-L92)
112why. Also because.
113
a7041588
JH
114## And now for something completely different
115
adbcb1dc 116Let's build a regular expression for matching any word in
a7041588 117`/usr/share/dict/words`.
adbcb1dc
JH
118
119```clojure
120user> (require '[clojure.java.io :as io])
121nil
122user> (def words
123 (-> (io/file "/usr/share/dict/words")
124 io/reader
125 line-seq))
126#'user/words
127user> (def word-re (frak/pattern words))
128#'user/word-re
edb98134 129user> (every? #(re-matches word-re %) words)
adbcb1dc
JH
130true
131```
132
0e9dc48f
JH
133The last two operations will take a moment since there are over
134235,000 words to consider.
45176aef 135
adbcb1dc
JH
136You can view the full expression
137[here](https://gist.github.com/noprompt/6106573/raw/fcb683834bb2e171618ca91bf0b234014b5b957d/word-re.clj)
138(it's approximately `1.5M`!).
265c997f
JH
139
140## Benchmarks
141
142```clojure
143(use 'criterium.core)
144
145(def words
146 (-> (io/file "/usr/share/dict/words")
147 io/reader
148 line-seq))
149
150(defn naive-pattern
151 "Create a naive regular expression pattern for matching every string
152 in strs."
153 [strs]
154 (->> strs
155 (clojure.string/join "|")
d35e9dc9
JH
156 (format "(?:%s)")
157 re-pattern))
265c997f
JH
158
159;; Shuffle 10000 words and build a naive and frak pattern from them.
160(def ws (shuffle (take 10000 words)))
0e9dc48f 161
265c997f
JH
162(def n-pat (naive-pattern ws))
163(def f-pat (frak/pattern ws))
164
165;; Verify the naive pattern matches everything it was constructed from.
166(every? #(re-matches n-pat %) ws)
167;; => true
168
169;; Shuffle the words again since the naive pattern is built in the
170;; same order as it's inputs.
171(def ws' (shuffle ws))
172
173;;;; Benchmarks
174
edb98134 175;; Naive pattern
265c997f
JH
176
177(bench (doseq [w ws'] (re-matches n-pat w)))
265c997f
JH
178;; Execution time mean : 1.499489 sec
179;; Execution time std-deviation : 181.365166 ms
180;; Execution time lower quantile : 1.337817 sec ( 2.5%)
181;; Execution time upper quantile : 1.828733 sec (97.5%)
182
183;; frak pattern
184
185(bench (doseq [w ws'] (re-matches f-pat w)))
186;; Execution time mean : 155.515855 ms
187;; Execution time std-deviation : 5.663346 ms
188;; Execution time lower quantile : 148.168855 ms ( 2.5%)
189;; Execution time upper quantile : 164.164294 ms (97.5%)
190```