Add example with terminal branches
[frak.git] / README.md
1 # frak
2
3 frak transforms collections of strings into regular expressions for
4 matching those strings. The primary goal of this library is to
5 generate regular expressions from a known set of inputs which avoid
6 backtracking as much as possible.
7
8 ## Usage
9
10 ```clojure
11 user> (require 'frak)
12 nil
13 user> (frak/pattern ["foo" "bar" "baz" "quux"])
14 #"(?:ba(?:r|z)|foo|quux)"
15 user> (frak/pattern ["Clojure" "Clojars" "ClojureScript"])
16 #"Cloj(?:ure(?:Script)?|ars)"
17 ```
18
19 ## How?
20
21 A frak pattern is constructed from a very stupid trie of characters.
22 As words are added to it, meta data is stored in it's branches. The
23 meta data contains information such as which branches are terminal and
24 a record of characters which have "visited" the branch.
25
26 During the rendering process frak will prefer branch characters that
27 have "visited" the most. In the example above, you will notice the
28 `ba(?:r|z)` branch takes precedence over `foo` even though `"foo"` was
29 the first to enter the trie. This is because the character `\b` has
30 frequented the branch more than `\f` and `\q`. The example below
31 illustrates this behavior on the second character of each input.
32
33 ```clojure
34 user> (frak/pattern ["bit" "bat" "ban" "bot" "bar" "box"])
35 #"b(?:a(?:t|n|r)|o(?:t|x)|it)"
36 ```
37
38 ## Why?
39
40 [Here's](https://github.com/guns/vim-clojure-static/blob/249328ee659190babe2b14cd119f972b21b80538/syntax/clojure.vim#L91-L92)
41 why. Also because.
42
43 Let's build a regular expression for matching any word in
44 `/usr/share/dict/words` (hint: no).
45
46 ```clojure
47 user> (require '[clojure.java.io :as io])
48 nil
49 user> (def words
50 (-> (io/file "/usr/share/dict/words")
51 io/reader
52 line-seq))
53 #'user/words
54 user> (def word-re (frak/pattern words))
55 #'user/word-re
56 user> (every? #(re-matches word-re %) words)
57 true
58 ```
59
60 You can view the full expression
61 [here](https://gist.github.com/noprompt/6106573/raw/fcb683834bb2e171618ca91bf0b234014b5b957d/word-re.clj)
62 (it's approximately `1.5M`!).