Commit | Line | Data |
---|---|---|
fef2a3db JH |
1 | # frak |
2 | ||
3 | frak transforms collections of strings into regular expressions for | |
adbcb1dc JH |
4 | matching those strings. The primary goal of this library is to |
5 | generate regular expressions from a known set of inputs which avoid | |
6 | backtracking as much as possible. | |
fef2a3db | 7 | |
84e74fb3 JH |
8 | ## "Installation" |
9 | ||
10 | Add frak as a dependency to your `project.clj` file. | |
11 | ||
12 | ```clojure | |
ddfeefe3 | 13 | [frak "0.1.1"] |
84e74fb3 JH |
14 | ``` |
15 | ||
fef2a3db JH |
16 | ## Usage |
17 | ||
18 | ```clojure | |
19 | user> (require 'frak) | |
20 | nil | |
21 | user> (frak/pattern ["foo" "bar" "baz" "quux"]) | |
22 | #"(?:ba(?:r|z)|foo|quux)" | |
aa0a57a0 JH |
23 | user> (frak/pattern ["Clojure" "Clojars" "ClojureScript"]) |
24 | #"Cloj(?:ure(?:Script)?|ars)" | |
fef2a3db | 25 | ``` |
adbcb1dc JH |
26 | |
27 | ## How? | |
28 | ||
24a60b46 JH |
29 | A frak pattern is constructed from a trie of characters. As characters |
30 | are added to it, meta data is stored in it's branches containing | |
31 | information such as which branches are terminal and a record of | |
32 | characters which have "visited" the branch. | |
adbcb1dc JH |
33 | |
34 | During the rendering process frak will prefer branch characters that | |
35 | have "visited" the most. In the example above, you will notice the | |
36 | `ba(?:r|z)` branch takes precedence over `foo` even though `"foo"` was | |
37 | the first to enter the trie. This is because the character `\b` has | |
38 | frequented the branch more than `\f` and `\q`. The example below | |
39 | illustrates this behavior on the second character of each input. | |
40 | ||
41 | ```clojure | |
42 | user> (frak/pattern ["bit" "bat" "ban" "bot" "bar" "box"]) | |
43 | #"b(?:a(?:t|n|r)|o(?:t|x)|it)" | |
44 | ``` | |
45 | ||
46 | ## Why? | |
47 | ||
48 | [Here's](https://github.com/guns/vim-clojure-static/blob/249328ee659190babe2b14cd119f972b21b80538/syntax/clojure.vim#L91-L92) | |
49 | why. Also because. | |
50 | ||
24a60b46 JH |
51 | ## Next |
52 | ||
53 | While the patterns currently generated by frak are correct, there is | |
54 | potential for improvement; the word trie could be converted to a | |
55 | directed acyclic word graph (DAWG). | |
56 | ||
57 | By using a DAWG it might be possible to produce more efficient | |
58 | patterns which consider both common prefixes and suffixes. This might | |
59 | reduce both backtracking and overall pattern size. | |
60 | ||
a7041588 JH |
61 | ## And now for something completely different |
62 | ||
adbcb1dc | 63 | Let's build a regular expression for matching any word in |
a7041588 | 64 | `/usr/share/dict/words`. |
adbcb1dc JH |
65 | |
66 | ```clojure | |
67 | user> (require '[clojure.java.io :as io]) | |
68 | nil | |
69 | user> (def words | |
70 | (-> (io/file "/usr/share/dict/words") | |
71 | io/reader | |
72 | line-seq)) | |
73 | #'user/words | |
74 | user> (def word-re (frak/pattern words)) | |
75 | #'user/word-re | |
76 | user> (every? #(re-matches word-re %) words) | |
77 | true | |
78 | ``` | |
79 | ||
80 | You can view the full expression | |
81 | [here](https://gist.github.com/noprompt/6106573/raw/fcb683834bb2e171618ca91bf0b234014b5b957d/word-re.clj) | |
82 | (it's approximately `1.5M`!). |