# frak frak transforms collections of strings into regular expressions for matching those strings. The primary goal of this library is to generate regular expressions from a known set of inputs which avoid backtracking as much as possible. ## "Installation" Add frak as a dependency to your `project.clj` file. ```clojure [frak "0.1.1"] ``` ## Usage ```clojure user> (require 'frak) nil user> (frak/pattern ["foo" "bar" "baz" "quux"]) #"(?:ba(?:r|z)|foo|quux)" user> (frak/pattern ["Clojure" "Clojars" "ClojureScript"]) #"Cloj(?:ure(?:Script)?|ars)" ``` ## How? A frak pattern is constructed from a trie of characters. As characters are added to it, meta data is stored in it's branches containing information such as which branches are terminal and a record of characters which have "visited" the branch. During the rendering process frak will prefer branch characters that have "visited" the most. In the example above, you will notice the `ba(?:r|z)` branch takes precedence over `foo` even though `"foo"` was the first to enter the trie. This is because the character `\b` has frequented the branch more than `\f` and `\q`. The example below illustrates this behavior on the second character of each input. ```clojure user> (frak/pattern ["bit" "bat" "ban" "bot" "bar" "box"]) #"b(?:a(?:t|n|r)|o(?:t|x)|it)" ``` ## Why? [Here's](https://github.com/guns/vim-clojure-static/blob/249328ee659190babe2b14cd119f972b21b80538/syntax/clojure.vim#L91-L92) why. Also because. ## Next While the patterns currently generated by frak are correct, there is potential for improvement; the word trie could be converted to a directed acyclic word graph (DAWG). By using a DAWG it might be possible to produce more efficient patterns which consider common prefixes and suffixes. This could reduce backtracking and overall pattern size. ## And now for something completely different Let's build a regular expression for matching any word in `/usr/share/dict/words`. ```clojure user> (require '[clojure.java.io :as io]) nil user> (def words (-> (io/file "/usr/share/dict/words") io/reader line-seq)) #'user/words user> (def word-re (frak/pattern words)) #'user/word-re user> (every? #(re-matches word-re %) words) true ``` You can view the full expression [here](https://gist.github.com/noprompt/6106573/raw/fcb683834bb2e171618ca91bf0b234014b5b957d/word-re.clj) (it's approximately `1.5M`!). ## Benchmarks ```clojure (use 'criterium.core) (def words (-> (io/file "/usr/share/dict/words") io/reader line-seq)) (defn naive-pattern "Create a naive regular expression pattern for matching every string in strs." [strs] (->> strs (clojure.string/join "|") (format "(?:%s)") re-pattern)) ;; Shuffle 10000 words and build a naive and frak pattern from them. (def ws (shuffle (take 10000 words))) (def n-pat (naive-pattern ws)) (def f-pat (frak/pattern ws)) ;; Verify the naive pattern matches everything it was constructed from. (every? #(re-matches n-pat %) ws) ;; => true ;; Shuffle the words again since the naive pattern is built in the ;; same order as it's inputs. (def ws' (shuffle ws)) ;;;; Benchmarks ;; Naive pattern (bench (doseq [w ws'] (re-matches n-pat w))) ;; Execution time mean : 1.499489 sec ;; Execution time std-deviation : 181.365166 ms ;; Execution time lower quantile : 1.337817 sec ( 2.5%) ;; Execution time upper quantile : 1.828733 sec (97.5%) ;; frak pattern (bench (doseq [w ws'] (re-matches f-pat w))) ;; Execution time mean : 155.515855 ms ;; Execution time std-deviation : 5.663346 ms ;; Execution time lower quantile : 148.168855 ms ( 2.5%) ;; Execution time upper quantile : 164.164294 ms (97.5%) ```