[frak.git] / README.md

# frak

frak transforms collections of strings into regular expressions for
matching those strings. The primary goal of this library is to
generate regular expressions from a known set of inputs which avoid
backtracking as much as possible.

## "Installation"

Add frak as a dependency to your `project.clj` file.

```clojure
[frak "0.1.1"]
```

## Usage

```clojure
user> (require 'frak)
nil
user> (frak/pattern ["foo" "bar" "baz" "quux"])
#"(?:ba(?:r|z)|foo|quux)"
user> (frak/pattern ["Clojure" "Clojars" "ClojureScript"])
#"Cloj(?:ure(?:Script)?|ars)"
```

## How?

A frak pattern is constructed from a very stupid trie of characters.
As characters are added to it, meta data is stored in it's branches.
The meta data contains information such as which branches are terminal
and a record of characters which have "visited" the branch.

During the rendering process frak will prefer branch characters that
have "visited" the most. In the example above, you will notice the
`ba(?:r|z)` branch takes precedence over `foo` even though `"foo"` was
the first to enter the trie. This is because the character `\b` has
frequented the branch more than `\f` and `\q`. The example below
illustrates this behavior on the second character of each input.

```clojure
user> (frak/pattern ["bit" "bat" "ban" "bot" "bar" "box"])
#"b(?:a(?:t|n|r)|o(?:t|x)|it)"
```

## Why?

[Here's](https://github.com/guns/vim-clojure-static/blob/249328ee659190babe2b14cd119f972b21b80538/syntax/clojure.vim#L91-L92)
why. Also because.

## And now for something completely different

Let's build a regular expression for matching any word in
`/usr/share/dict/words`.

```clojure
user> (require '[clojure.java.io :as io])
nil
user> (def words
           (-> (io/file "/usr/share/dict/words")
               io/reader
               line-seq))
#'user/words
user> (def word-re (frak/pattern words))
#'user/word-re
user> (every? #(re-matches word-re %) words) 
true
```

You can view the full expression
[here](https://gist.github.com/noprompt/6106573/raw/fcb683834bb2e171618ca91bf0b234014b5b957d/word-re.clj)
(it's approximately `1.5M`!).

## Benchmarks

```clojure
(use 'criterium.core)

(def words
  (-> (io/file "/usr/share/dict/words")
      io/reader
      line-seq))

(defn naive-pattern
  "Create a naive regular expression pattern for matching every string
   in strs."
  [strs]
  (->> strs
       (clojure.string/join "|")
       (format "(?:%s)")))

;; Shuffle 10000 words and build a naive and frak pattern from them.
(def ws (shuffle (take 10000 words)))
(def n-pat (naive-pattern ws))
(def f-pat (frak/pattern ws))

;; Verify the naive pattern matches everything it was constructed from.
(every? #(re-matches n-pat %) ws)
;; => true

;; Shuffle the words again since the naive pattern is built in the
;; same order as it's inputs.
(def ws' (shuffle ws))

;;;; Benchmarks

;; Naive pattern 

(bench (doseq [w ws'] (re-matches n-pat w)))

;;             Execution time mean : 1.499489 sec
;;    Execution time std-deviation : 181.365166 ms
;;   Execution time lower quantile : 1.337817 sec ( 2.5%)
;;   Execution time upper quantile : 1.828733 sec (97.5%)

;; frak pattern

(bench (doseq [w ws'] (re-matches f-pat w)))
;;             Execution time mean : 155.515855 ms
;;    Execution time std-deviation : 5.663346 ms
;;   Execution time lower quantile : 148.168855 ms ( 2.5%)
;;   Execution time upper quantile : 164.164294 ms (97.5%)
```
Commit	Line	Data
fef2a3db JH	1	# frak
	2
	3	frak transforms collections of strings into regular expressions for
adbcb1dc JH	4	matching those strings. The primary goal of this library is to
	5	generate regular expressions from a known set of inputs which avoid
	6	backtracking as much as possible.
fef2a3db	7
84e74fb3 JH	8	## "Installation"
	9
	10	Add frak as a dependency to your `project.clj` file.
	11
	12	```clojure
ddfeefe3	13	[frak "0.1.1"]
84e74fb3 JH	14	```
84e74fb3 JH	15
fef2a3db JH	16	## Usage
	17
	18	```clojure
	19	user> (require 'frak)
	20	nil
	21	user> (frak/pattern ["foo" "bar" "baz" "quux"])
	22	#"(?:ba(?:r\|z)\|foo\|quux)"
aa0a57a0 JH	23	user> (frak/pattern ["Clojure" "Clojars" "ClojureScript"])
aa0a57a0 JH	24	#"Cloj(?:ure(?:Script)?\|ars)"
fef2a3db	25	```
adbcb1dc JH	26
	27	## How?
	28
	29	A frak pattern is constructed from a very stupid trie of characters.
3b2ac58d JH	30	As characters are added to it, meta data is stored in it's branches.
	31	The meta data contains information such as which branches are terminal
	32	and a record of characters which have "visited" the branch.
adbcb1dc JH	33
	34	During the rendering process frak will prefer branch characters that
	35	have "visited" the most. In the example above, you will notice the
	36	`ba(?:r\|z)` branch takes precedence over `foo` even though `"foo"` was
	37	the first to enter the trie. This is because the character `\b` has
	38	frequented the branch more than `\f` and `\q`. The example below
	39	illustrates this behavior on the second character of each input.
	40
	41	```clojure
	42	user> (frak/pattern ["bit" "bat" "ban" "bot" "bar" "box"])
	43	#"b(?:a(?:t\|n\|r)\|o(?:t\|x)\|it)"
	44	```
	45
	46	## Why?
	47
	48	[Here's](https://github.com/guns/vim-clojure-static/blob/249328ee659190babe2b14cd119f972b21b80538/syntax/clojure.vim#L91-L92)
	49	why. Also because.
	50
a7041588 JH	51	## And now for something completely different
a7041588 JH	52
adbcb1dc	53	Let's build a regular expression for matching any word in
a7041588	54	`/usr/share/dict/words`.
adbcb1dc JH	55
	56	```clojure
	57	user> (require '[clojure.java.io :as io])
	58	nil
	59	user> (def words
	60	(-> (io/file "/usr/share/dict/words")
	61	io/reader
	62	line-seq))
	63	#'user/words
	64	user> (def word-re (frak/pattern words))
	65	#'user/word-re
	66	user> (every? #(re-matches word-re %) words)
	67	true
	68	```
	69
	70	You can view the full expression
	71	[here](https://gist.github.com/noprompt/6106573/raw/fcb683834bb2e171618ca91bf0b234014b5b957d/word-re.clj)
	72	(it's approximately `1.5M`!).
265c997f JH	73
	74	## Benchmarks
	75
	76	```clojure
	77	(use 'criterium.core)
	78
	79	(def words
	80	(-> (io/file "/usr/share/dict/words")
	81	io/reader
	82	line-seq))
	83
	84	(defn naive-pattern
	85	"Create a naive regular expression pattern for matching every string
	86	in strs."
	87	[strs]
	88	(->> strs
	89	(clojure.string/join "\|")
	90	(format "(?:%s)")))
	91
	92	;; Shuffle 10000 words and build a naive and frak pattern from them.
	93	(def ws (shuffle (take 10000 words)))
	94	(def n-pat (naive-pattern ws))
	95	(def f-pat (frak/pattern ws))
	96
	97	;; Verify the naive pattern matches everything it was constructed from.
	98	(every? #(re-matches n-pat %) ws)
	99	;; => true
	100
	101	;; Shuffle the words again since the naive pattern is built in the
	102	;; same order as it's inputs.
	103	(def ws' (shuffle ws))
	104
	105	;;;; Benchmarks
	106
	107	;; Naive pattern
	108
	109	(bench (doseq [w ws'] (re-matches n-pat w)))
	110
	111	;; Execution time mean : 1.499489 sec
	112	;; Execution time std-deviation : 181.365166 ms
	113	;; Execution time lower quantile : 1.337817 sec ( 2.5%)
	114	;; Execution time upper quantile : 1.828733 sec (97.5%)
	115
	116	;; frak pattern
	117
	118	(bench (doseq [w ws'] (re-matches f-pat w)))
	119	;; Execution time mean : 155.515855 ms
	120	;; Execution time std-deviation : 5.663346 ms
	121	;; Execution time lower quantile : 148.168855 ms ( 2.5%)
	122	;; Execution time upper quantile : 164.164294 ms (97.5%)
	123	```