performance improvement for large emails
authorAndrew Engelbrecht <sudoman@ninthfloor.org>
Thu, 21 Jan 2016 15:11:27 +0000 (10:11 -0500)
committerAndrew Engelbrecht <sudoman@ninthfloor.org>
Thu, 21 Jan 2016 15:11:27 +0000 (10:11 -0500)
very large emails around 4 MB were slowing down edward because it was
using a complex regex. parsing a 4 MB was taking 4 days.

this fix removes group matching, causing 4 MB files to be parsed in a
matter of seconds.

edward

diff --git a/edward b/edward
index 49bdb65151a553a76013782390d2e16bd360cc59..b13ac7a4f0571dade15f212757c4020a2abae172 100755 (executable)
--- a/edward
+++ b/edward
@@ -406,8 +406,7 @@ def scan_and_split (payload_piece, match_name, pattern):
         return [payload_piece]
 
     flags = re.DOTALL | re.MULTILINE
-    matches = re.search("(?P<beginning>.*?)(?P<match>" + pattern +
-                        ")(?P<rest>.*)", payload_piece.string, flags=flags)
+    matches = re.search(pattern, payload_piece.string, flags=flags)
 
     if matches == None:
         pieces = [payload_piece]
@@ -415,15 +414,15 @@ def scan_and_split (payload_piece, match_name, pattern):
     else:
 
         beginning               = PayloadPiece()
-        beginning.string        = matches.group('beginning')
+        beginning.string        = payload_piece.string[:matches.start()]
         beginning.piece_type    = payload_piece.piece_type
 
         match                   = PayloadPiece()
-        match.string            = matches.group('match')
+        match.string            = payload_piece.string[matches.start():matches.end()]
         match.piece_type        = match_name
 
         rest                    = PayloadPiece()
-        rest.string             = matches.group('rest')
+        rest.string             = payload_piece.string[matches.end():]
         rest.piece_type         = payload_piece.piece_type
 
         more_pieces = scan_and_split(rest, match_name, pattern)