Finding substrings very very quickly

Let’s say you’ve got a large document of text, and someone gives you a very large set of words to find in this very large document of text.

What’s the dumbest way to find those words in the document?

Dumbest way:
for each word, scan the document for it

If there are S words in the set and D words in the document, then there would be S*D comparisons. That’s a lot. Maybe you’re thinking about optimizing this by scanning the document once and indexing the D words into a hashtable and then do D lookups in it. That would be better, but still slow if the set of words is very very large. Also, if the things you were looking for didn’t end on word boundaries, you won’t be able to index the document in this way.

Much better way:
The Aho-Corasick string matching algorithm was invented just for this job! Instead of creating an efficient lookup data structure on the document, it creates an efficient lookup data structure on the set of lookup words. The basic idea is that it creates a trie-like data structure for every character of the lookup words. With this, it can match all the lookup words simultaneously with one pass of the document! The complexity is linear to the number of characters in the lookup words, the length of the document and the number of matches. And Linear is good.

Here’s the dry definition from wikipedia

Here’s a neat animation of the algorithm in action.

The best part, this is such a standard data structure and algorithm that somebody else has already written it for you. Here’s a java version of it with a really intuitive API.

AhoCorasick tree = new AhoCorasick();
tree.add(".", ".".toCharArray());
tree.add("...", "...".toCharArray());

tree.prepare();
Iterator<SearchResult> it = tree.search("Questa prova è molto cattiva..".toCharArray());
while(it.hasNext()) {
    SearchResult sr = it.next();
    System.out.println(sr.getLastIndex());

    Set<Object> outputs = sr.getOutputs();

    for(Object s : outputs)
        System.out.println(new String((char[]) s));
}
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: