hkucuk

Aho-Corasick Algorithm: Multiple Keyword Search Technique

September 11, 2023 • ☕️ 5 min read • 🏷 computer, software, algorithm, search

Translated by author into: English


In the disciplines of text processing and text mining, the need to quickly and effectively detect certain keywords in text is a frequently encountered necessity. In this context, Aho-Corasick algorithm stands out as a powerful algorithm developed for such text processing tasks.

Text mining is a data mining discipline that aims to perform a detailed analysis of text content and detect certain patterns, words or terms, especially in large collections of text. In this context, detecting specific keywords on large text sets is an important component of text mining applications. For such tasks, the Aho-Corasick algorithm offers an effective approach to searching for and finding multiple keywords in the text at the same time.

Trie Data Structure

The basic structure of the Aho-Corasick algorithm is based on the Trie (tree) data structure. Trie is a data structure used effectively in text processing. Designed to represent words or strings of characters, Trie is used to quickly identify these words in the text. Each node in the trie structure symbolizes a character or part of a word and marks the end of a word. Thanks to this feature, the Trie structure organizes the words in the text hierarchically and optimizes the search process.

How Does the Aho-Corasick Algorithm Work?

The Aho-Corasick algorithm uses the Trie data structure when performing the task of finding keywords in the text. This algorithm creates the paths of the keywords in the Trie and finds the keywords in the text by navigating these paths. It also optimizes search time by calculating failed transitions.

1. Creating the Trie Structure

The first step is to add the keywords to be searched in the text to the Trie data structure. Each keyword is represented as a path in the Trie. To create an example Trie structure, we follow these steps:

  1. Create the root node of the trie data structure.
  2. While adding the keywords to the Trie one by one, we create a node for each character and connect this node to the corresponding node in the Trie.

2. Calculation of Failed Passes

Fail transitions allow you to quickly navigate the Trie structure of the keywords to be searched in the text. Each node has a “fail_state” link, which routes failed migrations in the Trie. When calculating failed passes, follow these steps:

  1. Fail_state connections for all children of the root node are initially directed to the root node.
  2. Using the BFS (breadth-first search) algorithm, the Trie structure is browsed and fail_state connections are calculated for each node. These connections ensure that failed migrations are routed correctly.

3. Searching for Keywords in the Text

We use the Trie structure to search for keywords in the text. As we navigate through the text, we follow these steps for each character:

  1. We initialize the current state as the root node in the Trie structure.
  2. We take the current character and use this character to find the next state in the Trie.
  3. If there is no next state or it is not the end of the keyword, we follow the fail_state links.
  4. If the end of the keyword is found, we record the position of this word.

Below is a simple pseudocode example of the Aho-Corasick algorithm:

function AhoCorasick(text, keywords):
    create Trie from keywords
    set fail transitions in Trie
    result = empty dictionary
    
    state = root of Trie
    for i = 0 to len(text) - 1:
        while state is not found in Trie and state is not root:
            state = state.fail_state
        
        state = find_next_state(state, text[i])
        
        if state is None:
            state = root of Trie
        
        for output in state.outputs:
            if output not in result:
                result[output] = empty list
            result[output].append(i - len(output) + 1)
    
    return result

This pseudocode shows the main logic and operation of the Aho-Corasick algorithm. The algorithm uses the Trie structure to quickly find keywords within the text and optimizes its performance by calculating failed transitions.

Aho-Corasick implementation in GoLang:

package main

import (
	"fmt"
)

type TrieNode struct {
	children    map[rune]*TrieNode
	fail        *TrieNode
	isEndOfWord bool
	keyword     string
}

type AhoCorasick struct {
	root *TrieNode
}

func newTrieNode() *TrieNode {
	return &TrieNode{
		children: make(map[rune]*TrieNode),
		fail:     nil,
	}
}

func NewAhoCorasick(keywords []string) *AhoCorasick {
	root := newTrieNode()
	ac := &AhoCorasick{root: root}

	for _, keyword := range keywords {
		ac.insert(keyword)
	}

	ac.setFailTransitions()

	return ac
}

func (ac *AhoCorasick) insert(keyword string) {
	node := ac.root
	for _, char := range keyword {
		if _, found := node.children[char]; !found {
			node.children[char] = newTrieNode()
		}
		node = node.children[char]
	}
	node.isEndOfWord = true
	node.keyword = keyword
}

func (ac *AhoCorasick) setFailTransitions() {
	queue := make([]*TrieNode, 0)
	root := ac.root

	for _, child := range root.children {
		child.fail = root
		queue = append(queue, child)
	}

	for len(queue) > 0 {
		node := queue[0]
		queue = queue[1:]

		for char, child := range node.children {
			failNode := node.fail

			for failNode != root && failNode.children[char] == nil {
				failNode = failNode.fail
			}

			if nextNode, found := failNode.children[char]; found {
				child.fail = nextNode
			} else {
				child.fail = root
			}

			queue = append(queue, child)
		}
	}
}

func (ac *AhoCorasick) Search(text string) map[string][]int {
	result := make(map[string][]int)
	node := ac.root

	for i, char := range text {
		for node != ac.root && node.children[char] == nil {
			node = node.fail
		}

		if nextNode, found := node.children[char]; found {
			node = nextNode
		} else {
			node = ac.root
		}

		if node.isEndOfWord {
			if _, exists := result[node.keyword]; !exists {
				result[node.keyword] = []int{}
			}
			result[node.keyword] = append(result[node.keyword], i-len(node.keyword)+1)
		}
	}

	return result
}

func main() {
	keywords := []string{"what", "hat", "ver", "er"}
	text := "whatever, err ... , wherever"

	ac := NewAhoCorasick(keywords)
	results := ac.Search(text)

	fmt.Println(results)
}

When the program is run, its output will be as follows.

map[er:[10 22] ver:[5 25] what:[0]]

The working version of the program can be accessed here.


Resources