Aho-Corasick Algorithm: Multiple Keyword Search Technique
September 11, 2023 • ☕️ 5 min read • 🏷 computer, software, algorithm, search
Translated by author into: English
In the disciplines of text processing and text mining, the need to quickly and effectively detect certain keywords in text is a frequently encountered necessity. In this context, Aho-Corasick algorithm stands out as a powerful algorithm developed for such text processing tasks.
Text mining is a data mining discipline that aims to perform a detailed analysis of text content and detect certain patterns, words or terms, especially in large collections of text. In this context, detecting specific keywords on large text sets is an important component of text mining applications. For such tasks, the Aho-Corasick algorithm offers an effective approach to searching for and finding multiple keywords in the text at the same time.
Trie Data Structure
The basic structure of the Aho-Corasick algorithm is based on the Trie (tree) data structure. Trie is a data structure used effectively in text processing. Designed to represent words or strings of characters, Trie is used to quickly identify these words in the text. Each node in the trie structure symbolizes a character or part of a word and marks the end of a word. Thanks to this feature, the Trie structure organizes the words in the text hierarchically and optimizes the search process.
How Does the Aho-Corasick Algorithm Work?
The Aho-Corasick algorithm uses the Trie data structure when performing the task of finding keywords in the text. This algorithm creates the paths of the keywords in the Trie and finds the keywords in the text by navigating these paths. It also optimizes search time by calculating failed transitions.
1. Creating the Trie Structure
The first step is to add the keywords to be searched in the text to the Trie data structure. Each keyword is represented as a path in the Trie. To create an example Trie structure, we follow these steps:
- Create the root node of the trie data structure.
- While adding the keywords to the Trie one by one, we create a node for each character and connect this node to the corresponding node in the Trie.
2. Calculation of Failed Passes
Fail transitions allow you to quickly navigate the Trie structure of the keywords to be searched in the text. Each node has a “fail_state” link, which routes failed migrations in the Trie. When calculating failed passes, follow these steps:
- Fail_state connections for all children of the root node are initially directed to the root node.
- Using the BFS (breadth-first search) algorithm, the Trie structure is browsed and fail_state connections are calculated for each node. These connections ensure that failed migrations are routed correctly.
3. Searching for Keywords in the Text
We use the Trie structure to search for keywords in the text. As we navigate through the text, we follow these steps for each character:
- We initialize the current state as the root node in the Trie structure.
- We take the current character and use this character to find the next state in the Trie.
- If there is no next state or it is not the end of the keyword, we follow the fail_state links.
- If the end of the keyword is found, we record the position of this word.
Below is a simple pseudocode example of the Aho-Corasick algorithm:
function AhoCorasick(text, keywords):
create Trie from keywords
set fail transitions in Trie
result = empty dictionary
state = root of Trie
for i = 0 to len(text) - 1:
while state is not found in Trie and state is not root:
state = state.fail_state
state = find_next_state(state, text[i])
if state is None:
state = root of Trie
for output in state.outputs:
if output not in result:
result[output] = empty list
result[output].append(i - len(output) + 1)
return resultThis pseudocode shows the main logic and operation of the Aho-Corasick algorithm. The algorithm uses the Trie structure to quickly find keywords within the text and optimizes its performance by calculating failed transitions.
Aho-Corasick implementation in GoLang:
package main
import (
"fmt"
)
type TrieNode struct {
children map[rune]*TrieNode
fail *TrieNode
isEndOfWord bool
keyword string
}
type AhoCorasick struct {
root *TrieNode
}
func newTrieNode() *TrieNode {
return &TrieNode{
children: make(map[rune]*TrieNode),
fail: nil,
}
}
func NewAhoCorasick(keywords []string) *AhoCorasick {
root := newTrieNode()
ac := &AhoCorasick{root: root}
for _, keyword := range keywords {
ac.insert(keyword)
}
ac.setFailTransitions()
return ac
}
func (ac *AhoCorasick) insert(keyword string) {
node := ac.root
for _, char := range keyword {
if _, found := node.children[char]; !found {
node.children[char] = newTrieNode()
}
node = node.children[char]
}
node.isEndOfWord = true
node.keyword = keyword
}
func (ac *AhoCorasick) setFailTransitions() {
queue := make([]*TrieNode, 0)
root := ac.root
for _, child := range root.children {
child.fail = root
queue = append(queue, child)
}
for len(queue) > 0 {
node := queue[0]
queue = queue[1:]
for char, child := range node.children {
failNode := node.fail
for failNode != root && failNode.children[char] == nil {
failNode = failNode.fail
}
if nextNode, found := failNode.children[char]; found {
child.fail = nextNode
} else {
child.fail = root
}
queue = append(queue, child)
}
}
}
func (ac *AhoCorasick) Search(text string) map[string][]int {
result := make(map[string][]int)
node := ac.root
for i, char := range text {
for node != ac.root && node.children[char] == nil {
node = node.fail
}
if nextNode, found := node.children[char]; found {
node = nextNode
} else {
node = ac.root
}
if node.isEndOfWord {
if _, exists := result[node.keyword]; !exists {
result[node.keyword] = []int{}
}
result[node.keyword] = append(result[node.keyword], i-len(node.keyword)+1)
}
}
return result
}
func main() {
keywords := []string{"what", "hat", "ver", "er"}
text := "whatever, err ... , wherever"
ac := NewAhoCorasick(keywords)
results := ac.Search(text)
fmt.Println(results)
}When the program is run, its output will be as follows.
map[er:[10 22] ver:[5 25] what:[0]]The working version of the program can be accessed here.
Resources
- https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm
- https://cp-algorithms.com/string/aho_corasick.html
- http://web.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/02/Small02.pdf