hkucuk

What is Levenshtein Distance, How is it Calculated?

September 2, 2023 • ☕️ 5 min read • 🏷 computer, software, algorithm

Translated by author into: English


Levenshtein Distance is a fundamental method of measuring similarity or dissimilarity between text or strings and is commonly used in everyday life as well as in many different fields such as computer science, natural language processing and genetic analysis. is also widely used. This metric quantifies how similar or different these strings are by determining how many times characters between two strings need to be replaced, added, or removed. This is why Levenshtein Distance plays an important role in data mining, autocorrection systems, text comparisons and even genetic sequence analysis. In the following sections, we’ll take a detailed look at how the Levenshtein Distance is calculated and provide examples to help you understand this metric.

What is Levenshtein Distance?

Levenshtein Distance is a metric or algorithm used to determine a measure of similarity or difference between text or strings. This metric quantifies how similar the two strings are to each other by determining how the characters between two strings are changed, added or removed. It is very useful in measuring the difference or similarity between two strings and shows itself in many application areas.

Levenshtein Distance was first introduced in 1965 by Soviet computer scientist Vladimir Levenshtein. That’s why this metric is named after him. Levenshtein developed this metric to analyze typo corrections, but it later discovered the potential for use in many different fields.

When calculating this distance, it is essential to observe how the string of characters between two strings is changed, added or subtracted. The more editing required, the greater the Levenshtein Distance when trying to convert individual strings to each other. That is, the distance between the two strings represents the number of these editing operations.

The main purpose of Levenshtein Distance is to quantify the difference between two strings and to offer opportunities for use in text similarity analysis, spell checking, data matching, genetic sequence comparisons and many more. It is frequently used in natural language processing applications and data analysis projects. This algorithm is an essential tool in the fields of text mining and data processing, and offers numerous usage scenarios. In the following sections, we will examine in more detail how the Levenshtein Distance is calculated and examples of its application.

How to Calculate Levenshtein Distance?

Levenshtein Distance is calculated by dynamic programming method. The basic idea is to construct a matrix by comparing two strings and tracking the match status of each character.

The calculation steps are:

Step 1: Creating the Initial Matrix

First, we create a matrix to compare two strings. The rows of the matrix represent the first string and the columns represent the second string. The dimensions of the matrix are determined by the length of both strings. The first row and first column of the matrix start with zero values. This matrix forms the basis of the Levenshtein Distance calculation.

Example:

  |   | M | I | N | E
-----------------------
  | 0 | 1 | 2 | 3 | 4
-----------------------
H | 1 |   |   |   |  
-----------------------
O | 2 |   |   |   |  
-----------------------
U | 3 |   |   |   |  
-----------------------
S | 4 |   |   |   |  

Step 2: Filling the Matrix

After the matrix is created, we start calculating the value of each cell. Each cell gets a value based on the match status of the two compared characters. If the two characters are the same, the value of that cell is the value of the cell in the upper left corner. If the characters do not match, the value of that cell is determined by adding one to the smallest of the cells in the left, top, and top left corners.

Mathematically, the distance (D) between two characters is calculated as:

D(i, j) = 0, eğer string1[i] = string2[j]
D(i, j) = min(D(i-1, j), D(i, j-1), D(i-1, j-1)) + 1, eğer string1[i] ≠ string2[j]

Here, i and j represent indices used to compare two strings.

Example:

  |   | M | I | N | E
-----------------------
  | 0 | 1 | 2 | 3 | 4
-----------------------
H | 1 | 1 | 2 | 3 | 4
-----------------------
O | 2 | 2 | 2 | 3 | 4
-----------------------
U | 3 | 3 | 3 | 3 | 4
-----------------------
S | 4 | 4 | 4 | 4 | 4

Step 3: Finding the Levenshtein Distance

The value in the lower right corner of the matrix represents the Levenshtein Distance. This value represents the minimum number of editing operations between two strings. You can use this value when you want to measure the similarity or difference between two strings.

Mathematically, the Levenshtein Distance (L) is calculated as:

L = D(m, n)

Here, m and n represent the length of the first and second string, respectively.

Example:

Levenshtein Distance = 4

The process of calculating the Levenshtein Distance follows these steps and ultimately quantifies the similarity or difference between the two strings.

The Levenshtein Distance calculation process provides a powerful way to compare character strings and determine editing operations. Therefore, it is widely used in many different application areas such as spell checking, text similarity analysis, data matching and genetic sequence comparisons.

Levenshtein Distance app on GoLang:

package main

import (
	"fmt"
)

func Min(a, b, c int) int {
	if a <= b && a <= c {
		return a
	} else if b <= a && b <= c {
		return b
	} else {
		return c
	}
}

func LevenshteinDistance(str1, str2 string) int {
	m := len(str1)
	n := len(str2)

	matrix := make([][]int, m+1)
	for i := range matrix {
		matrix[i] = make([]int, n+1)
	}

	for i := 0; i <= m; i++ {
		matrix[i][0] = i
	}
	for j := 0; j <= n; j++ {
		matrix[0][j] = j
	}

	for i := 1; i <= m; i++ {
		for j := 1; j <= n; j++ {
			cost := 0
			if str1[i-1] != str2[j-1] {
				cost = 1
			}
			matrix[i][j] = Min(
				matrix[i-1][j]+1,
				matrix[i][j-1]+1,
				matrix[i-1][j-1]+cost,
			)
		}
	}

	return matrix[m][n]
}

func main() {
	str1 := "hors"
	str2 := "rose"

	distance := LevenshteinDistance(str1, str2)

	fmt.Printf("Levenshtein Distance between '%s' and '%s': %d\n", str1, str2, distance)
}

When the program is run, the output will be as follows.

Levenshtein Distance between 'hors' and 'rose': 3

The output shows the Levenshtein Distance between the strings “hors” and “rose” and indicates that the result is 2. This means that at least 2 editing operations are required between the two strings.

The running version of the program can be accessed from here.


Resources