Mar 4, 2025
8 min
Have you ever wondered how tools like autocorrect can guess what word you meant to type or how Google manages to correct your search queries when you make a typo? While it may seem like magic, these processes are powered by simple yet effective algorithms. One popular variation of such an algorithm is known as the Levenshtein distance algorithm which measures how similar 2 strings are to each other. In this article, weâre going to be exploring how to approach string similarity in an intuitive manner and also how to implement Levenshtein distance from said intuition.
The best way to start off thinking about string similarity is with an example. Consider the word âaegn.â Chances are, in an application, the user likely made a typo and intended to type âbeginâ rather than a completely unrelated word like âweaponized.â Why do we think this? Well the word âaegnâ doesnât exist, so it must be some sort of typo, and the word âbeginâ is visually much more similar to âaegnâ than âweaponized.â While humans are very easily able to discern such differences from a glance, the key question in this situation is how a computer compute how similar 2 strings are to each other.
One simple approach is to simply compare how long the 2 strings are-the closer the lengths, the more similar the words might be. However, this approach quickly breaks down when comparing words of the same length, but with entirely different letters. For instance, using only length as a measure, the word âtestâ would appear just as similar to âfairâ as the word âfairâ is to itself which clearly isnât correct.
A better approach involves going a step deeper and looking at the individual characters for the string itself. Specifically, seeing how many steps it takes to change one string to another. There are always going to be 2 major operations you need when changing a string, deleting a character and inserting a character. With these 2 operations you can change any string to another string. Take the words âfarmâ and âfairâ for example. You can change âfarmâ to âfairâ by deleting âmâ and inserting âiâ in between the âaâ and ârâ which would result in 2 total steps. An additional common operation to consider is substitution, where you insert and delete the character in the same position which for Levenshtein distance is going to count as 1 operation.
To find how similar each string is to an array of strings, you can simply calculate the minimum number of steps you would need to transform the string into every string in the array and then just return the string that takes the lowest number of steps. We can quickly verify this with an example. Take the string âaegnâ and compare the similarity to the two strings âbeginâ and âweaponized.â To save you the time, the number of steps required to change âaegnâ into âbeginâ is 2 while the number of steps required to change âaegnâ into âweaponizedâ is 8. Intuitively we know that âbeginâ is much more similar and algorithmically weâve just verified this.
Now that we have an intuitive basis for understanding how to calculate similarity, the next step is creating an algorithm to implement it. Fortunately, some very smart people have already developed the Levenshtein distance algorithm, which does exactly what we need. As such, Iâm going to skip over the process for developing the algorithm needed dive straight into implementation and understanding how Levenshtein distance works.
Levenshtein distance employs a technique called dynamic programmingâa fancy term for an algorithm that stores and reuses previously computed results. There are several ways to implement this algorithm, but weâll use an iterative approach with a 2D matrix. The idea is to construct a matrix where each cell represents the number of operations required to convert one substring into another.
Consider the words âfarmâ and âfair.â The initial matrix setup is going to look like the following:
f | a | i | r | ||
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | |
f | 1 | ||||
a | 2 | ||||
r | 3 | ||||
m | 4 |
Next, we fill in the matrix based on the following rule:
Filling in the matrix, we get:
f | a | i | r | ||
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | |
f | 1 | 0 | 1 | 2 | 3 |
a | 2 | 1 | 0 | 1 | 2 |
r | 3 | 2 | 1 | 1 | 1 |
m | 4 | 3 | 2 | 2 | 2 |
The bottom-right cell (value 2) represents the minimum number of edits needed to transform âfarmâ into âfair.â
Now that we understand the algorithm, implementing it is trivial. Hereâs a simple C++ implementation of the Levenshtein distance in less than 50 lines of code:
#include <algorithm>
#include <iostream>
#include <string>
#include <vector>
int main(int argc, char* argv[]) {
if (argc < 3) {
std::cerr << "Usage: " << argv[0] << " <string> <string>\n";
return 1;
}
// initializing variables and 2d matrix
std::string s1 = argv[1], s2 = argv[2];
int m = s1.length(), n = s2.length();
std::vector<std::vector<int>> dp(m+1, std::vector<int>(n+1, 0));
for (int i = 0; i <= m; i++) {
dp[i][0] = i;
}
for (int i = 0; i <= n; i++) {
dp[0][i] = i;
}
// filling the matrix table
for (int i = 1; i <= m; i++) {
for (int j = 1; j <= n; j++) {
if (s1[i-1] == s2[j-1]) {
dp[i][j] = dp[i-1][j-1];
} else {
dp[i][j] = 1 + std::min({
dp[i-1][j-1],
dp[i-1][j],
dp[i][j-1]
});
}
}
}
// outputting results
std::cout << "Similarity Score: " << dp[m][n] << "\n";
return 0;
}
Now that weâve covered how the Levenshtein distance algorithm works, there are a ton of ways you can use it in your applications:
One popular tool that utilizes a similar fuzzy search approach that I highly recommend checking out is fzf. It uses some slightly different algorithms, but is nevertheless an incredibly useful tool that I use daily. You can also build your own variation of it using your own algorithms which you can find on my GitHub here (shameless plug LMAO).
Thanks for reading. Cya đ.