Please note that this blog has been moved. Now it has its own domain: mynixworld.info🙂 If you want to read the latest version of this article (recommended) please click here and I open the page for you. |

One problem that I have encountered these days is related to “how to measure if two words are similar”.

The problem is not new (in fact its solution is older than any person I know alive) and one solution could be a hash function that will calculate a score for each word then you compare the both scores and if they are “close enough” then you can decide whether or not they are alike. It sounds like soundex function, which in fact is a hashing system for English words.

There are also other alternatives which seems to be even better. One is Levenshtein edit distance, a method that measure how many edit operations (such as insert/delete/substitute of a single char) should be done to transform one word into another. Another is the Needleman-Wunsch algorithm which tries to find the optimal score when aligning two sequences (words). For two words with the same length one could use even the Hamming distance (which counts how many chars are different, so it measures only the necessary substitutions that are required to change one word to another).

Even if the Needleman-Wunsch algorithm is helping us because it provides the optimal alignment of the two sequences, for my problem the Levenshtein distance seems to be 1% better.

### Levenshtein distance

In the example above we have tried to find the similarity between the word “LEVENSHTEIN” and “LEVENSHMIT”. So using the Levenshtein disntace algorithm you will get a number (bottom-right in matrix) that will show you how much it will cost you if you want to change first word into the second one:

- if we choose the first path/way then you have to delete the “T” letter then to substitute the “E” letter with a new one “M” (which does not exists in our source word) then to come up with another one “N” (which replaces the last “T”); so 3 edit operations
- if we choose the second path/way then you have to substitute the “T” with “M”, delete the “E” and insert the “T” at the end of the word; also 3 edit operations

### Needleman-Wunsch algorithm

Needleman-Wunsch algorithm helps us to align two different words in order to find the optimal alignment between them. Would you like to see what am I talking about? [Y/N]y Enter the 1st word : LEVENSHTEIN Enter the 2nd word : LEVENSHMIT We assume the following params: - insert score = 0 - delete score = 0 - substitute score = 0 - match score = 1 - similarity factor = 0.5 (i.e the words should be >= 50% alike) Testing Needleman-Wunsch algorithm - word 1 = LEVENSHTEIN - word 2 = LEVENSHMIT | L | E | V | E | N | S | H | T | E | I | N | +++++++++++++++++++++++++++++++++++++++++++++++++ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | L | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | E | 0 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | V | 0 | 1 | 2 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | E | 0 | 1 | 2 | 3 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | N | 0 | 1 | 2 | 3 | 4 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | S | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 6 | 6 | 6 | 6 | 6 | H | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 7 | 7 | 7 | 7 | M | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 7 | 7 | 7 | 7 | I | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 7 | 7 | 8 | 8 | T | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 8 | 8 | 8 | +++++++++++++++++++++++++++++++++++++++++++++++++ dist(Needleman-Wunsch)[10][11]=8 The strings's alignment is: LEVENSH+MIT ||||||| | LEVENSHTEIN ---------- mmmmmmmisms +########################################################+

The code above is an capture from the stdout of my Java implementation of the Levenshtein and Needleman-Wunsch algorithms. I made it available for the public and can be downloaded from:

https://docs.google.com/folder/d/0B95k2kr1bG9fRHN6VGZPakc2SjQ/edit

Besides Wikipedia I’ve found other useful information about this subject at: