How to measure the similarity between two words


Please note that this blog has been moved.

Now it has its own domain: mynixworld.info🙂

If you want to read the latest version of this article (recommended) please click here and I open the page for you.

One problem that I have encountered these days is related to “how to measure if two words are similar”.

The problem is not new (in fact its solution is older than any person I know alive) and one solution could be a hash function that will calculate a score for each word then you compare the both scores and if they are “close enough” then you can decide whether or not they are alike. It sounds like soundex function, which in fact is a hashing system for English words.

There are also other alternatives which seems to be even better. One is Levenshtein edit distance, a method that measure how many edit operations (such as insert/delete/substitute of a single char) should be done to transform one word into another. Another is the Needleman-Wunsch algorithm which tries to find the optimal score when aligning two sequences (words). For two words with the same length one could use even the Hamming distance (which counts how many chars are different, so it measures only the necessary substitutions that are required to change one word to another).

Even if the Needleman-Wunsch algorithm is helping us because it provides the optimal alignment of the two sequences, for my problem the Levenshtein distance seems to be 1% better.

Levenshtein distance

In the example above we have tried to find the similarity between the word “LEVENSHTEIN” and “LEVENSHMIT”. So using the Levenshtein disntace algorithm you will get a number (bottom-right in matrix) that will show you how much it will cost you if you want to change first word into the second one:

  • if we choose the first path/way then you have to delete the “T” letter then to substitute the “E” letter with a new one “M” (which does not exists in our source word) then to come up with another one “N” (which replaces the last “T”); so 3 edit operations
  • if we choose the second path/way then you have to substitute the “T” with “M”, delete the “E” and insert the “T” at the end of the word; also 3 edit operations

Needleman-Wunsch algorithm

Needleman-Wunsch algorithm helps us to align two different words
in order to find the optimal alignment between them.

Would you like to see what am I talking about? [Y/N]y
Enter the 1st word : LEVENSHTEIN
Enter the 2nd word : LEVENSHMIT
We assume the following params:
 - insert score = 0
 - delete score = 0
 - substitute score = 0
 - match score = 1
 - similarity factor = 0.5 (i.e the words should be >= 50% alike)

Testing Needleman-Wunsch algorithm
 - word 1 = LEVENSHTEIN
 - word 2 = LEVENSHMIT

      | L | E | V | E | N | S | H | T | E | I | N |
  +++++++++++++++++++++++++++++++++++++++++++++++++
  | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
L | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
E | 0 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
V | 0 | 1 | 2 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
E | 0 | 1 | 2 | 3 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
N | 0 | 1 | 2 | 3 | 4 | 5 | 5 | 5 | 5 | 5 | 5 | 5 |
S | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 6 | 6 | 6 | 6 | 6 |
H | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 7 | 7 | 7 | 7 |
M | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 7 | 7 | 7 | 7 |
I | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 7 | 7 | 8 | 8 |
T | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 8 | 8 | 8 |
  +++++++++++++++++++++++++++++++++++++++++++++++++
  dist(Needleman-Wunsch)[10][11]=8

  The strings's alignment is:

   LEVENSH+MIT
   |||||||  | 
   LEVENSHTEIN
   ----------
   mmmmmmmisms
+########################################################+

The code above is an capture from the stdout of my Java implementation of the Levenshtein and Needleman-Wunsch algorithms. I made it available for the public and can be downloaded from:

https://docs.google.com/folder/d/0B95k2kr1bG9fRHN6VGZPakc2SjQ/edit

Besides Wikipedia I’ve found other useful information about this subject at:

About Eugen Mihailescu

Always looking to learn more about *nix world, about the fundamental concepts of arithmetic, algebra and geometry. I am also passionate about programming, database and systems administration.
This entry was posted in Eclipse, Java and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s