In case you're interested in a quick visual comparison of Levenshtein and Difflib similarity, I calculated both for ~2.3 million book titles:
import codecs, difflib, Levenshtein, distance
with codecs.open("titles.tsv","r","utf-8") as f:
title_list = f.read().split("\n")[:-1]
for row in title_list:
sr = row.lower().split("\t")
diffl = difflib.SequenceMatcher(None, sr[3], sr[4]).ratio()
lev = Levenshtein.ratio(sr[3], sr[4])
sor = 1 - distance.sorensen(sr[3], sr[4])
jac = 1 - distance.jaccard(sr[3], sr[4])
print diffl, lev, sor, jac
I then plotted the results with R:

Strictly for the curious, I also compared the Difflib, Levenshtein, Sørensen, and Jaccard similarity values:
library(ggplot2)
require(GGally)
difflib <- read.table("similarity_measures.txt", sep = " ")
colnames(difflib) <- c("difflib", "levenshtein", "sorensen", "jaccard")
ggpairs(difflib)
Result:

The Difflib / Levenshtein similarity really is quite interesting.
2018 edit: If you're working on identifying similar strings, you could also check out minhashing--there's a great overview here. Minhashing is amazing at finding similarities in large text collections in linear time. My lab put together an app that detects and visualizes text reuse using minhashing here: https://github.com/YaleDHLab/intertext
Answer from duhaime on Stack OverflowIn case you're interested in a quick visual comparison of Levenshtein and Difflib similarity, I calculated both for ~2.3 million book titles:
import codecs, difflib, Levenshtein, distance
with codecs.open("titles.tsv","r","utf-8") as f:
title_list = f.read().split("\n")[:-1]
for row in title_list:
sr = row.lower().split("\t")
diffl = difflib.SequenceMatcher(None, sr[3], sr[4]).ratio()
lev = Levenshtein.ratio(sr[3], sr[4])
sor = 1 - distance.sorensen(sr[3], sr[4])
jac = 1 - distance.jaccard(sr[3], sr[4])
print diffl, lev, sor, jac
I then plotted the results with R:

Strictly for the curious, I also compared the Difflib, Levenshtein, Sørensen, and Jaccard similarity values:
library(ggplot2)
require(GGally)
difflib <- read.table("similarity_measures.txt", sep = " ")
colnames(difflib) <- c("difflib", "levenshtein", "sorensen", "jaccard")
ggpairs(difflib)
Result:

The Difflib / Levenshtein similarity really is quite interesting.
2018 edit: If you're working on identifying similar strings, you could also check out minhashing--there's a great overview here. Minhashing is amazing at finding similarities in large text collections in linear time. My lab put together an app that detects and visualizes text reuse using minhashing here: https://github.com/YaleDHLab/intertext
difflib.SequenceMatcher uses the Ratcliff/Obershelp algorithm it computes the doubled number of matching characters divided by the total number of characters in the two strings.
Levenshtein uses Levenshtein algorithm it computes the minimum number of edits needed to transform one string into the other
Complexity
SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common. (from here)
Levenshtein is O(m*n), where n and m are the length of the two input strings.
Performance
According to the source code of the Levenshtein module : Levenshtein has a some overlap with difflib (SequenceMatcher). It supports only strings, not arbitrary sequence types, but on the other hand it's much faster.
FuzzyWuzzy.ratio using python-Levenshtein doesn't return the Levenshtein score, but rather the Levenshtein ratio, which is (a+b - LevenshteinScore)/(a+b), where a and b are the lengths of the two strings being compared.
If you don't have python-Levenshtein installed then fuzzywuzzy doesn't use Levenshtein at all. Fuzzywuzzy's home page is misleading with regards to this, though it does recommend installing python-Levenshtein.
python-Levenshtein has some issues with installing; I used the second response to this stackoverflow question to solve it.
If you don't have python-Levenshtein installed FuzzyWuzzy uses difflib instead, which is the same for many input values, but not all. The developers recommend using python-Levenshtein. See this issue on fuzzywuzzy's git, which includes an example case where the results are different with the package as compared to without it. This probably shouldn't happen, or at least the documentation should make this clear, but FuzzyWuzzy's Devs seem content at least with the functionality.
Found an excellent article from the creator of FuzzyWuzzy here.
String Similarity The simplest way to compare two strings is with a measurement of edit distance. For example, the following two strings are quite similar: NEW YORK METS NEW YORK MEATS Looks like a harmless misspelling. Can we quantify it? Using python’s difflib, that’s pretty easy
from difflib import SequenceMatcher
m = SequenceMatcher(None,"NEW YORK METS", "NEW YORK MEATS")
m.ratio() ⇒ 0.962962962963
So it looks like these two strings are about 96% the same. Pretty good! We use this pattern so frequently, we wrote a helper method to encapsulate it
fuzz.ratio("NEW YORK METS", "NEW YORK MEATS") ⇒ 96
» pip install python-Levenshtein
I realize it's not the same thing, but this is close enough:
>>> import difflib
>>> a = 'Hello, All you people'
>>> b = 'hello, all You peopl'
>>> seq=difflib.SequenceMatcher(a=a.lower(), b=b.lower())
>>> seq.ratio()
0.97560975609756095
You can make this as a function
def similar(seq1, seq2):
return difflib.SequenceMatcher(a=seq1.lower(), b=seq2.lower()).ratio() > 0.9
>>> similar(a, b)
True
>>> similar('Hello, world', 'Hi, world')
False
There's a great resource for string similarity metrics at the University of Sheffield. It has a list of various metrics (beyond just Levenshtein) and has open-source implementations of them. Looks like many of them should be easy to adapt into Python.
http://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html
Here's a bit of the list:
- Hamming distance
- Levenshtein distance
- Needleman-Wunch distance or Sellers Algorithm
- and many more...