You forgot the first parameter to SequenceMatcher.
>>> import difflib
>>>
>>> a='abcd'
>>> b='ab123'
>>> seq=difflib.SequenceMatcher(None, a,b)
>>> d=seq.ratio()*100
>>> print d
44.4444444444
http://docs.python.org/library/difflib.html
Answer from Lennart Regebro on Stack OverflowYou forgot the first parameter to SequenceMatcher.
>>> import difflib
>>>
>>> a='abcd'
>>> b='ab123'
>>> seq=difflib.SequenceMatcher(None, a,b)
>>> d=seq.ratio()*100
>>> print d
44.4444444444
http://docs.python.org/library/difflib.html
From the docs:
The SequenceMatcher class has this constructor:
class difflib.SequenceMatcher(isjunk=None, a='', b='', autojunk=True)
The problem in your code is that by doing
seq=difflib.SequenceMatcher(a,b)
you are passing a as value for isjunk and b as value for a, leaving the default '' value for b. This results in a ratio of 0.0.
One way to overcome this (already mentioned by Lennart) is to explicitly pass None as extra first parameter so all the keyword arguments get assigned the correct values.
However I just found, and wanted to mention another solution, that doesn't touch the isjunk argument but uses the set_seqs() method to specify the different sequences.
>>> import difflib
>>> a = 'abcd'
>>> b = 'ab123'
>>> seq = difflib.SequenceMatcher()
>>> seq.set_seqs(a.lower(), b.lower())
>>> d = seq.ratio()*100
>>> print d
44.44444444444444
Videos
Hey guys, I’m trying to develop a Python code where I can input two strings and check for similarity in the two strings and output a similarity score for them. I’ve tried to read about regular expressions but can’t find a function that’s working. Any help/insight will be appreciated.
» pip install cdifflib
I am currently using sequenceMatcher.ratio() in a program I am working on, and while the function itself is exactly what I need the runtime is an issue. On 2 files im testing on, 500x2000 lines it takes about 1 minute. On the actual target documents, 20000x20000, it will take around 4000 minutes or roughly 3 days as best as I can figure.
I can't use quick_ratio() or real_quick_ratio() because accuracy of comparisons matter and both quick_ratio() and real_quick_ratio() per the documentation are "always at least as large as ratio()", or in other words will say that words are more similar than the normal ratio function.
If anyone knows any similar functions or other ways of approaching this issue (comparing how similar two words are relatively quickly) I could really use the help. The only alternative I or my boss have at the moment is multiprocessing or pushing it into a distributed environment and just brute forcing the slow version I have at the moment.