You forgot the first parameter to SequenceMatcher.

>>> import difflib
>>> 
>>> a='abcd'
>>> b='ab123'
>>> seq=difflib.SequenceMatcher(None, a,b)
>>> d=seq.ratio()*100
>>> print d
44.4444444444

http://docs.python.org/library/difflib.html

Answer from Lennart Regebro on Stack Overflow
🌐
Beautiful Soup
tedboy.github.io › python_stdlib › generated › generated › difflib.SequenceMatcher.html
difflib.SequenceMatcher — Python Standard Library
Example, comparing two strings, and considering blanks to be “junk”: >>> s = SequenceMatcher(lambda x: x == " ", ... "private Thread currentThread;", ...
🌐
Python
docs.python.org › 2.4 › lib › sequencematcher-examples.html
4.4.2 SequenceMatcher Examples
October 18, 2006 - Previous: 4.4.1 SequenceMatcher Objects Up: 4.4 difflib Next: 4.4.3 Differ Objects · This example compares two strings, considering blanks to be ``junk:''
🌐
HexDocs
hexdocs.pm › difflib › Difflib.SequenceMatcher.html
Difflib.SequenceMatcher — Difflib v0.1.0
For example, pass fn x -> x == " " if you're comparing lines as sequences of characters, and don't want to synch up on blanks or hard tabs. auto_junk - Optional parameter autojunk should be set to false to disable the "automatic junk heuristic" that treats popular elements as junk.
🌐
Python
docs.python.org › 2.4 › lib › sequence-matcher.html
4.4.1 SequenceMatcher Objects
October 18, 2006 - >>> s = SequenceMatcher(None, " abcd", "abcd abcd") >>> s.find_longest_match(0, 5, 0, 9) (0, 4, 5)
🌐
Python
docs.python.org › 3 › library › difflib.html
difflib — Helpers for computing deltas
Instead only the 'abcd' can match, and matches the leftmost 'abcd' in the second sequence: >>> s = SequenceMatcher(lambda x: x==" ", " abcd", "abcd abcd") >>> s.find_longest_match(0, 5, 0, 9) Match(a=1, b=0, size=4)
🌐
ProgramCreek
programcreek.com › python › example › 1936 › difflib.SequenceMatcher
Python Examples of difflib.SequenceMatcher
Parameters ---------- src : str Source string (or QGrams/Counter objects) for comparison tar : str Target string (or QGrams/Counter objects) for comparison Returns ------- float FuzzyWuzzy Token Sort similarity Examples -------- >>> cmp = FuzzyWuzzyTokenSort() >>> cmp.sim('cat', 'hat') 0.6666666666666666 >>> cmp.sim('Niall', 'Neil') 0.6666666666666666 >>> cmp.sim('aluminum', 'Catalan') 0.4 >>> cmp.sim('ATCG', 'TAGC') 0.5 .. versionadded:: 0.4.0 """ src = ' '.join( sorted(self.params['tokenizer'].tokenize(src).get_list()) ) tar = ' '.join( sorted(self.params['tokenizer'].tokenize(tar).get_list()) ) return SequenceMatcher(None, src, tar).ratio()
Find elsewhere
🌐
TestDriven.io
testdriven.io › tips › 6de2820b-785d-4fc1-b107-ed8215528f49
Tips and Tricks - Python - Using SequenceMatcher.ratio() to find similarity between two strings | TestDriven.io
https://docs.python.org/3/library/difflib.html#sequencematcher-objects · For example: from difflib import SequenceMatcher first = "Jane" second = "John" print(SequenceMatcher(a=first, b=second).ratio()) # => 0.5 · View All Tips · Feedback · × ·
🌐
CodeSpeedy
codespeedy.com › home › sequencematcher in python
SequenceMatcher in Python - CodeSpeedy
February 9, 2020 - #import the class from difflib import SequenceMatcher s1 = "gun" s2 = "run" sequence = SequenceMatcher(a=s1 , b=s2) #comparing both the strings print(sequence.ratio())
🌐
GeeksforGeeks
geeksforgeeks.org › sequencematcher-in-python-for-longest-common-substring
SequenceMatcher in Python for Longest Common Substring - GeeksforGeeks
March 24, 2023 - # Function to find Longest Common Sub-string from difflib import SequenceMatcher def longestSubstring(str1,str2): # initialize SequenceMatcher object with # input string seqMatch = SequenceMatcher(None,str1,str2) # find match of longest sub-string ...
🌐
TutorialsPoint
tutorialspoint.com › article › sequencematcher-in-python-for-longest-common-substring
SequenceMatcher in Python for Longest Common Substring.
Let's look at the following example, where we are going to perform the basic match to find the longest common substring between "abcde" and "abghf". from difflib import SequenceMatcher x = "abcde" y = "abghf" a = SequenceMatcher(None, x, y) result ...
Top answer
1 of 2
27

SequenceMatcher.ratio internally uses SequenceMatcher.get_matching_blocks to calculate the ratio, I will walk you through the steps to see how that happens:

SequenceMatcher.get_matching_blocks

Return list of triples describing matching subsequences. Each triple is of the form (i, j, n), and means that a[i:i+n] == b[j:j+n]. The triples are monotonically increasing in i and j.

The last triple is a dummy, and has the value (len(a), len(b), 0). It is the only triple with n == 0. If (i, j, n) and (i', j', n') are adjacent triples in the list, and the second is not the last triple in the list, then i+n != i' or j+n != j'; in other words, adjacent triples always describe non-adjacent equal blocks.

ratio internally uses SequenceMatcher.get_matching_blocks 's results, and sums the sizes of all matched sequences returned bySequenceMatcher.get_matching_blocks. This is the exact source code from difflib.py:

matches = sum(triple[-1] for triple in self.get_matching_blocks())

The above line is critical, because the result of the above expression is used to compute the ratio. We'll see that shortly and how it impacts the calculation of the ratio.


>>> m1 = SequenceMatcher(None, "Ebojfm Mzpm", "Ebfo ef Mfpo")
>>> m2 = SequenceMatcher(None, "Ebfo ef Mfpo", "Ebojfm Mzpm")

>>> matches1 = sum(triple[-1] for triple in m1.get_matching_blocks())
>>> matches1
7
>>> matches2 = sum(triple[-1] for triple in m2.get_matching_blocks())
>>> matches2
6

As you can see, we have 7 and 6. These are simply the sums of the matched subsequences as returned by get_matching_blocks. Why does this matter? Here's why, the ratio is computed in the following way, (this is from difflib source code):

def _calculate_ratio(matches, length):
    if length:
        return 2.0 * matches / length
    return 1.0

length is len(a) + len(b) where a is the first sequence and b being the second sequence.

Okay, enough talk, we need actions:

>>> length = len("Ebojfm Mzpm") + len("Ebfo ef Mfpo") 
>>> m1.ratio()
0.6086956521739131
>>> (2.0 * matches1 / length)  == m1.ratio()
True

Similarly for m2:

>>> 2.0 * matches2 / length
0.5217391304347826 
>>> (2.0 * matches2 / length) == m2.ratio()
True

Note: Not all SequenceMatcher(None a,b).ratio() == SequenceMatcher(None b,a).ratio() are False, sometimes they can be True:

>>> s1 = SequenceMatcher(None, "abcd", "bcde").ratio()
>>> s2 = SequenceMatcher(None, "bcde", "abcd").ratio()
>>> s1 == s2
True

In case you're wondering why, this is because

sum(triple[-1] for triple in self.get_matching_blocks())

is the same for both SequenceMatcher(None, "abcd", "bcde") and SequenceMatcher(None, "bcde", "abcd") which is 3.

2 of 2
11

My answer does not provide exact details of the observed problem, but contains a general explanation of why such things may happen with loosely defined diffing methods.

Essentially everything boils down to the fact that, in the general case,

  1. more than one common subsequences of the same length can be extracted from a given pair of strings, and

  2. longer common subsequences may appear less natural to a human expert than a shorter one.

Since you are puzzled by this particular case let's analyze common subsequence identification on the following pair of strings:

  • my stackoverflow mysteries
  • mystery

To me, the natural match is "MYSTER", as follows:

my stackoverflow MYSTERies
.................MYSTERy..

However, the longest match fully covers the shorter of the two strings as follows:

MY STackovERflow mYsteries
MY.ST.....ER......Y.......

The drawback of such a match is that it introduces multiple matching sub-blocks whereas the (shorter) natural match is contiguous.

Therefore, diffing algorithms are tweaked so that their outputs are more pleasing to the final user. As a result, they are not 100% mathematically elegant and therefore don't possess properties that you would expect from a purely academic (rather than practical) tool.

The documentation of SequenceMatcher contains a corresponding note:

class difflib.SequenceMatcher

This is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable. The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980’s by Ratcliff and Obershelp under the hyperbolic name “gestalt pattern matching.” The idea is to find the longest contiguous matching subsequence that contains no “junk” elements (the Ratcliff and Obershelp algorithm doesn’t address junk). The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence. This does not yield minimal edit sequences, but does tend to yield matches that “look right” to people.

🌐
7-Zip Documentation
documentation.help › Python-2.5 › sequence-matcher.html
4.4.1 SequenceMatcher Objects - Python 2.5 Documentation
>>> s = SequenceMatcher(None, " abcd", "abcd abcd") >>> s.find_longest_match(0, 5, 0, 9) (0, 4, 5)
🌐
HotExamples
python.hotexamples.com › examples › difflib › SequenceMatcher › - › python-sequencematcher-class-examples.html
Python SequenceMatcher Examples, difflib.SequenceMatcher Python Examples - HotExamples
def controlled_vocab_lookup(self, controlled_vocab, search_term): """ Performs a semi-fuzzy search for a term match in specified vocabulary """ search_term = search_term best_ratio = 0 best_term = None minimum_ratio = 0.8 return_value = None for term in controlled_vocab: # Exact match - exit with value if search_term == term: return search_term elif term.lower() in search_term.lower() or search_term.lower() in term.lower(): return search_term # Let's see how similar the strings are s = SequenceMatcher(None, search_term.lower(), term.lower()) ratio = s.ratio() if ratio > best_ratio: best_ratio
🌐
Medium
medium.com › @zhangkd5 › a-tutorial-for-difflib-a-powerful-python-standard-library-to-compare-textual-sequences-096d52b4c843
A Tutorial of Difflib — A Powerful Python Standard Library to Compare Textual Sequences | by Kaidong Zhang | Medium
January 27, 2024 - from difflib import SequenceMatcher a = """The cat is sleeping on the red sofa.""" b = """The cat is sleeping on a blue sofa...""" seq_match = SequenceMatcher(None, a, b) ratio = seq_match.ratio() print(ratio) # Check the similarity of the two ...
🌐
GeeksforGeeks
geeksforgeeks.org › compare-sequences-in-python-using-dfflib-module
Compare sequences in Python using dfflib module - GeeksforGeeks
February 24, 2021 - Example 3: Python3 · # import required module import difflib # assign parameters par1 = 'gfg' par2 = 'GFG' # compare print(difflib.SequenceMatcher(None, par1, par2).ratio()) Output: 0.0 · The get_matching_blocks() method of this class returns a list of triples describing matching subsequences.
🌐
SourceForge
epydoc.sourceforge.net › stdlib › difflib.SequenceMatcher-class.html
difflib.SequenceMatcher - Epydoc - SourceForge
Methods: __init__(isjunk=None, a='', b='') Construct a SequenceMatcher. set_seqs(a, b) Set the two sequences to be compared. set_seq1(a) Set the first sequence to be compared. set_seq2(b) Set the second sequence to be compared. find_longest_match(alo, ahi, blo, bhi) Find longest matching block ...
🌐
7-Zip Documentation
documentation.help › Python-2.4 › sequence-matcher.html
4.4.1 SequenceMatcher Objects - Python 2.4 Documentation
>>> s = SequenceMatcher(None, " abcd", "abcd abcd") >>> s.find_longest_match(0, 5, 0, 9) (0, 4, 5)