You've got the first case right. In the second case, only one a from aabc matches, so M = 1. In the third example, both as match so M = 2.
[P.S.: you're referring to the ancient Python 2.4 source code. The current source code is at hg.python.org.]
Answer from Fred Foo on Stack OverflowYou've got the first case right. In the second case, only one a from aabc matches, so M = 1. In the third example, both as match so M = 2.
[P.S.: you're referring to the ancient Python 2.4 source code. The current source code is at hg.python.org.]
never too late...
from difflib import SequenceMatcher
texto1 = 'BRASILIA~DISTRITO FEDERAL, DF'
texto2 = 'BRASILIA-DISTRITO FEDERAL, '
tamanho_texto1 = len(texto1)
tamanho_texto2 = len(texto2)
tamanho_tot = tamanho_texto1 + tamanho_texto2
tot = 0
if texto1 <= texto2:
for x in range(len(texto1)):
y = texto1[x]
if y in texto2:
tot += 1
else:
for x in range(len(texto2)):
y = texto2[x]
if y in texto1:
tot += 1
print('sequenceM = ',SequenceMatcher(None, texto1, texto2).ratio())
print('Total calculado = ',2*tot/tamanho_tot)
sequenceM = 0.9285714285714286
Total calculado = 0.9285714285714286
Videos
You forgot the first parameter to SequenceMatcher.
>>> import difflib
>>>
>>> a='abcd'
>>> b='ab123'
>>> seq=difflib.SequenceMatcher(None, a,b)
>>> d=seq.ratio()*100
>>> print d
44.4444444444
http://docs.python.org/library/difflib.html
From the docs:
The SequenceMatcher class has this constructor:
class difflib.SequenceMatcher(isjunk=None, a='', b='', autojunk=True)
The problem in your code is that by doing
seq=difflib.SequenceMatcher(a,b)
you are passing a as value for isjunk and b as value for a, leaving the default '' value for b. This results in a ratio of 0.0.
One way to overcome this (already mentioned by Lennart) is to explicitly pass None as extra first parameter so all the keyword arguments get assigned the correct values.
However I just found, and wanted to mention another solution, that doesn't touch the isjunk argument but uses the set_seqs() method to specify the different sequences.
>>> import difflib
>>> a = 'abcd'
>>> b = 'ab123'
>>> seq = difflib.SequenceMatcher()
>>> seq.set_seqs(a.lower(), b.lower())
>>> d = seq.ratio()*100
>>> print d
44.44444444444444
This gives some ideas of how matching works.
>>> import difflib
>>>
>>> def print_matches(a, b):
... s = difflib.SequenceMatcher(None, a, b)
... for block in s.get_matching_blocks():
... print "a[%d] and b[%d] match for %d elements" % block
... print s.ratio()
...
>>> print_matches('01017', '14260')
a[0] and b[4] match for 1 elements
a[5] and b[5] match for 0 elements
0.2
>>> print_matches('14260', '01017')
a[0] and b[1] match for 1 elements
a[4] and b[2] match for 1 elements
a[5] and b[5] match for 0 elements
0.4
It looks as if it matches as much as it can on the first sequence against the second and continues from the matches. In this case ('01017', '14260'), the righthand match is on the 0, the last character, so no further matches on the right are possible. In this case ('14260', '01017'), the 1s match and the 0 still is available to match on the right, so two matches are found.
I think the matching algorithm is commutative against sorted sequences.
I was working with difflib lately, and though this answer is late, I thought it might add a little spice to the answer provided by hughdbrown as it shows what's happening visually.
Before I go to the code snippet, let me quote the documentation
The idea is to find the longest contiguous matching subsequence that contains no "junk" elements; these "junk" elements are ones that are uninteresting in some sense, such as blank lines or whitespace. (Handling junk is an extension to the Ratcliff and Obershelp algorithm.) The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence. This does not yield minimal edit sequences, but does tend to yield matches that “look right” to people.
I think comparing the first string against the second one and then finding matches looks right enough to people. This is explained nicely in the answer by hughdbrown.
Now try and run this code snippet:
def show_matching_blocks(a, b):
s = SequenceMatcher(None, a, b)
m = s.get_matching_blocks()
seqs = [a, b]
new_seqs = []
for select, seq in enumerate(seqs):
i, n = 0, 0
new_seq = ''
while i < len(seq):
if i == m[n][select]:
new_seq += '{' + seq[m[n][select]:m[n][select] + m[n].size] + '}'
i += m[n].size
n += 1
elif i < m[n][select]:
new_seq += seq[i:m[n][select]]
i = m[n][select]
new_seqs.append(new_seq)
for seq, n in zip(seqs, new_seqs):
print('{} --> {}'.format(seq, n))
print('')
a, b = '10101789', '11426089'
show_matching_blocks(a, b)
show_matching_blocks(b, a)
Output:
10101789 --> {1}{0}1017{89}
11426089 --> {1}1426{0}{89}
11426089 --> {1}{1}426{0}{89}
10101789 --> {1}0{1}{0}17{89}
The parts inside braces ({}) are the matching parts. I just used SequenceMatcher.get_matching_blocks() to put the matching blocks within braces for better visibility. You can clearly see the difference when the order is reversed. With the first order, there are 4 matches, so the ratio is 2*4/16=0.5. But when the order is reversed, there are now 5 matches, so the ratio becomes 2*5/16=0.625. The ratio is calculated as given here in the documentation