Adapted from the cpython source:
https://github.com/python/cpython/blob/01fd68752e2d2d0a5f90ae8944ca35df0a5ddeaa/Lib/unittest/case.py#L1091
import difflib
import pprint
def compare_dicts(d1, d2):
return ('\n' + '\n'.join(difflib.ndiff(
pprint.pformat(d1).splitlines(),
pprint.pformat(d2).splitlines())))
Answer from user2733517 on Stack OverflowVideos
» pip install datadiff
Adapted from the cpython source:
https://github.com/python/cpython/blob/01fd68752e2d2d0a5f90ae8944ca35df0a5ddeaa/Lib/unittest/case.py#L1091
import difflib
import pprint
def compare_dicts(d1, d2):
return ('\n' + '\n'.join(difflib.ndiff(
pprint.pformat(d1).splitlines(),
pprint.pformat(d2).splitlines())))
You can use difflib, but the use unittest method seems more appropriate to me. But if you wanted to use difflib. Let's say say the following are the two dicts.
In [50]: dict1
Out[50]: {1: True, 2: False}
In [51]: dict2
Out[51]: {1: False, 2: True}
You may need to convert them to strings (or list of strings) and then go about using difflib as a normal business.
In [43]: a = '\n'.join(['%s:%s' % (key, value) for (key, value) in sorted(dict1.items())])
In [44]: b = '\n'.join(['%s:%s' % (key, value) for (key, value) in sorted(dict2.items())])
In [45]: print a
1:True
2:False
In [46]: print b
1:False
2:True
In [47]: for diffs in difflib.unified_diff(a.splitlines(), b.splitlines(), fromfile='dict1', tofile='dict2'):
print diffs
THe output would be:
--- dict1
+++ dict2
@@ -1,2 +1,2 @@
-1:True
-2:False
+1:False
+2:True
Your code seems legit. I did a couple of little tweaks that would shave off a couple of microseconds per loop:
- No need for the two
sortedcalls becausedifflibcan calculate an order-indifferent comparison withquick_ratio(Checkout the documentation here for the difference betweenratio,quick_ratio, andreal_quick_ratio). - No need for the
enumerateto accessmatbyiandj. - Removed the access of the list through index
first_dict[index]andsecond_dict[index]
def naive_ratio_comparison(first_dict, second_dict):
mat = []
for second in second_dict.values():
for first in first_dict.values():
sm = difflib.SequenceMatcher(None, first, second)
mat.append(sm.quick_ratio())
result = np.resize(mat, (len(second_dict), len(first_dict)))
return result
If one dict has M entries and the other N, then you're going to have to do M*N .ratio() calls. There's no way around that, and it's going to be costly.
However, you can easily arrange to do only M+N sorts instead of (as shown) M*N sorts.
For computing .ratio(), the most valuable hint is in the docs:
SequenceMatchercomputes and caches detailed information about the second sequence, so if you want to compare one sequence against many sequences, useset_seq2()to set the commonly used sequence once and callset_seq1()repeatedly, once for each of the other sequences.
Putting that all together:
firsts = list(map(sorted, first_dict.values())) # sort these only once
sm = difflib.SequenceMatcher(None)
for i, second in enumerate(second_dict.values()):
sm.set_seq2(sorted(second))
for j, first in enumerate(firsts):
sm.set_seq1(first)
mat[i, j] = sm.ratio()
That should deliver exactly the same results. To minimize the number of expensive .set_seq2() calls, it would - of course - be best to arrange for the shorter dict to be called "second_dict".
Alternative
It's worth asking whether you actually want difflib at all here. What are you really trying to accomplish? Nothing here looks at the contents of the strings of at all, beyond noting whether or not two strings are equal.
Perhaps what you really want is a different measure of "similarity". For example, one based on how many strings two lists have in common. If so, here's a way that doesn't use difflib:
from collections import Counter
cfirst = [(Counter(v), len(v)) for v in first_dict.values()]
csecond = [(Counter(v), len(v)) for v in second_dict.values()]
for i, (second, n2) in enumerate(csecond):
for j, (first, n1) in enumerate(cfirst):
mat[i, j] = sum((first & second).values()) * 2 / (n1 + n2)
That gives the same results on the specific example you gave, but is significantly cheaper to compute. The "ratio" computed here is the the total number of strings the two lists have in common, divided by the total number of strings in the two lists. That's easy to compute using Counters directly.
@Bilal Qandeel's answer suggested using difflib's .quick_ratio() instead, which happens to compute something similar under the covers. But that .quick_ratio() is order-independent is an undocumented implementation detail, and it's quicker to leave difflib out of it entirely if that is good enough.
NOTE: starting with Python 3.10,
mat[i, j] = sum((first & second).values()) * 2 / (n1 + n2)
can be replaced by
mat[i, j] = (first & second).total() * 2 / (n1 + n2)
» pip install deepdiff