import difflib
text1 = open('sample1.txt').readlines()
text2 = open('sample2.txt').readlines()
for line in difflib.unified_diff(text1, text2):
print(line)hi everyone and thanks for reading :
this usually returns the entire content of both files with indication(-, +, ?) of what contents are unique or mutual, I however will want to have just the contents that isn't present in both files. also can this code be put into a function? I get no error when I do but it just returns nothing. thanks in advance.
python - Using context_diff print only lines which have differences - Stack Overflow
text - python difflib comparing files - Stack Overflow
output - Python - compare two string by words using difflib and print only difference - Stack Overflow
python difflib compare output format - Stack Overflow
Just parse output of diff like this (change '- ' to '+ ' if needed):
#!/usr/bin/env python
# difflib_test
import difflib
file1 = open('/home/saad/Code/test/new_tweets', 'r')
file2 = open('/home/saad/PTITVProgs', 'r')
diff = difflib.ndiff(file1.readlines(), file2.readlines())
delta = ''.join(x[2:] for x in diff if x.startswith('- '))
print delta
There are multiple diff styles and different functions exist for them in the difflib library. unified_diff, ndiff and context_diff.
If you don't want the line number summaries, ndiff function gives a Differ-style delta:
import difflib
f1 = '''1
2
3
4
5'''
f2 = '''1
3
4
5
6'''
diff = difflib.ndiff(f1,f2)
for l in diff:
print(l)
Output:
1
- 2
3
4
5
+ 6
EDIT:
You could also parse the diff to extract only the changes if that's what you want:
>>>changes = [l for l in diff if l.startswith('+ ') or l.startswith('- ')]
>>>for c in changes:
print(c)
>>>
- 2
+ 6
I'm also still trying to figure out why many difflib functions return a generator instead of a list, what's the advantage there?
Well, think about it for a second - if you compare files, those files can in theory (and will be in practice) be quite large - returning the delta as a list, for exampe, means reading the complete data into memory, which is not a smart thing to do.
As for only returning the difference, well, there is another advantage in using a generator - just iterate over the delta and keep whatever lines you are interested in.
If you read the difflib documentation for Differ - style deltas, you will see a paragraph that reads:
Each line of a Differ delta begins with a two-letter code:
Code Meaning
'- ' line unique to sequence 1
'+ ' line unique to sequence 2
' ' line common to both sequences
'? ' line not present in either input sequence
So, if you only want differences, you can easily filter those out by using str.startswith
You can also use difflib.context_diff to obtain a compact delta which shows only the changes.
Diffs must contain enough information to make it possible to patch a version into another, so yes, for your experiment of a single-line change to a very small document, storing the whole documents could be cheaper.
Library functions return iterators to make it easier on clients that are tight on memory or only need to look at part of the resulting sequence. It's ok in Python because every iterator can be converted to a list with a very short list(an_iterator) expression.
Most differencing is done on lines of text, but it is possible to go down to the char-by-char, and difflib does it. Take a look at the Differ class of object in difflib.
The examples all over the place use human-friendly output, but the diffs are managed internally in a much more compact, computer-friendly way. Also, diffs usually contain redundant information (like the text of a line to delete) to make patching and merging changes safe. The redundancy can be removed by your own code, if you feel comfortable with that.
I just read that difflib opts for least-surprise in favor of optimality, which is something I won't argue against. There are well known algorithms that are fast at producing a minimum set of changes.
I once coded a generic diffing engine along with one of the optimum algorithms in about 1250 lines of Java (JRCS). It works for any sequence of elements that can be compared for equality. If you want to build your own solution, I think that a translation/reimplementation of JRCS should take no more than 300 lines of Python.
Processing the output produced by difflib to make it more compact is also an option. This is an example from a small files with three changes (an addition, a change, and a deletion):
---
+++
@@ -7,0 +7,1 @@
+aaaaa
@@ -9,1 +10,1 @@
-c= 0
+c= 1
@@ -15,1 +16,0 @@
- m = re.match(code_re, text)
What the patch says can be easily condensed to:
+7,1
aaaaa
-9,1
+10,1
c= 1
-15,1
For your own example the condensed output would be:
-8,1
+9,1
print "The end"
For safety, leaving in a leading marker ('>') for lines that must be inserted might be a good idea.
-8,1
+9,1
>print "The end"
Is that closer to what you need?
This is a simple function to do the compacting. You'll have to write your own code to apply the patch in that format, but it should be straightforward.
def compact_a_unidiff(s):
s = [l for l in s if l[0] in ('+','@')]
result = []
for l in s:
if l.startswith('++'):
continue
elif l.startswith('+'):
result.append('>'+ l[1:])
else:
del_cmd, add_cmd = l[3:-3].split()
del_pair, add_pair = (c.split(',') for c in (del_cmd,add_cmd))
if del_pair[1] != '0':
result.append(del_cmd)
if add_pair[1] != '0':
result.append(add_cmd)
return result