Use the xmldiff to perform this exact task.
main.py
from xmldiff import main
diff = main.diff_files("file1.xml", "file2.xml")
print(diff)
output
[DeleteNode(node='/ngs_sample/results/gastro_prelim_st/type[2]')]
Answer from Victor 'Chris' Cabral on Stack Overflow
» pip install xmldiff
Use the xmldiff to perform this exact task.
main.py
from xmldiff import main
diff = main.diff_files("file1.xml", "file2.xml")
print(diff)
output
[DeleteNode(node='/ngs_sample/results/gastro_prelim_st/type[2]')]
You can switch to the XMLFormatter and manually filter out the results:
...
# Change formatter:
formatter = formatting.XMLFormatter(normalize=formatting.WS_BOTH)
...
# after `out` has been retrieved:
import re
for i in out.splitlines():
if re.search(r'\bdiff:\w+', i):
print(i)
# Result:
# <type st="9999" diff:delete=""/>
» pip install xml-diff
This is actually a reasonably challenging problem (due to what "difference" means often being in the eye of the beholder here, as there will be semantically "equivalent" information that you probably don't want marked as differences).
You could try using xmldiff, which is based on work in the paper Change Detection in Hierarchically Structured Information.
My approach to the problem was transforming each XML into a xml.etree.ElementTree and iterating through each of the layers. I also included the functionality to ignore a list of attributes while doing the comparison.
The first block of code holds the class used:
import xml.etree.ElementTree as ET
import logging
class XmlTree():
def __init__(self):
self.hdlr = logging.FileHandler('xml-comparison.log')
self.formatter = logging.Formatter('%(asctime)s %(levelname)s %(message)s')
@staticmethod
def convert_string_to_tree( xmlString):
return ET.fromstring(xmlString)
def xml_compare(self, x1, x2, excludes=[]):
"""
Compares two xml etrees
:param x1: the first tree
:param x2: the second tree
:param excludes: list of string of attributes to exclude from comparison
:return:
True if both files match
"""
if x1.tag != x2.tag:
self.logger.debug('Tags do not match: %s and %s' % (x1.tag, x2.tag))
return False
for name, value in x1.attrib.items():
if not name in excludes:
if x2.attrib.get(name) != value:
self.logger.debug('Attributes do not match: %s=%r, %s=%r'
% (name, value, name, x2.attrib.get(name)))
return False
for name in x2.attrib.keys():
if not name in excludes:
if name not in x1.attrib:
self.logger.debug('x2 has an attribute x1 is missing: %s'
% name)
return False
if not self.text_compare(x1.text, x2.text):
self.logger.debug('text: %r != %r' % (x1.text, x2.text))
return False
if not self.text_compare(x1.tail, x2.tail):
self.logger.debug('tail: %r != %r' % (x1.tail, x2.tail))
return False
cl1 = x1.getchildren()
cl2 = x2.getchildren()
if len(cl1) != len(cl2):
self.logger.debug('children length differs, %i != %i'
% (len(cl1), len(cl2)))
return False
i = 0
for c1, c2 in zip(cl1, cl2):
i += 1
if not c1.tag in excludes:
if not self.xml_compare(c1, c2, excludes):
self.logger.debug('children %i do not match: %s'
% (i, c1.tag))
return False
return True
def text_compare(self, t1, t2):
"""
Compare two text strings
:param t1: text one
:param t2: text two
:return:
True if a match
"""
if not t1 and not t2:
return True
if t1 == '*' or t2 == '*':
return True
return (t1 or '').strip() == (t2 or '').strip()
The second block of code holds a couple of XML examples and their comparison:
xml1 = "<note><to>Tove</to><from>Jani</from><heading>Reminder</heading><body>Don't forget me this weekend!</body></note>"
xml2 = "<note><to>Tove</to><from>Daniel</from><heading>Reminder</heading><body>Don't forget me this weekend!</body></note>"
tree1 = XmlTree.convert_string_to_tree(xml1)
tree2 = XmlTree.convert_string_to_tree(xml2)
comparator = XmlTree()
if comparator.xml_compare(tree1, tree2, ["from"]):
print "XMLs match"
else:
print "XMLs don't match"
Most of the credit for this code must be given to syawar
You can achieve what you want with the help of a small Python script (you'll need Python installed, as well as the lxml toolkit).
tagsort.py:
#!/usr/bin/python
import sys
from lxml import etree
filename, tag = sys.argv[1:]
doc = etree.parse(filename, etree.XMLParser(remove_blank_text=True))
root = doc.getroot()
root[:] = sorted(root, key=lambda el: el.findtext(tag))
print etree.tostring(doc, pretty_print=True)
This script sorts the first-level elements under the XML document root by the content of a second-level element, sending the result to stdout. It's called like this:
$ python tagsort.py filename tag
Once you've got that, you can use process substitution to get a diff based on its output (I've added one element and changed another in your example files to show a non-empty result):
$ diff <(python tagsort.py file1 Id) <(python tagsort.py file2 Id)
4a5
> <AddedTag>Something</AddedTag>
17c18
< <Role>X</Role>
---
> <Role>S</Role>
I had a similar problem and I eventually found: https://superuser.com/questions/79920/how-can-i-diff-two-xml-files
That post suggests doing a canonical xml sort then doing a diff. The following should work for you if you are on linux, mac, or if you have windows something like cygwin installed:
$ xmllint --c14n File1.xml > 1.xml
$ xmllint --c14n File2.xml > 2.xml
$ diff 1.xml 2.xml