I wrote a simple python tool for this called xmldiffs:
Compare two XML files, ignoring element and attribute order.
Usage:
xmldiffs [OPTION] FILE1 FILE2Any extra options are passed to the
diffcommand.
Get it at https://github.com/joh/xmldiffs
Answer from joh on Stack OverflowI wrote a simple python tool for this called xmldiffs:
Compare two XML files, ignoring element and attribute order.
Usage:
xmldiffs [OPTION] FILE1 FILE2Any extra options are passed to the
diffcommand.
Get it at https://github.com/joh/xmldiffs
With Beyond Compare you can use in the File Formats-Settings the XML Sort Conversion. With this option the XML children will be sorted before the diff.
A trial / portable version of Beyond Compare is available.
Two approaches that I use are (a) to canonicalize both XML files and then compare their serializations, and (b) to use the XPath 2.0 deep-equal() function. Both approaches are OK for telling you whether the files are the same, but not very good at telling you where they differ.
A commercial tool that specializes in this problem is DeltaXML.
If you have things that you consider equivalent, but which aren't equivalent at the XML level - for example, elements in a different order - then you may have to be prepared to do a transformation to normalize the documents before comparison.
Good answer here:
Question: How can I diff two XML files? | Super User
Answer: How can I diff two XML files? | Super User
$ xmllint --format --exc-c14n one.xml > 1.xml
$ xmllint --format --exc-c14n two.xml > 2.xml
$ diff 1.xml 2.xml
Apologies for any failure to adhere to serverfault conventions ... I'm sure someone will let me know and I will amend appropriately.
This is actually a reasonably challenging problem (due to what "difference" means often being in the eye of the beholder here, as there will be semantically "equivalent" information that you probably don't want marked as differences).
You could try using xmldiff, which is based on work in the paper Change Detection in Hierarchically Structured Information.
My approach to the problem was transforming each XML into a xml.etree.ElementTree and iterating through each of the layers. I also included the functionality to ignore a list of attributes while doing the comparison.
The first block of code holds the class used:
import xml.etree.ElementTree as ET
import logging
class XmlTree():
def __init__(self):
self.hdlr = logging.FileHandler('xml-comparison.log')
self.formatter = logging.Formatter('%(asctime)s %(levelname)s %(message)s')
@staticmethod
def convert_string_to_tree( xmlString):
return ET.fromstring(xmlString)
def xml_compare(self, x1, x2, excludes=[]):
"""
Compares two xml etrees
:param x1: the first tree
:param x2: the second tree
:param excludes: list of string of attributes to exclude from comparison
:return:
True if both files match
"""
if x1.tag != x2.tag:
self.logger.debug('Tags do not match: %s and %s' % (x1.tag, x2.tag))
return False
for name, value in x1.attrib.items():
if not name in excludes:
if x2.attrib.get(name) != value:
self.logger.debug('Attributes do not match: %s=%r, %s=%r'
% (name, value, name, x2.attrib.get(name)))
return False
for name in x2.attrib.keys():
if not name in excludes:
if name not in x1.attrib:
self.logger.debug('x2 has an attribute x1 is missing: %s'
% name)
return False
if not self.text_compare(x1.text, x2.text):
self.logger.debug('text: %r != %r' % (x1.text, x2.text))
return False
if not self.text_compare(x1.tail, x2.tail):
self.logger.debug('tail: %r != %r' % (x1.tail, x2.tail))
return False
cl1 = x1.getchildren()
cl2 = x2.getchildren()
if len(cl1) != len(cl2):
self.logger.debug('children length differs, %i != %i'
% (len(cl1), len(cl2)))
return False
i = 0
for c1, c2 in zip(cl1, cl2):
i += 1
if not c1.tag in excludes:
if not self.xml_compare(c1, c2, excludes):
self.logger.debug('children %i do not match: %s'
% (i, c1.tag))
return False
return True
def text_compare(self, t1, t2):
"""
Compare two text strings
:param t1: text one
:param t2: text two
:return:
True if a match
"""
if not t1 and not t2:
return True
if t1 == '*' or t2 == '*':
return True
return (t1 or '').strip() == (t2 or '').strip()
The second block of code holds a couple of XML examples and their comparison:
xml1 = "<note><to>Tove</to><from>Jani</from><heading>Reminder</heading><body>Don't forget me this weekend!</body></note>"
xml2 = "<note><to>Tove</to><from>Daniel</from><heading>Reminder</heading><body>Don't forget me this weekend!</body></note>"
tree1 = XmlTree.convert_string_to_tree(xml1)
tree2 = XmlTree.convert_string_to_tree(xml2)
comparator = XmlTree()
if comparator.xml_compare(tree1, tree2, ["from"]):
print "XMLs match"
else:
print "XMLs don't match"
Most of the credit for this code must be given to syawar
One approach would be to first turn both XML files into Canonical XML, and compare the results using diff. For example, xmllint can be used to canonicalize XML.
$ xmllint --c14n one.xml > 1.xml
$ xmllint --c14n two.xml > 2.xml
$ diff 1.xml 2.xml
Or as a one-liner.
$ diff <(xmllint --c14n one.xml) <(xmllint --c14n two.xml)
Jukka's answer did not work for me, but it did point to Canonical XML. Neither --c14n nor --c14n11 sorted the attributes, but i did find the --exc-c14n switch did sort the attributes. --exc-c14n is not listed in the man page, but described on the command line as "W3C exclusive canonical format".
$ xmllint --exc-c14n one.xml > 1.xml
$ xmllint --exc-c14n two.xml > 2.xml
$ diff 1.xml 2.xml
$ xmllint | grep c14
--c14n : save in W3C canonical format v1.0 (with comments)
--c14n11 : save in W3C canonical format v1.1 (with comments)
--exc-c14n : save in W3C exclusive canonical format (with comments)
$ rpm -qf /usr/bin/xmllint
libxml2-2.7.6-14.el6.x86_64
libxml2-2.7.6-14.el6.i686
$ cat /etc/system-release
CentOS release 6.5 (Final)
Warning --exc-c14n strips out the xml header whereas the --c14n prepends the xml header if not there.
» pip install xmldiff
You can achieve what you want with the help of a small Python script (you'll need Python installed, as well as the lxml toolkit).
tagsort.py:
#!/usr/bin/python
import sys
from lxml import etree
filename, tag = sys.argv[1:]
doc = etree.parse(filename, etree.XMLParser(remove_blank_text=True))
root = doc.getroot()
root[:] = sorted(root, key=lambda el: el.findtext(tag))
print etree.tostring(doc, pretty_print=True)
This script sorts the first-level elements under the XML document root by the content of a second-level element, sending the result to stdout. It's called like this:
$ python tagsort.py filename tag
Once you've got that, you can use process substitution to get a diff based on its output (I've added one element and changed another in your example files to show a non-empty result):
$ diff <(python tagsort.py file1 Id) <(python tagsort.py file2 Id)
4a5
> <AddedTag>Something</AddedTag>
17c18
< <Role>X</Role>
---
> <Role>S</Role>
I had a similar problem and I eventually found: https://superuser.com/questions/79920/how-can-i-diff-two-xml-files
That post suggests doing a canonical xml sort then doing a diff. The following should work for you if you are on linux, mac, or if you have windows something like cygwin installed:
$ xmllint --c14n File1.xml > 1.xml
$ xmllint --c14n File2.xml > 2.xml
$ diff 1.xml 2.xml
You can use formencode.doctest_xml_compare -- the xml_compare function compares two ElementTree or lxml trees.
The order of the elements can be significant in XML, this may be why most other methods suggested will compare unequal if the order is different... even if the elements have same attributes and text content.
But I also wanted an order-insensitive comparison, so I came up with this:
from lxml import etree
import xmltodict # pip install xmltodict
def normalise_dict(d):
"""
Recursively convert dict-like object (eg OrderedDict) into plain dict.
Sorts list values.
"""
out = {}
for k, v in dict(d).iteritems():
if hasattr(v, 'iteritems'):
out[k] = normalise_dict(v)
elif isinstance(v, list):
out[k] = []
for item in sorted(v):
if hasattr(item, 'iteritems'):
out[k].append(normalise_dict(item))
else:
out[k].append(item)
else:
out[k] = v
return out
def xml_compare(a, b):
"""
Compares two XML documents (as string or etree)
Does not care about element order
"""
if not isinstance(a, basestring):
a = etree.tostring(a)
if not isinstance(b, basestring):
b = etree.tostring(b)
a = normalise_dict(xmltodict.parse(a))
b = normalise_dict(xmltodict.parse(b))
return a == b