You can achieve what you want with the help of a small Python script (you'll need Python installed, as well as the lxml toolkit).
tagsort.py:
#!/usr/bin/python
import sys
from lxml import etree
filename, tag = sys.argv[1:]
doc = etree.parse(filename, etree.XMLParser(remove_blank_text=True))
root = doc.getroot()
root[:] = sorted(root, key=lambda el: el.findtext(tag))
print etree.tostring(doc, pretty_print=True)
This script sorts the first-level elements under the XML document root by the content of a second-level element, sending the result to stdout. It's called like this:
$ python tagsort.py filename tag
Once you've got that, you can use process substitution to get a diff based on its output (I've added one element and changed another in your example files to show a non-empty result):
$ diff <(python tagsort.py file1 Id) <(python tagsort.py file2 Id)
4a5
> <AddedTag>Something</AddedTag>
17c18
< <Role>X</Role>
---
> <Role>S</Role>
Answer from user27282 on Stack ExchangeYou can achieve what you want with the help of a small Python script (you'll need Python installed, as well as the lxml toolkit).
tagsort.py:
#!/usr/bin/python
import sys
from lxml import etree
filename, tag = sys.argv[1:]
doc = etree.parse(filename, etree.XMLParser(remove_blank_text=True))
root = doc.getroot()
root[:] = sorted(root, key=lambda el: el.findtext(tag))
print etree.tostring(doc, pretty_print=True)
This script sorts the first-level elements under the XML document root by the content of a second-level element, sending the result to stdout. It's called like this:
$ python tagsort.py filename tag
Once you've got that, you can use process substitution to get a diff based on its output (I've added one element and changed another in your example files to show a non-empty result):
$ diff <(python tagsort.py file1 Id) <(python tagsort.py file2 Id)
4a5
> <AddedTag>Something</AddedTag>
17c18
< <Role>X</Role>
---
> <Role>S</Role>
I had a similar problem and I eventually found: https://superuser.com/questions/79920/how-can-i-diff-two-xml-files
That post suggests doing a canonical xml sort then doing a diff. The following should work for you if you are on linux, mac, or if you have windows something like cygwin installed:
$ xmllint --c14n File1.xml > 1.xml
$ xmllint --c14n File2.xml > 2.xml
$ diff 1.xml 2.xml
diff - How to compare XML files - Stack Overflow
XML files comparison / diff?
Best way to compare 2 XML documents in Java - Stack Overflow
Comparing xml files
I had a similar problem and I eventually found: http://superuser.com/questions/79920/how-can-i-diff-two-xml-files
That post suggests doing a canonical XML sort then doing a diff. The following should work for you if you are on Linux, Mac, or if you have Windows with something like Cygwin installed:
$ xmllint --c14n FileA.xml > 1.xml
$ xmllint --c14n FileB.xml > 2.xml
$ diff 1.xml 2.xml
For what it's worth, I have created a java tool (or kotlin actually) for effecient and configurable canonicalization of xml files.
It will always:
- Sort nodes and attributes by name.
- Remove namespaces (yes - it could - hypothetically - be a problem).
- Prettyprint the result.
In addition you can tell it to:
- Remove a given list of node names - maybe you do not want to know that the value of a piece of metadata - say
<RequestReceivedTimestamp>has changed. - Sort a given list of collections in the context of the parent - maybe you do not care that the order of
<Contact>entries in<ListOfFavourites>has changed.
It uses XSLT and does all the above efficiently using chaining.
Limitations
It does support sorting nested lists - sorting innermost lists before outer. But it cannot reliably sort arbitrary levels of recursively nested lists.
If you have such needs you can - after having used this tool - compare the sorted byte arrays of the results. they will be equal if only list sorting issues remain.
Where to get it
You can get it here: XMLNormalize
One approach would be to first turn both XML files into Canonical XML, and compare the results using diff. For example, xmllint can be used to canonicalize XML.
$ xmllint --c14n one.xml > 1.xml
$ xmllint --c14n two.xml > 2.xml
$ diff 1.xml 2.xml
Or as a one-liner.
$ diff <(xmllint --c14n one.xml) <(xmllint --c14n two.xml)
Jukka's answer did not work for me, but it did point to Canonical XML. Neither --c14n nor --c14n11 sorted the attributes, but i did find the --exc-c14n switch did sort the attributes. --exc-c14n is not listed in the man page, but described on the command line as "W3C exclusive canonical format".
$ xmllint --exc-c14n one.xml > 1.xml
$ xmllint --exc-c14n two.xml > 2.xml
$ diff 1.xml 2.xml
$ xmllint | grep c14
--c14n : save in W3C canonical format v1.0 (with comments)
--c14n11 : save in W3C canonical format v1.1 (with comments)
--exc-c14n : save in W3C exclusive canonical format (with comments)
$ rpm -qf /usr/bin/xmllint
libxml2-2.7.6-14.el6.x86_64
libxml2-2.7.6-14.el6.i686
$ cat /etc/system-release
CentOS release 6.5 (Final)
Warning --exc-c14n strips out the xml header whereas the --c14n prepends the xml header if not there.
I've got a few XML files that are a couple thousand lines long I need to compare and see which values in file A are missing from file B.
Does anyone have an efficient way to diff these? They're not formatted the same at all, so normal text diff programs won't work. I just need to compare which "<value203>...</value203>" aren't in the one file.
Two approaches that I use are (a) to canonicalize both XML files and then compare their serializations, and (b) to use the XPath 2.0 deep-equal() function. Both approaches are OK for telling you whether the files are the same, but not very good at telling you where they differ.
A commercial tool that specializes in this problem is DeltaXML.
If you have things that you consider equivalent, but which aren't equivalent at the XML level - for example, elements in a different order - then you may have to be prepared to do a transformation to normalize the documents before comparison.
Good answer here:
Question: How can I diff two XML files? | Super User
Answer: How can I diff two XML files? | Super User
$ xmllint --format --exc-c14n one.xml > 1.xml
$ xmllint --format --exc-c14n two.xml > 2.xml
$ diff 1.xml 2.xml
Apologies for any failure to adhere to serverfault conventions ... I'm sure someone will let me know and I will amend appropriately.
Sounds like a job for XMLUnit
- http://www.xmlunit.org/
- https://github.com/xmlunit
Example:
public class SomeTest extends XMLTestCase {
@Test
public void test() {
String xml1 = ...
String xml2 = ...
XMLUnit.setIgnoreWhitespace(true); // ignore whitespace differences
// can also compare xml Documents, InputSources, Readers, Diffs
assertXMLEqual(xml1, xml2); // assertXMLEquals comes from XMLTestCase
}
}
The following will check if the documents are equal using standard JDK libraries.
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
dbf.setCoalescing(true);
dbf.setIgnoringElementContentWhitespace(true);
dbf.setIgnoringComments(true);
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc1 = db.parse(new File("file1.xml"));
doc1.normalizeDocument();
Document doc2 = db.parse(new File("file2.xml"));
doc2.normalizeDocument();
Assert.assertTrue(doc1.isEqualNode(doc2));
normalize() is there to make sure there are no cycles (there technically wouldn't be any)
The above code will require the white spaces to be the same within the elements though, because it preserves and evaluates it. The standard XML parser that comes with Java does not allow you to set a feature to provide a canonical version or understand xml:space if that is going to be a problem then you may need a replacement XML parser such as xerces or use JDOM.