to extract RelatedTerms first you have to extract top Term element using btree.select('Terms > Term') now you can loop it and extract Term inside RelatedTerms using term.select('RelatedTerms > Term')
import json
from bs4 import BeautifulSoup
xml_file = './xml.xml'
btree = BeautifulSoup(open(xml_file, 'r'), "xml")
Terms = btree.select('Terms > Term')
jsonObj = {"thesaurus": []}
for term in Terms:
termDetail = {
"Description": term.find('Description').text,
"Title": term.find('Title').text
}
RelatedTerms = term.select('RelatedTerms > Term')
if RelatedTerms:
termDetail["RelatedTerms"] = []
for rterm in RelatedTerms:
termDetail["RelatedTerms"].append({
"Title": rterm.find('Title').text,
"Relationship": rterm.find('Relationship').text
})
jsonObj["thesaurus"].append(termDetail)
print json.dumps(jsonObj, indent=4)
Answer from ewwink on Stack OverflowVideos
xmltodict (full disclosure: I wrote it) can help you convert your XML to a dict+list+string structure, following this "standard". It is Expat-based, so it's very fast and doesn't need to load the whole XML tree in memory.
Once you have that data structure, you can serialize it to JSON:
import xmltodict, json
o = xmltodict.parse('<e> <a>text</a> <a>text</a> </e>')
json.dumps(o) # '{"e": {"a": ["text", "text"]}}'
There is no "one-to-one" mapping between XML and JSON, so converting one to the other necessarily requires some understanding of what you want to do with the results.
That being said, Python's standard library has several modules for parsing XML (including DOM, SAX, and ElementTree). As of Python 2.6, support for converting Python data structures to and from JSON is included in the json module.
So the infrastructure is there.
Im dealing with this problem at work and it's the last step I need to implement for a process that will be automated.
We're government so if there's anything in the offiical RedHat repos that can do this conversion it'd make my life easier but as far as I know there isn't.
The reason I can't use something like yq, or python modules like xmltodict, untangle, pandas, or beautifulsoup is brcause they aren't approved.
I know an easy answer is Apache Daffodil but the documentation on that is WAY over my head. Anyone have suggestions
I've got some legacy APIs that simply cannot do anything but XML, but I need to work with them and I don't like XML.
What's the best method these days for converting XML to JSON, and JSON to XML?
Idk if there's a standard method, or everybody goes by preference, or what have you
ยป pip install xmljson
xmltodict (full disclosure: I wrote it) can help you convert your XML to a dict+list+string structure, following this "standard". It is Expat-based, so it's very fast and doesn't need to load the whole XML tree in memory.
Once you have that data structure, you can serialize it to JSON:
import xmltodict, json
o = xmltodict.parse('<e> <a>text</a> <a>text</a> </e>')
json.dumps(o) # '{"e": {"a": ["text", "text"]}}'
Soviut's advice for lxml objectify is good. With a specially subclassed simplejson, you can turn an lxml objectify result into json.
import simplejson as json
import lxml
class objectJSONEncoder(json.JSONEncoder):
"""A specialized JSON encoder that can handle simple lxml objectify types
>>> from lxml import objectify
>>> obj = objectify.fromstring("<Book><price>1.50</price><author>W. Shakespeare</author></Book>")
>>> objectJSONEncoder().encode(obj)
'{"price": 1.5, "author": "W. Shakespeare"}'
"""
def default(self,o):
if isinstance(o, lxml.objectify.IntElement):
return int(o)
if isinstance(o, lxml.objectify.NumberElement) or isinstance(o, lxml.objectify.FloatElement):
return float(o)
if isinstance(o, lxml.objectify.ObjectifiedDataElement):
return str(o)
if hasattr(o, '__dict__'):
#For objects with a __dict__, return the encoding of the __dict__
return o.__dict__
return json.JSONEncoder.default(self, o)
See the docstring for example of usage, essentially you pass the result of lxml objectify to the encode method of an instance of objectJSONEncoder
Note that Koen's point is very valid here, the solution above only works for simply nested xml and doesn't include the name of root elements. This could be fixed.
I've included this class in a gist here: http://gist.github.com/345559