If parsing speed is a key factor for you, consider using cElementTree or lxml.
There are definitely more options out there, see these threads:
- What is the fastest way to parse large XML docs in Python?
- How do I parse XML in Python?
- High-performance XML parsing in Python with lxml
Which are the best libraries out there to manage XML files? There's a prticular one that you usually use? Why?
https://lxml.de/
In the standard library, ElementTree works fine if you just want to extract data from XML.
If you want to edit and save XML use LXML. The standard library implementations are incomplete and can't replicate some constructs. For example they move all namespace tags to the root node and will save XML that contains forbidden other characters.
Also consider defusedxml which monkeypatches the standard library against malformed XML.
What is a good XML stream parser for Python? - Stack Overflow
beautifulsoup - For web scraping and xml parsing, which is best library to learn - Stack Overflow
Best python libraries to manage XML files
https://lxml.de/
More on reddit.comA Roadmap to XML Parsers in Python โ Real Python
Videos
Here's good answer about xml.etree.ElementTree.iterparse practice on huge XML files. lxml has the method as well. The key to stream parsing with iterparse is manual clearing and removing already processed nodes, because otherwise you will end up running out of memory.
Another option is using xml.sax. The official manual is too formal to me, and lacks examples so it needs clarification along with the question. Default parser module, xml.sax.expatreader, implement incremental parsing interface xml.sax.xmlreader.IncrementalParser. That is to say xml.sax.make_parser() provides suitable stream parser.
For instance, given a XML stream like:
<?xml version="1.0" encoding="utf-8"?>
<root>
<entry><a>value 0</a><b foo='bar' /></entry>
<entry><a>value 1</a><b foo='baz' /></entry>
<entry><a>value 2</a><b foo='quz' /></entry>
...
</root>
Can be handled in the following way.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import xml.sax
class StreamHandler(xml.sax.handler.ContentHandler):
lastEntry = None
lastName = None
def startElement(self, name, attrs):
self.lastName = name
if name == 'entry':
self.lastEntry = {}
elif name != 'root':
self.lastEntry[name] = {'attrs': attrs, 'content': ''}
def endElement(self, name):
if name == 'entry':
print({
'a' : self.lastEntry['a']['content'],
'b' : self.lastEntry['b']['attrs'].getValue('foo')
})
self.lastEntry = None
elif name == 'root':
raise StopIteration
def characters(self, content):
if self.lastEntry:
self.lastEntry[self.lastName]['content'] += content
if __name__ == '__main__':
# use default ``xml.sax.expatreader``
parser = xml.sax.make_parser()
parser.setContentHandler(StreamHandler())
# feed the parser with small chunks to simulate
with open('data.xml') as f:
while True:
buffer = f.read(16)
if buffer:
try:
parser.feed(buffer)
except StopIteration:
break
# if you can provide a file-like object it's as simple as
with open('data.xml') as f:
parser.parse(f)
Are you looking for xml.sax? It's right in the standard library.