There are various technologies for streamed processing of XML. One of them is XSLT 3.0, where you would write
<xsl:mode streamable="yes"/>
<xsl:output method="text"/>
<xsl:template match="row">
<xsl:value-of select="@Id, @UserId, @Name, @Class, @TagBased"
separator=","/>
<xsl:text>
</xsl:text>
</xsl:template>
Answer from Michael Kay on Stack OverflowThere are various technologies for streamed processing of XML. One of them is XSLT 3.0, where you would write
<xsl:mode streamable="yes"/>
<xsl:output method="text"/>
<xsl:template match="row">
<xsl:value-of select="@Id, @UserId, @Name, @Class, @TagBased"
separator=","/>
<xsl:text>
</xsl:text>
</xsl:template>
I tried MySQL, imported the XML data set files into the database, then exported them to CSV format and processed 82.2GB files in just 3 hours.

Parsing large (30GB) XML into CSV (with Python) - Software Recommendations Stack Exchange
xml parsing - How to convert large xml file into csv using Python - Stack Overflow
How can I convert XML to CSV in Python without using libraries such as Etree or Xmltodict? - Stack Overflow
How do I convert XML into a CSV in the most efficient way possible?
Videos
One possibility is a streaming XSLT 3.0 processor, which given your constraints means in practice Saxon/C Enterprise Edition (this has a Python language binding).
There is actually a CSV-to-XML stylesheet published as a worked example in the XSLT 3.0 specification, but sadly no counterpart to do the reverse. However, you can see the principle in some of the answers here:
https://stackoverflow.com/questions/365312/xml-to-csv-using-xslt
or here:
https://stackoverflow.com/questions/15226194/xml-to-csv-using-xslt
To make the code streamable, the key constraint is that any template rule or for-each instruction that processes a particular element can only make one traversal of the element's children. That means you can't, for example, do one pass of the source XML to discover the field names and then another pass to process the values.
Note: Saxon-EE is a commercial product and I have a commercial interest in it.
The XML Utilities library is worth a try, assuming valid & flat XML structure - it even comes with a command line xml2csv utility.
It specifically states:
xmlutils.py is a set of Python utilities for processing xml files serially for converting them to various formats (SQL, CSV, JSON). The scripts use ElementTree.iterparse() to iterate through nodes in an XML document, thus not needing to load the entire DOM into memory. The scripts can be used to churn through large XML files (albeit taking long :P) without memory hiccups.
I would recommend pandasread_xml() and to_csv() function, 3-liner:
Compare the documentation: to_csv, read_xml
import pandas as pd
df = pd.read_xml('employee.xml')
df.to_csv('out.csv', index=False)
Output -> (CSV-file):
id,name,age,salary,division
303,varma,20,120000,3
304,Cyril,20,900000,3
305,Yojith,20,900000,3
I recommend just using libraries because they're usually very optimised. I'll talk about that later. For now, here's a way that utilises the xml.dom.minidom module, which is a part of the Python standard library, so no additional libraries are required.
Edit: rewrote the last part using the standard CSV library instead of manually writing the file, as suggested by a comment. That makes for 2 Python built-in modules, not 1. The original code for the CSV writing will be at the end of the reply, if you're interested.
from xml.dom import minidom
from csv import DictWriter
# Step 1: Read and parse the XML file
# Write it as a string, or open the file and read it
xml_file = open('employees.xml', 'r')
xml_data = xml_file.read()
dom = minidom.parseString(xml_data)
employees = dom.getElementsByTagName('employee')
xml_file.close()
# Step 2: Extract the required information
data = []
for employee in employees:
emp_data = {}
for child in employee.childNodes:
if child.nodeType == minidom.Node.ELEMENT_NODE:
emp_data[child.tagName] = child.firstChild.data
data.append(emp_data)
# Step 3: Write the extracted information to a CSV file
with open('output.csv', 'w', newline = '') as csv_file:
fieldnames = ['id', 'name', 'age', 'salary', 'division']
writer = DictWriter(csv_file, fieldnames = fieldnames)
writer.writeheader()
for emp_data in data:
writer.writerow(emp_data)
Don't reinvent the wheel, just realign it.
โ Anthony J. D'Angelo, I think
I recommend NOT using this code. You should really just use lxml. It's extremely simple and easy to use and can handle complex XML structures with nested elements and attributes. Let me know how everything goes!
Original CSV write code without CSV library
# Step 3: Write the extracted information to a CSV file
with open('output.csv', 'w') as f:
f.write('id,name,age,salary,division\n')
for emp_data in data:
f.write(f"{emp_data['id']},{emp_data['name']},{emp_data['age']},{emp_data['salary']},{emp_data['division']}\n")