I would recommend pandasread_xml() and to_csv() function, 3-liner:
Compare the documentation: to_csv, read_xml
import pandas as pd
df = pd.read_xml('employee.xml')
df.to_csv('out.csv', index=False)
Output -> (CSV-file):
id,name,age,salary,division
303,varma,20,120000,3
304,Cyril,20,900000,3
305,Yojith,20,900000,3
Answer from Hermann12 on Stack OverflowI would recommend pandasread_xml() and to_csv() function, 3-liner:
Compare the documentation: to_csv, read_xml
import pandas as pd
df = pd.read_xml('employee.xml')
df.to_csv('out.csv', index=False)
Output -> (CSV-file):
id,name,age,salary,division
303,varma,20,120000,3
304,Cyril,20,900000,3
305,Yojith,20,900000,3
I recommend just using libraries because they're usually very optimised. I'll talk about that later. For now, here's a way that utilises the xml.dom.minidom module, which is a part of the Python standard library, so no additional libraries are required.
Edit: rewrote the last part using the standard CSV library instead of manually writing the file, as suggested by a comment. That makes for 2 Python built-in modules, not 1. The original code for the CSV writing will be at the end of the reply, if you're interested.
from xml.dom import minidom
from csv import DictWriter
# Step 1: Read and parse the XML file
# Write it as a string, or open the file and read it
xml_file = open('employees.xml', 'r')
xml_data = xml_file.read()
dom = minidom.parseString(xml_data)
employees = dom.getElementsByTagName('employee')
xml_file.close()
# Step 2: Extract the required information
data = []
for employee in employees:
emp_data = {}
for child in employee.childNodes:
if child.nodeType == minidom.Node.ELEMENT_NODE:
emp_data[child.tagName] = child.firstChild.data
data.append(emp_data)
# Step 3: Write the extracted information to a CSV file
with open('output.csv', 'w', newline = '') as csv_file:
fieldnames = ['id', 'name', 'age', 'salary', 'division']
writer = DictWriter(csv_file, fieldnames = fieldnames)
writer.writeheader()
for emp_data in data:
writer.writerow(emp_data)
Don't reinvent the wheel, just realign it.
— Anthony J. D'Angelo, I think
I recommend NOT using this code. You should really just use lxml. It's extremely simple and easy to use and can handle complex XML structures with nested elements and attributes. Let me know how everything goes!
Original CSV write code without CSV library
# Step 3: Write the extracted information to a CSV file
with open('output.csv', 'w') as f:
f.write('id,name,age,salary,division\n')
for emp_data in data:
f.write(f"{emp_data['id']},{emp_data['name']},{emp_data['age']},{emp_data['salary']},{emp_data['division']}\n")
Using pandas and BeautifulSoup you can achieve your expected output easily:
#Code:
import pandas as pd
import itertools
from bs4 import BeautifulSoup as b
with open("file.xml", "r") as f: # opening xml file
content = f.read()
soup = b(content, "lxml")
pkgeid = [ values.text for values in soup.findAll("pkgeid")]
pkgname = [ values.text for values in soup.findAll("pkgname")]
time = [ values.text for values in soup.findAll("time")]
oper = [ values.text for values in soup.findAll("oper")]
# For python-3.x use `zip_longest` method
# For python-2.x use 'izip_longest method
data = [item for item in itertools.zip_longest(time, oper, pkgeid, pkgname)]
df = pd.DataFrame(data=data)
df.to_csv("sample.csv",index=False, header=None)
#output in `sample.csv` file will be as follows:
2015-09-16T04:13:20Z,Create_Product,10,BBCWRL
2015-09-16T04:13:20Z,Create_Product,18,CNNINT
2018-04-01T03:30:28Z,Deactivate_Dhct,,
Using Pandas, parsing all xml fields.
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse("file.xml")
root = tree.getroot()
get_range = lambda col: range(len(col))
l = [{r[i].tag:r[i].text for i in get_range(r)} for r in root]
df = pd.DataFrame.from_dict(l)
df.to_csv('file.csv')
Videos
This is a namespaced XML document. Therefore you need to address the nodes using their respective namespaces.
The namespaces used in the document are defined at the top:
xmlns:tc2="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:tp1="http://www.garmin.com/xmlschemas/TrackPointExtension/v1"
xmlns="http://www.topografix.com/GPX/1/1"
So the first namespace is mapped to the short form tc2, and would be used in an element like <tc2:foobar/>. The last one, which doesn't have a short form after the xmlns, is called the default namespace, and it applies to all elements in the document that don't explicitely use a namespace - so it applies to your <trkpt /> elements as well.
Therefore you would need to write root.iter('{http://www.topografix.com/GPX/1/1}trkpt') to select these elements.
In order to also get time and elevation, you can use trkpt.find() to access these elements below the trkpt node, and then element.text to retrieve those elements' text content (as opposed to attributes like lat and lon). Also, because the time and ele elements also use the default namespace you'll have to use the {namespace}element syntax again to select those nodes.
So you could use something like this:
NS = 'http://www.topografix.com/GPX/1/1'
header = ('lat', 'lon', 'ele', 'time')
with open('output.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(header)
root = lxml.etree.fromstring(x)
for trkpt in root.iter('{%s}trkpt' % NS):
lat = trkpt.get('lat')
lon = trkpt.get('lon')
ele = trkpt.find('{%s}ele' % NS).text
time = trkpt.find('{%s}time' % NS).text
row = lat, lon, ele, time
writer.writerow(row)
For more information on XML namespaces, see the Namespaces section in the lxml tutorial and the Wikipedia article on XML Namespaces. Also see GPS eXchange Format for some details on the .gpx format.
Apologies for using already-made tools here, but this did the job with your data :
- Convert XML to JSON : http://convertjson.com/xml-to-json.htm
- Take that JSON and convert JSON to CSV : https://konklone.io/json/
It worked like a charm with your data.
ele,time,_lat,_lon
0.0000000,2013-12-03T21:08:56Z,45.4852855,-122.6347885
0.0000000,2013-12-03T21:09:00Z,45.4852961,-122.6347926
0.2000000,2013-12-03T21:09:01Z,45.4852982,-122.6347897
So for coding, I reckon XML > JSON > CSV may be a good approach. You many find the relevant scripts pointed to in those links.
Use csv.DictWriter, get values from node.attrib dictionary
Your elements named TrdCapRpt have attributes, if you have such node, its attribute node.attrib
holds a dictionary with key/value for each attribute.
csv.DictWriter allows writing data taken from dictionary.
First some imports (I always use lxml as it is very fast and provides extra features):
from lxml import etree
import csv
Configure file names and fields to use in each record:
xml_fname = "data.xml"
csv_fname = "data.csv"
fields = [
"RptID", "TrdTyp", "TrdSubTyp", "ExecID", "TrdDt", "BizDt", "MLegRptTyp",
"MtchStat" "MsgEvtSrc", "TrdID", "LastQty", "LastPx", "TxnTm", "SettlCcy",
"SettlDt", "PxSubTyp", "VenueTyp", "VenuTyp", "OfstInst"]
Read the XML:
xml = etree.parse(xml_fname)
Iterate over elements "TrdCapRpt", write attribute values to CSV file:
with open(csv_fname, "w") as f:
writer = csv.DictWriter(f, fields, delimiter=";", extrasaction="ignore")
writer.writeheader()
for node in xml.iter("TrdCaptRpt"):
writer.writerow(node.attrib)
If you prefer using stdlib xml.etree.ElementTree, you shall manage easily as you do now, because the node.attrib is present there too.
Reading from multiple element names
In your comments, you noted, that you want to export attributes from more
element names. This is also possible. To do this, I will modify the example to
use xpath (which will probably work only with lxml) and add extra column
"elm_name" to track, from which element is the record created:
fields = [
"elm_name",
"RptID", "TrdTyp", "TrdSubTyp", "ExecID", "TrdDt", "BizDt", "MLegRptTyp",
"MtchStat" "MsgEvtSrc", "TrdID", "LastQty", "LastPx", "TxnTm", "SettlCcy",
"SettlDt", "PxSubTyp", "VenueTyp", "VenuTyp", "OfstInst",
"Typ", "Amt", "Ccy"
]
xml = etree.parse(xml_fname)
with open(csv_fname, "w") as f:
writer = csv.DictWriter(f, fields, delimiter=";", extrasaction="ignore")
writer.writeheader()
for node in xml.xpath("//*[self::TrdCaptRpt or self::PosRpt or self::Amt]"):
atts = node.attrib
atts["elm_name"] = node.tag
writer.writerow(node.attrib)
The modifications are:
fieldsgot extra"elm_name"field and fields from other elements (feel free to remove those you are not interested at).- iterate over elements using
xml.xpath. The XPath expression is more complex so I am not sure, if stdlib ElementTree supports that. - before writing the record, I add name of the element into
attsdictionary to provide name of the element.
Warning: the element Amt is nested inside PosRpt and this tree structure
is not possible to support in CSV. The records are written, but do not hold
information about where they come from (apart from following the record for
parent element).
You should first push each line with all your tags into a list.
for node in tree.iter('TrdCaptRpt'):
.....
my_list.push([RptID, TrdTyp, TrdSubTyp, TrdDt, BizDt,
MLegRptTyp, MtchStat, MsgEvtSrc, TrdID,
LastQty, LastPx, TxnTm, SettlCcy, SettlDt,
PxSubTyp, VenueTyp, VenuTyp, OfstInst])
Then write each line to file :
with open('/Users/anantsangar/Desktop/output.csv', 'w') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=' ', quotechar='|', quoting=csv.QUOTE_MINIMAL)
for row in my_list:
spamwriter.writerow(row)
You probably don't need to go through ElementTree; you can feed the xml directly to pandas. If I understand you correctly, this should do it:
df = pd.read_xml(path_to_file,"//*[local-name()='MainVIP']")
df = df.iloc[:,:4]
df
Output from your xml above:
Date RegisteredDate Type TypeDescription
0 20210616 20210216 YMBA TYPE OF ENQUIRY
Without any external lib - the code below generates a csv file.
The idea is to collect the required elements data from MainVip and store it in list of dicts. Loop on the list and write the data into a file.
import xml.etree.ElementTree as ET
xml = ''' <soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<Level2 xmlns="https://xxxxxxxxxx/xxxxxxx">
<Level3>
<ResponseStatus>Success</ResponseStatus>
<ErrorMessage/>
<Message>20 alert(s) generated for this period</Message>
<ProcessingTimeSecs>0.88217689999999993</ProcessingTimeSecs>
<Something1>1</Something1>
<Something2/>
<Something3/>
<Something4/>
<VIP>
<MainVIP>
<Date>20210616</Date>
<RegisteredDate>20210216</RegisteredDate>
<Type>YMBA</Type>
<TypeDescription>TYPE OF ENQUIRY</TypeDescription>
<BusinessName>COMPANY NAME</BusinessName>
<ITNumber>987654321</ITNumber>
<RegistrationNumber>123456789</RegistrationNumber>
<SubscriberNumber>55889977</SubscriberNumber>
<SubscriberReference/>
<TicketNumber>1122336655</TicketNumber>
<SubscriberName>COMPANY NAME 2 </SubscriberName>
<CompletedDate>20210615</CompletedDate>
</MainVIP>
</VIP>
<Something5/>
<Something6/>
<Something7/>
<Something8/>
<Something9/>
<PrincipalSomething10/>
<PrincipalSomething11/>
<PrincipalSomething12/>
<PrincipalSomething13/>
<Something14/>
<Something15/>
<Something16/>
<Something17/>
<Something18/>
<PrincipalSomething19/>
<PrincipalSomething20/>
</Level3>
</Level2>
</soap:Body>
</soap:Envelope>'''
cols = ['Date', 'RegisteredDate', 'Type',
'TypeDescription']
rows = []
NS = '{https://xxxxxxxxxx/xxxxxxx}'
root = ET.fromstring(xml)
for vip in root.findall(f'.//{NS}MainVIP'):
rows.append({c: vip.find(NS+c).text for c in cols})
with open('out.csv','w') as f:
f.write(','.join(cols) + '\n')
for row in rows:
f.write(','.join(row[c] for c in cols) + '\n')
out.csv
Date,RegisteredDate,Type,TypeDescription
20210616,20210216,YMBA,TYPE OF ENQUIRY
The lxml library is capable of very powerful XML parsing, and can be used to iterate over an XML tree to search for specific elements.
from lxml import etree
with open(r'path/to/xml', 'r') as xml:
text = xml.read()
tree = lxml.etree.fromstring(text)
row = ['', '']
for item in tree.iter('hw', 'def'):
if item.tag == 'hw':
row[0] = item.text
elif item.tag == 'def':
row[1] = item.text
line = ','.join(row)
with open(r'path/to/csv', 'a') as csv:
csv.write(line + '\n')
How you build the CSV file is largely based upon preference, but I have provided a trivial example above. If there are multiple <dps-data> tags, you could extract those elements first (which can be done with the same tree.iter method shown above), and then apply the above logic to each of them.
EDIT: I should point out that this particular implementation reads the entire XML file into memory. If you are working with a single 150mb file at a time, this should not be a problem, but it's just something to be aware of.
How about this:
from xml.dom import minidom
xmldoc = minidom.parse('your.xml')
hw_lst = xmldoc.getElementsByTagName('hw')
defu_lst = xmldoc.getElementsByTagName('def')
with open('your.csv', 'a') as out_file:
for i in range(len(hw_lst)):
out_file.write('{0}, {1}\n'.format(hw_lst[i].firstChild.data, defu_lst[i].firstChild.data))
There are various technologies for streamed processing of XML. One of them is XSLT 3.0, where you would write
<xsl:mode streamable="yes"/>
<xsl:output method="text"/>
<xsl:template match="row">
<xsl:value-of select="@Id, @UserId, @Name, @Class, @TagBased"
separator=","/>
<xsl:text>
</xsl:text>
</xsl:template>
I tried MySQL, imported the XML data set files into the database, then exported them to CSV format and processed 82.2GB files in just 3 hours.

Use the code provided by Nk03 to convert the XML you're loading to a python dictionary.
import xmltodict
d = xmltodict.parse("""
<D1>
<RECORD>
<ELEC>EL-13</ELEC>
<VAL>10</VAL>
<POWER>Max</POWER>
<WIRING>2.3</WIRING>
<ENABLED>Yes</ENABLED>
</RECORD>
<RECORD>
<ELEC>EL-14</ELEC>
<VAL>30</VAL>
<POWER>Max</POWER>
<WIRING>1.1</WIRING>
<ENABLED>Yes</ENABLED>
</RECORD>
</D1>
""")
From there, you can generate a list of keys to use as the column names for the DataFrame:
for key in parsed_dictionary.keys():
cols.append(key)
Here’s one way:
import xmltodict
d = xmltodict.parse("""
<D1>
<RECORD>
<ELEC>EL-13</ELEC>
<VAL>10</VAL>
<POWER>Max</POWER>
<WIRING>2.3</WIRING>
<ENABLED>Yes</ENABLED>
</RECORD>
<RECORD>
<ELEC>EL-14</ELEC>
<VAL>30</VAL>
<POWER>Max</POWER>
<WIRING>1.1</WIRING>
<ENABLED>Yes</ENABLED>
</RECORD>
</D1>
""")
pd.DataFrame(d).iloc[:,0].explode().apply(pd.Series).reset_index(drop=True).to_csv('out.csv’)
# Alternative:
pd.json_normalize(d).stack().explode().apply(pd.Series)
Explanation ->
- Convert the XML to dict.
- load the result into a dataframe.
- use explode to extract the values from the list of dict into multiple roes.
- Apply
pd.seriesto generate the required columns from thedict - Save the output to
csv.
Updated Answer:
df1 = pd.json_normalize(d).stack().explode().apply(pd.Series)
pd.concat([df1.pop('DATA').apply(pd.Series), df1], 1)