Use the code provided by Nk03 to convert the XML you're loading to a python dictionary.
import xmltodict
d = xmltodict.parse("""
<D1>
<RECORD>
<ELEC>EL-13</ELEC>
<VAL>10</VAL>
<POWER>Max</POWER>
<WIRING>2.3</WIRING>
<ENABLED>Yes</ENABLED>
</RECORD>
<RECORD>
<ELEC>EL-14</ELEC>
<VAL>30</VAL>
<POWER>Max</POWER>
<WIRING>1.1</WIRING>
<ENABLED>Yes</ENABLED>
</RECORD>
</D1>
""")
From there, you can generate a list of keys to use as the column names for the DataFrame:
for key in parsed_dictionary.keys():
cols.append(key)
Answer from Jordan Renaud on Stack OverflowUse the code provided by Nk03 to convert the XML you're loading to a python dictionary.
import xmltodict
d = xmltodict.parse("""
<D1>
<RECORD>
<ELEC>EL-13</ELEC>
<VAL>10</VAL>
<POWER>Max</POWER>
<WIRING>2.3</WIRING>
<ENABLED>Yes</ENABLED>
</RECORD>
<RECORD>
<ELEC>EL-14</ELEC>
<VAL>30</VAL>
<POWER>Max</POWER>
<WIRING>1.1</WIRING>
<ENABLED>Yes</ENABLED>
</RECORD>
</D1>
""")
From there, you can generate a list of keys to use as the column names for the DataFrame:
for key in parsed_dictionary.keys():
cols.append(key)
Here’s one way:
import xmltodict
d = xmltodict.parse("""
<D1>
<RECORD>
<ELEC>EL-13</ELEC>
<VAL>10</VAL>
<POWER>Max</POWER>
<WIRING>2.3</WIRING>
<ENABLED>Yes</ENABLED>
</RECORD>
<RECORD>
<ELEC>EL-14</ELEC>
<VAL>30</VAL>
<POWER>Max</POWER>
<WIRING>1.1</WIRING>
<ENABLED>Yes</ENABLED>
</RECORD>
</D1>
""")
pd.DataFrame(d).iloc[:,0].explode().apply(pd.Series).reset_index(drop=True).to_csv('out.csv’)
# Alternative:
pd.json_normalize(d).stack().explode().apply(pd.Series)
Explanation ->
- Convert the XML to dict.
- load the result into a dataframe.
- use explode to extract the values from the list of dict into multiple roes.
- Apply
pd.seriesto generate the required columns from thedict - Save the output to
csv.
Updated Answer:
df1 = pd.json_normalize(d).stack().explode().apply(pd.Series)
pd.concat([df1.pop('DATA').apply(pd.Series), df1], 1)
Videos
I would recommend pandasread_xml() and to_csv() function, 3-liner:
Compare the documentation: to_csv, read_xml
import pandas as pd
df = pd.read_xml('employee.xml')
df.to_csv('out.csv', index=False)
Output -> (CSV-file):
id,name,age,salary,division
303,varma,20,120000,3
304,Cyril,20,900000,3
305,Yojith,20,900000,3
I recommend just using libraries because they're usually very optimised. I'll talk about that later. For now, here's a way that utilises the xml.dom.minidom module, which is a part of the Python standard library, so no additional libraries are required.
Edit: rewrote the last part using the standard CSV library instead of manually writing the file, as suggested by a comment. That makes for 2 Python built-in modules, not 1. The original code for the CSV writing will be at the end of the reply, if you're interested.
from xml.dom import minidom
from csv import DictWriter
# Step 1: Read and parse the XML file
# Write it as a string, or open the file and read it
xml_file = open('employees.xml', 'r')
xml_data = xml_file.read()
dom = minidom.parseString(xml_data)
employees = dom.getElementsByTagName('employee')
xml_file.close()
# Step 2: Extract the required information
data = []
for employee in employees:
emp_data = {}
for child in employee.childNodes:
if child.nodeType == minidom.Node.ELEMENT_NODE:
emp_data[child.tagName] = child.firstChild.data
data.append(emp_data)
# Step 3: Write the extracted information to a CSV file
with open('output.csv', 'w', newline = '') as csv_file:
fieldnames = ['id', 'name', 'age', 'salary', 'division']
writer = DictWriter(csv_file, fieldnames = fieldnames)
writer.writeheader()
for emp_data in data:
writer.writerow(emp_data)
Don't reinvent the wheel, just realign it.
— Anthony J. D'Angelo, I think
I recommend NOT using this code. You should really just use lxml. It's extremely simple and easy to use and can handle complex XML structures with nested elements and attributes. Let me know how everything goes!
Original CSV write code without CSV library
# Step 3: Write the extracted information to a CSV file
with open('output.csv', 'w') as f:
f.write('id,name,age,salary,division\n')
for emp_data in data:
f.write(f"{emp_data['id']},{emp_data['name']},{emp_data['age']},{emp_data['salary']},{emp_data['division']}\n")
Using pandas and BeautifulSoup you can achieve your expected output easily:
#Code:
import pandas as pd
import itertools
from bs4 import BeautifulSoup as b
with open("file.xml", "r") as f: # opening xml file
content = f.read()
soup = b(content, "lxml")
pkgeid = [ values.text for values in soup.findAll("pkgeid")]
pkgname = [ values.text for values in soup.findAll("pkgname")]
time = [ values.text for values in soup.findAll("time")]
oper = [ values.text for values in soup.findAll("oper")]
# For python-3.x use `zip_longest` method
# For python-2.x use 'izip_longest method
data = [item for item in itertools.zip_longest(time, oper, pkgeid, pkgname)]
df = pd.DataFrame(data=data)
df.to_csv("sample.csv",index=False, header=None)
#output in `sample.csv` file will be as follows:
2015-09-16T04:13:20Z,Create_Product,10,BBCWRL
2015-09-16T04:13:20Z,Create_Product,18,CNNINT
2018-04-01T03:30:28Z,Deactivate_Dhct,,
Using Pandas, parsing all xml fields.
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse("file.xml")
root = tree.getroot()
get_range = lambda col: range(len(col))
l = [{r[i].tag:r[i].text for i in get_range(r)} for r in root]
df = pd.DataFrame.from_dict(l)
df.to_csv('file.csv')
Try the following
from bs4 import BeautifulSoup as bs
data = list()
with open("data.xml") as xml:
data_xml = bs(xml, "html.parser")
for record in data_xml.find_all("record"):
for ts in record.find_all("ts"):
id_, date, time, value = record.get("id"), ts.get("date"), ts.get("time"), ts.text
data.append(", ".join([id_, date, time, value]) + "\n")
with open("data.csv", "w") as csv:
csv.write("ID, date, time, value\n")
csv.writelines(data)
To use lxml, you can simply pass the string as html(). By using the xpath //record/ts (starting with double backslash), you can fetch all your ts results. The main id can be accessed by calling .getparent() and then the attribute.
To convert xml to csv, I would recommend using the python package csv. You can use normal file writer. However csv write handles a lot of issues and it's cleaner.
In general, you have one method that handles everything. I would recommend splitting the logic into functions. Think Single Responsibility. Also the solution below I've converted the xml nodes into a NamedTupple and then write the namedTupple to csv. It's a lot easier to maintain/ read. (i.e Theres one place that sets the header text and one place populate the data).
from lxml import etree
import csv #py -m pip install python-csv
import collections
from collections import namedtuple
Record = namedtuple('Record', ['id', 'date', 'time', 'value']) # Model to store records.
def CreateCsvFile(results):
with open('results.csv', 'w', newline='') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=list(Record._fields)) # use the namedtuple fields for the headers
writer.writeheader()
writer.writerows([x._asdict() for x in results]) # To use DictWriter, the namedtuple has to be converted to dictionary
def FormatRecord(xmlNode):
return Record(xmlNode.getparent().attrib['id'], xmlNode.attrib["date"], xmlNode.attrib["time"], xmlNode.text)
def Main(html):
xmlTree = etree.HTML(html)
results = [FormatRecord(xmlNode) for xmlNode in xmlTree.xpath('//record/ts')] # the double backslash will retrieve all nodes for record.
CreateCsvFile(results)
if __name__ == '__main__':
Main("""<record id="idOne">
<ts date="2019-07-03" time="15:28:41.720440">5</ts>
<ts date="2019-07-03" time="15:28:42.629959">10</ts>
<ts date="2019-07-03" time="15:28:43.552677">15</ts>
<ts date="2019-07-03" time="15:28:43.855345">20</ts>
</record>
<record id="idTwo">
<ts date="2019-07-03" time="15:28:45.072922">30</ts>
<ts date="2019-07-03" time="15:28:45.377087">35</ts>
<ts date="2019-07-03" time="15:28:46.316321">40</ts>
<ts date="2019-07-03" time="15:28:47.527960">45</ts>
</record>""")


