Assume we have a dataframe similar to your example:
import pandas as pd
df = pd.DataFrame.from_dict({'FILE_CREATION_DATE': ['2017-09-06'], 'FILE_DATA': ['''<?xml version="1.0" encoding="utf-8" ?>
<REPORT xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/CrashReport.DataLayer.v20170201">
<CRSREPORTTIMESTAMP>2020-10-08T06:49:31.813812</CRSREPORTTIMESTAMP>-
<AGENCYIDENTIFIER>MILWAUKEE</AGENCYIDENTIFIER>-
<AGENCYNAME>Milwaukee Police Department</AGENCYNAME></REPORT>''']})
df
FILE_CREATION_DATE FILE_DATA
0 2017-09-06 <?xml version="1.0" encoding="utf-8" ?>\n<REPO...
let's get the possible values from your XML. We'll just take the first row and assume the rest is identical.
import xml.etree.ElementTree as ET
root = ET.fromstring(df['FILE_DATA'][0])
# we need to get rid of the XML namespace, therefore the split by }
columns = [c.tag.split('}', 1)[-1] for c in root]
# convert each XML into a dictionary and asssign to the columns
df[columns] = df['FILE_DATA'].apply(lambda x: pd.Series({c.tag.split('}', 1)[-1]:c.text for c in ET.fromstring(x)}))
df.drop('FILE_DATA', axis=1, inplace=True)
df
FILE_CREATION_DATE CRSREPORTTIMESTAMP AGENCYIDENTIFIER AGENCYNAME
0 2017-09-06 2020-10-08... MILWAUKEE Milwaukee Police Department
Answer from Maximilian Peters on Stack OverflowI am trying to parse an XML file into a Pandas DataFrame. It's a nicely formatted file that's not very deep, but whenever I work with XML it's like my brain goes blank and I never can remember all the goofy intricacies of dealing with it.
The file looks roughly like this
<?xml version="1.0" encoding="utf-8"?>
<diagnosticsLog type="db-profile" startDate="11/14/2022 23:31:12">
<!--Build 18.0.1.69-->
<columns>
<column friendlyName="time" name="time" />
<column friendlyName="Direction" name="Direction" />
<column friendlyName="SQL" name="SQL" />
<column friendlyName="ProcessID" name="ProcessID" />
<column friendlyName="ThreadID" name="ThreadID" />
<column friendlyName="TimeSpan" name="TimeSpan" />
<column friendlyName="User" name="User" />
<column friendlyName="HTTPSessionID" name="HTTPSessionID" />
<column friendlyName="HTTPForward" name="HTTPForward" />
<column friendlyName="SessionID" name="SessionID" />
<column friendlyName="SessionGUID" name="SessionGUID" />
<column friendlyName="Datasource" name="Datasource" />
<column friendlyName="Sequence" name="Sequence" />
<column friendlyName="LocalSequence" name="LocalSequence" />
<column friendlyName="Message" name="Message" />
<column friendlyName="AppPoolName" name="AppPoolName" />
</columns>
<rows>
<row>
<col name="time">11/14/2022 23:31:12</col>
<col name="TimeSpan">0 ms</col>
<col name="ThreadID">0x00000025</col>
<col name="User">USERNAME</col>
<col name="HTTPSessionID"></col>
<col name="HTTPForward">20.186.0.0</col>
<col name="SessionGUID">e4e51b-a64d-4b7b-9bfe-9612dd22b6cc</col>
<col name="SessionID">6096783</col>
<col name="Datasource">datasourceName</col>
<col name="AppPoolName">C 1801AppServer Ext</col>
<col name="Direction">Out</col>
<col name="sql">UPDATE SET </col>
<col name="Sequence">236419</col>
<col name="LocalSequence">103825</col>
</row>
<row>
<col name="time">11/14/2022 23:31:12</col>
<col name="TimeSpan">N/A</col>
<col name="ThreadID">0x00000025</col>
<col name="User">USERNAME</col>
<col name="HTTPSessionID"></col>
<col name="HTTPForward">20.186.0.0</col>
<col name="SessionGUID">e491b-a64d-4b7b-9bfe-9612dd22b6cc</col>
<col name="SessionID">6096783</col>
<col name="Datasource">datasourceName</col>
<col name="AppPoolName">C 1801AppServer Ext</col>
<col name="Direction">In</col>
<col name="sql">UPDATE SET</col>
<col name="Sequence">236420</col>
<col name="LocalSequence">103826</col>
</row>
</rows>
</diagnosticsLog>I want to convert that to the column names being the columns and each row being a row. I'm at a loss on how to do this.
Videos
Assume we have a dataframe similar to your example:
import pandas as pd
df = pd.DataFrame.from_dict({'FILE_CREATION_DATE': ['2017-09-06'], 'FILE_DATA': ['''<?xml version="1.0" encoding="utf-8" ?>
<REPORT xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/CrashReport.DataLayer.v20170201">
<CRSREPORTTIMESTAMP>2020-10-08T06:49:31.813812</CRSREPORTTIMESTAMP>-
<AGENCYIDENTIFIER>MILWAUKEE</AGENCYIDENTIFIER>-
<AGENCYNAME>Milwaukee Police Department</AGENCYNAME></REPORT>''']})
df
FILE_CREATION_DATE FILE_DATA
0 2017-09-06 <?xml version="1.0" encoding="utf-8" ?>\n<REPO...
let's get the possible values from your XML. We'll just take the first row and assume the rest is identical.
import xml.etree.ElementTree as ET
root = ET.fromstring(df['FILE_DATA'][0])
# we need to get rid of the XML namespace, therefore the split by }
columns = [c.tag.split('}', 1)[-1] for c in root]
# convert each XML into a dictionary and asssign to the columns
df[columns] = df['FILE_DATA'].apply(lambda x: pd.Series({c.tag.split('}', 1)[-1]:c.text for c in ET.fromstring(x)}))
df.drop('FILE_DATA', axis=1, inplace=True)
df
FILE_CREATION_DATE CRSREPORTTIMESTAMP AGENCYIDENTIFIER AGENCYNAME
0 2017-09-06 2020-10-08... MILWAUKEE Milwaukee Police Department
A somewhat similar solution to @Maximilian Peter's, but using lxml, xpath (taking into account namespaces), an additional report from Chicago and map():
from lxml import etree
data = [["2017-09-06",'<?xml version="1.0" encoding="utf-8" ?><REPORT xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/CrashReport.DataLayer.v20170201"><CRSREPORTTIMESTAMP>2020-10-08T06:49:31.813812</CRSREPORTTIMESTAMP>-<AGENCYIDENTIFIER>MILWAULLKEE</AGENCYIDENTIFIER>-<AGENCYNAME>Milwaukee Police Department</AGENCYNAME></REPORT>'],\
["2017-09-07", '<?xml version="1.0" encoding="utf-8" ?><REPORT xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/CrashReport.DataLayer.v20170201"><CRSREPORTTIMESTAMP>2021-11-08T06:49:31.813812</CRSREPORTTIMESTAMP>-<AGENCYIDENTIFIER>CHICAGO</AGENCYIDENTIFIER>-<AGENCYNAME>Chicago Police Department</AGENCYNAME></REPORT>']]
columns = ["FILE_CREATION_DATE","FILE_DATA"]
def xpath_extract(my_str):
doc = etree.XML(my_str.encode())
a,b,c = [elem.text for elem in doc.xpath('//*[local-name()="REPORT"]//*')]
return a,b,c
police_df['TIME'], police_df['AGENCY_ID'], police_df['AGENCY_NAME'] = \
zip(*police_df['FILE_DATA'].map(xpath_extract))
police_df.drop('FILE_DATA', axis=1)
Output:
FILE_CREATION_DATE TIME AGENCY_ID AGENCY_NAME
0 2017-09-06 2020-10-08T06:49:31.813812 MILWAULLKEE Milwaukee Police Department
1 2017-09-07 2021-11-08T06:49:31.813812 CHICAGO Chicago Police Department
You can easily use xml (from the Python standard library) to convert to a pandas.DataFrame. Here's what I would do (when reading from a file replace xml_data with the name of your file or file object):
import pandas as pd
import xml.etree.ElementTree as ET
import io
def iter_docs(author):
author_attr = author.attrib
for doc in author.iter('document'):
doc_dict = author_attr.copy()
doc_dict.update(doc.attrib)
doc_dict['data'] = doc.text
yield doc_dict
xml_data = io.StringIO(u'''YOUR XML STRING HERE''')
etree = ET.parse(xml_data) #create an ElementTree object
doc_df = pd.DataFrame(list(iter_docs(etree.getroot())))
If there are multiple authors in your original document or the root of your XML is not an author, then I would add the following generator:
def iter_author(etree):
for author in etree.iter('author'):
for row in iter_docs(author):
yield row
and change doc_df = pd.DataFrame(list(iter_docs(etree.getroot()))) to doc_df = pd.DataFrame(list(iter_author(etree)))
Have a look at the ElementTree tutorial provided in the xml library documentation.
As of v1.3, you can simply use:
pandas.read_xml(path_or_file)
Use [] to filter and reorganize columns:
cols = ['Application_ID', 'Product_Type', 'Product_ID']
df = pd.read_xml('product.xml')[cols]
print(df)
# Output:
Application_ID Product_Type Product_ID
0 BBC#:1010 1 32
1 NBA#:1111 2 22
2 BBC#:1212 1 63
3 NBA#:2210 2 22
If you want to replace '_' from your column names by ' ':
df.columns = df.columns.str.replace('_', ' ')
print(df)
# Output:
Application ID Product Type Product ID
0 BBC#:1010 1 32
1 NBA#:1111 2 22
2 BBC#:1212 1 63
3 NBA#:2210 2 22
As of Pandas 1.3.0 there is a read_xml() function that makes working with reading/writing XML data in/out of pandas much easier.
Once you upgrade to Pandas >1.3.0 you can simply use:
df = pd.read_xml("___XML_FILEPATH___")
print(df)
(Note that in the XML sample above the <Rowset> tag needs to be closed)