You can easily use xml (from the Python standard library) to convert to a pandas.DataFrame. Here's what I would do (when reading from a file replace xml_data with the name of your file or file object):
import pandas as pd
import xml.etree.ElementTree as ET
import io
def iter_docs(author):
author_attr = author.attrib
for doc in author.iter('document'):
doc_dict = author_attr.copy()
doc_dict.update(doc.attrib)
doc_dict['data'] = doc.text
yield doc_dict
xml_data = io.StringIO(u'''YOUR XML STRING HERE''')
etree = ET.parse(xml_data) #create an ElementTree object
doc_df = pd.DataFrame(list(iter_docs(etree.getroot())))
If there are multiple authors in your original document or the root of your XML is not an author, then I would add the following generator:
def iter_author(etree):
for author in etree.iter('author'):
for row in iter_docs(author):
yield row
and change doc_df = pd.DataFrame(list(iter_docs(etree.getroot()))) to doc_df = pd.DataFrame(list(iter_author(etree)))
Have a look at the ElementTree tutorial provided in the xml library documentation.
You can easily use xml (from the Python standard library) to convert to a pandas.DataFrame. Here's what I would do (when reading from a file replace xml_data with the name of your file or file object):
import pandas as pd
import xml.etree.ElementTree as ET
import io
def iter_docs(author):
author_attr = author.attrib
for doc in author.iter('document'):
doc_dict = author_attr.copy()
doc_dict.update(doc.attrib)
doc_dict['data'] = doc.text
yield doc_dict
xml_data = io.StringIO(u'''YOUR XML STRING HERE''')
etree = ET.parse(xml_data) #create an ElementTree object
doc_df = pd.DataFrame(list(iter_docs(etree.getroot())))
If there are multiple authors in your original document or the root of your XML is not an author, then I would add the following generator:
def iter_author(etree):
for author in etree.iter('author'):
for row in iter_docs(author):
yield row
and change doc_df = pd.DataFrame(list(iter_docs(etree.getroot()))) to doc_df = pd.DataFrame(list(iter_author(etree)))
Have a look at the ElementTree tutorial provided in the xml library documentation.
As of v1.3, you can simply use:
pandas.read_xml(path_or_file)
nested xml to dataframe - Data Science Stack Exchange
Parsing XML into a Pandas dataframe
Pandas dataframe to nested xml
ElementTree and deeply nested XML
Videos
I am trying to parse an XML file into a Pandas DataFrame. It's a nicely formatted file that's not very deep, but whenever I work with XML it's like my brain goes blank and I never can remember all the goofy intricacies of dealing with it.
The file looks roughly like this
<?xml version="1.0" encoding="utf-8"?>
<diagnosticsLog type="db-profile" startDate="11/14/2022 23:31:12">
<!--Build 18.0.1.69-->
<columns>
<column friendlyName="time" name="time" />
<column friendlyName="Direction" name="Direction" />
<column friendlyName="SQL" name="SQL" />
<column friendlyName="ProcessID" name="ProcessID" />
<column friendlyName="ThreadID" name="ThreadID" />
<column friendlyName="TimeSpan" name="TimeSpan" />
<column friendlyName="User" name="User" />
<column friendlyName="HTTPSessionID" name="HTTPSessionID" />
<column friendlyName="HTTPForward" name="HTTPForward" />
<column friendlyName="SessionID" name="SessionID" />
<column friendlyName="SessionGUID" name="SessionGUID" />
<column friendlyName="Datasource" name="Datasource" />
<column friendlyName="Sequence" name="Sequence" />
<column friendlyName="LocalSequence" name="LocalSequence" />
<column friendlyName="Message" name="Message" />
<column friendlyName="AppPoolName" name="AppPoolName" />
</columns>
<rows>
<row>
<col name="time">11/14/2022 23:31:12</col>
<col name="TimeSpan">0 ms</col>
<col name="ThreadID">0x00000025</col>
<col name="User">USERNAME</col>
<col name="HTTPSessionID"></col>
<col name="HTTPForward">20.186.0.0</col>
<col name="SessionGUID">e4e51b-a64d-4b7b-9bfe-9612dd22b6cc</col>
<col name="SessionID">6096783</col>
<col name="Datasource">datasourceName</col>
<col name="AppPoolName">C 1801AppServer Ext</col>
<col name="Direction">Out</col>
<col name="sql">UPDATE SET </col>
<col name="Sequence">236419</col>
<col name="LocalSequence">103825</col>
</row>
<row>
<col name="time">11/14/2022 23:31:12</col>
<col name="TimeSpan">N/A</col>
<col name="ThreadID">0x00000025</col>
<col name="User">USERNAME</col>
<col name="HTTPSessionID"></col>
<col name="HTTPForward">20.186.0.0</col>
<col name="SessionGUID">e491b-a64d-4b7b-9bfe-9612dd22b6cc</col>
<col name="SessionID">6096783</col>
<col name="Datasource">datasourceName</col>
<col name="AppPoolName">C 1801AppServer Ext</col>
<col name="Direction">In</col>
<col name="sql">UPDATE SET</col>
<col name="Sequence">236420</col>
<col name="LocalSequence">103826</col>
</row>
</rows>
</diagnosticsLog>I want to convert that to the column names being the columns and each row being a row. I'm at a loss on how to do this.
» pip install xml-to-df