pandas parse xml column

stackoverflow.com › questions › 64704629 › parse-xml-in-a-dataframe-column

Assume we have a dataframe similar to your example:

import pandas as pd
df = pd.DataFrame.from_dict({'FILE_CREATION_DATE': ['2017-09-06'], 'FILE_DATA': ['''<?xml version="1.0" encoding="utf-8" ?>
<REPORT xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/CrashReport.DataLayer.v20170201">
<CRSREPORTTIMESTAMP>2020-10-08T06:49:31.813812</CRSREPORTTIMESTAMP>-
<AGENCYIDENTIFIER>MILWAUKEE</AGENCYIDENTIFIER>-
<AGENCYNAME>Milwaukee Police Department</AGENCYNAME></REPORT>''']})

df

FILE_CREATION_DATE  FILE_DATA
0   2017-09-06      <?xml version="1.0" encoding="utf-8" ?>\n<REPO...

let's get the possible values from your XML. We'll just take the first row and assume the rest is identical.

import xml.etree.ElementTree as ET

root = ET.fromstring(df['FILE_DATA'][0])
# we need to get rid of the XML namespace, therefore the split by }
columns = [c.tag.split('}', 1)[-1] for c in root]

# convert each XML into a dictionary and asssign to the columns
df[columns] = df['FILE_DATA'].apply(lambda x: pd.Series({c.tag.split('}', 1)[-1]:c.text for c in ET.fromstring(x)}))
df.drop('FILE_DATA', axis=1, inplace=True) 
df


FILE_CREATION_DATE  CRSREPORTTIMESTAMP          AGENCYIDENTIFIER    AGENCYNAME
0                   2017-09-06 2020-10-08...    MILWAUKEE           Milwaukee Police Department

Answer from Maximilian Peters on Stack Overflow

Pandas

pandas.pydata.org › docs › reference › api › pandas.read_xml.html

pandas.read_xml — pandas documentation - PyData |

Dict of functions for converting values in certain columns. Keys can either be integers or column labels. parse_datesbool or list of int or names or list of lists or dict, default False

reddit.com › r/learnpython › parsing xml into a pandas dataframe

r/learnpython on Reddit: Parsing XML into a Pandas dataframe

December 9, 2022 -

I am trying to parse an XML file into a Pandas DataFrame. It's a nicely formatted file that's not very deep, but whenever I work with XML it's like my brain goes blank and I never can remember all the goofy intricacies of dealing with it.

The file looks roughly like this

<?xml version="1.0" encoding="utf-8"?>

<diagnosticsLog type="db-profile" startDate="11/14/2022 23:31:12">

  <!--Build 18.0.1.69-->

  <columns>

    <column friendlyName="time" name="time" />

    <column friendlyName="Direction" name="Direction" />

    <column friendlyName="SQL" name="SQL" />

    <column friendlyName="ProcessID" name="ProcessID" />

    <column friendlyName="ThreadID" name="ThreadID" />


    <column friendlyName="TimeSpan" name="TimeSpan" />

    <column friendlyName="User" name="User" />

    <column friendlyName="HTTPSessionID" name="HTTPSessionID" />

    <column friendlyName="HTTPForward" name="HTTPForward" />

    <column friendlyName="SessionID" name="SessionID" />


    <column friendlyName="SessionGUID" name="SessionGUID" />

    <column friendlyName="Datasource" name="Datasource" />

    <column friendlyName="Sequence" name="Sequence" />

    <column friendlyName="LocalSequence" name="LocalSequence" />

    <column friendlyName="Message" name="Message" />

    <column friendlyName="AppPoolName" name="AppPoolName" />

  </columns>

  <rows>

    <row>

      <col name="time">11/14/2022 23:31:12</col>

      <col name="TimeSpan">0 ms</col>

      <col name="ThreadID">0x00000025</col>

      <col name="User">USERNAME</col>

      <col name="HTTPSessionID"></col>

      <col name="HTTPForward">20.186.0.0</col>

      <col name="SessionGUID">e4e51b-a64d-4b7b-9bfe-9612dd22b6cc</col>

      <col name="SessionID">6096783</col>

      <col name="Datasource">datasourceName</col>

      <col name="AppPoolName">C 1801AppServer Ext</col>

      <col name="Direction">Out</col>

      <col name="sql">UPDATE SET </col>

      <col name="Sequence">236419</col>

      <col name="LocalSequence">103825</col>

    </row>

    <row>

      <col name="time">11/14/2022 23:31:12</col>

      <col name="TimeSpan">N/A</col>

      <col name="ThreadID">0x00000025</col>

      <col name="User">USERNAME</col>

      <col name="HTTPSessionID"></col>

      <col name="HTTPForward">20.186.0.0</col>

      <col name="SessionGUID">e491b-a64d-4b7b-9bfe-9612dd22b6cc</col>

      <col name="SessionID">6096783</col>

      <col name="Datasource">datasourceName</col>

      <col name="AppPoolName">C 1801AppServer Ext</col>

      <col name="Direction">In</col>

      <col name="sql">UPDATE SET</col>

      <col name="Sequence">236420</col>

      <col name="LocalSequence">103826</col>

    </row>

  </rows>

</diagnosticsLog>

I want to convert that to the column names being the columns and each row being a row. I'm at a loss on how to do this.

Videos