Assume we have a dataframe similar to your example:

import pandas as pd
df = pd.DataFrame.from_dict({'FILE_CREATION_DATE': ['2017-09-06'], 'FILE_DATA': ['''<?xml version="1.0" encoding="utf-8" ?>
<REPORT xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/CrashReport.DataLayer.v20170201">
<CRSREPORTTIMESTAMP>2020-10-08T06:49:31.813812</CRSREPORTTIMESTAMP>-
<AGENCYIDENTIFIER>MILWAUKEE</AGENCYIDENTIFIER>-
<AGENCYNAME>Milwaukee Police Department</AGENCYNAME></REPORT>''']})

df

FILE_CREATION_DATE  FILE_DATA
0   2017-09-06      <?xml version="1.0" encoding="utf-8" ?>\n<REPO...

let's get the possible values from your XML. We'll just take the first row and assume the rest is identical.

import xml.etree.ElementTree as ET

root = ET.fromstring(df['FILE_DATA'][0])
# we need to get rid of the XML namespace, therefore the split by }
columns = [c.tag.split('}', 1)[-1] for c in root]

# convert each XML into a dictionary and asssign to the columns
df[columns] = df['FILE_DATA'].apply(lambda x: pd.Series({c.tag.split('}', 1)[-1]:c.text for c in ET.fromstring(x)}))
df.drop('FILE_DATA', axis=1, inplace=True) 
df


FILE_CREATION_DATE  CRSREPORTTIMESTAMP          AGENCYIDENTIFIER    AGENCYNAME
0                   2017-09-06 2020-10-08...    MILWAUKEE           Milwaukee Police Department
Answer from Maximilian Peters on Stack Overflow
🌐
Pandas
pandas.pydata.org › docs › reference › api › pandas.read_xml.html
pandas.read_xml — pandas documentation - PyData |
Dict of functions for converting values in certain columns. Keys can either be integers or column labels. parse_datesbool or list of int or names or list of lists or dict, default False
🌐
Reddit
reddit.com › r/learnpython › parsing xml into a pandas dataframe
r/learnpython on Reddit: Parsing XML into a Pandas dataframe
December 9, 2022 -

I am trying to parse an XML file into a Pandas DataFrame. It's a nicely formatted file that's not very deep, but whenever I work with XML it's like my brain goes blank and I never can remember all the goofy intricacies of dealing with it.

The file looks roughly like this

<?xml version="1.0" encoding="utf-8"?>

<diagnosticsLog type="db-profile" startDate="11/14/2022 23:31:12">

  <!--Build 18.0.1.69-->

  <columns>

    <column friendlyName="time" name="time" />

    <column friendlyName="Direction" name="Direction" />

    <column friendlyName="SQL" name="SQL" />

    <column friendlyName="ProcessID" name="ProcessID" />

    <column friendlyName="ThreadID" name="ThreadID" />


    <column friendlyName="TimeSpan" name="TimeSpan" />

    <column friendlyName="User" name="User" />

    <column friendlyName="HTTPSessionID" name="HTTPSessionID" />

    <column friendlyName="HTTPForward" name="HTTPForward" />

    <column friendlyName="SessionID" name="SessionID" />


    <column friendlyName="SessionGUID" name="SessionGUID" />

    <column friendlyName="Datasource" name="Datasource" />

    <column friendlyName="Sequence" name="Sequence" />

    <column friendlyName="LocalSequence" name="LocalSequence" />

    <column friendlyName="Message" name="Message" />

    <column friendlyName="AppPoolName" name="AppPoolName" />

  </columns>

  <rows>

    <row>

      <col name="time">11/14/2022 23:31:12</col>

      <col name="TimeSpan">0 ms</col>

      <col name="ThreadID">0x00000025</col>

      <col name="User">USERNAME</col>

      <col name="HTTPSessionID"></col>

      <col name="HTTPForward">20.186.0.0</col>

      <col name="SessionGUID">e4e51b-a64d-4b7b-9bfe-9612dd22b6cc</col>

      <col name="SessionID">6096783</col>

      <col name="Datasource">datasourceName</col>

      <col name="AppPoolName">C 1801AppServer Ext</col>

      <col name="Direction">Out</col>

      <col name="sql">UPDATE SET </col>

      <col name="Sequence">236419</col>

      <col name="LocalSequence">103825</col>

    </row>

    <row>

      <col name="time">11/14/2022 23:31:12</col>

      <col name="TimeSpan">N/A</col>

      <col name="ThreadID">0x00000025</col>

      <col name="User">USERNAME</col>

      <col name="HTTPSessionID"></col>

      <col name="HTTPForward">20.186.0.0</col>

      <col name="SessionGUID">e491b-a64d-4b7b-9bfe-9612dd22b6cc</col>

      <col name="SessionID">6096783</col>

      <col name="Datasource">datasourceName</col>

      <col name="AppPoolName">C 1801AppServer Ext</col>

      <col name="Direction">In</col>

      <col name="sql">UPDATE SET</col>

      <col name="Sequence">236420</col>

      <col name="LocalSequence">103826</col>

    </row>

  </rows>

</diagnosticsLog>

I want to convert that to the column names being the columns and each row being a row. I'm at a loss on how to do this.

Top answer
1 of 2
5

Assume we have a dataframe similar to your example:

import pandas as pd
df = pd.DataFrame.from_dict({'FILE_CREATION_DATE': ['2017-09-06'], 'FILE_DATA': ['''<?xml version="1.0" encoding="utf-8" ?>
<REPORT xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/CrashReport.DataLayer.v20170201">
<CRSREPORTTIMESTAMP>2020-10-08T06:49:31.813812</CRSREPORTTIMESTAMP>-
<AGENCYIDENTIFIER>MILWAUKEE</AGENCYIDENTIFIER>-
<AGENCYNAME>Milwaukee Police Department</AGENCYNAME></REPORT>''']})

df

FILE_CREATION_DATE  FILE_DATA
0   2017-09-06      <?xml version="1.0" encoding="utf-8" ?>\n<REPO...

let's get the possible values from your XML. We'll just take the first row and assume the rest is identical.

import xml.etree.ElementTree as ET

root = ET.fromstring(df['FILE_DATA'][0])
# we need to get rid of the XML namespace, therefore the split by }
columns = [c.tag.split('}', 1)[-1] for c in root]

# convert each XML into a dictionary and asssign to the columns
df[columns] = df['FILE_DATA'].apply(lambda x: pd.Series({c.tag.split('}', 1)[-1]:c.text for c in ET.fromstring(x)}))
df.drop('FILE_DATA', axis=1, inplace=True) 
df


FILE_CREATION_DATE  CRSREPORTTIMESTAMP          AGENCYIDENTIFIER    AGENCYNAME
0                   2017-09-06 2020-10-08...    MILWAUKEE           Milwaukee Police Department
2 of 2
2

A somewhat similar solution to @Maximilian Peter's, but using lxml, xpath (taking into account namespaces), an additional report from Chicago and map():

from lxml import etree
data = [["2017-09-06",'<?xml version="1.0" encoding="utf-8" ?><REPORT xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/CrashReport.DataLayer.v20170201"><CRSREPORTTIMESTAMP>2020-10-08T06:49:31.813812</CRSREPORTTIMESTAMP>-<AGENCYIDENTIFIER>MILWAULLKEE</AGENCYIDENTIFIER>-<AGENCYNAME>Milwaukee Police Department</AGENCYNAME></REPORT>'],\
        ["2017-09-07", '<?xml version="1.0" encoding="utf-8" ?><REPORT xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/CrashReport.DataLayer.v20170201"><CRSREPORTTIMESTAMP>2021-11-08T06:49:31.813812</CRSREPORTTIMESTAMP>-<AGENCYIDENTIFIER>CHICAGO</AGENCYIDENTIFIER>-<AGENCYNAME>Chicago Police Department</AGENCYNAME></REPORT>']]
columns = ["FILE_CREATION_DATE","FILE_DATA"]

def xpath_extract(my_str):
    doc = etree.XML(my_str.encode())
    a,b,c = [elem.text for elem in doc.xpath('//*[local-name()="REPORT"]//*')]
    return a,b,c  

police_df['TIME'], police_df['AGENCY_ID'], police_df['AGENCY_NAME'] = \
     zip(*police_df['FILE_DATA'].map(xpath_extract))

police_df.drop('FILE_DATA', axis=1)

Output:

     FILE_CREATION_DATE     TIME           AGENCY_ID       AGENCY_NAME
0   2017-09-06  2020-10-08T06:49:31.813812  MILWAULLKEE    Milwaukee Police Department
1   2017-09-07  2021-11-08T06:49:31.813812  CHICAGO        Chicago Police Department
🌐
Medium
medium.com › @robertopreste › from-xml-to-pandas-dataframes-9292980b1c1c
From XML to Pandas dataframes. How to parse XML files to obtain proper… | by Roberto Preste | Medium
August 25, 2019 - The downside to this approach is ... to hard-code column names accordingly. We can try to convert this code to a more useful and versatile function, without having to hard-code any values: import pandas as pd import xml.etree.ElementTree as et def parse_XML(xml_file, df_cols): ...
🌐
Pandas
pandas.pydata.org › pandas-docs › version › 1.4 › reference › api › pandas.read_xml.html
pandas.read_xml — pandas 1.4.4 documentation
Note: if XML document uses default namespace denoted as xmlns=’<URI>’ without a prefix, you must assign any temporary namespace prefix such as ‘doc’ to the URI in order to parse underlying nodes and/or attributes. For example, ... Parse only the child elements at the specified xpath. By default, all child elements and non-empty text nodes are returned. ... Parse only the attributes at the specified xpath. By default, all attributes are returned. ... Column ...
🌐
PyPI
pypi.org › project › pandas-read-xml
pandas-read-xml
JavaScript is disabled in your browser. Please enable JavaScript to proceed · A required part of this site couldn’t load. This may be due to a browser extension, network issues, or browser settings. Please check your connection, disable any ad blockers, or try using a different browser
Find elsewhere
🌐
Stack Abuse
stackabuse.com › reading-and-writing-xml-files-in-python-with-pandas
Reading and Writing XML Files in Python with Pandas
August 21, 2024 - Note: When reading data from XML, we have to transpose the DataFrame, as the data list's sub-elements are written in columns. We need to write them as rows in the DataFrame. Let's look at the code to demonstrate use of xml.etree.ElementTree: import xml.etree.ElementTree as ET import pandas as pd xml_data = open('properties.xml', 'r').read() # Read file root = ET.XML(xml_data) # Parse XML data = [] cols = [] for i, child in enumerate(root): data.append([subchild.text for subchild in child]) cols.append(child.tag) df = pd.DataFrame(data).T # Write in DF and transpose it df.columns = cols # Update column names print(df)
🌐
TutorialsPoint
tutorialspoint.com › python_pandas › python_pandas_parsing_xml_file.htm
Python Pandas - Parsing XML File
This example shows how to parse a nested XML structure representing a bookstore. Each <book> node has child elements like title, author, year, and price. By using the xpath parameter we can easily locate and extract these <book> nodes and their contents into a DataFrame. import pandas as pd from io import StringIO # Create a String representing XML data xml = """<?xml version="1.0" encoding="UTF-8"?> <bookstore> <book category="cooking"> <title lang="en">Everyday Italian</title> <author>Giada De Laurentiis</author> <year>2005</year> <price>30.00</price> </book> <book category="children"> <title lang="en">Harry Potter</title> <author>J K.
🌐
Pandas
pandas.pydata.org › pandas-docs › stable › reference › api › pandas.read_xml.html
pandas.read_xml — pandas 2.2.2 documentation - PyData |
Note: if XML document uses default namespace denoted as xmlns=’<URI>’ without a prefix, you must assign any temporary namespace prefix such as ‘doc’ to the URI in order to parse underlying nodes and/or attributes. For example, ... Parse only the child elements at the specified xpath. By default, all child elements and non-empty text nodes are returned. ... Parse only the attributes at the specified xpath. By default, all attributes are returned. ... Column ...
🌐
Python Forum
python-forum.io › thread-22948.html
Parse XML String in Pandas Dataframe
December 4, 2019 - Here is my situation: I have a pandas dataframe that contains one column with an xml string for each row. I need to be able to parse the xml string for each row to see the data elements of the xml file. All the code I have been able to find is code ...
🌐
Like Geeks
likegeeks.com › home › python › pandas › parsing xml files into dataframes using pandas read_xml
Parsing XML Files into DataFrames using Pandas read_xml
October 16, 2023 - The interparse argument is used to specify the structure of the XML and how it should be parsed. To compare the performance between using iterparse and without using it for reading a large XML file. Here’s the plan: Generate a large XML file. Measure the time taken to read the XML file without using iterparse. Measure the time taken to read the XML file using iterparse. import pandas as pd import random import time from io import BytesIO # Step 1: Generate a large XML file num_entries = 1000000 shapes = ["triangle", "square", "pentagon", "hexagon"] xml_data = '' for _ in range(num_entries):
🌐
Towards Data Science
towardsdatascience.com › home › latest › parsing xml data in python
Parsing XML Data in Python | Towards Data Science
January 19, 2025 - To summarize, in this post we discussed how to parse XML data using the ‘xml’ library in python. We showed how to use the ‘iterfind()’ method to define a generator object that we can iterate over in a ‘for-loop’. We also showed how to access element tag information using the ‘findtext()’ method. We then stored the XML information in lists which we used to define a Pandas data frame.
🌐
GeeksforGeeks
geeksforgeeks.org › how-to-create-pandas-dataframe-from-nested-xml
How to create Pandas DataFrame from nested XML? | GeeksforGeeks
April 28, 2021 - In this article, we will learn how to create Pandas DataFrame from nested XML. We will use the xml.etree.ElementTree module, which is a built-in module in Python for parsing or reading information from the XML file.
🌐
pandas
pandas.pydata.org › pandas-docs › dev › reference › api › pandas.read_xml.html
pandas.read_xml — pandas 3.0.0rc1+103.gaf9e3f0ca6 documentation
Dict of functions for converting values in certain columns. Keys can either be integers or column labels. parse_datesbool or list of int or names or list of lists or dict, default False
🌐
Pandas
pandas.pydata.org › pandas-docs › version › 2.0 › reference › api › pandas.read_xml.html
pandas.read_xml — pandas 2.0.3 documentation
Note: if XML document uses default namespace denoted as xmlns=’<URI>’ without a prefix, you must assign any temporary namespace prefix such as ‘doc’ to the URI in order to parse underlying nodes and/or attributes. For example, ... Parse only the child elements at the specified xpath. By default, all child elements and non-empty text nodes are returned. ... Parse only the attributes at the specified xpath. By default, all attributes are returned. ... Column ...
🌐
Pandas
pandas.pydata.org › docs › reference › api › pandas.DataFrame.to_xml.html
pandas.DataFrame.to_xml — pandas 3.0.1 documentation
DataFrame.to_xml(path_or_buffer=None, *, index=True, root_name='data', row_name='row', na_rep=None, attr_cols=None, elem_cols=None, namespaces=None, prefix=None, encoding='utf-8', xml_declaration=True, pretty_print=True, parser='lxml', stylesheet=None, compression='infer', storage_options=None)[source]#
🌐
Pandas
pandas.pydata.org › docs › dev › reference › api › pandas.read_xml.html
pandas.read_xml — pandas 3.0.0.dev0+2687.g00a7c41157 documentation
Note: if XML document uses default namespace denoted as xmlns=’<URI>’ without a prefix, you must assign any temporary namespace prefix such as ‘doc’ to the URI in order to parse underlying nodes and/or attributes. ... Parse only the child elements at the specified xpath. By default, all child elements and non-empty text nodes are returned. ... Parse only the attributes at the specified xpath. By default, all attributes are returned. ... Column ...
🌐
Pandas
pandas.pydata.org › pandas-docs › version › 1.5 › reference › api › pandas.read_xml.html
pandas.read_xml — pandas 1.5.2 documentation
Note: if XML document uses default namespace denoted as xmlns=’<URI>’ without a prefix, you must assign any temporary namespace prefix such as ‘doc’ to the URI in order to parse underlying nodes and/or attributes. For example, ... Parse only the child elements at the specified xpath. By default, all child elements and non-empty text nodes are returned. ... Parse only the attributes at the specified xpath. By default, all attributes are returned. ... Column ...