Assuming all three needed nodes (aixm:designator, aixm:type, and gml:pos) are always present, consider parsing the parent nodes, aixm:DesignatedPointTimeSlice and axim:Point and then join them. Finally, select the three final columns needed.

import pandas as pd

ab = {
    'aixm':'http://www.aixm.aero/schema/5.1.1', 
    'adrext':'http://www.aixm.aero/schema/5.1.1/extensions/EUR/ADR',
    'gml':'http://www.opengis.net/gml/3.2'
}

time_slice_df = pd.read_xml(
    'file.xml', xpath=".//aixm:DesignatedPointTimeSlice", namespaces=ab
).add_prefix("time_slice_")

point_df  = pd.read_xml(
    'file.xml', xpath=".//aixm:Point", namespaces=ab
).add_prefix("point_")

time_slice_df = (
    time_slice_df.join(point_df)
    .reindex(
        ["time_slice_designator", "time_slice_type", "point_pos"], 
        axis="columns"
    )
)

And in forthcoming pandas 1.5, read_xml will support iterparse allowing retrieval of descendant nodes not limited to XPath expressions:

time_slice_df = pd.read_xml(
    'file.xml', 
    namespaces = ab, 
    iterparse = {"aixm:DesignatedPointTimeSlice": 
        ["aixm:designator", "axim:type", "aixm:Point"]
    }
)
Answer from Parfait on Stack Overflow
🌐
Pandas
pandas.pydata.org › docs › reference › api › pandas.read_xml.html
pandas.read_xml — pandas 3.0.1 documentation - PyData |
pandas.read_xml(path_or_buffer, *, xpath='./*', namespaces=None, elems_only=False, attrs_only=False, names=None, dtype=None, converters=None, parse_dates=None, encoding='utf-8', parser='lxml', stylesheet=None, iterparse=None, compression='infer', storage_options=None, dtype_backend=<no_default>)[source]#
Discussions

BUG: iterparse on read_xml overwrites with attributes on child elements
Pandas version checks I have checked that this issue has not already been reported. I have confirmed this bug exists on the latest version of pandas. I have confirmed this bug exists on the main br... More on github.com
🌐 github.com
3
February 5, 2023
BUG: iterparse on read_xml overwrites nested child elements
Pandas version checks I have checked that this issue has not already been reported. I have confirmed this bug exists on the latest version of pandas. I have confirmed this bug exists on the main br... More on github.com
🌐 github.com
3
February 5, 2023
BUG: pd.read_xml does not support file like object when iterparse is used
Pandas version checks I have checked that this issue has not already been reported. I have confirmed this bug exists on the latest version of pandas. I have confirmed this bug exists on the main br... More on github.com
🌐 github.com
5
January 9, 2023
Parsing XML into a Pandas dataframe
To parse an XML file into a Pandas DataFrame, you can use the from_dict method of the DataFrame class. First, you will need to use the ElementTree module to parse the XML file and extract the relevant data. Here is an example of how this can be done: import xml.etree.ElementTree as ET import pandas as pd Parse the XML file using ElementTree tree = ET.parse('my_file.xml') root = tree.getroot() Extract the column names from the 'columns' element columns = [col.attrib['friendlyName'] for col in root.find('columns')] Create an empty list to store the data for each row data = [] Iterate over the 'row' elements and extract the data for each one for row in root.find('rows'): row_data = {} for col in row: # Add the data for each column to the dictionary row_data[col.attrib['name']] = col.text # Add the dictionary for this row to the list data.append(row_data) Create a DataFrame using the column names and data df = pd.DataFrame.from_dict(data, columns=columns) This code will parse the XML file and extract the data for each row and column, storing it in a dictionary. The dictionary is then used to create a DataFrame using the from_dict method. This DataFrame will have the column names as the columns and each row of data as a row in the DataFrame. More on reddit.com
🌐 r/learnpython
8
3
December 9, 2022
🌐
Pandas
pandas.pydata.org › pandas-docs › version › 1.5 › reference › api › pandas.read_xml.html
pandas.read_xml — pandas 1.5.2 documentation
Note: If this option is used, it will replace xpath parsing and unlike xpath, descendants do not need to relate to each other but can exist any where in document under the repeating element. This memory- efficient method should be used for very large XML files (500MB, 1GB, or 5GB+). For example, iterparse = {"row_element": ["child_elem", "attr", "grandchild_elem"]}
🌐
Like Geeks
likegeeks.com › home › python › pandas › parsing xml files into dataframes using pandas read_xml
Parsing XML Files into DataFrames using Pandas read_xml
October 16, 2023 - Measure the time taken to read the XML file using iterparse. import pandas as pd import random import time from io import BytesIO # Step 1: Generate a large XML file num_entries = 1000000 shapes = ["triangle", "square", "pentagon", "hexagon"] ...
🌐
Pandas
pandas.pydata.org › docs › dev › reference › api › pandas.read_xml.html
pandas.read_xml — pandas 3.0.0.dev0+2687.g00a7c41157 documentation
The xpath must reference nodes of transformed XML document generated after XSLT transformation and not the original XML document. Only XSLT 1.0 scripts and not later versions is currently supported. ... The nodes or attributes to retrieve in iterparsing of XML document as a dict with key being the name of repeating element and value being list of elements or attribute names that are descendants of the repeated element.
🌐
Pandas
pandas.pydata.org › pandas-docs › stable › reference › api › pandas.read_xml.html
pandas.read_xml — pandas 2.2.2 documentation - PyData |
Note: If this option is used, it will replace xpath parsing and unlike xpath, descendants do not need to relate to each other but can exist any where in document under the repeating element. This memory- efficient method should be used for very large XML files (500MB, 1GB, or 5GB+). For example, iterparse = {"row_element": ["child_elem", "attr", "grandchild_elem"]}
🌐
PyPI
pypi.org › project › pandas-read-xml
pandas-read-xml
JavaScript is disabled in your browser. Please enable JavaScript to proceed · A required part of this site couldn’t load. This may be due to a browser extension, network issues, or browser settings. Please check your connection, disable any ad blockers, or try using a different browser
🌐
GitHub
github.com › pandas-dev › pandas › issues › 47343
BUG: iterparse on read_xml overwrites with attributes on child elements · Issue #47343 · pandas-dev/pandas
February 5, 2023 - from tempfile import NamedTemporaryFile import pandas as pd XML = ''' <issue> <type>BUG</type> <reporter type="newbie">Emanuel</reporter> </issue> '''.encode('utf-8') with NamedTemporaryFile() as tempfile: tempfile.write(XML) tempfile.flush() df = pd.read_xml(tempfile.name, iterparse={'issue': ['type']}) issue_type = df.iloc[0]['type'] print(f"issue_type: expecting BUG, got {issue_type}") The parsing code parses attributes on all child elements, rather than just on the main element.
Author   bailsman
Find elsewhere
🌐
GitHub
github.com › pandas-dev › pandas › issues › 51183
BUG: iterparse on read_xml overwrites nested child elements · Issue #51183 · pandas-dev/pandas
February 5, 2023 - import pandas as pd XML =''' <values> <guidedSaleKey> <code>9023000918982</code> <externalReference>0102350511</externalReference> </guidedSaleKey> <store> <code>02300</code> <externalReference>1543</externalReference> <currency>EUR</currency> </store> </values> ''' df = pd.read_xml(XML,iterparse={"values":["code","code"]}, names=["guided_code","store_code"]) print(df) dataframe will not be able to return value of both code elements from guidedSaleKey and store this will return this: guided_code store_code 0 9023000918982 9023000918982 ·
Author   bama-chi
🌐
Pandas
pandas.pydata.org › pandas-docs › version › 1.5.0 › reference › api › pandas.read_xml.html
pandas.read_xml — pandas 1.5.0 documentation
July 23, 2025 - Note: If this option is used, it will replace xpath parsing and unlike xpath, descendants do not need to relate to each other but can exist any where in document under the repeating element. This memory- efficient method should be used for very large XML files (500MB, 1GB, or 5GB+). For example, iterparse = {"row_element": ["child_elem", "attr", "grandchild_elem"]}
🌐
GitHub
github.com › pandas-dev › pandas › issues › 50641
BUG: pd.read_xml does not support file like object when iterparse is used · Issue #50641 · pandas-dev/pandas
January 9, 2023 - I have confirmed this bug exists ... iterparse={root: list(elements)} ) the method read_xml with iterparse as parms is used to read large xml file, but it's restricted to read only files on local disk....
Author   bama-chi
🌐
Pandas
pandas.pydata.org › pandas-docs › version › 2.0 › reference › api › pandas.read_xml.html
pandas.read_xml — pandas 2.0.3 documentation
Note: If this option is used, it will replace xpath parsing and unlike xpath, descendants do not need to relate to each other but can exist any where in document under the repeating element. This memory- efficient method should be used for very large XML files (500MB, 1GB, or 5GB+). For example, iterparse = {"row_element": ["child_elem", "attr", "grandchild_elem"]}
🌐
Reddit
reddit.com › r/learnpython › parsing xml into a pandas dataframe
r/learnpython on Reddit: Parsing XML into a Pandas dataframe
December 9, 2022 -

I am trying to parse an XML file into a Pandas DataFrame. It's a nicely formatted file that's not very deep, but whenever I work with XML it's like my brain goes blank and I never can remember all the goofy intricacies of dealing with it.

The file looks roughly like this

<?xml version="1.0" encoding="utf-8"?>

<diagnosticsLog type="db-profile" startDate="11/14/2022 23:31:12">

  <!--Build 18.0.1.69-->

  <columns>

    <column friendlyName="time" name="time" />

    <column friendlyName="Direction" name="Direction" />

    <column friendlyName="SQL" name="SQL" />

    <column friendlyName="ProcessID" name="ProcessID" />

    <column friendlyName="ThreadID" name="ThreadID" />


    <column friendlyName="TimeSpan" name="TimeSpan" />

    <column friendlyName="User" name="User" />

    <column friendlyName="HTTPSessionID" name="HTTPSessionID" />

    <column friendlyName="HTTPForward" name="HTTPForward" />

    <column friendlyName="SessionID" name="SessionID" />


    <column friendlyName="SessionGUID" name="SessionGUID" />

    <column friendlyName="Datasource" name="Datasource" />

    <column friendlyName="Sequence" name="Sequence" />

    <column friendlyName="LocalSequence" name="LocalSequence" />

    <column friendlyName="Message" name="Message" />

    <column friendlyName="AppPoolName" name="AppPoolName" />

  </columns>

  <rows>

    <row>

      <col name="time">11/14/2022 23:31:12</col>

      <col name="TimeSpan">0 ms</col>

      <col name="ThreadID">0x00000025</col>

      <col name="User">USERNAME</col>

      <col name="HTTPSessionID"></col>

      <col name="HTTPForward">20.186.0.0</col>

      <col name="SessionGUID">e4e51b-a64d-4b7b-9bfe-9612dd22b6cc</col>

      <col name="SessionID">6096783</col>

      <col name="Datasource">datasourceName</col>

      <col name="AppPoolName">C 1801AppServer Ext</col>

      <col name="Direction">Out</col>

      <col name="sql">UPDATE SET </col>

      <col name="Sequence">236419</col>

      <col name="LocalSequence">103825</col>

    </row>

    <row>

      <col name="time">11/14/2022 23:31:12</col>

      <col name="TimeSpan">N/A</col>

      <col name="ThreadID">0x00000025</col>

      <col name="User">USERNAME</col>

      <col name="HTTPSessionID"></col>

      <col name="HTTPForward">20.186.0.0</col>

      <col name="SessionGUID">e491b-a64d-4b7b-9bfe-9612dd22b6cc</col>

      <col name="SessionID">6096783</col>

      <col name="Datasource">datasourceName</col>

      <col name="AppPoolName">C 1801AppServer Ext</col>

      <col name="Direction">In</col>

      <col name="sql">UPDATE SET</col>

      <col name="Sequence">236420</col>

      <col name="LocalSequence">103826</col>

    </row>

  </rows>

</diagnosticsLog>

I want to convert that to the column names being the columns and each row being a row. I'm at a loss on how to do this.

🌐
pandas
pandas.pydata.org › pandas-docs › dev › reference › api › pandas.read_xml.html
pandas.read_xml — pandas 3.0.0rc1+103.gaf9e3f0ca6 documentation
The xpath must reference nodes of transformed XML document generated after XSLT transformation and not the original XML document. Only XSLT 1.0 scripts and not later versions is currently supported. ... The nodes or attributes to retrieve in iterparsing of XML document as a dict with key being the name of repeating element and value being list of elements or attribute names that are descendants of the repeated element.
🌐
DataScientYst
datascientyst.com › read-xml-file-python-pandas
How to Read XML File with Python and Pandas
October 13, 2022 - In this quick tutorial, we'll cover how to read or convert XML file to Pandas DataFrame or Python data structure. Since version 1.3 Pandas offers an elegant solution for reading XML files: pd.read_xml(). The short solutions is: df = pd.read_xml('sitemap.xml') With the single line
🌐
GeeksforGeeks
geeksforgeeks.org › how-to-create-pandas-dataframe-from-nested-xml
How to create Pandas DataFrame from nested XML? | GeeksforGeeks
April 28, 2021 - Parse or read the XML file using ElementTree.parse( ) function and get the root element. Iterate through the root node to get the child nodes attributes 'SL NO' (here) and extract the text values of each attribute (here foodItem, price, quantity, ...
🌐
Pandas
pandas.pydata.org › pandas-docs › version › 2.1.0 › reference › api › pandas.read_xml.html
pandas.read_xml — pandas 2.1.0 documentation - PyData |
January 18, 2022 - Note: If this option is used, it will replace xpath parsing and unlike xpath, descendants do not need to relate to each other but can exist any where in document under the repeating element. This memory- efficient method should be used for very large XML files (500MB, 1GB, or 5GB+). For example, iterparse = {"row_element": ["child_elem", "attr", "grandchild_elem"]}
🌐
TutorialsPoint
tutorialspoint.com › python_pandas › python_pandas_read_xml_method.htm
Pandas DataFrame read_xml() Method
January 2, 2025 - The read_xml() method returns a Pandas DataFrame containing the parsed data from the XML document.
Top answer
1 of 1
2

PERFORMANCE: How do you explain the slower iterparse often recommended for larger files as file is iteratively parsed? Is it partly due to the if logic checks?

I would assume that more python code would make it slower, as the python code is evaluated every time. Have you tried a JIT compiler like pypy?

If I remove i and use first_tag only, it seems to be quite a bit faster, so yes it is partly due to the if logic checks:

def read_xml_iterparse2(path):
    data = []
    inner = {}
    first_tag = None
    for (ev, el) in et.iterparse(path):
        if not first_tag:
           first_tag = el.tag

        if el.tag == first_tag and len(inner) != 0:
            data.append(inner)            
            inner = {}

        if el.text is not None and len(el.text.strip()) > 0:
            inner[el.tag] = el.text

    df = pd.DataFrame(data)    

%timeit read_xml_iterparse(path)
# 10 loops, best of 5: 33 ms per loop
%timeit read_xml_iterparse2(path)
# 10 loops, best of 5: 23 ms per loop

I wasn't sure I understood the purpose of the last if check, but I'm also not sure why you would want to lose whitespace-only elements. Removing the last if consistently shaves off a little bit of time:

def read_xml_iterparse3(path):
    data = []
    inner = {}
    first_tag = None
    for (ev, el) in et.iterparse(path):
        if not first_tag:
           first_tag = el.tag

        if el.tag == first_tag and len(inner) != 0:
            data.append(inner)            
            inner = {}

        inner[el.tag] = el.text

    df = pd.DataFrame(data)    

%timeit read_xml_iterparse(path)
# 10 loops, best of 5: 34.4 ms per loop
%timeit read_xml_iterparse2(path)
# 10 loops, best of 5: 24.5 ms per loop
%timeit read_xml_iterparse3(path)
# 10 loops, best of 5: 20.9 ms per loop

Now, with or without those performance improvements, your iterparse version seems to produce an extra-large dataframe. Here seems to be a working, fast version:

def read_xml_iterparse5(path):
    data = []
    inner = {}
    for (ev, el) in et.iterparse(path):
        # /ending parents trigger a new row, and in our case .text is \n followed by spaces.  it would be more reliable to pass 'topusers' to our read_xml_iterparse5 as the .tag to check
        if el.text and el.text[0] == '\n':
            # ignore /stackoverflow
            if inner:
                data.append(inner)
                inner = {}
        else:
            inner[el.tag] = el.text

    return pd.DataFrame(data)    

print(read_xml_iterfind(path).shape)
# (900, 8)
print(read_xml_iterparse(path).shape)
# (7050, 8)
print(read_xml_lxml_xpath(path).shape)
# (900, 8)
print(read_xml_lxml_xsl(path).shape)
# (900, 8)
print(read_xml_iterparse5(path).shape)
# (900, 8)
%timeit read_xml_iterparse5(path)
# 10 loops, best of 5: 20.6 ms per loop

MEMORY: Do CPU memory correlate with timings in I/O calls? XSLT and XPath 1.0 tend not to scale well with larger XML documents as entire file must be read in memory to be parsed.

I'm not totally sure what you mean by "I/O calls" but if your document is small enough to fit in cache, then everything will be much faster as it won't evict many other items from the cache.

STRATEGY: Is list of dictionaries an optimal strategy for Dataframe() call? See these interesting answers: generator version and a iterwalk user-defined version. Both upcast lists to dataframe.

The lists use less memory, so depending on how many columns you have, it could make a noticeable difference. Of course, this then requires your XML tags to be in a consistent order, which they do appear to be. The DataFrame() call would also need to do less work, as it doesn't have to lookup keys in the dict on every row, to figure out what column if for what value.

🌐
Stack Overflow
stackoverflow.com › questions › 74043409 › unable-to-parse-xml-data-using-pandas-method-read-xml
python - Unable to parse xml data using pandas method read.xml() - Stack Overflow
October 12, 2022 - df = pd.concat([pd.read_xml("example.xml", iterparse = {name: ["value"]}, names = [name]) for name in ["shape", "degrees", "sides"] ], axis=1 )