pandas read xml iterparse

stackoverflow.com › questions › 72583155 › parse-large-xml-in-python

Assuming all three needed nodes (aixm:designator, aixm:type, and gml:pos) are always present, consider parsing the parent nodes, aixm:DesignatedPointTimeSlice and axim:Point and then join them. Finally, select the three final columns needed.

import pandas as pd

ab = {
    'aixm':'http://www.aixm.aero/schema/5.1.1', 
    'adrext':'http://www.aixm.aero/schema/5.1.1/extensions/EUR/ADR',
    'gml':'http://www.opengis.net/gml/3.2'
}

time_slice_df = pd.read_xml(
    'file.xml', xpath=".//aixm:DesignatedPointTimeSlice", namespaces=ab
).add_prefix("time_slice_")

point_df  = pd.read_xml(
    'file.xml', xpath=".//aixm:Point", namespaces=ab
).add_prefix("point_")

time_slice_df = (
    time_slice_df.join(point_df)
    .reindex(
        ["time_slice_designator", "time_slice_type", "point_pos"], 
        axis="columns"
    )
)

And in forthcoming pandas 1.5, read_xml will support iterparse allowing retrieval of descendant nodes not limited to XPath expressions:

time_slice_df = pd.read_xml(
    'file.xml', 
    namespaces = ab, 
    iterparse = {"aixm:DesignatedPointTimeSlice": 
        ["aixm:designator", "axim:type", "aixm:Point"]
    }
)

Answer from Parfait on Stack Overflow

Pandas

pandas.pydata.org › docs › reference › api › pandas.read_xml.html

pandas.read_xml — pandas 3.0.1 documentation - PyData |

pandas.read_xml(path_or_buffer, *, xpath='./*', namespaces=None, elems_only=False, attrs_only=False, names=None, dtype=None, converters=None, parse_dates=None, encoding='utf-8', parser='lxml', stylesheet=None, iterparse=None, compression='infer', storage_options=None, dtype_backend=<no_default>)[source]#

Stack Overflow

stackoverflow.com › questions › 72583155 › parse-large-xml-in-python

pandas - parse large xml in python - Stack Overflow

Top answer

1 of 1

1

Assuming all three needed nodes (aixm:designator, aixm:type, and gml:pos) are always present, consider parsing the parent nodes, aixm:DesignatedPointTimeSlice and axim:Point and then join them. Finally, select the three final columns needed.

import pandas as pd

ab = {
    'aixm':'http://www.aixm.aero/schema/5.1.1', 
    'adrext':'http://www.aixm.aero/schema/5.1.1/extensions/EUR/ADR',
    'gml':'http://www.opengis.net/gml/3.2'
}

time_slice_df = pd.read_xml(
    'file.xml', xpath=".//aixm:DesignatedPointTimeSlice", namespaces=ab
).add_prefix("time_slice_")

point_df  = pd.read_xml(
    'file.xml', xpath=".//aixm:Point", namespaces=ab
).add_prefix("point_")

time_slice_df = (
    time_slice_df.join(point_df)
    .reindex(
        ["time_slice_designator", "time_slice_type", "point_pos"], 
        axis="columns"
    )
)

And in forthcoming pandas 1.5, read_xml will support iterparse allowing retrieval of descendant nodes not limited to XPath expressions:

time_slice_df = pd.read_xml(
    'file.xml', 
    namespaces = ab, 
    iterparse = {"aixm:DesignatedPointTimeSlice": 
        ["aixm:designator", "axim:type", "aixm:Point"]
    }
)

Discussions

BUG: iterparse on read_xml overwrites with attributes on child elements

Pandas version checks I have checked that this issue has not already been reported. I have confirmed this bug exists on the latest version of pandas. I have confirmed this bug exists on the main br... More on github.com

github.com

3

February 5, 2023

BUG: iterparse on read_xml overwrites nested child elements

Pandas version checks I have checked that this issue has not already been reported. I have confirmed this bug exists on the latest version of pandas. I have confirmed this bug exists on the main br... More on github.com

github.com

3

February 5, 2023

BUG: pd.read_xml does not support file like object when iterparse is used

Pandas version checks I have checked that this issue has not already been reported. I have confirmed this bug exists on the latest version of pandas. I have confirmed this bug exists on the main br... More on github.com

github.com

5

January 9, 2023

Parsing XML into a Pandas dataframe

To parse an XML file into a Pandas DataFrame, you can use the from_dict method of the DataFrame class. First, you will need to use the ElementTree module to parse the XML file and extract the relevant data. Here is an example of how this can be done: import xml.etree.ElementTree as ET import pandas as pd Parse the XML file using ElementTree tree = ET.parse('my_file.xml') root = tree.getroot() Extract the column names from the 'columns' element columns = [col.attrib['friendlyName'] for col in root.find('columns')] Create an empty list to store the data for each row data = [] Iterate over the 'row' elements and extract the data for each one for row in root.find('rows'): row_data = {} for col in row: # Add the data for each column to the dictionary row_data[col.attrib['name']] = col.text # Add the dictionary for this row to the list data.append(row_data) Create a DataFrame using the column names and data df = pd.DataFrame.from_dict(data, columns=columns) This code will parse the XML file and extract the data for each row and column, storing it in a dictionary. The dictionary is then used to create a DataFrame using the from_dict method. This DataFrame will have the column names as the columns and each row of data as a row in the DataFrame. More on reddit.com

r/learnpython

8

3

December 9, 2022

Pandas

pandas.pydata.org › pandas-docs › version › 1.5 › reference › api › pandas.read_xml.html

pandas.read_xml — pandas 1.5.2 documentation

Note: If this option is used, it will replace xpath parsing and unlike xpath, descendants do not need to relate to each other but can exist any where in document under the repeating element. This memory- efficient method should be used for very large XML files (500MB, 1GB, or 5GB+). For example, iterparse = {"row_element": ["child_elem", "attr", "grandchild_elem"]}

Like Geeks

likegeeks.com › home › python › pandas › parsing xml files into dataframes using pandas read_xml

Parsing XML Files into DataFrames using Pandas read_xml

October 16, 2023 - Measure the time taken to read the XML file using iterparse. import pandas as pd import random import time from io import BytesIO # Step 1: Generate a large XML file num_entries = 1000000 shapes = ["triangle", "square", "pentagon", "hexagon"] ...

Pandas

pandas.pydata.org › docs › dev › reference › api › pandas.read_xml.html

pandas.read_xml — pandas 3.0.0.dev0+2687.g00a7c41157 documentation

The xpath must reference nodes of transformed XML document generated after XSLT transformation and not the original XML document. Only XSLT 1.0 scripts and not later versions is currently supported. ... The nodes or attributes to retrieve in iterparsing of XML document as a dict with key being the name of repeating element and value being list of elements or attribute names that are descendants of the repeated element.

Pandas

pandas.pydata.org › pandas-docs › stable › reference › api › pandas.read_xml.html

pandas.read_xml — pandas 2.2.2 documentation - PyData |

Note: If this option is used, it will replace xpath parsing and unlike xpath, descendants do not need to relate to each other but can exist any where in document under the repeating element. This memory- efficient method should be used for very large XML files (500MB, 1GB, or 5GB+). For example, iterparse = {"row_element": ["child_elem", "attr", "grandchild_elem"]}

PyPI

pypi.org › project › pandas-read-xml

pandas-read-xml

JavaScript is disabled in your browser. Please enable JavaScript to proceed · A required part of this site couldn’t load. This may be due to a browser extension, network issues, or browser settings. Please check your connection, disable any ad blockers, or try using a different browser

GitHub

github.com › pandas-dev › pandas › issues › 47343

BUG: iterparse on read_xml overwrites with attributes on child elements · Issue #47343 · pandas-dev/pandas

February 5, 2023 - from tempfile import NamedTemporaryFile import pandas as pd XML = ''' <issue> <type>BUG</type> <reporter type="newbie">Emanuel</reporter> </issue> '''.encode('utf-8') with NamedTemporaryFile() as tempfile: tempfile.write(XML) tempfile.flush() df = pd.read_xml(tempfile.name, iterparse={'issue': ['type']}) issue_type = df.iloc[0]['type'] print(f"issue_type: expecting BUG, got {issue_type}") The parsing code parses attributes on all child elements, rather than just on the main element.

Author bailsman

Find elsewhere

Google Bing Mojeek

GitHub

github.com › pandas-dev › pandas › issues › 51183

BUG: iterparse on read_xml overwrites nested child elements · Issue #51183 · pandas-dev/pandas

February 5, 2023 - import pandas as pd XML =''' <values> <guidedSaleKey> <code>9023000918982</code> <externalReference>0102350511</externalReference> </guidedSaleKey> <store> <code>02300</code> <externalReference>1543</externalReference> <currency>EUR</currency> </store> </values> ''' df = pd.read_xml(XML,iterparse={"values":["code","code"]}, names=["guided_code","store_code"]) print(df) dataframe will not be able to return value of both code elements from guidedSaleKey and store this will return this: guided_code store_code 0 9023000918982 9023000918982 ·

Author bama-chi

Pandas

pandas.pydata.org › pandas-docs › version › 1.5.0 › reference › api › pandas.read_xml.html

pandas.read_xml — pandas 1.5.0 documentation

July 23, 2025 - Note: If this option is used, it will replace xpath parsing and unlike xpath, descendants do not need to relate to each other but can exist any where in document under the repeating element. This memory- efficient method should be used for very large XML files (500MB, 1GB, or 5GB+). For example, iterparse = {"row_element": ["child_elem", "attr", "grandchild_elem"]}

GitHub

github.com › pandas-dev › pandas › issues › 50641

BUG: pd.read_xml does not support file like object when iterparse is used · Issue #50641 · pandas-dev/pandas

January 9, 2023 - I have confirmed this bug exists ... iterparse={root: list(elements)} ) the method read_xml with iterparse as parms is used to read large xml file, but it's restricted to read only files on local disk....

Author bama-chi

Pandas

pandas.pydata.org › pandas-docs › version › 2.0 › reference › api › pandas.read_xml.html

pandas.read_xml — pandas 2.0.3 documentation

Note: If this option is used, it will replace xpath parsing and unlike xpath, descendants do not need to relate to each other but can exist any where in document under the repeating element. This memory- efficient method should be used for very large XML files (500MB, 1GB, or 5GB+). For example, iterparse = {"row_element": ["child_elem", "attr", "grandchild_elem"]}

reddit.com › r/learnpython › parsing xml into a pandas dataframe

r/learnpython on Reddit: Parsing XML into a Pandas dataframe

December 9, 2022 -

I am trying to parse an XML file into a Pandas DataFrame. It's a nicely formatted file that's not very deep, but whenever I work with XML it's like my brain goes blank and I never can remember all the goofy intricacies of dealing with it.

The file looks roughly like this

<?xml version="1.0" encoding="utf-8"?>

<diagnosticsLog type="db-profile" startDate="11/14/2022 23:31:12">

  <!--Build 18.0.1.69-->

  <columns>

    <column friendlyName="time" name="time" />

    <column friendlyName="Direction" name="Direction" />

    <column friendlyName="SQL" name="SQL" />

    <column friendlyName="ProcessID" name="ProcessID" />

    <column friendlyName="ThreadID" name="ThreadID" />


    <column friendlyName="TimeSpan" name="TimeSpan" />

    <column friendlyName="User" name="User" />

    <column friendlyName="HTTPSessionID" name="HTTPSessionID" />

    <column friendlyName="HTTPForward" name="HTTPForward" />

    <column friendlyName="SessionID" name="SessionID" />


    <column friendlyName="SessionGUID" name="SessionGUID" />

    <column friendlyName="Datasource" name="Datasource" />

    <column friendlyName="Sequence" name="Sequence" />

    <column friendlyName="LocalSequence" name="LocalSequence" />

    <column friendlyName="Message" name="Message" />

    <column friendlyName="AppPoolName" name="AppPoolName" />

  </columns>

  <rows>

    <row>

      <col name="time">11/14/2022 23:31:12</col>

      <col name="TimeSpan">0 ms</col>

      <col name="ThreadID">0x00000025</col>

      <col name="User">USERNAME</col>

      <col name="HTTPSessionID"></col>

      <col name="HTTPForward">20.186.0.0</col>

      <col name="SessionGUID">e4e51b-a64d-4b7b-9bfe-9612dd22b6cc</col>

      <col name="SessionID">6096783</col>

      <col name="Datasource">datasourceName</col>

      <col name="AppPoolName">C 1801AppServer Ext</col>

      <col name="Direction">Out</col>

      <col name="sql">UPDATE SET </col>

      <col name="Sequence">236419</col>

      <col name="LocalSequence">103825</col>

    </row>

    <row>

      <col name="time">11/14/2022 23:31:12</col>

      <col name="TimeSpan">N/A</col>

      <col name="ThreadID">0x00000025</col>

      <col name="User">USERNAME</col>

      <col name="HTTPSessionID"></col>

      <col name="HTTPForward">20.186.0.0</col>

      <col name="SessionGUID">e491b-a64d-4b7b-9bfe-9612dd22b6cc</col>

      <col name="SessionID">6096783</col>

      <col name="Datasource">datasourceName</col>

      <col name="AppPoolName">C 1801AppServer Ext</col>

      <col name="Direction">In</col>

      <col name="sql">UPDATE SET</col>

      <col name="Sequence">236420</col>

      <col name="LocalSequence">103826</col>

    </row>

  </rows>

</diagnosticsLog>

I want to convert that to the column names being the columns and each row being a row. I'm at a loss on how to do this.