pandas read xml to dataframe

How to convert an XML file to nice pandas dataframe?

stackoverflow.com › questions › 28259301 › how-to-convert-an-xml-file-to-nice-pandas-dataframe

You can easily use xml (from the Python standard library) to convert to a pandas.DataFrame. Here's what I would do (when reading from a file replace xml_data with the name of your file or file object):

import pandas as pd
import xml.etree.ElementTree as ET
import io

def iter_docs(author):
    author_attr = author.attrib
    for doc in author.iter('document'):
        doc_dict = author_attr.copy()
        doc_dict.update(doc.attrib)
        doc_dict['data'] = doc.text
        yield doc_dict

xml_data = io.StringIO(u'''YOUR XML STRING HERE''')

etree = ET.parse(xml_data) #create an ElementTree object 
doc_df = pd.DataFrame(list(iter_docs(etree.getroot())))

If there are multiple authors in your original document or the root of your XML is not an author, then I would add the following generator:

def iter_author(etree):
    for author in etree.iter('author'):
        for row in iter_docs(author):
            yield row

and change doc_df = pd.DataFrame(list(iter_docs(etree.getroot()))) to doc_df = pd.DataFrame(list(iter_author(etree)))

Have a look at the ElementTree tutorial provided in the xml library documentation.

Answer from JaminSore on Stack Overflow

Pandas

pandas.pydata.org › docs › reference › api › pandas.read_xml.html

pandas.read_xml — pandas documentation - PyData |

String path, path object (implementing os.PathLike[str]), or file-like object implementing a read() function. The string can be a path. The string can further be a URL. Valid URL schemes include http, ftp, s3, and file. ... The XPath to parse required set of nodes for migration to DataFrame.``XPath`` should return a collection of elements and not a single element. Note: The etree parser supports limited XPath expressions. For more complex XPath, use lxml which requires installation. ... The namespaces defined in XML document as dicts with key being namespace prefix and value the URI.

Stack Overflow

stackoverflow.com › questions › 28259301 › how-to-convert-an-xml-file-to-nice-pandas-dataframe

python - How to convert an XML file to nice pandas dataframe? - Stack Overflow

Top answer

1 of 5

61

You can easily use xml (from the Python standard library) to convert to a pandas.DataFrame. Here's what I would do (when reading from a file replace xml_data with the name of your file or file object):

import pandas as pd
import xml.etree.ElementTree as ET
import io

def iter_docs(author):
    author_attr = author.attrib
    for doc in author.iter('document'):
        doc_dict = author_attr.copy()
        doc_dict.update(doc.attrib)
        doc_dict['data'] = doc.text
        yield doc_dict

xml_data = io.StringIO(u'''YOUR XML STRING HERE''')

etree = ET.parse(xml_data) #create an ElementTree object 
doc_df = pd.DataFrame(list(iter_docs(etree.getroot())))

If there are multiple authors in your original document or the root of your XML is not an author, then I would add the following generator:

def iter_author(etree):
    for author in etree.iter('author'):
        for row in iter_docs(author):
            yield row

and change doc_df = pd.DataFrame(list(iter_docs(etree.getroot()))) to doc_df = pd.DataFrame(list(iter_author(etree)))

Have a look at the ElementTree tutorial provided in the xml library documentation.

2 of 5

33

As of v1.3, you can simply use:

pandas.read_xml(path_or_file)

Discussions

Parsing XML into a Pandas dataframe

To parse an XML file into a Pandas DataFrame, you can use the from_dict method of the DataFrame class. First, you will need to use the ElementTree module to parse the XML file and extract the relevant data. Here is an example of how this can be done: import xml.etree.ElementTree as ET import pandas as pd Parse the XML file using ElementTree tree = ET.parse('my_file.xml') root = tree.getroot() Extract the column names from the 'columns' element columns = [col.attrib['friendlyName'] for col in root.find('columns')] Create an empty list to store the data for each row data = [] Iterate over the 'row' elements and extract the data for each one for row in root.find('rows'): row_data = {} for col in row: # Add the data for each column to the dictionary row_data[col.attrib['name']] = col.text # Add the dictionary for this row to the list data.append(row_data) Create a DataFrame using the column names and data df = pd.DataFrame.from_dict(data, columns=columns) This code will parse the XML file and extract the data for each row and column, storing it in a dictionary. The dictionary is then used to create a DataFrame using the from_dict method. This DataFrame will have the column names as the columns and each row of data as a row in the DataFrame. More on reddit.com

r/learnpython

8

3

December 9, 2022

python - How to read XML file into Pandas Dataframe - Stack Overflow

I have a xml file: 'product.xml' that I want to read using pandas, here is an example of the sample file: 32... More on stackoverflow.com

stackoverflow.com

elementtree - Read XML file to Pandas DataFrame - Stack Overflow

Can someone please help convert the following XML file to Pandas dataframe: More on stackoverflow.com

stackoverflow.com

October 24, 2018

How to load 85.6 GB of XML data into a dataframe

That's very large for a single dataset. I doubt your PC has that much RAM - most of us have more like 16GB or 32GB of RAM, so even if you were able to load the data it'd exceed your RAM capacity and your program would be extremely slow as it constantly swaps to disk. Also, there's definitely no way this is going to work if you're not running a 64-bit version of Python. If you're running a 32-bit version of Python (still pretty common) then you can't load anything larger than around 2 GB. But even if you are running 64-bit Python, that's unreasonably large to try to read into RAM. The typical solution in these cases is to preprocess your data and split it into smaller files, each one of which being a reasonable size. Machine learning packages like TensorFlow are designed to work this way - rather than loading all of your training data into RAM at once, you load a chunk at a time and train on that, then you unload that and load more. You might want to first find a tool to split your xml file into smaller chunks. More on reddit.com

r/learnprogramming

7

6

September 26, 2021

Videos