While XML as a data format can take many forms from flat to deeply nested, data frames must adhere to a single structure of two dimensions: row by column. Hence, as noted in docs, pandas.read_xml, is a convenience method best for flatter, shallow XML files. You can use xpath to traverse different areas of the document, not just the default /*.
However, you can use XSLT 1.0 (special purpose language designed to transform XML files) with the default parser, lxml, to transform any XML to the needed flat format of data frame. Below stylesheet will restyle the <slike> node for comma-separated text of its children <slika>:
XSLT (save as .xsl file, a special .xml file)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="slike">
<xsl:copy>
<xsl:for-each select="*">
<xsl:value-of select="text()"/>
<xsl:if test="position() != last()">
<xsl:text>,</xsl:text>
</xsl:if>
</xsl:for-each>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Online Demo
Python
artikal_df = pd.read_xml("my_filename.xml", stylesheet="my_style.xsl")
# CONVERT COMMA-SEPARATED VALUES TO EMBEDDED LISTS
artikal_df["slike"] = artikal_df["slike"].str.split(',')
# PREFIX PARENT NODE NAME
artikal_df = artikal_df.add_prefix('artikal_')
artikal_df.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 2 entries, 0 to 1
# Data columns (total 12 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 artikal_id 2 non-null int64
# 1 artikal_sifra 2 non-null int64
# 2 artikal_barKod 2 non-null int64
# 3 artikal_naziv 2 non-null object
# 4 artikal_kategorija1 2 non-null object
# 5 artikal_kategorija2 2 non-null object
# 6 artikal_kategorija3 2 non-null object
# 7 artikal_vpCena 2 non-null float64
# 8 artikal_mpCena 2 non-null float64
# 9 artikal_dostupan 2 non-null int64
# 10 artikal_opis 0 non-null float64
# 11 artikal_slike 2 non-null object
# dtypes: float64(3), int64(4), object(5)
# memory usage: 320.0+ bytes
Answer from Parfait on Stack Overflownested xml to dataframe - Data Science Stack Exchange
How to read nested xml file with python pandas? - Stack Overflow
python - Nested XML to Pandas dataframe - Stack Overflow
python - How to create pandas DataFrame from nested xml - Stack Overflow
Videos
While XML as a data format can take many forms from flat to deeply nested, data frames must adhere to a single structure of two dimensions: row by column. Hence, as noted in docs, pandas.read_xml, is a convenience method best for flatter, shallow XML files. You can use xpath to traverse different areas of the document, not just the default /*.
However, you can use XSLT 1.0 (special purpose language designed to transform XML files) with the default parser, lxml, to transform any XML to the needed flat format of data frame. Below stylesheet will restyle the <slike> node for comma-separated text of its children <slika>:
XSLT (save as .xsl file, a special .xml file)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="slike">
<xsl:copy>
<xsl:for-each select="*">
<xsl:value-of select="text()"/>
<xsl:if test="position() != last()">
<xsl:text>,</xsl:text>
</xsl:if>
</xsl:for-each>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Online Demo
Python
artikal_df = pd.read_xml("my_filename.xml", stylesheet="my_style.xsl")
# CONVERT COMMA-SEPARATED VALUES TO EMBEDDED LISTS
artikal_df["slike"] = artikal_df["slike"].str.split(',')
# PREFIX PARENT NODE NAME
artikal_df = artikal_df.add_prefix('artikal_')
artikal_df.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 2 entries, 0 to 1
# Data columns (total 12 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 artikal_id 2 non-null int64
# 1 artikal_sifra 2 non-null int64
# 2 artikal_barKod 2 non-null int64
# 3 artikal_naziv 2 non-null object
# 4 artikal_kategorija1 2 non-null object
# 5 artikal_kategorija2 2 non-null object
# 6 artikal_kategorija3 2 non-null object
# 7 artikal_vpCena 2 non-null float64
# 8 artikal_mpCena 2 non-null float64
# 9 artikal_dostupan 2 non-null int64
# 10 artikal_opis 0 non-null float64
# 11 artikal_slike 2 non-null object
# dtypes: float64(3), int64(4), object(5)
# memory usage: 320.0+ bytes
You start by reading the xml file and also making a placeholder file for you to write the output in a csv format (or any other text format - you might have to tweak the code a bit).
Then you specify the names of columns in your final dataframe (after you have parsed the xml file). But this information is already in your xml file anyways, so you just to make sure you understand the contents.
Lastly, loop over the entries and find the keywords (column names) to read and write to the csv.
Once done, you can read the csv using pd.read_csv('output.csv').
import xml.etree.ElementTree as ET
import csv
# Load and parse the XML file
tree = ET.parse('your_xml_file.xml')
root = tree.getroot()
# Define the CSV file and writer
csv_file = open('output.csv', 'w', newline='', encoding='utf-8')
csv_writer = csv.writer(csv_file)
# Write header row
header = ['column1', 'column2', 'column3', 'column4', 'column5']
csv_writer.writerow(header)
# Extract data and write to CSV
for id in root.findall('.//main_identifier'):
column1_text = id.find('column1').text if id.find('column') is not None else ''
column2_text = id.find('.//column2').text if id.find('.//column2') is not None else ''
column3_text = id.find('.//column3').text if id.find('.//column3') is not None else ''
column4 = id.find('.//column4').text if id.find('.//column4') is not None else ''
column5_text = id.find('.//column5').text if id.find('.//column5') is not None else ''
# Write data to CSV
csv_writer.writerow([column1_text, column2_text, column3_text, column4_text, column5_text])
# Close the CSV file
csv_file.close()
I've made a package for similar use case. It could work here too.
pip install pandas_read_xml
you can do something like
import pandas_read_xml as pdx
df = pdx.read_xml('filename.xml', ['data'])
To flatten, you could
df = pdx.flatten(df)
or
df = pdx.fully_flatten(df)
You'll need a recursive function to flatten rows, and a mechanism for dealing with duplicate data.
This is messy and depending on the data and nesting, you may end up with rather strange dataframes.
import xml.etree.ElementTree as et
from collections import defaultdict
import pandas as pd
def flatten_xml(node, key_prefix=()):
"""
Walk an XML node, generating tuples of key parts and values.
"""
# Copy tag content if any
text = (node.text or '').strip()
if text:
yield key_prefix, text
# Copy attributes
for attr, value in node.items():
yield key_prefix + (attr,), value
# Recurse into children
for child in node:
yield from flatten_xml(child, key_prefix + (child.tag,))
def dictify_key_pairs(pairs, key_sep='-'):
"""
Dictify key pairs from flatten_xml, taking care of duplicate keys.
"""
out = {}
# Group by candidate key.
key_map = defaultdict(list)
for key_parts, value in pairs:
key_map[key_sep.join(key_parts)].append(value)
# Figure out the final dict with suffixes if required.
for key, values in key_map.items():
if len(values) == 1: # No need to suffix keys.
out[key] = values[0]
else: # More than one value for this key.
for suffix, value in enumerate(values, 1):
out[f'{key}{key_sep}{suffix}'] = value
return out
# Parse XML with etree
tree = et.XML("""<?xml version="1.0"?>
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
<neighbor2 name="Italy" direction="S"/>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
<cities>
<city name="Chargin" population="1234" />
<city name="Firin" population="4567" />
</cities>
</country>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
<neighbor name="Colombia" direction="E"/>
</country>
</data>
""")
# Generate flat rows out of the root nodes in the tree
rows = [dictify_key_pairs(flatten_xml(row)) for row in tree]
df = pd.DataFrame(rows)
print(df)
outputs
name rank year gdppc neighbor-name-1 neighbor-name-2 neighbor-direction-1 neighbor-direction-2 neighbor2-name neighbor2-direction neighbor-name neighbor-direction cities-city-name-1 cities-city-name-2 cities-city-population-1 cities-city-population-2
0 Liechtenstein 1 2008 141100 Austria Switzerland E W Italy S NaN NaN NaN NaN NaN NaN
1 Singapore 4 2011 59900 NaN NaN NaN NaN NaN NaN Malaysia N Chargin Firin 1234 4567
2 Panama 68 2011 13600 Costa Rica Colombia W E NaN NaN NaN NaN NaN NaN NaN NaN
You want to append the text values from the ItemNr elements which are under the shop element to the items list and not the xml Element python object which is what you were doing.
The following code was working for me:
items.append([item_nr_element.text for item_nr_element in node.getchildren()])
I hope this is the expected output:
import xml.etree.ElementTree as ET
import pandas as pd
data = 'example_shops.xml'
tree = ET.parse(data)
root = tree.getroot()
shops_items = []
all_shops_items = []
for ashop in root.iter('shop'):
items = []
shop_Nr = ashop.attrib.get('shopNr')
for anitem in ashop.iter('ItemNr'):
items.append(anitem.text)
shops_items = [shop_Nr,items]
all_shops_items.append(shops_items)
df = pd.DataFrame(all_shops_items,columns=['SHOP_NUMBER','ITEM_NUMBER'])
print(df)
Output:
SHOP_NUMBER ITEM_NUMBER
0 01 [1001, 1002, 1003, 1004, 1010]
1 02 [1002, 1006, 1005]
2 03 [1009, 1006, 1005, 1002]
If you want shops with individual items :
import xml.etree.ElementTree as ET
import pandas as pd
data = 'example_shops.xml'
tree = ET.parse(data)
root = tree.getroot()
shops_items = []
all_shops_items = []
for ashop in root.iter('shop'):
shop_Nr = ashop.attrib.get('shopNr')
for anitem in ashop.iter('ItemNr'):
item_Nr = anitem.text
shops_items = [shop_Nr,item_Nr]
all_shops_items.append(shops_items)
df = pd.DataFrame(all_shops_items,columns=['SHOP_NUMBER','ITEM_NUMBER'])
print(df)
output:
SHOP_NUMBER ITEM_NUMBER
0 01 1001
1 01 1002
2 01 1003
3 01 1004
4 01 1010
5 02 1002
6 02 1006
7 02 1005
8 03 1009
9 03 1006
10 03 1005
11 03 1002
I am trying to parse an XML file into a Pandas DataFrame. It's a nicely formatted file that's not very deep, but whenever I work with XML it's like my brain goes blank and I never can remember all the goofy intricacies of dealing with it.
The file looks roughly like this
<?xml version="1.0" encoding="utf-8"?>
<diagnosticsLog type="db-profile" startDate="11/14/2022 23:31:12">
<!--Build 18.0.1.69-->
<columns>
<column friendlyName="time" name="time" />
<column friendlyName="Direction" name="Direction" />
<column friendlyName="SQL" name="SQL" />
<column friendlyName="ProcessID" name="ProcessID" />
<column friendlyName="ThreadID" name="ThreadID" />
<column friendlyName="TimeSpan" name="TimeSpan" />
<column friendlyName="User" name="User" />
<column friendlyName="HTTPSessionID" name="HTTPSessionID" />
<column friendlyName="HTTPForward" name="HTTPForward" />
<column friendlyName="SessionID" name="SessionID" />
<column friendlyName="SessionGUID" name="SessionGUID" />
<column friendlyName="Datasource" name="Datasource" />
<column friendlyName="Sequence" name="Sequence" />
<column friendlyName="LocalSequence" name="LocalSequence" />
<column friendlyName="Message" name="Message" />
<column friendlyName="AppPoolName" name="AppPoolName" />
</columns>
<rows>
<row>
<col name="time">11/14/2022 23:31:12</col>
<col name="TimeSpan">0 ms</col>
<col name="ThreadID">0x00000025</col>
<col name="User">USERNAME</col>
<col name="HTTPSessionID"></col>
<col name="HTTPForward">20.186.0.0</col>
<col name="SessionGUID">e4e51b-a64d-4b7b-9bfe-9612dd22b6cc</col>
<col name="SessionID">6096783</col>
<col name="Datasource">datasourceName</col>
<col name="AppPoolName">C 1801AppServer Ext</col>
<col name="Direction">Out</col>
<col name="sql">UPDATE SET </col>
<col name="Sequence">236419</col>
<col name="LocalSequence">103825</col>
</row>
<row>
<col name="time">11/14/2022 23:31:12</col>
<col name="TimeSpan">N/A</col>
<col name="ThreadID">0x00000025</col>
<col name="User">USERNAME</col>
<col name="HTTPSessionID"></col>
<col name="HTTPForward">20.186.0.0</col>
<col name="SessionGUID">e491b-a64d-4b7b-9bfe-9612dd22b6cc</col>
<col name="SessionID">6096783</col>
<col name="Datasource">datasourceName</col>
<col name="AppPoolName">C 1801AppServer Ext</col>
<col name="Direction">In</col>
<col name="sql">UPDATE SET</col>
<col name="Sequence">236420</col>
<col name="LocalSequence">103826</col>
</row>
</rows>
</diagnosticsLog>I want to convert that to the column names being the columns and each row being a row. I'm at a loss on how to do this.
You can easily use xml (from the Python standard library) to convert to a pandas.DataFrame. Here's what I would do (when reading from a file replace xml_data with the name of your file or file object):
import pandas as pd
import xml.etree.ElementTree as ET
import io
def iter_docs(author):
author_attr = author.attrib
for doc in author.iter('document'):
doc_dict = author_attr.copy()
doc_dict.update(doc.attrib)
doc_dict['data'] = doc.text
yield doc_dict
xml_data = io.StringIO(u'''YOUR XML STRING HERE''')
etree = ET.parse(xml_data) #create an ElementTree object
doc_df = pd.DataFrame(list(iter_docs(etree.getroot())))
If there are multiple authors in your original document or the root of your XML is not an author, then I would add the following generator:
def iter_author(etree):
for author in etree.iter('author'):
for row in iter_docs(author):
yield row
and change doc_df = pd.DataFrame(list(iter_docs(etree.getroot()))) to doc_df = pd.DataFrame(list(iter_author(etree)))
Have a look at the ElementTree tutorial provided in the xml library documentation.
As of v1.3, you can simply use:
pandas.read_xml(path_or_file)