While XML as a data format can take many forms from flat to deeply nested, data frames must adhere to a single structure of two dimensions: row by column. Hence, as noted in docs, pandas.read_xml, is a convenience method best for flatter, shallow XML files. You can use xpath to traverse different areas of the document, not just the default /*.
However, you can use XSLT 1.0 (special purpose language designed to transform XML files) with the default parser, lxml, to transform any XML to the needed flat format of data frame. Below stylesheet will restyle the <slike> node for comma-separated text of its children <slika>:
XSLT (save as .xsl file, a special .xml file)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="slike">
<xsl:copy>
<xsl:for-each select="*">
<xsl:value-of select="text()"/>
<xsl:if test="position() != last()">
<xsl:text>,</xsl:text>
</xsl:if>
</xsl:for-each>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Online Demo
Python
artikal_df = pd.read_xml("my_filename.xml", stylesheet="my_style.xsl")
# CONVERT COMMA-SEPARATED VALUES TO EMBEDDED LISTS
artikal_df["slike"] = artikal_df["slike"].str.split(',')
# PREFIX PARENT NODE NAME
artikal_df = artikal_df.add_prefix('artikal_')
artikal_df.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 2 entries, 0 to 1
# Data columns (total 12 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 artikal_id 2 non-null int64
# 1 artikal_sifra 2 non-null int64
# 2 artikal_barKod 2 non-null int64
# 3 artikal_naziv 2 non-null object
# 4 artikal_kategorija1 2 non-null object
# 5 artikal_kategorija2 2 non-null object
# 6 artikal_kategorija3 2 non-null object
# 7 artikal_vpCena 2 non-null float64
# 8 artikal_mpCena 2 non-null float64
# 9 artikal_dostupan 2 non-null int64
# 10 artikal_opis 0 non-null float64
# 11 artikal_slike 2 non-null object
# dtypes: float64(3), int64(4), object(5)
# memory usage: 320.0+ bytes
Answer from Parfait on Stack OverflowWhile XML as a data format can take many forms from flat to deeply nested, data frames must adhere to a single structure of two dimensions: row by column. Hence, as noted in docs, pandas.read_xml, is a convenience method best for flatter, shallow XML files. You can use xpath to traverse different areas of the document, not just the default /*.
However, you can use XSLT 1.0 (special purpose language designed to transform XML files) with the default parser, lxml, to transform any XML to the needed flat format of data frame. Below stylesheet will restyle the <slike> node for comma-separated text of its children <slika>:
XSLT (save as .xsl file, a special .xml file)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="slike">
<xsl:copy>
<xsl:for-each select="*">
<xsl:value-of select="text()"/>
<xsl:if test="position() != last()">
<xsl:text>,</xsl:text>
</xsl:if>
</xsl:for-each>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Online Demo
Python
artikal_df = pd.read_xml("my_filename.xml", stylesheet="my_style.xsl")
# CONVERT COMMA-SEPARATED VALUES TO EMBEDDED LISTS
artikal_df["slike"] = artikal_df["slike"].str.split(',')
# PREFIX PARENT NODE NAME
artikal_df = artikal_df.add_prefix('artikal_')
artikal_df.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 2 entries, 0 to 1
# Data columns (total 12 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 artikal_id 2 non-null int64
# 1 artikal_sifra 2 non-null int64
# 2 artikal_barKod 2 non-null int64
# 3 artikal_naziv 2 non-null object
# 4 artikal_kategorija1 2 non-null object
# 5 artikal_kategorija2 2 non-null object
# 6 artikal_kategorija3 2 non-null object
# 7 artikal_vpCena 2 non-null float64
# 8 artikal_mpCena 2 non-null float64
# 9 artikal_dostupan 2 non-null int64
# 10 artikal_opis 0 non-null float64
# 11 artikal_slike 2 non-null object
# dtypes: float64(3), int64(4), object(5)
# memory usage: 320.0+ bytes
You start by reading the xml file and also making a placeholder file for you to write the output in a csv format (or any other text format - you might have to tweak the code a bit).
Then you specify the names of columns in your final dataframe (after you have parsed the xml file). But this information is already in your xml file anyways, so you just to make sure you understand the contents.
Lastly, loop over the entries and find the keywords (column names) to read and write to the csv.
Once done, you can read the csv using pd.read_csv('output.csv').
import xml.etree.ElementTree as ET
import csv
# Load and parse the XML file
tree = ET.parse('your_xml_file.xml')
root = tree.getroot()
# Define the CSV file and writer
csv_file = open('output.csv', 'w', newline='', encoding='utf-8')
csv_writer = csv.writer(csv_file)
# Write header row
header = ['column1', 'column2', 'column3', 'column4', 'column5']
csv_writer.writerow(header)
# Extract data and write to CSV
for id in root.findall('.//main_identifier'):
column1_text = id.find('column1').text if id.find('column') is not None else ''
column2_text = id.find('.//column2').text if id.find('.//column2') is not None else ''
column3_text = id.find('.//column3').text if id.find('.//column3') is not None else ''
column4 = id.find('.//column4').text if id.find('.//column4') is not None else ''
column5_text = id.find('.//column5').text if id.find('.//column5') is not None else ''
# Write data to CSV
csv_writer.writerow([column1_text, column2_text, column3_text, column4_text, column5_text])
# Close the CSV file
csv_file.close()
nested xml to dataframe - Data Science Stack Exchange
python - Parsing nested children nodes using pandas.read_xml - Stack Overflow
BUG: iterparse on read_xml overwrites nested child elements
Pandas dataframe to nested xml
Videos
Each week I get a spreadsheet of price changes from a supplier. I have been using excel to format and calculate the required columns, then export as xml to import into our stock management system.
I have written a script using pandas to import and process the sheet, but I am stuck on how to export it to xml.
The xml needs to follow the following format:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <Items xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <Item> <Descriptors> <Barcode>9770307017919</Barcode> <SupplierCode>030701791</SupplierCode> <Description>Daily Express (Mon)</Description> <CommodityGroup>1</CommodityGroup> </Descriptors> <Pricing> <PackCost>0.5625</PackCost> <CostPricePerUnit>0.5625</CostPricePerUnit> <RetailPrice>0.75</RetailPrice> <ValidFrom>44193</ValidFrom> </Pricing> <Sizing> <PackSize>1</PackSize> </Sizing> <Flags/> </Item> </Items>
I have the columns of my dataframe titled as Parent.Field i.e:
["Descriptors.Barcode", "Descriptors.SupplierCode", "Descriptors.Description", "Descriptors.CommodityGroup", "Pricing.PackCost", "Pricing.CostPricePerUnit", "Pricing.RetailPrice" "Sizing.Packsize"]
Pretty much the only relevant thing I could find online was this,
https://stackoverflow.com/questions/18574108/how-do-convert-a-pandas-dataframe-to-xml
but i'm unsure how best to utilise this to export with the necessary nested data structure.
Does anyone have any tips as to how I can achieve this?