See below

import requests
import xml.etree.ElementTree as ET
import pandas as pd

r = requests.get('https://raw.githubusercontent.com/dgs2021/golfdeals/main/35386_3864840_mp_delta.xml')
attrb_fields =  {'manufacturer_name': 'manufacturer','name':'name','part_number':'part_number'}
sub_elements = {'retail':'retail','product':'product'}

root = ET.fromstring(r.content)

data = []
for p in root.findall('product'):
  entry = {v:p.attrib.get(k,'NA') for k,v in attrb_fields.items()}
  for k,v in sub_elements.items():
    e = p.find(f'.//{v}')
    entry[k] = e.text if e is not None else 'NA'
  data.append(entry)
columns = list(attrb_fields.values()) + list(sub_elements.values())
df = pd.DataFrame(data,columns= columns)
print(df)

output

          manufacturer  ...                                            product
0           Champ Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
1         Stinger Tees  ...  https://click.linksynergy.com/link?id=83wh4zNK...
2           Vegas Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
3        Ray Cook Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
4     Rock Bottom Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
...                ...  ...                                                ...
4100     Callaway Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
4101        Cobra Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
4102      Odyssey Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
4103   TaylorMade Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
4104     Titleist Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...

[4105 rows x 5 columns]
Answer from balderman on Stack Overflow
๐ŸŒ
Pandas
pandas.pydata.org โ€บ docs โ€บ reference โ€บ api โ€บ pandas.read_xml.html
pandas.read_xml โ€” pandas 3.0.1 documentation - PyData |
The XPath to parse required set of nodes for migration to DataFrame.``XPath`` should return a collection of elements and not a single element. Note: The etree parser supports limited XPath expressions. For more complex XPath, use lxml which requires installation. ... The namespaces defined in XML document as dicts with key being namespace prefix and value the URI.
๐ŸŒ
GeeksforGeeks
geeksforgeeks.org โ€บ how-to-create-pandas-dataframe-from-nested-xml
How to create Pandas DataFrame from nested XML? | GeeksforGeeks
April 28, 2021 - In this article, we will learn how to create Pandas DataFrame from nested XML. We will use the xml.etree.ElementTree module, which is a built-in module in Python for parsing or reading information from the XML file.
Discussions

Parsing deeply nested XML into dataframe with python - struggling with deeper elements - Stack Overflow
I'm attempting to parse out a fairly nested XML file. I've spent the last few hours trying to find a solution with no luck. I'm not sure if the issue is with namespaces, or needing to findall within the loop. I am able to extract the higher level elements but the deeper nested elements are not being extracted. I am looking to export Part_number, manufacturer_name, name, Product and Retail to a df... More on stackoverflow.com
๐ŸŒ stackoverflow.com
October 15, 2021
How to read nested xml file with python pandas? - Stack Overflow
While XML as a data format can ... deeply nested, data frames must adhere to a single structure of two dimensions: row by column. Hence, as noted in docs, pandas.read_xml, is a convenience method best for flatter, shallow XML files. You can use xpath to traverse different areas of the document, not just the default /*. However, you can use XSLT 1.0 (special purpose language designed to transform XML files) with the default parser, lxml, to ... More on stackoverflow.com
๐ŸŒ stackoverflow.com
python - How to convert an XML file to nice pandas dataframe? - Stack Overflow
@CristianCiupitu I see the question is tagged python-2.7 ---u prefix has been added. 2018-07-25T02:59:45.58Z+00:00 ... Actually, from this specific post, OP needs to adjust XPath to look one level deeper from root: pandas.read_xml(path_or_file, xpath="/Author/document") 2021-05-19T16:39:32.857Z+00:00 ... Here is another way of converting a xml to pandas data frame. For example i have parsing ... More on stackoverflow.com
๐ŸŒ stackoverflow.com
python - Nested XML to Pandas dataframe - Stack Overflow
I'm trying to create a script to convert nested XML files to a Pandas dataframe. I've found this article https://medium.com/@robertopreste/from-xml-to-pandas-dataframes-9292980b1c1c, which does a g... More on stackoverflow.com
๐ŸŒ stackoverflow.com
๐ŸŒ
Stack Exchange
datascience.stackexchange.com โ€บ questions โ€บ 113782 โ€บ nested-xml-to-dataframe
nested xml to dataframe - Data Science Stack Exchange
August 23, 2022 - resulting_df = pd.DataFrame() file_path= "Downloads/dataexch.xml" df_cols = ["Filename", "Label", "xmin", "ymin", "xmax", "ymax"] rows = [] tree = ET.parse(file_path) root = tree.getroot() for node in root: #files = {} Filename = root.find('filename').text Label = root.find('object').find('name').text xmin = root.find('object').find('bndbox').find('xmin').text ymin = root.find('object').find('bndbox').find('ymin').text xmax= root.find('object').find('bndbox').find('xmax').text ymax = root.find('object').find('bndbox').find('ymax').text rows.append({"Filename": Filename, "Label": Label, "xmin": xmin, "ymin": ymin,"xmax": xmax,"ymax": ymax}) #print(rows) out_df = pd.DataFrame(rows, columns = df_cols).drop_duplicates()
Top answer
1 of 2
1

See below

import requests
import xml.etree.ElementTree as ET
import pandas as pd

r = requests.get('https://raw.githubusercontent.com/dgs2021/golfdeals/main/35386_3864840_mp_delta.xml')
attrb_fields =  {'manufacturer_name': 'manufacturer','name':'name','part_number':'part_number'}
sub_elements = {'retail':'retail','product':'product'}

root = ET.fromstring(r.content)

data = []
for p in root.findall('product'):
  entry = {v:p.attrib.get(k,'NA') for k,v in attrb_fields.items()}
  for k,v in sub_elements.items():
    e = p.find(f'.//{v}')
    entry[k] = e.text if e is not None else 'NA'
  data.append(entry)
columns = list(attrb_fields.values()) + list(sub_elements.values())
df = pd.DataFrame(data,columns= columns)
print(df)

output

          manufacturer  ...                                            product
0           Champ Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
1         Stinger Tees  ...  https://click.linksynergy.com/link?id=83wh4zNK...
2           Vegas Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
3        Ray Cook Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
4     Rock Bottom Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
...                ...  ...                                                ...
4100     Callaway Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
4101        Cobra Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
4102      Odyssey Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
4103   TaylorMade Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
4104     Titleist Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...

[4105 rows x 5 columns]
2 of 2
1

Assuming XML structure is constant and element/attributes are retrieved by the xpath expression in the same order

from lxml import etree
import pandas as pd

df_cols = ["part_number", "manufacturer", "name", "retail", "product"]
rows = []
tree = etree.parse('/home/luis/tmp/tmp.xml')
root = tree.getroot()
steps = tree.xpath('//product/attribute::*[name()="name" or name()="part_number" or name()="manufacturer_name"] | //product/URL/product/text() | //product/price/retail/text()')
i=0
d=dict()
for s in steps:

    if i == 0:
        d[df_cols[2]]=s
    if i == 1:
        d[df_cols[0]]=s
    if i == 2:
        d[df_cols[1]]=s
    if i == 3:
        d[df_cols[3]]=s
    if i == 4:
        d[df_cols[4]]=s
        rows.append(d)
        i=0
        d=dict()
        continue
    i+=1


out_df = pd.DataFrame(rows, columns = df_cols)

print(out_df.head())

Result:

     part_number              manufacturer                                               name                                             retail product
0     Champ Golf  19CHPSPWRCH1111111111101                   Champ Golf- Max Pro Spike Wrench  https://click.linksynergy.com/link?id=83wh4zNK...    9.99
1   Stinger Tees  19STGTEEMID3CO1111111101  Stinger Tees- 3" Stinger Pro XL Competition Ca...  https://click.linksynergy.com/link?id=83wh4zNK...    7.99
2     Vegas Golf  19VEGORIGIN1111111111101                          Vegas Golf- Original Game  https://click.linksynergy.com/link?id=83wh4zNK...   14.99
3  Ray Cook Golf  19RAYBALRET1111111111201      Ray Cook Golf- 12' Compact Cup Ball Retriever  https://click.linksynergy.com/link?id=83wh4zNK...   19.99
๐ŸŒ
TutorialsPoint
tutorialspoint.com โ€บ python_pandas โ€บ python_pandas_parsing_xml_file.htm
Python Pandas - Parsing XML File
For deeply nested or complex XML files, you can use the xpath and namespaces parameters to extract specific nodes. This example shows how to parse a nested XML structure representing a bookstore.
Top answer
1 of 2
5

While XML as a data format can take many forms from flat to deeply nested, data frames must adhere to a single structure of two dimensions: row by column. Hence, as noted in docs, pandas.read_xml, is a convenience method best for flatter, shallow XML files. You can use xpath to traverse different areas of the document, not just the default /*.

However, you can use XSLT 1.0 (special purpose language designed to transform XML files) with the default parser, lxml, to transform any XML to the needed flat format of data frame. Below stylesheet will restyle the <slike> node for comma-separated text of its children <slika>:

XSLT (save as .xsl file, a special .xml file)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="node()|@*">
     <xsl:copy>
       <xsl:apply-templates select="node()|@*"/>
     </xsl:copy>
    </xsl:template>
    
    <xsl:template match="slike">
     <xsl:copy>
       <xsl:for-each select="*">
         <xsl:value-of select="text()"/>
         <xsl:if test="position() != last()">
            <xsl:text>,</xsl:text>
         </xsl:if>
       </xsl:for-each>
     </xsl:copy>
    </xsl:template>  
</xsl:stylesheet>

Online Demo

Python

artikal_df = pd.read_xml("my_filename.xml", stylesheet="my_style.xsl") 

# CONVERT COMMA-SEPARATED VALUES TO EMBEDDED LISTS
artikal_df["slike"] = artikal_df["slike"].str.split(',')

# PREFIX PARENT NODE NAME
artikal_df = artikal_df.add_prefix('artikal_')

artikal_df.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 2 entries, 0 to 1
# Data columns (total 12 columns):
#  #   Column               Non-Null Count  Dtype  
# ---  ------               --------------  -----  
#  0   artikal_id           2 non-null      int64  
#  1   artikal_sifra        2 non-null      int64  
#  2   artikal_barKod       2 non-null      int64  
#  3   artikal_naziv        2 non-null      object 
#  4   artikal_kategorija1  2 non-null      object 
#  5   artikal_kategorija2  2 non-null      object 
#  6   artikal_kategorija3  2 non-null      object 
#  7   artikal_vpCena       2 non-null      float64
#  8   artikal_mpCena       2 non-null      float64
#  9   artikal_dostupan     2 non-null      int64  
#  10  artikal_opis         0 non-null      float64
#  11  artikal_slike        2 non-null      object 
# dtypes: float64(3), int64(4), object(5)
# memory usage: 320.0+ bytes
2 of 2
0

You start by reading the xml file and also making a placeholder file for you to write the output in a csv format (or any other text format - you might have to tweak the code a bit).

Then you specify the names of columns in your final dataframe (after you have parsed the xml file). But this information is already in your xml file anyways, so you just to make sure you understand the contents.

Lastly, loop over the entries and find the keywords (column names) to read and write to the csv.

Once done, you can read the csv using pd.read_csv('output.csv').

import xml.etree.ElementTree as ET
import csv

# Load and parse the XML file
tree = ET.parse('your_xml_file.xml')
root = tree.getroot()

# Define the CSV file and writer
csv_file = open('output.csv', 'w', newline='', encoding='utf-8')
csv_writer = csv.writer(csv_file)

# Write header row
header = ['column1', 'column2', 'column3', 'column4', 'column5']
csv_writer.writerow(header)

# Extract data and write to CSV
for id in root.findall('.//main_identifier'):
    column1_text = id.find('column1').text if id.find('column') is not None else ''
    column2_text = id.find('.//column2').text if id.find('.//column2') is not None else ''
    column3_text = id.find('.//column3').text if id.find('.//column3') is not None else ''
    column4 = id.find('.//column4').text if id.find('.//column4') is not None else ''
    column5_text = id.find('.//column5').text if id.find('.//column5') is not None else ''
    
    # Write data to CSV
    csv_writer.writerow([column1_text, column2_text, column3_text, column4_text, column5_text])

# Close the CSV file
csv_file.close()
Find elsewhere
๐ŸŒ
Medium
medium.com โ€บ @robertopreste โ€บ from-xml-to-pandas-dataframes-9292980b1c1c
From XML to Pandas dataframes. How to parse XML files to obtain properโ€ฆ | by Roberto Preste | Medium
August 25, 2019 - import pandas as pd import xml.etree.ElementTree as et def parse_XML(xml_file, df_cols): """Parse the input XML file and store the result in a pandas DataFrame with the given columns.
๐ŸŒ
YouTube
youtube.com โ€บ watch
Transforming Nested XML to Pandas DataFrame - YouTube
Hello and welcome to this tutorial. In this tutorial, you will learn how to transform XML documents to pandas data frames using Python and the element tree l...
Published ย  October 21, 2023
๐ŸŒ
PyPI
pypi.org โ€บ project โ€บ xml-to-df
xml-to-df
JavaScript is disabled in your browser. Please enable JavaScript to proceed ยท A required part of this site couldnโ€™t load. This may be due to a browser extension, network issues, or browser settings. Please check your connection, disable any ad blockers, or try using a different browser
Top answer
1 of 2
10

I've made a package for similar use case. It could work here too.

pip install pandas_read_xml

you can do something like

import pandas_read_xml as pdx

df = pdx.read_xml('filename.xml', ['data'])

To flatten, you could

df = pdx.flatten(df)

or

df = pdx.fully_flatten(df)
2 of 2
6

You'll need a recursive function to flatten rows, and a mechanism for dealing with duplicate data.

This is messy and depending on the data and nesting, you may end up with rather strange dataframes.

import xml.etree.ElementTree as et
from collections import defaultdict
import pandas as pd


def flatten_xml(node, key_prefix=()):
    """
    Walk an XML node, generating tuples of key parts and values.
    """

    # Copy tag content if any
    text = (node.text or '').strip()
    if text:
        yield key_prefix, text

    # Copy attributes
    for attr, value in node.items():
        yield key_prefix + (attr,), value

    # Recurse into children
    for child in node:
        yield from flatten_xml(child, key_prefix + (child.tag,))


def dictify_key_pairs(pairs, key_sep='-'):
    """
    Dictify key pairs from flatten_xml, taking care of duplicate keys.
    """
    out = {}

    # Group by candidate key.
    key_map = defaultdict(list)
    for key_parts, value in pairs:
        key_map[key_sep.join(key_parts)].append(value)

    # Figure out the final dict with suffixes if required.
    for key, values in key_map.items():
        if len(values) == 1:  # No need to suffix keys.
            out[key] = values[0]
        else:  # More than one value for this key.
            for suffix, value in enumerate(values, 1):
                out[f'{key}{key_sep}{suffix}'] = value

    return out


# Parse XML with etree
tree = et.XML("""<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
        <neighbor2 name="Italy" direction="S"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
        <cities>
            <city name="Chargin" population="1234" />
            <city name="Firin" population="4567" />
        </cities>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>
""")

# Generate flat rows out of the root nodes in the tree
rows = [dictify_key_pairs(flatten_xml(row)) for row in tree]
df = pd.DataFrame(rows)
print(df)

outputs

            name rank  year   gdppc neighbor-name-1 neighbor-name-2 neighbor-direction-1 neighbor-direction-2 neighbor2-name neighbor2-direction neighbor-name neighbor-direction cities-city-name-1 cities-city-name-2 cities-city-population-1 cities-city-population-2
0  Liechtenstein    1  2008  141100         Austria     Switzerland                    E                    W          Italy                   S           NaN                NaN                NaN                NaN                      NaN                      NaN
1      Singapore    4  2011   59900             NaN             NaN                  NaN                  NaN            NaN                 NaN      Malaysia                  N            Chargin              Firin                     1234                     4567
2         Panama   68  2011   13600      Costa Rica        Colombia                    W                    E            NaN                 NaN           NaN                NaN                NaN                NaN                      NaN                      NaN
Top answer
1 of 2
2

You want to append the text values from the ItemNr elements which are under the shop element to the items list and not the xml Element python object which is what you were doing.

The following code was working for me:

items.append([item_nr_element.text for item_nr_element in node.getchildren()])
2 of 2
2

I hope this is the expected output:

import xml.etree.ElementTree as ET
import pandas as pd
data = 'example_shops.xml'
tree = ET.parse(data)
root = tree.getroot()
shops_items = []
all_shops_items = []
for ashop in root.iter('shop'):
    items = []
    shop_Nr = ashop.attrib.get('shopNr')
    for anitem in ashop.iter('ItemNr'):
        items.append(anitem.text)
    shops_items = [shop_Nr,items]
    all_shops_items.append(shops_items)
df = pd.DataFrame(all_shops_items,columns=['SHOP_NUMBER','ITEM_NUMBER'])        
print(df)

Output:

  SHOP_NUMBER                     ITEM_NUMBER
0          01  [1001, 1002, 1003, 1004, 1010]
1          02              [1002, 1006, 1005]
2          03        [1009, 1006, 1005, 1002]

If you want shops with individual items :

import xml.etree.ElementTree as ET
import pandas as pd
data = 'example_shops.xml'
tree = ET.parse(data)
root = tree.getroot()
shops_items = []
all_shops_items = []
for ashop in root.iter('shop'):
    shop_Nr = ashop.attrib.get('shopNr')
    for anitem in ashop.iter('ItemNr'):
        item_Nr = anitem.text
        shops_items = [shop_Nr,item_Nr]
        all_shops_items.append(shops_items)
df = pd.DataFrame(all_shops_items,columns=['SHOP_NUMBER','ITEM_NUMBER'])        
print(df)

output:

   SHOP_NUMBER ITEM_NUMBER
0           01        1001
1           01        1002
2           01        1003
3           01        1004
4           01        1010
5           02        1002
6           02        1006
7           02        1005
8           03        1009
9           03        1006
10          03        1005
11          03        1002
๐ŸŒ
Reddit
reddit.com โ€บ r/learnpython โ€บ parsing xml into a pandas dataframe
r/learnpython on Reddit: Parsing XML into a Pandas dataframe
December 9, 2022 -

I am trying to parse an XML file into a Pandas DataFrame. It's a nicely formatted file that's not very deep, but whenever I work with XML it's like my brain goes blank and I never can remember all the goofy intricacies of dealing with it.

The file looks roughly like this

<?xml version="1.0" encoding="utf-8"?>

<diagnosticsLog type="db-profile" startDate="11/14/2022 23:31:12">

  <!--Build 18.0.1.69-->

  <columns>

    <column friendlyName="time" name="time" />

    <column friendlyName="Direction" name="Direction" />

    <column friendlyName="SQL" name="SQL" />

    <column friendlyName="ProcessID" name="ProcessID" />

    <column friendlyName="ThreadID" name="ThreadID" />


    <column friendlyName="TimeSpan" name="TimeSpan" />

    <column friendlyName="User" name="User" />

    <column friendlyName="HTTPSessionID" name="HTTPSessionID" />

    <column friendlyName="HTTPForward" name="HTTPForward" />

    <column friendlyName="SessionID" name="SessionID" />


    <column friendlyName="SessionGUID" name="SessionGUID" />

    <column friendlyName="Datasource" name="Datasource" />

    <column friendlyName="Sequence" name="Sequence" />

    <column friendlyName="LocalSequence" name="LocalSequence" />

    <column friendlyName="Message" name="Message" />

    <column friendlyName="AppPoolName" name="AppPoolName" />

  </columns>

  <rows>

    <row>

      <col name="time">11/14/2022 23:31:12</col>

      <col name="TimeSpan">0 ms</col>

      <col name="ThreadID">0x00000025</col>

      <col name="User">USERNAME</col>

      <col name="HTTPSessionID"></col>

      <col name="HTTPForward">20.186.0.0</col>

      <col name="SessionGUID">e4e51b-a64d-4b7b-9bfe-9612dd22b6cc</col>

      <col name="SessionID">6096783</col>

      <col name="Datasource">datasourceName</col>

      <col name="AppPoolName">C 1801AppServer Ext</col>

      <col name="Direction">Out</col>

      <col name="sql">UPDATE SET </col>

      <col name="Sequence">236419</col>

      <col name="LocalSequence">103825</col>

    </row>

    <row>

      <col name="time">11/14/2022 23:31:12</col>

      <col name="TimeSpan">N/A</col>

      <col name="ThreadID">0x00000025</col>

      <col name="User">USERNAME</col>

      <col name="HTTPSessionID"></col>

      <col name="HTTPForward">20.186.0.0</col>

      <col name="SessionGUID">e491b-a64d-4b7b-9bfe-9612dd22b6cc</col>

      <col name="SessionID">6096783</col>

      <col name="Datasource">datasourceName</col>

      <col name="AppPoolName">C 1801AppServer Ext</col>

      <col name="Direction">In</col>

      <col name="sql">UPDATE SET</col>

      <col name="Sequence">236420</col>

      <col name="LocalSequence">103826</col>

    </row>

  </rows>

</diagnosticsLog>

I want to convert that to the column names being the columns and each row being a row. I'm at a loss on how to do this.

๐ŸŒ
CopyProgramming
copyprogramming.com โ€บ howto โ€บ flatten-xml-data-as-a-pandas-dataframe
Python: Transform XML data into a Pandas dataframe
March 26, 2023 - 201 ''' data = [] root = ... = {c.tag if c.tag != 'ID' else 'ID_inner':c.text for c in sub} entry.update(temp) data.append(entry) df = pd.DataFrame(data) print(df) ......
๐ŸŒ
LinkedIn
linkedin.com โ€บ pulse โ€บ processing-dynamically-nested-xml-files-using-python-library-weber
Processing dynamically nested XML files using the Python library ElementTree
March 27, 2022 - import xml.etree.ElementTree as ET tree = ET.parse('/dbfs/mnt/xmlfiles/NestedXML.xml') root = tree.getroot() ... import pandas as p element_list = list() parent_list = list() tag_list = list() text_list = list() for element in all_descendants: element_list.append(str(element)) parent_list.append(str(parent_map.get(element)) if element != root else "None") tag_list.append(element.tag) text_list.append(element.text) data = {'Element': element_list, 'Parent': parent_list, 'Tag': tag_list, 'Text': text_list} df = pd.DataFrame(data)d
๐ŸŒ
Plain English
python.plainenglish.io โ€บ parsing-xml-into-pandas-dataframes-661882abd8e5
Parsing XML into pandas DataFrames | by Florian Kromer | Python in Plain English
August 5, 2023 - import pandas as pd import xml.etree.ElementTree as et def parse_XML(xml_file, df_cols): """Parse the input XML file and store the result in a pandas DataFrame with the given columns.
Top answer
1 of 2
2

Consider building a list of dictionaries with comma-collapsed text values. Then pass list into the pandas.DataFrame constructor:

dicts = []
for node in root:
    orgs = ", ".join([org.text for org in node.findall('.//{http://something.org/schema/s/program}orgUnitId')])
    desc = ", ".join([desc.text for desc in node.findall('.//{http://something.org/schema/s/program}programDescriptionText')])
    lvls = ", ".join([lvl.text for lvl in node.findall('.//{http://something.org/schema/s/program}requiredLevel')])
    wrds = ", ".join([wrd.text for wrd in node.findall('.//{http://something.org/schema/s/program}searchword')])

    dicts.append({'organization': orgs, 'description': desc, 'level': lvls, 'keyword': wrds})

final_df = pd.DataFrame(dicts, columns=['organization','description','level','keyword'])

Output

print(final_df)
#      organization                                        description                                         level                                            keyword
# 0  Organization 1                       Here is some text; blablabla            academic bachelor, academic master                                       Scrum master
# 1  Organization 2   Text from another organization about some stuff.  bachelor, academic master, academic bachelor                                          Excutives
# 2  Organization 3  Also another huge text description from anothe...                                                Negotiating, Effective leadership, negotiating...
2 of 2
1

A lightweight xml_to_dict converter can be found here. It can be improved by this to handle namespaces.

def xml_to_dict(xml='', remove_namespace=True):
    """Converts an XML string into a dict

    Args:
        xml: The XML as string
        remove_namespace: True (default) if namespaces are to be removed

    Returns:
        The XML string as dict

    Examples:
        >>> xml_to_dict('<text><para>hello world</para></text>')
        {'text': {'para': 'hello world'}}

    """
    def _xml_remove_namespace(buf):
        # Reference: https://stackoverflow.com/a/25920989/1498199
        it = ElementTree.iterparse(buf)
        for _, el in it:
            if '}' in el.tag:
                el.tag = el.tag.split('}', 1)[1]
        return it.root

    def _xml_to_dict(t):
        # Reference: https://stackoverflow.com/a/10077069/1498199
        from collections import defaultdict

        d = {t.tag: {} if t.attrib else None}
        children = list(t)
        if children:
            dd = defaultdict(list)
            for dc in map(_xml_to_dict, children):
                for k, v in dc.items():
                    dd[k].append(v)
            d = {t.tag: {k: v[0] if len(v) == 1 else v for k, v in dd.items()}}

        if t.attrib:
            d[t.tag].update(('@' + k, v) for k, v in t.attrib.items())

        if t.text:
            text = t.text.strip()
            if children or t.attrib:
                if text:
                    d[t.tag]['#text'] = text
            else:
                d[t.tag] = text

        return d

    buffer = io.StringIO(xml.strip())
    if remove_namespace:
        root = _xml_remove_namespace(buffer)
    else:
        root = ElementTree.parse(buffer).getroot()

    return _xml_to_dict(root)

So let s be the string which holds your xml. We can convert it to a dict via

d = xml_to_dict(s, remove_namespace=True)

Now the solution is straight forward:

rows = []
for program in d['programs']['program']:
    cols = []
    cols.append(program['orgUnitId'])
    cols.append(program['programDescriptionText']['#text'])
    try:
        cols.append(','.join(program['requiredLevel']))
    except KeyError:
        cols.append('')

    try:
         searchwords = program['searchword']['#text']
    except TypeError:
         searchwords = []
         for searchword in program['searchword']:
            searchwords.append(searchword['#text'])
         searchwords = ','.join(searchwords)
    cols.append(searchwords)

    rows.append(cols)

df = pd.DataFrame(rows, columns=['organization', 'description', 'level', 'keyword'])
๐ŸŒ
YouTube
youtube.com โ€บ watch
Parsing Deeply Nested XML with Python: Exporting to DataFrame Made Easy - YouTube
Struggling to parse deeply nested XML files in Python? Learn how to extract vital information into a DataFrame with simple tricks!---This video is based on t...
Published ย  April 4, 2025
Views ย  32
๐ŸŒ
Stack Abuse
stackabuse.com โ€บ reading-and-writing-xml-files-in-python-with-pandas
Reading and Writing XML Files in Python with Pandas
August 21, 2024 - In this approach, we read the file content in a variable and use ET.XML() to parse the XML document from the string constant. We will loop across each child and sub child maintaining a list of data they contain. Meanwhile, writing child tags for the DataFrame column.