By using xmltodict to transform your XML file to a dictionary, in combination with this answer to flatten a dict, this should be possible.

Example:

# Original code: https://codereview.stackexchange.com/a/21035
from collections import OrderedDict

def flatten_dict(d):
    def items():
        for key, value in d.items():
            if isinstance(value, dict):
                for subkey, subvalue in flatten_dict(value).items():
                    yield key + "." + subkey, subvalue
            else:
                yield key, value

    return OrderedDict(items())

import xmltodict

# Convert to dict
with open('test.xml', 'rb') as f:
    xml_content = xmltodict.parse(f)

# Flatten dict
flattened_xml = flatten_dict(xml_content)

# Print in desired format
for k,v in flattened_xml.items():
    print('{} = {}'.format(k,v))

Output:

A.B.ConnectionType = a
A.B.StartTime = 00:00:00
A.B.EndTime = 00:00:00
A.B.UseDataDictionary = N
Answer from DocZerø on Stack Overflow
Top answer
1 of 3
9

By using xmltodict to transform your XML file to a dictionary, in combination with this answer to flatten a dict, this should be possible.

Example:

# Original code: https://codereview.stackexchange.com/a/21035
from collections import OrderedDict

def flatten_dict(d):
    def items():
        for key, value in d.items():
            if isinstance(value, dict):
                for subkey, subvalue in flatten_dict(value).items():
                    yield key + "." + subkey, subvalue
            else:
                yield key, value

    return OrderedDict(items())

import xmltodict

# Convert to dict
with open('test.xml', 'rb') as f:
    xml_content = xmltodict.parse(f)

# Flatten dict
flattened_xml = flatten_dict(xml_content)

# Print in desired format
for k,v in flattened_xml.items():
    print('{} = {}'.format(k,v))

Output:

A.B.ConnectionType = a
A.B.StartTime = 00:00:00
A.B.EndTime = 00:00:00
A.B.UseDataDictionary = N
2 of 3
2

This is not a complete implementation but you could take advantage of lxmls's getpath:

xml = """<A>
            <B>
               <ConnectionType>a</ConnectionType>
               <StartTime>00:00:00</StartTime>
               <EndTime>00:00:00</EndTime>
               <UseDataDictionary>N
               <UseDataDictionary2>G</UseDataDictionary2>
               </UseDataDictionary>
            </B>
       </A>"""


from lxml import etree
from io import StringIO
tree = etree.parse(StringIO(xml))

root = tree.getroot().tag
for node in tree.iter():
    for child in node.getchildren():
         if child and child.text.strip():
            print("{}.{} = {}".format(root, ".".join(tree.getelementpath(child).split("/")), child.text.strip()))

Which gives you:

A.B.ConnectionType = a
A.B.StartTime = 00:00:00
A.B.EndTime = 00:00:00
A.B.UseDataDictionary = N
A.B.UseDataDictionary.UseDataDictionary2 = G
🌐
Reddit
reddit.com › r/python › reading and writing out an xml file as flat file?
r/Python on Reddit: Reading and writing out an XML file as flat file?
January 26, 2013 -

I have an xml file with the following layout

<records>
  <header>
    n name value pairs
  </header>
  <rec1>
    <nested1>
    <nested2>
    <nested_n>
  </rec1>
  <rec2>
   ......
  </rec2>
<records>

I want to write it out as one row with the parser descending down till end of rec1 before writing a new line, which is going to be record 1.
The nested* are further highlevel nodes with more subnodes or elements. And the number of nested elements can vary from one record to another and so would like to get pipe delimited entries with space/0 depending on the type

All the examples I see seem to either give a search example to find a specific node element or explicitly hard code up to 1 or 2 levels using xml.etree import ElementTree or lxml.

How do I recursively descend and write it out all as 1 row till i hit </rec1>

EDIT: I got so far as

from xml.etree import ElementTree as et
fh = open("GC.xml","r")
xm = et.parse(fh)
for e in xm.getiterator():
    print e.tag, repr(e.text)

How do I query the node depth to spit out a newline at the appropriate place?

Discussions

python - XML schema parsing and XML creation from flat files - Code Review Stack Exchange
I am new to Python and had to create a schema parser to pull information on attributes and complex types, etc. and then convert data from flat files into the proper XML format. We are processing a ... More on codereview.stackexchange.com
🌐 codereview.stackexchange.com
July 22, 2011
Python : Flatten xml to csv with nested child tags - Stack Overflow
There are multiple XML files that I would like to flatten, I am looking for a generic function or logic to convert the xml to a flat file. Most of the answers include hard-coded tags. Closest one being Python : Flatten xml to csv with parent tag repeated in child but still has hard-coded solution. More on stackoverflow.com
🌐 stackoverflow.com
July 4, 2021
Python Parsing nested XML and flattening the data - Stack Overflow
I am trying to flatten the following XML data into CSV type table data. I could get the data in the Sal element and its attributes but I couldn't flatten SalC data to the parent sailing attributes... More on stackoverflow.com
🌐 stackoverflow.com
May 22, 2017
python - Flatten XML data as a pandas dataframe - Stack Overflow
As a special purpose language written ... to flatter format for migration to data frame. Specifically, each stylesheet drills down to the most granular node and then by the ancestor axis pulls higher level information as sibling columns. Mentions (save as .xsl, a special .xml file or embed as string in Python... More on stackoverflow.com
🌐 stackoverflow.com
🌐
GitHub
github.com › seflorentino › py-xml-flatten
GitHub - seflorentino/py-xml-flatten: Python script for flattening simple XML files
Simple Python script for transforming big XML files to flat CSV format.
Author   seflorentino
Top answer
1 of 1
3

Normally the xml nodes that hold a value should be the corresponding columns. As I see in your xml example "child", "child2", "childid", and so on, should be columns.

Based on the above xml I've made this piece of code that should be sufficiently generic to accommodate similar examples.

import pandas as pd
import tabulate
import xml.etree.ElementTree as Xet

def getData(root, rows, columns, rowcount, name=None):
    if name != None:
        name = "{0}{1}{2}".format(name,"|",root.tag) # we construct the column names like this so that we don't risk haveing the same column on different nodes that should repeat
                                         # for example: a node named "name" could be under group and secondgroup and they shouldn't be the same column
    else:
        name = root.tag

    for item in root:
        if len(item) == 0:
            colName = "{0}{1}{2}".format(name,"|", item.tag)
            # colName = item.tag # remove this line to get the full column name; ex: root|group|grouplist|groupzone|groupsize
            if not colName in columns:
                columns.append(colName) # save the column to a list
                rowcount.append(0) # save the row on which we add the value for this column
                rows[rowcount[columns.index(colName)]].update({colName : item.text.strip()}) # add the value to the row - this will always happen on row 0
            else:
                repeatPosition = columns.index(colName) # get the column position for the repeated item
                rowcount[repeatPosition] = rowcount[repeatPosition] + 1 # increase row count
                if len(rows) <= max(rowcount):
                    rows.append({}) # add a new row based on row count
                rows[rowcount[repeatPosition]].update({colName : item.text.strip()}) # add the value on the new row

        getData(item, rows, columns, rowcount, name) # recursive call to walk trough each list of elements


xmlParse = Xet.parse('example.xml')
root = xmlParse.getroot()

rows = [{}] # adding at least one row from the start and will add additional rows as we go along
columns = [] # holds the names of the columns
rowcount = [] # holds the rows on which we add each element value; ex: 
getData(root, rows, columns, rowcount)

df = pd.DataFrame(rows, columns=columns)
print(df)
df.to_csv('parse.csv')

The end result after running this code looks like this: csv result

And this is the plain csv:

,root|child,root|child2,root|anotherchild|childid,root|anotherchild|childname,root|group|groupid,root|group|grouplist|groupzone|groupname,root|group|grouplist|groupzone|groupsize,root|secondgroup|secondgroupid,root|secondgroup|secondgrouptitle,root|secondgroup|secondgrouplist|secondgroupzone|secondgroupsub|secondsub,root|secondgroup|secondgrouplist|secondgroupzone|secondgroupsub|secondsubid,root|secondgroup|secondgrouplist|secondgroupzone|secondgroupname,root|secondgroup|secondgrouplist|secondgroupzone|secondgroupsize,root|child3
0,child-val,child2-val2,another child 45,another child name,groupid-123,first,4,secondgroupid-42,second group title,v1,12,third,4,val3
1,,,,,,second,6,,,v2,1,fourth,6,
2,,,,,,third,8,,,v3,45,tenth,10,

Hopefully this should get you started in the right direction.

🌐
PyPI
pypi.org › project › xml-flatten
xml-flatten
JavaScript is disabled in your browser · Please enable JavaScript to proceed · A required part of this site couldn’t load. This may be due to a browser extension, network issues, or browser settings. Please check your connection, disable any ad blockers, or try using a different browser
Top answer
1 of 2
2

This can easily be solved using XSLT without introducing Python in your workflow, however, if you have to use Python, lxml.etree conveniently introduced a new class lxml.etree.XSLT which you can exploit to your advantage.

Assuming your XML data is in a file named xmlfile.xml the code below should work.

xsltfile.xsl

<?xml version="1.0" encoding="utf-8" ?>
<xsl:stylesheet version="1.0"
        xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
        <xsl:output method="text" />
        <xsl:template match="SalC">
                <xsl:value-of select="concat(../@col1,',', ../@col2,',',../@col3,',',../@col4,',',../@col5,',',../@col6,',',@col7,',',@col8,',',@col9,',',@col10)" />
        </xsl:template>
</xsl:stylesheet>

Example Code

from lxml import etree

xsltfile = etree.XSLT(etree.parse('xsltfile.xsl'))
xmlfile = etree.parse('xmlfile.xml')
output = xsltfile(xmlfile)
print(output)
2 of 2
0

sal.attrib is dict-like:

row = dict(sal.attrib)

salc.attrib is also dict-like. To "flatten" -- or rather, join -- the two dicts togther, you could use dict.update:

row.update(salc.attrib)

Assuming each SalC element has col7, col8, cal9 and col10 attributes, you can just call row.update(salc.attrib) for each salc in sal:


import lxml.etree as ET
import csv

text = '''\
<root>
<Sal col1="a1" col2="C" col3="12/5/2012" col4="a" col5="8" col6="True">
    <SalC col7="A" col8="1" col9="2" col10="True"/>
...
    <SalC col7="D" col8="1" col9="2" col10="True"/>
    <SalC col7="E" col8="1" col9="2" col10="False"/>
</Sal>
</root>'''

fieldnames = ('col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col6', 'col7', 'col8', 
              'col9', 'col10')

with open('/tmp/output.csv', 'wb') as f:
    writer = csv.DictWriter(f, fieldnames, delimiter = ',', lineterminator = '\n', )
    writer.writeheader()
    root = ET.fromstring(text)
    for sal in root.xpath('//Sal'):
        row = dict(sal.attrib)
        for salc in sal:
            row.update(salc.attrib)
            writer.writerow(row)

yields

col1,col2,col3,col4,col5,col6,col6,col7,col8,col9,col10
a1,C,12/5/2012,a,8,True,True,A,1,2,True
a1,C,12/5/2012,a,8,True,True,A1,1,2,False
a1,C,12/5/2012,a,8,True,True,B,1,2,False
...
a3,C,12/9/2012,d,8,True,True,B,1,2,False
a3,C,12/9/2012,d,8,True,True,C,1,2,False
a3,C,12/9/2012,d,8,True,True,D,1,2,True
a3,C,12/9/2012,d,8,True,True,E,1,2,False
🌐
Medium
medium.com › @hiteshtiwari21990 › flatten-xml-file-from-spark-82c920c6db56
Flatten XML file from Spark. Let's see how we can flatten the XML… | by Hitek | Medium
November 26, 2024 - Process XML using pyspark: Processing an XML file in PySpark involves reading the XML data into a DataFrame, potentially flattening nested structures, and then performing any necessary transformations or analyses. Below is a step-by-step guide: Install spark-xml library in the cluster. ... # Define the path to the XML file xml_file_path = "path/to/your/xmlfile.xml" # Load the XML file df = spark.read.format("xml") \ .option("rowTag", "your_row_tag") \ .load(xml_file_path)
Find elsewhere
🌐
ActiveState
code.activestate.com › recipes › 577547-flatten-xml-to-xpath-syntax-lines
Flatten XML to XPath syntax lines « Python recipes « ActiveState Code
January 18, 2011 - This script acts like xml2. It transforms a XML file into a flat text output, with XPath-like syntax, one line per XML node or attribute. This format is more suitable for working with standard unix CLI utils (sed, grep, ...
Top answer
1 of 2
1

Since the URL really contains two data sections under each <Tour>, specifically <Mentions> (which appear to be aggregate vote data) and <Candidats> (which are granular person-level data) (pardon my French), consider building two separate data frames using the new IO method, pandas.read_xml, which supports XSLT 1.0 (via the third-party lxml package). No migration to dictionaries for JSON handling.

As a special purpose language written in XML, XSLT can transform your nested structure to flatter format for migration to data frame. Specifically, each stylesheet drills down to the most granular node and then by the ancestor axis pulls higher level information as sibling columns.

Mentions (save as .xsl, a special .xml file or embed as string in Python)

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes"/>
  <xsl:strip-space elements="*"/>
  
  <xsl:template match="/">
    <Tours>
      <xsl:apply-templates select="descendant::Tour/Mentions"/>
    </Tours>
  </xsl:template>
  
  <xsl:template match="Mentions/*">
    <Mention>
      <xsl:copy-of select="ancestor::Election/Scrutin/*"/>
      <xsl:copy-of select="ancestor::Departement/*[name()!='Communes']"/>
      <xsl:copy-of select="ancestor::Commune/*[name()!='Tours']"/>
      <xsl:copy-of select="ancestor::Tour/NumTour"/>
      <Mention><xsl:value-of select="name()"/></Mention>
      <xsl:copy-of select="*"/>
    </Mention>
  </xsl:template>
  
</xsl:stylesheet>

Python (read directly from URL)

url = (
    "https://www.resultats-elections.interieur.gouv.fr/telechargements/" 
    "PR2022/resultatsT1/027/058/058com.xml"
)

mentions_df = pd.read_xml(url, stylesheet=mentions_xsl)

Output

                Type  Annee  CodReg  CodReg3Car                   LibReg  CodDpt  CodMinDpt  CodDpt3Car  LibDpt  CodSubCom    LibSubCom  NumTour      Mention  Nombre RapportInscrit RapportVotant
0     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1     Inscrits     105           None          None
1     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1  Abstentions      24          22,86          None
2     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1      Votants      81          77,14          None
3     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1       Blancs       2           1,90          2,47
4     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1         Nuls       0           0,00          0,00
             ...    ...     ...         ...                      ...     ...        ...         ...     ...        ...          ...      ...          ...     ...            ...           ...
1849  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1  Abstentions      13          14,94          None
1850  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1      Votants      74          85,06          None
1851  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1       Blancs       1           1,15          1,35
1852  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1         Nuls       0           0,00          0,00
1853  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1     Exprimes      73          83,91         98,65

[1854 rows x 16 columns]

Candidats (save as .xsl, a special .xml file or embed as string in Python)

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes"/>
  <xsl:strip-space elements="*"/>
  
  <xsl:template match="/">
    <Candidats>
      <xsl:apply-templates select="descendant::Tour/Resultats/Candidats"/>
    </Candidats>
  </xsl:template>
  
  <xsl:template match="Candidat">
    <xsl:copy>
      <xsl:copy-of select="ancestor::Election/Scrutin/*"/>
      <xsl:copy-of select="ancestor::Departement/*[name()!='Communes']"/>
      <xsl:copy-of select="ancestor::Commune/*[name()!='Tours']"/>
      <xsl:copy-of select="ancestor::Tour/NumTour"/>
      <xsl:copy-of select="*"/>
    </xsl:copy>
  </xsl:template>
  
</xsl:stylesheet>

Python (read directly from URL)

url = (
    "https://www.resultats-elections.interieur.gouv.fr/telechargements/" 
    "PR2022/resultatsT1/027/058/058com.xml"
)

candidats_df = pd.read_xml(url, stylesheet=candidats_xsl)

Output

                Type  Annee  CodReg  CodReg3Car                   LibReg  CodDpt  CodMinDpt  CodDpt3Car  LibDpt  CodSubCom    LibSubCom  NumTour  NumPanneauCand         NomPsn PrenomPsn CivilitePsn  NbVoix RapportExprime RapportInscrit
0     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1               1        ARTHAUD  Nathalie         Mme       0           0,00           0,00
1     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1               2        ROUSSEL    Fabien          M.       3           3,80           2,86
2     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1               3         MACRON  Emmanuel          M.      14          17,72          13,33
3     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1               4       LASSALLE      Jean          M.       2           2,53           1,90
4     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1               5         LE PEN    Marine         Mme      28          35,44          26,67
             ...    ...     ...         ...                      ...     ...        ...         ...     ...        ...          ...      ...             ...            ...       ...         ...     ...            ...            ...
3703  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1               8        HIDALGO      Anne         Mme       0           0,00           0,00
3704  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1               9          JADOT   Yannick          M.       4           5,48           4,60
3705  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1              10       PÉCRESSE   Valérie         Mme       6           8,22           6,90
3706  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1              11         POUTOU  Philippe          M.       1           1,37           1,15
3707  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1              12  DUPONT-AIGNAN   Nicolas          M.       4           5,48           4,60

[3708 rows x 19 columns]

You can join resulting data frames using their shared Communes nodes: <CodSubCom> and <LibSubCom> but may have to pivot_table on the aggregate data for a one-to-many merge. Below demonstrates with Nombre aggregate:

mentions_candidats_df = (
    candidats_df.merge(
        mentions_df.pivot_table(
            index=["CodSubCom", "LibSubCom"],
            columns="Mention",
            values="Nombre",
            aggfunc="max"
        ).reset_index(),
        on=["CodSubCom", "LibSubCom"]
    )
)
mentions_candidats_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3708 entries, 0 to 3707
Data columns (total 25 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Type            3708 non-null   object
 1   Annee           3708 non-null   int64 
 2   CodReg          3708 non-null   int64 
 3   CodReg3Car      3708 non-null   int64 
 4   LibReg          3708 non-null   object
 5   CodDpt          3708 non-null   int64 
 6   CodMinDpt       3708 non-null   int64 
 7   CodDpt3Car      3708 non-null   int64 
 8   LibDpt          3708 non-null   object
 9   CodSubCom       3708 non-null   int64 
 10  LibSubCom       3708 non-null   object
 11  NumTour         3708 non-null   int64 
 12  NumPanneauCand  3708 non-null   int64 
 13  NomPsn          3708 non-null   object
 14  PrenomPsn       3708 non-null   object
 15  CivilitePsn     3708 non-null   object
 16  NbVoix          3708 non-null   int64 
 17  RapportExprime  3708 non-null   object
 18  RapportInscrit  3708 non-null   object
 19  Abstentions     3708 non-null   int64 
 20  Blancs          3708 non-null   int64 
 21  Exprimes        3708 non-null   int64 
 22  Inscrits        3708 non-null   int64 
 23  Nuls            3708 non-null   int64 
 24  Votants         3708 non-null   int64 
dtypes: int64(16), object(9)
memory usage: 753.2+ KB

In forthcoming pandas 1.5, read_xml will support dtypes to allow conversion after XSLT transformation in this case.

2 of 2
1

I tried this:

import pandas as pd
import xmltodict

rawdata = '058com.xml'

with open(rawdata) as fd:
    doc = xmltodict.parse(fd.read(), encoding='ISO-8859-1', process_namespaces=False)

df = pd.json_normalize(doc['Election']['Departement']['Communes']['Commune'])


col_length_df = len(df.columns)
all_columns = list(df.columns[:-1]) + list(df.iloc[0, len(df.columns)-1][0].keys())

new_df = df.reindex(columns = all_columns)

new_df.astype({"RapportExprime": str, "RapportInscrit": str}).dtypes

for index, rows in new_df.iterrows():
    new_df.iloc[index, col_length_df-1:] = list(df.iloc[index, len(df.columns)-1][0].values())

Since the last row of df is an ordered dictionary, the code uses its keys to add empty columns, along with original columns of df, to new_df. Finally, it loops over rows of df and new_df to fill the empty columns of new_df.

The above code gives us:

    CodSubCom           LibSubCom Tours.Tour.NumTour Tours.Tour.Mentions.Inscrits.Nombre Tours.Tour.Mentions.Abstentions.Nombre  ... PrenomPsn CivilitePsn NbVoix RapportExprime RapportInscrit
0         001               Achun                  1                                 105                                     24  ...  Nathalie         Mme      0           0,00           0,00
1         002       Alligny-Cosne                  1                                 696                                    133  ...  Nathalie         Mme      3           0,54           0,43
2         003   Alligny-en-Morvan                  1                                 533                                    123  ...  Nathalie         Mme      5           1,25           0,94
3         004               Alluy                  1                                 263                                     48  ...  Nathalie         Mme      1           0,48           0,38
4         005               Amazy                  1                                 188                                     51  ...  Nathalie         Mme      2           1,53           1,06
..        ...                 ...                ...                                 ...                                    ...  ...       ...         ...    ...            ...            ...
304       309        Villapourçon                  1                                 327                                     70  ...  Nathalie         Mme      1           0,40           0,31
305       310     Villiers-le-Sec                  1                                  34                                      4  ...  Nathalie         Mme      0           0,00           0,00
306       311         Ville-Langy                  1                                 203                                     46  ...  Nathalie         Mme      1           0,64           0,49
307       312  Villiers-sur-Yonne                  1                                 263                                     60  ...  Nathalie         Mme      0           0,00           0,00
308       313         Vitry-Laché                  1                                  87                                     13  ...  Nathalie         Mme      1           1,37           1,15

Finally, new_df.columns is:

Index(['CodSubCom', 'LibSubCom', 'Tours.Tour.NumTour',
       'Tours.Tour.Mentions.Inscrits.Nombre',
       'Tours.Tour.Mentions.Abstentions.Nombre',
       'Tours.Tour.Mentions.Abstentions.RapportInscrit',
       'Tours.Tour.Mentions.Votants.Nombre',
       'Tours.Tour.Mentions.Votants.RapportInscrit',
       'Tours.Tour.Mentions.Blancs.Nombre',
       'Tours.Tour.Mentions.Blancs.RapportInscrit',
       'Tours.Tour.Mentions.Blancs.RapportVotant',
       'Tours.Tour.Mentions.Nuls.Nombre',
       'Tours.Tour.Mentions.Nuls.RapportInscrit',
       'Tours.Tour.Mentions.Nuls.RapportVotant',
       'Tours.Tour.Mentions.Exprimes.Nombre',
       'Tours.Tour.Mentions.Exprimes.RapportInscrit',
       'Tours.Tour.Mentions.Exprimes.RapportVotant', 'NumPanneauCand',
       'NomPsn', 'PrenomPsn', 'CivilitePsn', 'NbVoix', 'RapportExprime',
       'RapportInscrit'],
      dtype='object')

Total number of columns in new_df: 24

🌐
Community
community.safe.com › transformers-9 › best-way-to-flatten-a-huge-complex-xml-file-and-extract-all-data-17574
Best way to flatten a huge complex xml file and extract all data | Community
May 31, 2019 - This works fine enough and I get the desired output however as soon as my .xml file gets larger than 50Mb, this "xml process" takes a long time and becomes the bottleneck in my process. Most of my .xml files are much larger. ... I want to pursue the PythonCaller (xml.etree.Elementtree) option and parse the whole file to it's constituent elements and then use that output in the rest of the process.
🌐
Ryder
ryder.dev › flatten-xml-via-json
Flatten XML via JSON - ryder.dev
January 2, 2021 - If you run into ModuleNotFoundErrors, simply type the following into your Python terminal: ... import xmltodict, json xml_filepath = input('Drag and drop an XML file here:').strip() with open(xml_filepath) as xml_file_input: xml_data_stream = xml_file_input.read() data_dict = xmltodict.parse(xml_data_stream) I came across a nice, succinct, and effective piece of code to accomplish what I needed. Thanks Amir Ziai! import json def flatten_json(y): # or use `pip install flatten_json` out = {} def flatten(x, name=''): if type(x) is dict: for a in x: flatten(x[a], name + a + '_') elif type(x) is li
🌐
GitHub
github.com › knadh › xmlutils.py
GitHub - knadh/xmlutils.py: Python scripts for processing XML documents and converting to SQL, CSV, and JSON [UNMAINTAINED]
Convert an XML document to a CSV file. xml2csv --input "samples/fruits.xml" --output "samples/fruits.csv" --tag "item" --input Input XML document's filename* --output Output CSV file's filename* --tag The tag of the node that represents a single record (Eg: item, record)* --delimiter Delimiter for seperating items in a row.
Starred by 255 users
Forked by 141 users
Languages   Python 100.0% | Python 100.0%
🌐
Pythonstudio
pythonstudio.us › python xml processing › flat file
Flat File - Python XML Processing - Python Studio
October 14, 2010 - To accomplish its goal of taking flat text and organizing it into an XML document, the FlatfileParser uses the DOM implementation to create a DOM structure to hold the various pieces of text that the FlatfileParser extracts: # ... The class FlatfileParser has one method, named parseFile.
🌐
GitHub
gist.github.com › matthewbelisle-wf › 4171684
Sample script to convert an xml file into a csv file · GitHub
November 29, 2012 - Sample script to convert an xml file into a csv file - main.py
🌐
Medium
medium.com › data-science › how-to-convert-xml-to-a-flat-table-python-f51576f569ad
How to convert XML to a flat table (Python) | by Erik Yan | TDS Archive | Medium
October 16, 2021 - Simply copy and paste the script into a Python file or notebook and execute the script: NOTE: Be sure to replace the XML file path with the path of your XML file path. The script will prompt you for your username and password you created prior. If the login credentials are valid, it will automatically compile the REST API request, encode your XML data, send it to the interpreter, and then return your flat ...
🌐
Stack Overflow
stackoverflow.com › questions › 53578682 › converting-xml-to-a-flat-file
db2 - Converting XML to a flat file - Stack Overflow
Infact, each row in the flat file will have the segment name as well(at the beginning). I just removed it from here as I dint want to complicate it here. Thanks! ... ....you're just complicating things. Adding an extra step adds an extra point of failure. Assuming the original code was reasonably written (...which might not be true...), you'd probably do as much work trying to write the XML-to-flatfile as just converting the original flatfile process.
Top answer
1 of 1
4

Try the following code as a starter:

#!python3

import re
import xml.etree.ElementTree as ET

rex = re.compile(r'''(?P<title>Longitude
                       |Latitude
                       |date&time
                       |gsm\s+cell\s+id
                     )
                     \s*:?\s*
                     (?P<value>.*)
                     ''', re.VERBOSE)

root = ET.Element('root')
root.text = '\n'    # newline before the celldata element

with open('cell.txt') as f:
    celldata = ET.SubElement(root, 'celldata')
    celldata.text = '\n'    # newline before the collected element
    celldata.tail = '\n\n'  # empty line after the celldata element
    for line in f:
        # Empty line starts new celldata element (hack style, uggly)
        if line.isspace():
            celldata = ET.SubElement(root, 'celldata')
            celldata.text = '\n'
            celldata.tail = '\n\n'

        # If the line contains the wanted data, process it.
        m = rex.search(line)
        if m:
            # Fix some problems with the title as it will be used
            # as the tag name.
            title = m.group('title')
            title = title.replace('&', '')
            title = title.replace(' ', '')

            e = ET.SubElement(celldata, title.lower())
            e.text = m.group('value')
            e.tail = '\n'

# Display for debugging            
ET.dump(root)

# Include the root element to the tree and write the tree
# to the file.
tree = ET.ElementTree(root)
tree.write('cell.xml', encoding='utf-8', xml_declaration=True)

It displays for your example data:

<root>
<celldata>
<latitude>23.1100348</latitude>
<longitude>72.5364922</longitude>
<datetime>30:August:2014 05:04:31 PM</datetime>
<gsmcellid>4993</gsmcellid>
</celldata>

<celldata>
<latitude>23.1120549</latitude>
<longitude>72.5397988</longitude>
<datetime>30:August:2014 05:04:34 PM</datetime>
<gsmcellid>4993</gsmcellid>
</celldata>

</root>

Update for the wanted neigbour list:

#!python3

import re
import xml.etree.ElementTree as ET

rex = re.compile(r'''(?P<title>Longitude
                       |Latitude
                       |date&time
                       |gsm\s+cell\s+id
                       |Neighboring\s+List-\s+Lac\s+:\s+Cid\s+:\s+RSSI
                     )
                     \s*:?\s*
                     (?P<value>.*)
                     ''', re.VERBOSE)

root = ET.Element('root')
root.text = '\n'    # newline before the celldata element

with open('cell.txt') as f:
    celldata = ET.SubElement(root, 'celldata')
    celldata.text = '\n'    # newline before the collected element
    celldata.tail = '\n\n'  # empty line after the celldata element
    for line in f:
        # Empty line starts new celldata element (hack style, uggly)
        if line.isspace():
            celldata = ET.SubElement(root, 'celldata')
            celldata.text = '\n'
            celldata.tail = '\n\n'
        else:
            # If the line contains the wanted data, process it.
            m = rex.search(line)
            if m:
                # Fix some problems with the title as it will be used
                # as the tag name.
                title = m.group('title')
                title = title.replace('&', '')
                title = title.replace(' ', '')

                if line.startswith('Neighboring'):
                    neighbours = ET.SubElement(celldata, 'neighbours')
                    neighbours.text = '\n'
                    neighbours.tail = '\n'
                else:
                    e = ET.SubElement(celldata, title.lower())
                    e.text = m.group('value')
                    e.tail = '\n'
            else:
                # This is the neighbour item. Split it by colon,
                # and set the attributes of the item element.
                item = ET.SubElement(neighbours, 'item')
                item.tail = '\n'

                lac, cid, rssi = (a.strip() for a in line.split(':'))
                item.attrib['lac'] = lac
                item.attrib['cid'] = cid
                item.attrib['rssi'] = rssi.split()[0] # dBm removed

# Include the root element to the tree and write the tree
# to the file.
tree = ET.ElementTree(root)
tree.write('cell.xml', encoding='utf-8', xml_declaration=True)

Update for accepting empty line before neighbours -- also better implementation for general purposes:

#!python3

import re
import xml.etree.ElementTree as ET

rex = re.compile(r'''(?P<title>Longitude
                       |Latitude
                       |date&time
                       |gsm\s+cell\s+id
                       |Neighboring\s+List-\s+Lac\s+:\s+Cid\s+:\s+RSSI
                     )
                     \s*:?\s*
                     (?P<value>.*)
                     ''', re.VERBOSE)

root = ET.Element('root')
root.text = '\n'    # newline before the celldata element

with open('cell.txt') as f:
    celldata = ET.SubElement(root, 'celldata')
    celldata.text = '\n'    # newline before the collected element
    celldata.tail = '\n\n'  # empty line after the celldata element
    status = 0              # init status of the finite automaton
    for line in f:
        if status == 0:     # lines of the heading expected
            # If the line contains the wanted data, process it.
            m = rex.search(line)
            if m:
                # Fix some problems with the title as it will be used
                # as the tag name.
                title = m.group('title')
                title = title.replace('&', '')
                title = title.replace(' ', '')

                if line.startswith('Neighboring'):
                    neighbours = ET.SubElement(celldata, 'neighbours')
                    neighbours.text = '\n'
                    neighbours.tail = '\n'
                    status = 1  # empty line and then list of neighbours expected
                else:
                    e = ET.SubElement(celldata, title.lower())
                    e.text = m.group('value')
                    e.tail = '\n'
                    # keep the same status

        elif status == 1:   # empty line expected
            if line.isspace():
                status = 2  # list of neighbours must follow
            else:
                raise RuntimeError('Empty line expected. (status == {})'.format(status))
                status = 999 # error status

        elif status == 2:   # neighbour or the empty line as final separator

            if line.isspace():
                celldata = ET.SubElement(root, 'celldata')
                celldata.text = '\n'
                celldata.tail = '\n\n'
                status = 0  # go to the initial status
            else:
                # This is the neighbour item. Split it by colon,
                # and set the attributes of the item element.
                item = ET.SubElement(neighbours, 'item')
                item.tail = '\n'

                lac, cid, rssi = (a.strip() for a in line.split(':'))
                item.attrib['lac'] = lac
                item.attrib['cid'] = cid
                item.attrib['rssi'] = rssi.split()[0] # dBm removed
                # keep the same status

        elif status == 999: # error status -- break the loop
            break

        else:
            raise LogicError('Unexpected status {}.'.format(status))
            break

# Display for debugging
ET.dump(root)

# Include the root element to the tree and write the tree
# to the file.
tree = ET.ElementTree(root)
tree.write('cell.xml', encoding='utf-8', xml_declaration=True)

The code implements so called finite automaton where the status variable represents its current status. You can visualize it using pencil and paper -- draw small circles with the status numbers inside (called nodes in the graph theory). Being at the status, you allow only some kind of input (line). When the input is recognized, you draw the arrow (oriented edge in the graph theory) to another status (possibly to the same status, as a loop returning back to the same node). The arrow is annotated `condition | action'.

The result may look complex at the beginning; however, it is easy in the sense that you can always focus ony on the part of the code that belongs to certain status. And also, the code can be easily modified. However, finite automatons have limited power. But they are just perfect for this kind of problems.

🌐
Python
docs.python.org › 3 › library › xml.etree.elementtree.html
xml.etree.ElementTree — The ElementTree XML API
January 29, 2026 - This function takes an XML data string (xml_data) or a file path or file-like object (from_file) as input, converts it to the canonical form, and writes it out using the out file(-like) object, if provided, or returns it as a text string if not. The output file receives text, not bytes.