By using xmltodict to transform your XML file to a dictionary, in combination with this answer to flatten a dict, this should be possible.
Example:
# Original code: https://codereview.stackexchange.com/a/21035
from collections import OrderedDict
def flatten_dict(d):
def items():
for key, value in d.items():
if isinstance(value, dict):
for subkey, subvalue in flatten_dict(value).items():
yield key + "." + subkey, subvalue
else:
yield key, value
return OrderedDict(items())
import xmltodict
# Convert to dict
with open('test.xml', 'rb') as f:
xml_content = xmltodict.parse(f)
# Flatten dict
flattened_xml = flatten_dict(xml_content)
# Print in desired format
for k,v in flattened_xml.items():
print('{} = {}'.format(k,v))
Output:
A.B.ConnectionType = a
A.B.StartTime = 00:00:00
A.B.EndTime = 00:00:00
A.B.UseDataDictionary = N
Answer from DocZerø on Stack OverflowBy using xmltodict to transform your XML file to a dictionary, in combination with this answer to flatten a dict, this should be possible.
Example:
# Original code: https://codereview.stackexchange.com/a/21035
from collections import OrderedDict
def flatten_dict(d):
def items():
for key, value in d.items():
if isinstance(value, dict):
for subkey, subvalue in flatten_dict(value).items():
yield key + "." + subkey, subvalue
else:
yield key, value
return OrderedDict(items())
import xmltodict
# Convert to dict
with open('test.xml', 'rb') as f:
xml_content = xmltodict.parse(f)
# Flatten dict
flattened_xml = flatten_dict(xml_content)
# Print in desired format
for k,v in flattened_xml.items():
print('{} = {}'.format(k,v))
Output:
A.B.ConnectionType = a
A.B.StartTime = 00:00:00
A.B.EndTime = 00:00:00
A.B.UseDataDictionary = N
This is not a complete implementation but you could take advantage of lxmls's getpath:
xml = """<A>
<B>
<ConnectionType>a</ConnectionType>
<StartTime>00:00:00</StartTime>
<EndTime>00:00:00</EndTime>
<UseDataDictionary>N
<UseDataDictionary2>G</UseDataDictionary2>
</UseDataDictionary>
</B>
</A>"""
from lxml import etree
from io import StringIO
tree = etree.parse(StringIO(xml))
root = tree.getroot().tag
for node in tree.iter():
for child in node.getchildren():
if child and child.text.strip():
print("{}.{} = {}".format(root, ".".join(tree.getelementpath(child).split("/")), child.text.strip()))
Which gives you:
A.B.ConnectionType = a
A.B.StartTime = 00:00:00
A.B.EndTime = 00:00:00
A.B.UseDataDictionary = N
A.B.UseDataDictionary.UseDataDictionary2 = G
I have an xml file with the following layout
<records>
<header>
n name value pairs
</header>
<rec1>
<nested1>
<nested2>
<nested_n>
</rec1>
<rec2>
......
</rec2>
<records>
I want to write it out as one row with the parser descending down till end of rec1 before writing a new line, which is going to be record 1.
The nested* are further highlevel nodes with more subnodes or elements. And the number of nested elements can vary from one record to another and so would like to get pipe delimited entries with space/0 depending on the type
All the examples I see seem to either give a search example to find a specific node element or explicitly hard code up to 1 or 2 levels using xml.etree import ElementTree or lxml.
How do I recursively descend and write it out all as 1 row till i hit </rec1>
EDIT: I got so far as
from xml.etree import ElementTree as et
fh = open("GC.xml","r")
xm = et.parse(fh)
for e in xm.getiterator():
print e.tag, repr(e.text)How do I query the node depth to spit out a newline at the appropriate place?
You probably want to write a SAX parser. Look at xml.sax in the stdlib.
This question really belongs in r/learnpython.
I'd just use BeautifulSoup for simple XML parsing tasks. Something like:
soup = BeautifulSoup(xml_str)
rec1_tag = soup.find("rec1")
rec1_str = "|".join([
child.string for child in rec1_tag.children
if child.name.startswith("nested")])
python - XML schema parsing and XML creation from flat files - Code Review Stack Exchange
Python : Flatten xml to csv with nested child tags - Stack Overflow
Python Parsing nested XML and flattening the data - Stack Overflow
python - Flatten XML data as a pandas dataframe - Stack Overflow
This can easily be solved using XSLT without introducing Python in your workflow, however, if you have to use Python, lxml.etree conveniently introduced a new class lxml.etree.XSLT which you can exploit to your advantage.
Assuming your XML data is in a file named xmlfile.xml the code below should work.
xsltfile.xsl
<?xml version="1.0" encoding="utf-8" ?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" />
<xsl:template match="SalC">
<xsl:value-of select="concat(../@col1,',', ../@col2,',',../@col3,',',../@col4,',',../@col5,',',../@col6,',',@col7,',',@col8,',',@col9,',',@col10)" />
</xsl:template>
</xsl:stylesheet>
Example Code
from lxml import etree
xsltfile = etree.XSLT(etree.parse('xsltfile.xsl'))
xmlfile = etree.parse('xmlfile.xml')
output = xsltfile(xmlfile)
print(output)
sal.attrib is dict-like:
row = dict(sal.attrib)
salc.attrib is also dict-like. To "flatten" -- or rather, join -- the two dicts togther, you could use dict.update:
row.update(salc.attrib)
Assuming each SalC element has col7, col8, cal9 and col10 attributes, you can just call row.update(salc.attrib) for each salc in sal:
import lxml.etree as ET
import csv
text = '''\
<root>
<Sal col1="a1" col2="C" col3="12/5/2012" col4="a" col5="8" col6="True">
<SalC col7="A" col8="1" col9="2" col10="True"/>
...
<SalC col7="D" col8="1" col9="2" col10="True"/>
<SalC col7="E" col8="1" col9="2" col10="False"/>
</Sal>
</root>'''
fieldnames = ('col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col6', 'col7', 'col8',
'col9', 'col10')
with open('/tmp/output.csv', 'wb') as f:
writer = csv.DictWriter(f, fieldnames, delimiter = ',', lineterminator = '\n', )
writer.writeheader()
root = ET.fromstring(text)
for sal in root.xpath('//Sal'):
row = dict(sal.attrib)
for salc in sal:
row.update(salc.attrib)
writer.writerow(row)
yields
col1,col2,col3,col4,col5,col6,col6,col7,col8,col9,col10
a1,C,12/5/2012,a,8,True,True,A,1,2,True
a1,C,12/5/2012,a,8,True,True,A1,1,2,False
a1,C,12/5/2012,a,8,True,True,B,1,2,False
...
a3,C,12/9/2012,d,8,True,True,B,1,2,False
a3,C,12/9/2012,d,8,True,True,C,1,2,False
a3,C,12/9/2012,d,8,True,True,D,1,2,True
a3,C,12/9/2012,d,8,True,True,E,1,2,False
Since the URL really contains two data sections under each <Tour>, specifically <Mentions> (which appear to be aggregate vote data) and <Candidats> (which are granular person-level data) (pardon my French), consider building two separate data frames using the new IO method, pandas.read_xml, which supports XSLT 1.0 (via the third-party lxml package). No migration to dictionaries for JSON handling.
As a special purpose language written in XML, XSLT can transform your nested structure to flatter format for migration to data frame. Specifically, each stylesheet drills down to the most granular node and then by the ancestor axis pulls higher level information as sibling columns.
Mentions (save as .xsl, a special .xml file or embed as string in Python)
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<Tours>
<xsl:apply-templates select="descendant::Tour/Mentions"/>
</Tours>
</xsl:template>
<xsl:template match="Mentions/*">
<Mention>
<xsl:copy-of select="ancestor::Election/Scrutin/*"/>
<xsl:copy-of select="ancestor::Departement/*[name()!='Communes']"/>
<xsl:copy-of select="ancestor::Commune/*[name()!='Tours']"/>
<xsl:copy-of select="ancestor::Tour/NumTour"/>
<Mention><xsl:value-of select="name()"/></Mention>
<xsl:copy-of select="*"/>
</Mention>
</xsl:template>
</xsl:stylesheet>
Python (read directly from URL)
url = (
"https://www.resultats-elections.interieur.gouv.fr/telechargements/"
"PR2022/resultatsT1/027/058/058com.xml"
)
mentions_df = pd.read_xml(url, stylesheet=mentions_xsl)
Output
Type Annee CodReg CodReg3Car LibReg CodDpt CodMinDpt CodDpt3Car LibDpt CodSubCom LibSubCom NumTour Mention Nombre RapportInscrit RapportVotant
0 Présidentielle 2022 27 27 Bourgogne-Franche-Comté 58 58 58 Nièvre 1 Achun 1 Inscrits 105 None None
1 Présidentielle 2022 27 27 Bourgogne-Franche-Comté 58 58 58 Nièvre 1 Achun 1 Abstentions 24 22,86 None
2 Présidentielle 2022 27 27 Bourgogne-Franche-Comté 58 58 58 Nièvre 1 Achun 1 Votants 81 77,14 None
3 Présidentielle 2022 27 27 Bourgogne-Franche-Comté 58 58 58 Nièvre 1 Achun 1 Blancs 2 1,90 2,47
4 Présidentielle 2022 27 27 Bourgogne-Franche-Comté 58 58 58 Nièvre 1 Achun 1 Nuls 0 0,00 0,00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1849 Présidentielle 2022 27 27 Bourgogne-Franche-Comté 58 58 58 Nièvre 313 Vitry-Laché 1 Abstentions 13 14,94 None
1850 Présidentielle 2022 27 27 Bourgogne-Franche-Comté 58 58 58 Nièvre 313 Vitry-Laché 1 Votants 74 85,06 None
1851 Présidentielle 2022 27 27 Bourgogne-Franche-Comté 58 58 58 Nièvre 313 Vitry-Laché 1 Blancs 1 1,15 1,35
1852 Présidentielle 2022 27 27 Bourgogne-Franche-Comté 58 58 58 Nièvre 313 Vitry-Laché 1 Nuls 0 0,00 0,00
1853 Présidentielle 2022 27 27 Bourgogne-Franche-Comté 58 58 58 Nièvre 313 Vitry-Laché 1 Exprimes 73 83,91 98,65
[1854 rows x 16 columns]
Candidats (save as .xsl, a special .xml file or embed as string in Python)
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<Candidats>
<xsl:apply-templates select="descendant::Tour/Resultats/Candidats"/>
</Candidats>
</xsl:template>
<xsl:template match="Candidat">
<xsl:copy>
<xsl:copy-of select="ancestor::Election/Scrutin/*"/>
<xsl:copy-of select="ancestor::Departement/*[name()!='Communes']"/>
<xsl:copy-of select="ancestor::Commune/*[name()!='Tours']"/>
<xsl:copy-of select="ancestor::Tour/NumTour"/>
<xsl:copy-of select="*"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Python (read directly from URL)
url = (
"https://www.resultats-elections.interieur.gouv.fr/telechargements/"
"PR2022/resultatsT1/027/058/058com.xml"
)
candidats_df = pd.read_xml(url, stylesheet=candidats_xsl)
Output
Type Annee CodReg CodReg3Car LibReg CodDpt CodMinDpt CodDpt3Car LibDpt CodSubCom LibSubCom NumTour NumPanneauCand NomPsn PrenomPsn CivilitePsn NbVoix RapportExprime RapportInscrit
0 Présidentielle 2022 27 27 Bourgogne-Franche-Comté 58 58 58 Nièvre 1 Achun 1 1 ARTHAUD Nathalie Mme 0 0,00 0,00
1 Présidentielle 2022 27 27 Bourgogne-Franche-Comté 58 58 58 Nièvre 1 Achun 1 2 ROUSSEL Fabien M. 3 3,80 2,86
2 Présidentielle 2022 27 27 Bourgogne-Franche-Comté 58 58 58 Nièvre 1 Achun 1 3 MACRON Emmanuel M. 14 17,72 13,33
3 Présidentielle 2022 27 27 Bourgogne-Franche-Comté 58 58 58 Nièvre 1 Achun 1 4 LASSALLE Jean M. 2 2,53 1,90
4 Présidentielle 2022 27 27 Bourgogne-Franche-Comté 58 58 58 Nièvre 1 Achun 1 5 LE PEN Marine Mme 28 35,44 26,67
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3703 Présidentielle 2022 27 27 Bourgogne-Franche-Comté 58 58 58 Nièvre 313 Vitry-Laché 1 8 HIDALGO Anne Mme 0 0,00 0,00
3704 Présidentielle 2022 27 27 Bourgogne-Franche-Comté 58 58 58 Nièvre 313 Vitry-Laché 1 9 JADOT Yannick M. 4 5,48 4,60
3705 Présidentielle 2022 27 27 Bourgogne-Franche-Comté 58 58 58 Nièvre 313 Vitry-Laché 1 10 PÉCRESSE Valérie Mme 6 8,22 6,90
3706 Présidentielle 2022 27 27 Bourgogne-Franche-Comté 58 58 58 Nièvre 313 Vitry-Laché 1 11 POUTOU Philippe M. 1 1,37 1,15
3707 Présidentielle 2022 27 27 Bourgogne-Franche-Comté 58 58 58 Nièvre 313 Vitry-Laché 1 12 DUPONT-AIGNAN Nicolas M. 4 5,48 4,60
[3708 rows x 19 columns]
You can join resulting data frames using their shared Communes nodes: <CodSubCom> and <LibSubCom> but may have to pivot_table on the aggregate data for a one-to-many merge. Below demonstrates with Nombre aggregate:
mentions_candidats_df = (
candidats_df.merge(
mentions_df.pivot_table(
index=["CodSubCom", "LibSubCom"],
columns="Mention",
values="Nombre",
aggfunc="max"
).reset_index(),
on=["CodSubCom", "LibSubCom"]
)
)
mentions_candidats_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3708 entries, 0 to 3707
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Type 3708 non-null object
1 Annee 3708 non-null int64
2 CodReg 3708 non-null int64
3 CodReg3Car 3708 non-null int64
4 LibReg 3708 non-null object
5 CodDpt 3708 non-null int64
6 CodMinDpt 3708 non-null int64
7 CodDpt3Car 3708 non-null int64
8 LibDpt 3708 non-null object
9 CodSubCom 3708 non-null int64
10 LibSubCom 3708 non-null object
11 NumTour 3708 non-null int64
12 NumPanneauCand 3708 non-null int64
13 NomPsn 3708 non-null object
14 PrenomPsn 3708 non-null object
15 CivilitePsn 3708 non-null object
16 NbVoix 3708 non-null int64
17 RapportExprime 3708 non-null object
18 RapportInscrit 3708 non-null object
19 Abstentions 3708 non-null int64
20 Blancs 3708 non-null int64
21 Exprimes 3708 non-null int64
22 Inscrits 3708 non-null int64
23 Nuls 3708 non-null int64
24 Votants 3708 non-null int64
dtypes: int64(16), object(9)
memory usage: 753.2+ KB
In forthcoming pandas 1.5, read_xml will support dtypes to allow conversion after XSLT transformation in this case.
I tried this:
import pandas as pd
import xmltodict
rawdata = '058com.xml'
with open(rawdata) as fd:
doc = xmltodict.parse(fd.read(), encoding='ISO-8859-1', process_namespaces=False)
df = pd.json_normalize(doc['Election']['Departement']['Communes']['Commune'])
col_length_df = len(df.columns)
all_columns = list(df.columns[:-1]) + list(df.iloc[0, len(df.columns)-1][0].keys())
new_df = df.reindex(columns = all_columns)
new_df.astype({"RapportExprime": str, "RapportInscrit": str}).dtypes
for index, rows in new_df.iterrows():
new_df.iloc[index, col_length_df-1:] = list(df.iloc[index, len(df.columns)-1][0].values())
Since the last row of df is an ordered dictionary, the code uses its keys to add empty columns, along with original columns of df, to new_df. Finally, it loops over rows of df and new_df to fill the empty columns of new_df.
The above code gives us:
CodSubCom LibSubCom Tours.Tour.NumTour Tours.Tour.Mentions.Inscrits.Nombre Tours.Tour.Mentions.Abstentions.Nombre ... PrenomPsn CivilitePsn NbVoix RapportExprime RapportInscrit
0 001 Achun 1 105 24 ... Nathalie Mme 0 0,00 0,00
1 002 Alligny-Cosne 1 696 133 ... Nathalie Mme 3 0,54 0,43
2 003 Alligny-en-Morvan 1 533 123 ... Nathalie Mme 5 1,25 0,94
3 004 Alluy 1 263 48 ... Nathalie Mme 1 0,48 0,38
4 005 Amazy 1 188 51 ... Nathalie Mme 2 1,53 1,06
.. ... ... ... ... ... ... ... ... ... ... ...
304 309 Villapourçon 1 327 70 ... Nathalie Mme 1 0,40 0,31
305 310 Villiers-le-Sec 1 34 4 ... Nathalie Mme 0 0,00 0,00
306 311 Ville-Langy 1 203 46 ... Nathalie Mme 1 0,64 0,49
307 312 Villiers-sur-Yonne 1 263 60 ... Nathalie Mme 0 0,00 0,00
308 313 Vitry-Laché 1 87 13 ... Nathalie Mme 1 1,37 1,15
Finally, new_df.columns is:
Index(['CodSubCom', 'LibSubCom', 'Tours.Tour.NumTour',
'Tours.Tour.Mentions.Inscrits.Nombre',
'Tours.Tour.Mentions.Abstentions.Nombre',
'Tours.Tour.Mentions.Abstentions.RapportInscrit',
'Tours.Tour.Mentions.Votants.Nombre',
'Tours.Tour.Mentions.Votants.RapportInscrit',
'Tours.Tour.Mentions.Blancs.Nombre',
'Tours.Tour.Mentions.Blancs.RapportInscrit',
'Tours.Tour.Mentions.Blancs.RapportVotant',
'Tours.Tour.Mentions.Nuls.Nombre',
'Tours.Tour.Mentions.Nuls.RapportInscrit',
'Tours.Tour.Mentions.Nuls.RapportVotant',
'Tours.Tour.Mentions.Exprimes.Nombre',
'Tours.Tour.Mentions.Exprimes.RapportInscrit',
'Tours.Tour.Mentions.Exprimes.RapportVotant', 'NumPanneauCand',
'NomPsn', 'PrenomPsn', 'CivilitePsn', 'NbVoix', 'RapportExprime',
'RapportInscrit'],
dtype='object')
Total number of columns in new_df: 24
For python, here is a comprehensive list of available XML libs/modules:
http://wiki.python.org/moin/PythonXml
If you are looking for something simpler than XSLT, XMLStarlet is a set of command line tools which may be of interest for you:
http://xmlstar.sourceforge.net/
As any command line tool, this is not especially made for Python, but can be easily integrated into a python script.
Although it’s only useful for writing XML, XMLwitch is freakin’ amazing. For doing non-XML to XML transformations, I highly recommend it!