I've got the needed outcome using following script.
XML File:
<?xml version="1.0" encoding="UTF-8"?>
<base>
<element1>element 1</element1>
<element2>element 2</element2>
<element3>
<subElement3>subElement 3</subElement3>
</element3>
</base>
Python code:
import pandas as pd
from lxml import etree
data = "C:/Path/test.xml"
tree = etree.parse(data)
lstKey = []
lstValue = []
for p in tree.iter() :
lstKey.append(tree.getpath(p).replace("/",".")[1:])
lstValue.append(p.text)
df = pd.DataFrame({'key' : lstKey, 'value' : lstValue})
df.sort_values('key')
Result:

Videos
How to use the
Convert XML to reStructuredText Table Online for free?
How to use the
Convert XML to Pandas DataFrame Online for free?
What is reStructuredText Table format?
Given the two levels of nodes that cover the Coluna attributes, consider XSLT, the special-purpose language designed to transform or style original XML files. Python's lxml can run XSLT 1.0 scripts and being the default parse to pandas.read_xml can transform your raw XML into a flatter version to parse to DataFrame.
XSLT (save as .xsl file, a special .xml file)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:pace='http://www.ms.com/pace'>
<xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- REDESIGN XML TO ONLY RETURN AnaliseDiaria NODES -->
<xsl:template match="/*">
<xsl:copy>
<xsl:apply-templates select="descendant::pace:AnaliseDiaria"/>
</xsl:copy>
</xsl:template>
<!-- REDESIGN AnaliseDiaria NODES -->
<xsl:template match="pace:AnaliseDiaria">
<xsl:copy>
<!-- BRING DOWN Produto ATTRIBUTES WITH CURRENT ATTRIBUTES -->
<xsl:copy-of select="ancestor::pace:Produto/@*|@*"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Online Demo
Python
analise_diaria_df = pd.read("input.xml", stylesheet="style.xsl")
analise_diaria_df
# Coluna1 Coluna2 Coluna3 ... Coluna14 Coluna15 Coluna16
# 0 21-851611 CAMIO VO NaN ... NaN NaN NaN
# 1 21-3667984 SCA4X2 -1.0 ... NaN NaN NaN
# 2 21-3667994 SCA963 -1.0 ... NaN NaN NaN
# 3 21-3676543 SCA713 -1.0 ... NaN NaN NaN
# 4 21-3676601 SCA97 -1.0 ... NaN NaN NaN
# 5 21-3814014 CAMIX2 NaN ... NaN NaN NaN
# 6 21-3814087 SCA56 NaN ... NaN NaN NaN
# 7 21-3814087 SCA56 NaN ... 195.000,00 NF9 10203910A
# 8 21-3814087 SCA56 NaN ... 195.090,00 NaN NaN
# 9 21-3814087 SCA56 NaN ... 195.270,00 NaN NaN
# 10 21-3814087 SCA56 NaN ... 195.482,60 NaN NaN
# 11 21-3814087 SCA56 NaN ... 195.627,80 NaN NaN
# 12 21-3814087 SCA56 NaN ... 204.529,82 NaN NaN
# 13 21-3814087 SCA56 NaN ... NaN NaN 158PES
Fortunately, in the case of your xml in the question, you can use the pandas read_xml() method, although you'll have to skirt around the namespaces issue:
import pandas as pd
pd.read_xml(file.xml,xpath='//*[local-name()="Linha"]//*[local-name()="Produto"]')
Output:
Coluna1 Coluna2 Coluna3 Coluna4 Coluna5 {http://www.ms.com/pace}AnaliseDiaria
0 21-851611 CAMIO VO NaN NaN NaN NaN
1 21-3667984 SCA4X2 -1.0 NaN NaN NaN
2 21-3667994 SCA963 -1.0 NaN NaN NaN
etc. If you are not interested in one column or anothter, you can simply drop() it.
One way to achieve this is to use XSLT Transformation. Most programming languages including Python will have support to convert an XML document into another document (e.g. HTML) when supplied with an XSL.
A good tutorial on XSLT Transformation can be found here
Use of Python to achieve transformation (once an XSL is prepared) is described here
There are several things wrong with your XHTML source. First, xmlns is not a correct attribute for the xml declaration; it should be put on the root element instead. And the root element for XHTML is <html>, not <xhtml>. So the valid XHTML input in this particular case would be
<?xml version=\"1.0\"?>\n<html xmlns=\"http://www.w3.org/1999/xhtml\">\n<head><title></title></head>\n<body>\n</body></html>
That said, I'm not sure if xml.etree.ElementTree accepts that, having no experience with it.
You can easily use xml (from the Python standard library) to convert to a pandas.DataFrame. Here's what I would do (when reading from a file replace xml_data with the name of your file or file object):
import pandas as pd
import xml.etree.ElementTree as ET
import io
def iter_docs(author):
author_attr = author.attrib
for doc in author.iter('document'):
doc_dict = author_attr.copy()
doc_dict.update(doc.attrib)
doc_dict['data'] = doc.text
yield doc_dict
xml_data = io.StringIO(u'''YOUR XML STRING HERE''')
etree = ET.parse(xml_data) #create an ElementTree object
doc_df = pd.DataFrame(list(iter_docs(etree.getroot())))
If there are multiple authors in your original document or the root of your XML is not an author, then I would add the following generator:
def iter_author(etree):
for author in etree.iter('author'):
for row in iter_docs(author):
yield row
and change doc_df = pd.DataFrame(list(iter_docs(etree.getroot()))) to doc_df = pd.DataFrame(list(iter_author(etree)))
Have a look at the ElementTree tutorial provided in the xml library documentation.
As of v1.3, you can simply use:
pandas.read_xml(path_or_file)
