Parsing XML into a Pandas dataframe
python - Pandas read xml not working properly for single tag xml - Stack Overflow
python - XSLT Pandas - How to Pull the Grandchild value to a Dataframe - Stack Overflow
python - nested xml file to pandas dataframe - Stack Overflow
Videos
I am trying to parse an XML file into a Pandas DataFrame. It's a nicely formatted file that's not very deep, but whenever I work with XML it's like my brain goes blank and I never can remember all the goofy intricacies of dealing with it.
The file looks roughly like this
<?xml version="1.0" encoding="utf-8"?>
<diagnosticsLog type="db-profile" startDate="11/14/2022 23:31:12">
<!--Build 18.0.1.69-->
<columns>
<column friendlyName="time" name="time" />
<column friendlyName="Direction" name="Direction" />
<column friendlyName="SQL" name="SQL" />
<column friendlyName="ProcessID" name="ProcessID" />
<column friendlyName="ThreadID" name="ThreadID" />
<column friendlyName="TimeSpan" name="TimeSpan" />
<column friendlyName="User" name="User" />
<column friendlyName="HTTPSessionID" name="HTTPSessionID" />
<column friendlyName="HTTPForward" name="HTTPForward" />
<column friendlyName="SessionID" name="SessionID" />
<column friendlyName="SessionGUID" name="SessionGUID" />
<column friendlyName="Datasource" name="Datasource" />
<column friendlyName="Sequence" name="Sequence" />
<column friendlyName="LocalSequence" name="LocalSequence" />
<column friendlyName="Message" name="Message" />
<column friendlyName="AppPoolName" name="AppPoolName" />
</columns>
<rows>
<row>
<col name="time">11/14/2022 23:31:12</col>
<col name="TimeSpan">0 ms</col>
<col name="ThreadID">0x00000025</col>
<col name="User">USERNAME</col>
<col name="HTTPSessionID"></col>
<col name="HTTPForward">20.186.0.0</col>
<col name="SessionGUID">e4e51b-a64d-4b7b-9bfe-9612dd22b6cc</col>
<col name="SessionID">6096783</col>
<col name="Datasource">datasourceName</col>
<col name="AppPoolName">C 1801AppServer Ext</col>
<col name="Direction">Out</col>
<col name="sql">UPDATE SET </col>
<col name="Sequence">236419</col>
<col name="LocalSequence">103825</col>
</row>
<row>
<col name="time">11/14/2022 23:31:12</col>
<col name="TimeSpan">N/A</col>
<col name="ThreadID">0x00000025</col>
<col name="User">USERNAME</col>
<col name="HTTPSessionID"></col>
<col name="HTTPForward">20.186.0.0</col>
<col name="SessionGUID">e491b-a64d-4b7b-9bfe-9612dd22b6cc</col>
<col name="SessionID">6096783</col>
<col name="Datasource">datasourceName</col>
<col name="AppPoolName">C 1801AppServer Ext</col>
<col name="Direction">In</col>
<col name="sql">UPDATE SET</col>
<col name="Sequence">236420</col>
<col name="LocalSequence">103826</col>
</row>
</rows>
</diagnosticsLog>I want to convert that to the column names being the columns and each row being a row. I'm at a loss on how to do this.
Indeed, in forthcoming Pandas 1.3, read_xml will allow you to migrate parsed nodes into data frames. However, because XML can have many dimensions beyond the 2D of rows by columns, as noted:
This method is best designed to import shallow XML documents
Therefore, any nested elements are not immediately picked up as shown here with about 20 columns. Notice the required use of namespaces due to the default namespace in document.
Pandas 1.3+
url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"
df = pd.read_xml(url, xpath="//edgar:invstOrSec",
namespaces={"edgar": "http://www.sec.gov/edgar/nport"})
print(df)
# name lei title cusip ... fairValLevel securityLending assetCat debtSec
# 0 Tastemade Inc. NaN Tastemade Inc. 999999999 ... 3.0 NaN None NaN
# 1 Regatta XV Funding Ltd., Subordinated Note, Pr... NaN Regatta XV Funding Ltd., Subordinated Note, Pr... 75888PAC7 ... 2.0 NaN ABS-CBDO NaN
# 2 Hired, Inc., Series C Preferred Stock NaN Hired, Inc., Series C Preferred Stock NaN ... 3.0 NaN EP NaN
# 3 WESTVIEW CAPITAL PARTNERS II LP NaN WESTVIEW CAPITAL PARTNERS II LP 999999999 ... NaN NaN None NaN
# 4 VOYAGER CAPITAL FUND III, L.P. NaN VOYAGER CAPITAL FUND III, L.P. 999999999 ... NaN NaN None NaN
.. ... ... ... ... ... ... ... ... ...
# 158 ARCLIGHT ENERGY PARTNERS FUND V, L.P. NaN ARCLIGHT ENERGY PARTNERS FUND V, L.P. 999999999 ... NaN NaN None NaN
# 159 ALLOY MERCHANT PARTNERS L.P. NaN ALLOY MERCHANT PARTNERS L.P. 999999999 ... NaN NaN None NaN
# 160 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... NaN ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... 999999999 ... NaN NaN None NaN
# 161 ABRY ADVANCED SECURITIES FUND LP NaN ABRY ADVANCED SECURITIES FUND LP 999999999 ... NaN NaN None NaN
# 162 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... NaN ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... 999999999 ... NaN NaN None NaN
# [163 rows x 20 columns]
url = "https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/primary_doc.xml"
df = pd.read_xml(url, xpath="//edgar:invstOrSec",
namespaces={"edgar": "http://www.sec.gov/edgar/nport"})
print(df)
# name lei title cusip ... invCountry isRestrictedSec fairValLevel securityLending
# 0 Salient Private Access Master Fund, L.P. NaN Salient Private Access Master Fund, L.P. 999999999 ... US Y NaN NaN
# [1 rows x 18 columns]
Fortunately, read_xml supports XSLT (special-purpose language designed to transform XML documents) with default lxml parser. With XSLT, you can then flatten needed nodes for migration to retrieve the 32 columns.
xsl = """<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:edgar="http://www.sec.gov/edgar/nport">
<xsl:output method="xml" indent="yes" />
<xsl:strip-space elements="*"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="edgar:invstOrSec">
<xsl:copy>
<xsl:apply-templates select="*|*/*"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
"""
url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"
df = pd.read_xml(url, xpath="//edgar:invstOrSec", namespaces={"edgar": "http://www.sec.gov/edgar/nport"},
stylesheet=xsl)
print(df)
# name lei title cusip ... annualizedRt isDefault areIntrstPmntsInArrs isPaidKind
# 0 Tastemade Inc. NaN Tastemade Inc. 999999999 ... NaN None None None
# 1 Regatta XV Funding Ltd., Subordinated Note, Pr... NaN Regatta XV Funding Ltd., Subordinated Note, Pr... 75888PAC7 ... 0.0624 N N N
# 2 Hired, Inc., Series C Preferred Stock NaN Hired, Inc., Series C Preferred Stock NaN ... NaN None None None
# 3 WESTVIEW CAPITAL PARTNERS II LP NaN WESTVIEW CAPITAL PARTNERS II LP 999999999 ... NaN None None None
# 4 VOYAGER CAPITAL FUND III, L.P. NaN VOYAGER CAPITAL FUND III, L.P. 999999999 ... NaN None None None
.. ... ... ... ... ... ... ... ... ...
# 158 ARCLIGHT ENERGY PARTNERS FUND V, L.P. NaN ARCLIGHT ENERGY PARTNERS FUND V, L.P. 999999999 ... NaN None None None
# 159 ALLOY MERCHANT PARTNERS L.P. NaN ALLOY MERCHANT PARTNERS L.P. 999999999 ... NaN None None None
# 160 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... NaN ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... 999999999 ... NaN None None None
# 161 ABRY ADVANCED SECURITIES FUND LP NaN ABRY ADVANCED SECURITIES FUND LP 999999999 ... NaN None None None
# 162 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... NaN ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... 999999999 ... NaN None None None
# [163 rows x 32 columns]
Pandas < 1.3
To achieve same result via XPath approach requires more steps where you will have to handle URL request and XML parsing to data frame build. Specifically, create a list of dictionaries from transformed, parsed XML and pass into DataFrame constructor. Below uses same XSLT and XPath with namespace as above.
import lxml.etree as lx
import pandas as pd
import urllib.request as rq
url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"
xsl = """<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:edgar="http://www.sec.gov/edgar/nport">
<xsl:output method="xml" indent="yes" />
<xsl:strip-space elements="*"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="edgar:invstOrSec">
<xsl:copy>
<xsl:apply-templates select="*|*/*"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
"""
content = rq.urlopen(url)
# LOAD XML AND XSL
doc = lx.fromstring(content.read())
style = lx.fromstring(xsl)
# INITIALIZE AND TRANSFORM ORIGINAL DOC
transformer = lx.XSLT(style)
result = transformer(doc)
# RUN XPATH PARSING ON FLATTER XML
data = [{node.tag.split('}')[1]:node.text for node in inv.xpath("*")
} for inv in result.xpath("//edgar:invstOrSec",
namespaces={"edgar": "http://www.sec.gov/edgar/nport"})]
# BIND DATA FOR DATA FRAME
df = pd.DataFrame(data)
print(df)
# name lei title ... isDefault areIntrstPmntsInArrs isPaidKind
# 0 Tastemade Inc. N/A Tastemade Inc. ... NaN NaN NaN
# 1 Regatta XV Funding Ltd., Subordinated Note, Pr... N/A Regatta XV Funding Ltd., Subordinated Note, Pr... ... N N N
# 2 Hired, Inc., Series C Preferred Stock N/A Hired, Inc., Series C Preferred Stock ... NaN NaN NaN
# 3 WESTVIEW CAPITAL PARTNERS II LP N/A WESTVIEW CAPITAL PARTNERS II LP ... NaN NaN NaN
# 4 VOYAGER CAPITAL FUND III, L.P. N/A VOYAGER CAPITAL FUND III, L.P. ... NaN NaN NaN
# .. ... ... ... ... ... ... ...
# 158 ARCLIGHT ENERGY PARTNERS FUND V, L.P. N/A ARCLIGHT ENERGY PARTNERS FUND V, L.P. ... NaN NaN NaN
# 159 ALLOY MERCHANT PARTNERS L.P. N/A ALLOY MERCHANT PARTNERS L.P. ... NaN NaN NaN
# 160 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... N/A ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... ... NaN NaN NaN
# 161 ABRY ADVANCED SECURITIES FUND LP N/A ABRY ADVANCED SECURITIES FUND LP ... NaN NaN NaN
# 162 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... N/A ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... ... NaN NaN NaN
# [163 rows x 32 columns]
First of all, thanks for the feedback! I wrote pandas-read-xml because pandas did not have a pd.read_xml() implementation. You (and the rest of us) will be pleased to know that there is a dev version of pandas read_xml which should be coming soon! (https://pandas.pydata.org/docs/dev/reference/api/pandas.read_xml.html)
As for you current conundrum, this is a result (and one of my many dislikes towards) of the structure of XML. Unlike JSON, where single elements can be returned within a list, the XML structure just has one XML tag, which is interpreted as a single value rather than a list.
Essentially, if there is only one "row" tag, then the "column" tags is now treated as column tags... I'm not making much sense am I? Let me explain with your examples.
Here is how I suggest you use it:
# Import package
import pandas_read_xml as pdx
from pandas_read_xml import fully_flatten
# Example 1
url_1 = 'https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml'
df_1 = pdx.read_xml(url_1,['edgarSubmission', 'formData','invstOrSecs', 'invstOrSec']).pipe(fully_flatten)
# Example 2
url_2 = "https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/primary_doc.xml"
df_2 = pdx.read_xml(url_2,['edgarSubmission', 'formData', 'invstOrSecs'], transpose=True).pipe(fully_flatten)
df_2
What is the difference?
In Example 1, you already expect multiple within tag. So, passing the root_tag_list=['edgarSubmission', 'formData','invstOrSecs', 'invstOrSec'] returns a list under the hood. The fully_flatten process would first explode the list into rows.
In Example 2, if you use the same root_tag_list, pandas is not reading in a list. Rather, it is reading in a dictionary that corresponds to the single row. In effect, it treats the tags intended as columns to be rows. Instead, I would pass one tag above it as the root tag, then transpose it, then fully_flatten.
Yes... I know... it is bit of a workaround. But... then again, I didn't create pandas-read-xml hoping to solve all the problems. It was always meant to be a interim solution until pandas natively supports reading XML (which it looks like it is coming soon).
Let me know how it goes!
EDIT:
Regarding how to make it so that the XML to pandas DataFrame conversion can switch depending on whether the XML has only one "row" tag or multiple, I have the following two options.
In the many row case, the DataFrame will result in a DataFrame with integer index (row numbers), whereas in the single row case, the DataFrame indices will be "Strings" that were meant to be columns. So one strategy would be to detect that and re-do accordingly. (you could probably avoid double downloading with a smarter approach)
# Import package
import pandas as pd
import pandas_read_xml as pdx
from pandas_read_xml import fully_flatten
# Example 3
dfs = []
url_components = ['1279392/000114554921008161', '1279394/000114554921008162']
for url_component in url_components:
url = f'https://www.sec.gov/Archives/edgar/data/{url_component}/primary_doc.xml'
temp = pdx.read_xml(url, ['edgarSubmission', 'formData', 'invstOrSecs'])
if 0 not in temp.index:
temp = pdx.read_xml(url, ['edgarSubmission', 'formData', 'invstOrSecs'], transpose=True)
else:
temp = pdx.read_xml(url, ['edgarSubmission', 'formData', 'invstOrSecs', 'invstOrSec'])
dfs.append(temp)
df = pd.concat(dfs, ignore_index=True).pipe(fully_flatten)
df
Another option is to use the underlying tools. There is no magic behind pandas_read_xml, it uses a package called xmltodict. Read the XML, convert to dicts, then convert to pandas, and then flatten. The only downside is that because the name of the tag "invstOrSec" is retained, they become prefixes for the column names. You should be able to remove those easily.
# Import package
import pandas as pd
import pandas_read_xml as pdx
import xmltodict
from pandas_read_xml import fully_flatten
# Example 4
url_components = ['1279392/000114554921008161', '1279394/000114554921008162']
xmldicts = []
for url_component in url_components:
url = f'https://www.sec.gov/Archives/edgar/data/{url_component}/primary_doc.xml'
xml = pdx.read_xml_from_url(url)
xmldicts.append(xmltodict.parse(xml)['edgarSubmission']['formData']['invstOrSecs'])
df = pd.DataFrame.from_dict(xmldicts).pipe(fully_flatten)
df
Hope that helps!
EDIT:
So, I've updated the package (now version 0.2.0). Now the pandas_read_xml should treat the root tag as rows in the resulting pandas dataframe as default, so no need to distinguish XMLs that sometimes have single "row" and sometimes having multiple rows.
Should this be an issue in other cases, then there is a new argument root_is_rows that is True by default, but can be made False.
I think that instead of:
<xsl:template match="*[@ml:dessert]">
<xsl:copy>
<xsl:value-of select="@ml:name"/>
</xsl:copy>
</xsl:template>
you want:
<xsl:template match="ml:dessert">
<xsl:copy>
<xsl:value-of select="ml:name"/>
</xsl:copy>
</xsl:template>
You can reach this result with pandas read_xml():
import pandas as pd
from io import StringIO
xmlstr = """<ml:Meals xmlns:ml="http://www.food.com">
<ml:Meal>
<ml:type>lunch</ml:type>
<ml:main_course>turkey sandwich</ml:main_course>
<ml:dessert>
<ml:name>Cookie</ml:name>
</ml:dessert>
</ml:Meal>
</ml:Meals>"""
f = StringIO(xmlstr)
df = pd.read_xml(f, xpath = './/ml:Meal/*', namespaces={'ml':'http://www.food.com'})
df1 = df.ffill().dropna()
print(df1.to_string(index=False))
Output:
type main_course name
lunch turkey sandwich Cookie
Option with xml.etree.ElementTree:
import xml.etree.ElementTree as ET
import pandas as pd
from io import StringIO
xmlstr = """<ml:Meals xmlns:ml="http://www.food.com">
<ml:Meal>
<ml:type>lunch</ml:type>
<ml:main_course>turkey sandwich</ml:main_course>
<ml:dessert>
<ml:name>Cookie</ml:name>
</ml:dessert>
</ml:Meal>
</ml:Meals>"""
f = StringIO(xmlstr)
root = ET.parse(f).getroot()
ET.register_namespace('ml','http://www.food.com')
name = root.find('.//{*}name')
dessert = root.find('.//{*}dessert')
meal = root.find('.//{*}Meal')
meal.remove(dessert)
meal.append(name)
#ET.dump(root)
df = pd.read_xml(ET.tostring(root))
print(df.to_string(index=False))