🌐
Pandas
pandas.pydata.org › docs › reference › api › pandas.read_xml.html
pandas.read_xml — pandas 3.0.1 documentation - PyData |
A URL, file-like object, or a string path containing an XSLT script. This stylesheet should flatten complex, deeply nested XML documents for easier parsing. To use this feature you must have lxml module installed and specify ‘lxml’ as parser. The xpath must reference nodes of transformed ...
🌐
Pandas
pandas.pydata.org › pandas-docs › version › 1.4 › reference › api › pandas.read_xml.html
pandas.read_xml — pandas 1.4.4 documentation
Encoding of XML document. parser{‘lxml’,’etree’}, default ‘lxml’ · Parser module to use for retrieval of data. Only ‘lxml’ and ‘etree’ are supported. With ‘lxml’ more complex XPath searches and ability to use XSLT stylesheet are supported.
Discussions

Parsing XML into a Pandas dataframe
To parse an XML file into a Pandas DataFrame, you can use the from_dict method of the DataFrame class. First, you will need to use the ElementTree module to parse the XML file and extract the relevant data. Here is an example of how this can be done: import xml.etree.ElementTree as ET import pandas as pd Parse the XML file using ElementTree tree = ET.parse('my_file.xml') root = tree.getroot() Extract the column names from the 'columns' element columns = [col.attrib['friendlyName'] for col in root.find('columns')] Create an empty list to store the data for each row data = [] Iterate over the 'row' elements and extract the data for each one for row in root.find('rows'): row_data = {} for col in row: # Add the data for each column to the dictionary row_data[col.attrib['name']] = col.text # Add the dictionary for this row to the list data.append(row_data) Create a DataFrame using the column names and data df = pd.DataFrame.from_dict(data, columns=columns) This code will parse the XML file and extract the data for each row and column, storing it in a dictionary. The dictionary is then used to create a DataFrame using the from_dict method. This DataFrame will have the column names as the columns and each row of data as a row in the DataFrame. More on reddit.com
🌐 r/learnpython
8
3
December 9, 2022
python - Pandas read xml not working properly for single tag xml - Stack Overflow
Fortunately, read_xml supports XSLT (special-purpose language designed to transform XML documents) with default lxml parser. More on stackoverflow.com
🌐 stackoverflow.com
python - XSLT Pandas - How to Pull the Grandchild value to a Dataframe - Stack Overflow
I'm trying to flatten an XML structure into a Pandas (Python) Dataframe using a custom XSLT with the read_xml function. Given the following XML Structure: More on stackoverflow.com
🌐 stackoverflow.com
python - nested xml file to pandas dataframe - Stack Overflow
Because your XML is pretty complex ... designed to transform XML files especially complex to simpler ones. Python's third-party module, lxml, can run XSLT 1.0 even XPath 1.0 to parse through the transformed result for migration to a pandas dataframe.... More on stackoverflow.com
🌐 stackoverflow.com
March 22, 2018
🌐
Pandas
pandas.pydata.org › pandas-docs › stable › reference › api › pandas.read_xml.html
pandas.read_xml — pandas 2.2.2 documentation - PyData |
Encoding of XML document. parser{‘lxml’,’etree’}, default ‘lxml’ · Parser module to use for retrieval of data. Only ‘lxml’ and ‘etree’ are supported. With ‘lxml’ more complex XPath searches and ability to use XSLT stylesheet are supported.
🌐
Pandas
pandas.pydata.org › pandas-docs › version › 1.5 › reference › api › pandas.read_xml.html
pandas.read_xml — pandas 1.5.2 documentation
Encoding of XML document. parser{‘lxml’,’etree’}, default ‘lxml’ · Parser module to use for retrieval of data. Only ‘lxml’ and ‘etree’ are supported. With ‘lxml’ more complex XPath searches and ability to use XSLT stylesheet are supported.
🌐
Pandas
pandas.pydata.org › docs › reference › api › pandas.DataFrame.to_xml.html
pandas.DataFrame.to_xml — pandas documentation - PyData |
Whether to include the XML declaration at start of document. ... Whether output should be pretty printed with indentation and line breaks. parser{{‘lxml’,’etree’}}, default ‘lxml’ · Parser module to use for building of tree. Only ‘lxml’ and ‘etree’ are supported. With ‘lxml’, the ability to use XSLT stylesheet is supported.
🌐
pandas
pandas.pydata.org › pandas-docs › dev › reference › api › pandas.read_xml.html
pandas.read_xml — pandas 3.0.0.dev0+2339.g1af0a95e94 documentation
Encoding of XML document. parser{‘lxml’,’etree’}, default ‘lxml’ · Parser module to use for retrieval of data. Only ‘lxml’ and ‘etree’ are supported. With ‘lxml’ more complex XPath searches and ability to use XSLT stylesheet are supported.
🌐
Like Geeks
likegeeks.com › home › python › pandas › parsing xml files into dataframes using pandas read_xml
Parsing XML Files into DataFrames using Pandas read_xml
October 16, 2023 - The read_xml function in Pandas is used to read XML (eXtensible Markup Language) files and convert them into DataFrames.
🌐
pandas
pandas.pydata.org › pandas-docs › dev › reference › api › pandas.read_xml.html
pandas.read_xml — pandas 3.0.0rc1+103.gaf9e3f0ca6 documentation
A URL, file-like object, or a string path containing an XSLT script. This stylesheet should flatten complex, deeply nested XML documents for easier parsing. To use this feature you must have lxml module installed and specify ‘lxml’ as parser. The xpath must reference nodes of transformed ...
Find elsewhere
🌐
Pandas
pandas.pydata.org › pandas-docs › version › 2.0 › reference › api › pandas.read_xml.html
pandas.read_xml — pandas 2.0.3 documentation
Encoding of XML document. parser{‘lxml’,’etree’}, default ‘lxml’ · Parser module to use for retrieval of data. Only ‘lxml’ and ‘etree’ are supported. With ‘lxml’ more complex XPath searches and ability to use XSLT stylesheet are supported.
🌐
Pandas
pandas.pydata.org › docs › dev › reference › api › pandas.read_xml.html
pandas.read_xml — pandas 3.0.0.dev0+2687.g00a7c41157 documentation
Encoding of XML document. parser{‘lxml’,’etree’}, default ‘lxml’ · Parser module to use for retrieval of data. Only ‘lxml’ and ‘etree’ are supported. With ‘lxml’ more complex XPath searches and ability to use XSLT stylesheet are supported.
🌐
Pandas
pandas.pydata.org › pandas-docs › version › 1.3 › reference › api › pandas.read_xml.html
pandas.read_xml — pandas 1.3.5 documentation
Encoding of XML document. parser{‘lxml’,’etree’}, default ‘lxml’ · Parser module to use for retrieval of data. Only ‘lxml’ and ‘etree’ are supported. With ‘lxml’ more complex XPath searches and ability to use XSLT stylesheet are supported.
🌐
Reddit
reddit.com › r/learnpython › parsing xml into a pandas dataframe
r/learnpython on Reddit: Parsing XML into a Pandas dataframe
December 9, 2022 -

I am trying to parse an XML file into a Pandas DataFrame. It's a nicely formatted file that's not very deep, but whenever I work with XML it's like my brain goes blank and I never can remember all the goofy intricacies of dealing with it.

The file looks roughly like this

<?xml version="1.0" encoding="utf-8"?>

<diagnosticsLog type="db-profile" startDate="11/14/2022 23:31:12">

  <!--Build 18.0.1.69-->

  <columns>

    <column friendlyName="time" name="time" />

    <column friendlyName="Direction" name="Direction" />

    <column friendlyName="SQL" name="SQL" />

    <column friendlyName="ProcessID" name="ProcessID" />

    <column friendlyName="ThreadID" name="ThreadID" />


    <column friendlyName="TimeSpan" name="TimeSpan" />

    <column friendlyName="User" name="User" />

    <column friendlyName="HTTPSessionID" name="HTTPSessionID" />

    <column friendlyName="HTTPForward" name="HTTPForward" />

    <column friendlyName="SessionID" name="SessionID" />


    <column friendlyName="SessionGUID" name="SessionGUID" />

    <column friendlyName="Datasource" name="Datasource" />

    <column friendlyName="Sequence" name="Sequence" />

    <column friendlyName="LocalSequence" name="LocalSequence" />

    <column friendlyName="Message" name="Message" />

    <column friendlyName="AppPoolName" name="AppPoolName" />

  </columns>

  <rows>

    <row>

      <col name="time">11/14/2022 23:31:12</col>

      <col name="TimeSpan">0 ms</col>

      <col name="ThreadID">0x00000025</col>

      <col name="User">USERNAME</col>

      <col name="HTTPSessionID"></col>

      <col name="HTTPForward">20.186.0.0</col>

      <col name="SessionGUID">e4e51b-a64d-4b7b-9bfe-9612dd22b6cc</col>

      <col name="SessionID">6096783</col>

      <col name="Datasource">datasourceName</col>

      <col name="AppPoolName">C 1801AppServer Ext</col>

      <col name="Direction">Out</col>

      <col name="sql">UPDATE SET </col>

      <col name="Sequence">236419</col>

      <col name="LocalSequence">103825</col>

    </row>

    <row>

      <col name="time">11/14/2022 23:31:12</col>

      <col name="TimeSpan">N/A</col>

      <col name="ThreadID">0x00000025</col>

      <col name="User">USERNAME</col>

      <col name="HTTPSessionID"></col>

      <col name="HTTPForward">20.186.0.0</col>

      <col name="SessionGUID">e491b-a64d-4b7b-9bfe-9612dd22b6cc</col>

      <col name="SessionID">6096783</col>

      <col name="Datasource">datasourceName</col>

      <col name="AppPoolName">C 1801AppServer Ext</col>

      <col name="Direction">In</col>

      <col name="sql">UPDATE SET</col>

      <col name="Sequence">236420</col>

      <col name="LocalSequence">103826</col>

    </row>

  </rows>

</diagnosticsLog>

I want to convert that to the column names being the columns and each row being a row. I'm at a loss on how to do this.

🌐
Pandas
pandas.pydata.org › pandas-docs › version › 1.4.3 › reference › api › pandas.read_xml.html
pandas.read_xml — pandas 1.4.3 documentation
Encoding of XML document. parser{‘lxml’,’etree’}, default ‘lxml’ · Parser module to use for retrieval of data. Only ‘lxml’ and ‘etree’ are supported. With ‘lxml’ more complex XPath searches and ability to use XSLT stylesheet are supported.
🌐
GitHub
github.com › pandas-dev › pandas › blob › main › pandas › io › xml.py
pandas/pandas/io/xml.py at main · pandas-dev/pandas
Internal class to parse XML into :class:`~pandas.DataFrame` with third-party · full-featured XML library, ``lxml``, that supports · ``XPath`` 1.0 and XSLT 1.0. """ · def parse_data(self) -> list[dict[str, str | None]]: """ Parse xml data.
Author   pandas-dev
Top answer
1 of 2
8

Indeed, in forthcoming Pandas 1.3, read_xml will allow you to migrate parsed nodes into data frames. However, because XML can have many dimensions beyond the 2D of rows by columns, as noted:

This method is best designed to import shallow XML documents

Therefore, any nested elements are not immediately picked up as shown here with about 20 columns. Notice the required use of namespaces due to the default namespace in document.

Pandas 1.3+

url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"
df = pd.read_xml(url, xpath="//edgar:invstOrSec", 
                 namespaces={"edgar": "http://www.sec.gov/edgar/nport"})

print(df)
#                                                   name  lei                                              title      cusip  ...  fairValLevel  securityLending  assetCat debtSec
# 0                                       Tastemade Inc.  NaN                                     Tastemade Inc.  999999999  ...           3.0              NaN      None     NaN
# 1    Regatta XV Funding Ltd., Subordinated Note, Pr...  NaN  Regatta XV Funding Ltd., Subordinated Note, Pr...  75888PAC7  ...           2.0              NaN  ABS-CBDO     NaN
# 2                Hired, Inc., Series C Preferred Stock  NaN              Hired, Inc., Series C Preferred Stock        NaN  ...           3.0              NaN        EP     NaN
# 3                      WESTVIEW CAPITAL PARTNERS II LP  NaN                    WESTVIEW CAPITAL PARTNERS II LP  999999999  ...           NaN              NaN      None     NaN
# 4                       VOYAGER CAPITAL FUND III, L.P.  NaN                     VOYAGER CAPITAL FUND III, L.P.  999999999  ...           NaN              NaN      None     NaN
..                                                 ...  ...                                                ...        ...  ...           ...              ...       ...     ...
# 158              ARCLIGHT ENERGY PARTNERS FUND V, L.P.  NaN              ARCLIGHT ENERGY PARTNERS FUND V, L.P.  999999999  ...           NaN              NaN      None     NaN
# 159                       ALLOY MERCHANT PARTNERS L.P.  NaN                       ALLOY MERCHANT PARTNERS L.P.  999999999  ...           NaN              NaN      None     NaN
# 160  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ...  NaN  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ...  999999999  ...           NaN              NaN      None     NaN
# 161                   ABRY ADVANCED SECURITIES FUND LP  NaN                   ABRY ADVANCED SECURITIES FUND LP  999999999  ...           NaN              NaN      None     NaN
# 162  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F...  NaN  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F...  999999999  ...           NaN              NaN      None     NaN

# [163 rows x 20 columns]


url = "https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/primary_doc.xml"
df = pd.read_xml(url, xpath="//edgar:invstOrSec", 
                 namespaces={"edgar": "http://www.sec.gov/edgar/nport"})

print(df)
#                                        name  lei                                     title      cusip  ...  invCountry  isRestrictedSec fairValLevel securityLending
# 0  Salient Private Access Master Fund, L.P.  NaN  Salient Private Access Master Fund, L.P.  999999999  ...          US                Y          NaN             NaN

# [1 rows x 18 columns]

Fortunately, read_xml supports XSLT (special-purpose language designed to transform XML documents) with default lxml parser. With XSLT, you can then flatten needed nodes for migration to retrieve the 32 columns.

xsl = """<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                                       xmlns:edgar="http://www.sec.gov/edgar/nport">
    <xsl:output method="xml" indent="yes" />
    <xsl:strip-space elements="*"/>

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="edgar:invstOrSec">
        <xsl:copy>
            <xsl:apply-templates select="*|*/*"/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>
"""

url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"
df = pd.read_xml(url, xpath="//edgar:invstOrSec", namespaces={"edgar": "http://www.sec.gov/edgar/nport"},
                 stylesheet=xsl)
print(df)
#                                                   name  lei                                              title      cusip  ...  annualizedRt  isDefault  areIntrstPmntsInArrs  isPaidKind
# 0                                       Tastemade Inc.  NaN                                     Tastemade Inc.  999999999  ...           NaN       None                  None        None
# 1    Regatta XV Funding Ltd., Subordinated Note, Pr...  NaN  Regatta XV Funding Ltd., Subordinated Note, Pr...  75888PAC7  ...        0.0624          N                     N           N
# 2                Hired, Inc., Series C Preferred Stock  NaN              Hired, Inc., Series C Preferred Stock        NaN  ...           NaN       None                  None        None
# 3                      WESTVIEW CAPITAL PARTNERS II LP  NaN                    WESTVIEW CAPITAL PARTNERS II LP  999999999  ...           NaN       None                  None        None
# 4                       VOYAGER CAPITAL FUND III, L.P.  NaN                     VOYAGER CAPITAL FUND III, L.P.  999999999  ...           NaN       None                  None        None
..                                                 ...  ...                                                ...        ...  ...           ...        ...                   ...         ...
# 158              ARCLIGHT ENERGY PARTNERS FUND V, L.P.  NaN              ARCLIGHT ENERGY PARTNERS FUND V, L.P.  999999999  ...           NaN       None                  None        None
# 159                       ALLOY MERCHANT PARTNERS L.P.  NaN                       ALLOY MERCHANT PARTNERS L.P.  999999999  ...           NaN       None                  None        None
# 160  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ...  NaN  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ...  999999999  ...           NaN       None                  None        None
# 161                   ABRY ADVANCED SECURITIES FUND LP  NaN                   ABRY ADVANCED SECURITIES FUND LP  999999999  ...           NaN       None                  None        None
# 162  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F...  NaN  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F...  999999999  ...           NaN       None                  None        None

# [163 rows x 32 columns]

Pandas < 1.3

To achieve same result via XPath approach requires more steps where you will have to handle URL request and XML parsing to data frame build. Specifically, create a list of dictionaries from transformed, parsed XML and pass into DataFrame constructor. Below uses same XSLT and XPath with namespace as above.

import lxml.etree as lx
import pandas as pd
import urllib.request as rq

url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"

xsl = """<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                                       xmlns:edgar="http://www.sec.gov/edgar/nport">
    <xsl:output method="xml" indent="yes" />
    <xsl:strip-space elements="*"/>

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="edgar:invstOrSec">
        <xsl:copy>
            <xsl:apply-templates select="*|*/*"/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>
"""

content = rq.urlopen(url)

# LOAD XML AND XSL
doc = lx.fromstring(content.read())
style = lx.fromstring(xsl)

# INITIALIZE AND TRANSFORM ORIGINAL DOC
transformer = lx.XSLT(style)
result = transformer(doc)

# RUN XPATH PARSING ON FLATTER XML
data = [{node.tag.split('}')[1]:node.text for node in inv.xpath("*")
        } for inv in result.xpath("//edgar:invstOrSec", 
                                 namespaces={"edgar": "http://www.sec.gov/edgar/nport"})]

# BIND DATA FOR DATA FRAME
df = pd.DataFrame(data)

print(df)
#                                                   name  lei                                              title  ... isDefault areIntrstPmntsInArrs  isPaidKind
# 0                                       Tastemade Inc.  N/A                                     Tastemade Inc.  ...       NaN                  NaN         NaN
# 1    Regatta XV Funding Ltd., Subordinated Note, Pr...  N/A  Regatta XV Funding Ltd., Subordinated Note, Pr...  ...         N                    N           N
# 2                Hired, Inc., Series C Preferred Stock  N/A              Hired, Inc., Series C Preferred Stock  ...       NaN                  NaN         NaN
# 3                      WESTVIEW CAPITAL PARTNERS II LP  N/A                    WESTVIEW CAPITAL PARTNERS II LP  ...       NaN                  NaN         NaN
# 4                       VOYAGER CAPITAL FUND III, L.P.  N/A                     VOYAGER CAPITAL FUND III, L.P.  ...       NaN                  NaN         NaN
# ..                                                 ...  ...                                                ...  ...       ...                  ...         ...
# 158              ARCLIGHT ENERGY PARTNERS FUND V, L.P.  N/A              ARCLIGHT ENERGY PARTNERS FUND V, L.P.  ...       NaN                  NaN         NaN
# 159                       ALLOY MERCHANT PARTNERS L.P.  N/A                       ALLOY MERCHANT PARTNERS L.P.  ...       NaN                  NaN         NaN
# 160  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ...  N/A  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ...  ...       NaN                  NaN         NaN
# 161                   ABRY ADVANCED SECURITIES FUND LP  N/A                   ABRY ADVANCED SECURITIES FUND LP  ...       NaN                  NaN         NaN
# 162  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F...  N/A  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F...  ...       NaN                  NaN         NaN

# [163 rows x 32 columns]

2 of 2
4

First of all, thanks for the feedback! I wrote pandas-read-xml because pandas did not have a pd.read_xml() implementation. You (and the rest of us) will be pleased to know that there is a dev version of pandas read_xml which should be coming soon! (https://pandas.pydata.org/docs/dev/reference/api/pandas.read_xml.html)

As for you current conundrum, this is a result (and one of my many dislikes towards) of the structure of XML. Unlike JSON, where single elements can be returned within a list, the XML structure just has one XML tag, which is interpreted as a single value rather than a list.

Essentially, if there is only one "row" tag, then the "column" tags is now treated as column tags... I'm not making much sense am I? Let me explain with your examples.

Here is how I suggest you use it:

# Import package
import pandas_read_xml as pdx
from pandas_read_xml import fully_flatten

# Example 1
url_1 = 'https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml'
df_1 =  pdx.read_xml(url_1,['edgarSubmission', 'formData','invstOrSecs', 'invstOrSec']).pipe(fully_flatten)

# Example 2
url_2 = "https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/primary_doc.xml"
df_2  = pdx.read_xml(url_2,['edgarSubmission', 'formData', 'invstOrSecs'], transpose=True).pipe(fully_flatten)
df_2

What is the difference?

In Example 1, you already expect multiple within tag. So, passing the root_tag_list=['edgarSubmission', 'formData','invstOrSecs', 'invstOrSec'] returns a list under the hood. The fully_flatten process would first explode the list into rows.

In Example 2, if you use the same root_tag_list, pandas is not reading in a list. Rather, it is reading in a dictionary that corresponds to the single row. In effect, it treats the tags intended as columns to be rows. Instead, I would pass one tag above it as the root tag, then transpose it, then fully_flatten.

Yes... I know... it is bit of a workaround. But... then again, I didn't create pandas-read-xml hoping to solve all the problems. It was always meant to be a interim solution until pandas natively supports reading XML (which it looks like it is coming soon).

Let me know how it goes!

EDIT:

Regarding how to make it so that the XML to pandas DataFrame conversion can switch depending on whether the XML has only one "row" tag or multiple, I have the following two options.

In the many row case, the DataFrame will result in a DataFrame with integer index (row numbers), whereas in the single row case, the DataFrame indices will be "Strings" that were meant to be columns. So one strategy would be to detect that and re-do accordingly. (you could probably avoid double downloading with a smarter approach)

# Import package
import pandas as pd
import pandas_read_xml as pdx
from pandas_read_xml import fully_flatten

# Example 3

dfs = []
url_components = ['1279392/000114554921008161', '1279394/000114554921008162']

for url_component in url_components:
    url = f'https://www.sec.gov/Archives/edgar/data/{url_component}/primary_doc.xml'
    temp = pdx.read_xml(url, ['edgarSubmission', 'formData', 'invstOrSecs'])
    if 0 not in temp.index:
        temp = pdx.read_xml(url, ['edgarSubmission', 'formData', 'invstOrSecs'], transpose=True)
    else:
        temp = pdx.read_xml(url, ['edgarSubmission', 'formData', 'invstOrSecs', 'invstOrSec'])
    dfs.append(temp)

df = pd.concat(dfs, ignore_index=True).pipe(fully_flatten)

df

Another option is to use the underlying tools. There is no magic behind pandas_read_xml, it uses a package called xmltodict. Read the XML, convert to dicts, then convert to pandas, and then flatten. The only downside is that because the name of the tag "invstOrSec" is retained, they become prefixes for the column names. You should be able to remove those easily.

# Import package
import pandas as pd
import pandas_read_xml as pdx
import xmltodict
from pandas_read_xml import fully_flatten

# Example 4

url_components = ['1279392/000114554921008161', '1279394/000114554921008162']
xmldicts = []

for url_component in url_components:
    url = f'https://www.sec.gov/Archives/edgar/data/{url_component}/primary_doc.xml'
    xml = pdx.read_xml_from_url(url)
    xmldicts.append(xmltodict.parse(xml)['edgarSubmission']['formData']['invstOrSecs'])
    
df = pd.DataFrame.from_dict(xmldicts).pipe(fully_flatten)

df

Hope that helps!

EDIT:

So, I've updated the package (now version 0.2.0). Now the pandas_read_xml should treat the root tag as rows in the resulting pandas dataframe as default, so no need to distinguish XMLs that sometimes have single "row" and sometimes having multiple rows.

Should this be an issue in other cases, then there is a new argument root_is_rows that is True by default, but can be made False.

🌐
PyPI
pypi.org › project › pandas-read-xml
pandas-read-xml
JavaScript is disabled in your browser. Please enable JavaScript to proceed · A required part of this site couldn’t load. This may be due to a browser extension, network issues, or browser settings. Please check your connection, disable any ad blockers, or try using a different browser
🌐
Pandas
pandas.pydata.org › pandas-docs › stable › reference › api › pandas.DataFrame.to_xml.html
pandas.DataFrame.to_xml — pandas 3.0.1 documentation
Whether to include the XML declaration at start of document. ... Whether output should be pretty printed with indentation and line breaks. parser{{‘lxml’,’etree’}}, default ‘lxml’ · Parser module to use for building of tree. Only ‘lxml’ and ‘etree’ are supported. With ‘lxml’, the ability to use XSLT stylesheet is supported.
Top answer
1 of 1
3

Because your XML is pretty complex with text values spilling across nodes, consider XSLT, the special-purpose language designed to transform XML files especially complex to simpler ones.

Python's third-party module, lxml, can run XSLT 1.0 even XPath 1.0 to parse through the transformed result for migration to a pandas dataframe. Additionally, you can use external XSLT processors that Python can call with subprocess.

Specifically, below XSLT extracts necessary attributes from both defendant and victim and entire paragraph text value by using XPath's descendant::* from the root, assuming <p> is a child to it.

XSLT (save as a .xsl file, a special .xml file)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes" method="xml"/>
  <xsl:strip-space elements="*"/>
  
  <xsl:template match="/*">
    <xsl:apply-templates select="p"/>
  </xsl:template>
  
  <xsl:template match="p">
    <data>
      <defendantName><xsl:value-of select="normalize-space(descendant::persName[@type='defendantName'])"/></defendantName>
      <defendantGender><xsl:value-of select="descendant::persName[@type='defendantName']/interp[@type='gender']/@value"/></defendantGender>
      <offenceCategory><xsl:value-of select="descendant::interp[@type='offenceCategory']/@value"/></offenceCategory>
      <offenceSubCategory><xsl:value-of select="descendant::interp[@type='offenceSubcategory']/@value"/></offenceSubCategory>
      
      <victimName><xsl:value-of select="normalize-space(descendant::persName[@type='victimName'])"/></victimName>
      <victimGender><xsl:value-of select="descendant::persName[@type='victimName']/interp[@type='gender']/@value"/></victimGender>
      <verdictCategory><xsl:value-of select="descendant::interp[@type='verdictCategory']/@value"/></verdictCategory>
      <verdictSubCategory><xsl:value-of select="descendant::interp[@type='verdictSubcategory']/@value"/></verdictSubCategory>
      <punishmentCategory><xsl:value-of select="descendant::interp[@type='punishmentCategory']/@value"/></punishmentCategory>
      
      <trialText><xsl:value-of select="normalize-space(/p)"/></trialText>
    </data>
  </xsl:template>       
 
</xsl:stylesheet>

Python

import lxml.etree as et
import pandas as pd

# LOAD XML AND XSL
doc = et.parse("Source.xml")
xsl = et.parse("XSLT_Script.xsl")

# RUN TRANSFORMATION
transformer = et.XSLT(xsl)
result = transformer(doc)

# OUTPUT TO CONSOLE
print(result)

data = []
for i in result.xpath('/*'):
    inner = {}
    for j in i.xpath('*'):
        inner[j.tag] = j.text
        
    data.append(inner)
    
trial_df = pd.DataFrame(data)

print(trial_df)

For the 1,000 similar XML files, loop through this process and append each one-row trial_df dataframes in a list to be stacked with pd.concat.

XML Output

<?xml version="1.0"?>
<data>
  <defendantName>Alice Jones</defendantName>
  <defendantGender>female</defendantGender>
  <offenceCategory>theft</offenceCategory>
  <offenceSubCategory>shoplifting</offenceSubCategory>
  <victimName>Edward Hillior</victimName>
  <victimGender>male</victimGender>
  <verdictCategory>guilty</verdictCategory>
  <verdictSubCategory>theftunder1s</verdictSubCategory>
  <punishmentCategory>transport</punishmentCategory>
  <trialText>Alice Jones , of St. Michael's Cornhill, was indicted for privately stealing a Bermundas Hat, value 10 s. out of the Shop of Edward Hillior , on the 21st of April last. The Prosecutor's Servant deposed that the Prisner came into his Master's Shop and ask'd for a Hat of about 10 s. price; that he shewed several, and at last they agreed for one; but she said it was to go into the Country, and that she would stop into Bishopsgate-street. and if the Coach was not gone she would come and fetch it; that she went out of the Shop but he perceiving she could hardly walk fetcht her back again, and the Hat mentioned in the Indictment fell from between her Legs. Another deposed that he saw the former Evidence take the Hat from under her Petticoats. The Prisoner denyed the Fact, and called two Persons to her Reputation, who gave her a good Character, and said that she rented a House of 10 l. a Year in Petty France, at Westminster, but she had told the Justice that she liv'd in King-Street. The Jury considering the whole matter, found her Guilty to the value of 10 d. Transportation .</trialText>
</data>

Dataframe Output

#   defendantGender defendantName offenceCategory offenceSubCategory  \
# 0          female   Alice Jones           theft        shoplifting   

#   punishmentCategory                                          trialText  \
# 0          transport  Alice Jones , of St. Michael's Cornhill, was i...   

#   verdictCategory verdictSubCategory victimGender      victimName  
# 0          guilty       theftunder1s         male  Edward Hillior  
🌐
TutorialsPoint
tutorialspoint.com › python_pandas › python_pandas_read_xml_method.htm
Pandas DataFrame read_xml() Method
January 2, 2025 - Options include 'infer', 'gzip', 'bz2', 'zip', etc. storage_options: Additional options for remote storage connections. The read_xml() method returns a Pandas DataFrame containing the parsed data from the XML document.