I hope I can help you or at least point you in the right direction.

With nested structures you have to dissolve the layers step by step. When you create your dataframe, the schema is displayed below.

Here it is important to distinguish between arrays and structs. Structs can be resolved with a "select" expression and arrays with an "explode" function.

Copydf_categories = df.select("categories.*")

With this "Select" you explicitly select the column "categories" and all its values. But note that you drop all other columns. If you want to keep them, you have to specify this as well.

The result would look like this:

With this, we have resolved the whole thing somewhat, but that is still not enough for us. If we also dissolve the underlying structure, we will bring more structure into it.

Now we have an array in the highest level, which we have to explode. For this, the "explode" function must be imported beforehand.

Now we have reached the lowest level. All you have to do now is to pivot the rows into columns.

I hope I was able to help you.

EDIT:

You said that you want the values from "_name" as columns, but I don't know how useful that is in this context. You could use the following code to pivot:

Copyfrom pyspark.sql import functions as F
from pyspark.sql.window import Window

# Hinzufügen einer Spalte mit eindeutiger Identifikation für jede Zeile
df_attributes = df_attributes.withColumn("row_id", F.monotonically_increasing_id())

# Verwende die "pivot" Funktion, um die Einträge in "_name" als Spalten zu definieren
pivot_df = df_attributes.groupBy("row_id").pivot("_name").agg(F.first("_VALUE"))

# Optional: Fehlende Werte mit 0 füllen
pivot_df = pivot_df.fillna(0)

pivot_df.display()

In my opinion, it would make more sense to access the values directly from the "_name" column.

I hope I was able to help you.

Answer from DanielP on Stack Overflow
Top answer
1 of 1
1

I hope I can help you or at least point you in the right direction.

With nested structures you have to dissolve the layers step by step. When you create your dataframe, the schema is displayed below.

Here it is important to distinguish between arrays and structs. Structs can be resolved with a "select" expression and arrays with an "explode" function.

Copydf_categories = df.select("categories.*")

With this "Select" you explicitly select the column "categories" and all its values. But note that you drop all other columns. If you want to keep them, you have to specify this as well.

The result would look like this:

With this, we have resolved the whole thing somewhat, but that is still not enough for us. If we also dissolve the underlying structure, we will bring more structure into it.

Now we have an array in the highest level, which we have to explode. For this, the "explode" function must be imported beforehand.

Now we have reached the lowest level. All you have to do now is to pivot the rows into columns.

I hope I was able to help you.

EDIT:

You said that you want the values from "_name" as columns, but I don't know how useful that is in this context. You could use the following code to pivot:

Copyfrom pyspark.sql import functions as F
from pyspark.sql.window import Window

# Hinzufügen einer Spalte mit eindeutiger Identifikation für jede Zeile
df_attributes = df_attributes.withColumn("row_id", F.monotonically_increasing_id())

# Verwende die "pivot" Funktion, um die Einträge in "_name" als Spalten zu definieren
pivot_df = df_attributes.groupBy("row_id").pivot("_name").agg(F.first("_VALUE"))

# Optional: Fehlende Werte mit 0 füllen
pivot_df = pivot_df.fillna(0)

pivot_df.display()

In my opinion, it would make more sense to access the values directly from the "_name" column.

I hope I was able to help you.

🌐
Spark By {Examples}
sparkbyexamples.com › home › apache hadoop › spark read xml file using databricks api
Spark Read XML file using Databricks API - Spark By {Examples}
March 27, 2024 - Databricks Spark-XML package allows us to read simple or nested XML files into DataFrame, once DataFrame is created, we can leverage its APIs to perform transformations and actions like any other DataFrame.
Top answer
1 of 3
12

heirarchy should be rootTag and att should be rowTag as

df = spark.read \
    .format("com.databricks.spark.xml") \
    .option("rootTag", "hierarchy") \
    .option("rowTag", "att") \
    .load("test.xml")

and you should get

+-----+------+----------------------------+
|Order|attval|children                    |
+-----+------+----------------------------+
|1    |Data  |[[[1, Studyval], [2, Site]]]|
|2    |Info  |[[[1, age], [2, gender]]]   |
+-----+------+----------------------------+

and schema

root
 |-- Order: long (nullable = true)
 |-- attval: string (nullable = true)
 |-- children: struct (nullable = true)
 |    |-- att: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- Order: long (nullable = true)
 |    |    |    |-- attval: string (nullable = true)

find more information on databricks xml

2 of 3
3

Databricks has released new version to read xml to Spark DataFrame

<dependency>
     <groupId>com.databricks</groupId>
     <artifactId>spark-xml_2.12</artifactId>
     <version>0.6.0</version>
 </dependency>

Input XML file I used on this example is available at GitHub repository.

val df = spark.read
      .format("com.databricks.spark.xml")
      .option("rowTag", "person")
      .xml("persons.xml")

Schema

root
 |-- _id: long (nullable = true)
 |-- dob_month: long (nullable = true)
 |-- dob_year: long (nullable = true)
 |-- firstname: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- salary: struct (nullable = true)
 |    |-- _VALUE: long (nullable = true)
 |    |-- _currency: string (nullable = true)

Outputs:

+---+---------+--------+---------+------+--------+----------+---------------+
|_id|dob_month|dob_year|firstname|gender|lastname|middlename|         salary|
+---+---------+--------+---------+------+--------+----------+---------------+
|  1|        1|    1980|    James|     M|   Smith|      null|  [10000, Euro]|
|  2|        6|    1990|  Michael|     M|    null|      Rose|[10000, Dollor]|
+---+---------+--------+---------+------+--------+----------+---------------+

Note that Spark XML API has some limitations and discussed here Spark-XML API Limitations

Hope this helps !!

🌐
Stack Overflow
stackoverflow.com › questions › 55560672 › fetch-dataframe-for-nested-xml-schema
scala - Fetch dataframe for nested XML schema - Stack Overflow
April 7, 2019 - I just care about the book tag details at this point because I need to append some nested tags inside details, but the final output file must have the bookdata data as well while writing the DF to XML. How should I work this out? ... val df = spark.read.format("com.databricks.spark.xml") .option("rootTag", "books") .option("rowTag", "book") .schema(bookschema) .load(filePath) df.show() /* df has the books data: +-----+--------------------+----+---+ | cost| details|name|num| +-----+--------------------+----+---+ |200.0|[[1, X,,], [5, A,,]]| A| 11| |300.0| [[2, Y,,]]| B| 12| +-----+--------------------+----+---+ */ val df2 = spark.read.format("com.databricks.spark.xml") .option("rootTag", "books") .option("rowTag", "bookdata") .schema(bookdataschema) .load(filePath) df2.show() /* df2 has the bookdata data: +-----+-------+ |count| lang| +-----+-------+ | 4|English| +-----+-------+ */
🌐
GitHub
github.com › databricks › spark-xml › issues › 91
Flattening Nested XMLs to DataFrame · Issue #91 · databricks/spark-xml
February 17, 2016 - Thanks for the very helpful module. I have the following XML structure that gets converted to Row of POP with the sequence inside. Is there any way to map attribute with NAME and PVAL as value to Columns in dataframe? ...
Author   logisticDigressionSplitter
🌐
Scala
index.scala-lang.org › databricks › spark-xml
Scaladex - databricks / spark-xml
It supports only simple, complex ... a DataFrame, spark-xml can also parse XML in a string-valued column in an existing DataFrame with from_xml, in order to add it as a new column with parsed results as a struct....
🌐
Microsoft Learn
learn.microsoft.com › en-us › azure › databricks › archive › connectors › spark-xml-library
Read and write XML data using the spark-xml library - Azure Databricks | Microsoft Learn
December 16, 2024 - It supports only simple, complex ... Although primarily used to convert an XML file into a DataFrame, you can also use the from_xml method to parse XML in a string-valued column in an existing DataFrame and add it as a new ...
🌐
Stack Overflow
stackoverflow.com › questions › 69688100 › how-to-read-the-nested-elements-from-the-xml-in-pyspark
apache spark - How to read the nested elements from the xml in pyspark? - Stack Overflow
October 23, 2021 - %%pyspark df = spark.read \ ... True) \ .load('file.xml') display(df) Step 2: Convert the nested tab to a JSON using to_json function before we read the nested tab....
Find elsewhere
🌐
GitHub
github.com › databricks › spark-xml
GitHub - databricks/spark-xml: XML data source for Spark SQL and DataFrames · GitHub
This package allows reading XML files in local or distributed filesystem as Spark DataFrames.
Starred by 512 users
Forked by 225 users
Languages   Scala 97.8% | Java 1.5% | Shell 0.7%
🌐
Databricks Documentation
docs.databricks.com › resources › documentation archive › read and write xml data using the spark-xml library
Read and write XML data using the spark-xml library | Databricks on AWS
... import com.databricks.spar... Although primarily used to convert an XML file into a DataFrame, you can also use the from_xml method to parse XML in a string-valued column in an existing DataFrame and add it as a new ...
🌐
Medium
medium.com › @tennysusanto › use-databricks-spark-xml-to-parse-nested-xml-d7d7cf797c28
Use Databrick’s spark-xml to parse nested xml and create csv files. | by Tenny Susanto | Medium
February 10, 2017 - Also note how to query an xml attribute vs an xml element (look at OperatorID in query below). ... val sqlContext = new org.apache.spark.sql.SQLContext(sc)val df = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "Transaction").load("/user/tsusanto/POSLog-201409300635-21.xml")val flattened = df.withColumn("LineItem", explode($"RetailTransaction.LineItem"))val selectedData = flattened.select($"RetailStoreID",$"WorkstationID",$"OperatorID._OperatorName" as "OperatorName",$"OperatorID._VALUE" as "OperatorID",$"CurrencyCode",$"RetailTransaction.ReceiptDateTime",$"RetailTransaction.TransactionCount",$"LineItem.SequenceNumber",$"LineItem.Tax.TaxableAmount")selectedData.show(3,false)selectedData.write.format("com.databricks.spark.csv").option("header", "true").mode("overwrite").save("POSLog-201409300635-21_lines")
🌐
Stack Overflow
stackoverflow.com › questions › 46622466 › how-to-parse-nested-xml-inside-textfile-using-spark-rdd
How to parse nested XML inside textfile using Spark RDD? - Stack Overflow
If you have XML alone in RDD[String] format, you can convert it to DataFrame with Databricks utility class: ... Yes like I mentioned in the question that can be done easily. But I have XML data embedded inside text data in textFile. I have extracted XML data into another RDD but I am not able to extract key and value attribute values from all the 'ab' tags(nested tags).
Top answer
1 of 1
9

You can use xpath queries without using UDFs:

df = spark.createDataFrame([['<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="9615" age="68" sex="F" /> <visitor id="1882" age="34" sex="M" /> <visitor id="5987" age="23" sex="M" /> </visitors>']], ['visitors'])

df.show(truncate=False)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|visitors                                                                                                                                                                          |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="9615" age="68" sex="F" /> <visitor id="1882" age="34" sex="M" /> <visitor id="5987" age="23" sex="M" /> </visitors>|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+


df2 = df.selectExpr(
    "xpath(visitors, './visitors/visitor/@id') id",
    "xpath(visitors, './visitors/visitor/@age') age",
    "xpath(visitors, './visitors/visitor/@sex') sex"
).selectExpr(
    "explode(arrays_zip(id, age, sex)) visitors"
).select('visitors.*')

df2.show(truncate=False)
+----+---+---+
|id  |age|sex|
+----+---+---+
|9615|68 |F  |
|1882|34 |M  |
|5987|23 |M  |
+----+---+---+

If you insist on using UDFs:

import xml.etree.ElementTree as ET
import pyspark.sql.functions as F

@F.udf('array<struct<id:string, age:string, sex:string>>')
def parse_xml(s):
    root = ET.fromstring(s)
    return list(map(lambda x: x.attrib, root.findall('visitor')))
    
df2 = df.select(
    F.explode(parse_xml('visitors')).alias('visitors')
).select('visitors.*')

df2.show()
+----+---+---+
|  id|age|sex|
+----+---+---+
|9615| 68|  F|
|1882| 34|  M|
|5987| 23|  M|
+----+---+---+
🌐
Blogger
leelaprasadhadoop.blogspot.com › 2019 › 01 › xml-parsing.html
Hadoop and Spark by Leela Prasad: XML Parsing
January 6, 2019 - Here, coldata is the column which contains XML in GZIP Format , xmldf is the dataframe, xmlcolumn is the New column in which we would like to extract the XML. from above data as a DF. val xmlmodified = data.map(x => x.toString) val reader = new XmlReader() val xml_parsed = reader.withRowTag("Object").xmlrdd(spark.SqlContext,xmlmodified).select($"object") at ·
🌐
Stack Overflow
stackoverflow.com › questions › 59629669 › how-to-read-xml-data-from-a-spark-dataframe-column
How to read xml data from a spark dataframe column - Stack Overflow
dfx = spark.read.load('books.xml', format='xml', rowTag='bks:books', valueTag="_ele_value") dfx.schema · Trying to get the similar dataframe output when trying to read it from the value column (this is coming from kafka) My xml has a deeply nested structure, just a example of books xml with 2 levels nested ·
🌐
Sonra
sonra.io › home › xml › how to parse xml in spark and databricks (guide)
How to Parse XML in Spark and Databricks (Guide) - Sonra
June 17, 2025 - The spark-xml library handles the initial flattening by reading the XML and converting it into a DataFrame based on the row tag. But if your XML is more complex and has multiple levels of nesting (which, let’s be honest, it probably does), ...
🌐
Dataink
dataink.com.au › 2024-08-17-databricks-xml-ingestion
Ingesting and Flattening XML Files in Databricks
August 17, 2024 - Spark provides native support for reading XML files. By leveraging the rowTag option, we can accurately parse the XML data into a DataFrame. The rowTag specifies the element in the XML that represents the root of the records we want to extract.
🌐
Stack Overflow
stackoverflow.com › questions › 67149863 › parsing-nested-xml-in-databricks
scala - Parsing nested XML in Databricks - Stack Overflow
April 18, 2021 - I am trying to read the XML into a data frame and trying to flatten the data using explode as below. val df = spark.read.format("xml").option("rowTag","on").option("inferschema","true").load("filepath") val parsxml= df .withColumn("exploded_element", explode(("prgSvc.element"))).
🌐
Stack Overflow
stackoverflow.com › questions › 45654658 › reading-xml-with-nested-tags-into-a-spark-rdd-and-transforming-to-json
scala - Reading XML with nested tags into a Spark RDD, and transforming to JSON - Stack Overflow
August 13, 2017 - My understanding is it's not reading XML nested tags, and also it is just working with DataFrames, and I would also prefer RDD. Is there a way of achieving the functionality? I tried just reading as a text file and using the tag item as delimiter, but it did not work. It still considers line breakers as delimiters: val spark = SparkSession .builder() .config("textinputformat.record.delimiter", "</item>") .master("local[*]") .getOrCreate() val documents = spark.sparkContext.textFile("/home/myuser/test-data/Records.xml")