spark read nested xml into dataframe

How to read a nested xml correctly in pyspark using spark-xml?

stackoverflow.com › questions › 77192576 › how-to-read-a-nested-xml-correctly-in-pyspark-using-spark-xml

I hope I can help you or at least point you in the right direction.

With nested structures you have to dissolve the layers step by step. When you create your dataframe, the schema is displayed below.

Here it is important to distinguish between arrays and structs. Structs can be resolved with a "select" expression and arrays with an "explode" function.

Copydf_categories = df.select("categories.*")

With this "Select" you explicitly select the column "categories" and all its values. But note that you drop all other columns. If you want to keep them, you have to specify this as well.

The result would look like this:

With this, we have resolved the whole thing somewhat, but that is still not enough for us. If we also dissolve the underlying structure, we will bring more structure into it.

Now we have an array in the highest level, which we have to explode. For this, the "explode" function must be imported beforehand.

Now we have reached the lowest level. All you have to do now is to pivot the rows into columns.

I hope I was able to help you.

EDIT:

You said that you want the values from "_name" as columns, but I don't know how useful that is in this context. You could use the following code to pivot:

Copyfrom pyspark.sql import functions as F
from pyspark.sql.window import Window

# Hinzufügen einer Spalte mit eindeutiger Identifikation für jede Zeile
df_attributes = df_attributes.withColumn("row_id", F.monotonically_increasing_id())

# Verwende die "pivot" Funktion, um die Einträge in "_name" als Spalten zu definieren
pivot_df = df_attributes.groupBy("row_id").pivot("_name").agg(F.first("_VALUE"))

# Optional: Fehlende Werte mit 0 füllen
pivot_df = pivot_df.fillna(0)

pivot_df.display()

In my opinion, it would make more sense to access the values directly from the "_name" column.

I hope I was able to help you.

Answer from DanielP on Stack Overflow

Stack Overflow

stackoverflow.com › questions › 77192576 › how-to-read-a-nested-xml-correctly-in-pyspark-using-spark-xml

pandas - How to read a nested xml correctly in pyspark using spark-xml? - Stack Overflow

Top answer

1 of 1

I hope I can help you or at least point you in the right direction.

With nested structures you have to dissolve the layers step by step. When you create your dataframe, the schema is displayed below.

Here it is important to distinguish between arrays and structs. Structs can be resolved with a "select" expression and arrays with an "explode" function.

Copydf_categories = df.select("categories.*")

With this "Select" you explicitly select the column "categories" and all its values. But note that you drop all other columns. If you want to keep them, you have to specify this as well.

The result would look like this:

With this, we have resolved the whole thing somewhat, but that is still not enough for us. If we also dissolve the underlying structure, we will bring more structure into it.

Now we have an array in the highest level, which we have to explode. For this, the "explode" function must be imported beforehand.

Now we have reached the lowest level. All you have to do now is to pivot the rows into columns.

I hope I was able to help you.

EDIT:

You said that you want the values from "_name" as columns, but I don't know how useful that is in this context. You could use the following code to pivot:

Copyfrom pyspark.sql import functions as F
from pyspark.sql.window import Window

# Hinzufügen einer Spalte mit eindeutiger Identifikation für jede Zeile
df_attributes = df_attributes.withColumn("row_id", F.monotonically_increasing_id())

# Verwende die "pivot" Funktion, um die Einträge in "_name" als Spalten zu definieren
pivot_df = df_attributes.groupBy("row_id").pivot("_name").agg(F.first("_VALUE"))

# Optional: Fehlende Werte mit 0 füllen
pivot_df = pivot_df.fillna(0)

pivot_df.display()

In my opinion, it would make more sense to access the values directly from the "_name" column.

I hope I was able to help you.

Spark By {Examples}

sparkbyexamples.com › home › apache hadoop › spark read xml file using databricks api

Spark Read XML file using Databricks API - Spark By {Examples}

March 27, 2024 - Databricks Spark-XML package allows us to read simple or nested XML files into DataFrame, once DataFrame is created, we can leverage its APIs to perform transformations and actions like any other DataFrame.

Videos

52:45

YouTube

23. Reading and writing XML files in Azure Databricks - YouTube

November 3, 2020

6.58K

youtube.com

Create DataFrame from XML File - Scala API - YouTube

15:02

YouTube

Create DataFrame from Nested XML | Spark DataFrame Practical | ...

Databricks Tutorial 8: Read xml files in Pyspark, writing xml files ...

August 16, 2020

View all

Stack Overflow

stackoverflow.com › questions › 50429315 › read-xml-in-spark

Read XML in spark - Stack Overflow

Top answer

1 of 3

heirarchy should be rootTag and att should be rowTag as

df = spark.read \
    .format("com.databricks.spark.xml") \
    .option("rootTag", "hierarchy") \
    .option("rowTag", "att") \
    .load("test.xml")

and you should get

+-----+------+----------------------------+
|Order|attval|children                    |
+-----+------+----------------------------+
|1    |Data  |[[[1, Studyval], [2, Site]]]|
|2    |Info  |[[[1, age], [2, gender]]]   |
+-----+------+----------------------------+

and schema

root
 |-- Order: long (nullable = true)
 |-- attval: string (nullable = true)
 |-- children: struct (nullable = true)
 |    |-- att: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- Order: long (nullable = true)
 |    |    |    |-- attval: string (nullable = true)

find more information on databricks xml

2 of 3

Databricks has released new version to read xml to Spark DataFrame

<dependency>
     <groupId>com.databricks</groupId>
     <artifactId>spark-xml_2.12</artifactId>
     <version>0.6.0</version>
 </dependency>

Input XML file I used on this example is available at GitHub repository.

val df = spark.read
      .format("com.databricks.spark.xml")
      .option("rowTag", "person")
      .xml("persons.xml")

Schema

root
 |-- _id: long (nullable = true)
 |-- dob_month: long (nullable = true)
 |-- dob_year: long (nullable = true)
 |-- firstname: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- salary: struct (nullable = true)
 |    |-- _VALUE: long (nullable = true)
 |    |-- _currency: string (nullable = true)

Outputs:

+---+---------+--------+---------+------+--------+----------+---------------+
|_id|dob_month|dob_year|firstname|gender|lastname|middlename|         salary|
+---+---------+--------+---------+------+--------+----------+---------------+
|  1|        1|    1980|    James|     M|   Smith|      null|  [10000, Euro]|
|  2|        6|    1990|  Michael|     M|    null|      Rose|[10000, Dollor]|
+---+---------+--------+---------+------+--------+----------+---------------+

Note that Spark XML API has some limitations and discussed here Spark-XML API Limitations

Hope this helps !!

Stack Overflow

stackoverflow.com › questions › 55560672 › fetch-dataframe-for-nested-xml-schema

scala - Fetch dataframe for nested XML schema - Stack Overflow

April 7, 2019 - I just care about the book tag details at this point because I need to append some nested tags inside details, but the final output file must have the bookdata data as well while writing the DF to XML. How should I work this out? ... val df = spark.read.format("com.databricks.spark.xml") .option("rootTag", "books") .option("rowTag", "book") .schema(bookschema) .load(filePath) df.show() /* df has the books data: +-----+--------------------+----+---+ | cost| details|name|num| +-----+--------------------+----+---+ |200.0|[[1, X,,], [5, A,,]]| A| 11| |300.0| [[2, Y,,]]| B| 12| +-----+--------------------+----+---+ */ val df2 = spark.read.format("com.databricks.spark.xml") .option("rootTag", "books") .option("rowTag", "bookdata") .schema(bookdataschema) .load(filePath) df2.show() /* df2 has the bookdata data: +-----+-------+ |count| lang| +-----+-------+ | 4|English| +-----+-------+ */

GitHub

github.com › databricks › spark-xml › issues › 91

Flattening Nested XMLs to DataFrame · Issue #91 · databricks/spark-xml

February 17, 2016 - Thanks for the very helpful module. I have the following XML structure that gets converted to Row of POP with the sequence inside. Is there any way to map attribute with NAME and PVAL as value to Columns in dataframe? ...

Author logisticDigressionSplitter

Scala

index.scala-lang.org › databricks › spark-xml

Scaladex - databricks / spark-xml

It supports only simple, complex ... a DataFrame, spark-xml can also parse XML in a string-valued column in an existing DataFrame with from_xml, in order to add it as a new column with parsed results as a struct....

Microsoft Learn

learn.microsoft.com › en-us › azure › databricks › archive › connectors › spark-xml-library

Read and write XML data using the spark-xml library - Azure Databricks | Microsoft Learn

December 16, 2024 - It supports only simple, complex ... Although primarily used to convert an XML file into a DataFrame, you can also use the from_xml method to parse XML in a string-valued column in an existing DataFrame and add it as a new ...

Stack Overflow

stackoverflow.com › questions › 69688100 › how-to-read-the-nested-elements-from-the-xml-in-pyspark

apache spark - How to read the nested elements from the xml in pyspark? - Stack Overflow

October 23, 2021 - %%pyspark df = spark.read \ ... True) \ .load('file.xml') display(df) Step 2: Convert the nested tab to a JSON using to_json function before we read the nested tab....

Find elsewhere

Google Bing Mojeek

GitHub

github.com › databricks › spark-xml

GitHub - databricks/spark-xml: XML data source for Spark SQL and DataFrames · GitHub

This package allows reading XML files in local or distributed filesystem as Spark DataFrames.

Starred by 512 users

Forked by 225 users

Languages Scala 97.8% | Java 1.5% | Shell 0.7%

Databricks Documentation

docs.databricks.com › resources › documentation archive › read and write xml data using the spark-xml library

Read and write XML data using the spark-xml library | Databricks on AWS

... import com.databricks.spar... Although primarily used to convert an XML file into a DataFrame, you can also use the from_xml method to parse XML in a string-valued column in an existing DataFrame and add it as a new ...

Medium

medium.com › @tennysusanto › use-databricks-spark-xml-to-parse-nested-xml-d7d7cf797c28

Use Databrick’s spark-xml to parse nested xml and create csv files. | by Tenny Susanto | Medium

February 10, 2017 - Also note how to query an xml attribute vs an xml element (look at OperatorID in query below). ... val sqlContext = new org.apache.spark.sql.SQLContext(sc)val df = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "Transaction").load("/user/tsusanto/POSLog-201409300635-21.xml")val flattened = df.withColumn("LineItem", explode($"RetailTransaction.LineItem"))val selectedData = flattened.select($"RetailStoreID",$"WorkstationID",$"OperatorID._OperatorName" as "OperatorName",$"OperatorID._VALUE" as "OperatorID",$"CurrencyCode",$"RetailTransaction.ReceiptDateTime",$"RetailTransaction.TransactionCount",$"LineItem.SequenceNumber",$"LineItem.Tax.TaxableAmount")selectedData.show(3,false)selectedData.write.format("com.databricks.spark.csv").option("header", "true").mode("overwrite").save("POSLog-201409300635-21_lines")

Stack Overflow

stackoverflow.com › questions › 46622466 › how-to-parse-nested-xml-inside-textfile-using-spark-rdd

How to parse nested XML inside textfile using Spark RDD? - Stack Overflow

If you have XML alone in RDD[String] format, you can convert it to DataFrame with Databricks utility class: ... Yes like I mentioned in the question that can be done easily. But I have XML data embedded inside text data in textFile. I have extracted XML data into another RDD but I am not able to extract key and value attribute values from all the 'ab' tags(nested tags).

Stack Overflow

stackoverflow.com › questions › 65643323 › parsing-the-nested-xml-fields-from-pyspark-dataframe-using-udf

apache spark - Parsing the nested XML fields from PySpark Dataframe using UDF - Stack Overflow

Top answer

1 of 1

You can use xpath queries without using UDFs:

df = spark.createDataFrame([['<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="9615" age="68" sex="F" /> <visitor id="1882" age="34" sex="M" /> <visitor id="5987" age="23" sex="M" /> </visitors>']], ['visitors'])

df.show(truncate=False)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|visitors                                                                                                                                                                          |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="9615" age="68" sex="F" /> <visitor id="1882" age="34" sex="M" /> <visitor id="5987" age="23" sex="M" /> </visitors>|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+


df2 = df.selectExpr(
    "xpath(visitors, './visitors/visitor/@id') id",
    "xpath(visitors, './visitors/visitor/@age') age",
    "xpath(visitors, './visitors/visitor/@sex') sex"
).selectExpr(
    "explode(arrays_zip(id, age, sex)) visitors"
).select('visitors.*')

df2.show(truncate=False)
+----+---+---+
|id  |age|sex|
+----+---+---+
|9615|68 |F  |
|1882|34 |M  |
|5987|23 |M  |
+----+---+---+

If you insist on using UDFs:

import xml.etree.ElementTree as ET
import pyspark.sql.functions as F

@F.udf('array<struct<id:string, age:string, sex:string>>')
def parse_xml(s):
    root = ET.fromstring(s)
    return list(map(lambda x: x.attrib, root.findall('visitor')))
    
df2 = df.select(
    F.explode(parse_xml('visitors')).alias('visitors')
).select('visitors.*')

df2.show()
+----+---+---+
|  id|age|sex|
+----+---+---+
|9615| 68|  F|
|1882| 34|  M|
|5987| 23|  M|
+----+---+---+

Blogger

leelaprasadhadoop.blogspot.com › 2019 › 01 › xml-parsing.html

Hadoop and Spark by Leela Prasad: XML Parsing

January 6, 2019 - Here, coldata is the column which contains XML in GZIP Format , xmldf is the dataframe, xmlcolumn is the New column in which we would like to extract the XML. from above data as a DF. val xmlmodified = data.map(x => x.toString) val reader = new XmlReader() val xml_parsed = reader.withRowTag("Object").xmlrdd(spark.SqlContext,xmlmodified).select($"object") at ·

Stack Overflow

stackoverflow.com › questions › 59629669 › how-to-read-xml-data-from-a-spark-dataframe-column

How to read xml data from a spark dataframe column - Stack Overflow

dfx = spark.read.load('books.xml', format='xml', rowTag='bks:books', valueTag="_ele_value") dfx.schema · Trying to get the similar dataframe output when trying to read it from the value column (this is coming from kafka) My xml has a deeply nested structure, just a example of books xml with 2 levels nested ·

Stack Overflow

stackoverflow.com › questions › 65526383 › spark-how-to-transform-to-data-frame-data-from-multiple-nested-xml-files-with-a

python - Spark: How to transform to Data Frame data from multiple nested XML files with attributes - Stack Overflow

Top answer

1 of 1

You can use Level_0 as the rowTag, and explode the relevant arrays/structs:

import pyspark.sql.functions as F

df = spark.read.format('xml').options(rowTag="Level_0").load('line_removed.xml')

df2 = df.select(
    '_Id0', 
    F.explode_outer('Level_1.Level_2.Level_3.Level_4').alias('Level_4')
).select(
    '_Id0',
    'Level_4.*'
)

df2.show()
+---------------+----------+-----+
|           _Id0|      Date|Value|
+---------------+----------+-----+
|Id0_value_file1|2021-01-01|  4_1|
|Id0_value_file1|2021-01-02|  4_2|
+---------------+----------+-----+

Sonra

sonra.io › home › xml › how to parse xml in spark and databricks (guide)

How to Parse XML in Spark and Databricks (Guide) - Sonra

June 17, 2025 - The spark-xml library handles the initial flattening by reading the XML and converting it into a DataFrame based on the row tag. But if your XML is more complex and has multiple levels of nesting (which, let’s be honest, it probably does), ...

Dataink

dataink.com.au › 2024-08-17-databricks-xml-ingestion

Ingesting and Flattening XML Files in Databricks

August 17, 2024 - Spark provides native support for reading XML files. By leveraging the rowTag option, we can accurately parse the XML data into a DataFrame. The rowTag specifies the element in the XML that represents the root of the records we want to extract.

Stack Overflow

stackoverflow.com › questions › 67149863 › parsing-nested-xml-in-databricks

scala - Parsing nested XML in Databricks - Stack Overflow

April 18, 2021 - I am trying to read the XML into a data frame and trying to flatten the data using explode as below. val df = spark.read.format("xml").option("rowTag","on").option("inferschema","true").load("filepath") val parsxml= df .withColumn("exploded_element", explode(("prgSvc.element"))).

Stack Overflow

stackoverflow.com › questions › 45654658 › reading-xml-with-nested-tags-into-a-spark-rdd-and-transforming-to-json

scala - Reading XML with nested tags into a Spark RDD, and transforming to JSON - Stack Overflow

August 13, 2017 - My understanding is it's not reading XML nested tags, and also it is just working with DataFrames, and I would also prefer RDD. Is there a way of achieving the functionality? I tried just reading as a text file and using the tag item as delimiter, but it did not work. It still considers line breakers as delimiters: val spark = SparkSession .builder() .config("textinputformat.record.delimiter", "</item>") .master("local[*]") .getOrCreate() val documents = spark.sparkContext.textFile("/home/myuser/test-data/Records.xml")