heirarchy should be rootTag and att should be rowTag as

df = spark.read \
    .format("com.databricks.spark.xml") \
    .option("rootTag", "hierarchy") \
    .option("rowTag", "att") \
    .load("test.xml")

and you should get

+-----+------+----------------------------+
|Order|attval|children                    |
+-----+------+----------------------------+
|1    |Data  |[[[1, Studyval], [2, Site]]]|
|2    |Info  |[[[1, age], [2, gender]]]   |
+-----+------+----------------------------+

and schema

root
 |-- Order: long (nullable = true)
 |-- attval: string (nullable = true)
 |-- children: struct (nullable = true)
 |    |-- att: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- Order: long (nullable = true)
 |    |    |    |-- attval: string (nullable = true)

find more information on databricks xml

Answer from Anahcolus on Stack Overflow
🌐
GitHub
github.com › databricks › spark-xml
GitHub - databricks/spark-xml: XML data source for Spark SQL and DataFrames · GitHub
The library contains a Hadoop input format for reading XML files by a start tag and an end tag. This is similar with XmlInputFormat.java in Mahout but supports to read compressed files, different encodings and read elements including attributes, ...
Starred by 512 users
Forked by 225 users
Languages   Scala 97.8% | Java 1.5% | Shell 0.7%
🌐
Microsoft Learn
learn.microsoft.com › en-us › azure › databricks › query › formats › xml
Read and write XML files - Azure Databricks | Microsoft Learn
The examples in this section use an XML file available for download in the Apache Spark GitHub repo. df = (spark.read .format('xml') .options(rowTag='book') .load(xmlPath)) # books.xml selected_data = df.select("author", "_id") (selected_data.write .options(rowTag='book', rootTag='books') .xml('newbooks.xml'))
🌐
Databricks
docs.databricks.com › data engineering › lakeflow connect › data formats › xml file
Read and write XML files | Databricks on AWS
August 9, 2024 - Read the XML file with rowTag option as “books”: ... df = spark.read.option("rowTag", "books").format("xml").load(xmlPath) df.printSchema() df.show(truncate=False)
🌐
Apache Spark
spark.apache.org › docs › latest › sql-data-sources-xml.html
XML Files - Spark 4.1.1 Documentation
Spark SQL provides spark.read().xml("file_1_path","file_2_path") to read a file or directory of files in XML format into a Spark DataFrame, and dataframe.write().xml("path") to write to a xml file. The rowTag option must be specified to indicate the XML element that maps to a DataFrame row.
Top answer
1 of 3
12

heirarchy should be rootTag and att should be rowTag as

df = spark.read \
    .format("com.databricks.spark.xml") \
    .option("rootTag", "hierarchy") \
    .option("rowTag", "att") \
    .load("test.xml")

and you should get

+-----+------+----------------------------+
|Order|attval|children                    |
+-----+------+----------------------------+
|1    |Data  |[[[1, Studyval], [2, Site]]]|
|2    |Info  |[[[1, age], [2, gender]]]   |
+-----+------+----------------------------+

and schema

root
 |-- Order: long (nullable = true)
 |-- attval: string (nullable = true)
 |-- children: struct (nullable = true)
 |    |-- att: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- Order: long (nullable = true)
 |    |    |    |-- attval: string (nullable = true)

find more information on databricks xml

2 of 3
3

Databricks has released new version to read xml to Spark DataFrame

<dependency>
     <groupId>com.databricks</groupId>
     <artifactId>spark-xml_2.12</artifactId>
     <version>0.6.0</version>
 </dependency>

Input XML file I used on this example is available at GitHub repository.

val df = spark.read
      .format("com.databricks.spark.xml")
      .option("rowTag", "person")
      .xml("persons.xml")

Schema

root
 |-- _id: long (nullable = true)
 |-- dob_month: long (nullable = true)
 |-- dob_year: long (nullable = true)
 |-- firstname: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- salary: struct (nullable = true)
 |    |-- _VALUE: long (nullable = true)
 |    |-- _currency: string (nullable = true)

Outputs:

+---+---------+--------+---------+------+--------+----------+---------------+
|_id|dob_month|dob_year|firstname|gender|lastname|middlename|         salary|
+---+---------+--------+---------+------+--------+----------+---------------+
|  1|        1|    1980|    James|     M|   Smith|      null|  [10000, Euro]|
|  2|        6|    1990|  Michael|     M|    null|      Rose|[10000, Dollor]|
+---+---------+--------+---------+------+--------+----------+---------------+

Note that Spark XML API has some limitations and discussed here Spark-XML API Limitations

Hope this helps !!

🌐
MSSQLTips
mssqltips.com › home › read and write xml files using apache spark
Read and Write XML Files using Apache Spark
March 21, 2023 - The Spark library for reading XML has simple options. We must define the format as XML. We can use the rootTag and rowTag options to slice out data from the file. This is handy when the file has multiple record types. Last, we use the load method to complete the action.
🌐
Spark By {Examples}
sparkbyexamples.com › home › apache hadoop › spark read xml file using databricks api
Spark Read XML file using Databricks API - Spark By {Examples}
March 27, 2024 - Databricks Spark-XML package allows us to read simple or nested XML files into DataFrame, once DataFrame is created, we can leverage its APIs to perform transformations and actions like any other DataFrame.
🌐
Szczeles
szczeles.github.io › Reading-JSON-CSV-and-XML-files-efficiently-in-Apache-Spark
Reading JSON, CSV and XML files efficiently in Apache Spark
November 6, 2017 - Even if you need only the first record from the file, Spark (by default) reads its whole content to create a valid schema consisting of the superset of used fields and their types. Let’s see how to improve the process with three simple hints. If your datasets have mostly static schema, there is no need to read all the data. You can speed up loading files with samplingRatio option for JSON and XML readers - the value is from range (0,1] and specifies what fraction of data will be loaded by scheme inferring job.
Find elsewhere
🌐
Microsoft Learn
learn.microsoft.com › en-us › azure › databricks › archive › connectors › spark-xml-library
Read and write XML data using the spark-xml library - Azure Databricks | Microsoft Learn
December 16, 2024 - # Infer schema library(SparkR) sparkR.session("local[4]", sparkPackages = c("com.databricks:spark-xml_2.12:<release>")) df <- read.df("dbfs:/books.xml", source = "xml", rowTag = "book") # Default `rootTag` and `rowTag` write.df(df, "dbfs:/newbooks.xml", "xml") # Specify schema customSchema <- structType( structField("_id", "string"), structField("author", "string"), structField("description", "string"), structField("genre", "string"), structField("price", "double"), structField("publish_date", "string"), structField("title", "string")) df <- read.df("dbfs:/books.xml", source = "xml", schema = customSchema, rowTag = "book") # In this case, `rootTag` is set to "ROWS" and `rowTag` is set to "ROW".
🌐
Databricks Documentation
docs.databricks.com › resources › documentation archive › read and write xml data using the spark-xml library
Read and write XML data using the spark-xml library | Databricks on AWS
# Infer schema library(SparkR) sparkR.session("local[4]", sparkPackages = c("com.databricks:spark-xml_2.12:<release>")) df <- read.df("dbfs:/books.xml", source = "xml", rowTag = "book") # Default `rootTag` and `rowTag` write.df(df, "dbfs:/newbooks.xml", "xml") # Specify schema customSchema <- structType( structField("_id", "string"), structField("author", "string"), structField("description", "string"), structField("genre", "string"), structField("price", "double"), structField("publish_date", "string"), structField("title", "string")) df <- read.df("dbfs:/books.xml", source = "xml", schema = customSchema, rowTag = "book") # In this case, `rootTag` is set to "ROWS" and `rowTag` is set to "ROW".
🌐
Medium
medium.com › @pawankg › mastering-xml-data-integration-in-pyspark-merging-parsing-and-analyzing-multiple-files-with-ease-b449353ec87
Mastering XML Data Integration in PySpark: Merging, Parsing, and Analyzing Multiple Files with Ease for Data Professionals | by Pawan Kumar Ganjhu | Medium
May 24, 2023 - These functions enable you to extract information from XML documents, transform XML structures, and perform various operations on XML data. Here are some commonly used XML functions in PySpark: 1. spark.read.format("xml").option(key, value).load...
🌐
GitHub
github.com › databricks › spark-xml › blob › master › README.md
spark-xml/README.md at master · databricks/spark-xml
April 29, 2022 - The library contains a Hadoop input format for reading XML files by a start tag and an end tag. This is similar with XmlInputFormat.java in Mahout but supports to read compressed files, different encodings and read elements including attributes, ...
Author   databricks
🌐
Apache Spark
spark.apache.org › docs › 4.0.0 › sql-data-sources-xml.html
XML Files - Spark 4.0.0 Documentation
Spark SQL provides spark.read().xml("file_1_path","file_2_path") to read a file or directory of files in XML format into a Spark DataFrame, and dataframe.write().xml("path") to write to a xml file. The rowTag option must be specified to indicate the XML element that maps to a DataFrame row.
🌐
CloudxLab
cloudxlab.com › assessment › displayslide › 613 › spark-sql-loading-xml
Spark SQL - Loading XML | Automated hands-on| CloudxLab
We can also use the fully qualified name of format as com.databricks.spark.xml instead of simply xml. Finally, we can take a look at data of dataframe using show() method. You can see that it in this dataframe every row is a book and the columns if the book is id, author, descriptions etc. spark.read.format("xml").option("rowTag","book").load("/data/spark/books.xml").show()
🌐
Medium
medium.com › @uzzaman.ahmed › working-with-xml-files-in-pyspark-reading-and-writing-data-d5e570c913de
Working with XML files in PySpark: Reading and Writing Data | by Ahmed Uz Zaman | Medium
April 11, 2023 - When reading XML files in PySpark, the spark-xml package infers the schema of the XML data and returns a DataFrame with columns corresponding to the tags and attributes in the XML file.
🌐
Kontext
kontext.tech › home › blogs › code snippets & tips › read and write xml files in pyspark
Read and Write XML files in PySpark - Kontext Labs
December 16, 2020 - from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, IntegerType from decimal import Decimal appName = "Python Example - PySpark Read XML" master = "local" # Create Spark session spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .getOrCreate() schema = StructType([ StructField('_id', IntegerType(), False), StructField('rid', IntegerType(), False), StructField('name', StringType(), False) ]) df = spark.read.format("com.databricks.spark.xml") \ .option("rowTag","record").load("file:///home/tangr/python-examples/test.xml", schema=schema) df.show()
Top answer
1 of 3
4

@jxc's answer in the comments to the question is the best solution:

df = spark.read.format("com.databricks.spark.xml")\
               .option("rowTag", "head")\
               .load(','.join(s3_paths))

Here is an example using a toy dataset:

fnames = ['books_part1.xml','books_part2.xml'] # part1 -> ids bk101-bk106, part2 -> ids bk107-bk112

df = spark.read.format('xml') \
              .option('rowTag','book')\
              .load(','.join(fnames))

df.show()

# +-----+--------------------+--------------------+---------------+-----+------------+--------------------+
# |  _id|              author|         description|          genre|price|publish_date|               title|
# +-----+--------------------+--------------------+---------------+-----+------------+--------------------+
# |bk101|Gambardella, Matthew|An in-depth look ...|       Computer|44.95|  2000-10-01|XML Developer's G...|
# |bk102|          Ralls, Kim|A former architec...|        Fantasy| 5.95|  2000-12-16|       Midnight Rain|
# |bk103|         Corets, Eva|After the collaps...|        Fantasy| 5.95|  2000-11-17|     Maeve Ascendant|
# |bk104|         Corets, Eva|In post-apocalyps...|        Fantasy| 5.95|  2001-03-10|     Oberon's Legacy|
# |bk105|         Corets, Eva|The two daughters...|        Fantasy| 5.95|  2001-09-10|  The Sundered Grail|
# |bk106|    Randall, Cynthia|When Carla meets ...|        Romance| 4.95|  2000-09-02|         Lover Birds|
# |bk107|      Thurman, Paula|A deep sea diver ...|        Romance| 4.95|  2000-11-02|       Splish Splash|
# |bk108|       Knorr, Stefan|An anthology of h...|         Horror| 4.95|  2000-12-06|     Creepy Crawlies|
# |bk109|        Kress, Peter|After an inadvert...|Science Fiction| 6.95|  2000-11-02|        Paradox Lost|
# |bk110|        O'Brien, Tim|Microsoft's .NET ...|       Computer|36.95|  2000-12-09|Microsoft .NET: T...|
# |bk111|        O'Brien, Tim|The Microsoft MSX...|       Computer|36.95|  2000-12-01|MSXML3: A Compreh...|
# |bk112|         Galos, Mike|Microsoft Visual ...|       Computer|49.95|  2001-04-16|Visual Studio 7: ...|
# +-----+--------------------+--------------------+---------------+-----+------------+--------------------+
2 of 3
1

you can check the following GitHub repo.

  • https://github.com/databricks/spark-xml