heirarchy should be rootTag and att should be rowTag as
df = spark.read \
.format("com.databricks.spark.xml") \
.option("rootTag", "hierarchy") \
.option("rowTag", "att") \
.load("test.xml")
and you should get
+-----+------+----------------------------+
|Order|attval|children |
+-----+------+----------------------------+
|1 |Data |[[[1, Studyval], [2, Site]]]|
|2 |Info |[[[1, age], [2, gender]]] |
+-----+------+----------------------------+
and schema
root
|-- Order: long (nullable = true)
|-- attval: string (nullable = true)
|-- children: struct (nullable = true)
| |-- att: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Order: long (nullable = true)
| | | |-- attval: string (nullable = true)
find more information on databricks xml
Answer from Anahcolus on Stack OverflowVideos
heirarchy should be rootTag and att should be rowTag as
df = spark.read \
.format("com.databricks.spark.xml") \
.option("rootTag", "hierarchy") \
.option("rowTag", "att") \
.load("test.xml")
and you should get
+-----+------+----------------------------+
|Order|attval|children |
+-----+------+----------------------------+
|1 |Data |[[[1, Studyval], [2, Site]]]|
|2 |Info |[[[1, age], [2, gender]]] |
+-----+------+----------------------------+
and schema
root
|-- Order: long (nullable = true)
|-- attval: string (nullable = true)
|-- children: struct (nullable = true)
| |-- att: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Order: long (nullable = true)
| | | |-- attval: string (nullable = true)
find more information on databricks xml
Databricks has released new version to read xml to Spark DataFrame
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-xml_2.12</artifactId>
<version>0.6.0</version>
</dependency>
Input XML file I used on this example is available at GitHub repository.
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "person")
.xml("persons.xml")
Schema
root
|-- _id: long (nullable = true)
|-- dob_month: long (nullable = true)
|-- dob_year: long (nullable = true)
|-- firstname: string (nullable = true)
|-- gender: string (nullable = true)
|-- lastname: string (nullable = true)
|-- middlename: string (nullable = true)
|-- salary: struct (nullable = true)
| |-- _VALUE: long (nullable = true)
| |-- _currency: string (nullable = true)
Outputs:
+---+---------+--------+---------+------+--------+----------+---------------+
|_id|dob_month|dob_year|firstname|gender|lastname|middlename| salary|
+---+---------+--------+---------+------+--------+----------+---------------+
| 1| 1| 1980| James| M| Smith| null| [10000, Euro]|
| 2| 6| 1990| Michael| M| null| Rose|[10000, Dollor]|
+---+---------+--------+---------+------+--------+----------+---------------+
Note that Spark XML API has some limitations and discussed here Spark-XML API Limitations
Hope this helps !!
@jxc's answer in the comments to the question is the best solution:
df = spark.read.format("com.databricks.spark.xml")\
.option("rowTag", "head")\
.load(','.join(s3_paths))
Here is an example using a toy dataset:
fnames = ['books_part1.xml','books_part2.xml'] # part1 -> ids bk101-bk106, part2 -> ids bk107-bk112
df = spark.read.format('xml') \
.option('rowTag','book')\
.load(','.join(fnames))
df.show()
# +-----+--------------------+--------------------+---------------+-----+------------+--------------------+
# | _id| author| description| genre|price|publish_date| title|
# +-----+--------------------+--------------------+---------------+-----+------------+--------------------+
# |bk101|Gambardella, Matthew|An in-depth look ...| Computer|44.95| 2000-10-01|XML Developer's G...|
# |bk102| Ralls, Kim|A former architec...| Fantasy| 5.95| 2000-12-16| Midnight Rain|
# |bk103| Corets, Eva|After the collaps...| Fantasy| 5.95| 2000-11-17| Maeve Ascendant|
# |bk104| Corets, Eva|In post-apocalyps...| Fantasy| 5.95| 2001-03-10| Oberon's Legacy|
# |bk105| Corets, Eva|The two daughters...| Fantasy| 5.95| 2001-09-10| The Sundered Grail|
# |bk106| Randall, Cynthia|When Carla meets ...| Romance| 4.95| 2000-09-02| Lover Birds|
# |bk107| Thurman, Paula|A deep sea diver ...| Romance| 4.95| 2000-11-02| Splish Splash|
# |bk108| Knorr, Stefan|An anthology of h...| Horror| 4.95| 2000-12-06| Creepy Crawlies|
# |bk109| Kress, Peter|After an inadvert...|Science Fiction| 6.95| 2000-11-02| Paradox Lost|
# |bk110| O'Brien, Tim|Microsoft's .NET ...| Computer|36.95| 2000-12-09|Microsoft .NET: T...|
# |bk111| O'Brien, Tim|The Microsoft MSX...| Computer|36.95| 2000-12-01|MSXML3: A Compreh...|
# |bk112| Galos, Mike|Microsoft Visual ...| Computer|49.95| 2001-04-16|Visual Studio 7: ...|
# +-----+--------------------+--------------------+---------------+-----+------------+--------------------+
you can check the following GitHub repo.
- https://github.com/databricks/spark-xml
One way is to use the databricks spark-xml library :
- Import the spark-xml library into your workspace https://docs.databricks.com/user-guide/libraries.html#create-a-library (search spark-xml in the maven/spark package section and import it)
- Attach the library to your cluster https://docs.databricks.com/user-guide/libraries.html#attach-a-library-to-a-cluster
- Use the following code in your notebook to read the xml file, where "note" is the root of my xml file.
xmldata = spark.read.format('xml').option("rootTag","note").load('dbfs:/mnt/mydatafolder/xmls/note.xml')
Example :

I found this one is really helpful. https://github.com/raveendratal/PysparkTelugu/blob/master/Read_Write_XML_File.ipynb
he has a youtube to walk through the steps as well.
in summary, 2 approaches:
- install in your databricks cluster at the 'library' tab.
- install it via launching spark-shell in the notebook itself.