ClassNotFoundException means that you need a fat jar which you could include the package in your build.sbt and make the jar by sbt assembly. you may have a try. If can not work. add the jar into $SPARK_HOME/jars and have a try.

Answer from shengshan zhang on Stack Overflow
🌐
Maven Repository
mvnrepository.com › artifact › com.databricks › spark-xml_2.10 › 0.2.0
Maven Repository: com.databricks » spark-xml_2.10 » 0.2.0
Home » com.databricks » spark-xml_2.10 » 0.2.0 · spark-xml · Note: There is a new version for this artifact · Maven · Gradle · SBT · Mill · Ivy · Grape · Leiningen · Buildr · Scope: compile · test · provided · runtime · Scope: compile · test · provided ·
🌐
Maven Repository
mvnrepository.com › artifact › com.databricks › spark-xml
Maven Repository: com.databricks » spark-xml
April 10, 2024 - Web site developed by @frodriguez Powered by: Scala, Play, Spark, Pekko and Cassandra Read more...
🌐
GitHub
github.com › databricks › spark-xml › releases
Releases · databricks/spark-xml
XML data source for Spark SQL and DataFrames. Contribute to databricks/spark-xml development by creating an account on GitHub.
Author   databricks
🌐
GitHub
github.com › databricks › spark-xml
GitHub - databricks/spark-xml: XML data source for Spark SQL and DataFrames · GitHub
The structure and test tools are mostly copied from CSV Data Source for Spark. This package supports to process format-free XML files in a distributed way, unlike JSON datasource in Spark restricts in-line JSON format.
Starred by 512 users
Forked by 225 users
Languages   Scala 97.8% | Java 1.5% | Shell 0.7%
🌐
LinkedIn
linkedin.com › pulse › pyspark-xml-handling-using-maven-spark-xml212-jar-harish-dhanraj
PySpark XML Handling using spark-xml_2.12 Jar
April 11, 2023 - spark = SparkSession.builder.config("spark.jars", "C:\Program Files\spark-3.2.3-bin-hadoop3.2\jars\spark-xml_2.12-0.12.0.jar") \ .config("spark.driver.extraClassPath", "C:\Program Files\spark-3.2.3-bin-hadoop3.2\jars\spark-xml_2.12-0.12.0.jar") .\ .getOrCreate() sc = SparkSession.sparkContext spark.sparkContext.setLogLevel("ERROR") df=spark.read.option("rowTag","book").option("rootTag","catalog").format("xml").load("C:/tmp/sample.xml") df.show(truncate=False) df.persist() #To remove /n and whitespaces use regexp_replace() df1 =df.withColumn('description',regexp_replace("description",'\n',' ')) df2=df1.withColumn('description',regexp_replace('description', "\\s + ", " ")) df3=df2.withColumn('FirstName',split('author',",").getItem(0))\ .withColumn('LastName',split('author',",").getItem(1)).show(truncate=False)
🌐
Databricks Community
community.databricks.com › t5 › data-engineering › how-to-load-xml-files-with-spark-xml › td-p › 57093
Solved: How to load xml files with spark-xml ? - Databricks Community - 57093
February 1, 2024 - If anybody faces this problem, I'll be grateful for sharing experience about reading xml files in databricks. ... Hi @leaw , The option I suggested should have downloaded the jar directly from maven but it seems like due to some issue it is unable to download. ... Anyway, glad to know that you were able to find an alternate solution. ... Installed spark-xml_2.13-0.17.0.jar on runtime 14.2 scala 2.12 and also receiving the error when attempting to read XML.
Find elsewhere
🌐
Jar-download
jar-download.com
Download spark-xml JAR files with all dependencies
Download JAR files for spark-xml ✓ With dependencies ✓ Documentation ✓ Source code
🌐
Maven Central
central.sonatype.com › artifact › com.databricks › spark-xml_2.13 › 0.14.0
com.databricks:spark-xml_2.13:0.14.0 - Maven Central
<?xml version='1.0' encoding='UTF-8'?> <project xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://maven.apache.org/POM/4.0.0"> <modelVersion>4.0.0</modelVersion> <groupId>com.databricks</groupId> <artifactId>spark-xml_2.13</artifactId> <packaging>jar</packaging> <description>spark-xml</description> <version>0.14.0</version> <name>spark-xml</name> <organization> <name>com.databricks</name> </organization> <url>https://github.com/databricks/spark-xml</url> <licenses> <license> <na
🌐
Spark-packages
spark-packages.org › package › HyukjinKwon › spark-xml
spark-xml
Version: 0.1.1-s_2.11 ( 43adcd | zip | jar ) / Date: 2015-11-19 / License: Apache-2.0 / Scala version: 2.11 · Spark Scala/Java API compatibility: - 26% , - 100% , - 79% , - 92% Version: 0.1-s_2.11 ( 8ab44a | zip ) / Date: 2015-11-19 / License: Apache-2.0 · Version: spark-xml:0.1-s_2.11 ( ...
🌐
Maven Central Repository
search.maven.org › artifact › com.databricks › spark-xml_2.11 › 0.11.0 › jar
Maven
<?xml version='1.0' encoding='UTF-8'?> <project xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://maven.apache.org/POM/4.0.0"> <modelVersion>4.0.0</modelVersion> <groupId>com.databricks</groupId> <artifactId>spark-xml_2.11</artifactId> <packaging>jar</packaging> <description>spark-xml</description> <version>0.11.0</version> <name>spark-xml</name> <organization> <name>com.databricks</name> </organization> <url>https://github.com/databricks/spark-xml</url> <licenses> <license> <na
Top answer
1 of 3
12

heirarchy should be rootTag and att should be rowTag as

df = spark.read \
    .format("com.databricks.spark.xml") \
    .option("rootTag", "hierarchy") \
    .option("rowTag", "att") \
    .load("test.xml")

and you should get

+-----+------+----------------------------+
|Order|attval|children                    |
+-----+------+----------------------------+
|1    |Data  |[[[1, Studyval], [2, Site]]]|
|2    |Info  |[[[1, age], [2, gender]]]   |
+-----+------+----------------------------+

and schema

root
 |-- Order: long (nullable = true)
 |-- attval: string (nullable = true)
 |-- children: struct (nullable = true)
 |    |-- att: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- Order: long (nullable = true)
 |    |    |    |-- attval: string (nullable = true)

find more information on databricks xml

2 of 3
3

Databricks has released new version to read xml to Spark DataFrame

<dependency>
     <groupId>com.databricks</groupId>
     <artifactId>spark-xml_2.12</artifactId>
     <version>0.6.0</version>
 </dependency>

Input XML file I used on this example is available at GitHub repository.

val df = spark.read
      .format("com.databricks.spark.xml")
      .option("rowTag", "person")
      .xml("persons.xml")

Schema

root
 |-- _id: long (nullable = true)
 |-- dob_month: long (nullable = true)
 |-- dob_year: long (nullable = true)
 |-- firstname: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- salary: struct (nullable = true)
 |    |-- _VALUE: long (nullable = true)
 |    |-- _currency: string (nullable = true)

Outputs:

+---+---------+--------+---------+------+--------+----------+---------------+
|_id|dob_month|dob_year|firstname|gender|lastname|middlename|         salary|
+---+---------+--------+---------+------+--------+----------+---------------+
|  1|        1|    1980|    James|     M|   Smith|      null|  [10000, Euro]|
|  2|        6|    1990|  Michael|     M|    null|      Rose|[10000, Dollor]|
+---+---------+--------+---------+------+--------+----------+---------------+

Note that Spark XML API has some limitations and discussed here Spark-XML API Limitations

Hope this helps !!

🌐
Maven Repository
mvnrepository.com › artifact › com.databricks › spark-xml_2.11 › 0.3.1
Maven Repository: com.databricks » spark-xml_2.11 » 0.3.1
January 18, 2016 - Home » com.databricks » spark-xml_2.11 » 0.3.1 · spark-xml · Note: There is a new version for this artifact · Maven · Gradle · Gradle (Short) Gradle (Kotlin) SBT · Ivy · Grape · Leiningen · Buildr · Include comment with link to declaration · Central · Atlassian ·
🌐
Databricks Community
community.databricks.com › t5 › data-engineering › spark-xml-not-working-with-databricks-connect-and-pyspark › td-p › 13802
spark-xml not working with Databricks Connect and ... - Databricks Community - 13802
October 10, 2021 - Are you adding spark-xml as a dependency 'locally'? you're doing it right, and the name of the data source doesn't matter. Both are correct. You do not need to install JARs manually.
🌐
Spark-packages
spark-packages.org › package › elsevierlabs-os › spark-xml-utils
spark-xml-utils
Version: 1.4.0 ( 777824 | zip | jar ) / Date: 2017-02-06 / License: Apache-2.0
🌐
Maven Central
central.sonatype.com › artifact › com.databricks › spark-xml_2.12
spark-xml_2.12 - com.databricks - Maven Central - Sonatype
<?xml version='1.0' encoding='UTF-8'?> <project xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://maven.apache.org/POM/4.0.0"> <modelVersion>4.0.0</modelVersion> <groupId>com.databricks</groupId> <artifactId>spark-xml_2.12</artifactId> <packaging>jar</packaging> <description>spark-xml</description> <version>0.18.0</version> <name>spark-xml</name> <organization> <name>com.databricks</name> </organization> <url>https://github.com/databricks/spark-xml</url> <licenses> <license> <na
🌐
Databricks
forums.databricks.com › questions › 12381 › cant-get-spark-xml-package-to-work-in-pyspark.html
can't get spark-xml package to work in pyspark - Databricks Community Forum
September 13, 2017 - $SPARK_HOME/jars --packages com.databricks:spark-xml_2.12:0.5.0 · Comment · Add comment · Share · 0 · Answer by DheerajAwale · Jul 04 at 07:27 AM · I gave up with this issue and installed anaconda with Jupyter notebook. It works without having to spend weeks on setting up the machine ·