ClassNotFoundException means that you need a fat jar which you could include the package in your build.sbt and make the jar by sbt assembly. you may have a try. If can not work. add the jar into $SPARK_HOME/jars and have a try.
Answer from shengshan zhang on Stack OverflowVideos
ClassNotFoundException means that you need a fat jar which you could include the package in your build.sbt and make the jar by sbt assembly. you may have a try. If can not work. add the jar into $SPARK_HOME/jars and have a try.
Alternatively, you can add the jar file into your spark shell. Download the spark-xml_2.10-0.2.0.jar jar file and copy into the spark's class path and add the jar file in your spark shell using the :cp command as
:cp spark-xml_2.10-0.2.0.jar
/*
jar file will get imported into the spark shell
now you can use this jar file anywhere in your code inside the spark shell.
*/
val rd = spark.read.format("com.databricks.spark.xml").load("C:/Users/kumar/Desktop/d.xml")
No other external jar are required except the databricks spark xml. You need to add dependency for 2.0+. If you are using older Spark then you need t use this.
You need to use
groupId: com.databricks
artifactId: spark-xml_2.11
version: 0.4.1
Match the Scala version to that of Spark. Starting version 2.0, Spark is built with Scala 2.11 by default. Scala 2.10 users should need the Spark source package and build with Scala 2.10 support. This may help
Compatibility issue with Scala and Spark for compiled jars
spark-xml
heirarchy should be rootTag and att should be rowTag as
df = spark.read \
.format("com.databricks.spark.xml") \
.option("rootTag", "hierarchy") \
.option("rowTag", "att") \
.load("test.xml")
and you should get
+-----+------+----------------------------+
|Order|attval|children |
+-----+------+----------------------------+
|1 |Data |[[[1, Studyval], [2, Site]]]|
|2 |Info |[[[1, age], [2, gender]]] |
+-----+------+----------------------------+
and schema
root
|-- Order: long (nullable = true)
|-- attval: string (nullable = true)
|-- children: struct (nullable = true)
| |-- att: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Order: long (nullable = true)
| | | |-- attval: string (nullable = true)
find more information on databricks xml
Databricks has released new version to read xml to Spark DataFrame
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-xml_2.12</artifactId>
<version>0.6.0</version>
</dependency>
Input XML file I used on this example is available at GitHub repository.
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "person")
.xml("persons.xml")
Schema
root
|-- _id: long (nullable = true)
|-- dob_month: long (nullable = true)
|-- dob_year: long (nullable = true)
|-- firstname: string (nullable = true)
|-- gender: string (nullable = true)
|-- lastname: string (nullable = true)
|-- middlename: string (nullable = true)
|-- salary: struct (nullable = true)
| |-- _VALUE: long (nullable = true)
| |-- _currency: string (nullable = true)
Outputs:
+---+---------+--------+---------+------+--------+----------+---------------+
|_id|dob_month|dob_year|firstname|gender|lastname|middlename| salary|
+---+---------+--------+---------+------+--------+----------+---------------+
| 1| 1| 1980| James| M| Smith| null| [10000, Euro]|
| 2| 6| 1990| Michael| M| null| Rose|[10000, Dollor]|
+---+---------+--------+---------+------+--------+----------+---------------+
Note that Spark XML API has some limitations and discussed here Spark-XML API Limitations
Hope this helps !!