🌐
Maven Repository
mvnrepository.com › artifact › com.databricks › spark-xml
Maven Repository: com.databricks » spark-xml
April 10, 2024 - Current Group · Group · Databricks · com.databricks · Description · Links · Related Categories · XML Processing · HTML Parsers
🌐
GitHub
github.com › databricks › spark-xml
GitHub - databricks/spark-xml: XML data source for Spark SQL and DataFrames · GitHub
XML data source for Spark SQL and DataFrames. Contribute to databricks/spark-xml development by creating an account on GitHub.
Starred by 512 users
Forked by 225 users
Languages   Scala 97.8% | Java 1.5% | Shell 0.7%
🌐
Maven Repository
mvnrepository.com › artifact › com.databricks › spark-xml_2.10 › 0.2.0
Maven Repository: com.databricks » spark-xml_2.10 » 0.2.0
Indexed Artifacts (63.0M) · Popular Categories · Testing Frameworks & Tools · Android Packages · JVM Languages · Logging Frameworks · Java Specifications · JSON Libraries · Core Utilities · Mocking
🌐
GitHub
github.com › databricks › spark-xml › releases
Releases · databricks/spark-xml
XML data source for Spark SQL and DataFrames. Contribute to databricks/spark-xml development by creating an account on GitHub.
Author   databricks
🌐
Jar-download
jar-download.com
Download spark-xml JAR files with all dependencies
Download JAR files for spark-xml ✓ With dependencies ✓ Documentation ✓ Source code
🌐
LinkedIn
linkedin.com › pulse › pyspark-xml-handling-using-maven-spark-xml212-jar-harish-dhanraj
PySpark XML Handling using spark-xml_2.12 Jar
April 11, 2023 - The following snapshot describes step by step instruction to handle the XML datasets in PySpark: Download the spark-xml jar from the Maven Repository make sure the jar version matches your Scala version. Move the downloaded jar to spark-3.
🌐
Databricks Community
community.databricks.com › t5 › data-engineering › how-to-load-xml-files-with-spark-xml › td-p › 57093
Solved: How to load xml files with spark-xml ? - Databricks Community - 57093
February 1, 2024 - If anybody faces this problem, ... reading xml files in databricks. ... Hi @leaw , The option I suggested should have downloaded the jar directly from maven but it seems like due to some issue it is unable to download. ... Anyway, glad to know that you were able to find an alternate solution. ... Installed spark-xml_2.13-...
🌐
Spark-packages
spark-packages.org › package › HyukjinKwon › spark-xml
spark-xml
Version: 0.1.1-s_2.11 ( 43adcd | zip | jar ) / Date: 2015-11-19 / License: Apache-2.0 / Scala version: 2.11 · Spark Scala/Java API compatibility: - 26% , - 100% , - 79% , - 92% Version: 0.1-s_2.11 ( 8ab44a | zip ) / Date: 2015-11-19 / License: Apache-2.0 · Version: spark-xml:0.1-s_2.11 ( ...
Find elsewhere
🌐
Maven Repository
mvnrepository.com › artifact › com.databricks › spark-xml_2.11 › 0.3.1
Maven Repository: com.databricks » spark-xml_2.11 » 0.3.1
January 18, 2016 - Indexed Artifacts (51.7M) · Popular Categories · Testing Frameworks & Tools · Android Packages · Logging Frameworks · JVM Languages · Java Specifications · JSON Libraries · Core Utilities · Mocking
🌐
Maven Central
central.sonatype.com › artifact › com.databricks › spark-xml_2.13 › 0.14.0
com.databricks:spark-xml_2.13:0.14.0 - Maven Central
<?xml version='1.0' encoding='UTF-8'?> <project xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://maven.apache.org/POM/4.0.0"> <modelVersion>4.0.0</modelVersion> <groupId>com.databricks</groupId> <artifactId>spark-xml_2.13</artifactId> <packaging>jar</packaging> <description>spark-xml</description> <version>0.14.0</version> <name>spark-xml</name> <organization> <name>com.databricks</name> </organization> <url>https://github.com/databricks/spark-xml</url> <licenses> <license> <na
🌐
Maven Central Repository
search.maven.org › artifact › com.databricks › spark-xml_2.11 › 0.11.0 › jar
Maven
<?xml version='1.0' encoding='UTF-8'?> <project xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://maven.apache.org/POM/4.0.0"> <modelVersion>4.0.0</modelVersion> <groupId>com.databricks</groupId> <artifactId>spark-xml_2.11</artifactId> <packaging>jar</packaging> <description>spark-xml</description> <version>0.11.0</version> <name>spark-xml</name> <organization> <name>com.databricks</name> </organization> <url>https://github.com/databricks/spark-xml</url> <licenses> <license> <na
🌐
Jar-download
jar-download.com › artifact-search › spark-xml_2.12
Download spark-xml_2.12 JAR file with all dependencies
January 5, 2023 - Download spark-xml_2.12 JAR file ✓ With dependencies ✓ Documentation ✓ Source code
🌐
Maven Central
central.sonatype.com › artifact › com.databricks › spark-xml_2.12
spark-xml_2.12 - com.databricks - Maven Central - Sonatype
<?xml version='1.0' encoding='UTF-8'?> <project xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://maven.apache.org/POM/4.0.0"> <modelVersion>4.0.0</modelVersion> <groupId>com.databricks</groupId> <artifactId>spark-xml_2.12</artifactId> <packaging>jar</packaging> <description>spark-xml</description> <version>0.18.0</version> <name>spark-xml</name> <organization> <name>com.databricks</name> </organization> <url>https://github.com/databricks/spark-xml</url> <licenses> <license> <na
Top answer
1 of 3
12

heirarchy should be rootTag and att should be rowTag as

df = spark.read \
    .format("com.databricks.spark.xml") \
    .option("rootTag", "hierarchy") \
    .option("rowTag", "att") \
    .load("test.xml")

and you should get

+-----+------+----------------------------+
|Order|attval|children                    |
+-----+------+----------------------------+
|1    |Data  |[[[1, Studyval], [2, Site]]]|
|2    |Info  |[[[1, age], [2, gender]]]   |
+-----+------+----------------------------+

and schema

root
 |-- Order: long (nullable = true)
 |-- attval: string (nullable = true)
 |-- children: struct (nullable = true)
 |    |-- att: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- Order: long (nullable = true)
 |    |    |    |-- attval: string (nullable = true)

find more information on databricks xml

2 of 3
3

Databricks has released new version to read xml to Spark DataFrame

<dependency>
     <groupId>com.databricks</groupId>
     <artifactId>spark-xml_2.12</artifactId>
     <version>0.6.0</version>
 </dependency>

Input XML file I used on this example is available at GitHub repository.

val df = spark.read
      .format("com.databricks.spark.xml")
      .option("rowTag", "person")
      .xml("persons.xml")

Schema

root
 |-- _id: long (nullable = true)
 |-- dob_month: long (nullable = true)
 |-- dob_year: long (nullable = true)
 |-- firstname: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- salary: struct (nullable = true)
 |    |-- _VALUE: long (nullable = true)
 |    |-- _currency: string (nullable = true)

Outputs:

+---+---------+--------+---------+------+--------+----------+---------------+
|_id|dob_month|dob_year|firstname|gender|lastname|middlename|         salary|
+---+---------+--------+---------+------+--------+----------+---------------+
|  1|        1|    1980|    James|     M|   Smith|      null|  [10000, Euro]|
|  2|        6|    1990|  Michael|     M|    null|      Rose|[10000, Dollor]|
+---+---------+--------+---------+------+--------+----------+---------------+

Note that Spark XML API has some limitations and discussed here Spark-XML API Limitations

Hope this helps !!

🌐
Databricks Community
community.databricks.com › t5 › data-engineering › spark-xml-not-working-with-databricks-connect-and-pyspark › td-p › 13802
spark-xml not working with Databricks Connect and ... - Databricks Community - 13802
October 10, 2021 - Are you adding spark-xml as a dependency 'locally'? you're doing it right, and the name of the data source doesn't matter. Both are correct. You do not need to install JARs manually.
🌐
Maven Repository
mvnrepository.com › artifact › com.databricks › spark-xml_2.11 › 0.5.0
Maven Repository: com.databricks » spark-xml_2.11 » 0.5.0
December 30, 2018 - HomePage https://github.com/databricks/spark-xml 🔍 Inspect URL · DateDec 30, 2018 · Filespom (2 KB)jar (221 KB)View All · RepositoriesCentral · Ranking#11825 in MvnRepository (See Top Artifacts)#52 in XML Processing · Used By42 artifacts · Scala TargetScala 2.11 (View all targets) Vulnerabilities ·