🌐
GitHub
github.com › databricks › spark-xml
GitHub - databricks/spark-xml: XML data source for Spark SQL and DataFrames · GitHub
XML data source for Spark SQL and DataFrames. Contribute to databricks/spark-xml development by creating an account on GitHub.
Starred by 512 users
Forked by 225 users
Languages   Scala 97.8% | Java 1.5% | Shell 0.7%
🌐
GitHub
github.com › elsevierlabs-os › spark-xml-utils
GitHub - elsevierlabs-os/spark-xml-utils
This site offers some background information on how to utilize the capabilities provided by the spark-xml-utils library within an Apache Spark application. Some scala examples (leveraging XPath, XSLT, and XQuery) within the Apache Spark framework are provided.
Starred by 61 users
Forked by 11 users
Languages   Java 100.0% | Java 100.0%
🌐
Maven Repository
mvnrepository.com › artifact › com.databricks › spark-xml
Maven Repository: com.databricks » spark-xml
April 10, 2024 - aar android apache api arm assets ... github gradle groovy io ios javascript jvm kotlin library logging maven mobile module npm osgi persistence plugin resources rlang sdk server service spring sql starter testing tools ui web webapp · Web site developed by @frodriguez Powered by: Scala, Play, Spark, Pekko and ...
🌐
GitHub
github.com › databricks › spark-xml › releases
Releases · databricks/spark-xml
XML data source for Spark SQL and DataFrames. Contribute to databricks/spark-xml development by creating an account on GitHub.
Author   databricks
🌐
Maven Repository
mvnrepository.com › artifact › com.databricks › spark-xml_2.10 › 0.2.0
Maven Repository: com.databricks » spark-xml_2.10 » 0.2.0
aar amazon android apache api arm ... github gradle groovy io ios javascript jvm kotlin library maven mobile module npm osgi plugin resources rlang sdk server service spring sql starter testing tools ui war web webapp · Web site developed by @frodriguez Powered by: Scala, Play, Spark, Pekko and ...
🌐
GitHub
github.com › apache › spark › blob › master › examples › pom.xml
spark/examples/pom.xml at master · apache/spark
<relativePath>../pom.xml</relativePath> </parent> · <artifactId>spark-examples_2.13</artifactId> <packaging>jar</packaging> <name>Spark Project Examples</name> <url>https://spark.apache.org/</url> ·
Author   apache
🌐
Maven Central Repository
search.maven.org › artifact › com.databricks › spark-xml_2.11 › 0.11.0 › jar
Maven
<?xml version='1.0' encoding='UTF-8'?> ..._2.11</artifactId> <packaging>jar</packaging> <description>spark-xml</description> <version>0.11.0</version> <name>spark-xml</name> <organization> <name>com.databricks</name> </organization> <url>https://github.com/databricks/spark-xml</url> ...
🌐
GitHub
github.com › apache › spark › blob › master › pom.xml
spark/pom.xml at master · apache/spark
Apache Spark - A unified analytics engine for large-scale data processing - spark/pom.xml at master · apache/spark
Author   apache
🌐
Maven Central
central.sonatype.com › artifact › com.databricks › spark-xml_2.13 › 0.14.0
com.databricks:spark-xml_2.13:0.14.0 - Maven Central
<?xml version='1.0' encoding='UTF-8'?> ..._2.13</artifactId> <packaging>jar</packaging> <description>spark-xml</description> <version>0.14.0</version> <name>spark-xml</name> <organization> <name>com.databricks</name> </organization> <url>https://github.com/databricks/spark-xml</url> ...
🌐
GitHub
github.com › apache › spark › blob › master › core › pom.xml
spark/core/pom.xml at master · apache/spark
<relativePath>../pom.xml</relativePath> </parent> · <artifactId>spark-core_2.13</artifactId> <packaging>jar</packaging> <name>Spark Project Core</name> <url>https://spark.apache.org/</url> · <properties> <sbt.project.name>core</sbt.project.name> </properties> ·
Author   apache
Find elsewhere
🌐
GitHub
github.com › apache › spark › blob › master › assembly › pom.xml
spark/assembly/pom.xml at master · apache/spark
<relativePath>../pom.xml</relativePath> </parent> · <artifactId>spark-assembly_2.13</artifactId> <name>Spark Project Assembly</name> <url>https://spark.apache.org/</url> <packaging>pom</packaging> · <properties> <sbt.project.name>assembly</sbt.project.name> <build.testJarPhase>none</build.testJarPhase> <build.copyDependenciesPhase>package</build.copyDependenciesPhase> </properties> · <dependencies> <!-- Prevent our dummy JAR from being included in Spark distributions or uploaded to YARN --> <dependency> <groupId>org.spark-project.spark</groupId>
Author   apache
🌐
GitHub
github.com › databricks › spark-xml › issues › 299
Import spark-xml in Jupyter Notebook · Issue #299 · databricks/spark-xml
April 30, 2018 - My Jupyter Notebook can start Spark session successfully and parse for example .json files with SparkSession.read.json(). While so far I had no clue at all how to incorporate spark-xml in the kernel.
Author   OXPHOS
🌐
GitHub
github.com › databricks › spark-xml › issues › 209
Using jar "spark-xml_2.11-0.4.1" with Python Error - org.apache.spark.sql.types.DecimalType$.Unlimited( · Issue #209 · databricks/spark-xml
November 22, 2016 - databricks / spark-xml Public archive · Notifications · You must be signed in to change notification settings · Fork 227 · Star 511 · This repository was archived by the owner on Mar 24, 2025. It is now read-only. Copy link · Copy link · Closed · Closed · Using jar "spark-xml_2.11-0.4.1" with Python Error - org.apache.spark.sql.types.DecimalType$.Unlimited(#209 ·
Author   ivaradagit
🌐
Maven Repository
mvnrepository.com › artifact › com.databricks › spark-xml_2.11 › 0.3.1
Maven Repository: com.databricks » spark-xml_2.11 » 0.3.1
January 18, 2016 - aar android apache api application ... github gradle groovy ios javascript jenkins kotlin library maven mobile module npm osgi persistence plugin resources rlang sdk server service spring sql starter testing tools ui war web webapp · Web site developed by @frodriguez Powered by: Scala, Play, Spark, Pekko and ...
🌐
Maven Repository
mvnrepository.com › artifact › com.databricks › spark-xml_2.12 › 0.5.0
Maven Repository: com.databricks » spark-xml_2.12 » 0.5.0
December 30, 2018 - aar amazon android apache api arm ... github gradle groovy io ios javascript jvm kotlin library maven mobile module npm osgi plugin resources rlang sdk server service spring sql starter testing tools ui war web webapp · Web site developed by @frodriguez Powered by: Scala, Play, Spark, Pekko and ...
🌐
GitHub
github.com › elsevierlabs-os › spark-xml-utils › tree › bd24eb29ed32679073ba4b37b4dcee63f3a861e2
GitHub - elsevierlabs-os/spark-xml-utils at bd24eb29ed32679073ba4b37b4dcee63f3a861e2
This site offers some background information on how to utilize the capabilities provided by the spark-xml-utils library within an Apache Spark application. Some scala examples (leveraging XPath, XSLT, and XQuery) within the Apache Spark framework are provided.
Starred by 61 users
Forked by 11 users
Languages   Java 100.0% | Java 100.0%
Top answer
1 of 3
12

heirarchy should be rootTag and att should be rowTag as

df = spark.read \
    .format("com.databricks.spark.xml") \
    .option("rootTag", "hierarchy") \
    .option("rowTag", "att") \
    .load("test.xml")

and you should get

+-----+------+----------------------------+
|Order|attval|children                    |
+-----+------+----------------------------+
|1    |Data  |[[[1, Studyval], [2, Site]]]|
|2    |Info  |[[[1, age], [2, gender]]]   |
+-----+------+----------------------------+

and schema

root
 |-- Order: long (nullable = true)
 |-- attval: string (nullable = true)
 |-- children: struct (nullable = true)
 |    |-- att: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- Order: long (nullable = true)
 |    |    |    |-- attval: string (nullable = true)

find more information on databricks xml

2 of 3
3

Databricks has released new version to read xml to Spark DataFrame

<dependency>
     <groupId>com.databricks</groupId>
     <artifactId>spark-xml_2.12</artifactId>
     <version>0.6.0</version>
 </dependency>

Input XML file I used on this example is available at GitHub repository.

val df = spark.read
      .format("com.databricks.spark.xml")
      .option("rowTag", "person")
      .xml("persons.xml")

Schema

root
 |-- _id: long (nullable = true)
 |-- dob_month: long (nullable = true)
 |-- dob_year: long (nullable = true)
 |-- firstname: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- salary: struct (nullable = true)
 |    |-- _VALUE: long (nullable = true)
 |    |-- _currency: string (nullable = true)

Outputs:

+---+---------+--------+---------+------+--------+----------+---------------+
|_id|dob_month|dob_year|firstname|gender|lastname|middlename|         salary|
+---+---------+--------+---------+------+--------+----------+---------------+
|  1|        1|    1980|    James|     M|   Smith|      null|  [10000, Euro]|
|  2|        6|    1990|  Michael|     M|    null|      Rose|[10000, Dollor]|
+---+---------+--------+---------+------+--------+----------+---------------+

Note that Spark XML API has some limitations and discussed here Spark-XML API Limitations

Hope this helps !!

🌐
GitHub
github.com › rohankumardubey › spark-xml
GitHub - rohankumardubey/spark-xml
XML Data Source for Apache Spark Linking Using with Spark shell Features XSD Support Parsing Nested XML Pyspark notes Structure Conversion Conversion from XML to DataFrame Conversion from DataFrame to XML Examples SQL API Scala API Java API Python API R API Hadoop InputFormat Building From Source Acknowledgements
Author   rohankumardubey
🌐
Maven Central
central.sonatype.com › artifact › com.databricks › spark-xml_2.12
spark-xml_2.12 - com.databricks - Maven Central - Sonatype
<?xml version='1.0' encoding='UTF-8'?> <project xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://maven.apache.org/POM/4.0.0"> <modelVersion>4.0.0</modelVersion> <groupId>com.databricks</groupId> <artifactId>spark-xml_2.12</artifactId> <packaging>jar</packaging> <description>spark-xml</description> <version>0.18.0</version> <name>spark-xml</name> <organization> <name>com.databricks</name> </organization> <url>https://github.com/databricks/spark-xml</url> <licenses> <license> <na
🌐
Maven Repository
mvnrepository.com › artifact › com.databricks › spark-xml_2.12 › 0.18.0
Maven Repository: com.databricks » spark-xml_2.12 » 0.18.0
April 10, 2024 - HomePage https://github.com/databricks/spark-xml 🔍 Inspect URL · Links · DateApr 10, 2024 · Filespom (3 KB)jar (150 KB)View All · RepositoriesCentral · Ranking · #8579in MvnRepository · #25in XML Processing · Scala TargetScala 2.12 (View all targets) Vulnerabilities ·
Published   Apr 10, 2024
Version   0.18.0