🌐
Cloudera Community
community.cloudera.com › t5 › Support-Questions › XML-processing-using-Spark › m-p › 242452
XML processing using Spark - Cloudera Community - 242452
September 26, 2020 - Help me how to process an XML file using Spark without using databricks spark-xml package. Is there any standard way which we can use in real time in live projects? ... AFAIK Yes, by using databricks spark-xml package, we can parse the xml file and create Dataframe on top of Xml data.
🌐
GitHub
github.com › databricks › spark-xml
GitHub - databricks/spark-xml: XML data source for Spark SQL and DataFrames · GitHub
A library for parsing and querying XML data with Apache Spark, for Spark SQL and DataFrames.
Starred by 512 users
Forked by 225 users
Languages   Scala 97.8% | Java 1.5% | Shell 0.7%
Discussions

xml - Databricks Community Forum
A community forum to discuss working with Databricks Cloud and Spark More on forums.databricks.com
🌐 forums.databricks.com
Spark XML parsing - Stack Overflow
isn't there a way to pass XML attributes in spark.xml. If I'm creating a custom schema how to create one to pass XML attributes? ... Hi FaigB thanks for the answer, I too read through the databricks documentation but I can't figure out a way to pass attributes of XML tags. More on stackoverflow.com
🌐 stackoverflow.com
May 23, 2017
How can I read a XML file Azure Databricks Spark - Stack Overflow
I was looking for some info on the MSDN forums but couldn't find a good forum/ While reading on the spark site I've the hint that here I would have better chances. So bottom line, I want to read a ... More on stackoverflow.com
🌐 stackoverflow.com
Read XML in spark - Stack Overflow
Note that Spark XML API has some limitations and discussed here Spark-XML API Limitations · Hope this helps !! ... You can use Databricks jar to parse the xml to a dataframe. More on stackoverflow.com
🌐 stackoverflow.com
🌐
Apache Spark
spark.apache.org › docs › latest › sql-data-sources-xml.html
XML Files - Spark 4.1.1 Documentation
// The path can be either a single xml file or more xml files String path = "examples/src/main/resources/people.xml"; Dataset<Row> peopleDF = spark.read().option("rowTag", "person").xml(path); // The inferred schema can be visualized using the printSchema() method peopleDF.printSchema(); // root // |-- age: long (nullable = true) // |-- name: string (nullable = true) // Creates a temporary view using the DataFrame peopleDF.createOrReplaceTempView("people"); // SQL statements can be run by using the sql methods provided by spark Dataset<Row> teenagerNamesDF = spark.sql( "SELECT name FROM people
🌐
Spark By {Examples}
sparkbyexamples.com › home › apache hadoop › spark read xml file using databricks api
Spark Read XML file using Databricks API - Spark By {Examples}
March 27, 2024 - Processing XML files in Apache Spark is enabled by using below Databricks spark-xml dependency into the maven pom.xml file.
🌐
Databricks
forums.databricks.com › topics › xml.html
xml - Databricks Community Forum
pyspark parsing databricks spark spark sql xml parsing scala python apache spark emr apache spark dataframe schema spark-xml copybook databric json cobol parquet explode azure databricks jbossfuse import data read file written by databrick clusters spark dataframe local file
🌐
Stack Overflow
stackoverflow.com › questions › 42416211 › spark-xml-parsing
Spark XML parsing - Stack Overflow
May 23, 2017 - Try to use _ symbol before an XML attribute name in your schema. If it is not working - try to use @ symbol. Watch example, but it is provided for old Spark version.
🌐
CloudxLab
cloudxlab.com › assessment › displayslide › 613 › spark-sql-loading-xml
Spark SQL - Loading XML | Automated hands-on| CloudxLab
Now, we can also use the spark.read.format object with xml as an argument and then specifying the columns using a method .option and then load the data from the HDFS. We can also use the fully qualified name of format as com.databricks.spark.xml instead of simply xml.
Find elsewhere
Top answer
1 of 3
12

heirarchy should be rootTag and att should be rowTag as

df = spark.read \
    .format("com.databricks.spark.xml") \
    .option("rootTag", "hierarchy") \
    .option("rowTag", "att") \
    .load("test.xml")

and you should get

+-----+------+----------------------------+
|Order|attval|children                    |
+-----+------+----------------------------+
|1    |Data  |[[[1, Studyval], [2, Site]]]|
|2    |Info  |[[[1, age], [2, gender]]]   |
+-----+------+----------------------------+

and schema

root
 |-- Order: long (nullable = true)
 |-- attval: string (nullable = true)
 |-- children: struct (nullable = true)
 |    |-- att: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- Order: long (nullable = true)
 |    |    |    |-- attval: string (nullable = true)

find more information on databricks xml

2 of 3
3

Databricks has released new version to read xml to Spark DataFrame

<dependency>
     <groupId>com.databricks</groupId>
     <artifactId>spark-xml_2.12</artifactId>
     <version>0.6.0</version>
 </dependency>

Input XML file I used on this example is available at GitHub repository.

val df = spark.read
      .format("com.databricks.spark.xml")
      .option("rowTag", "person")
      .xml("persons.xml")

Schema

root
 |-- _id: long (nullable = true)
 |-- dob_month: long (nullable = true)
 |-- dob_year: long (nullable = true)
 |-- firstname: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- salary: struct (nullable = true)
 |    |-- _VALUE: long (nullable = true)
 |    |-- _currency: string (nullable = true)

Outputs:

+---+---------+--------+---------+------+--------+----------+---------------+
|_id|dob_month|dob_year|firstname|gender|lastname|middlename|         salary|
+---+---------+--------+---------+------+--------+----------+---------------+
|  1|        1|    1980|    James|     M|   Smith|      null|  [10000, Euro]|
|  2|        6|    1990|  Michael|     M|    null|      Rose|[10000, Dollor]|
+---+---------+--------+---------+------+--------+----------+---------------+

Note that Spark XML API has some limitations and discussed here Spark-XML API Limitations

Hope this helps !!

🌐
Medium
medium.com › analytics-vidhya › xml-parsing-with-pyspark-4a16fbd53ddb
XML Parsing with Pyspark. This is one of my stories in spark deep… | by somanath sankaran | Analytics Vidhya | Medium
March 26, 2020 - XSD is the schema file for xml which is generally provided by the source application which will be the source of truth for validating the xml by the consuming application. So In spark also we can provide the xsd using · rowValidationXSDPath after adding the xsd file using SparkContext.addFile as the local xsd file will not available in all executors
🌐
Microsoft Learn
learn.microsoft.com › en-us › azure › databricks › archive › connectors › spark-xml-library
Read and write XML data using the spark-xml library - Azure Databricks | Microsoft Learn
December 16, 2024 - The products, services, or technologies mentioned int his content are not officially endorsed or tested by Databricks. As an alternative, native XML file format support is available in Public Preview. See Read and write XML files. This article describes how to read and write an XML file as an Apache Spark data source.
🌐
Databricks
community.databricks.com › s › question › 0D53f00001rSJlKCAW › databricks-spark-xml-parser-support-for-namespace-declared-at-the-ancestor-level
Solved: Databricks Spark XML parser : support for namespac... - Databricks Community - 22863
March 7, 2023 - The problem is that I need to validate my "row" against an XSD using rowValidationXSDPath , which does not support Prefixes at Row level with namespace declaration at ancestor level. ... Hey @Ben Ben​ , so Spark-XML is not a package maintained by Databricks.
🌐
Sonra
sonra.io › home › xml › how to parse xml in spark and databricks (guide)
How to Parse XML in Spark and Databricks (Guide) - Sonra
June 17, 2025 - So, while the following workflow focuses specifically on using Databricks, my local testing suggests that converting XML to Delta with Spark 4.0 (outside of Databricks) follows similar steps. Parsing XML in modern data platforms should be easy.
🌐
GitHub
github.com › databricks › spark-xml › issues › 331
Extract xml data from Dataframe and process the xml in to a separate Dataframe · Issue #331 · databricks/spark-xml
September 18, 2018 - In the xmldata column there is xml tags inside, I need to parse it in a structured data in a seperate dataframe. Previously I had the xml file alone in a text file, and loaded in a spark dataframe using "com.databricks.spark.xml"
Author   rakiuday
🌐
Szczeles
szczeles.github.io › Reading-JSON-CSV-and-XML-files-efficiently-in-Apache-Spark
Reading JSON, CSV and XML files efficiently in Apache Spark
November 6, 2017 - With Apache Spark you can easily read semi-structured files like JSON, CSV using standard library and XML files with spark-xml package. Sadly, the process of loading files may be long, as Spark needs to infer schema of underlying records by reading them. That's why I'm going to explain possible ...
🌐
Databricks Documentation
docs.databricks.com › data engineering › lakeflow connect › data formats › xml file
Read and write XML files | Databricks on AWS
July 5, 2023 - You can optionally validate each row-level XML record by an XML Schema Definition (XSD). The XSD file is specified in the rowValidationXSDPath option. The XSD does not otherwise affect the schema provided or inferred. A record that fails the validation is marked as “corrupted” and handled based on the corrupt record handling mode option described in the option section. You can use XSDToSchema to extract a Spark DataFrame schema from a XSD file.
🌐
Cloudera Community
community.cloudera.com › t5 › Support-Questions › com-databricks-spark-xml-parsing-xml-takes-a-very-long-time › td-p › 130447
com.databricks.spark.xml parsing xml takes a very ... - Cloudera Community - 130447
April 13, 2017 - Hello All, I require to import and parse xml files in Hadoop. I have an old pig 'REGEX_EXTRACT' script parser that works fine but takes a sometime to run, arround 10-15mins. In the last 6 months, I have started to use spark, with large success in improving run time.