heirarchy should be rootTag and att should be rowTag as

df = spark.read \
    .format("com.databricks.spark.xml") \
    .option("rootTag", "hierarchy") \
    .option("rowTag", "att") \
    .load("test.xml")

and you should get

+-----+------+----------------------------+
|Order|attval|children                    |
+-----+------+----------------------------+
|1    |Data  |[[[1, Studyval], [2, Site]]]|
|2    |Info  |[[[1, age], [2, gender]]]   |
+-----+------+----------------------------+

and schema

root
 |-- Order: long (nullable = true)
 |-- attval: string (nullable = true)
 |-- children: struct (nullable = true)
 |    |-- att: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- Order: long (nullable = true)
 |    |    |    |-- attval: string (nullable = true)

find more information on databricks xml

Answer from Anahcolus on Stack Overflow
🌐
Apache Spark
spark.apache.org › docs › latest › sql-data-sources-xml.html
XML Files - Spark 4.1.1 Documentation
// The path can be either a single xml file or more xml files String path = "examples/src/main/resources/people.xml"; Dataset<Row> peopleDF = spark.read().option("rowTag", "person").xml(path); // The inferred schema can be visualized using the printSchema() method peopleDF.printSchema(); // ...
🌐
GitHub
github.com › databricks › spark-xml
GitHub - databricks/spark-xml: XML data source for Spark SQL and DataFrames · GitHub
The library contains a Hadoop input format for reading XML files by a start tag and an end tag. This is similar with XmlInputFormat.java in Mahout but supports to read compressed files, different encodings and read elements including attributes, which you may make direct use of as follows: import com.databricks.spark.xml.XmlInputFormat import org.apache.spark.SparkContext import org.apache.hadoop.io.{LongWritable, Text} val sc: SparkContext = _ // This will detect the tags including attributes sc.hadoopConfiguration.set(XmlInputFormat.START_TAG_KEY, "<book>") sc.hadoopConfiguration.set(XmlInputFormat.END_TAG_KEY, "</book>") val records = sc.newAPIHadoopFile( "path", classOf[XmlInputFormat], classOf[LongWritable], classOf[Text])
Starred by 511 users
Forked by 225 users
Languages   Scala 97.8% | Java 1.5% | Shell 0.7%
Top answer
1 of 3
12

heirarchy should be rootTag and att should be rowTag as

df = spark.read \
    .format("com.databricks.spark.xml") \
    .option("rootTag", "hierarchy") \
    .option("rowTag", "att") \
    .load("test.xml")

and you should get

+-----+------+----------------------------+
|Order|attval|children                    |
+-----+------+----------------------------+
|1    |Data  |[[[1, Studyval], [2, Site]]]|
|2    |Info  |[[[1, age], [2, gender]]]   |
+-----+------+----------------------------+

and schema

root
 |-- Order: long (nullable = true)
 |-- attval: string (nullable = true)
 |-- children: struct (nullable = true)
 |    |-- att: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- Order: long (nullable = true)
 |    |    |    |-- attval: string (nullable = true)

find more information on databricks xml

2 of 3
3

Databricks has released new version to read xml to Spark DataFrame

<dependency>
     <groupId>com.databricks</groupId>
     <artifactId>spark-xml_2.12</artifactId>
     <version>0.6.0</version>
 </dependency>

Input XML file I used on this example is available at GitHub repository.

val df = spark.read
      .format("com.databricks.spark.xml")
      .option("rowTag", "person")
      .xml("persons.xml")

Schema

root
 |-- _id: long (nullable = true)
 |-- dob_month: long (nullable = true)
 |-- dob_year: long (nullable = true)
 |-- firstname: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- salary: struct (nullable = true)
 |    |-- _VALUE: long (nullable = true)
 |    |-- _currency: string (nullable = true)

Outputs:

+---+---------+--------+---------+------+--------+----------+---------------+
|_id|dob_month|dob_year|firstname|gender|lastname|middlename|         salary|
+---+---------+--------+---------+------+--------+----------+---------------+
|  1|        1|    1980|    James|     M|   Smith|      null|  [10000, Euro]|
|  2|        6|    1990|  Michael|     M|    null|      Rose|[10000, Dollor]|
+---+---------+--------+---------+------+--------+----------+---------------+

Note that Spark XML API has some limitations and discussed here Spark-XML API Limitations

Hope this helps !!

🌐
Spark By {Examples}
sparkbyexamples.com › home › apache hadoop › spark read xml file using databricks api
Spark Read XML file using Databricks API - Spark By {Examples}
March 27, 2024 - In this article, I will explain how to read XML file with several options using the Scala example. ... Processing XML files in Apache Spark is enabled by using below Databricks spark-xml dependency into the maven pom.xml file.
🌐
Microsoft Learn
learn.microsoft.com › en-us › azure › databricks › external-data › xml
Read and write XML files - Azure Databricks | Microsoft Learn
July 6, 2023 - You can enable the rescued data column by setting the option rescuedDataColumn to a column name when reading data, such as _rescued_data with spark.read.option("rescuedDataColumn", "_rescued_data").format("xml").load(<path>).
🌐
MSSQLTips
mssqltips.com › home › read and write xml files using apache spark
Read and Write XML Files using Apache Spark
March 21, 2023 - This will improve the performance of loading the file into memory. The code below reads in the CD catalog file using a defined schema. %sql -- -- 4 - Create table w schema -- CREATE TABLE sparktips.umt_catalog_of_cds ( ARTIST string, COMPANY string, COUNTRY string, PRICE double, TITLE string, YEAR long ) USING xml OPTIONS (path 'dbfs:/mnt/datalake/bronze/sparktips/cd-catalog.xml', rowTag "CD")
🌐
Databricks Documentation
docs.databricks.com › data engineering › lakeflow connect › data formats › xml file
Read and write XML files | Databricks on AWS
September 8, 2024 - You can enable the rescued data column by setting the option rescuedDataColumn to a column name when reading data, such as _rescued_data with spark.read.option("rescuedDataColumn", "_rescued_data").format("xml").load(<path>).
Find elsewhere
🌐
Microsoft Learn
learn.microsoft.com › en-us › azure › databricks › archive › connectors › spark-xml-library
Read and write XML data using the spark-xml library - Azure Databricks | Microsoft Learn
December 16, 2024 - // Infer schema import com.databricks.spark.xml._ // Add the DataFrame.read.xml() method val df = spark.read .option("rowTag", "book") .xml("dbfs:/books.xml") val selectedData = df.select("author", "_id") selectedData.write .option("rootTag", ...
🌐
Databricks Documentation
docs.databricks.com › resources › documentation archive › read and write xml data using the spark-xml library
Read and write XML data using the spark-xml library | Databricks on AWS
// Infer schema import com.databricks.spark.xml._ // Add the DataFrame.read.xml() method val df = spark.read .option("rowTag", "book") .xml("dbfs:/books.xml") val selectedData = df.select("author", "_id") selectedData.write .option("rootTag", ...
🌐
Apache Spark
spark.apache.org › docs › 4.0.0 › sql-data-sources-xml.html
XML Files - Spark 4.0.0 Documentation
// The path can be either a single xml file or more xml files String path = "examples/src/main/resources/people.xml"; Dataset<Row> peopleDF = spark.read().option("rowTag", "person").xml(path); // The inferred schema can be visualized using the printSchema() method peopleDF.printSchema(); // ...
🌐
Medium
medium.com › @uzzaman.ahmed › working-with-xml-files-in-pyspark-reading-and-writing-data-d5e570c913de
Working with XML files in PySpark: Reading and Writing Data | by Ahmed Uz Zaman | Medium
April 11, 2023 - When reading XML files in PySpark, the spark-xml package infers the schema of the XML data and returns a DataFrame with columns corresponding to the tags and attributes in the XML file.
🌐
Szczeles
szczeles.github.io › Reading-JSON-CSV-and-XML-files-efficiently-in-Apache-Spark
Reading JSON, CSV and XML files efficiently in Apache Spark
November 6, 2017 - With Apache Spark you can easily read semi-structured files like JSON, CSV using standard library and XML files with spark-xml package. Sadly, the process of loading files may be long, as Spark needs to infer schema of underlying records by reading them. That's why I'm going to explain possible ...
🌐
Databricks Documentation
docs.databricks.com › data guides › query data › data format options › xml file
Read and write XML files | Databricks on Google Cloud
August 9, 2024 - You can enable the rescued data column by setting the option rescuedDataColumn to a column name when reading data, such as _rescued_data with spark.read.option("rescuedDataColumn", "_rescued_data").format("xml").load(<path>).
🌐
GitHub
github.com › databricks › spark-xml › blob › master › README.md
spark-xml/README.md at master · databricks/spark-xml
April 29, 2022 - The library contains a Hadoop input format for reading XML files by a start tag and an end tag. This is similar with XmlInputFormat.java in Mahout but supports to read compressed files, different encodings and read elements including attributes, which you may make direct use of as follows: import com.databricks.spark.xml.XmlInputFormat import org.apache.spark.SparkContext import org.apache.hadoop.io.{LongWritable, Text} val sc: SparkContext = _ // This will detect the tags including attributes sc.hadoopConfiguration.set(XmlInputFormat.START_TAG_KEY, "<book>") sc.hadoopConfiguration.set(XmlInputFormat.END_TAG_KEY, "</book>") val records = sc.newAPIHadoopFile( "path", classOf[XmlInputFormat], classOf[LongWritable], classOf[Text])
Author   databricks
🌐
Reddit
reddit.com › r/apachespark › spark-xml: how to access the value inside a tag?
r/apachespark on Reddit: spark-xml: How to access the value inside a tag?
March 6, 2024 -

Based on this

https://learn.microsoft.com/en-us/azure/databricks/query/formats/xml

I've simplified the xml mentioned in the guide above. All I did was to omit the additional xml elements for author and title.

xmlString = '''
  <books>
    <book id="bk103">
      Corets, Eva
    </book>
    <book id="bk104">
      Moretti, Sabrina
    </book>
  </books>'''

val xmlPath = "dbfs:/tmp/books.xml"
dbutils.fs.put(xmlPath, xmlString)

However, I couldn't figure out a way to access the names inside the two xml-tags

df = spark.read\
      .format("com.databricks.spark.xml")\
      .option("rowTag", "book")\
      .load(xmlPath)

# Show the DataFrame
df.printSchema()
df.show(truncate=False)

I'm getting the ids but not the names

+-----+
|_id  |
+-----+
|bk103|
|bk104|
+-----+

ps:

Using my local linux box, I run my script as follows:

spark-submit --packages com.databricks:spark-xml_2.12:0.17.0 ./example.py

🌐
Medium
medium.com › @pawankg › mastering-xml-data-integration-in-pyspark-merging-parsing-and-analyzing-multiple-files-with-ease-b449353ec87
Mastering XML Data Integration in PySpark: Merging, Parsing, and Analyzing Multiple Files with Ease for Data Professionals | by Pawan Kumar Ganjhu | Medium
May 24, 2023 - from pyspark.sql import SparkSession ... keys df1 = spark.read.format("xml").option("rowTag", "employee").load("employees1*.xml") df2 = spark.read.format("xml").option("rowTag", "employee").load("employees2*.xml") # Extract common keys and combine data common_keys = ["id", "name", "position"] combined_df = df1.select(common_keys).union(df2.select(common_keys)) # Display the contents combined_df.show() In this example, XML files matching the patterns employees1*.xml and employees2*.xml are read using the wildcard path...
Top answer
1 of 4
23

Yes it possible but details will differ depending on an approach you take.

  • If files are small, as you've mentioned, the simplest solution is to load your data using SparkContext.wholeTextFiles. It loads data as RDD[(String, String)] where the the first element is path and the second file content. Then you parse each file individually like in a local mode.
  • For larger files you can use Hadoop input formats.
    • If structure is simple you can split records using textinputformat.record.delimiter. You can find a simple example here. Input is not a XML but it you should give you and idea how to proceed
    • Otherwise Mahout provides XmlInputFormat
  • Finally it is possible to read file using SparkContext.textFile and adjust later for record spanning between partitions. Conceptually it means something similar to creating sliding window or partitioning records into groups of fixed size:

    • use mapPartitionsWithIndex partitions to identify records broken between partitions, collect broken records
    • use second mapPartitionsWithIndex to repair broken records

Edit:

There is also relatively new spark-xml package which allows you to extract specific records by tag:

val df = sqlContext.read
  .format("com.databricks.spark.xml")
   .option("rowTag", "foo")
   .load("bar.xml")
2 of 4
8

Here's the way to perform it using HadoopInputFormats to read XML data in spark as explained by @zero323.

Input data:

<root>
    <users>
        <user>
            <account>1234<\account>
            <name>name_1<\name>
            <number>34233<\number>
        <\user>
        <user>
            <account>58789<\account>
            <name>name_2<\name>
            <number>54697<\number>
        <\user>
    <\users>
<\root>

Code for reading XML Input:

You will get some jars at this link

Imports:

//---------------spark_import
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext

//----------------xml_loader_import
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{ LongWritable, Text }
import com.cloudera.datascience.common.XmlInputFormat

Code:

object Tester_loader {
  case class User(account: String, name: String, number: String)
  def main(args: Array[String]): Unit = {

    val sparkHome = "/usr/big_data_tools/spark-1.5.0-bin-hadoop2.6/"
    val sparkMasterUrl = "spark://SYSTEMX:7077"

    var jars = new ArrayString

    jars(0) = "/home/hduser/Offload_Data_Warehouse_Spark.jar"
    jars(1) = "/usr/big_data_tools/JARS/Spark_jar/avro/spark-avro_2.10-2.0.1.jar"

    val conf = new SparkConf().setAppName("XML Reading")
    conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .setMaster("local")
      .set("spark.cassandra.connection.host", "127.0.0.1")
      .setSparkHome(sparkHome)
      .set("spark.executor.memory", "512m")
      .set("spark.default.deployCores", "12")
      .set("spark.cores.max", "12")
      .setJars(jars)

    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)
    import sqlContext.implicits._

    // ---- loading user from XML

    // calling function 1.1
    val pages = readFile("src/input_data", "<user>", "<\\user>", sc) 

    val xmlUserDF = pages.map { tuple =>
      {
        val account = extractField(tuple, "account")
        val name = extractField(tuple, "name")
        val number = extractField(tuple, "number")

        User(account, name, number)
      }
    }.toDF()
    println(xmlUserDF.count())
    xmlUserDF.show()
  }

Functions:

  def readFile(path: String, start_tag: String, end_tag: String,
      sc: SparkContext) = {

    val conf = new Configuration()
    conf.set(XmlInputFormat.START_TAG_KEY, start_tag)
    conf.set(XmlInputFormat.END_TAG_KEY, end_tag)
    val rawXmls = sc.newAPIHadoopFile(
        path, classOf[XmlInputFormat], classOf[LongWritable],
        classOf[Text], conf)

    rawXmls.map(p => p._2.toString)
  }

  def extractField(tuple: String, tag: String) = {
    var value = tuple.replaceAll("\n", " ").replace("<\\", "</")

    if (value.contains("<" + tag + ">") &&
        value.contains("</" + tag + ">")) {
      value = value.split("<" + tag + ">")(1).split("</" + tag + ">")(0)
    }
    value
  }

}

Output:

+-------+------+------+
|account|  name|number|
+-------+------+------+
|   1234|name_1| 34233|
|  58789|name_2| 54697|
+-------+------+------+

The result obtained is in dataframes you can convert them to RDD as per your requirement like this->

val xmlUserRDD = xmlUserDF.toJavaRDD.rdd.map { x =>
    (x.get(0).toString(),x.get(1).toString(),x.get(2).toString()) }

Please evaluate it, if it could help you some how.