spark read format xml load example

stackoverflow.com › questions › 50429315 › read-xml-in-spark

heirarchy should be rootTag and att should be rowTag as

df = spark.read \
    .format("com.databricks.spark.xml") \
    .option("rootTag", "hierarchy") \
    .option("rowTag", "att") \
    .load("test.xml")

and you should get

+-----+------+----------------------------+
|Order|attval|children                    |
+-----+------+----------------------------+
|1    |Data  |[[[1, Studyval], [2, Site]]]|
|2    |Info  |[[[1, age], [2, gender]]]   |
+-----+------+----------------------------+

and schema

root
 |-- Order: long (nullable = true)
 |-- attval: string (nullable = true)
 |-- children: struct (nullable = true)
 |    |-- att: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- Order: long (nullable = true)
 |    |    |    |-- attval: string (nullable = true)

find more information on databricks xml

Answer from Anahcolus on Stack Overflow

Apache Spark

spark.apache.org › docs › latest › sql-data-sources-xml.html

XML Files - Spark 4.1.1 Documentation

// The path can be either a single xml file or more xml files String path = "examples/src/main/resources/people.xml"; Dataset<Row> peopleDF = spark.read().option("rowTag", "person").xml(path); // The inferred schema can be visualized using the printSchema() method peopleDF.printSchema(); // ...

GitHub

github.com › databricks › spark-xml

GitHub - databricks/spark-xml: XML data source for Spark SQL and DataFrames · GitHub

The library contains a Hadoop input format for reading XML files by a start tag and an end tag. This is similar with XmlInputFormat.java in Mahout but supports to read compressed files, different encodings and read elements including attributes, which you may make direct use of as follows: import com.databricks.spark.xml.XmlInputFormat import org.apache.spark.SparkContext import org.apache.hadoop.io.{LongWritable, Text} val sc: SparkContext = _ // This will detect the tags including attributes sc.hadoopConfiguration.set(XmlInputFormat.START_TAG_KEY, "<book>") sc.hadoopConfiguration.set(XmlInputFormat.END_TAG_KEY, "</book>") val records = sc.newAPIHadoopFile( "path", classOf[XmlInputFormat], classOf[LongWritable], classOf[Text])

Starred by 511 users

Forked by 225 users

Languages Scala 97.8% | Java 1.5% | Shell 0.7%

Videos

52:45

YouTube

23. Reading and writing XML files in Azure Databricks - YouTube

Big Data on Spark | Tutorial for Beginners [Part 24] | Spark - ...

Databricks Tutorial 8: Read xml files in Pyspark, writing xml files ...

August 16, 2020

youtube.com

Create DataFrame from XML File - Scala API - YouTube

43:22

YouTube

Processing XML in Apache Spark using Spark XML and the DataFrame ...

stackoverflow.com › questions › 50429315 › read-xml-in-spark

Read XML in spark - Stack Overflow

Top answer

1 of 3

heirarchy should be rootTag and att should be rowTag as

df = spark.read \
    .format("com.databricks.spark.xml") \
    .option("rootTag", "hierarchy") \
    .option("rowTag", "att") \
    .load("test.xml")

and you should get

+-----+------+----------------------------+
|Order|attval|children                    |
+-----+------+----------------------------+
|1    |Data  |[[[1, Studyval], [2, Site]]]|
|2    |Info  |[[[1, age], [2, gender]]]   |
+-----+------+----------------------------+

and schema

root
 |-- Order: long (nullable = true)
 |-- attval: string (nullable = true)
 |-- children: struct (nullable = true)
 |    |-- att: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- Order: long (nullable = true)
 |    |    |    |-- attval: string (nullable = true)

find more information on databricks xml

2 of 3

Databricks has released new version to read xml to Spark DataFrame

<dependency>
     <groupId>com.databricks</groupId>
     <artifactId>spark-xml_2.12</artifactId>
     <version>0.6.0</version>
 </dependency>

Input XML file I used on this example is available at GitHub repository.

val df = spark.read
      .format("com.databricks.spark.xml")
      .option("rowTag", "person")
      .xml("persons.xml")

Schema

root
 |-- _id: long (nullable = true)
 |-- dob_month: long (nullable = true)
 |-- dob_year: long (nullable = true)
 |-- firstname: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- salary: struct (nullable = true)
 |    |-- _VALUE: long (nullable = true)
 |    |-- _currency: string (nullable = true)

Outputs:

+---+---------+--------+---------+------+--------+----------+---------------+
|_id|dob_month|dob_year|firstname|gender|lastname|middlename|         salary|
+---+---------+--------+---------+------+--------+----------+---------------+
|  1|        1|    1980|    James|     M|   Smith|      null|  [10000, Euro]|
|  2|        6|    1990|  Michael|     M|    null|      Rose|[10000, Dollor]|
+---+---------+--------+---------+------+--------+----------+---------------+

Note that Spark XML API has some limitations and discussed here Spark-XML API Limitations

Hope this helps !!

Spark By {Examples}

sparkbyexamples.com › home › apache hadoop › spark read xml file using databricks api

Spark Read XML file using Databricks API - Spark By {Examples}

March 27, 2024 - In this article, I will explain how to read XML file with several options using the Scala example. ... Processing XML files in Apache Spark is enabled by using below Databricks spark-xml dependency into the maven pom.xml file.

Microsoft Learn

learn.microsoft.com › en-us › azure › databricks › external-data › xml

Read and write XML files - Azure Databricks | Microsoft Learn

July 6, 2023 - You can enable the rescued data column by setting the option rescuedDataColumn to a column name when reading data, such as _rescued_data with spark.read.option("rescuedDataColumn", "_rescued_data").format("xml").load(<path>).

MSSQLTips

mssqltips.com › home › read and write xml files using apache spark

Read and Write XML Files using Apache Spark

March 21, 2023 - This will improve the performance of loading the file into memory. The code below reads in the CD catalog file using a defined schema. %sql -- -- 4 - Create table w schema -- CREATE TABLE sparktips.umt_catalog_of_cds ( ARTIST string, COMPANY string, COUNTRY string, PRICE double, TITLE string, YEAR long ) USING xml OPTIONS (path 'dbfs:/mnt/datalake/bronze/sparktips/cd-catalog.xml', rowTag "CD")

Databricks Documentation

docs.databricks.com › data engineering › lakeflow connect › data formats › xml file

Read and write XML files | Databricks on AWS

September 8, 2024 - You can enable the rescued data column by setting the option rescuedDataColumn to a column name when reading data, such as _rescued_data with spark.read.option("rescuedDataColumn", "_rescued_data").format("xml").load(<path>).

Find elsewhere

Google Bing Mojeek

Microsoft Learn

learn.microsoft.com › en-us › azure › databricks › archive › connectors › spark-xml-library

Read and write XML data using the spark-xml library - Azure Databricks | Microsoft Learn

December 16, 2024 - // Infer schema import com.databricks.spark.xml._ // Add the DataFrame.read.xml() method val df = spark.read .option("rowTag", "book") .xml("dbfs:/books.xml") val selectedData = df.select("author", "_id") selectedData.write .option("rootTag", ...

Databricks Documentation

docs.databricks.com › resources › documentation archive › read and write xml data using the spark-xml library

Read and write XML data using the spark-xml library | Databricks on AWS

// Infer schema import com.databricks.spark.xml._ // Add the DataFrame.read.xml() method val df = spark.read .option("rowTag", "book") .xml("dbfs:/books.xml") val selectedData = df.select("author", "_id") selectedData.write .option("rootTag", ...

CloudxLab

cloudxlab.com › assessment › displayslide › 613 › spark-sql-loading-xml

Spark SQL - Loading XML | Automated hands-on| CloudxLab

spark.read.format("xml").option("rowTag","book").load("/data/spark/books.xml").show()

Apache Spark

spark.apache.org › docs › 4.0.0 › sql-data-sources-xml.html

XML Files - Spark 4.0.0 Documentation

Medium

medium.com › @uzzaman.ahmed › working-with-xml-files-in-pyspark-reading-and-writing-data-d5e570c913de

Working with XML files in PySpark: Reading and Writing Data | by Ahmed Uz Zaman | Medium

April 11, 2023 - When reading XML files in PySpark, the spark-xml package infers the schema of the XML data and returns a DataFrame with columns corresponding to the tags and attributes in the XML file.

Szczeles

szczeles.github.io › Reading-JSON-CSV-and-XML-files-efficiently-in-Apache-Spark

Reading JSON, CSV and XML files efficiently in Apache Spark

November 6, 2017 - With Apache Spark you can easily read semi-structured files like JSON, CSV using standard library and XML files with spark-xml package. Sadly, the process of loading files may be long, as Spark needs to infer schema of underlying records by reading them. That's why I'm going to explain possible ...

Databricks Documentation

docs.databricks.com › data guides › query data › data format options › xml file

Read and write XML files | Databricks on Google Cloud

August 9, 2024 - You can enable the rescued data column by setting the option rescuedDataColumn to a column name when reading data, such as _rescued_data with spark.read.option("rescuedDataColumn", "_rescued_data").format("xml").load(<path>).

GitHub

github.com › databricks › spark-xml › blob › master › README.md

spark-xml/README.md at master · databricks/spark-xml

April 29, 2022 - The library contains a Hadoop input format for reading XML files by a start tag and an end tag. This is similar with XmlInputFormat.java in Mahout but supports to read compressed files, different encodings and read elements including attributes, which you may make direct use of as follows: import com.databricks.spark.xml.XmlInputFormat import org.apache.spark.SparkContext import org.apache.hadoop.io.{LongWritable, Text} val sc: SparkContext = _ // This will detect the tags including attributes sc.hadoopConfiguration.set(XmlInputFormat.START_TAG_KEY, "<book>") sc.hadoopConfiguration.set(XmlInputFormat.END_TAG_KEY, "</book>") val records = sc.newAPIHadoopFile( "path", classOf[XmlInputFormat], classOf[LongWritable], classOf[Text])

Author databricks

reddit.com › r/apachespark › spark-xml: how to access the value inside a tag?

r/apachespark on Reddit: spark-xml: How to access the value inside a tag?

March 6, 2024 -

Based on this

https://learn.microsoft.com/en-us/azure/databricks/query/formats/xml

I've simplified the xml mentioned in the guide above. All I did was to omit the additional xml elements for author and title.

xmlString = '''
  <books>
    <book id="bk103">
      Corets, Eva
    </book>
    <book id="bk104">
      Moretti, Sabrina
    </book>
  </books>'''

val xmlPath = "dbfs:/tmp/books.xml"
dbutils.fs.put(xmlPath, xmlString)

However, I couldn't figure out a way to access the names inside the two xml-tags

df = spark.read\
      .format("com.databricks.spark.xml")\
      .option("rowTag", "book")\
      .load(xmlPath)

# Show the DataFrame
df.printSchema()
df.show(truncate=False)

I'm getting the ids but not the names

+-----+
|_id  |
+-----+
|bk103|
|bk104|
+-----+

ps:

Using my local linux box, I run my script as follows:

spark-submit --packages com.databricks:spark-xml_2.12:0.17.0 ./example.py

Top answer

1 of 2

I think the root should be also specified

2 of 2

That could be me, but that looks like a malformed xml. Your book name (or whatever is inside the book tag should also have a tag. That way you can get the reference from those internal tags.

Stack Overflow

stackoverflow.com › questions › 69387165 › how-do-i-read-a-xml-file-in-pyspark

apache spark - How do I read a xml file in "pyspark"? - Stack Overflow

Top answer

1 of 2

But I don't want to use Databricks

Okay, then you need to implement your own Spark data format reader for XML since that's not a built-in option

Otherwise, write your parser elsewhere, then reformat your data to something Spark can work with out of the box. For example, read the complete file as a string, then use Python lxml or etree modules to build out a Dataframe with some schema

2 of 2

You can try to read the content of the xml file as a string into the spark dataframe, and then use the Spark SQL xpath series of functions to process it.

Medium

medium.com › @pawankg › mastering-xml-data-integration-in-pyspark-merging-parsing-and-analyzing-multiple-files-with-ease-b449353ec87

Mastering XML Data Integration in PySpark: Merging, Parsing, and Analyzing Multiple Files with Ease for Data Professionals | by Pawan Kumar Ganjhu | Medium

May 24, 2023 - from pyspark.sql import SparkSession ... keys df1 = spark.read.format("xml").option("rowTag", "employee").load("employees1*.xml") df2 = spark.read.format("xml").option("rowTag", "employee").load("employees2*.xml") # Extract common keys and combine data common_keys = ["id", "name", "position"] combined_df = df1.select(common_keys).union(df2.select(common_keys)) # Display the contents combined_df.show() In this example, XML files matching the patterns employees1*.xml and employees2*.xml are read using the wildcard path...

Stack Overflow

stackoverflow.com › questions › 33078221 › xml-processing-in-spark

Xml processing in Spark - Stack Overflow

Top answer

1 of 4

Yes it possible but details will differ depending on an approach you take.

If files are small, as you've mentioned, the simplest solution is to load your data using SparkContext.wholeTextFiles. It loads data as RDD[(String, String)] where the the first element is path and the second file content. Then you parse each file individually like in a local mode.
For larger files you can use Hadoop input formats.
- If structure is simple you can split records using textinputformat.record.delimiter. You can find a simple example here. Input is not a XML but it you should give you and idea how to proceed
- Otherwise Mahout provides XmlInputFormat
Finally it is possible to read file using SparkContext.textFile and adjust later for record spanning between partitions. Conceptually it means something similar to creating sliding window or partitioning records into groups of fixed size:
- use mapPartitionsWithIndex partitions to identify records broken between partitions, collect broken records
- use second mapPartitionsWithIndex to repair broken records

Edit:

There is also relatively new spark-xml package which allows you to extract specific records by tag:

val df = sqlContext.read
  .format("com.databricks.spark.xml")
   .option("rowTag", "foo")
   .load("bar.xml")

2 of 4

Here's the way to perform it using HadoopInputFormats to read XML data in spark as explained by @zero323.

Input data:

<root>
    <users>
        <user>
            <account>1234<\account>
            <name>name_1<\name>
            <number>34233<\number>
        <\user>
        <user>
            <account>58789<\account>
            <name>name_2<\name>
            <number>54697<\number>
        <\user>
    <\users>
<\root>

Code for reading XML Input:

You will get some jars at this link

Imports:

//---------------spark_import
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext

//----------------xml_loader_import
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{ LongWritable, Text }
import com.cloudera.datascience.common.XmlInputFormat

Code:

object Tester_loader {
  case class User(account: String, name: String, number: String)
  def main(args: Array[String]): Unit = {

    val sparkHome = "/usr/big_data_tools/spark-1.5.0-bin-hadoop2.6/"
    val sparkMasterUrl = "spark://SYSTEMX:7077"

    var jars = new ArrayString

    jars(0) = "/home/hduser/Offload_Data_Warehouse_Spark.jar"
    jars(1) = "/usr/big_data_tools/JARS/Spark_jar/avro/spark-avro_2.10-2.0.1.jar"

    val conf = new SparkConf().setAppName("XML Reading")
    conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .setMaster("local")
      .set("spark.cassandra.connection.host", "127.0.0.1")
      .setSparkHome(sparkHome)
      .set("spark.executor.memory", "512m")
      .set("spark.default.deployCores", "12")
      .set("spark.cores.max", "12")
      .setJars(jars)

    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)
    import sqlContext.implicits._

    // ---- loading user from XML

    // calling function 1.1
    val pages = readFile("src/input_data", "<user>", "<\\user>", sc) 

    val xmlUserDF = pages.map { tuple =>
      {
        val account = extractField(tuple, "account")
        val name = extractField(tuple, "name")
        val number = extractField(tuple, "number")

        User(account, name, number)
      }
    }.toDF()
    println(xmlUserDF.count())
    xmlUserDF.show()
  }

Functions:

  def readFile(path: String, start_tag: String, end_tag: String,
      sc: SparkContext) = {

    val conf = new Configuration()
    conf.set(XmlInputFormat.START_TAG_KEY, start_tag)
    conf.set(XmlInputFormat.END_TAG_KEY, end_tag)
    val rawXmls = sc.newAPIHadoopFile(
        path, classOf[XmlInputFormat], classOf[LongWritable],
        classOf[Text], conf)

    rawXmls.map(p => p._2.toString)
  }

  def extractField(tuple: String, tag: String) = {
    var value = tuple.replaceAll("\n", " ").replace("<\\", "</")

    if (value.contains("<" + tag + ">") &&
        value.contains("</" + tag + ">")) {
      value = value.split("<" + tag + ">")(1).split("</" + tag + ">")(0)
    }
    value
  }

}

Output:

+-------+------+------+
|account|  name|number|
+-------+------+------+
|   1234|name_1| 34233|
|  58789|name_2| 54697|
+-------+------+------+

The result obtained is in dataframes you can convert them to RDD as per your requirement like this->

val xmlUserRDD = xmlUserDF.toJavaRDD.rdd.map { x =>
    (x.get(0).toString(),x.get(1).toString(),x.get(2).toString()) }

Please evaluate it, if it could help you some how.