I hope I can help you or at least point you in the right direction.
With nested structures you have to dissolve the layers step by step. When you create your dataframe, the schema is displayed below.

Here it is important to distinguish between arrays and structs. Structs can be resolved with a "select" expression and arrays with an "explode" function.
Copydf_categories = df.select("categories.*")
With this "Select" you explicitly select the column "categories" and all its values. But note that you drop all other columns. If you want to keep them, you have to specify this as well.
The result would look like this:

With this, we have resolved the whole thing somewhat, but that is still not enough for us. If we also dissolve the underlying structure, we will bring more structure into it.

Now we have an array in the highest level, which we have to explode. For this, the "explode" function must be imported beforehand.

Now we have reached the lowest level. All you have to do now is to pivot the rows into columns.
I hope I was able to help you.
EDIT:
You said that you want the values from "_name" as columns, but I don't know how useful that is in this context. You could use the following code to pivot:
Copyfrom pyspark.sql import functions as F
from pyspark.sql.window import Window
# Hinzufügen einer Spalte mit eindeutiger Identifikation für jede Zeile
df_attributes = df_attributes.withColumn("row_id", F.monotonically_increasing_id())
# Verwende die "pivot" Funktion, um die Einträge in "_name" als Spalten zu definieren
pivot_df = df_attributes.groupBy("row_id").pivot("_name").agg(F.first("_VALUE"))
# Optional: Fehlende Werte mit 0 füllen
pivot_df = pivot_df.fillna(0)
pivot_df.display()
In my opinion, it would make more sense to access the values directly from the "_name" column.
I hope I was able to help you.
Answer from DanielP on Stack OverflowVideos
heirarchy should be rootTag and att should be rowTag as
df = spark.read \
.format("com.databricks.spark.xml") \
.option("rootTag", "hierarchy") \
.option("rowTag", "att") \
.load("test.xml")
and you should get
+-----+------+----------------------------+
|Order|attval|children |
+-----+------+----------------------------+
|1 |Data |[[[1, Studyval], [2, Site]]]|
|2 |Info |[[[1, age], [2, gender]]] |
+-----+------+----------------------------+
and schema
root
|-- Order: long (nullable = true)
|-- attval: string (nullable = true)
|-- children: struct (nullable = true)
| |-- att: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Order: long (nullable = true)
| | | |-- attval: string (nullable = true)
find more information on databricks xml
Databricks has released new version to read xml to Spark DataFrame
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-xml_2.12</artifactId>
<version>0.6.0</version>
</dependency>
Input XML file I used on this example is available at GitHub repository.
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "person")
.xml("persons.xml")
Schema
root
|-- _id: long (nullable = true)
|-- dob_month: long (nullable = true)
|-- dob_year: long (nullable = true)
|-- firstname: string (nullable = true)
|-- gender: string (nullable = true)
|-- lastname: string (nullable = true)
|-- middlename: string (nullable = true)
|-- salary: struct (nullable = true)
| |-- _VALUE: long (nullable = true)
| |-- _currency: string (nullable = true)
Outputs:
+---+---------+--------+---------+------+--------+----------+---------------+
|_id|dob_month|dob_year|firstname|gender|lastname|middlename| salary|
+---+---------+--------+---------+------+--------+----------+---------------+
| 1| 1| 1980| James| M| Smith| null| [10000, Euro]|
| 2| 6| 1990| Michael| M| null| Rose|[10000, Dollor]|
+---+---------+--------+---------+------+--------+----------+---------------+
Note that Spark XML API has some limitations and discussed here Spark-XML API Limitations
Hope this helps !!