convert xml to parquet python

Is there a way to create parquet file from xml/json input file without .avsc file and without impala/hive.?

stackoverflow.com › questions › 36289548 › is-there-a-way-to-create-parquet-file-from-xml-json-input-file-without-avsc-fil

I just published one using python.

https://github.com/blackrock/xml_to_parquet

Convert one or more XML files into Apache Parquet format. Only requires a XSD and XML file to get started.

It requires a XSD schema file to convert everything in your XML file into an equivalent parquet file with nested data structures that match XML paths.

Convert a small XML file to a Parquet file
python xml_to_parquet.py -x PurchaseOrder.xsd PurchaseOrder.xml

INFO - 2021-01-21 12:32:38 - Parsing XML Files..
INFO - 2021-01-21 12:32:38 - Processing 1 files
DEBUG - 2021-01-21 12:32:38 - Generating schema from PurchaseOrder.xsd
DEBUG - 2021-01-21 12:32:38 - Parsing PurchaseOrder.xml
DEBUG - 2021-01-21 12:32:38 - Saving to file PurchaseOrder.xml.parquet
DEBUG - 2021-01-21 12:32:38 - Completed PurchaseOrder.xml

Answer from David Lee on Stack Overflow

Stack Overflow

stackoverflow.com › questions › 36289548 › is-there-a-way-to-create-parquet-file-from-xml-json-input-file-without-avsc-fil

Is there a way to create parquet file from xml/json input file without .avsc file and without impala/hive.? - Stack Overflow

Top answer

1 of 1

I just published one using python.

https://github.com/blackrock/xml_to_parquet

Convert one or more XML files into Apache Parquet format. Only requires a XSD and XML file to get started.

It requires a XSD schema file to convert everything in your XML file into an equivalent parquet file with nested data structures that match XML paths.

Convert a small XML file to a Parquet file
python xml_to_parquet.py -x PurchaseOrder.xsd PurchaseOrder.xml

INFO - 2021-01-21 12:32:38 - Parsing XML Files..
INFO - 2021-01-21 12:32:38 - Processing 1 files
DEBUG - 2021-01-21 12:32:38 - Generating schema from PurchaseOrder.xsd
DEBUG - 2021-01-21 12:32:38 - Parsing PurchaseOrder.xml
DEBUG - 2021-01-21 12:32:38 - Saving to file PurchaseOrder.xml.parquet
DEBUG - 2021-01-21 12:32:38 - Completed PurchaseOrder.xml

GitHub

github.com › blackrock › xml_to_parquet

GitHub - blackrock/xml_to_parquet: Convert one or more XML files into Apache Parquet format. Only requires a XSD and XML file to get started.

This repository contains code for the XML to Parquet Converter. This converter is written in Python and will convert one or more XML files into Parquet files

Starred by 38 users

Forked by 26 users

Languages Python 100.0% | Python 100.0%

Microsoft Learn

learn.microsoft.com › en-us › answers › questions › 1658822 › how-to-convert-xml-files-to-parquet-without-losing

How to convert XML files to Parquet without losing any data. In my scenario the xml file is in complex nested format and I am unable to capture all fields? - Microsoft Q&A

April 22, 2024 - Convert a small XML file to a Parquet file python xml_to_parquet.py -x PurchaseOrder.xsd PurchaseOrder.xml INFO - 2021-01-21 12:32:38 - Parsing XML Files..

Py-forge-cli

py-forge-cli.github.io › PyForge-CLI › converters › xml-to-parquet

XML to Parquet - PyForge CLI Documentation

Convert XML files to efficient Parquet format with intelligent structure analysis and configurable flattening strategies for analytics use cases.

Medium

medium.com › @sonradata › convert-xml-with-spark-to-parquet-1c5ba561b193

Convert XML with Spark to Parquet | by Uli Bethke | Medium

January 31, 2018 - Flexter automatically converts XML to Hadoop formats (Parquet, Avro, ORC), Text (CSV, TSV etc.), or a database (Oracle, SQL Server, PostgreSQL etc.). You don’t have to write a single line of code.

Databricks Community

community.databricks.com › t5 › data-engineering › xml-to-parquet-files › td-p › 82457

XML to Parquet files - Databricks Community - 82457

August 9, 2024 - The custom python func goes over all the columns of the input dataframe - if the column types are complex, i.e. struct or array - it continues to flatten it (explode if array, dot(.) operator if struct) until all the columns are simple types. Something like: df = spark.read.format('xml').load(path) flattened_df = flatten_func(df) flattened_df.write.format('parquet').save(destinationpath)

Informatica Knowledge

knowledge.informatica.com › s › article › 577970

Convert XML to Parquet using Intelligent Structure ... - HOW TO

May 19, 2022 - Loading · ×Sorry to interrupt · Refresh

GitHub

github.com › blackrock › xml_to_parquet › blob › master › xml_to_parquet.py

xml_to_parquet/xml_to_parquet.py at master · blackrock/xml_to_parquet

April 11, 2023 - Convert one or more XML files into Apache Parquet format. Only requires a XSD and XML file to get started. - xml_to_parquet/xml_to_parquet.py at master · blackrock/xml_to_parquet

Author blackrock

Sonra

sonra.io › home › xml › how to convert xml to parquet with spark

How to convert XML to Parquet with Spark

June 17, 2025 - Flexter is an enterprise XML conversion software.Flexter automatically converts complex XML to big data file formats (Delta Table, Parquet, Avro, ORC), Text (CSV, TSV etc.), or a database (Oracle, SQL Server, PostgreSQL etc.). You don’t have ...

Find elsewhere

Google Bing Mojeek

GitHub

github.com › apicrafter › pyiterable

GitHub - apicrafter/pyiterable: Python library to read, write and convert data files with formats BSON, JSON, NDJSON, Parquet, ORC, XLS, XLSX and XML

Python library to read, write and convert data files with formats BSON, JSON, NDJSON, Parquet, ORC, XLS, XLSX and XML - apicrafter/pyiterable

Author apicrafter

Stack Overflow

stackoverflow.com › questions › 48280452 › parse-load-huge-xml-file-with-spark

python - Parse/load Huge XML file with spark - Stack Overflow

Top answer

1 of 1

Don't use SomeRoottag as a rowTag. It instructs Spark to use a whole document as a single row. Instead:

Copydf = (sqlContext.read.format('xml')
    .option("rowTag", "row")
    .load("/tmp/xmlfile.xml"))

Now there is no need to explode either:

Copydf.write.format("parquet").saveAsTable("xml_table")

Edit:

Considering your edit you are affected by a known bug. Please see Self-closing tags are not supported as top-level rows #92. It doesn't look like there is any current progress on solving that, so you may have to:

Make a PR yourself to fix this thing.

Parse file manually. If elements are always single line, it can be easily done with udf.

Copyfrom pyspark.sql.functions import col, udf
from lxml import etree

@udf("struct<id: string, postTypeId: string>")
def parse(s):
    try:
        attrib = etree.fromstring(s).attrib
        return attrib.get("Id"), attrib.get("PostTypeId")
    except:
        pass

(spark.read.text("/tmp/someXML.xml")
    .where(col("value").rlike("^\\s*<row "))
    .select(parse("value").alias("value"))
    .select("value.*")
    .show())

# +--------+----------+
# |      id|postTypeId|
# +--------+----------+
# |47513849|         1|
# | 4751323|         4|
# |  475546|         1|
# |   47597|         2|
# +--------+----------+

Jason Feng's blog

q15928.github.io › 2019 › 07 › 14 › parse-xml

Parsing XML files made simple by PySpark - Jason Feng's blog

July 14, 2019 - Then we use flatMap function which each input item as the content of an XML file can be mapped to multiple items through the function parse_xml. flatMap is one of the functions made me “WoW” when I first used Spark a few years ago. We then convert the transformed RDDs to DataFrame with the pre-defined schema. The DataFrame looks like below. Finally we can save the results as csv files. Spark provides rich set of destination formats, i.e. we can write to JSON, parquet, avro, or even to a table in a database.

Stack Overflow

stackoverflow.com › questions › 75968903 › glue-python-script-to-read-xml-from-soap-and-write-as-parquet-to-s3

pandas - Glue Python script to read xml (from SOAP) and write as Parquet to S3 - Stack Overflow

October 21, 2024 - Now, for the later part, my intention was to create a Pandas dataframe by using read_xml() method with the returned XML string and then use the df.to_parquet() to store the xml in parquet format on S3. Unfortunately, I am unable to parse the xml with read_xml(). Here is the code that I had created - import requests import xmltodict import pandas as pd import pyarrow url = "https://www.w3schools.com/xml/tempconvert.asmx" payload = """<?xml version="1.0" encoding="utf-8"?> <soap12:Envelope xmlns:xsi="http://w3.org/2002/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soap12

Informatica Knowledge

knowledge.informatica.com › s › article › 509208

HOW TO: Convert Parquet File Data to XML using ... - Search

May 19, 2022 - Loading · ×Sorry to interrupt · Refresh

Apache Arrow

arrow.apache.org › docs › python › parquet.html

Reading and Writing the Apache Parquet Format — Apache Arrow v23.0.1

You can also use the convenience function read_table exposed by pyarrow.parquet that avoids the need for an additional Dataset object creation step. ... Note: the partition columns in the original table will have their types converted to Arrow dictionary types (pandas categorical) on load.

DataConverter

dataconverter.io › convert › xml-to-parquet

Convert XML to Parquet Online - DataConverter.io

Use our free online tool to convert your XML data to Apache Parquet quickly

GitHub

github.com › databricks › spark-xml

GitHub - databricks/spark-xml: XML data source for Spark SQL and DataFrames · GitHub

This package allows reading XML files in local or distributed filesystem as Spark DataFrames. When reading files the API accepts several options: path: Location of files. Similar to Spark can accept standard Hadoop globbing expressions.

Starred by 512 users

Forked by 225 users

Languages Scala 97.8% | Java 1.5% | Shell 0.7%

Stack Overflow

stackoverflow.com › questions › 72447656 › converting-xml-to-parquet-nested-objects-and-lists

amazon web services - Converting XML to Parquet - Nested objects and Lists - Stack Overflow

May 31, 2022 - I have a requirement of converting XML Data into Parquet to be used in S3. It sounded like a simple problem at first and hence I hand coded the converter myself. But slowly as the data is getting

Saturn Cloud

saturncloud.io › blog › converting-xml-to-python-dataframe-a-comprehensive-guide

Converting XML to Python DataFrame: A Guide | Saturn Cloud Blog

January 25, 2018 - This blog post will guide you through the process of converting XML to a Python DataFrame, making your data analysis tasks easier and more efficient.