Many Python XML libraries support parsing XML sub elements incrementally, e.g. xml.etree.ElementTree.iterparse and xml.sax.parse in the standard library. These functions are usually called "XML Stream Parser".
The xmltodict library you used also has a streaming mode. I think it may solve your problem
https://github.com/martinblech/xmltodict#streaming-mode
Instead of trying to read the file in one go and then process it, you want to read it in chunks and process each chunk as it's loaded. This is a fairly common situation when processing large XML files and is covered by the Simple API for XML (SAX) standard, which specifies a callback API for parsing XML streams - it's part of the Python standard library under xml.sax.parse and xml.etree.ETree as mentioned above.
Here's a quick XML to JSON converter:
from collections import defaultdict
import json
import xml.etree.ElementTree as ET
def parse_xml(file_name):
events = ("start", "end")
context = ET.iterparse(file_name, events=events)
return pt(context)
def pt(context, cur_elem=None):
items = defaultdict(list)
if cur_elem:
items.update(cur_elem.attrib)
text = ""
for action, elem in context:
# print("{0:>6} : {1:20} {2:20} '{3}'".format(action, elem.tag, elem.attrib, str(elem.text).strip()))
if action == "start":
items[elem.tag].append(pt(context, elem))
elif action == "end":
text = elem.text.strip() if elem.text else ""
elem.clear()
break
if len(items) == 0:
return text
return { k: v[0] if len(v) == 1 else v for k, v in items.items() }
if __name__ == "__main__":
json_data = parse_xml("large.xml")
print(json.dumps(json_data, indent=2))
If you're looking at a lot of XML processing check out the lxml library, it's got a ton of useful stuff over and above the standard modules, while also being much easier to use.
http://lxml.de/tutorial.html
Try the recommended Microsoft technique: https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/linq/how-to-perform-streaming-transform-of-large-xml-documents
So, for example, you have the following part of code:
while (reader.Read())
{
if (reader.NodeType == XmlNodeType.EndElement)
break;
if (reader.NodeType == XmlNodeType.Element
&& reader.Name == "Item")
{
item = XElement.ReadFrom(reader) as XElement;
if (item != null)
{
//here you can get data from your array object
//and put it to your JSON stream
}
}
}
If you want to define the type of element you can check if it has children: How to check if XElement has any child nodes?
It should work good in pair with streaming of JSON. For more info about JSON steaming look into: Writing JSON to a stream without buffering the string in memory
Huge files always require using XmlReader. I use a combination of XmlReader and Xml Linq in code below
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
namespace ConsoleApplication120
{
class Program
{
const string FILENAME = @"c:\temp\test.xml";
static void Main(string[] args)
{
List<Dictionary<string, string>> items = new List<Dictionary<string, string>>();
XmlReader reader = XmlReader.Create(FILENAME);
reader.ReadToFollowing("hugeArray");
while (!reader.EOF)
{
if (reader.Name != "item")
{
reader.ReadToFollowing("item");
}
if (!reader.EOF)
{
XElement item = (XElement)XElement.ReadFrom(reader);
Dictionary<string, string> dict = item.Elements()
.GroupBy(x => x.Name.LocalName, y => (string)y)
.ToDictionary(x => x.Key, y => y.FirstOrDefault());
items.Add(dict);
}
}
}
}
}