You can convert this Excel XML file programmatically. Requirement: only python and pandas.
import pandas as pd
from xml.sax import ContentHandler, parse
# Reference https://www.oreilly.com/library/view/python-cookbook-2nd/0596007973/ch12s08.html
class ExcelHandler(ContentHandler):
def __init__(self):
self.chars = [ ]
self.cells = [ ]
self.rows = [ ]
self.tables = [ ]
def characters(self, content):
self.chars.append(content)
def startElement(self, name, atts):
if name=="Cell":
self.chars = [ ]
elif name=="Row":
self.cells=[ ]
elif name=="Table":
self.rows = [ ]
def endElement(self, name):
if name=="Cell":
self.cells.append(''.join(self.chars))
elif name=="Row":
self.rows.append(self.cells)
elif name=="Table":
self.tables.append(self.rows)
excelHandler = ExcelHandler()
parse('coalpublic2012.xls', excelHandler)
df1 = pd.DataFrame(excelHandler.tables[0][4:], columns=excelHandler.tables[0][3])
Answer from jrovegno on Stack Overflowpython - How to convert an XML file to an Excel file? - Stack Overflow
excel - Converting .xml into .xlsx python pandas - Stack Overflow
python - How to read XML file into Pandas Dataframe like Read XML Table in Excel - Stack Overflow
python - How to convert an XML file to nice pandas dataframe? - Stack Overflow
Videos
You can convert this Excel XML file programmatically. Requirement: only python and pandas.
import pandas as pd
from xml.sax import ContentHandler, parse
# Reference https://www.oreilly.com/library/view/python-cookbook-2nd/0596007973/ch12s08.html
class ExcelHandler(ContentHandler):
def __init__(self):
self.chars = [ ]
self.cells = [ ]
self.rows = [ ]
self.tables = [ ]
def characters(self, content):
self.chars.append(content)
def startElement(self, name, atts):
if name=="Cell":
self.chars = [ ]
elif name=="Row":
self.cells=[ ]
elif name=="Table":
self.rows = [ ]
def endElement(self, name):
if name=="Cell":
self.cells.append(''.join(self.chars))
elif name=="Row":
self.rows.append(self.cells)
elif name=="Table":
self.tables.append(self.rows)
excelHandler = ExcelHandler()
parse('coalpublic2012.xls', excelHandler)
df1 = pd.DataFrame(excelHandler.tables[0][4:], columns=excelHandler.tables[0][3])
The problem is that while the 2013 data is an actual Excel file, the 2012 data is an XML document, something which seems to not be supported in Python. I would say your best bet is to open it in Excel, and save a copy as either a proper Excel file, or as a CSV.
create a csv file which is Excel friendly format.
import xml.etree.ElementTree as ET
from os import listdir
xml_lst = [f for f in listdir() if f.startswith('xml')]
fields = ['RecordID','I_25Hz_1s','I_75Hz_2s'] # TODO - add rest of the fields
with open('out.csv','w') as f:
f.write(','.join(fields) + '\n')
for xml in xml_lst:
root = ET.parse(xml)
values = [root.find(f'.//{f}').text for f in fields]
f.write(','.join(values) + '\n')
output
RecordID,I_25Hz_1s,I_75Hz_2s
Madird01,56.40,0.36
London01,56.40,0.36
When you need to iterate over files in folder with similar names one of the ways could be make a pattern and use glob. To make sure that returned path is file you can use isfile().
Regarding XML, I see that basically you need to write values of every terminal tag in column with name of this tag. As you have various files you can create tag-value dictionaries from each file and store them into ChainMap. After all files processed you can use DictWriter to write all data into final csv file.
This method is much more safe and flexible then use static column names. Firstly program will collect all possible tag(column) names from all files, so in case if XML doesn't have such a tag or have some extra tags it won't throw an exception and all data will be saved.
Code:
import xml.etree.ElementTree as ET
from glob import iglob
from os.path import isfile, join
from csv import DictWriter
from collections import ChainMap
xml_root = r"C:\data\Desktop\Blue\XML-files"
pattern = "xmlfile_*"
data = ChainMap()
for filename in iglob(join(xml_root, pattern)):
if isfile(filename):
tree = ET.parse(filename)
root = tree.getroot()
temp = {node.tag: node.text for node in root.iter() if not node}
data = data.new_child(temp)
with open(join(xml_root, "data.csv"), "w", newline="") as f:
writer = DictWriter(f, data)
writer.writeheader()
writer.writerows(data.maps[:-1]) # last is empty dict
Upd. If you want to use xlsx format instead of csv you have to use third-party library (e.g. openpyxl). Example of usage:
from openpyxl import Workbook
...
wb = Workbook(write_only=True)
ws = wb.create_sheet()
ws.append(list(data)) # write header
for row in data.maps[:-1]:
ws.append([row.get(key, "") for key in data])
wb.save(join(xml_root, "data.xlsx"))
» pip install xml2xlsx
You can easily use xml (from the Python standard library) to convert to a pandas.DataFrame. Here's what I would do (when reading from a file replace xml_data with the name of your file or file object):
import pandas as pd
import xml.etree.ElementTree as ET
import io
def iter_docs(author):
author_attr = author.attrib
for doc in author.iter('document'):
doc_dict = author_attr.copy()
doc_dict.update(doc.attrib)
doc_dict['data'] = doc.text
yield doc_dict
xml_data = io.StringIO(u'''YOUR XML STRING HERE''')
etree = ET.parse(xml_data) #create an ElementTree object
doc_df = pd.DataFrame(list(iter_docs(etree.getroot())))
If there are multiple authors in your original document or the root of your XML is not an author, then I would add the following generator:
def iter_author(etree):
for author in etree.iter('author'):
for row in iter_docs(author):
yield row
and change doc_df = pd.DataFrame(list(iter_docs(etree.getroot()))) to doc_df = pd.DataFrame(list(iter_author(etree)))
Have a look at the ElementTree tutorial provided in the xml library documentation.
As of v1.3, you can simply use:
pandas.read_xml(path_or_file)