Use ElementTree:
import xml.etree.ElementTree as ET
tree = ET.parse('Config.xml')
root = tree.getroot()
print(root.findall('.//Log'))
Output:
pawel@pawel-XPS-15-9570:~/test$ python parse_xml.py
[<Element 'Log' at 0x7fb3f2eee9f
Answer from pawelbylina on Stack OverflowUse ElementTree:
import xml.etree.ElementTree as ET
tree = ET.parse('Config.xml')
root = tree.getroot()
print(root.findall('.//Log'))
Output:
pawel@pawel-XPS-15-9570:~/test$ python parse_xml.py
[<Element 'Log' at 0x7fb3f2eee9f
Below:
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<Automation_Config>
<Path>
<Log>.\SERVER.log</Log>
<Flag_Path>.\Flag</Flag_Path>
<files>.\PO</files>
</Path>
</Automation_Config>'''
root = ET.fromstring(xml)
for idx,log_element in enumerate(root.findall('.//Log')):
print('{}) Log value: {}'.format(idx,log_element.text))
output
0) Log value: .\SERVER.log
Videos
Here's an lxml snippet that extracts an attribute as well as element text (your question was a little ambiguous about which one you needed, so I'm including both):
from lxml import etree
doc = etree.parse(filename)
memoryElem = doc.find('memory')
print memoryElem.text # element text
print memoryElem.get('unit') # attribute
You asked (in a comment on Ali Afshar's answer) whether minidom (2.x, 3.x) is a good alternative. Here's the equivalent code using minidom; judge for yourself which is nicer:
import xml.dom.minidom as minidom
doc = minidom.parse(filename)
memoryElem = doc.getElementsByTagName('memory')[0]
print ''.join( [node.data for node in memoryElem.childNodes] )
print memoryElem.getAttribute('unit')
lxml seems like the winner to me.
XML
<data>
<items>
<item name="item1">item1</item>
<item name="item2">item2</item>
<item name="item3">item3</item>
<item name="item4">item4</item>
</items>
</data>
Python :
from xml.dom import minidom
xmldoc = minidom.parse('items.xml')
itemlist = xmldoc.getElementsByTagName('item')
print "Len : ", len(itemlist)
print "Attribute Name : ", itemlist[0].attributes['name'].value
print "Text : ", itemlist[0].firstChild.nodeValue
for s in itemlist :
print "Attribute Name : ", s.attributes['name'].value
print "Text : ", s.firstChild.nodeValue
You need to iterate each TExportCarcass tag and then use find to access BodyNum
Ex:
from lxml import etree
doc = etree.parse('file.xml')
for elem in doc.findall('TExportCarcass'):
print(elem.find("BodyNum").text)
Output:
6168
6169
or
print([i.text for i in doc.findall('TExportCarcass/BodyNum')]) #-->['6168', '6169']
When you run find on a text string, it will only search for elements at the root level. You can instead use xpath queries within find to search for any element within the doc:
- To get the first element only:
from lxml import etree
doc = etree.parse('file.xml')
memoryElem = doc.find('.//BodyNum')
memoryElem.text
# 6168
- To get all elements:
[ b.text for b in doc.iterfind('.//BodyNum') ]
# ['6168', '6169']
I suggest ElementTree. There are other compatible implementations of the same API, such as lxml, and cElementTree in the Python standard library itself; but, in this context, what they chiefly add is even more speed -- the ease of programming part depends on the API, which ElementTree defines.
First build an Element instance root from the XML, e.g. with the XML function, or by parsing a file with something like:
import xml.etree.ElementTree as ET
root = ET.parse('thefile.xml').getroot()
Or any of the many other ways shown at ElementTree. Then do something like:
for type_tag in root.findall('bar/type'):
value = type_tag.get('foobar')
print(value)
Output:
1
2
minidom is the quickest and pretty straight forward.
XML:
<data>
<items>
<item name="item1"></item>
<item name="item2"></item>
<item name="item3"></item>
<item name="item4"></item>
</items>
</data>
Python:
from xml.dom import minidom
dom = minidom.parse('items.xml')
elements = dom.getElementsByTagName('item')
print(f"There are {len(elements)} items:")
for element in elements:
print(element.attributes['name'].value)
Output:
There are 4 items:
item1
item2
item3
item4
This has nothing to do with the xml file format, but in which encoding your file is. Python3 assumes everything to be in utf-8, but if you are on windows your file is probably in windows-1252. You should use:
f = open("text.txt", "r", encoding="cp1252")
this will sure do your job.
a=[]
with open('reboot.xml', 'r') as f:
a = f.read()
f.closed
print a
Use [] to filter and reorganize columns:
cols = ['Application_ID', 'Product_Type', 'Product_ID']
df = pd.read_xml('product.xml')[cols]
print(df)
# Output:
Application_ID Product_Type Product_ID
0 BBC#:1010 1 32
1 NBA#:1111 2 22
2 BBC#:1212 1 63
3 NBA#:2210 2 22
If you want to replace '_' from your column names by ' ':
df.columns = df.columns.str.replace('_', ' ')
print(df)
# Output:
Application ID Product Type Product ID
0 BBC#:1010 1 32
1 NBA#:1111 2 22
2 BBC#:1212 1 63
3 NBA#:2210 2 22
As of Pandas 1.3.0 there is a read_xml() function that makes working with reading/writing XML data in/out of pandas much easier.
Once you upgrade to Pandas >1.3.0 you can simply use:
df = pd.read_xml("___XML_FILEPATH___")
print(df)
(Note that in the XML sample above the <Rowset> tag needs to be closed)
Using BeautifulSoup bs4 and lxml parser library to scrape xml data.
from bs4 import BeautifulSoup
xml_data = '''<?xml version="1.0" encoding="UTF-8"?>
<ClinicalDocument xmlns="urn:hl7-org:v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:hl7-org:v3 CDA.xsd">
<templateId root="2.16.840.1.113883.10.20.22.1.1"/>
<id extension="4b78219a-1d02-4e7c-9870-dc7ce3b8a8fb" root="1.2.840.113619.21.1.3214775361124994304.5.1"/>
<code code="34133-9" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="Summarization of episode note"/>
<title>Summary</title>
<effectiveTime value="20170919160921ddfdsdsdsd31-0400"/>
<confidentialityCode code="N" codeSystem="2.16.840.dwdwddsd1.113883.5.25"/>
<recordTarget>
<patientRole><id extension="0" root="1.2.840.113619.21.1.3214775361124994304.2.1.1.2"/>
<addr use="HP"><streetAddressLine>addd2 </streetAddressLine><city>fgfgrtt</city><state>tr</state><postalCode>121213434</postalCode><country>rere</country></addr>
<patient>
<name><given>fname</given><family>lname</family></name>
<administrativeGenderCode code="F" codeSystem="2.16.840.1.113883.5.1" displayName="Female"/>
<birthTime value="19501025"/>
<maritalStatusCode code="M" codeSystem="2434.16.840.1.143434313883.5.2" displayName="M"/>
<languageCommunication>
<languageCode code="eng"/>
<proficiencyLevelCode nullFlavor="NI"/>
<preferenceInd value="true"/>
</languageCommunication>
</patient>'''
soup = BeautifulSoup(xml_data, "lxml")
title = soup.find("title")
print(title.text.strip())
patient = soup.find("patient")
given = patient.find("given").text.strip()
family = patient.find("family").text.strip()
gender = patient.find("administrativegendercode")['displayname'].strip()
print(given)
print(family)
print(gender)
O/P:
Summary
fname
lname
Female
Install library dependency:
pip3 install beautifulsoup4==4.7.1
pip3 install lxml==4.3.3
Or you can simply use lxml. Here is tutorial that I used: https://lxml.de/tutorial.html But it should be similar to:
from lxml import etree
root = etree.Element("patient")
print(root.find("given"))
print(root.find("family"))
print(root.find("give"))
I looks to me as if you do not need any DOM capabilities from your program. I would second the use of the (c)ElementTree library. If you use the iterparse function of the cElementTree module, you can work your way through the xml and deal with the events as they occur.
Note however, Fredriks advice on using cElementTree iterparse function:
to parse large files, you can get rid of elements as soon as youโve processed them:
for event, elem in iterparse(source):
if elem.tag == "record":
... process record elements ...
elem.clear()
The above pattern has one drawback; it does not clear the root element, so you will end up with a single element with lots of empty child elements. If your files are huge, rather than just large, this might be a problem. To work around this, you need to get your hands on the root element. The easiest way to do this is to enable start events, and save a reference to the first element in a variable:
# get an iterable
context = iterparse(source, events=("start", "end"))
# turn it into an iterator
context = iter(context)
# get the root element
event, root = context.next()
for event, elem in context:
if event == "end" and elem.tag == "record":
... process record elements ...
root.clear()
The lxml.iterparse() does not allow this.
The previous does not work on Python 3.7, consider the following way to get the first element.
import xml.etree.ElementTree as ET
# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
for index, (event, elem) in enumerate(context):
# Get the root element.
if index == 0:
root = elem
if event == "end" and elem.tag == "record":
# ... process record elements ...
root.clear()
Have you tried the cElementTree module?
cElementTree is included with Python 2.5 and later, as xml.etree.cElementTree. Refer the benchmarks.
Note that since Python 3.3 cElementTree is used as the default implementation so this change is not needed with a Python version 3.3+.
removed dead ImageShack link
lxml has been mentioned. You might also check out lxml.objectify for some really simple manipulation.
>>> from lxml import objectify
>>> tree = objectify.fromstring(your_xml)
>>> tree.weather.attrib["module_id"]
'0'
>>> tree.weather.forecast_information.city.attrib["data"]
'Mountain View, CA'
>>> tree.weather.forecast_information.postal_code.attrib["data"]
'94043'
You want a thin veneer? That's easy to cook up. Try the following trivial wrapper around ElementTree as a start:
# geetree.py
import xml.etree.ElementTree as ET
class GeeElem(object):
"""Wrapper around an ElementTree element. a['foo'] gets the
attribute foo, a.foo gets the first subelement foo."""
def __init__(self, elem):
self.etElem = elem
def __getitem__(self, name):
res = self._getattr(name)
if res is None:
raise AttributeError, "No attribute named '%s'" % name
return res
def __getattr__(self, name):
res = self._getelem(name)
if res is None:
raise IndexError, "No element named '%s'" % name
return res
def _getelem(self, name):
res = self.etElem.find(name)
if res is None:
return None
return GeeElem(res)
def _getattr(self, name):
return self.etElem.get(name)
class GeeTree(object):
"Wrapper around an ElementTree."
def __init__(self, fname):
self.doc = ET.parse(fname)
def __getattr__(self, name):
if self.doc.getroot().tag != name:
raise IndexError, "No element named '%s'" % name
return GeeElem(self.doc.getroot())
def getroot(self):
return self.doc.getroot()
You invoke it so:
>>> import geetree
>>> t = geetree.GeeTree('foo.xml')
>>> t.xml_api_reply.weather.forecast_information.city['data']
'Mountain View, CA'
>>> t.xml_api_reply.weather.current_conditions.temp_f['data']
'68'
Given the two levels of nodes that cover the Coluna attributes, consider XSLT, the special-purpose language designed to transform or style original XML files. Python's lxml can run XSLT 1.0 scripts and being the default parse to pandas.read_xml can transform your raw XML into a flatter version to parse to DataFrame.
XSLT (save as .xsl file, a special .xml file)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:pace='http://www.ms.com/pace'>
<xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- REDESIGN XML TO ONLY RETURN AnaliseDiaria NODES -->
<xsl:template match="/*">
<xsl:copy>
<xsl:apply-templates select="descendant::pace:AnaliseDiaria"/>
</xsl:copy>
</xsl:template>
<!-- REDESIGN AnaliseDiaria NODES -->
<xsl:template match="pace:AnaliseDiaria">
<xsl:copy>
<!-- BRING DOWN Produto ATTRIBUTES WITH CURRENT ATTRIBUTES -->
<xsl:copy-of select="ancestor::pace:Produto/@*|@*"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Online Demo
Python
analise_diaria_df = pd.read("input.xml", stylesheet="style.xsl")
analise_diaria_df
# Coluna1 Coluna2 Coluna3 ... Coluna14 Coluna15 Coluna16
# 0 21-851611 CAMIO VO NaN ... NaN NaN NaN
# 1 21-3667984 SCA4X2 -1.0 ... NaN NaN NaN
# 2 21-3667994 SCA963 -1.0 ... NaN NaN NaN
# 3 21-3676543 SCA713 -1.0 ... NaN NaN NaN
# 4 21-3676601 SCA97 -1.0 ... NaN NaN NaN
# 5 21-3814014 CAMIX2 NaN ... NaN NaN NaN
# 6 21-3814087 SCA56 NaN ... NaN NaN NaN
# 7 21-3814087 SCA56 NaN ... 195.000,00 NF9 10203910A
# 8 21-3814087 SCA56 NaN ... 195.090,00 NaN NaN
# 9 21-3814087 SCA56 NaN ... 195.270,00 NaN NaN
# 10 21-3814087 SCA56 NaN ... 195.482,60 NaN NaN
# 11 21-3814087 SCA56 NaN ... 195.627,80 NaN NaN
# 12 21-3814087 SCA56 NaN ... 204.529,82 NaN NaN
# 13 21-3814087 SCA56 NaN ... NaN NaN 158PES
Fortunately, in the case of your xml in the question, you can use the pandas read_xml() method, although you'll have to skirt around the namespaces issue:
import pandas as pd
pd.read_xml(file.xml,xpath='//*[local-name()="Linha"]//*[local-name()="Produto"]')
Output:
Coluna1 Coluna2 Coluna3 Coluna4 Coluna5 {http://www.ms.com/pace}AnaliseDiaria
0 21-851611 CAMIO VO NaN NaN NaN NaN
1 21-3667984 SCA4X2 -1.0 NaN NaN NaN
2 21-3667994 SCA963 -1.0 NaN NaN NaN
etc. If you are not interested in one column or anothter, you can simply drop() it.