Use ElementTree:
import xml.etree.ElementTree as ET
tree = ET.parse('Config.xml')
root = tree.getroot()
print(root.findall('.//Log'))
Output:
pawel@pawel-XPS-15-9570:~/test$ python parse_xml.py
[<Element 'Log' at 0x7fb3f2eee9f
Answer from pawelbylina on Stack OverflowUse ElementTree:
import xml.etree.ElementTree as ET
tree = ET.parse('Config.xml')
root = tree.getroot()
print(root.findall('.//Log'))
Output:
pawel@pawel-XPS-15-9570:~/test$ python parse_xml.py
[<Element 'Log' at 0x7fb3f2eee9f
Below:
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<Automation_Config>
<Path>
<Log>.\SERVER.log</Log>
<Flag_Path>.\Flag</Flag_Path>
<files>.\PO</files>
</Path>
</Automation_Config>'''
root = ET.fromstring(xml)
for idx,log_element in enumerate(root.findall('.//Log')):
print('{}) Log value: {}'.format(idx,log_element.text))
output
0) Log value: .\SERVER.log
Sorry for using an answer as a question, but formatting this inside a comment is painful. Does the code below solve your problem?
import xml.etree.ElementTree as ET
myParser = ET.XMLParser(encoding="utf-8")
tree = ET.parse('VOAPoints_2010_M25.jml',parser=myParser)
root = tree.getroot()
Since the pound sign was the issue, you can escape it with the character entity £. Python can even automate the replace in XML file by iteratively reading each line and replacing it conditionally on the pound symbol:
import xml.etree.ElementTree as ET
oldfile = "VOAPoints_2010_M25.jml"
newfile = "VOAPoints_2010_M25_new.jml"
with open(oldfile, 'r') as otxt:
for rline in otxt:
if "£" in rline:
rline = rline.replace("£", "£")
with open(newfile, 'a') as ntxt:
ntxt.write(rline)
tree = ET.parse(newfile)
root = tree.getroot()
Videos
You need to iterate each TExportCarcass tag and then use find to access BodyNum
Ex:
from lxml import etree
doc = etree.parse('file.xml')
for elem in doc.findall('TExportCarcass'):
print(elem.find("BodyNum").text)
Output:
6168
6169
or
print([i.text for i in doc.findall('TExportCarcass/BodyNum')]) #-->['6168', '6169']
When you run find on a text string, it will only search for elements at the root level. You can instead use xpath queries within find to search for any element within the doc:
- To get the first element only:
from lxml import etree
doc = etree.parse('file.xml')
memoryElem = doc.find('.//BodyNum')
memoryElem.text
# 6168
- To get all elements:
[ b.text for b in doc.iterfind('.//BodyNum') ]
# ['6168', '6169']
Here's an lxml snippet that extracts an attribute as well as element text (your question was a little ambiguous about which one you needed, so I'm including both):
from lxml import etree
doc = etree.parse(filename)
memoryElem = doc.find('memory')
print memoryElem.text # element text
print memoryElem.get('unit') # attribute
You asked (in a comment on Ali Afshar's answer) whether minidom (2.x, 3.x) is a good alternative. Here's the equivalent code using minidom; judge for yourself which is nicer:
import xml.dom.minidom as minidom
doc = minidom.parse(filename)
memoryElem = doc.getElementsByTagName('memory')[0]
print ''.join( [node.data for node in memoryElem.childNodes] )
print memoryElem.getAttribute('unit')
lxml seems like the winner to me.
XML
<data>
<items>
<item name="item1">item1</item>
<item name="item2">item2</item>
<item name="item3">item3</item>
<item name="item4">item4</item>
</items>
</data>
Python :
from xml.dom import minidom
xmldoc = minidom.parse('items.xml')
itemlist = xmldoc.getElementsByTagName('item')
print "Len : ", len(itemlist)
print "Attribute Name : ", itemlist[0].attributes['name'].value
print "Text : ", itemlist[0].firstChild.nodeValue
for s in itemlist :
print "Attribute Name : ", s.attributes['name'].value
print "Text : ", s.firstChild.nodeValue
My beloved SD Chargers hat is off to you if you think a regex is easier than this:
#!/usr/bin/env python
import xml.etree.cElementTree as et
sxml="""
<encspot>
<file>
<Name>some filename.mp3</Name>
<Encoder>Gogo (after 3.0)</Encoder>
<Bitrate>131</Bitrate>
</file>
<file>
<Name>another filename.mp3</Name>
<Encoder>iTunes</Encoder>
<Bitrate>128</Bitrate>
</file>
</encspot>
"""
tree=et.fromstring(sxml)
for el in tree.findall('file'):
print '-------------------'
for ch in el.getchildren():
print '{:>15}: {:<30}'.format(ch.tag, ch.text)
print "\nan alternate way:"
el=tree.find('file[2]/Name') # xpath
print '{:>15}: {:<30}'.format(el.tag, el.text)
Output:
-------------------
Name: some filename.mp3
Encoder: Gogo (after 3.0)
Bitrate: 131
-------------------
Name: another filename.mp3
Encoder: iTunes
Bitrate: 128
an alternate way:
Name: another filename.mp3
If your attraction to a regex is being terse, here is an equally incomprehensible bit of list comprehension to create a data structure:
[(ch.tag,ch.text) for e in tree.findall('file') for ch in e.getchildren()]
Which creates a list of tuples of the XML children of <file> in document order:
[('Name', 'some filename.mp3'),
('Encoder', 'Gogo (after 3.0)'),
('Bitrate', '131'),
('Name', 'another filename.mp3'),
('Encoder', 'iTunes'),
('Bitrate', '128')]
With a few more lines and a little more thought, obviously, you can create any data structure that you want from XML with ElementTree. It is part of the Python distribution.
Edit
Code golf is on!
[{item.tag: item.text for item in ch} for ch in tree.findall('file')]
[ {'Bitrate': '131',
'Name': 'some filename.mp3',
'Encoder': 'Gogo (after 3.0)'},
{'Bitrate': '128',
'Name': 'another filename.mp3',
'Encoder': 'iTunes'}]
If your XML only has the file section, you can choose your golf. If your XML has other tags, other sections, you need to account for the section the children are in and you will need to use findall
There is a tutorial on ElementTree at Effbot.org
Use ElementTree. You don't need/want to muck about with a parse-only gadget like pyexpat ... you'd only end up re-inventing ElementTree partially and poorly.
Another possibility is lxml which is a third-party package which implements the ElementTree interface plus more.
Update Someone started playing code-golf; here's my entry, which actually creates the data structure you asked for:
# xs = """<encspot> etc etc </encspot"""
>>> import xml.etree.cElementTree as et
>>> from pprint import pprint as pp
>>> pp([dict((attr.tag, attr.text) for attr in el) for el in et.fromstring(xs)])
[{'Bitrate': '131',
'Encoder': 'Gogo (after 3.0)',
'Frame': 'no',
'Frames': '6255',
'Freq.': '44100',
'Length': '00:02:43',
'Mode': 'joint stereo',
'Name': 'some filename.mp3',
'Quality': 'good',
'Size': '5,236,644'},
{'Bitrate': '0', 'Name': 'foo.mp3'}]
>>>
You'd probably want to have a dict mapping "attribute" names to conversion functions:
converters = {
'Frames': int,
'Size': lambda x: int(x.replace(',', '')),
# etc
}
I suggest ElementTree. There are other compatible implementations of the same API, such as lxml, and cElementTree in the Python standard library itself; but, in this context, what they chiefly add is even more speed -- the ease of programming part depends on the API, which ElementTree defines.
First build an Element instance root from the XML, e.g. with the XML function, or by parsing a file with something like:
import xml.etree.ElementTree as ET
root = ET.parse('thefile.xml').getroot()
Or any of the many other ways shown at ElementTree. Then do something like:
for type_tag in root.findall('bar/type'):
value = type_tag.get('foobar')
print(value)
Output:
1
2
minidom is the quickest and pretty straight forward.
XML:
<data>
<items>
<item name="item1"></item>
<item name="item2"></item>
<item name="item3"></item>
<item name="item4"></item>
</items>
</data>
Python:
from xml.dom import minidom
dom = minidom.parse('items.xml')
elements = dom.getElementsByTagName('item')
print(f"There are {len(elements)} items:")
for element in elements:
print(element.attributes['name'].value)
Output:
There are 4 items:
item1
item2
item3
item4
lxml has been mentioned. You might also check out lxml.objectify for some really simple manipulation.
>>> from lxml import objectify
>>> tree = objectify.fromstring(your_xml)
>>> tree.weather.attrib["module_id"]
'0'
>>> tree.weather.forecast_information.city.attrib["data"]
'Mountain View, CA'
>>> tree.weather.forecast_information.postal_code.attrib["data"]
'94043'
You want a thin veneer? That's easy to cook up. Try the following trivial wrapper around ElementTree as a start:
# geetree.py
import xml.etree.ElementTree as ET
class GeeElem(object):
"""Wrapper around an ElementTree element. a['foo'] gets the
attribute foo, a.foo gets the first subelement foo."""
def __init__(self, elem):
self.etElem = elem
def __getitem__(self, name):
res = self._getattr(name)
if res is None:
raise AttributeError, "No attribute named '%s'" % name
return res
def __getattr__(self, name):
res = self._getelem(name)
if res is None:
raise IndexError, "No element named '%s'" % name
return res
def _getelem(self, name):
res = self.etElem.find(name)
if res is None:
return None
return GeeElem(res)
def _getattr(self, name):
return self.etElem.get(name)
class GeeTree(object):
"Wrapper around an ElementTree."
def __init__(self, fname):
self.doc = ET.parse(fname)
def __getattr__(self, name):
if self.doc.getroot().tag != name:
raise IndexError, "No element named '%s'" % name
return GeeElem(self.doc.getroot())
def getroot(self):
return self.doc.getroot()
You invoke it so:
>>> import geetree
>>> t = geetree.GeeTree('foo.xml')
>>> t.xml_api_reply.weather.forecast_information.city['data']
'Mountain View, CA'
>>> t.xml_api_reply.weather.current_conditions.temp_f['data']
'68'
Use [] to filter and reorganize columns:
cols = ['Application_ID', 'Product_Type', 'Product_ID']
df = pd.read_xml('product.xml')[cols]
print(df)
# Output:
Application_ID Product_Type Product_ID
0 BBC#:1010 1 32
1 NBA#:1111 2 22
2 BBC#:1212 1 63
3 NBA#:2210 2 22
If you want to replace '_' from your column names by ' ':
df.columns = df.columns.str.replace('_', ' ')
print(df)
# Output:
Application ID Product Type Product ID
0 BBC#:1010 1 32
1 NBA#:1111 2 22
2 BBC#:1212 1 63
3 NBA#:2210 2 22
As of Pandas 1.3.0 there is a read_xml() function that makes working with reading/writing XML data in/out of pandas much easier.
Once you upgrade to Pandas >1.3.0 you can simply use:
df = pd.read_xml("___XML_FILEPATH___")
print(df)
(Note that in the XML sample above the <Rowset> tag needs to be closed)