You can use a combination of ElementTree's fromstring() method and the requests module's requests.get() to accomplish this.
https://docs.python.org/2/library/xml.etree.elementtree.html#parsing-xml
fromstring() parses XML from a string directly into an Element, which is the root element of the parsed tree.
Install the requests module:
pip install requests
Use the requests.get() to get your xml file from the url as a string. Pass that into the fromstring() function.
import xml.etree.cElementTree as ET
import requests
tree = ET.fromstring(requests.get('http://synd.cricbuzz.com/j2me/1.0/livematches.xml').text)
for child in tree:
print("%s - %s"%(child.get('srs'),child.get('mchDesc')))
Results:
None - None
India tour of Sri Lanka, 2015 - Cricbuzz Cup - SL vs IND
Australia tour of Ireland, 2015 - IRE vs AUS
New Zealand tour of South Africa, 2015 - RSA vs NZ
Royal London One-Day Cup, 2015 - SUR vs KENT
Royal London One-Day Cup, 2015 - ESS vs YORKS
Answer from Joe Young on Stack OverflowPython - Parsing XML data with ElementTree - Stack Overflow
python - Trying to parse XML directly from a URL - Stack Overflow
python - Parsing a URL XML with the ElementTree XML API - Stack Overflow
Parsing XML using Python ElementTree - Stack Overflow
Videos
From ElementTree docs:
We can import this data by reading from a file:
import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
Or directly from a string:
root = ET.fromstring(country_data_as_string)
and later in the same page, 20.5.1.4. Finding interesting elements:
for neighbor in root.iter('neighbor'):
print(neighbor.attrib)
Which translate to:
import xml.etree.ElementTree as ET
root = ET.fromstring("""
<root>
<H D="14/11/2017">
<FC>
<F LV="0">The quick</F>
<F LV="1">brown</F>
<F LV="2">fox</F>
</FC>
</H>
<H D="14/11/2017">
<FC>
<F LV="0">The lazy</F>
<F LV="1">fox</F>
</FC>
</H>
</root>""")
# root = tree.getroot()
for h in root.iter("H"):
print (h.attrib["D"])
for f in root.iter("F"):
print (f.attrib, f.text)
output:
14/11/2017
14/11/2017
{'LV': '0'} The quick
{'LV': '1'} brown
{'LV': '2'} fox
{'LV': '0'} The lazy
{'LV': '1'} fox
You did not specifiy what exactly you whant to use so i recommend lxml for python. For getting the values you whant you have more possibiltys:
With a loop:
from lxml import etree
tree = etree.parse('XmlTest.xml')
root = tree.getroot()
text = []
for element in root:
text.append(element.get('D',None))
for child in element:
for grandchild in child:
text.append(grandchild.text)
print(text)
Output: ['14/11/2017', 'The quick', 'brown', 'fox', '14/11/2017', 'The lazy', 'fox']
With xpath:
from lxml import etree
tree = etree.parse('XmlTest.xml')
root = tree.getroot()
D = root.xpath("./H")
F = root.xpath(".//F")
for each in D:
print(each.get('D',None))
for each in F:
print(each.text)
Output: 14/11/2017 14/11/2017 The quick brown fox The lazy fox
Both have there own advantages but give you a good starting point. I recommend the xpath since it gives you more freedom when values are missing.
So I have ElementTree 1.2.6 on my box now, and ran the following code against the XML chunk you posted:
import elementtree.ElementTree as ET
tree = ET.parse("test.xml")
doc = tree.getroot()
thingy = doc.find('timeSeries')
print thingy.attrib
and got the following back:
{'name': 'NWIS Time Series Instantaneous Values'}
It appears to have found the timeSeries element without needing to use numerical indices.
What would be useful now is knowing what you mean when you say "it doesn't work." Since it works for me given the same input, it is unlikely that ElementTree is broken in some obvious way. Update your question with any error messages, backtraces, or anything you can provide to help us help you.
If I understand your question correctly:
for elem in doc.findall('timeSeries/values/value'):
print elem.get('dateTime'), elem.text
or if you prefer (and if there is only one occurrence of timeSeries/values:
values = doc.find('timeSeries/values')
for value in values:
print value.get('dateTime'), elem.text
The findall() method returns a list of all matching elements, whereas find() returns only the first matching element. The first example loops over all the found elements, the second loops over the child elements of the values element, in this case leading to the same result.
I don't see where the problem with not finding timeSeries comes from however. Maybe you just forgot the getroot() call? (note that you don't really need it because you can work from the elementtree itself too, if you change the path expression to for example /timeSeriesResponse/timeSeries/values or //timeSeries/values)
Before I try to answer, a tip. Your exception handler covers up the nature of the problem. Just let the original exception rise up and you'll have more information to share with people who are interested in helping you.
I like to use feedparser to parse Atom feeds. It does indeed give you dict-like objects. I submitted a patch to feedparser 4.1 to parse the GeoRSS elements into GeoJSON style dicts. See https://code.google.com/p/feedparser/issues/detail?id=62 and blog post at http://sgillies.net/blog/566/georss-patch-for-universal-feedparser/. You'd use it like this:
>>> import feedparser
>>> feed = feedparser.parse("http://earthquake.usgs.gov/earthquakes/catalogs/1hour-M1.xml")
>>> feed.entries[0]['where']
{'type': 'Point', 'coordinates': (-122.8282, 38.844700000000003)}
My patched version of 4.1 is in my Dropbox and you can get it using pip.
$ pip install http://dl.dropbox.com/u/10325831/feedparser-4.1-georss.tar.gz
Or just download and install with "python setup.py install".
It's more comfortable to use lxml for XML processing. Here is an example that fetches the feed and prints earthquake titles and coordinates:
import lxml.etree
feed_url = 'http://earthquake.usgs.gov/earthquakes/catalogs/1hour-M1.xml'
ns = {
'atom': 'http://www.w3.org/2005/Atom',
'georss': 'http://www.georss.org/georss',
}
def main():
doc = lxml.etree.parse(feed_url)
for entry in doc.xpath('//atom:entry', namespaces=ns):
[title] = entry.xpath('./atom:title', namespaces=ns)
[point] = entry.xpath('./georss:point', namespaces=ns)
print point.text, title.text
if __name__ == '__main__':
main()
You can parse the text as a string, which creates an Element, and create an ElementTree using that Element.
import xml.etree.ElementTree as ET
tree = ET.ElementTree(ET.fromstring(xmlstring))
I just came across this issue and the documentation, while complete, is not very straightforward on the difference in usage between the parse() and fromstring() methods.
If you're using xml.etree.ElementTree.parse to parse from a file, then you can use xml.etree.ElementTree.fromstring to get the root Element of the document. Often you don't actually need an ElementTree.
See xml.etree.ElementTree
I am trying to parse a xml from a url, with python 3, but i always end up with:
xml.etree.ElementTree.ParseError: not well-formed (invalid token):
the code looks like this:
import requests
import urllib
from urllib.request import urlopen
import xml.etree.ElementTree as etree
response = urllib.request.urlopen("http://regnskaber.virk.dk/32673592/eGJybHN0b3JlOi8vWC1GNzY5MUY0Ny0yMDE0MDMyOV8xMzQxNThfMTc5L3hicmw.xml")
tree = etree.parse(response)
root = tree.getroot()
what am i missing?
The xml is gzip compressed - requests handles this automatically for you which you could use instead of urllib.
response = requests.get(url)tree = etree.fromstring(response.content)
http://stackoverflow.com/a/26435241 discusses solutions for doing it with urllib
As mentioned, the XML is compressed. You COULD change to use Requests, but you could also (more properly) do etree.parse(response.read()). urllib does handle gzip encoding, but you've got to call the .read() method to actually do that parsing. You may also need to do .decode('utf-8') in some cases; it depends on if ElementTree can handle bytes-like objects or if it needs a plain string.