Assuming you have a file called file.xml, containing:
<annotation>
<folder>all_images</folder>
<filename>0.jpg</filename>
<path>/home/vishnu/Documents/all_images/0.jpg</path>
<source>
<database>Unknown</database>
</source>
<size>
<width>4250</width>
<height>5500</height>
<depth>1</depth>
</size>
<segmented>0</segmented>
<object>
<name>word</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>308</xmin>
<ymin>45</ymin>
<xmax>502</xmax>
<ymax>162</ymax>
</bndbox>
</object>
</annotation>
Then the following Python script in the same folder gives you an idea how to use the Standard Library ElementTree API to parse the file:
import xml.etree.ElementTree as ET
tree = ET.parse("file.xml")
root = tree.getroot()
print(root.find("./folder").text)
print(root.find("./object/name").text)
print(root.find("./object/bndbox/xmin").text)
You will need to work out how to write the values to your own text files, but that should be straightforward. There are lots of resources such as this one.
Answer from rjmurt on Stack OverflowVideos
There is already a built-in XML library, notably ElementTree. For example:
>>> from xml.etree import cElementTree as ET
>>> xmlstr = """
... <root>
... <page>
... <title>Chapter 1</title>
... <content>Welcome to Chapter 1</content>
... </page>
... <page>
... <title>Chapter 2</title>
... <content>Welcome to Chapter 2</content>
... </page>
... </root>
... """
>>> root = ET.fromstring(xmlstr)
>>> for page in list(root):
... title = page.find('title').text
... content = page.find('content').text
... print('title: %s; content: %s' % (title, content))
...
title: Chapter 1; content: Welcome to Chapter 1
title: Chapter 2; content: Welcome to Chapter 2
You can also try this code to extract texts:
from bs4 import BeautifulSoup
import csv
data ="""<page>
<title>Chapter 1</title>
<content>Welcome to Chapter 1</content>
</page>
<page>
<title>Chapter 2</title>
<content>Welcome to Chapter 2</content>
</page>"""
soup = BeautifulSoup(data, "html.parser")
########### Title #############
required0 = soup.find_all("title")
title = []
for i in required0:
title.append(i.get_text())
########### Content #############
required0 = soup.find_all("content")
content = []
for i in required0:
content.append(i.get_text())
doc1 = list(zip(title, content))
for i in doc1:
print(i)
Output:
('Chapter 1', 'Welcome to Chapter 1')
('Chapter 2', 'Welcome to Chapter 2')
This should work:-
xmlstr = ET.tostring(root, encoding='utf8', method='xml')
How do I convert ElementTree.Element to a String?
For Python 3:
xml_str = ElementTree.tostring(xml, encoding='unicode')
For Python 2:
xml_str = ElementTree.tostring(xml, encoding='utf-8')
For compatibility with both Python 2 & 3:
xml_str = ElementTree.tostring(xml).decode()
Example usage
from xml.etree import ElementTree
xml = ElementTree.Element("Person", Name="John")
xml_str = ElementTree.tostring(xml).decode()
print(xml_str)
Output:
<Person Name="John" />
Explanation
Despite what the name implies, ElementTree.tostring() returns a bytestring by default in Python 2 & 3. This is an issue in Python 3, which uses Unicode for strings.
In Python 2 you could use the
strtype for both text and binary data. Unfortunately this confluence of two different concepts could lead to brittle code which sometimes worked for either kind of data, sometimes not. [...]To make the distinction between text and binary data clearer and more pronounced, [Python 3] made text and binary data distinct types that cannot blindly be mixed together.
Source: Porting Python 2 Code to Python 3
If we know what version of Python is being used, we can specify the encoding as unicode or utf-8. Otherwise, if we need compatibility with both Python 2 & 3, we can use decode() to convert into the correct type.
For reference, I've included a comparison of .tostring() results between Python 2 and Python 3.
ElementTree.tostring(xml)
# Python 3: b'<Person Name="John" />'
# Python 2: <Person Name="John" />
ElementTree.tostring(xml, encoding='unicode')
# Python 3: <Person Name="John" />
# Python 2: LookupError: unknown encoding: unicode
ElementTree.tostring(xml, encoding='utf-8')
# Python 3: b'<Person Name="John" />'
# Python 2: <Person Name="John" />
ElementTree.tostring(xml).decode()
# Python 3: <Person Name="John" />
# Python 2: <Person Name="John" />
Thanks to Martijn Peters for pointing out that the str datatype changed between Python 2 and 3.
Why not use str()?
In most scenarios, using str() would be the "cannonical" way to convert an object to a string. Unfortunately, using this with Element returns the object's location in memory as a hexstring, rather than a string representation of the object's data.
from xml.etree import ElementTree
xml = ElementTree.Element("Person", Name="John")
print(str(xml)) # <Element 'Person' at 0x00497A80>
One way to achieve this is to use XSLT Transformation. Most programming languages including Python will have support to convert an XML document into another document (e.g. HTML) when supplied with an XSL.
A good tutorial on XSLT Transformation can be found here
Use of Python to achieve transformation (once an XSL is prepared) is described here
There are several things wrong with your XHTML source. First, xmlns is not a correct attribute for the xml declaration; it should be put on the root element instead. And the root element for XHTML is <html>, not <xhtml>. So the valid XHTML input in this particular case would be
<?xml version=\"1.0\"?>\n<html xmlns=\"http://www.w3.org/1999/xhtml\">\n<head><title></title></head>\n<body>\n</body></html>
That said, I'm not sure if xml.etree.ElementTree accepts that, having no experience with it.