Assuming you have a file called file.xml, containing:
<annotation>
<folder>all_images</folder>
<filename>0.jpg</filename>
<path>/home/vishnu/Documents/all_images/0.jpg</path>
<source>
<database>Unknown</database>
</source>
<size>
<width>4250</width>
<height>5500</height>
<depth>1</depth>
</size>
<segmented>0</segmented>
<object>
<name>word</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>308</xmin>
<ymin>45</ymin>
<xmax>502</xmax>
<ymax>162</ymax>
</bndbox>
</object>
</annotation>
Then the following Python script in the same folder gives you an idea how to use the Standard Library ElementTree API to parse the file:
import xml.etree.ElementTree as ET
tree = ET.parse("file.xml")
root = tree.getroot()
print(root.find("./folder").text)
print(root.find("./object/name").text)
print(root.find("./object/bndbox/xmin").text)
You will need to work out how to write the values to your own text files, but that should be straightforward. There are lots of resources such as this one.
Answer from rjmurt on Stack OverflowVideos
There is already a built-in XML library, notably ElementTree. For example:
>>> from xml.etree import cElementTree as ET
>>> xmlstr = """
... <root>
... <page>
... <title>Chapter 1</title>
... <content>Welcome to Chapter 1</content>
... </page>
... <page>
... <title>Chapter 2</title>
... <content>Welcome to Chapter 2</content>
... </page>
... </root>
... """
>>> root = ET.fromstring(xmlstr)
>>> for page in list(root):
... title = page.find('title').text
... content = page.find('content').text
... print('title: %s; content: %s' % (title, content))
...
title: Chapter 1; content: Welcome to Chapter 1
title: Chapter 2; content: Welcome to Chapter 2
You can also try this code to extract texts:
from bs4 import BeautifulSoup
import csv
data ="""<page>
<title>Chapter 1</title>
<content>Welcome to Chapter 1</content>
</page>
<page>
<title>Chapter 2</title>
<content>Welcome to Chapter 2</content>
</page>"""
soup = BeautifulSoup(data, "html.parser")
########### Title #############
required0 = soup.find_all("title")
title = []
for i in required0:
title.append(i.get_text())
########### Content #############
required0 = soup.find_all("content")
content = []
for i in required0:
content.append(i.get_text())
doc1 = list(zip(title, content))
for i in doc1:
print(i)
Output:
('Chapter 1', 'Welcome to Chapter 1')
('Chapter 2', 'Welcome to Chapter 2')
One way to achieve this is to use XSLT Transformation. Most programming languages including Python will have support to convert an XML document into another document (e.g. HTML) when supplied with an XSL.
A good tutorial on XSLT Transformation can be found here
Use of Python to achieve transformation (once an XSL is prepared) is described here
There are several things wrong with your XHTML source. First, xmlns is not a correct attribute for the xml declaration; it should be put on the root element instead. And the root element for XHTML is <html>, not <xhtml>. So the valid XHTML input in this particular case would be
<?xml version=\"1.0\"?>\n<html xmlns=\"http://www.w3.org/1999/xhtml\">\n<head><title></title></head>\n<body>\n</body></html>
That said, I'm not sure if xml.etree.ElementTree accepts that, having no experience with it.