Something like this?
>>> from xml.sax.saxutils import escape
>>> escape("< & >")
'< & >'
Answer from mbarkhau on Stack OverflowSomething like this?
>>> from xml.sax.saxutils import escape
>>> escape("< & >")
'< & >'
xml.sax.saxutils does not escape quotation characters (")
So here is another one:
def escape( str_xml: str ):
str_xml = str_xml.replace("&", "&")
str_xml = str_xml.replace("<", "<")
str_xml = str_xml.replace(">", ">")
str_xml = str_xml.replace("\"", """)
str_xml = str_xml.replace("'", "'")
return str_xml
if you look it up then xml.sax.saxutils only does string replace
One thing that I tried, that worked for me is to open the xml file as a file object , then use ElementTree.fromstring() passing in the complete contents of the file.
Example -
>>> import xml.etree.ElementTree as ET
>>> ef = ET.parse('a.xml')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python34\lib\xml\etree\ElementTree.py", line 1187, in parse
tree.parse(source, parser)
File "C:\Python34\lib\xml\etree\ElementTree.py", line 598, in parse
self._root = parser._parse_whole(source)
ValueError: multi-byte encodings are not supported
>>> with open('a.xml','r') as f:
... ef = ET.fromstring(f.read())
...
>>> ef
<Element 'productMeta' at 0x028DF180>
You can also, create an XMLParser with the required encoding, and this should enable you to be able to parse strings from that encoding, Example -
import xml.etree.ElementTree as ET
xmlp = ET.XMLParser(encoding="utf-8")
f = ET.parse('a.xml',parser=xmlp)
ET.parse('a.xml', parser=ET.XMLParser(encoding='iso-8859-5'))
solved my problem when dealed with xml excel in python
The standard lib’s very own HTMLParser has an undocumented function unescape() which does exactly what you think it does:
up to Python 3.4:
import HTMLParser
h = HTMLParser.HTMLParser()
h.unescape('© 2010') # u'\xa9 2010'
h.unescape('© 2010') # u'\xa9 2010'
Python 3.4+:
import html
html.unescape('© 2010') # u'\xa9 2010'
html.unescape('© 2010') # u'\xa9 2010'
Python has the htmlentitydefs module, but this doesn't include a function to unescape HTML entities.
Python developer Fredrik Lundh (author of elementtree, among other things) has such a function on his website, which works with decimal, hex and named entities:
import re, htmlentitydefs
##
# Removes HTML or XML character references and entities from a text string.
#
# @param text The HTML (or XML) source text.
# @return The plain text, as a Unicode string, if necessary.
def unescape(text):
def fixup(m):
text = m.group(0)
if text[:2] == "&#":
# character reference
try:
if text[:3] == "&#x":
return unichr(int(text[3:-1], 16))
else:
return unichr(int(text[2:-1]))
except ValueError:
pass
else:
# named entity
try:
text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
except KeyError:
pass
return text # leave as is
return re.sub("&#?\w+;", fixup, text)