xml - Java use replaceAll with Escape Characters String - Stack Overflow
Best way to encode text data for XML in Java? - Stack Overflow
Escaping special character when generating an XML in Java - Stack Overflow
removing invalid XML characters from a string in java - Stack Overflow
Don't use replaceAll(), which does a regex search.
Instead use replace(), which uses plain-text search.
getXML = getXML.replace(headerXMLString, "");
Note that despite the unfortunate name difference, replace() still replaces all occurrences found.
A better approach would be to use regex to match the XML header no matter what it contains:
getXML = getXML("^<?xml.*?\\?>", "");
This would also do nothing if there was no header.
you can use replace() instead replaceAll() following works for me
String s = "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>";
String s2 = "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>";
s2 = s2.replace(s, "");
System.out.println(s2);
OP:
<blank>
EDIT:
how about following?
String s = "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>";
Scanner sc = new Scanner(new File("D:\\temp.txt"));
String s2 = sc.nextLine();
System.out.println("b4 "+s2);
s2 = s2.replaceAll(s, "");
System.out.println("aftr "+s2);
File Content :
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
As others have mentioned, using an XML library is the easiest way. If you do want to escape yourself, you could look into StringEscapeUtils from the Apache Commons Lang library.
Very simply: use an XML library. That way it will actually be right instead of requiring detailed knowledge of bits of the XML spec.
You can use apache common text library to escape a string.
org.apache.commons.text.StringEscapeUtils
# For XML 1.0
String escapedXml = StringEscapeUtils.escapeXml10("the data might contain & or ! or % or ' or # etc");
# For XML 1.1
String escapedXml = StringEscapeUtils.escapeXml11("the data might contain & or ! or % or ' or # etc");
But what you are looking for is a way to convert any string into a valid XML tag name. For ASCII characters, XML tag name must begin with one of _:a-zA-Z and followed by any number of character in _:a-zA-Z0-9.-
I believe there is no library to do this for you so you have to implement your own function to convert from any string to match this pattern or alternatively make it into a value of attritbue.
<property name="no more need to be encoded, it should be handled by XML library">0.0</property>
public class RssParser {
int length;
URL url;
URLConnection urlConn;
NodeList nodeList;
Document doc;
Node node;
Element firstEle;
NodeList titleList;
Element ele;
NodeList txtEleList;
String retVal, urlStrToParse, rootNodeName;
public RssParser(String urlStrToParse, String rootNodeName){
this.urlStrToParse = urlStrToParse;
this.rootNodeName = rootNodeName;
url=null;
urlConn=null;
nodeList=null;
doc=null;
node=null;
firstEle=null;
titleList=null;
ele=null;
txtEleList=null;
retVal=null;
doc = null;
try {
url = new URL(this.urlStrToParse);
// dis is path of url which v'll parse
urlConn = url.openConnection();
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
String s = isToString(urlConn.getInputStream());
s = s.replace("&", "&");
StringBuilder sb =
new StringBuilder
("<?xml version=\"1.0\" encoding=\"utf-8\"?>");
sb.append("\n"+s);
System.out.println("STR: \n"+sb.toString());
s = sb.toString();
doc = db.parse(urlConn.getInputStream());
nodeList = doc.getElementsByTagName(this.rootNodeName);
// dis is d first node which
// contains other inner element-nodes
length =nodeList.getLength();
firstEle=doc.getDocumentElement();
}
catch (ParserConfigurationException pce) {
System.out.println("Could not Parse XML: " + pce.getMessage());
}
catch (SAXException se) {
System.out.println("Could not Parse XML: " + se.getMessage());
}
catch (IOException ioe) {
System.out.println("Invalid XML: " + ioe.getMessage());
}
catch(Exception e){
System.out.println("Error: "+e.toString());
}
}
public String isToString(InputStream in) throws IOException {
StringBuffer out = new StringBuffer();
byte[] b = new byte[512];
for (int i; (i = in.read(b)) != -1;) {
out.append(new String(b, 0, i));
}
return out.toString();
}
public String getVal(int i, String param){
node =nodeList.item(i);
if(node.getNodeType() == Node.ELEMENT_NODE)
{
System.out.println("Param: "+param);
titleList = firstEle.getElementsByTagName(param);
if(firstEle.hasAttribute("id"))
System.out.println("hasAttrib----------------");
else System.out.println("Has NOTNOT NOT");
System.out.println("titleList: "+titleList.toString());
ele = (Element)titleList.item(i);
System.out.println("ele: "+ele);
txtEleList = ele.getChildNodes();
retVal=(((Node)txtEleList.item(0)).getNodeValue()).toString();
if (retVal == null)
return null;
System.out.println("retVal: "+retVal);
}
return retVal;
}
}
Java's regex supports supplementary characters, so you can specify those high ranges with two UTF-16 encoded chars, or, even easier, use \x to specify any valid code point.
Here is the pattern for removing characters that are illegal in XML 1.0:
// XML 1.0
// #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
String xml10pattern = "[^"
+ "\u0009\r\n"
+ "\u0020-\uD7FF"
+ "\uE000-\uFFFD"
+ "\x{10000}-\x{10FFFF}"
+ "]";
Most people will want the XML 1.0 version.
Here is the pattern for removing characters that are illegal in XML 1.1:
// XML 1.1
// [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
String xml11pattern = "[^"
+ "\u0001-\uD7FF"
+ "\uE000-\uFFFD"
+ "\x{10000}-\x{10FFFF}"
+ "]+";
You will need to use String.replaceAll(...) and not String.replace(...).
String illegal = "Hello, World!\0";
String legal = illegal.replaceAll(pattern, "");
All these answers so far only replace the characters themselves. But sometimes an XML document will have invalid XML entity sequences resulting in errors. For example, if you have  in your xml, a java xml parser will throw Illegal character entity: expansion character (code 0x2 at ....
Here is a simple java program that can replace those invalid entity sequences.
public final Pattern XML_ENTITY_PATTERN = Pattern.compile("\\&\\#(?:x([0-9a-fA-F]+)|([0-9]+))\\;");
/**
* Remove problematic xml entities from the xml string so that you can parse it with java DOM / SAX libraries.
*/
String getCleanedXml(String xmlString) {
Matcher m = XML_ENTITY_PATTERN.matcher(xmlString);
Set<String> replaceSet = new HashSet<>();
while (m.find()) {
String group = m.group(1);
int val;
if (group != null) {
val = Integer.parseInt(group, 16);
if (isInvalidXmlChar(val)) {
replaceSet.add("&#x" + group + ";");
}
} else if ((group = m.group(2)) != null) {
val = Integer.parseInt(group);
if (isInvalidXmlChar(val)) {
replaceSet.add("&#" + group + ";");
}
}
}
String cleanedXmlString = xmlString;
for (String replacer : replaceSet) {
cleanedXmlString = cleanedXmlString.replaceAll(replacer, "");
}
return cleanedXmlString;
}
private boolean isInvalidXmlChar(int val) {
if (val == 0x9 || val == 0xA || val == 0xD ||
val >= 0x20 && val <= 0xD7FF ||
val >= 0x10000 && val <= 0x10FFFF) {
return false;
}
return true;
}
To match you can use regular expression:
(?:<)(?<=<)(\/?\w*)(?=.*(?<=<\/html))(?:>)
(?:<)Match but don't capture<.(?<=<)Positive lookbehind for<.(\/?\w*)Capture tag name. Optional/and word characters.(?=.*(?<=<\/html))Positive lookahead, then positive lookbehind for closing tag.(?:>)Match but don't capture>.
To replace you can use:
<$1>
Where $1 is the result of the capture group in the regular expression.
You can test the regular expression interactively here.
Using the following Java code:
public static void main(String []args){
String xml = "<Message text=\"<html>Welcome User, <br> Happy to have you. <br>.</html>\" Multi=\"false\"><Meta source=\"system\" dest=\"any\"></Meta></Message>";
String newxml = replaceChars(xml);
System.out.println(newxml);
}
private static String replaceChars(String xml)
{
xml = xml.replaceAll("(?:<)(?<=<)(\/?\w*)(?=.*(?<=<\/html))(?:>)", "<$1>");
return xml;
}
The output is:
"<Message text="<html>Welcome User, <br> Happy to have you. </html>" Multi="false"> <Meta source="system" dest="any"></Meta></Message>"
Please do not use regular expressions to escape special characters in XML.
Can you guarantee that this will work for all possible html input with all of HTML and XML quirks (very extensive specs!!!) ?
Just use one of many utilities out there to escape XML strings.
Apache Commons is quite popular - please see this example