Videos
You can use a native .NET method for escaping special characters in text. Sure, there's only like 5 special characters, and 5 Replace() calls would probably do the trick, but I'm sure there's got to be something built-in.
Example of converting "&" to "&"
To much relief, I've discovered a native method, hidden away in the bowels of the SecurityElement class. Yes, that's right - SecurityElement.Escape(string s) will escape your string and make it XML safe.
This is important, since if we are copying or writing data to Infopath Text fields, it needs to be first Escaped to non-Entity character like "&".
invalid XML Character to Replaced With
"<" to "<"
">" to ">"
"\"" to """
"'" to "'"
"&" to "&"
Namespace is "System.Security". Refer : http://msdn2.microsoft.com/en-us/library/system.security.securityelement.escape(VS.80).aspx
The Other Option is to Customise code for
public static string EscapeXml( this string s )
{
string toxml = s;
if ( !string.IsNullOrEmpty( toxml ) )
{
// replace literal values with entities
toxml = toxml.Replace( "&", "&" );
toxml = toxml.Replace( "'", "'" );
toxml = toxml.Replace( "\"", """ );
toxml = toxml.Replace( ">", ">" );
toxml = toxml.Replace( "<", "<" );
}
return toxml;
}
public static string UnescapeXml( this string s )
{
string unxml = s;
if ( !string.IsNullOrEmpty( unxml ) )
{
// replace entities with literal values
unxml = unxml.Replace( "'", "'" );
unxml = unxml.Replace( """, "\"" );
unxml = unxml.Replace( ">", ">" );
unxml = unxml.Replace( "<", "<" );
unxml = unxml.Replace( "&", "&" );
}
return unxml;
}
You can use HttpUtility.HtmlDecode or with .NET 4.0+ you can also use WebUtility.HtmlDecode
Depending on how much you know about the input, you may have to take into account that not all Unicode characters are valid XML characters.
Both Server.HtmlEncode and System.Security.SecurityElement.Escape seem to ignore illegal XML characters, while System.XML.XmlWriter.WriteString throws an ArgumentException when it encounters illegal characters (unless you disable that check in which case it ignores them). An overview of library functions is available here.
Edit 2011/8/14: seeing that at least a few people have consulted this answer in the last couple years, I decided to completely rewrite the original code, which had numerous issues, including horribly mishandling UTF-16.
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
/// <summary>
/// Encodes data so that it can be safely embedded as text in XML documents.
/// </summary>
public class XmlTextEncoder : TextReader {
public static string Encode(string s) {
using (var stream = new StringReader(s))
using (var encoder = new XmlTextEncoder(stream)) {
return encoder.ReadToEnd();
}
}
/// <param name="source">The data to be encoded in UTF-16 format.</param>
/// <param name="filterIllegalChars">It is illegal to encode certain
/// characters in XML. If true, silently omit these characters from the
/// output; if false, throw an error when encountered.</param>
public XmlTextEncoder(TextReader source, bool filterIllegalChars=true) {
_source = source;
_filterIllegalChars = filterIllegalChars;
}
readonly Queue<char> _buf = new Queue<char>();
readonly bool _filterIllegalChars;
readonly TextReader _source;
public override int Peek() {
PopulateBuffer();
if (_buf.Count == 0) return -1;
return _buf.Peek();
}
public override int Read() {
PopulateBuffer();
if (_buf.Count == 0) return -1;
return _buf.Dequeue();
}
void PopulateBuffer() {
const int endSentinel = -1;
while (_buf.Count == 0 && _source.Peek() != endSentinel) {
// Strings in .NET are assumed to be UTF-16 encoded [1].
var c = (char) _source.Read();
if (Entities.ContainsKey(c)) {
// Encode all entities defined in the XML spec [2].
foreach (var i in Entities[c]) _buf.Enqueue(i);
} else if (!(0x0 <= c && c <= 0x8) &&
!new[] { 0xB, 0xC }.Contains(c) &&
!(0xE <= c && c <= 0x1F) &&
!(0x7F <= c && c <= 0x84) &&
!(0x86 <= c && c <= 0x9F) &&
!(0xD800 <= c && c <= 0xDFFF) &&
!new[] { 0xFFFE, 0xFFFF }.Contains(c)) {
// Allow if the Unicode codepoint is legal in XML [3].
_buf.Enqueue(c);
} else if (char.IsHighSurrogate(c) &&
_source.Peek() != endSentinel &&
char.IsLowSurrogate((char) _source.Peek())) {
// Allow well-formed surrogate pairs [1].
_buf.Enqueue(c);
_buf.Enqueue((char) _source.Read());
} else if (!_filterIllegalChars) {
// Note that we cannot encode illegal characters as entity
// references due to the "Legal Character" constraint of
// XML [4]. Nor are they allowed in CDATA sections [5].
throw new ArgumentException(
String.Format("Illegal character: '{0:X}'", (int) c));
}
}
}
static readonly Dictionary<char,string> Entities =
new Dictionary<char,string> {
{ '"', """ }, { '&', "&"}, { '\'', "'" },
{ '<', "<" }, { '>', ">" },
};
// References:
// [1] http://en.wikipedia.org/wiki/UTF-16/UCS-2
// [2] http://www.w3.org/TR/xml11/#sec-predefined-ent
// [3] http://www.w3.org/TR/xml11/#charsets
// [4] http://www.w3.org/TR/xml11/#sec-references
// [5] http://www.w3.org/TR/xml11/#sec-cdata-sect
}
Unit tests and full code can be found here.
SecurityElement.Escape
documented here
A character encoding specifies how characters are mapped onto bytes. Since XML documents are stored and transferred as byte streams, this is necessary to represent the unicode characters that make up an XML document.
UTF-8 is chosen as the default, because it has several advantages:
- it is compatible with ASCII in that all valid ASCII encoded text is also valid UTF-8 encoded (but not necessarily the other way around!)
- it uses only 1 byte per character for "common" letters (those that also exist in ASCII)
- it can represent all existing Unicode characters
Character encodings are a more general topic than just XML. UTF-8 is not restricted to being used in XML only.
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text is a good article that gives a good overview over the topic.
When computers were first created, they mostly only worked with characters found in the english language, leading to the 7-bit US-ASCII standard.
However, there are a lot of different written languages in the world, and ways had to be found to be able to use them in computers.
The first way works fine if you restrict yourself to a certain language, it's to use a culture specific encoding, such as ISO-8859-1, which is able to represent latin-european language characters on 8-bits, or GB2312 for chinese characters.
The second way is a bit more complicated, but allows theoretically to represent every character in the world, it's the Unicode standard, in which every character from every language has a specific code. However, given the high number of existing characters (109,000 in Unicode 5), unicode characters are normally represented using a three byte representation (one byte for the Unicode plane, and two bytes for the character code.
In order to maximize compatibility with existing code (some is still using text in ASCII), the UTF-8 standard encoding was devised as a way to store Unicode characters, only using the minimal amount of space, as described in Joachim Sauer's answer.
So, it's common to see files encoded with specific charsets such as ISO-8859-1 if the file is meant to be edited or read only by software (and people) understanding only these languages, and UTF-8 when there's the need to be highly interoperable and culture-independant. The current tendancy is for UTF-8 to replace other charsets, even though it needs work from software developers, since UTF-8 strings are more complicated to handle than fixed-width charset strings.