🌐
GitHub
github.com › melody26613 › xml-encode-by-C
GitHub - melody26613/xml-encode-by-C: Encode XML special characters by using simple C code · GitHub
./sample "&'\"<>" result input str: &'"<> encode: &amp;&apos;&quot;&lt;&gt; decode: &'"<> ./sample "&&&''\"<>>><" result input str: &&&''"<>>>< encode: &amp;&amp;&amp;&apos;&apos;&quot;&lt;&gt;&gt;&gt;&lt; decode: &&&''"<>>>< ./sample "&amp;&amp;&\"<&apos;>>><" result: input str: &amp;&amp;&"<&apos;>>>< encode: &amp;amp;&amp;amp;&amp;&quot;&lt;&amp;apos;&gt;&gt;&gt;&lt; decode: &amp;&amp;&"<&apos;>>>< please reference code under test/ and test/README.md
Author   melody26613
🌐
Obj-sys
obj-sys.com › docs › xbv30 › CCppUsersGuide › ch12s01.html
XML C Encode Functions
The XML C low-level encode functions handle the XML encoding of simple XML schema data types. Calls to these functions are assembled in the C source code generated by the XBinder compiler to accomplish the encoding of complex structures.
🌐
GitHub
github.com › tmpest127 › xml_encode
GitHub - tmpest127/xml_encode: Encode shellcode as XML-looking data. Single-header C library with a two-stage PIC loader example.
Encodes arbitrary binary data as XML-looking data. Includes a single-header C library (xml_encoder.h) and a CLI tool.
Starred by 12 users
Forked by 4 users
Languages   C 63.3% | Go 20.2% | Python 8.8% | Just 6.9% | Assembly 0.8% | C 63.3% | Go 20.2% | Python 8.8% | Just 6.9% | Assembly 0.8%
🌐
TutorialsPoint
tutorialspoint.com › xml › xml_encoding.htm
XML - Encoding
Hence, we need to specify the type of encoding in the XML declaration. ... UTF stands for UCS Transformation Format, and UCS itself means Universal Character Set. The number 8 or 16 refers to the number of bits used to represent a character. They are either 8(1 to 4 bytes) or 16(2 or 4 bytes).
🌐
Online-domain-tools
xml-encoding.online-domain-tools.com
XML Encoding – Easily encode or decode strings or files online
If the data you want to encode or decode are in the form of a short string we recommend using the text string input. On the other hand for larger input data we recommend you to use a file as an input. On the output you are given the result in the form of a text or a hex dump, depending on the contents of the output, as well as in the form of a file that you can download.
🌐
EDUCBA
educba.com › home › software development › software development tutorials › xml tutorial › xml encoding
XML Encoding | Types of Encoding in XML with Examples
July 12, 2021 - XML Encoding is defined as the process of converting Unicode characters into binary format and in XML when the processor reads the document it mandatorily encodes the statement to the declared type of encodings, the character encodings are specified ...
Address   Unit no. 202, Jay Antariksh Bldg, Makwana Road, Marol, Andheri (East),, 400059, Mumbai
🌐
Liquid Technologies
liquid-technologies.com › Reference › Glossary › XML_Encoding.html
XML Encoding
There are many Multibyte encoding schemes the most common one being UTF-8. The · UTF-8 Wikipedia entry describes the nuts and bolts of performing the encoding in all its gory detail. When the XML parser reads an XML file or stream its reading binary data, bytes not characters.
🌐
Zoho
zoho.com › deluge › help › string › xml-encode.html
XML Encode | Help - Zoho Deluge
The size of the supplied input text and the returned output text can individually be up to 300KB. This function supports all XML tags until the latest version. ... The following script encodes the input XML text such that all the special characters will be transformed into their equivalent ...
Find elsewhere
🌐
Wikipedia
en.wikipedia.org › wiki › List_of_XML_and_HTML_character_entity_references
List of XML and HTML character entity references - Wikipedia
3 weeks ago - Most entities are predefined in XML and HTML to reference just one character in the UCS, but there are no predefined entities for isolated combining characters, variation selectors, or characters for private use assignments; however the list includes some predefined entities for character sequences of two characters containing some of them. Since HTML 5.0 (and MathML 3.0 which shares the same set en entities), all entities are encoded in Unicode normalization forms C and KC (this was not the case with older versions of HTML and MathML, so older entities that were initially defined with characters for private use assignments, CJK compatibility forms, or in non-NFC forms were modified).
🌐
IBM
ibm.com › docs › en › cobol-zos › 6.3.0
The encoding of XML documents
XML documents must be encoded in a supported code page.
Top answer
1 of 9
26

You can use a native .NET method for escaping special characters in text. Sure, there's only like 5 special characters, and 5 Replace() calls would probably do the trick, but I'm sure there's got to be something built-in.

Example of converting "&" to "&amp;"

To much relief, I've discovered a native method, hidden away in the bowels of the SecurityElement class. Yes, that's right - SecurityElement.Escape(string s) will escape your string and make it XML safe.

This is important, since if we are copying or writing data to Infopath Text fields, it needs to be first Escaped to non-Entity character like "&amp;".

invalid XML Character to Replaced With

"<" to "&lt;"

">" to "&gt;"

"\"" to "&quot;"

"'" to "&apos;"

"&" to "&amp;"

Namespace is "System.Security". Refer : http://msdn2.microsoft.com/en-us/library/system.security.securityelement.escape(VS.80).aspx

The Other Option is to Customise code for

public static string EscapeXml( this string s )
{
  string toxml = s;
  if ( !string.IsNullOrEmpty( toxml ) )
  {
    // replace literal values with entities
    toxml = toxml.Replace( "&", "&amp;" );
    toxml = toxml.Replace( "'", "&apos;" );
    toxml = toxml.Replace( "\"", "&quot;" );
    toxml = toxml.Replace( ">", "&gt;" );
    toxml = toxml.Replace( "<", "&lt;" );
  }
  return toxml;
}

public static string UnescapeXml( this string s )
{
  string unxml = s;
  if ( !string.IsNullOrEmpty( unxml ) )
  {
    // replace entities with literal values
    unxml = unxml.Replace( "&apos;", "'" );
    unxml = unxml.Replace( "&quot;", "\"" );
    unxml = unxml.Replace( "&gt;", ">" );
    unxml = unxml.Replace( "&lt;", "<" );
    unxml = unxml.Replace( "&amp;", "&" );
  }
  return unxml;
}
2 of 9
20

You can use HttpUtility.HtmlDecode or with .NET 4.0+ you can also use WebUtility.HtmlDecode

Top answer
1 of 13
80

Depending on how much you know about the input, you may have to take into account that not all Unicode characters are valid XML characters.

Both Server.HtmlEncode and System.Security.SecurityElement.Escape seem to ignore illegal XML characters, while System.XML.XmlWriter.WriteString throws an ArgumentException when it encounters illegal characters (unless you disable that check in which case it ignores them). An overview of library functions is available here.

Edit 2011/8/14: seeing that at least a few people have consulted this answer in the last couple years, I decided to completely rewrite the original code, which had numerous issues, including horribly mishandling UTF-16.

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;

/// <summary>
/// Encodes data so that it can be safely embedded as text in XML documents.
/// </summary>
public class XmlTextEncoder : TextReader {
    public static string Encode(string s) {
        using (var stream = new StringReader(s))
        using (var encoder = new XmlTextEncoder(stream)) {
            return encoder.ReadToEnd();
        }
    }

    /// <param name="source">The data to be encoded in UTF-16 format.</param>
    /// <param name="filterIllegalChars">It is illegal to encode certain
    /// characters in XML. If true, silently omit these characters from the
    /// output; if false, throw an error when encountered.</param>
    public XmlTextEncoder(TextReader source, bool filterIllegalChars=true) {
        _source = source;
        _filterIllegalChars = filterIllegalChars;
    }

    readonly Queue<char> _buf = new Queue<char>();
    readonly bool _filterIllegalChars;
    readonly TextReader _source;

    public override int Peek() {
        PopulateBuffer();
        if (_buf.Count == 0) return -1;
        return _buf.Peek();
    }

    public override int Read() {
        PopulateBuffer();
        if (_buf.Count == 0) return -1;
        return _buf.Dequeue();
    }

    void PopulateBuffer() {
        const int endSentinel = -1;
        while (_buf.Count == 0 && _source.Peek() != endSentinel) {
            // Strings in .NET are assumed to be UTF-16 encoded [1].
            var c = (char) _source.Read();
            if (Entities.ContainsKey(c)) {
                // Encode all entities defined in the XML spec [2].
                foreach (var i in Entities[c]) _buf.Enqueue(i);
            } else if (!(0x0 <= c && c <= 0x8) &&
                       !new[] { 0xB, 0xC }.Contains(c) &&
                       !(0xE <= c && c <= 0x1F) &&
                       !(0x7F <= c && c <= 0x84) &&
                       !(0x86 <= c && c <= 0x9F) &&
                       !(0xD800 <= c && c <= 0xDFFF) &&
                       !new[] { 0xFFFE, 0xFFFF }.Contains(c)) {
                // Allow if the Unicode codepoint is legal in XML [3].
                _buf.Enqueue(c);
            } else if (char.IsHighSurrogate(c) &&
                       _source.Peek() != endSentinel &&
                       char.IsLowSurrogate((char) _source.Peek())) {
                // Allow well-formed surrogate pairs [1].
                _buf.Enqueue(c);
                _buf.Enqueue((char) _source.Read());
            } else if (!_filterIllegalChars) {
                // Note that we cannot encode illegal characters as entity
                // references due to the "Legal Character" constraint of
                // XML [4]. Nor are they allowed in CDATA sections [5].
                throw new ArgumentException(
                    String.Format("Illegal character: '{0:X}'", (int) c));
            }
        }
    }

    static readonly Dictionary<char,string> Entities =
        new Dictionary<char,string> {
            { '"', "&quot;" }, { '&', "&amp;"}, { '\'', "&apos;" },
            { '<', "&lt;" }, { '>', "&gt;" },
        };

    // References:
    // [1] http://en.wikipedia.org/wiki/UTF-16/UCS-2
    // [2] http://www.w3.org/TR/xml11/#sec-predefined-ent
    // [3] http://www.w3.org/TR/xml11/#charsets
    // [4] http://www.w3.org/TR/xml11/#sec-references
    // [5] http://www.w3.org/TR/xml11/#sec-cdata-sect
}

Unit tests and full code can be found here.

2 of 13
35

SecurityElement.Escape

documented here

🌐
My Tec Bits
mytecbits.com › tools › encoders › xml-encoder
XML Encoder Tool | My Tec Bits.
July 23, 2021 - XML Encoder is a tool to convert the XML's predefined entities to standard character data.
🌐
GitHub
github.com › ConradIrwin › libxml2 › blob › master › encoding.c
libxml2/encoding.c at master · ConradIrwin/libxml2
* encoding.c : implements the encoding conversion functions needed for XML · * * Related specs: * rfc2044 (UTF-8 and UTF-16) F. Yergeau Alis Technologies · * rfc2781 UTF-16, an encoding of ISO 10646, P. Hoffman, F. Yergeau ...
Author   ConradIrwin
🌐
Go Packages
pkg.go.dev › encoding › xml
xml package - encoding/xml - Go Packages
March 6, 2026 - When a project reaches major version v1 it is considered stable. ... Package xml implements a simple XML 1.0 parser that understands XML name spaces. ... package main import ( "encoding/xml" "fmt" "log" "strings" ) type Animal int const ( Unknown Animal = iota Gopher Zebra ) func (a *Animal) ...
🌐
DocsAllOver
docsallover.com › tools › xml-encoder-decoder
DocsAllOver | XML Encoder/Decoder
An XML Encoder Decoder is a tool that allows you to encode special characters in your XML code into XML entities or decode XML entities back into their original characters.
Top answer
1 of 4
9

A character encoding specifies how characters are mapped onto bytes. Since XML documents are stored and transferred as byte streams, this is necessary to represent the unicode characters that make up an XML document.

UTF-8 is chosen as the default, because it has several advantages:

  • it is compatible with ASCII in that all valid ASCII encoded text is also valid UTF-8 encoded (but not necessarily the other way around!)
  • it uses only 1 byte per character for "common" letters (those that also exist in ASCII)
  • it can represent all existing Unicode characters

Character encodings are a more general topic than just XML. UTF-8 is not restricted to being used in XML only.

What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text is a good article that gives a good overview over the topic.

2 of 4
4

When computers were first created, they mostly only worked with characters found in the english language, leading to the 7-bit US-ASCII standard.

However, there are a lot of different written languages in the world, and ways had to be found to be able to use them in computers.

The first way works fine if you restrict yourself to a certain language, it's to use a culture specific encoding, such as ISO-8859-1, which is able to represent latin-european language characters on 8-bits, or GB2312 for chinese characters.

The second way is a bit more complicated, but allows theoretically to represent every character in the world, it's the Unicode standard, in which every character from every language has a specific code. However, given the high number of existing characters (109,000 in Unicode 5), unicode characters are normally represented using a three byte representation (one byte for the Unicode plane, and two bytes for the character code.

In order to maximize compatibility with existing code (some is still using text in ASCII), the UTF-8 standard encoding was devised as a way to store Unicode characters, only using the minimal amount of space, as described in Joachim Sauer's answer.

So, it's common to see files encoded with specific charsets such as ISO-8859-1 if the file is meant to be edited or read only by software (and people) understanding only these languages, and UTF-8 when there's the need to be highly interoperable and culture-independant. The current tendancy is for UTF-8 to replace other charsets, even though it needs work from software developers, since UTF-8 strings are more complicated to handle than fixed-width charset strings.