There are 128 valid basic ASCII characters, mapped to the values 0 (the NUL byte) to 127 (the DEL character). See here.

The word 'character' must be used wisely. The definition of 'character' is a special one. For example, the è, is that one character? Or is it two characters (e and `)? It depends.

Secondly, a sequence of characters is completely independent from its encoding. For simplicity, I assume that each byte is interpreted as one character.

You can determine if a byte can be parsed as an ASCII character, you can simply do this:

byte[] bytes = "Bj��rk����oacute�".getBytes();
for (byte b : bytes) {
    // What's happening here? A byte that is in the range from 0 to 127 is
    // valid, and other values are invalid. A byte in Java is signed, that
    // means that valid ranges are from -128 to 127.
    if (b >= 0) {
        System.out.println("Valid ASCII");
    }
    else {
        System.out.println("Invalid ASCII");
    }
}
Answer from MC Emperor on Stack Overflow
🌐
University of Maryland
terpconnect.umd.edu › ~zben › Web › CharSet › htmlchars.html
Non-ASCII Glyphs
This table was produced automatically from the character set tables in the HTML 4.0 document from W3C by an AWK script.
🌐
UW Computer Sciences
pages.cs.wisc.edu › ~markm › ascii.html
Check for non-ASCII
Choose a file to check for non-ASCII characters: · OR Copy/paste your code here to check for non-ASCII characters:
Discussions

java - Valid/invalid non-ascii and invalid ascii characters - Stack Overflow
I need to test the processing of a string which contains valid non-ascii characters + invalid non-ascii characters + invalid ascii characters. Can someone please give me a couple of examples of s... More on stackoverflow.com
🌐 stackoverflow.com
How to strip non ascii characters out ? - Storage & SAN - Spiceworks Community
Hi, I am loading data from csv file and unfortunately there some non ascii characters. How can I take them out from name field on the flat/cvs file? thanks, Gok More on community.spiceworks.com
🌐 community.spiceworks.com
0
March 14, 2010
r - detect non ascii characters in a string - Stack Overflow
How can I detect non-ascii characters in a vector of strings in a grep like fashion. For example below I'd like to return c(1, 3) or c(TRUE, FALSE, TRUE, FALSE): x More on stackoverflow.com
🌐 stackoverflow.com
Search for a non ASCII character
Hi there, I am trying to validate a JSON file I have, and suspect there it contains it a non ASCII character (as this is the error I get when trying to import it into Tableau). I’m hoping someone here can help me find a way I can search my JSON file to find where it is! More on forum.sublimetext.com
🌐 forum.sublimetext.com
1
0
January 24, 2019
🌐
SEOZoom
seozoom.com › home › blog › non-ascii characters and special characters: what they are and how to use them
Non-ASCII and special characters: guidance and tips for the site
May 23, 2024 - Non-ASCII characters are all those symbols that are an extension of the original ASCII code, which includes 128 standard characters such as the letters of the English alphabet, numbers, and basic control symbols.
🌐
Omni Calculator
omnicalculator.com › what-is-a-non-ascii-character
What is a Non-ASCII Character​?
These include, but aren't limited to, letters with accents, characters from languages that are not written in the Latin script, such as Arabic (العربية) and Chinese (中文) alphabets, mathematical operators (Ω ∑ π ≤ ≥), technical symbols ( © ® ™), and emojis (😀 ❤️ ...
Top answer
1 of 2
1

There are 128 valid basic ASCII characters, mapped to the values 0 (the NUL byte) to 127 (the DEL character). See here.

The word 'character' must be used wisely. The definition of 'character' is a special one. For example, the è, is that one character? Or is it two characters (e and `)? It depends.

Secondly, a sequence of characters is completely independent from its encoding. For simplicity, I assume that each byte is interpreted as one character.

You can determine if a byte can be parsed as an ASCII character, you can simply do this:

byte[] bytes = "Bj��rk����oacute�".getBytes();
for (byte b : bytes) {
    // What's happening here? A byte that is in the range from 0 to 127 is
    // valid, and other values are invalid. A byte in Java is signed, that
    // means that valid ranges are from -128 to 127.
    if (b >= 0) {
        System.out.println("Valid ASCII");
    }
    else {
        System.out.println("Invalid ASCII");
    }
}
2 of 2
1

Some background

As Java was invented, a very important design decision was that text in java would be Unicode: a numbering system of all graphemes in the world. Hence char is two bytes (in UTF-16, one of the Unicode "universal character set transformation format"). And byte is a distinct type for binary data.

Unicode numbers all symbols, so-called code points, like , as U+266B. Those numbers reaching the three byte integers. Hence code points in java are represented as int.

ASCII is a 7-bits subset of Unicode UTF-8, 0 - 127.

UTF-8 is a multibyte Unicode format, where ASCII is a valid subset, and higher symbols

Validity

You were asked to identify "invalid" characters = wrongly produced code points. You could also identify code parts that produce invalid characters. (Easier.)

In the above is a place holder character (like ?) that substitutes a code point not being representable in the current character set. If the code produced a ? as place holder, one cannot guess whether substitution took place. For some west European languages the encoding is Windows-1252 (Cp1252, MS Windows Latin-1) having. You can check whether a code point from a String can be converted to that Charset.

Then remain false positives, wrong characters that however exist in Cp1252. That could be a multi-byte code sequence of UTF-8, interpreted as several Window-1252 characters. So: an acceptable non-ASCII char adjacent to a unacceptable non-ASCII char is suspect too. That means you need to list the special characters in your language, and extras: like special quotes, in English borrows like ç, ñ.

For MS-Windows Latin-1 (an altered ISO Latin-1) something like:

boolean isSuspect(char ch) {
    if (ch < 32) {
        return "\f\n\r\t".indexOf(ch) != -1;
    } else if (ch >= 127) {
        return false;
    } else {
        return suspects.get((int) ch); // Better use a positive list.
    }
}

static BitSet suspects = new BitSet(256);
static {
    ...
}
🌐
Spiceworks
community.spiceworks.com › hardware & infrastructure › storage & san
How to strip non ascii characters out ? - Storage & SAN - Spiceworks Community
March 14, 2010 - Hi, I am loading data from csv file and unfortunately there some non ascii characters. How can I take them out from name field on the flat/cvs file? thanks, Gok
Find elsewhere
🌐
Nfshost
rbutterworth.nfshost.com › Tables › compose
Non-ASCII characters — compose key sequences
A table of the UTF-8 Unicode characters available using the compose key.
🌐
TextPad Community
forums.textpad.com › home › board index › peer group support › general
Finding Non-ASCII characters - Community
[\x80-\xFF] means every character in the range hex 80 (128) to FF (255). [^\x00-\x7F] means every character not in the range hex 00 (0) to 7F (127). Thery are equivalent if the text consists entirely of 8-bit characters.
🌐
Google Groups
groups.google.com › g › psychopy-users › c › dSNXrbcWlXc › m › -4dJmmFQT3IJ
Non-ASCII Characters in Comments
I'm putting together a kind of ... a mixture of ASCII and Unicode characters as stimuli. In order to keep clear in my code what a "text = u'\u25D3'" actually was, I put a comment after the line with the Unicode character itself. This caused a "SyntaxError: Non-ASCII character ...
🌐
Wordpress
measureofchaos.wordpress.com › 2010 › 05 › 13 › find-non-ascii-characters
Find non-ASCII characters | Musings of a Developer
May 13, 2010 - find . -print | grep "[^A-Za-z0-9.,-_=+ @*;:\$\"'()&#!%^?~\`|{}]"This will find all files from the current directory down whose name contains non-printable ASCII characters (e.g. unicode) and print them out with a relative path.(It's worth pointing out that on Debian at least this command will fail if your environment's $LANG variable is set to anything other than…
🌐
Digital Coach
digital-coach.com › articles › case-studies › non-ascii-characters
Non ASCII Characters: find out how to correct them now
November 29, 2023 - Non ASCII characters are an extension of the standard ASCII code. Find out how to recognise and eliminate them for an SEO friendly site.
🌐
Reddit
reddit.com › r/linuxonthinkpad › how can i type non-ascii characters?
r/LinuxOnThinkpad on Reddit: How can I type non-ASCII characters?
February 2, 2026 -

How can I type a "tilde-n" (Spanish character, like that in "espanol" if I could type it) ? The help I've seen mentions "AltGr" (I don't have such a key?) or Alt + umeric pad (doesn't work), or other suggestions that haven't worked for me so far.

My p16v's keyboard has these keys:

Row Keys (italics -> not-an-ASCII-character)
row 5 Esc/FnLock; Function keys, Home, End, Insert, Delete
row 4 TILDE/backquote, digits, PLUS/equals, backspace
row 3 Tab, ASCII chars, PIPE/backslash
row 2 CapsLock; ASCII chars; Enter
row 1 left-shift; ASCII chars; right-shift
row 0 (bottom) Fn, left-Ctrl, "super/Windows logo", left-Alt; Space; right-Alt, PrtSc, right-Ctrl; "PgUp/arrow keys/PgDn"

...plus a numeric keypad.

Bonus question: What's the point of the "super/Windows logo" key? It didn't exist in the Windows versions I've used to any extent.

🌐
Notepad++ Community
community.notepad-plus-plus.org › topic › 15389 › any-way-to-replace-all-non-ascii-characters-i-e-all-x80-or-greater-within-a-text-file
Any way to replace all Non ASCII characters i.e. all x80 or greater within a text file? | Notepad++ Community
March 11, 2018 - Here, is, bellow, a NON-exhaustive table of some Unicode characters, with code-point, above 007Fh, taken from the following Unicode blocks : ... +--------------------------------------------------------------+---------------------------------------------+ | NON-ASCII Character with Code > \x{007F} | Similar Character(s) with Code < \x{0080} | +--------------------------------------------------------------+---------------------------------------------+ | Code | Char | Character Name | Code | Char | Character Name | +--------+------+----------------------------------------------+--------+-------
Top answer
1 of 7
13

There are historical reasons for this guidance, that are mainly related to the lack of uniform encoding standard.

Encoding issues

Unicode dates back to the 90s'. Before it became mainstream, there was no global standard way to encode λ, é, or ä. The western character sets used to be limited to a single byte (octet to be more precise). This leaves room for only 128 characters on top of the ASCII ones. But there are many more accented local characters and greek letters than the 128 that were available.

This is why the ISO-8859 character sets were standardized in several variants. People used country-dependent settings. Suppose a German developper worked in ISO 8859-1 and had an ß in the identifier. This was encoded as 0xDF in the source file. If a greek developer opened the same source, using an ISO 8859-7 setting, the identifier would be displayed with an ί instead of ß. If the greek developer would type a ß it would lead to 0xE2 in the source file, which a French developer would then read as â. This was a total mess guaranteed. And when an US colleague looked at it, it just showed a dot, or a question mark, or, a semi-graphic character (after 1981, when extended ASCII became popular).

Tool support

The lack of uniformity triggered practical annoyances in text editors. I remember for example that the word-by-word move viewed non-ascii as a word separator. So the navigation was not as smooth as with ascii identifiers.

More seriously, there was a lacking tool support. I remember the first linker on MS-DOS had constraints about the length of the symbol identifiers and it was limited to ASCII character sets. This is btw not so old story according Wikipedia about GNU Compiler collection (see also this SO question):

Although the C++ language requires support for non-ASCII Unicode characters in identifiers, the feature has only been supported since GCC 10. As with the existing handling of string literals, the source file is assumed to be encoded in UTF-8. The feature is optional in C, but has been made available too since this change.

Momentum, prejudice and internationalization

To prevent all these nasty issues, a lot of coding standards pragmatically recommended the use of ASCII characters for identifiers. This creates some momentum.

Moreover there is a broad consensus for using English language in identifiers in the context of international projects, or open source projects looking for a broad community. Having this kind of expectations create some prejudice against opening up to unicode characters.

But not all projects have an international audience. There are lots of teams out there working in a local context and using native language in comments, in git commits and even in identifiers. A study of some 1.1 millions non-English git repos demonstrate that this is a large scale reality. Here unicode is relevant (after all, there must be a reason for new languages such as Swift, C# and others to accept unicode identifiers). Ironically, due to the historical problems, many keep using transliteration (e.g. in German ae instead of ä or, more ambiguous: in French e instead of é).

So, still nowadays, a significant number of people have kept in mind that there was something, and just continue to promote ASCII, although Unicode identifiers are now largely supported.

Edit: All unicode chars are still not equal!

Accepting unicode characters in an identifier does not ensure that all unicode characters are equally treated in all languages that claim to support it.

Unicode characters are categorized into classes, for example spacing, punctuation, letters (aka writing alphabet) and others. Some quick experimentation on my mac:

  • Swift, C++ and C# interpret spacing characters properly as token separator, whereas python considers it as an error.
  • Emojis are not letters. C++ and Swift accept them in an identifier, but python and C# don't (Example: tw)
  • Characters of class "letter" are generally accepted in an identifier in all my tests in the four languages. But some Egyptian hieroglyphs are not recognized as letter by C# (Examples: t𓀉w, t丳w, téw, the first being the problematic one)
  • And good news for the mathematicians among us, π is a valid identifier in all these languages ;-)
2 of 7
8

The big problem is seeing two identifiers, and determining just by looking whether they are the same. Capital Latin, Cyrillic and Greek characters for example often look the same. Chinese characters I just couldn’t recognise. And I think Swift allows certain white space characters in identifiers, other languages might as well. Imagine having an identifier that is literally invisible.

Having to use a different keyboard would be insane. On Mac keyboards, there are a few dozen non-ASCII characters easily available (I can type them without thinking), so anything in this set I would be fine, anything outside I wouldn’t. In one case I defined a ≈≈ operator (== didn’t work for some reasons). Other OSs have different easily accessible characters.

So I’m not totally against it, but be reasonable.

@Greg, the compilers have absolutely no problem with unicode characters. It's the programmer. The compiler can easily distinguish between three different white space characters, or defining an m-dash or n-dash as an operator instead of the hyphen. It's me who has the problem.

PS. Unfortunately Swift doesn't support an operator named ² or ³, so you can't write y = x³ or y = (a+b)². Well, you can write the first one because x³ is a valid identifier, so you can write

let x = 5
let x² = x*x
let x³ = x²*x
🌐
IBM
ibm.com › docs › en › spss-modeler › 18.6.0
Using Non-ASCII characters
We cannot provide a description for this page right now
🌐
Scala Users
users.scala-lang.org › question
Non-ASCII characters in identifier - Question - Scala Users
February 24, 2021 - I’ve recently switched a project from 2.13.3 to 2.13.5. I’m not sure whether this is related because I’ve also updated (as suggested) to the newest scala-intelliJ plug-in, but I now get a warning about Non-ASCII characters in identifiers. I’m using variables φ and λ, because I’m ...
🌐
Quora
quora.com › What-is-a-non-ASCII-character
What is a non-ASCII character? - Quora
In UTF-8, the first 127 code points are identical to ASCII. Non-ASCII characters are represented by using a special code in the first byte then up to 4 additional bytes to represent the character.
🌐
GitHub
gist.github.com › garfieldnate › 2226d5d52eb8994c1eb7
List non-ascii characters in input file · GitHub
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.