There are 128 valid basic ASCII characters, mapped to the values 0 (the NUL byte) to 127 (the DEL character). See here.
The word 'character' must be used wisely. The definition of 'character' is a special one. For example, the è, is that one character? Or is it two characters (e and `)? It depends.
Secondly, a sequence of characters is completely independent from its encoding. For simplicity, I assume that each byte is interpreted as one character.
You can determine if a byte can be parsed as an ASCII character, you can simply do this:
byte[] bytes = "Bj��rk����oacute�".getBytes();
for (byte b : bytes) {
// What's happening here? A byte that is in the range from 0 to 127 is
// valid, and other values are invalid. A byte in Java is signed, that
// means that valid ranges are from -128 to 127.
if (b >= 0) {
System.out.println("Valid ASCII");
}
else {
System.out.println("Invalid ASCII");
}
}
Answer from MC Emperor on Stack Overflowjava - Valid/invalid non-ascii and invalid ascii characters - Stack Overflow
How to strip non ascii characters out ? - Storage & SAN - Spiceworks Community
r - detect non ascii characters in a string - Stack Overflow
Search for a non ASCII character
Videos
There are 128 valid basic ASCII characters, mapped to the values 0 (the NUL byte) to 127 (the DEL character). See here.
The word 'character' must be used wisely. The definition of 'character' is a special one. For example, the è, is that one character? Or is it two characters (e and `)? It depends.
Secondly, a sequence of characters is completely independent from its encoding. For simplicity, I assume that each byte is interpreted as one character.
You can determine if a byte can be parsed as an ASCII character, you can simply do this:
byte[] bytes = "Bj��rk����oacute�".getBytes();
for (byte b : bytes) {
// What's happening here? A byte that is in the range from 0 to 127 is
// valid, and other values are invalid. A byte in Java is signed, that
// means that valid ranges are from -128 to 127.
if (b >= 0) {
System.out.println("Valid ASCII");
}
else {
System.out.println("Invalid ASCII");
}
}
Some background
As Java was invented, a very important design decision was that text in java would be Unicode: a numbering system of all graphemes in the world. Hence char is two bytes (in UTF-16, one of the Unicode "universal character set transformation format"). And byte is a distinct type for binary data.
Unicode numbers all symbols, so-called code points, like ♫, as U+266B. Those numbers reaching the three byte integers. Hence code points in java are represented as int.
ASCII is a 7-bits subset of Unicode UTF-8, 0 - 127.
UTF-8 is a multibyte Unicode format, where ASCII is a valid subset, and higher symbols
Validity
You were asked to identify "invalid" characters = wrongly produced code points. You could also identify code parts that produce invalid characters. (Easier.)
In the above � is a place holder character (like ?) that substitutes a code point not being representable in the current character set. If the code produced a ? as place holder, one cannot guess whether substitution took place. For some west European languages the encoding is Windows-1252 (Cp1252, MS Windows Latin-1) having. You can check whether a code point from a String can be converted to that Charset.
Then remain false positives, wrong characters that however exist in Cp1252. That could be a multi-byte code sequence of UTF-8, interpreted as several Window-1252 characters. So: an acceptable non-ASCII char adjacent to a unacceptable non-ASCII char is suspect too. That means you need to list the special characters in your language, and extras: like special quotes, in English borrows like ç, ñ.
For MS-Windows Latin-1 (an altered ISO Latin-1) something like:
boolean isSuspect(char ch) {
if (ch < 32) {
return "\f\n\r\t".indexOf(ch) != -1;
} else if (ch >= 127) {
return false;
} else {
return suspects.get((int) ch); // Better use a positive list.
}
}
static BitSet suspects = new BitSet(256);
static {
...
}
Came across this later using pure base regex and so simple:
grepl("[^ -~]", x)
## [1] TRUE FALSE TRUE FALSE
More here: http://www.catonmat.net/blog/my-favorite-regex/
another possible way is to try to convert your string to ASCII and the try to detect all the generated non printable control characters which couldn't be converted
grepl("[[:cntrl:]]", stringi::stri_enc_toascii(x))
## [1] TRUE FALSE TRUE FALSE
Though it seems stringi has a built in function for this type of things too
stringi::stri_enc_mark(x)
# [1] "latin1" "ASCII" "latin1" "ASCII"
How can I type a "tilde-n" (Spanish character, like that in "espanol" if I could type it) ? The help I've seen mentions "AltGr" (I don't have such a key?) or Alt + umeric pad (doesn't work), or other suggestions that haven't worked for me so far.
My p16v's keyboard has these keys:
| Row | Keys (italics -> not-an-ASCII-character) |
|---|---|
| row 5 | Esc/FnLock; Function keys, Home, End, Insert, Delete |
| row 4 | TILDE/backquote, digits, PLUS/equals, backspace |
| row 3 | Tab, ASCII chars, PIPE/backslash |
| row 2 | CapsLock; ASCII chars; Enter |
| row 1 | left-shift; ASCII chars; right-shift |
| row 0 (bottom) | Fn, left-Ctrl, "super/Windows logo", left-Alt; Space; right-Alt, PrtSc, right-Ctrl; "PgUp/arrow keys/PgDn" |
...plus a numeric keypad.
Bonus question: What's the point of the "super/Windows logo" key? It didn't exist in the Windows versions I've used to any extent.
There are historical reasons for this guidance, that are mainly related to the lack of uniform encoding standard.
Encoding issues
Unicode dates back to the 90s'. Before it became mainstream, there was no global standard way to encode λ, é, or ä. The western character sets used to be limited to a single byte (octet to be more precise). This leaves room for only 128 characters on top of the ASCII ones. But there are many more accented local characters and greek letters than the 128 that were available.
This is why the ISO-8859 character sets were standardized in several variants. People used country-dependent settings. Suppose a German developper worked in ISO 8859-1 and had an ß in the identifier. This was encoded as 0xDF in the source file. If a greek developer opened the same source, using an ISO 8859-7 setting, the identifier would be displayed with an ί instead of ß. If the greek developer would type a ß it would lead to 0xE2 in the source file, which a French developer would then read as â. This was a total mess guaranteed. And when an US colleague looked at it, it just showed a dot, or a question mark, or, a semi-graphic character (after 1981, when extended ASCII became popular).
Tool support
The lack of uniformity triggered practical annoyances in text editors. I remember for example that the word-by-word move viewed non-ascii as a word separator. So the navigation was not as smooth as with ascii identifiers.
More seriously, there was a lacking tool support. I remember the first linker on MS-DOS had constraints about the length of the symbol identifiers and it was limited to ASCII character sets. This is btw not so old story according Wikipedia about GNU Compiler collection (see also this SO question):
Although the C++ language requires support for non-ASCII Unicode characters in identifiers, the feature has only been supported since GCC 10. As with the existing handling of string literals, the source file is assumed to be encoded in UTF-8. The feature is optional in C, but has been made available too since this change.
Momentum, prejudice and internationalization
To prevent all these nasty issues, a lot of coding standards pragmatically recommended the use of ASCII characters for identifiers. This creates some momentum.
Moreover there is a broad consensus for using English language in identifiers in the context of international projects, or open source projects looking for a broad community. Having this kind of expectations create some prejudice against opening up to unicode characters.
But not all projects have an international audience. There are lots of teams out there working in a local context and using native language in comments, in git commits and even in identifiers. A study of some 1.1 millions non-English git repos demonstrate that this is a large scale reality. Here unicode is relevant (after all, there must be a reason for new languages such as Swift, C# and others to accept unicode identifiers). Ironically, due to the historical problems, many keep using transliteration (e.g. in German ae instead of ä or, more ambiguous: in French e instead of é).
So, still nowadays, a significant number of people have kept in mind that there was something, and just continue to promote ASCII, although Unicode identifiers are now largely supported.
Edit: All unicode chars are still not equal!
Accepting unicode characters in an identifier does not ensure that all unicode characters are equally treated in all languages that claim to support it.
Unicode characters are categorized into classes, for example spacing, punctuation, letters (aka writing alphabet) and others. Some quick experimentation on my mac:
- Swift, C++ and C# interpret spacing characters properly as token separator, whereas python considers it as an error.
- Emojis are not letters. C++ and Swift accept them in an identifier, but python and C# don't (Example:
tw) - Characters of class "letter" are generally accepted in an identifier in all my tests in the four languages. But some Egyptian hieroglyphs are not recognized as letter by C# (Examples:
t𓀉w,t丳w,téw, the first being the problematic one) - And good news for the mathematicians among us,
πis a valid identifier in all these languages ;-)
The big problem is seeing two identifiers, and determining just by looking whether they are the same. Capital Latin, Cyrillic and Greek characters for example often look the same. Chinese characters I just couldn’t recognise. And I think Swift allows certain white space characters in identifiers, other languages might as well. Imagine having an identifier that is literally invisible.
Having to use a different keyboard would be insane. On Mac keyboards, there are a few dozen non-ASCII characters easily available (I can type them without thinking), so anything in this set I would be fine, anything outside I wouldn’t. In one case I defined a ≈≈ operator (== didn’t work for some reasons). Other OSs have different easily accessible characters.
So I’m not totally against it, but be reasonable.
@Greg, the compilers have absolutely no problem with unicode characters. It's the programmer. The compiler can easily distinguish between three different white space characters, or defining an m-dash or n-dash as an operator instead of the hyphen. It's me who has the problem.
PS. Unfortunately Swift doesn't support an operator named ² or ³, so you can't write y = x³ or y = (a+b)². Well, you can write the first one because x³ is a valid identifier, so you can write
let x = 5
let x² = x*x
let x³ = x²*x