You can use the Blob to get the string size in bytes.
Examples:
console.info(
new Blob(['']).size, // 4
new Blob(['']).size, // 4
new Blob(['']).size, // 8
new Blob(['']).size, // 8
new Blob(['I\'m a string']).size, // 12
// from Premasagar correction of Lauri's answer for
// strings containing lone characters in the surrogate pair range:
// https://stackoverflow.com/a/39488643/6225838
new Blob([String.fromCharCode(55555)]).size, // 3
new Blob([String.fromCharCode(55555, 57000)]).size // 4 (not 6)
);
Answer from P Roitto on Stack OverflowYou can use the Blob to get the string size in bytes.
Examples:
console.info(
new Blob(['']).size, // 4
new Blob(['']).size, // 4
new Blob(['']).size, // 8
new Blob(['']).size, // 8
new Blob(['I\'m a string']).size, // 12
// from Premasagar correction of Lauri's answer for
// strings containing lone characters in the surrogate pair range:
// https://stackoverflow.com/a/39488643/6225838
new Blob([String.fromCharCode(55555)]).size, // 3
new Blob([String.fromCharCode(55555, 57000)]).size // 4 (not 6)
);
This function will return the byte size of any UTF-8 string you pass to it.
function byteCount(s) {
return encodeURI(s).split(/%..|./).length - 1;
}
Source
JavaScript engines are free to use UCS-2 or UTF-16 internally. Most engines that I know of use UTF-16, but whatever choice they made, it’s just an implementation detail that won’t affect the language’s characteristics.
The ECMAScript/JavaScript language itself, however, exposes characters according to UCS-2, not UTF-16.
Source
Years passed and nowadays you can do it natively
const textEncoder = new TextEncoder();
console.log(textEncoder.encode('foo').length);
Note that it's not supported by IE (you may use a polyfill for that).
MDN documentation
Standard specifications
There is no way to do it in JavaScript natively. (See Riccardo Galli's answer for a modern approach.)
For historical reference or where TextEncoder APIs are still unavailable.
If you know the character encoding, you can calculate it yourself though.
encodeURIComponent assumes UTF-8 as the character encoding, so if you need that encoding, you can do,
function lengthInUtf8Bytes(str) {
// Matches only the 10.. bytes that are non-initial characters in a multi-byte sequence.
var m = encodeURIComponent(str).match(/%[89ABab]/g);
return str.length + (m ? m.length : 0);
}
This should work because of the way UTF-8 encodes multi-byte sequences. The first encoded byte always starts with either a high bit of zero for a single byte sequence, or a byte whose first hex digit is C, D, E, or F. The second and subsequent bytes are the ones whose first two bits are 10. Those are the extra bytes you want to count in UTF-8.
The table in wikipedia makes it clearer
Bits Last code point Byte 1 Byte 2 Byte 3
7 U+007F 0xxxxxxx
11 U+07FF 110xxxxx 10xxxxxx
16 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
...
If instead you need to understand the page encoding, you can use this trick:
function lengthInPageEncoding(s) {
var a = document.createElement('A');
a.href = '#' + s;
var sEncoded = a.href;
sEncoded = sEncoded.substring(sEncoded.indexOf('#') + 1);
var m = sEncoded.match(/%[0-9a-f]{2}/g);
return sEncoded.length - (m ? m.length * 2 : 0);
}
It would be a lot simpler to work out the length yourself rather than to interpret the results of encodeURI().
/**
* Count bytes in a string's UTF-8 representation.
*
* @param string
* @return int
*/
function getByteLen(normal_val) {
// Force string type
normal_val = String(normal_val);
var byteLen = 0;
for (var i = 0; i < normal_val.length; i++) {
var c = normal_val.charCodeAt(i);
byteLen += (c & 0xf800) == 0xd800 ? 2 : // Code point is half of a surrogate pair
c < (1 << 7) ? 1 :
c < (1 << 11) ? 2 : 3;
}
return byteLen;
}
JavaScript implementations may use either UCS-2 or UTF-16 to represent strings.
UCS-2 only supports Unicode code points up to U+FFFF, and such Unicode characters occupy 1, 2, or 3 bytes in their UTF-8 representation. This is not too tricky to handle.
However, as @Mac points out, UTF-16 surrogate pairs are a tricky special case. UTF-16 extends UCS-2 by adding support for code points U+10000 to U+10FFFF, which UTF-16 encodes using a pair of code points. The first code point of such a pair (called the "high surrogate") is in the range D800 to DBFF; it should always be followed by another code point (called the "low surrogate") is in the range DC00 to DFFF. Observe that the UTF-8 representation of any character in the range U+10000 to U+10FFFF would take 4 bytes. Therefore, any surrogate pair in UTF-16 would translate to a 4-byte UTF-8 representation. Or, we could say that whenever we encounter half of a surrogate pair (i.e., a code point is in the range from D800 to DFFF), just add two bytes to the UTF-8 length.
My 2 cents
- Please do not abbreviate words, choose short words or acronyms ( Len -> Length )
- Please lower camel case ( normal_val -> normalValue )
- Consider using spartan conventions ( s -> generic string )
new Array()is considered old skool, considervar byte_pieces = []- You are using
byte_piecesto track the bytes just to get the length, you could have just kept track of the length, this would be more efficient - I am not sure what
abnormal pieceswould be here:
if(normal_pieces[i] && normal_pieces[i] != '')
- You check again for these here, probably not needed:
if(encoded_pieces[i] && encoded_pieces[i] != '')
- You could just do
return byte_pieces.lengthinstead of
// Array length is the number of bytes in string var byte_length = byte_pieces.length; return byte_length;
All that together, I would counter propose something like this:
function getByteCount( s )
{
var count = 0, stringLength = s.length, i;
s = String( s || "" );
for( i = 0 ; i < stringLength ; i++ )
{
var partCount = encodeURI( s[i] ).split("%").length;
count += partCount==1?1:partCount-1;
}
return count;
}
getByteCount("i js");
getByteCount("abc def");
You could get the sum by using .reduce(), I leave that as an exercise to the reader.
Finally, if you are truly concerned about performance, there are some very fancy performant js libraries out there.
» npm install string-byte-length