I’m definitely missing something about the TextEncoder and TextDecoder behavior. It seems to me like the following code should round-trip, but it doesn’t seem to:
new TextDecoder().decode(new TextEncoder().encode(String.fromCharCode(55296))).charCodeAt(0);
Since I’m just encoding and decoding the string, the char code seems like it should be the same, but this returns 65533 instead of 55296. What am I missing?
Advertisement
Answer
Based on some spelunking, the TextEncoder.encode()
method appears to take an argument of type USVString
, where USV stands for Unicode Scalar Value. According to this page, a USV cannot be a high-surrogate or low-surrogate code point.
Also, according to MDN:
A USVString is a sequence of Unicode scalar values. This definition differs from that of DOMString or the JavaScript String type in that it always represents a valid sequence suitable for text processing, while the latter can contain surrogate code points.
So, my guess is your String
argument to encode()
is getting converted to a USVString
(either implicitly or within encode()
). Based on this page, it looks like to convert from String
to USVString
, it first converts it to a DOMString
, and then follows this procedure, which includes replacing all surrogates with U+FFFD
, which is the code point you see, 65533
, the “Replacement Character”.
The reason String.fromCharCode(55296).charCodeAt(0)
works I believe is because it doesn’t need to do this String -> USVString
conversion.
As to why TextEncoder.encode()
was designed this way, I don’t understand the unicode details well enough to attempt to explain, but I suspect it’s to simplify implementation since the only output encoding it supports seems to be UTF-8, in an Uint8Array
. I’m guessing requiring a USVString
argument without surrogates (instead of a native UTF-16 String
possibly with surrogates) simplifies the encoding to UTF-8
, or maybe makes some encoding/decoding use cases simpler?