Skip to content
Advertisement

TextEncoder / TextDecoder not round tripping

I’m definitely missing something about the TextEncoder and TextDecoder behavior. It seems to me like the following code should round-trip, but it doesn’t seem to:

new TextDecoder().decode(new TextEncoder().encode(String.fromCharCode(55296))).charCodeAt(0);

Since I’m just encoding and decoding the string, the char code seems like it should be the same, but this returns 65533 instead of 55296. What am I missing?

Advertisement

Answer

Based on some spelunking, the TextEncoder.encode() method appears to take an argument of type USVString, where USV stands for Unicode Scalar Value. According to this page, a USV cannot be a high-surrogate or low-surrogate code point.

Also, according to MDN:

A USVString is a sequence of Unicode scalar values. This definition differs from that of DOMString or the JavaScript String type in that it always represents a valid sequence suitable for text processing, while the latter can contain surrogate code points.

So, my guess is your String argument to encode() is getting converted to a USVString (either implicitly or within encode()). Based on this page, it looks like to convert from String to USVString, it first converts it to a DOMString, and then follows this procedure, which includes replacing all surrogates with U+FFFD, which is the code point you see, 65533, the “Replacement Character”.

The reason String.fromCharCode(55296).charCodeAt(0) works I believe is because it doesn’t need to do this String -> USVString conversion.

As to why TextEncoder.encode() was designed this way, I don’t understand the unicode details well enough to attempt to explain, but I suspect it’s to simplify implementation since the only output encoding it supports seems to be UTF-8, in an Uint8Array. I’m guessing requiring a USVString argument without surrogates (instead of a native UTF-16 String possibly with surrogates) simplifies the encoding to UTF-8, or maybe makes some encoding/decoding use cases simpler?

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement