TextEncoder / TextDecoder not round tripping

Question

I'm definitely missing something about the TextEncoder and TextDecoder behavior. It seems to me like the following code should round-trip, but it doesn't seem to: Since I'm just encoding and decoding the string, the char code seems like it should be the same, but this returns 65533 instead of 55296. What am I missing? Answer Based on some spelunking, the

Accepted Answer

Based on some spelunking, the TextEncoder.encode() method appears to take an argument of type USVString, where USV stands for Unicode Scalar Value. According to this page, a USV cannot be a high-surrogate or low-surrogate code point.Also, according to MDN:A USVString is a sequence of Unicode scalar values. This definitiondiffers from that of DOMString or the JavaScript String type in thatit always represents a valid sequence suitable for text processing,while the latter can contain surrogate code points.So, my guess is your String argument to encode() is getting converted to a USVString (either implicitly or within encode()). Based on this page, it looks like to convert from String to USVString, it first converts it to a DOMString, and then follows this procedure, which includes replacing all surrogates with U+FFFD, which is the code point you see, 65533, the &#8220;Replacement Character&#8221;.The reason String.fromCharCode(55296).charCodeAt(0) works I believe is because it doesn&#8217;t need to do this String -> USVString conversion.As to why TextEncoder.encode() was designed this way, I don&#8217;t understand the unicode details well enough to attempt to explain, but I suspect it&#8217;s to simplify implementation since the only output encoding it supports seems to be UTF-8, in an Uint8Array. I&#8217;m guessing requiring a USVString argument without surrogates (instead of a native UTF-16 String possibly with surrogates) simplifies the encoding to UTF-8, or maybe makes some encoding/decoding use cases simpler?

Advertisement

Answer