Skip to content
Advertisement

Converting from a Uint8Array to a string and back

I’m having an issue converting from a particular Uint8Array to a string and back. I’m working in the browser and in Chrome which natively supports the TextEncoder/TextDecoder modules.

If I start with a simple case, everything seems to work well:

const uintArray = new TextEncoder().encode('silly face demons'); // Uint8Array(17) [115, 105, 108, 108, 121, 32, 102, 97, 99, 101, 32, 100, 101, 109, 111, 110, 115] new TextDecoder().decode(uintArray); // silly face demons

But the following case is not giving me the results I expect. Without getting into too much of the details (it’s cryptography related), let’s start with the fact that I’m provided with the following Uint8Array:

Uint8Array(24) [58, 226, 7, 102, 202, 238, 58, 234, 217, 17, 189, 208, 46, 34, 254, 4, 76, 249, 169, 101, 112, 102, 140, 208]

and what I want to do is to convert that to a string and then later decrypt the string back to the original array, but I get this:

const uintArray = new Uint8Array([58, 226, 7, 102, 202, 238, 58, 234, 217, 17, 189, 208, 46, 34, 254, 4, 76, 249, 169, 101, 112, 102, 140, 208]); new TextDecoder().decode(uint8Array); // :�f��:����."�L��epf�� new TextEncoder().encode(':�f��:����."�L��epf��');

…which results in: Uint8Array(48) [58, 239, 191, 189, 7, 102, 239, 191, 189, 239, 191, 189, 58, 239, 191, 189, 239, 191, 189, 17, 239, 191, 189, 239, 191, 189, 46, 34, 239, 191, 189, 4, 76, 239, 191, 189, 239, 191, 189, 101, 112, 102, 239, 191, 189, 239, 191, 189]

The array has doubled. Encoding is a bit out of my wheel house. Can anyone tell me why the array has doubled (I’m assuming it’s an alternate representation of the original array…?). Also, and more importantly, is there a way I could get back to the original array (i.e. undouble the one I’m getting)?

Advertisement

Answer

You have code points in the array that you are trying to convert to utf-8 that don’t make sense or are not allowed. Pretty much everything >= 128 requires special handling. Some of these are allowed but are leading bytes for multiple byte sequences and some like 254 are just not allowed. If you want to convert back and forth you will need to make sure you are creating valid utf-8. The codepage layout here might be useful: https://en.wikipedia.org/wiki/UTF-8#Codepage_layout as might the description of illegal byte sequences: https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences.

As a concrete example, this:

let arr = new TextDecoder().decode(new Uint8Array([194, 169]))
let res = new TextEncoder().encode(arr) // => [194, 168]

works because [194, 169] is valid utf-8 for © but:

let arr = new TextDecoder().decode(new Uint8Array([194, 27]))
let res = new TextEncoder().encode(arr) // => [239, 191, 189, 27]

doesn’t because it’s not a valid sequence.

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement