Skip to content
Advertisement

Replace (fix) non UTF-8 character in string

When I am parsing a string to become a JSON object there are some special characters that are replaced with the black question mark (�), I believe that is an issue with the encoding of the character. Is there any way of replacing the question mark with the correct character (é) or is it lost?

Advertisement

Answer

From the current version of the specification:

The Replacement Character U+FFFD replacement character is the general substitute character in the Unicode Standard. It can be substituted for any “unknown” character in another encoding that cannot be mapped in terms of known Unicode characters

Some of the algorithms that delivered this character sequence had probably encountered an error, and used the replacement character:

If a noncharacter that does not have a specific internal use is unexpectedly encountered in processing, an implementation may signal an error or replace the noncharacter with U+FFFD replacement character

That means that the text you get is modified:

If the implementation chooses to replace, delete or ignore a noncharacter, such an action constitutes a modification in the interpretation of the text.

The Unicode sequence does not contain other error information, and there is no way to recover the original byte sequence from it only, because part of it has already been mapped.

In case you are displaying (not parsing) text, the interface might choose to display some unknown characters with the replacement character:

Options for rendering such unknown code points include printing the code point as four to six hexadecimal digits, printing a black or white box, or another substitute glyph, such as that commonly shown for U+FFFD

Advertisement