« Previous | Next » 

Revision e2ec3f97

IDe2ec3f976803b360c70d9ae2ba13852fa5d11665

Added by Markus Armbruster about 11 years ago

qjson: to_json() case QTYPE_QSTRING is buggy, rewrite

Known bugs in to_json():

  • A start byte for a three-byte sequence followed by less than two
    continuation bytes is split into one-byte sequences.
  • Start bytes for sequences longer than three bytes get misinterpreted
    as start bytes for three-byte sequences. Continuation bytes beyond
    byte three become one-byte sequences.

    This means all characters outside the BMP are decoded incorrectly.

  • One-byte sequences with the MSB are put into the JSON string
    verbatim when char is unsigned, producing invalid UTF-8. When char
    is signed, they're replaced by "\\uFFFF" instead.

    This includes \xFE, \xFF, and stray continuation bytes.

  • Overlong sequences are happily accepted, unless screwed up by the
    bugs above.
  • Likewise, sequences encoding surrogate code points or noncharacters.
  • Unlike other control characters, ASCII DEL is not escaped. Except
    in overlong encodings.

My rewrite fixes them as follows:

  • Malformed UTF-8 sequences are replaced.

    Except the overlong encoding \xC0\x80 of U+0000 is still accepted.
    Permits embedding NUL characters in C strings. This trick is known
    as "Modified UTF-8".

  • Sequences encoding code points beyond Unicode range are replaced.
  • Sequences encoding code points beyond the BMP produce a surrogate
    pair.
  • Sequences encoding surrogate code points are replaced.
  • Sequences encoding noncharacters are replaced.
  • ASCII DEL is now always escaped.

The replacement character is U+FFFD.

Signed-off-by: Markus Armbruster <>
Reviewed-by: Laszlo Ersek <>
Signed-off-by: Blue Swirl <>

Files

  • added
  • modified
  • copied
  • renamed
  • deleted

View differences