from Hacker News

Ask HN: Super-summary to go from "grapheme" to "bytes"?

by zepearl on 4/30/24, 9:53 PM with 4 comments

Is this super-summary correct, to understand how what's shown on the user's screen is expanded into single bytes?

1) A user sees some character on his/her screen => that's a "grapheme", which is a collection of...

2) ...1 to N "Unicode code points", where a single "Unicode code point" can use...

3) ...1 to 6 "UTF-8" bytes.

Is that right (in the case of UTF-8 storage)?

(I feel like that I'm missing an intermediate step...)

(indirectly related to "You can't just assume UTF-8" https://news.ycombinator.com/item?id=40195009 , comment https://news.ycombinator.com/item?id=40206149 , link mentioned in that comment being https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ )

Thx :o)

  • by nuc1e0n on 5/1/24, 9:04 PM

    Codepoints can only be 1 to 4 utf-8 bytes. Utf-8's bit pattern can extend up to 6 bytes, but there are only 1,114,111 valid unicode codepoints. and U+10FFFF takes 4 bytes to encode in utf-8 in a not overlong form. I guess you could encode it overlong, but utf-8 should only be encoded not overlong, so anything else could be considered invalid and potentially harmful.
  • by nuc1e0n on 5/3/24, 3:41 PM

    Also I think the step you feel you are missing is the one where the combining of codepoints into ligatures and laying out of text on screen is done. Google Chrome uses a library called Pango to do this IIRC. Edit: maybe it's one called Skia instead. https://en.wikipedia.org/wiki/Complex_text_layout