Re: number of bytes for each (uni)code point while using utf-8 as
encoding ...
On 7/10/2012 3:45 PM, lbrt chx _ gemale wrote:
On 10/07/2012 12:21, lbrt chx _ gemale allegedly wrote:
How can you get the number of bytes you "get()"?
Well, UTF-8 always encodes the same char to the same (number of) bytes,
doesn't it?
~
What about files, which (author's) claim to be UTF-8 encoded but they aren't, and/or get somehow corrupted in transit? There are quite a bit of "monkeys" (us) messing with the metadata headers of html pages
~
Sometimes you must double check every file you keep in a text bank/corpus, because, through associations, one mistake may propagate and create other kinds of problems
~
I don't see how knowing the char -> length mapping is going to help you
in this case. If your input is a blob of bytes which someone claims is
UTF-8 but isn't, you can set up decoders to throw an error or at least
instead of the replacement char (U+FFFD) which makes it detectable that
someone screwed up.
The problem also is, if it's not UTF-8, what is it then? The heuristics
for this kind of stuff is incredibly squirrely and it more or less turns
out that the most reliable way to fix it is to know the default charset
of the computer spitting data out at you. Even then, there's still a
possibility that its input was screwed up in a similar fashion: I've
seen one message undergo the standard I-thought-your-UTF8-was-ISO-8859-1
twice, so that every standard character ended up with 4 gibberish
characters.
--
Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth