Re: isspace

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Sat, 30 Jan 2010 04:11:00 -0800 (PST)

Message-ID:

<7aed2ad7-89de-44bd-b02b-bd86003c3c8b@l19g2000yqb.googlegroups.com>

On Jan 29, 8:09 am, Paavo Helde <myfirstn...@osa.pri.ee> wrote:

[...]

Ok, well, suppose I want to use UTF-8 encoding, how do I specify it

With UTF-8 one is using char, not wchar_t. Note that if char
is a signed type, then one must take care to cast char to
unsigned char in places where a non-negative value is
expected.

He didn't make clear whether he meant internal or external
encoding. One can use UTF-8 externally (and probably should for
any new projects), and still use wchar_t and UTF-16 or UTF-32
internally.

By historic reasons the locale and encoding stuff has been mixed up.

The reasons aren't just historical. Functions like isalpha have
to know the encoding if they are to work. Logically, of course,
locale and encoding are, or should be, two completely separate
concepts, but practically, at the technical level, that would
mean specifying both a locale and an encoding for things like
isalpha. (Note that the design of <locale> leaves a bit to be
desired here, since it links isalpha purely to the ctype facet;
logically, it should depend on both ctype and codecvt.
Practically, however, I'll admit that I wouldn't like to
implement a design that handled this correctly.)

Are you more interested in locales or in encodings? Locales
affect such stuff as the character of representing the
decimal point in numbers, look of the dates and whether V and
W are sorted together or separately, and whether cyrillic
characters are considered alphabetic characters or not.
Encoding is a fully different business, specifying for example
how those cyrillic characters are encoded in the binary data,
if at all.

The character encoding does affect whether isalpha(0xE9) should
return true (ISO 8859-1) or false (UTF-8).

If you just want to translate different encodings, then you do
not need any locale stuff at all. When a web page comes in,
you do not know if the decimal point used in numbers therein
is a dot or a comma, for example, so strictly speaking you
cannot set the correct locale for processing the page. What
you can do is to look at BOM markers and charset encoding, and
to translate the file from its charset to the encoding you are
using internally, for example. For that, again no locales are
needed, but instead one needs some kind os system-sepcific
code or other library like iconv.

Strictly speaking, when a web page comes in, you don't even know
how comma or dot are encoding in it. In practice, all of the
codesets used in web pages have the first 128 values in common.
And the header should be written using just those values until
it's reached the point where it specifies the encoding. (Also
in practice, a lot of headers don't bother to specify the
encoding, so it's worthwhile to develop some pragmatic
heuristics to guess it. If the data starts with a BOM, then
it's Unicode, and the BOM will allow you to determine the
format. If the data contains 0's in the first four bytes, it's
almost certainly some format of UTF-16 or UTF-32, and you can
determine which by the number and position of the zeros.
Otherwise, I'd treat it as undetermined ASCII based until I
encountered a byte value larger than 128---if that byte value
was part of a legal UTF-8 code, I'd shift to UTF-8, otherwise to
ISO-8859-1, but that's really just a guess.)

using locale? And where can I find a list of the possible
locale encoding configuration (e.g. if I wanted to correctly
decode a web page just parsing the fist bytes looking for
'charset')?

http://www.iana.org/assignments/character-sets

But that doesn't tell you what the name of the locale on your
system might be.

--
James Kanze