Re: portable Unicode programming.

From:

"James Kanze" <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++.moderated

Date:

Mon, 29 Jan 2007 15:22:09 CST

Message-ID:

<1169984687.894747.174460@q2g2000cwa.googlegroups.com>

Lance Diduck wrote:

[...]

5. OK let's say that the C++ community did agree (ha!) on a common
encoding UTFx, and then did agree on a normal form for collation (for
example, last time I looked JavaScript uses UTF16LE using Normal Form
3 for comparisions) and string was changed to accomdate it. Then can
you imagine all the howling "The Spirit of C++ as Uttered by Dr
Stroustrup Himself Is That You Dont Have To Pay For What You Dont
Use." I predict few takers, esp since the vast majority of C++
programs never seem to use anything beyond the printable set of
ASCII,

While I found the rest of your article very good, I can't let
this by. I've never written a program that didn't use
characters not in the printable set of ASCII, and the same is
true for all of my collegues. That's a very parochial point of
view: English is not the only language in the world, and most
other languages do need more than just ASCII. Also, the cost
would only occur if you used the class; if you don't need it,
you just continue with the current std::string.

The problem is more that we still lack sufficient experience in
the domain to be able to define exactly what is needed.

[...]

If you don't care abut comparing strings using the proper semantics,
or, would use collate::compare insead of string::operator< and ==,
then you may be in business. One thing is for certain -- you get
almost no abstraction beyond "I am a contiguous sequence of bytes."
There are a few standard librariy and locale installations that will
at least let you get some rudimentary functionality out of
ctype::is_space, ctype::is_digit, etc passing in Unicode. Don't ask me
if passing in Unicode roman numerals qualifies as a digit!

The problem is that for many applications, the most natural
format of Unicode is UTF-8, for which ctype is completely
useless.

There are Unicode C++ libraries. There is the C++ version of ICU,
however, it really just looks like some turned on a Java C++ machine
translator, and placed the flotsam in a tar for download. (Last I
checked a couple years ago, ICU UnicodeString has "bogus sematics" --
if it is loaded with an invalid byte sequence, you check this by
calling isBogus(). Honest. Memory allocation is configured by
recompiling the ENTIRE library. Hopefully someone upgraded it out of
its misery) . A little more sane offering, that will cost you, is the
one by RogueWave. That one really does look like a Unicode String
written by a C++ developer.

I doubt that the C++ community would go as far as Javascript in
specifying how Unicode would be used. However, it would be nice to see
a standard Unicode utilities library in C++. I think that is something
that would be useful. As is stands, C++ can do Unicode, just not is a
portable way Sun, Linux want to use UTF32 (and of course Sun uses LE
and LInux I imagine BE), and Microsoft and IBM use UTF16 (again two
different flavors).

As long as you're in the program itself, that shouldn't cause
too much of a problem. And UTF-8 makes a good compromize for
external format (and is often the most reasonable choice in the
program itself).

--
James Kanze (Gabi Software) email: james.kanze@gmail.com
Conseils en informatique orientie objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place Simard, 78210 St.-Cyr-l'Icole, France, +33 (0)1 30 23 00 34

--
      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated. First time posters: Do this! ]