XMLEventWriter and numeric character references
Hi
I have some text, well-formed XML, US-ASCII encoded, which contains
some numeric character references outside the BMP. I process this text
with the StAX event iterator API shipped with the latest (as of this
writing) Oracle JDK. All is well with the process except a couple of
hiccups; a prepended XML declaration, and the character references are
printed as a surrogate pair of character references rather than the
single numeric character reference I desire.
I cannot find a way to change this behavior with the StAX event
iterator API and am hoping that there is a way, and that you can help
me find it. I would not like to have to change APIs just to eliminate
the minor irritation of the prepended declaration and the annoying
hindrance of the pair of character references.
A code sample follows, my motivation follows that for those that might
be interested.
import java.io.StringReader;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLEventWriter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
public class HiccupingProcessor {
public static void main(String[] args)
throws XMLStreamException {
XMLEventReader eventReader =
XMLInputFactory
.newInstance()
.createXMLEventReader(
new StringReader("<example>𝐺</example>"));
XMLEventWriter eventWriter =
XMLOutputFactory
.newInstance()
.createXMLEventWriter(
System.out, "US-ASCII");
while (eventReader.hasNext()) {
// some irrelevant processing goes here
eventWriter.add(eventReader.nextEvent());
}
eventWriter.close();
// prints:
// <?xml version="1.0" ?><example>��</example>
// rather than the desired:
// <example>𝐺</example>
}
}
I edit this text in my text editor which supports XML syntax
highlighting and folding. I then process it and may edit again,
possibly cycling several iterations. I want to be able to easily see
characters outside US-ASCII and the syntax highlighting makes that
possible. If I don't remember what a particular character is I can
easily look it up but not so with the surrogate pair - then I must
transcode before looking it up. The prepended XML declaration is not
needed at this stage and has become a simple irritation.