XMLEventWriter and numeric character references

From:

Jeff Higgins <jeff@invalid.invalid>

Newsgroups:

comp.lang.java.help

Date:

Tue, 05 Feb 2013 03:52:57 -0500

Message-ID:

<keqget$g1q$1@dont-email.me>

Hi
I have some text, well-formed XML, US-ASCII encoded, which contains
some numeric character references outside the BMP. I process this text
with the StAX event iterator API shipped with the latest (as of this
writing) Oracle JDK. All is well with the process except a couple of
hiccups; a prepended XML declaration, and the character references are
printed as a surrogate pair of character references rather than the
single numeric character reference I desire.

I cannot find a way to change this behavior with the StAX event
iterator API and am hoping that there is a way, and that you can help
me find it. I would not like to have to change APIs just to eliminate
the minor irritation of the prepended declaration and the annoying
hindrance of the pair of character references.

A code sample follows, my motivation follows that for those that might
be interested.

import java.io.StringReader;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLEventWriter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;

public class HiccupingProcessor {

   public static void main(String[] args)
     throws XMLStreamException {

     XMLEventReader eventReader =
       XMLInputFactory
         .newInstance()
         .createXMLEventReader(
           new StringReader("<example>𝐺</example>"));

     XMLEventWriter eventWriter =
       XMLOutputFactory
         .newInstance()
         .createXMLEventWriter(
             System.out, "US-ASCII");

     while (eventReader.hasNext()) {
       // some irrelevant processing goes here
       eventWriter.add(eventReader.nextEvent());
     }
     eventWriter.close();

     // prints:
     // <?xml version="1.0" ?><example>&#xd835;&#xdc3a;</example>

     // rather than the desired:
     // <example>𝐺</example>
   }
}

I edit this text in my text editor which supports XML syntax
highlighting and folding. I then process it and may edit again,
possibly cycling several iterations. I want to be able to easily see
characters outside US-ASCII and the syntax highlighting makes that
possible. If I don't remember what a particular character is I can
easily look it up but not so with the surrogate pair - then I must
transcode before looking it up. The prepended XML declaration is not
needed at this stage and has become a simple irritation.