Re: CSV Parsing algorithms in Java
Simon Brooke wrote:
Heavens, writing a CSV parser is trivial. It's simply a case of a
StringTokenizer in a for loop:
[...]
There is no one official "CSV format," but even the simple
version described at http://www.wotsit.org/ is not parseable by
a mere StringTokenizer (which the JavaDoc calls a "legacy class"
whose use in new code is "discouraged," by the way).
Brooke, 21 Elm Street
// space before '2' should vanish but embedded spaces
// should remain
"Brooke, Simon" , 21 Elm Street
// first comma does not end a field, quotes disappear,
// both spaces surrounding second comma disappear
"Brooke, Simon" , """The Beeches"", Herts"
// doubled quotes become singles, only one of the three
// commas is a field separator, more disappearing and
// retained spaces
"Brooke, Simon" , "21 Elm Street
Apartment 3B"
// embedded newline in second field
Parsing CSV -- even allowing for some variations beyond the
wotsit description -- is not difficult, but not trivial. My own
CSVReader class runs to 376 lines, including JavaDoc. (It could
probably be tightened a bit; I wrote it as an exercise when I was
new to Java and would likely do things differently nowadays.)
--
Eric Sosman
esosman@acm-dot-org.invalid