Re: Write a program that reads a Java source-code file and displays all the comments.

From:

Lew <lew@lewscanon.com>

Newsgroups:

comp.lang.java.help

Date:

Sat, 23 Feb 2008 13:08:45 -0500

Message-ID:

<C8udncZdOsqw_F3anZ2dnUVZ_hadnZ2d@comcast.com>

anon36@yahoo.com wrote:

I am trying to do exercise 17 on page 546 of Bruce Eckel's Thinking In
Java (4th edition):
"Write a program that reads a Java source-code file (you provide the
file name on the command line) and displays all the comments."
This is at the end of a section about regular expressions. We have
just learnt how to use appendReplacement().

Does anyone have a solution to this exercise? or any hints?

Joshua Cranmer wrote:

The way I would do it would be to create a Reader on the file, read each
character and perform simple lexical analysis there, like so:

boolean inEOLComment = false, inCComment = false, inString = false;
for each character in stream:
  if inEOLComment:
     print character
     if character is newline, inEOLComment = false
  else if inCComment:
     if character is * and next is /, inCComment = false
     else print character
  else if inString:
     if character is \, skip next character
     else if character is ", inString = false
  else if character is /:
     if next character is /, inEOLComment = true
     else if next character is *, inCComment = true
  else if character is ", inString = true
  else, do nothing

(writing the actual Java code is left as an exercise to the reader)

Another approach is to borrow from the LEX / YACC approach, and have a "lexer"
extract tokens from the input, along with a token-type enum identifying it as
"String literal", "identifier/keyword", "punctuation", etc. The output of the
lexer becomes the input to the parser, which examines each token and its
identifier, and operates a state machine with, say, states of IN_COMMENT and
NOT_IN_COMMENT.

You run the parser in a loop, with the interpretation of each token depending
on the current state. So, for example, if you hit IN_SINGLE_LINE_COMMENT
state, you ignore each token up until the LINE_END token. If you reach
IN_MULTI_LINE_COMMENT state, you ignore each token until you reach the
END_COMMENT token ("*/"). Strings inside the comment do not trigger a false
END_COMMENT because you've already lexed such strings into tokens. The parser
will not see an string-embedded "*/", it'll only see a STRING_LITERAL token.

--
Lew