Re: Accessing attributes in HTML with DOM

From:

"Daniel Pitts" <googlegroupie@coloraura.com>

Newsgroups:

comp.lang.java.programmer

Date:

16 Jan 2007 14:22:40 -0800

Message-ID:

<1168986159.934529.321340@l53g2000cwa.googlegroups.com>

Damo wrote:

Hi
I'm trying to extract text from a html page useing DOM. I used JTidy
first on it. The HTml itself is not very descriptive. Theres no
standout tags around the text I need to extract . The way I was
thinking of doing it was accessing the attributes, but I keep getting a
NullPointerException. This is the HTML:

<div class="mb16">
<div id="r_t0" class="prel">
<a id="r0_t" class="L4"href="http://java.sun.com/"">
<b>Java</b> Technology</a></div>
<div class="T1" id="r0_a">Sun's home for <b>Java</b>. Offers
Windows, Solaris, and Linux <b>Java</b> Development Kits (JDKs),
extensions, news, tutorials, and product information.</div>
<div id="r_b0" class="prel T11"><a id="r0_b"
href="http://java.sun.com/">
<img src="http://sp.ask.com/sh/i/icon_bins.gif" border="0"class="bb"
/></a>
<span id="r0_u" class="T10">java.sun.com/</span>
<strong>·</strong> <a class="L5 nw"
href="http://www.askcache.com">
Cached</a> 1f40 <strong>·</strong>
<a class="L5 L5V" href="javascript:void(0)">Save</a>
</div>
</div>

This is the part I want to skip to to extract text. Its buried in loads
of other HTML. Cany anyone please help me do this.

The example HTML is a good start, perhaps you should consider giving us
the code that produces the NPE, and what you expect the output to be.
Also, if its a valid XML document, perhaps you should consider using
XPath, it helps select data based on the path to that data (including
selections based on element names, attributes, order, etc...).