Re: regexp lookahead
Michael Powe writes:
"Jussi" == Jussi Piitulainen writes:
Jussi> Negative lookahead:
>> String re = "(.*)\\[(?!\\S+)\\](.*)";
Jussi> The look-ahead pattern and the following pattern match at
Jussi> the same position: (?!\S+) matches the empty string between
Jussi> the \[ and something that _fails_ to match \S+ at that
Jussi> position, and that something should start with the
Jussi> \]. Where can this happen?
In my test, it happens everywhere -- the regexp fails when there's
nothing there and when there's text there.
Right, except I would say _nowhere_ rather than everywhere. If (?!\S+)
matches, \] does not. If \] matches, (?!\S+) does not.
Jussi> Positive lookahead:
>> String re = "(.*)\\[(?=\\S+)\\](.*)";
Jussi> The look-ahead pattern and the following pattern match at
Jussi> the same position: (?=\S+) matches the empty string between
Jussi> the \[ and before an \S+, and that \S+ should start with
Jussi> the \]. Where can this happen?
The reason for my testing was because the regexp fails to match the
case where there is nothing between the brackets. Note that the
I thought that was the case that succeeded. That pattern is just like
(.*)\[\](.*) with an extra condition that the part of input that
matches \](.*) must also match \S+, which it does, since the ] is
there.
Are you sure that you understand that a lookahead pattern always
consumes an empty string? So your whole pattern can only match a pair
of brackets [], with the two groups on each side of it.
In the real-world case that led me to examine the lookahead option,
I had a regexp matching a long string (9 group captures) that failed
when one of the expected groups, inside a bracket pair, was empty.
\\S+ does not match inside [] and thus caused the whole regex to
fail.
\S matches the right bracket, and eats it, too. (?=\S+) also matches
the right bracket but doesn't eat it.
Nine groups sounds rather complicated. Do you need to do it all in one
expression?
I would like to see a useful, nontrivial application of lookahead.
It doesn't appear to me that there is one.
I think there is a candidate in the other post I made, this morning I
think, where someone wanted to split a certain file at each <?xml...>
thingamajic in it.
(Which reminds me, you might consider the use of non-greedy patterns,
like .*?, since those .* try to eat the bracket pairs, too, and that
may lead to something that feels unintuitive.)
And the negative lookahead just appears broken.
Let me contrive an example of sorts: a maximal digit sequence not
bounded by a . or a - or an e.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
class NonLook {
public static void main(String [] _) {
Matcher m = Pattern
.compile("(?<![.e\\-\\d])\\d++(?![.e\\-])")
.matcher("pi 3.14 314e-2 1024 e 2.7 27e-1 31415926");
while (m.find()) {
System.out.println(m.group(0));
}
}
}
Ok, I had to throw in a lookbehind, a possessive quantifier in \d++,
and a \d inside the lookbehind. This does not eat the preceding or
following character, and matches even where there is no following
character at all. It seems to work.
Jussi> (Javadoc for 1.4.2 was not too helpful here, so I
Jussi> experimented a bit, never having used these myself.)
I actually have Habibi's book, _Java Regular Expressions_, but IMO
it is not very useful if you already have good knowledge of regex.
Does it tell what (?>X) does? Sun's doc says it matches "X, as an
independent, non-capturing group". I have no idea what an independent
group is. (I know that I'm not looking at the latest documentation.)
....
Ironically, Habibi criticizes perl's conditional construct in regex,
and it is exactly that construct that I need in the case described
here.
There are likely to be other ways.
If your problem is that a pair of brackets in your input may contain
an empty string that you need to match, then you need to match an
empty string there. There is no way around that.