Re: simple regex pattern sought
On 5/26/2012 8:13 AM, Robert Klemme wrote:
On 26.05.2012 16:57, markspace wrote:
Finally I think this could be simplified slightly with Lew's
back-reference idea.
(['"])(?:\\.|[^\1\\])*
(Untested.) This allows empty strings between delimiters; instead of a *
use + for only non-empty strings between the quotes.
Interesting approach - but it doesn't work. Simple test with
Pattern.compile("(.)[a\\1]"):
Exception in thread "main" java.util.regex.PatternSyntaxException:
Illegal/unsupported escape sequence near index 6
(.)[a\1]
^
Yup, [] is for characters, and \1 could be a string. Gets rejected. I
think you could use "negative lookahead" to say "not this string" when
parsing. Gets kinda ugly though.
<http://www.regular-expressions.info/conditional.html>
Java:
"(['\"])(?:\\\\.|(?!\\1|\\\\).)+\\1"
Regex:
(['"])(?:\\.|(?!\1|\\).)+\1
I re-did Roedy's test program to be a bit more clear about what it was
looking for, and the results. This could be even cleaner if it was run
with a JUnit test harness.
At this point though the regex is basically just a mess. Download antlr
and get an XML/HTML grammar from online.
package quicktest;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import static java.lang.System.out;
/**
*
* @author Brenden
*/
public class MindProdRegex {
}
/*
* [TestRegexFindQuotedString.java]
*
* Summary: Finding a quoted String with a regex.
..
*
* Copyright: (c) 2012 Roedy Green, Canadian Mind Products,
http://mindprod.com
*
* Licence: This software may be copied and used freely for any
purpose but military.
* http://mindprod.com/contact/nonmil.html
*
* Requires: JDK 1.7+
*
* Created with: JetBrains IntelliJ IDEA IDE
http://www.jetbrains.com/idea/
*
* Version History:
* 1.0 2012-05-25 initial release
*/
/**
* Finding a quoted String with a regex.
*
* @author Roedy Green, Canadian Mind Products
* @version 1.0 2012-05-25 initial release
* @since 2012-05-25
*/
class TestRegexFindQuotedString
{
// ------------------------------
CONSTANTS------------------------------
private static final String[] vectors =
{"Basic: George said \"that's theticket\".",
"\"that's theticket\"",
"Nested: Jeb replied '\"ticket?\"what ticket'.",
"'\"ticket?\"what ticket'",
"Non-ASCII: \"How na\u00efve!\".",
"\"How na\u00efve!\"",
" empty: \"\"xx",
"\"\"",
" escaped: 'Bob\\'s your uncle.'",
"'Bob\\'s your uncle.'",
" 'unbalanced\"",
"",
};
// -------------------------- STATIC METHODS--------------------------
/**
* exercise that pattern to see what if can find
*/
static void exercisePattern( Pattern pattern )
{
out.println();
out.println( "Pattern: " + pattern.toString() );
for( int i = 0; i < vectors.length; i+=2 ) {
String test = vectors[i];
String result = vectors[i+1];
final Matcher m = pattern.matcher( test );
boolean found = m.find();
boolean correct = false;
String groupString = null;
if( found ) {
correct = m.group(0).equals( result );
groupString = m.group();
}
System.out.println( test+", found: "+ found +
", correct: "+correct+" ("+groupString+")");
}
}
// --------------------------- main() method---------------------------
/**
* test harness
*
* @param args not used
*/
public static void main( String[] args )
{
// We want to find Strings of the form "xx'xx" or 'xx"xx'
// We want to avoid the following problems:
// 1. Works even if String contains foreign languages,
evenRussian or accented letters.
// 2. If starts with " must end with ", if starts with '
mustend with '.
// 3. ' is ok inside "...", and " is ok inside '...'
// 4. We don't worry about how to use ' inside '...'.
// here are some suggested techniques:
exercisePattern( Pattern.compile( "[\"']\\p{Print}+?[\"']" )
); // fails 1 2 3
exercisePattern( Pattern.compile( "[\"'][^\"']+[\"']" ) );
//fails 2 3
exercisePattern( Pattern.compile( "([\"'])[^\"']+\\1" ) );
//fails 3, uses a capturing group.
exercisePattern( Pattern.compile( "\"[^\"]+\"|'[^']+'" ) );
//works, rejects empty strings by Mark Space.
exercisePattern( Pattern.compile(
"(['\"])(?:\\\\.|(?!\\1|\\\\).)+\\1" ) ); //works, rejects empty strings
by Mark Space.
exercisePattern( Pattern.compile( "\"[^\"]*\"|'[^']*'" ) );
//works, accepts empty strings by Robert Klemme.
exercisePattern( Pattern.compile(
"\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" ) ); // works, acceptsempty
strings
// (?: ) is a non-capturing group. This is Robert
Klemme'scontribution. I don't understand how it works.
}
}