Last updated 2004-06-28 by Roedy
Green ©1996-2004 Canadian Mind Products
Java definitions: 0-9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
You are here : home : Java Glossary : R words : regex.
JDK 1.4 introduces the java.util.regex package. If they don't work, use Wassup to check out the version of Java you are using. You may be inadvertently using an old one. Perl-like Regex expressions are compiled into a Pattern (parsed into an internal state machine format, not byte code). You don't use a constructor to create a Pattern; you use the static method Pattern Pattern.compile(String). Then you create a Matcher object with Matcher Pattern.matcher(String) feeding it the string you wonder if matches the pattern. Finally, you call boolean Matcher.matches to see if the string fits the pattern. There are many other things you can do, for example, to find multiple matches in your String.
| Regex Variants | |||
|---|---|---|---|
| Java 1.4 | SlickEdit
Unix |
Funduc SR | Function |
| - + * ? ( ) [ ] { } \ | $ ^ < = | + * ? { } | | - + * ?
( ) [ ] \ | $ ^ ! |
reserved chars in search strings. Reserved characters must be quoted.
This does not mean you must enclose them in quotation marks, but rather you must
specially mark them as meant literally by preceding them with a \.
However, in Java source code, but not otherwise, the \ too must be quoted to keep the Java string literal compiler happy with another \. So, in Java source, \ as an ordinary char is \\\\! Quadrupled!! Arrgh. Since \ is used in almost every regex pattern, and these pattern appear in Java strings, you have java \\-type quoting going on all the time, even when you don't have \\-type quoting going on in the regex grammar. It is truly unfortunate that Java and regexes use the same character \ for quoting. Be especially careful with File.fileSeparatorChar in regexes. If it is / it must be doubled. Java 1.4.1 also offers \Q ... \E quoting long passages. The quoter amanuensis will let you compose your regex strings then convert them to deal with Java \\ quoting. It won't hurt to quote punctuation that doesn't need it. Note that " and ' don't need regex quoting, though they need Java quoting.
In Java Regexes in Java source code:
|
| \ | \ | % \ < > | reserved chars in replace strings. Must be \ quoted in Funduc if used as data chars. |
| * | * | * | Zero or More of the preceeding thing. .* matches anything. In Funduc, the * comes before the thing repeated, e.g. *[] to match anything even over multiple lines. In Java and SlickEdit, the * comes after, e.g. [a-z]*. |
| ++ | + | + | One or More of the preceeding thing. |
| {1} | {1} | default | Exactly One of the preceeding things, similarly for any {n} |
| ?+ | Zero or One of the preceeding thing. | ||
| (?!X) | anything but X, via zero width negative lookahead. After the non-match, you continue where you left off, not at the end of the non-matching string. | ||
| | | | | | | infix or Operator, (cat|dog) matches cat or dog. |
| ^ | ^ | ! | Not Operator, e.g. [^abc] means anything but a, b or c. In other contexts means start of line. |
| . | . | ? | any char but newline. Java 1.4.1 can't decide if newline matches. |
| \r\n | \n | \r\n | newline, given for Windows. |
| ^ | ^ | ^ | Start of Line. In other contexts means not. |
| $ | $ | $ | End of Line |
| ^^ | Start of File | ||
| $$ | End of File | ||
| [] | [] | [] | Range Operator, list of chars,[ab] means match a or b. [a-z] matches any character in range a through z. |
| () | () | () | Sub-Expression |
| +n | Column Operator | ||
| \1 | \1 | %1
%1< (to lower case) |
back reference to tagged expression #1, in () for replace.
E.g. in SlickEdit to replace all occurences of <span class="jmethod"> used before an upper case name, converting them to <span class="jclass"> . Search string : <span class="jmethod">([A-Z]) Replace string : <span class="jclass">\1 Remember to keep turn exact case matching on. |
| \d = digit
\D = non digit \s = single whitespace char \S = not whitespace \w = single alphanumeric char \W not alphanumeric \p{Lower} \p{Upper} \p{ASCII} \p{Alpha} \p{Digit} \p{Alnum} \p{Punct} \p{Graph} \p{Print} \p{Blank} \p{Cntrl} \p{XDigit} \p{Space} \p{Lu} \p{InGreek} \p{Sc} \P{InGreek} |
\:a alphanumeric
\:b blanks \:c alpha \:d numeric \:f filename part \:h hex \:i int \:n float \:p path \:q quoted string \:v c variable \:w word |
predefined match strings, eg. \:w = ([A-Za-z]+) matches a word. Those those are braces in \p{Alnum} not parentheses. It can be hard to tell in some typefaces. The strings are case sensitive, and such strings must be coded as \\p{Alnum}. | |
| X{n,m}
capturing / non-capturing constructs |
%%srpath%%
%%srfile%% %%srfiledate%% %%srfiletime%% %%srfilesize%% %%srdate%% %%srtime%% %%envvar=fruit%% |
Other notable features
X{n,m} means X appears exactly n to m times. |
|
| How To Encode Awkward Characters | |
|---|---|
| How | Desired |
| \\\\ | \ The literal backslash character. You must double the \ twice since \ is the quoting character in both Java and Regex literals. |
| \\xhh | The character with hexadecimal value 0xhh, e.g. \\xff. Only works with two hex digits! |
| \uhhhh | The character with hexadecimal value 0xhhhh, e.g. \u20ac. Must always have exactly four hex digits. Don't use for control characters e.g. 0..ff since \u expansion happens prior to compiliation. In other words \u000a will start a new line in your program. Note there is only one lead \. |
| \\t | The tab character \u0009 |
| \\n | The newline (line feed) character \u000a |
| \\r | The carriage-return character \u000d |
| \\f | The form-feed character \u000c |
| \\a | The alert (bell) character \u0007 |
| \\e | The escape character \u001b |
| \\cx | control characters, e.g. \\cq for ctrl-q. |
| \\- | Literal -, not a regex range operator. |
| \\+ | Literal +, not a regex operator. |
| \\* | Literal *, not a regex operator. |
| \\? | Literal ?, not a regex operator. |
| \\( | Literal (, not a regex expression bracketer. |
| \\) | Literal ), not a regex expression bracketer. |
| \\[ | Literal [, not a regex expression bracketer. |
| \\] | Literal ], not a regex expression bracketer. |
| \\{ | Literal {, not a regex expression bracketer. |
| \\} | Literal }, not a regex expression bracketer. |
| \\| | Literal |, not a regex operator. |
| \\$ | Literal $, not a regex end of line. |
| \\^ | Literal ^, not regex operator. |
| \\< | Literal <, not regex operator. |
| \\= | Literal =, not regex operator. |
Java 1.4.1+ regexes have assertions, extra conditions placed on the match. Colourful regex terminology includes:
// find stuff between <td> ... </td> tags // prints: orca pilot whale Pattern p = Pattern.compile( "\\<td>([^\\<>]++)\\</td>" ); Matcher m = p.matcher( "dolphin <td>orca</td> junk\n" + "<td></td> empty" + "<td>pilot whale</td> beluga\n" ); while ( m.find() ) { int gc = m.groupCount(); // group 0 is the whole pattern // run from 1 to gc, not 0 to gc-1 as is traditional. for ( int i=1; i<=gc; i++ ) { System.out.println( m.group( i )); } }
// Pattern to split into words separated by spaces or commas, ignoring null fields private static Pattern splitter = Pattern.compile ("[, ]++" ); ... // Split phrase into words String[] words = splitter.split( phrase );
Beware, split treats leading, embedded and trailing separators differently. It ignores trailing separators unless you use split ( string, -1 /* limit */ ). It inherited this oddity from Perl.
home |
Canadian Mind Products | |||
| mindprod.com IP:[24.87.56.253] | ||||
| Your IP:[80.134.30.163] | ||||
| You are visitor number 9515. | ||||
| Please send errors, omissions and suggestions | ||||
| to improve this page to Roedy Green. | ||||
| You can get a fresh copy of this page from: | or possibly from your local J: drive mirror: | |||
| http://mindprod.com/jgloss/regex.html | J:\mindprod\jgloss\regex.html | |||