Java Glossary : regex

CMP home Java glossary home Menu no menu Last updated 2004-06-28 by Roedy Green ©1996-2004 Canadian Mind Products

Java definitions: 0-9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

You are here : home : Java Glossary : R words : regex.

regex
regular expression: a system of pattern masks to describe strings to search for, sort of a more generalised wildcarding. Daniel Saverse has written a free package which he no longer supports. He has written a new one based on Perl regexes. Marc W.F. Meurrens maintains a list of code regular expression processors, accompanied by some great Mozart. Look at the Apache Jakarta project. IBM Alphaworks has one. Search for regex. Jakarka-ORO (née OROMatcher), lets you add regex ability to your own Java programs. Funduc Search and Replace is a utility for doing global search and replace on files using regular expressions. The Quoter Amanuensis helps you compose regex expressions for Funduc Search and Replace. SlickEdit is a text editor that has supports several kinds of regular expressions for global search and replace.

JDK 1.4 introduces the java.util.regex package. If they don't work, use Wassup to check out the version of Java you are using. You may be inadvertently using an old one. Perl-like Regex expressions are compiled into a Pattern (parsed into an internal state machine format, not byte code). You don't use a constructor to create a Pattern; you use the static method Pattern Pattern.compile(String). Then you create a Matcher object with Matcher Pattern.matcher(String) feeding it the string you wonder if matches the pattern. Finally, you call boolean Matcher.matches to see if the string fits the pattern. There are many other things you can do, for example, to find multiple matches in your String.

Regex Variants
Java 1.4 SlickEdit
Unix
Funduc SR Function
- + * ? ( ) [ ] { } \ | $ ^ < = + * ? { } | - + * ?
( ) [ ]
\ | $ ^
!
reserved chars in search strings. Reserved characters must be quoted. This does not mean you must enclose them in quotation marks, but rather you must specially mark them as meant literally by preceding them with a \.

However, in Java source code, but not otherwise, the \ too must be quoted to keep the Java string literal compiler happy with another \. So, in Java source, \ as an ordinary char is \\\\! Quadrupled!! Arrgh. Since \ is used in almost every regex pattern, and these pattern appear in Java strings, you have java \\-type quoting going on all the time, even when you don't have \\-type quoting going on in the regex grammar. It is truly unfortunate that Java and regexes use the same character \ for quoting. Be especially careful with File.fileSeparatorChar in regexes. If it is / it must be doubled.

Java 1.4.1 also offers \Q ... \E quoting long passages.

The quoter amanuensis will let you compose your regex strings then convert them to deal with Java \\ quoting.

It won't hurt to quote punctuation that doesn't need it. Note that " and ' don't need regex quoting, though they need Java quoting.

In Java Regexes in Java source code:
A reserved char like [ meant literally is written as \\[.
A newline character is written as \\n.
\ itself meant literally is written as \\\\.

\ \ % \ < > reserved chars in replace strings. Must be \ quoted in Funduc if used as data chars.
* * * Zero or More of the preceeding thing. .* matches anything. In Funduc, the * comes before the thing repeated, e.g. *[] to match anything even over multiple lines. In Java and SlickEdit, the * comes after, e.g. [a-z]*.
++ + + One or More of the preceeding thing.
{1} {1} default Exactly One of the preceeding things, similarly for any {n}
?+     Zero or One of the preceeding thing.
(?!X)     anything but X, via zero width negative lookahead. After the non-match, you continue where you left off, not at the end of the non-matching string.
| | | infix or Operator, (cat|dog) matches cat or dog.
^ ^ ! Not Operator, e.g. [^abc] means anything but a, b or c. In other contexts means start of line.
. . ? any char but newline. Java 1.4.1 can't decide if newline matches.
\r\n \n \r\n newline, given for Windows.
^ ^ ^ Start of Line. In other contexts means not.
$ $ $ End of Line
    ^^ Start of File
    $$ End of File
[] [] [] Range Operator, list of chars,[ab] means match a or b. [a-z] matches any character in range a through z.
() () () Sub-Expression
    +n Column Operator
\1 \1 %1
%1< (to lower case)
back reference to tagged expression #1, in () for replace.
E.g. in SlickEdit to replace all occurences of
<span class="jmethod">
used before an upper case name, converting them to
<span class="jclass"> .
Search string : <span class="jmethod">([A-Z])
Replace string : <span class="jclass">\1
Remember to keep turn exact case matching on.
\d = digit
\D = non digit
\s = single whitespace char
\S = not whitespace
\w = single alphanumeric char
\W not alphanumeric
\p{Lower}
\p{Upper}
\p{ASCII}
\p{Alpha}
\p{Digit}
\p{Alnum}
\p{Punct}
\p{Graph}
\p{Print}
\p{Blank}
\p{Cntrl}
\p{XDigit}
\p{Space}
\p{Lu}
\p{InGreek}
\p{Sc}
\P{InGreek}
\:a alphanumeric
\:b blanks
\:c alpha
\:d numeric
\:f filename part
\:h hex
\:i int
\:n float
\:p path
\:q quoted string
\:v c variable
\:w word
  predefined match strings, eg. \:w = ([A-Za-z]+) matches a word. Those those are braces in \p{Alnum} not parentheses. It can be hard to tell in some typefaces. The strings are case sensitive, and such strings must be coded as \\p{Alnum}.
X{n,m}
capturing
/
non-capturing
constructs
  %%srpath%%
%%srfile%%
%%srfiledate%%
%%srfiletime%%
%%srfilesize%%
%%srdate%%
%%srtime%%
%%envvar=fruit%%
Other notable features
X{n,m} means X appears exactly n to m times.
This table only covers the most common magic characters. See the documenatation for each Regex package for details.

Awkward Characters

Here is how to represent various awkward characters. They represent the combined quoting needs for Java String literals and Regex Patterns.
How To Encode Awkward Characters
How Desired
\\\\ \ The literal backslash character. You must double the \ twice since \ is the quoting character in both Java and Regex literals.
\\xhh The character with hexadecimal value 0xhh, e.g. \\xff. Only works with two hex digits!
\uhhhh The character with hexadecimal value 0xhhhh, e.g. \u20ac. Must always have exactly four hex digits. Don't use for control characters e.g. 0..ff since \u expansion happens prior to compiliation. In other words \u000a will start a new line in your program. Note there is only one lead \.
\\t The tab character \u0009
\\n The newline (line feed) character \u000a
\\r The carriage-return character \u000d
\\f The form-feed character \u000c
\\a The alert (bell) character \u0007
\\e The escape character \u001b
\\cx control characters, e.g. \\cq for ctrl-q.
\\- Literal -, not a regex range operator.
\\+ Literal +, not a regex operator.
\\* Literal *, not a regex operator.
\\? Literal ?, not a regex operator.
\\( Literal (, not a regex expression bracketer.
\\) Literal ), not a regex expression bracketer.
\\[ Literal [, not a regex expression bracketer.
\\] Literal ], not a regex expression bracketer.
\\{ Literal {, not a regex expression bracketer.
\\} Literal }, not a regex expression bracketer.
\\| Literal |, not a regex operator.
\\$ Literal $, not a regex end of line.
\\^ Literal ^, not regex operator.
\\< Literal <, not regex operator.
\\= Literal =, not regex operator.

Terminology

Pattern.CASE_INSENSITIVE is a flag you can feed to Pattern.compile to do case insensistive searches. This is much easier than trying to do them directly in the regex strings.

Java 1.4.1+ regexes have assertions, extra conditions placed on the match. Colourful regex terminology includes:

The easiest way to understand these terms is to experiment with the various regex operators on simple strings. You can make yourself a test program that reads strings from the console. That way, at least you can avoid having to deal with Java \ string quoting. You only need concern yourself with regex \ quoting. You can also use the Quoter Amanuensis to first apply regex quoting then Java string quoting and let you paste the result into your program.

Examples

The following examples use the Java conventions. For use on the command line, undouble the \\.


view

Matching vs Finding

When you want the entire String to match your Pattern, you use Matcher.matches. When you want to find fragments in your String that match the Pattern, use Matcher.find.

// find stuff between <td> ... </td> tags
// prints: orca pilot whale
Pattern p = Pattern.compile( "\\<td>([^\\<>]++)\\</td>" );
Matcher m = p.matcher( "dolphin <td>orca</td> junk\n"
                       + "<td></td> empty"
                       + "<td>pilot whale</td> beluga\n"
                     );
while ( m.find() )
   {
   int gc = m.groupCount();
   // group 0 is the whole pattern
   // run from 1 to gc, not 0 to gc-1 as is traditional.
   for ( int i=1; i<=gc; i++ )
      {
      System.out.println( m.group( i ));
      }
   }

Splitting

Regexes can be used to break phrases into individual words. Here is an example:

// Pattern to split into words separated by spaces or commas, ignoring null fields
private static Pattern splitter = Pattern.compile ("[, ]++" );
...
// Split phrase into words
String[] words = splitter.split( phrase );

Beware, split treats leading, embedded and trailing separators differently. It ignores trailing separators unless you use split ( string, -1 /* limit */ ). It inherited this oddity from Perl.

Tips

book_coverMastering Regular Expressions, Powerful Techniques for Perl and Other Tools, Second Edition
0-596-00289-0
Jeffrey E. Friedl, Andy Oram
Includes scripting languages such as Perl, Tcl, auk and Python. Does not specifically cover Java, though Java regexes were modeled on Perl. More a book for regex experts to hone their skills than a newbie to learn regexes. It is a good place to find regex solutions to standard problems. While it isn't made up in cookbook style, the examples are usually real-life problems that can be put into practical use.
amazon.com Barnes and Noble
amazon.ca chapters
amazon.co.uk amazon.de


CMP logo
CMP_home
home
Canadian Mind Products CSS
HTML Checked!
ICRA ratings logo
mindprod.com IP:[24.87.56.253]
Your IP:[80.134.30.163]
You are visitor number 9515.
Please send errors, omissions and suggestions
to improve this page to Roedy Green.
You can get a fresh copy of this page from: or possibly from your local J: drive mirror:
http://mindprod.com/jgloss/regex.html J:\mindprod\jgloss\regex.html