Certification
SCJP
The Java 2 Platform, Standard Edition (J2SE), version 1.4, contains a new package called java.util.regex, enabling the use of regular expressions. Now functionality includes the use of meta characters, which gives regular expressions versatility.
A regular expression, specified as a string, must first be compiled into an instance of this class. The resulting pattern can then be used to create a Matcher object that can match arbitrary character sequences against the regular expression. All of the state involved in performing a match resides in the matcher, so many matchers can share the same pattern.
A typical invocation sequence is thus:
Pattern p = Pattern.compile("a*b");
Matcher m = p.matcher("aaaaab");
boolean b = m.matches();
A matches method is defined by Pattern class as a convenience for when a regular expression is used just once. This method compiles an expression and matches an input sequence against it in a single invocation. The statement:
boolean b = Pattern.matches("a*b", "aaaaab");
is equivalent to the three statements above, though for repeated matches it is less efficient since it does not allow the compiled pattern to be reused.
Instances of Pattern class are immutable and are safe for use by multiple concurrent threads. Instances of the Matcher class are not safe for such use.
[abc] a, b, or c (simple class)
[^abc] Any character except a, b, or c (negation)
[a-zA-Z] a through z or A through Z, inclusive (range)
[a-d[m-p]] a through d, or m through p: [a-dm-p] (union)
[a-z&&[def]] d, e, or f (intersection)
[a-z&&[^bc]] a through z, except for b and c: [ad-z] (subtraction)
[a-z&&[^m-p]] a through z, and not m through p: [a-lq-z](subtraction)
. Any character (may or may not match line terminators)
\d A digit: [0-9]
\D A non-digit: [^0-9]
\s A whitespace character: [ \t\n\x0B\f\r]
\S A non-whitespace character: [^\s]
\w A word character: [a-zA-Z_0-9]
\W A non-word character: [^\w]
An instance of the Pattern class represents a regular expression that is specified in string form in a syntax similar to that used by Perl.
A regular expression, specified as a string, must first be compiled into an instance of the Pattern class. The resulting pattern is used to create a Matcher object that matches arbitrary character sequences against the regular expression. Many matchers can share the same pattern because it is stateless.
The compile method compiles the given regular expression into a pattern, then the matcher method creates a matcher that will match the given input against this pattern. The pattern method returns the regular expression from which this pattern was compiled.
The split method is a convenience method that splits the given input sequence around matches of this pattern. The following example uses split to break up a string of input separated by commas and/or whitespace:
|
..... |
The output:
|one|
|two|
|three|
|four|
|five|
Instances of the Matcher class are used to match character sequences against a given string sequence pattern. Input is provided to matchers using the CharSequence interface to support matching against characters from a wide variety of input sources.
A matcher is created from a pattern by invoking the pattern's matcher method. Once created, a matcher can be used to perform three different kinds of match operations:
Each of these methods returns a boolean indicating success or failure. More information about a successful match can be obtained by querying the state of the matcher.
The Matcher class also defines methods for replacing matched sequences by new strings whose contents can, if desired, be computed from the match result.
The appendReplacement method appends everything up to the next match and the replacement for that match. The appendTail appends the strings at the end, after the last match.
The following code samples demonstrate the use of the java.util.regex package. This code writes "One dog, two dogs in the yard" to the standard-output stream:
|
..... |
Quantifiers specify the number of occurrences of a pattern. This allows us to control how many times a pattern occurs in a string. Table summarizes how to use quantifiers:
Table 3.1. Quantifiers
|
Greedy Quantifiers |
Reluctant Quantifiers |
Possessive Quantifiers |
Occurrence of a pattern X |
|
X? |
X?? |
X?+ |
X, once or not at all |
|
X* |
X*? |
X*+ |
X, zero or more times |
|
X+ |
X+? |
X++ |
X, one or more times |
|
X{n} |
X{n}? |
X{n}+ |
X, exactly n times |
|
X{n,} |
X{n,}? |
X{n,}+ |
X, at least n times |
|
X{n,m} |
X{n,m}? |
X{n,m}+ |
X, at least n but not more than m times |
The first three columns show regular expressions that represent a set of strings in which X loops occur. The last column describes the meaning of its corresponding regular expressions. There are three types of quantifiers to specify each kind of pattern occurrence. These three types of quantifiers are different in usage. It's important to understand the meaning of the metacharacters used in quantifiers before we explain the differences.
The most general quantifier is {n,m}, where n and m are integers. X{n,m} means a set of strings in which X loops at least n times but no more than m times. For instance, X{3, 5} includes XXX, XXXX, and XXXXX but excludes X, XX, and XXXXXX.
Even though we have the above metacharacters to control occurrence, there are several other ways to match a string with a regular expression. This is why there is a greedy quantifier, reluctant quantifier, and possessive quantifier in each case of occurrence.
A greedy quantifier forces a Matcher to digest the whole inputted string first. If the matching fails, it then forces the Matcher to back off the inputted string by one character, check matching, and repeat the process until there are no more characters left.
A reluctant quantifier, on the other hand, asks a Matcher to digest the first character of the whole inputted string first. If the matching fails, it appends its successive character and checks again. It repeats the process until the Matcher digests the whole inputted string.
A possessive quantifier, unlike the other two, makes a Matcher digest the whole string and then stop.
Table below helps to understand the difference between the greedy quantifier (the first test), the reluctant quantifier (the second test), and the possessive quantifier (the third test). The string content is "whellowwwwwwhellowwwwww"
Table 3.2. Difference between quantifiers
|
Regular Expression |
Result |
|
.*hello |
Found the text "whellowwwwwwhello" starting at index 0 and ending at index 17. |
|
.*?hello |
Found the text "whello" starting at index 0 and ending at index 6. Found the text "wwwwwwhello" starting at index 6 and ending at index 17. |
|
.*+hello |
No match found. |
The above operations also work on groups of characters by using capturing groups. A capturing group is a way to treat a group of characters as a single unit. For instance, (java) is a capturing group, where java is a unit of characters. javajava can belong to a regular expression of (java)*. A part of the inputted string that matches a capturing group will be saved and then recalled by back references.
Java provides numbering to identify capturing groups in a regular expression. They are numbered by counting their opening parentheses from left to right. For example, there are four following capturing groups in the regular expression ((A)(B(C))):
1. ((A)(B(C)))
2. (A)
3. (B(C))
4. (C)
You can invoke the Matcher method groupCount() to determine how many capturing groups there are in a Matcher's Pattern.
The numbering of capturing groups is necessary to recall a stored part of a string by back references. A back reference is invoked by \n, where n is the index of a subgroup to recall the capturing group.
Table 3.3. Groups usage
|
Whole Content |
Regular Expression |
Result |
|
abab |
([a-z][a-z])\1 |
Found the text "abab" starting at index 0 and ending at index 4. |
|
abcd |
([a-z][a-z])\1 |
No match found. |
|
abcd |
([a-z][a-z]) |
Found the text "ab" starting at index 0 and ending at index 2. I found the text "cd" starting at index 2 and ending at index 4. |
J2SE 1.4 added the split() method to the String class to simplify the task of breaking a string into substrings, or tokens. This method uses the concept of a regular expression to specify the delimiters. A regular expression is a remnant from the Unix grep tool ("grep" meaning "general regular expression parser").
See most any introductory Unix text or the Java API documentation for the java.util.regex.Pattern class.
In its simplest form, searching for a regular expression consisting of a single character finds a match of that character. For example, the character 'x' is a match for the regular expression "x".
The split() method takes a parameter giving the regular expression to use as a delimiter and returns a String array containing the tokens so delimited. Using split() function:
|
..... |
The output:
This
is
a
string
object
NOTE, str.split (" "); is equal to str.split (\\s);.
To use "*" (which is a "special" regex character) as a delimiter, specify "\\*" as the regular expression (escape it):
|
..... |
A
bunch
of
stars
NOTE, always use double "\" for escaping in java source code, i.e. "\\s", "\\d", "\\*", otherwise the code will not compile:
|
..... |
|
..... |
The following example (splitting by single character):
|
..... |
gives the following output:
My1Da
y2cooks34pu
ing
The same string, but with escaped "d" (regexp):
|
..... |
The output:
My
Daddy
cooks
pudding
public String[] split(String regex)
Splits this string around matches of the given regular expression. This method works as if by invoking the two-argument split(...) method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array. The string "boo:and:foo", for example, yields the following results with these expressions:
|
..... |
The output:
|
..... |
|
..... |
Splits this string around matches of the given regular expression.
The array returned by this method contains each substring of this string that is terminated by another substring that matches the given expression or is terminated by the end of the string. The substrings in the array are in the order in which they occur in this string. If the expression does not match any part of the input then the resulting array has just one element, namely this string.
The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded:
|
..... |
An invocation of this method of the form str.split(regex, n) yields the same result as the expression:
Pattern.compile(regex).split(str, n)
The scanner API provides basic input functionality for reading data from the system console or any data stream. The following example reads a String from standard input and expects a following int value:
|
..... |
The Scanner methods like next and nextInt will block if no data is available. If you need to process more complex input, then there are also pattern-matching algorithms, available from the java.util.Formatter class.
java.util.Scanner is a simple text scanner which can parse primitive types and strings using regular expressions.
A Scanner breaks its input into tokens using a delimiter pattern, which by default matches whitespace. The resulting tokens may then be converted into values of different types using the various next methods.
For example, this code allows a user to read a number from System.in:
|
..... |
As another example, this code allows long types to be assigned from entries in a file myNumbers:
|
..... |
The scanner can also use delimiters other than whitespace. This example reads several items from a string:
|
..... |
prints the following output:
1
2
red
blue
The same output can be generated with this code, which uses a regular expression to parse all four tokens at once:
|
..... |
Class java.util.Scanner implements a simple text scanner (lexical analyzer) which uses regular expressions to parse primitive types and strings from its source.
A Scanner converts the input from its source into tokens using a delimiter pattern, which by default matches whitespace.
The tokens can be converted into values of different types using the various next() methods:
|
..... |
|
..... |
Before parsing the next token with a particular next() method, for example at (3), a lookahead can be performed by the corresponding hasNext() method as shown at (2).
The next() and hasNext() methods and their primitive-type companion methods (such as nextInt() and hasNextInt()) first skip any input that matches the delimiter pattern, and then attempt to return the next token.
A scanner must be constructed to parse text:
Scanner(Type source)
Returns an appropriate scanner. Type can be a String, a File, an InputStream, a ReadableByteChannel, or a Readable (implemented by CharBuffer and various Readers).
A scanner throws an InputMismatchException when the next token cannot be translated into a valid value of requested type.
Lookahead methods:
|
..... |
The name XXX can be: Byte, Short, Int, Long, Float, Double or BigInteger.
Parsing the next token methods:
|
..... |
The name XXX can be: Byte, Short, Int, Long, Float, Double or BigInteger. The corresponding 'xxx' can be: byte, short, int, long, float, double or BigInteger.
Example:
|
..... |
The output:
true
123
true
45.56
true
true
true
567
true
722
true
blabla
false
Error in parsing:
|
..... |
The output (runtime exception):
|
..... |
Developers now have the option of using printf-type functionality to generate formatted output. This will help migrate legacy C applications, as the same text layout can be preserved with little or no change.
Most of the common C printf formatters are available, and in addition some Java classes like Date and BigInteger also have formatting rules. See the java.util.Formatter class for more information. Although the standard UNIX newline '\n' character is accepted, for cross-platform support of newlines the Java %n is recommended. Furthermore, J2SE 5.0 added a printf() method to the PrintStream class. So now you can use System.out.printf() to send formatted numerical output to the console. It uses a java.util.Formatter object internally:
|
..... |
The simplest of the overloaded versions of the method goes as
|
..... |
The format argument is a string in which you embed specifier substrings that indicate how the arguments appear in the output. For example:
|
..... |
results in the console output:
1. pi = 3,142
2. pi = 3,1415