Regular Expressions and C#, .NET
This Article explores the concept of Regular Expressions in the context of C#, .NET support for Regular Expressions, Meta-characters and their Description, Character Escapes, Substitutions, Character Classes, Regular Expression Options and Atomic Zero-Width Assertions.
What are regular expressions?
Regular expressions are Patterns that can be used to match strings. You can call it a formula for matching strings that follow some pattern. Regular expression(s) can be considered as a Language, which is designed to manipulate text. You can then ask questions such as
- “Does the given string match the pattern?”, or
- “Does the given string contain characters that match a pattern?”.
Regular Expressions may be used to find one or more occurrences of a pattern of characters within a string. You may choose to replace it with some other characters or perform some other tasks based on the results obtained. These patterns of characters can be simple or very complex. Regular Expressions generally comprises of two types of characters –
1) Literal or Normal Characters such as “abcd123”
2) Special Characters that have a special meaning such as “.” Or “$” or “^”
Due to the special characters Regular Expressions form a very powerful means of manipulating strings and text.
.NET support for Regular Expressions:
.Net provides an extensive set of Regular expressions which you could use to create, modify or compare strings. They can be classified as follows –
a) Character Escapes
b) Substitutions
c) Character Classes
d) Regular Expression Options
e) Atomic Zero-Width Assertions
f) Quantifiers
g) Grouping Constructs
h) Backreference Constructs
i) Alternation Constructs
j) Miscellaneous Constructs
Meta-characters and their Description
|
.
|
Matches any single character.
An example of this is the regular expression s.t would match the strings sat, sit, but not sight.
|
|
$
|
Matches the end of a line. For instance, the regular expression reason$ would match the end of the string "He has a reason" but not the string "He has his reasons"
|
|
^
|
Matches the beginning of a line. For instance, the regular expression ^Where would match the beginning of the string "Where is my cap" but would not match "Do you know Where it is " .
|
|
*
|
Matches zero or more occurrences of the character immediately preceding. For example, the regular expression .* means match any number of any characters.
|
|
\
|
This is a escape or quoting character. The character after this is treated as an ordinary character. For example, \^ is used to match the caret sign character (^) rather than the begining of a line. Similarly, the expression \. is used to match the “.” character .
|
|
[ ]
.
[c1-c2]
.
[^c1-c2]
|
Matches any one of the characters between the brackets. For example, the regular expression s[ia]t matches sat, sit, but not set.
.
Ranges of characters can specified by using a hyphen. For example, the regular expression [0-9] means match any digit.
.
Multiple ranges can be specified as well. The regular expression [A-Za-z] means match any upper or lower case letter.
.
To match any character except those in the range, the complement range, use the caret as the first character after the opening bracket.
.
For example, the expression [^123a-z] will match any characters except 1,2, 3, and lower case letters.
|
|
\ < \ >
|
Matches the beginning (\ < ) or end ( \ >) or a word. For example, \ < THE< _fckxhtmljob="2" span > matches on "the" in the string "for the older" but does not match "the" in "rather"
|
|
\( \)
|
Treat the expression between \( and \) as a group. Also, saves the characters matched by the expression into temporary holding areas. Up to nine pattern matches can be saved in a single regular expression. They can be referenced as \1 through \9.
|
|
|
|
Or two conditions together. For example (him|her) matches the line "it belongs to him" and matches the line "it belongs to her" but does not match the line "it belongs to them."
|
|
+
|
Matches one or more occurences of the character or regular expression immediately preceding. For example, the regular expression 9+ matches 9, 99, 999.
|
|
?
|
Matches 0 or 1 occurence of the character or regular expression immediately preceding.
|
|
\{i\}
.
\{i,j\}
|
Match a specific number of instances or instances within a range of the preceding character. For example, the expression A[0-9]\{3\} will match "A" followed by exactly 3 digits. That is, it will match A123 but not A1234. The expression [0-9]\{4,6\} any sequence of 4, 5, or 6 digits
|
Character Escapes
The escape character \ (a single backslash) signals to the regular expression parser that the character following the backslash is not an operator
|
\b
|
Matches a backspace
|
|
\t
|
Matches a tab
|
|
\r
|
Matches a carriage return
|
|
\v
|
Matches a vertical tab
|
|
\f
|
Matches a form feed
|
|
\n
|
Matches a new line
|
|
\e
|
Matches an escape
|
|
\040
|
Matches an ASCII character as octal (up to three digits);
|
|
\x20
|
Matches an ASCII character using hexadecimal representation (exactly two digits).
|
|
\cC
|
Matches an ASCII control character; for example, \cC is control-C.
|
|
\u0020
|
Matches a Unicode character using hexadecimal representation (exactly four digits).
|
Substitutions:
Provides information on the special constructs used in replacement patterns.
Substitutions are allowed only within replacement patterns.
|
Character
|
Description
|
|
$number
|
Substitutes the last substring matched by group number number (decimal).
|
|
${name}
|
Substitutes the last substring matched by a (? ) group.
|
|
$$
|
Substitutes a single "$" literal.
|
|
$&
|
Substitutes a copy of the entire match itself.
|
|
$`
|
Substitutes all the text of the input string before the match.
|
|
$'
|
Substitutes all the text of the input string after the match.
|
|
$+
|
Substitutes the last group captured.
|
|
$_
|
Substitutes the entire input string.
|
Character Classes
A character class is a set of characters that will find a match if any one of the characters included in the set matches.
|
Character class
|
Description
|
|
.
|
Matches any character except \n. If modified by the Singleline option, a period character matches any character.
|
|
[aeiou]
|
Matches any single character included in the specified set of characters.
|
|
[^aeiou]
|
Matches any single character not in the specified set of characters.
|
|
[0-9a-fA-F]
|
Use of a hyphen (–) allows specification of contiguous character ranges.
|
|
\p{name}
|
Matches any character in the named character class specified by {name}.
|
|
\P{name}
|
Matches text not included in groups and block ranges specified in {name}.
|
|
\w
|
Matches any word character.
|
|
\W
|
Matches any nonword character.
|
|
\s
|
Matches any white-space character.
|
|
\S
|
Matches any non-white-space character.
|
|
\d
|
Matches any decimal digit.
|
|
\D
|
Matches any nondigit.
|
Regular Expression Options:
.You can modify a regular expression pattern with options that affect matching behavior
|
RegexOption member
|
Inline character
|
Description
|
|
None
|
N/A
|
Specifies that no options are set.
|
|
IgnoreCase
|
I
|
Specifies case-insensitive matching.
|
|
Multiline
|
M
|
Specifies multiline mode. Changes the meaning of ^ and $ so that they match at the beginning and end, respectively, of any line, not just the beginning and end of the whole string.
|
|
ExplicitCapture
|
N
|
Specifies that the only valid captures are explicitly named or numbered groups of the form (?Â…).
.
This allows parentheses to act as noncapturing groups without the syntactic clumsiness of (?:Â…).
|
|
Compiled
|
N/A
|
Specifies that the regular expression will be compiled to an assembly.
.
Generates Microsoft intermediate language (MSIL) code for the regular expression;
.
yields faster execution at the expense of startup time.
|
|
Singleline
|
S
|
Specifies single-line mode. Changes the meaning of the period character (.) so that it matches every character (instead of every character except \n).
|
|
IgnorePatternWhitespace
|
X
|
Specifies that unescaped white space is excluded from the pattern and enables comments following a number sign (#).
|
|
RightToLeft
|
N/A
|
Specifies that the search moves from right to left instead of from left to right.
|
|
ECMAScript
|
N/A
|
Specifies that ECMAScript-compliant behavior is enabled for the expression.
.
This option can be used only in conjunction with the IgnoreCase and Multiline flags.
Use of ECMAScript with any other flags results in an exception.
|
|
CultureInvariant
|
N/A
|
Specifies that cultural differences in language is ignored
|
.
Atomic Zero-Width Assertions
These metacharacters do not cause the engine to advance through the string or consume characters. They simply cause a match to succeed or fail depending on the current position in the string.
.
.
|
Assertion
|
Description
|
|
^
|
Specifies that the match must occur at the beginning of the string or the beginning of the line.
|
|
$
|
Specifies that the match must occur at the end of the string, before \n at the end of the string, or at the end of the line
|
|
\A
|
Specifies that the match must occur at the beginning of the string
|
|
\Z
|
Specifies that the match must occur at the end of the string or before \n at the end of the string
|
|
\z
|
Specifies that the match must occur at the end of the string
|
|
\G
|
Specifies that the match must occur at the point where the previous match ended.
.
When used with Match.NextMatch(), this ensures that matches are all contiguous.
|
|
\b
|
Specifies that the match must occur on a boundary between \w (alphanumeric) and \W (nonalphanumeric) characters.
.
The match must occur on word boundaries — that is, at the first or last characters in words separated by any nonalphanumeric characters.
|
|
\B
|
Specifies that the match must not occur on a \b boundary.
|
Trackback(0)
|