Regular expressions

A regular expression is a sequence of characters that act as a pattern for matching and manipulating strings. Regular expressions are used in the fn:matches, fn:replace, and fn:tokenize functions.

Syntax

character-group

character

In a regular expression, character is a normal XML character that is not a metacharacter.

Metacharacters

Metacharacters are control characters in regular expressions. The regular expression metacharacters that are currently supported are:

backslash (\)

Begins a character class escape. A character class escape indicates that the metacharacter that follows is to be used as a character, instead of a metacharacter.

period (.)

Matches any single character except a newline character (\n).

carat (^)

If the carat character appears outside of a character class, the characters that follow the carat match the start of the input string or, for multi-line input strings, the start of a line. An input string is considered to be a multi-line input string if the function that uses the input string includes the m flag.

If the carat character appears as the first character within a character class, the carat acts as a not-sign. A match occurs if none of the characters in the character group appear in the string that is being compared to the regular expression.

dollar sign ($)

Matches the end of the input string or, for multi-line input strings, the end of a line. An input string is considered to be a multi-line input string if the function that uses the input string includes the m flag.

question mark (?)

Matches the preceding character or character group in the regular expression zero or one time.

asterisk (*)

Matches the preceding character or character group in the regular expression zero or more times.

plus sign (+)

Matches the preceding character or character group in the regular expression one or more times.

{n}

Matches the preceding character or character group in the regular expression exactly n times. n must be a positive integer.

{n,m}

Matches the preceding character or character group in the regular expression at least n times, but not more than m times. n must be a positive integer, and m must be a positive integer that is greater than or equal to n.

{n,}

Matches the preceding character or character group in the regular expression at least n times. n must be a positive integer.

opening bracket ([) and closing bracket (])

The opening and closing brackets and the enclosed character group define a character class. For example, the character class [aeiou] matches any single vowel. Character classes also support character ranges. For example:

[a-z] means any lowercase letter.
[a-p] means any lowercase letter from a through p.
[0-9] means any single digit.

opening parenthesis (() and closing parenthesis ())

An opening and closing parenthesis denote a grouping of some characters within a regular expression. You can then apply an operator, such as a repetition operator, to the entire group.

character-class-escape

A character class escape specifies that you want certain special characters to be treated as characters, instead of performing some function. A character class escape consists of a backslash (\), followed by a single metacharacter, newline character, return character, or tab character. The following table lists the character class escapes.

Table 1. Single-character character class escapes
Character escape	Character represented	Description
`\n`	#x0A	Newline
`\r`	#x0D	Return
`\t`	#x09	Tab
`\\`	\	Backslash
`\\|`	\|	Pipe
`\.`	.	Period
`\?`	?	Question mark
`\*`	*	Asterisk
`\+`	+	Plus sign
`\(`	(	Opening parenthesis
`\)`	)	Closing parenthesis
`\{`	{	Opening curly brace
`\}`	}	Closing curly brace
`\$`	$	Dollar sign
`\-`	-	Dash
`\[`	[	Opening bracket
`\]`	]	Closing bracket
`\^`	^	Caret

character-group

A character group is the set of characters in a character class. The character class is used for matching. It can consist characters, character ranges, character class escapes, and an optional opening carat. If the carat is included, it indicates the complement of the set of characters that are defined by the rest of character group.

Examples

The following examples demonstrate how each of the metacharacters affects a regular expression.

"hello[0-9]world" matches "hello3world", but not "hello world".
"^hello" matches this text:
```
hello world
```
However, "^hello" does not match this text:
```
world hello
```
"hello$" matches this text:
```
world hello
```
However, "hello$" does not match this text:
```
hello world
```
"(ca)|(bd)" matches "arcade" or "abdicate".
"^((ca)|(bd))" does not match "arcade" or "abdicate".
"w?s" matches "ws" or "s".
"w.*s" matches "was" or "waters".
"be+t" matches "beet" or "bet".
"be{1,3}t" matches "bet", "beet", or "beeet".
"\[n\]" matches "[n]".