Extended regular expressions in the lex command

Specifying extended regular expressions in a lex specification file is similar to methods used in the sed or ed commands.

An extended regular expression specifies a set of strings to be matched. The expression contains both text characters and operator characters. Text characters match the corresponding characters in the strings being compared. Operator characters specify repetitions, choices, and other features.

Numbers and letters of the alphabet are considered text characters. For example, the extended regular expression integer matches the string integer, and the expression a57D looks for the string a57D.

Operators

The following list describes how operators are used to specify extended regular expressions:
Character
Matches the character Character.

Example: a matches the literal character a; b matches the literal character b, and c matches the literal character c.

"String"
Matches the string enclosed within quotes, even if the string includes an operator.

Example: To prevent the lex command from interpreting $ (dollar sign) as an operator, enclose the symbol in quotes.

\Character or \Digits
Escape character. When preceding a character class operator used in a string, the \ character indicates that the operator symbol represents a literal character rather than an operator. Valid escape sequences include:
\a
Alert
\b
Backspace
\f
Form-feed
\n
New line character (Do not use the actual new line character in an expression.)
\r
Return
\t
Tab
\v
Vertical tab
\\
Backslash
\Digits
The character whose encoding is represented by the one-digit, two-digit, or three-digit octal integer specified by the Digits string.
\xDigits
The character whose encoding is represented by the sequence of hexadecimal characters specified by the Digits string.

When the \ character precedes a character that is not in the preceding list of escape sequences, the lex command interprets the character literally.

Example: \c is interpreted as the c character unchanged, and [\^abc] represents the class of characters that includes the characters ^abc.

Note: Never use \0 or \x0 in the lex command.
[List]
Matches any one character in the enclosed range ([x-y]) or the enclosed list ([xyz]) based on the locale in which the lex command is invoked. All operator symbols, with the exception of the following, lose their special meaning within a bracket expression: - (dash), ^ (caret), and \ (backslash).

Example: [abc-f] matches a, b, c, d, e, or f in the en_US locale.

[:Class:]
Matches any of the characters belonging to the character class specified between the [::] delimiters as defined in the LC_TYPE category in the current locale. The following character class names are supported in all locales:
alnum   cntrl  lower   space

alpha   digit   print  upper

blank  graph   punct   xdigit

The lex command also recognizes user-defined character class names. The [::] operator is valid only in a [] expression.

Example: [[:alpha:]] matches any character in the alpha character class in the current locale, but [:alpha:] matches only the characters :,a,l,p, and h.

[.CollatingSymbol.]
Matches the collating symbol specified within the [..] delimiters as a single character. The [..] operator is valid only in a [ ] expression. The collating symbol must be a valid collating symbol for the current locale.

Example: [[.ch.]] matches c and h together while [ch] matches c or h.

[=CollatingElement=]
Matches the collating element specified within the [==] delimiters and all collating elements belonging to its equivalence class. The [==] operator is valid only in a [] expression.

Example: If w and v belong to the same equivalence class, [[=w=]] is the same as [wv] and matches w or v. If w does not belong to an equivalence class, then [[=w=]] matches w only.

[^Character]
Matches any character except the one following the ^ (caret) symbol. The resultant character class consists solely of single-byte characters. The character following the ^ symbol can be a multibyte character. However, for this operator to match multibyte characters, you must set %h and %m to greater than zero in the definitions section.

Example: [^c] matches any character except c.

CollatingElement-CollatingElement
In a character class, indicates a range of characters within the collating sequence defined for the current locale. Ranges must be in ascending order. The ending range point must collate equal to or higher than the starting range point. Because the range is based on the collating sequence of the current locale, a given range may match different characters, depending on the locale in which the lex command was invoked.
Expression?
Matches either zero or one occurrence of the expression immediately preceding the ? operator.

Example: ab?c matches either ac or abc.

Period character (.)
Matches any character except the new line character. In order for the period character (.) to match multi-byte characters, %z must be set to greater than 0 in the definitions section of the lex specification file. If %z is not set, the period character (.) matches single-byte characters only.
Expression*
Matches zero or more occurrences of the expression immediately preceding the * operator. For example, a* is any number of consecutive a characters, including zero. The usefulness of matching zero occurrences is more obvious in complicated expressions.

Example: The expression, [A-Za-z][A-Za-z0-9]* indicates all alphanumeric strings with a leading alphabetic character, including strings that are only one alphabetic character. You can use this expression for recognizing identifiers in computer languages.

Expression+
Matches one or more occurrences of the pattern immediately preceding the + operator.

Example: a+ matches one or more instances of a. Also, [a-z]+ matches all strings of lowercase letters.

Expression|Expression
Indicates a match for the expression that precedes or follows the | (pipe) operator.

Example: ab|cd matches either ab or cd.

(Expression)
Matches the expression in the parentheses. The () (parentheses) operator is used for grouping and causes the expression within parentheses to be read into the yytext array. A group in parentheses can be used in place of any single character in any other pattern.

Example: (ab|cd+)?(ef)* matches such strings as abefef, efefef, cdef, or cddd; but not abc, abcd, or abcdef.

^Expression
Indicates a match only when Expression is at the beginning of the line and the ^ (caret) operator is the first character in an expression.

Example: ^h matches an h at the beginning of a line.

Expression$
Indicates a match only when Expression is at the end of the line and the $ (dollar sign) operator is the last character in an expression.

Example: h$ matches an h at the end of a line.

Expression1/Expression2
Indicates a match only if Expression2 immediately follows Expression1. The / (slash) operator reads only the first expression into the yytext array.

Example: ab/cd matches the string ab, but only if followed by cd, and then reads ab into the yytext array.

Note: Only one / trailing context operator can be used in a single extended regular expression. The ^ (caret) and $ (dollar sign) operators cannot be used in the same expression with the / operator as they indicate special cases of trailing context.
{DefinedName}
Matches the name as you defined it in the definitions section.

Example: If you defined D to be numerical digits, {D} matches all numerical digits.

{Number1,Number2}
Matches Number1 to Number2 occurrences of the pattern immediately preceding it. The expressions {Number} and {Number,} are also allowed and match exactly Number occurrences of the pattern preceding the expression.

Example: xyz{2,4} matches either xyzxyz, xyzxyzxyz, or xyzxyzxyzxyz. This differs from the +, * and ? operators in that these operators match only the character immediately preceding them. To match only the character preceding the interval expression, use the grouping operator. For example, xy(z{2,4}) matches xyzz, xyzzz or xyzzzz.

<StartCondition>
Executes the associated action only if the lexical analyzer is in the indicated start condition

Example: If being at the beginning of a line is start condition ONE, then the ^ (caret) operator equals the expression <ONE>.

To use the operator characters as text characters, use one of the escape sequences: " " (double quotation marks) or \ (backslash). The " " operator indicates that what is enclosed is text. Thus, the following example matches the string xyz++:
xyz"++"

A portion of a string can be quoted. Quoting an ordinary text character has no effect. For example, the following expression is equivalent to the previous example:

"xyz++"

To ensure that text is interpreted as text, quote all characters that are not letters or numbers.

Another way to convert an operator character to a text character is to put a \ (backslash) character before the operator character. For example, the following expression is equivalent to the preceding examples:
xyz\+\+