Extended regular expressions in the lex command
Specifying extended regular expressions in a lex specification file is similar to methods used in the sed or ed commands.
An extended regular expression specifies a set of strings to be matched. The expression contains both text characters and operator characters. Text characters match the corresponding characters in the strings being compared. Operator characters specify repetitions, choices, and other features.
Numbers and letters of the alphabet are considered text characters. For example, the extended regular expression integer matches the string integer, and the expression a57D looks for the string a57D.
Operators
- Character
- Matches the character Character.
Example: a matches the literal character a; b matches the literal character b, and c matches the literal character c.
- "String"
- Matches the string enclosed within quotes, even if the string
includes an operator.
Example: To prevent the lex command from interpreting $ (dollar sign) as an operator, enclose the symbol in quotes.
- \Character or \Digits
- Escape character. When preceding a character class operator used
in a string, the \ character indicates that the operator
symbol represents a literal character rather than an operator. Valid
escape sequences include:
- \a
- Alert
- \b
- Backspace
- \f
- Form-feed
- \n
- New line character (Do not use the actual new line character in an expression.)
- \r
- Return
- \t
- Tab
- \v
- Vertical tab
- \\
- Backslash
- \Digits
- The character whose encoding is represented by the one-digit, two-digit, or three-digit octal integer specified by the Digits string.
- \xDigits
- The character whose encoding is represented by the sequence of
hexadecimal characters specified by the Digits string.
When the \ character precedes a character that is not in the preceding list of escape sequences, the lex command interprets the character literally.
Example: \c is interpreted as the c character unchanged, and [\^abc] represents the class of characters that includes the characters ^abc.
Note: Never use \0 or \x0 in the lex command.
- [List]
- Matches any one character in the enclosed range ([x-y])
or the enclosed list ([xyz]) based on the
locale in which the lex command is invoked. All operator symbols,
with the exception of the following, lose their special meaning within
a bracket expression: - (dash), ^ (caret), and \ (backslash).
Example: [abc-f] matches
a
,b
,c
,d
,e
, orf
in the en_US locale. - [:Class:]
- Matches any of the characters belonging to the character class
specified between the [::] delimiters as defined in the LC_TYPE
category in the current locale. The following character class names
are supported in all locales:
alnum cntrl lower space alpha digit print upper blank graph punct xdigit
The lex command also recognizes user-defined character class names. The [::] operator is valid only in a [] expression.
Example: [[:alpha:]] matches any character in the alpha character class in the current locale, but [:alpha:] matches only the characters :,a,l,p, and h.
- [.CollatingSymbol.]
- Matches the collating symbol specified within the [..] delimiters
as a single character. The [..] operator is valid only in
a [ ] expression. The collating symbol must be
a valid collating symbol for the current locale.
Example: [[.ch.]] matches c and h together while [ch] matches c or h.
- [=CollatingElement=]
- Matches the collating element specified within the [==] delimiters
and all collating elements belonging to its equivalence class. The [==] operator
is valid only in a [] expression.
Example: If w and v belong to the same equivalence class, [[=w=]] is the same as [wv] and matches w or v. If w does not belong to an equivalence class, then [[=w=]] matches w only.
- [^Character]
- Matches any character except the one following the ^ (caret)
symbol. The resultant character class consists solely of single-byte
characters. The character following the ^ symbol can be a
multibyte character. However, for this operator to match multibyte
characters, you must set %h and %m to greater than zero
in the definitions section.
Example: [^c] matches any character except c.
- CollatingElement-CollatingElement
- In a character class, indicates a range of characters within the collating sequence defined for the current locale. Ranges must be in ascending order. The ending range point must collate equal to or higher than the starting range point. Because the range is based on the collating sequence of the current locale, a given range may match different characters, depending on the locale in which the lex command was invoked.
- Expression?
- Matches either zero or one occurrence of the expression immediately
preceding the ? operator.
Example: ab?c matches either ac or abc.
- Period character (.)
- Matches any character except the new line character. In order for the period character (.) to match multi-byte characters, %z must be set to greater than 0 in the definitions section of the lex specification file. If %z is not set, the period character (.) matches single-byte characters only.
- Expression*
- Matches zero or more occurrences of the expression immediately
preceding the * operator. For example, a* is any
number of consecutive a characters, including zero. The usefulness
of matching zero occurrences is more obvious in complicated expressions.
Example: The expression, [A-Za-z][A-Za-z0-9]* indicates all alphanumeric strings with a leading alphabetic character, including strings that are only one alphabetic character. You can use this expression for recognizing identifiers in computer languages.
- Expression+
- Matches one or more occurrences of the pattern immediately preceding
the + operator.
Example: a+ matches one or more instances of a. Also, [a-z]+ matches all strings of lowercase letters.
- Expression|Expression
- Indicates a match for the expression that precedes or follows
the | (pipe) operator.
Example: ab|cd matches either ab or cd.
- (Expression)
- Matches the expression in the parentheses. The () (parentheses)
operator is used for grouping and causes the expression within parentheses
to be read into the yytext array. A group in parentheses can
be used in place of any single character in any other pattern.
Example: (ab|cd+)?(ef)* matches such strings as abefef, efefef, cdef, or cddd; but not abc, abcd, or abcdef.
- ^Expression
- Indicates a match only when Expression is at the beginning
of the line and the ^ (caret) operator is the first character
in an expression.
Example: ^h matches an h at the beginning of a line.
- Expression$
- Indicates a match only when Expression is at the end of
the line and the $ (dollar sign) operator is the last character
in an expression.
Example:
h$
matches anh
at the end of a line. - Expression1/Expression2
- Indicates a match only if Expression2 immediately follows Expression1.
The / (slash) operator reads only the first expression into
the yytext array.
Example: ab/cd matches the string ab, but only if followed by cd, and then reads ab into the yytext array.
Note: Only one / trailing context operator can be used in a single extended regular expression. The ^ (caret) and $ (dollar sign) operators cannot be used in the same expression with the / operator as they indicate special cases of trailing context. - {DefinedName}
- Matches the name as you defined it in the definitions section.
Example: If you defined D to be numerical digits, {D} matches all numerical digits.
- {Number1,Number2}
- Matches Number1 to Number2 occurrences of the pattern
immediately preceding it. The expressions {Number} and {Number,} are
also allowed and match exactly Number occurrences of the pattern
preceding the expression.
Example: xyz{2,4} matches either xyzxyz, xyzxyzxyz, or xyzxyzxyzxyz. This differs from the +, * and ? operators in that these operators match only the character immediately preceding them. To match only the character preceding the interval expression, use the grouping operator. For example, xy(z{2,4}) matches xyzz, xyzzz or xyzzzz.
- <StartCondition>
- Executes the associated action only if the lexical analyzer
is in the indicated start condition
Example: If being at the beginning of a line is start condition ONE, then the ^ (caret) operator equals the expression <ONE>.
xyz"++"
A portion of a string can be quoted. Quoting an ordinary text character has no effect. For example, the following expression is equivalent to the previous example:
"xyz++"
To ensure that text is interpreted as text, quote all characters that are not letters or numbers.
xyz\+\+