Parle pattern matching
Parle supports regex matching similar to flex.
Also supported are the following POSIX character sets:
[:alnum:]
, [:alpha:]
, [:blank:]
, [:cntrl:]
, [:digit:]
, [:graph:]
, [:lower:]
, [:print:]
, [:punct:]
, [:space:]
, [:upper:]
, [:xdigit:]
.
The Unicode character classes are currently not enabled by default, pass --enable-parle-utf32 to make them available.
A particular encoding can be mapped with a correctly constructed regex.
For example, to match the EURO symbol encoded in UTF-8, the regular expression [\xe2][\x82][\xac]
can be used.
The pattern for an UTF-8 encoded string could be [ -\x7f]{+}[\x80-\xbf]{+}[\xc2-\xdf]{+}[\xe0-\xef]{+}[\xf0-\xff]+
.
Character representations
Sequence | Description |
---|---|
\a | Alert (bell). |
\b | Backspace. |
\e | ESC character, \x1b. |
\n | Newline. |
\r | Carriage return. |
\f | Form feed, \x0c. |
\t | Horizontal tab, \x09. |
\v | Vertical tab, \x0b. |
\oct | Character specified by a three-digit octal code. |
\xhex | Character specified by a hex code. |
\cchar | Named control character. |
Character classes
Sequence | Description |
---|---|
[...] | A single character listed or contained within a listed range. Ranges can be combined with the {+} and {-} operators. For example [a-z]{+}[0-9] is the same as [0-9a-z] and [a-z]{-}[aeiou] is the same as [b-df-hj-np-tv-z] . |
[^...] | A single character not listed and not contained within a listed range. |
. | Any character, default [^\n]. |
\d | Digit character, [0-9] . |
\D | Non-digit character, [^0-9] . |
\s | White space character, [ \t\n\r\f\v] . |
\S | Non-white space character, [^ \t\n\r\f\v] . |
\w | Word character, [a-zA-Z0-9_] . |
\W | Non-word character, [^a-zA-Z0-9_] . |
Unicode character classes
Sequence | Description |
---|---|
\p{C} | Other. |
\p{Cc} | Other, control. |
\p{Cf} | Other, format. |
\p{Co} | Other, private use. |
\p{Cs} | Other, surrogate. |
\p{L} | Letter. |
\p{LC} | Letter, cased. |
\p{Ll} | Letter, lowercase. |
\p{Lm} | Letter, modifier. |
\p{Lo} | Letter, other. |
\p{Lt} | Letter, titlecase. |
\p{Lu} | Letter, uppercase. |
\p{M} | Mark. |
\p{Mc} | Mark, space combining. |
\p{Me} | Mark, enclosing. |
\p{Mn} | Mark, nonspacing. |
\p{N} | Number. |
\p{Nd} | Number, decimal digit. |
\p{Nl} | Number, letter. |
\p{No} | Number, other. |
\p{P} | Punctuation. |
\p{Pc} | Punctiation, connector. |
\p{Pd} | Punctuation, dash. |
\p{Pe} | Punctuation, close. |
\p{Pf} | Punctuation, final quote. |
\p{Pi} | Punctuation, initial quote. |
\p{Po} | Punctuation, other. |
\p{Ps} | Punctuation, open. |
\p{S} | Symbol. |
\p{Sc} | Symbol, currency. |
\p{Sk} | Symbol, modifier. |
\p{Sm} | Symbol, math. |
\p{So} | Symbol, other. |
\p{Z} | Separator. |
\p{Zl} | Separator, line. |
\p{Zp} | Separator, paragraph. |
\p{Zs} | Separator, space. |
These character classes are only available, if the option --enable-parle-utf32 was passed at the compilation time.
Alternation and repetition
Sequence | Greedy | Description |
---|---|---|
...|... | - | Try sub-patterns in alternation. |
* | yes | Match 0 or more times. |
+ | yes | Match 1 or more times. |
? | yes | Match 0 or 1 times. |
{n} | no | Match exactly n times. |
{n,} | yes | Match at least n times. |
{n,m} | yes | Match at least n times but no more than m times. |
*? | no | Match 0 or more times. |
+? | no | Match 1 or more times. |
?? | no | Match 0 or 1 times. |
{n,}? | no | Match at least n times. |
{n,m}? | no | Match at least n times but no more than m times. |
{MACRO} | - | Include the regex MACRO in the current regex. |
Anchors
Sequence | Description |
---|---|
^ | Start of string or after a newline. |
$ | End of string or before a newline. |
Grouping
Sequence | Description |
---|---|
(...) | Group a regular expression to override default operator precedence. |
(?r-s:pattern) |
Apply option r and omit option s while interpreting pattern.
Options may be zero or more of the characters i, s, or x.
i means case-insensitive.
-i means case-sensitive.
s alters the meaning of . to match any character whatsoever.
-s alters the meaning of . to match any character except \n .
x ignores comments and whitespace in patterns.
Whitespace is ignored unless it is backslash-escaped, contained within ""s ,
or appears inside a character range.
These options can be applied globally at the rules level by passing a combination of the bit flags to the lexer.
|
(?# comment ) | Omit everything within (). The first ) character encountered ends the pattern. It is not possible for the comment to contain a ) character. The comment may span lines. |