Skip to content

Regular Expressions and Regular Expression Types

This document describes the MontiCore language RegEx, which introduces regular expressions. Based thereon, it introduces a new form of types RegExTypes that are used as subtypes of String and help to make model even more typesafe.

A regular expression describe patterns that can be applied to text in order to allow/deny certain sequences of characters and also to find specific sequences within the text.

Motivation and Basic Examples

In theoretical computer science, a regular expression is a character string that is used to describe sets of character strings with the help of certain syntactic rules. Regular expressions are used in a variety of applications. In addition to implementations in many programming languages, many text editors also process regular expressions in the Find and Replace function. A simple use case of regular expressions are wildcards.

Regular expressions can be used as filter criteria in text search by matching the text with the regular expression pattern. This process is also called pattern matching. For example, it is possible to find all words from a word list that begin with S and end with D without having to explicitly specify the intervening letters or their number.

This project introduces regular expressions as a MontiCore grammar, allowing modelers to write regular expressions. The regular expressions which are defined by the grammars in this project adhere to the definition which can be found at https://en.wikipedia.org/wiki/Regular_expression#Formal_language_theory. More specifically, it is modeled after the Java syntax for regular expressions.

Examples for Regular Expressions

Below are a few examples for regular expressions along with words they match (taken and adapted from https://cs.lmu.edu/~ray/notes/regex/).

Regular Expression Example Matches
hello hello
gray|grey gray, grey
gr(a|e)y gray, grey
gr[ae]y gray, grey
b[aeiou]bble babble, bebble, bibble, bobble, bubble
[b-chm-pP]at|ot bat, cat, hat, mat, nat, oat, pat, Pat, ot
colou?r color, colour
rege(x(es)?|xps?) regex, regexes, regexp, regexps
go*gle ggle, gogle, google, gooogle, ..
go+gle gogle, google, gooogle, ...
g(oog)+le google, googoogle, googoogoogle, ...
z{3} zzz
z{3,6} zzz, zzzz, zzzzz, zzzzzz
z{3,} zzz, zzzz, zzzzz, ...
[Bb]rainch\*\*k Brainch**k, brainch**k
\d 0,1,2,3,4,5,6,7,8,9
\d{5}(-\d{4})? 12345-1234, 67890
1\d{10} 11231231231, 19999999999
[2-9]|[12]\d|3[0-6] 2, 3, 4, ..., 34, 35, 36
Hello\nworld Hello [newline] world
mi.....ft microsoft, mi.....ft, mi123456ft
\d+(.\d\d)? 123.45, 6.78, 9
[^i*&2@] anything besides i, *, &, 2 or @
//[^\r\n]*[\r\n] //hello world, //I am a comment
^dog dog at the start of a line
dog$ dog at the end of a line
^dog$ a line which only contains dog

Language Components for Regular Expressions

The regular expressions which are defined by these grammars adhere to the definition which can be found at https://en.wikipedia.org/wiki/Regular_expression#Formal_language_theory. More specifically, it is modeled after the Java syntax for regular expressions.

This RegEx language component contains

  • two grammars,
  • context conditions for a regular expression
  • a symbol table infrastructure, and
  • pretty-printers for both grammars, resp. their main nonterminals.

Grammars

The RegEx language component contains two grammars: 1. RegularExpressions, 1. RegExType

RegularExpressions

The grammar RegularExpressions is a component grammar which defines the syntax of regular expressions. The grammar is supposed to be embedded in DSLs where a regular expression is of use, e.g. when describing allowed input patterns.

The grammar only extends the grammar MCCommonLiterals as it uses the Digit token as well as the Name nonterminal. Although the main nonterminals is called RegularExpression, this grammar and its nonterminals are defined independent of MontiCore's expression grammars.

The grammar uses a special mode REGEX to distinguish regular expressions from, e.g., from algebraic expressions.

The nonterminal RegularExpression is designed in such a way that it assumes that the mode REGEX has already been switched on (and will be switched off) by the calling nonterminal. This is provided by the nonterminal RegExLiteral in the grammar.

Context Conditions for the RegularExpressions language

Symbol Table for the RegularExpressions language

RegularExpressions introduces symbols for * named capturing groups (NamedCapturingGroupSymbol), which can then, e.g., be used to select a part of the accepted pattern for later repeated matching.

RegExType

The grammar RegExType extends the MCType hierarchy by the notion of regular types. The new nonterminal RegExType consists of a regular expression embedded in R"...". It is used to define a subtype of String which contains only the text accepted by the regular expressions.

Therefore, this grammar is used to define a type system for strings, which can be used to define variables that can only have values of a subset of all strings. This is, e.g., helpful in notoriously unsafe infrastructures, such as Internet-Of-Things or otherwise distributed systems, where permanently strings are accepted from unsafe external sources.

The grammar RegExType extends the grammars * RegularExpressions to use regular expressions, and * MCBasicTypes for the nonterminal MCType

TypeCheck for RegExType

TypeChecks are important when using the RegExType grammar, as they prevent the assignments of irregular values to variables with a regular expression type. Unfortunately, the type checks for our regular expressions cannot be completely performed at compile-time, as checking whether two different expression types are compatible is not solvable in general, because of the extended capabilities of the language, which in fact goes beyond mere regularity. However, if variables/fields with a regular expression type get assigned a static string, the compatibility can be checked at compile-time. And also, in a number of cases, when a string of one known regular type is assigned to a variable of a second regular type, the subtyping property can be statically resolved.

The TypeCheck classes in this project provide functionality using MontiCore's type checking mechanism. A new SymTypeExpression SymTypeOfRegEx is introduced, which carries information about which type of regular expression it accepts and is a subtype of the SymTypeExpression String.

For more info see TypeSystem.

Further Information