The Language
Contents
RL3 syntax
RL3 is intended to be a highly readable language with an uncluttered visual layout. RL3 adheres to the off-side rule, i.e. it uses whitespaces or tab indentation to delimit blocks/declarations.
The following is an example of RL3 code:
rule
assert category "Category1" 1.0
if
match fact1 {pattern1}
disjunction
match fact2 {pattern2}
match fact3 {pattern2}
The language follows a modular programming philosophy and provides corresponding mechanisms allowing to isolate different parts of a program into modules, include or import common parts and assemble modules into a project.
The key structures in RL3 are rules, predicates, patterns, transformers and annotators.
Rules
Rules allow to define conditional actions. A rule consists of two blocks: rule
and if
.
rule
actions
if
conditions
The rule
block describes actions to be executed when all conditions described in the if
block are matched.
At the moment the only available action is assert
:
assert label value expression
or alternative syntax
label = value [weight=expression, label=value]
This action creates a new fact in a factsheet. The value of the fact can be a text string or a variable. The weight can be a float number defined directly or through an expression which may include:
- Constants (for instance
1.5
); - Functions:
- The number of pattern occurrences:
{count label pattern}
. Where label is the name of a fact in which pattern will be counted. - Pattern coverage (an aggregated size of all pattern occurrences):
{coverage label pattern}
- Fact size (the number of symbols in the corresponding fact value):
{size label}
- The size of the largest match:
{size label pattern}
- The number of pattern occurrences:
- Operations:
/
- division%
- division with threshold (x%y = x/y if the result is less than 1, else 1).
Conditions can be defined using predefined or user predicates. Each predicate should be defined in a separate row. All rows are concatenated by the conjunction operation by default.
Predicates
Predicates in RL3 are boolean functions that can be evaluated to true or false.
Built-in predicates
- match
- This predicate determines whether a given pattern matches all character sequences in a particular fact. The predicate returns
true
if a given pattern matches at least one fact with a given name path:
match path pattern
- where path is a sequence of fact labels or
*
delimited by.
(pathabc.def
corresponds to all sub-facts with adef
label in facts with anabc
label)
- search
- This predicate searches the character sequence for a given pattern:
search path pattern
- dawg
- This predicate concatenates the given facts and searches the result in a given dictionary:
dawg (label_1, label_2, ...) in dictionary [options]
- where
- label_1, label_2, ... are fact labels;
- dictionary is the name of a DAWG dictionary;
- options are optional modifiers delimited with
,
:- nb - ignore spaces;
- nc - case insensitive (default);
- cs - case sensitive.
- count
- This predicate counts the number of pattern occurrences and compares it with a given value using given operation (comparator):
count pattern in path comparator value
- where comparator can be one of the following operations:
<
,<=
,=
,>=
,>
.
- size
- This predicate compares the size of a given fact with a given value:
size path comparator value
- any
- This predicate iterates a given variable through a given fact and pattern, and checks the conditions defined in the following block (if a pattern is not defined, then the default pattern
\A.*\Z
is used). The predicate returnstrue
if at least one occurrence of a given pattern in a given fact matches given conditions:
any variable in path [on pattern]
conditions
- each
- This predicate iterates a given variable through a given fact and pattern, and checks the conditions defined in the following block (if a pattern is not defined, then the default pattern
\A.*\Z
is used). The predicate returnstrue
if all occurrences of a given pattern in a given fact matches given conditions:
each variable in path [on pattern]
predicates
- not
- This predicate inverts the evaluation result of a given predicate:
not predicate
- conjunction
- This predicate concatenates predicates defined in the following block with a boolean conjunction:
conjunction
predicates
- disjunction
- This predicate concatenates predicates defined in the following block with a boolean disjunction:
disjunction
predicates
User-defined predicates
User predicates are defined in a separate block and consist of a header and body:
predicate name (arg1, arg2, ..., argN)
predicates
where
- name' - predicate name;
- (arg1, arg2, ..., argN) - optional list of arguments;
- predicates - predicate body.
Patterns
The pattern language in RL3 is based on the regular expressions syntax with extensions related to the user defined and pre-defined patterns. The following core regular expressions matchers are supported:
Matcher | Description |
---|---|
. |
any symbol |
ab |
sequence of symbols |
a|b |
OR |
(a|b) |
group |
a* |
0 or more reps of a (greedy) |
a+ |
1 or more reps of a (greedy) |
a? |
0 of 1 reps of a (greedy) |
a{n,m} |
n to m reps of a (greedy) |
a*? |
0 or more reps of a (not greedy) |
a+? |
1 or more reps of a (not greedy) |
a?? |
0 or 1 reps of a (not greedy) |
a{n,m}? |
n to m reps of a (not greedy) |
^ |
start of row |
$ |
end of row |
\< |
start of word |
\> |
end of word |
\A |
start of sequence |
\Z |
end of sequence |
\b |
word boundary (or backspace if used inside [] )
|
\B |
not word boundary |
\w |
word symbol - same as [[:alnum:]]
|
\W |
not word symbol - same as [^[:alnum:]]
|
\d |
digit |
\D |
not digit |
\s |
space |
\S |
not space |
\n |
new line |
\f |
new page |
\t |
horizontal tab |
\v |
vertical tab |
\e |
escape |
\a |
bell |
\c |
control |
[[:alnum:]] |
alphabetic or numeric symbol class |
[[:alpha:]] |
alphabetic symbol class |
[[:blank:]] |
horizontal space class |
[[:cntrl:]] |
control symbol class |
[[:digit:]] |
numeric symbol class |
[[:graph:]] |
visible symbol class |
[[:lower:]] |
lowercase symbol class |
[[:print:]] |
printable symbol class |
[[:punct:]] |
punctuation symbol class |
[[:space:]] |
space symbol class |
[[:upper:]] |
uppercase symbol class |
[[:xdigit:]] |
hexadecimal symbol class |
[[:class:]] |
symbol of class |
[^[:class:]] |
not symbol of class |
[2-7] |
symbol from given set (i.e. 2, 3, 4, 5, 6, or 7) |
[b-e] |
symbol from given set (i.e. b, c, d, or e) |
[abc] |
one of the given symbols |
[0-9abc] |
symbol from given set or one of the given symbols |
[^abc] |
not one of the given symbols |
(?i:a) |
case insensitive match of a |
(?>a) |
independent sub-match of a (disabled backtracking) |
(?=a) |
look behind for a |
(?!a) |
inversion of look behind for a |
(?<=a) |
look before for a |
(?<!a) |
inversion of look before for a |
(?P<name>a) |
named group |
(?P=name) |
reference to named group |
# |
comment (at row start) |
?# |
comment (inside row) |
?R |
recursion |
?$[name] |
rule assign |
?$[name] |
rule reference |
[ |
start of equivalence class |
] |
end of equivalence class |
[. |
start of collation element |
.] |
end of collation element |
Built-in patterns
The following are the predefined patterns supported in RL3:
- annotation. Matches a sequence of symbols if it belongs to a given annotation:
{annotation label}
- dawg. Matches a sequence of symbols if it belongs to a given dictionary:
{dawg dictionary}
or{dawg nb dictionary}
where dictionary is the name of a precompiled dictionary, and nb instructs the engine to ignore blank symbols (i.e.[ \x09_\-+" :.,l()]
) in matching process. - doc_bname_fuzzy. Fuzzy (partial) match of the domain_basename fact:
{doc_bname_fuzzy}
- number. Matches a sequence of numeric symbols - an optimized implementation of
(?>\d+)
expression:{number}
- word. Matches a sequence of alphabetic symbols - an optimized implementation of
(?>[[:alpha:]]+)
expression:{word}
- token. Matches a sequence of alphanumeric symbols - an optimized implementation of
(?>[[:alnum:]]+)
expression:{token}
- ref. Provides a reference to the fact or facts already defined earlier in the pattern:
{ref strategy label}
or{ref nb strategy label}
where label is a fact name, strategy defines options for the facts to be searched or matched in case there is more than one fact satisfying the condition (e.g. any of the given facts, last fact), and nb instructs the engine to ignore blank symbols in matching process. - select. Matches a given pattern using the given strategy:
{select strategy pattern}
where strategy instructs the engine how to choose the best matched sequence (longest = choose the longest matched sequence; shortest = choose the shortest matched sequence), and pattern is the pattern to be matched. - =. Matches a given pattern and captures it under the given name:
{name=pattern}
User-defined patterns
In RL3 you can define your own named patterns and use them inside other patterns:
pattern name (arg1, arg2, ..., argN) [icase]
pattern body
where:
- name - a pattern name;
- (arg1, arg2, ..., argN) - an optional list of arguments;
- pattern body - a well-formed pattern consists of core regular expressions matchers, pre-defined and/or user-defined patterns;
- [icase] - an optional parameter that instructs the engine to compile a case insensitive pattern.
A user-defined pattern may also include blocks of conditions:
pattern name (arg1, arg2, ..., argN)
pattern_body
if
conditions
where conditions is a block of predicates.
Such a pattern matches if the pattern's body matches and evaluation of conjunction of all predicates in the if block returns true.
Inside of the if block the following additional (automatically generated) facts can be used:
$$
- a value matched with a pattern body;$<
- left context;$>
- right context.
Transformers
Transformers are special rules which are executed prior to any other rules or annotations and allow to transform the values of input facts (replace matched patterns with given formats).
transform label_1, ... label_N pattern to format
where
- label_1, ..., label_N - facts to be transformed;
- pattern - a pattern to be searched (the space symbol in a pattern must be escaped with
\s
); - format - a replacement template.
The format may contain the following expressions:
Expression | Description |
---|---|
$1, $2 |
reference to the captured group |
\1, \2 |
reference to the captured group |
\g<name> |
reference to the captured named group |
$& |
matched sequence |
$` |
prefix of matched sequence |
$' |
suffix of matched sequence |
$$ |
symbol $
|
\l |
next symbol transformed to lowercase |
\u |
next symbol transformed to uppercase |
\L |
start of sequence transformed to lowercase |
\U |
start of sequence transformed to uppercase |
\E |
end of sequence transformed to lower/upper case |
\a |
symbol a
|
\e |
ESC |
\f |
end of page |
\n |
LF |
\r |
CR |
\t |
horizontal TAB |
\v |
vertical TAB |
\xFF |
hexadecimal |
\x{FFFF} |
hexadecimal |
\cX |
control symbol X |
Annotators
Annotators allow to annotate all occurrences of a given pattern in a given fact.
annotation
actions
search label
pattern
if
conditions
The annotation block defines a set of actions to execute for each occurrence of a pattern in a fact referred with label. The actions are executed if and only if evaluation of all conditions defined in the if block returns true
. Actions can refer to captured named groups and following special (auto-generated) facts:
$$
- matched value$<
- left context$>
- right context
Modules, includes & imports
An RL3 program may consist of multiple .rl3
files (modules).
There is an include directive which can be used to import one file to another:
include "project_file.rl3"
or
include <stdlib_file.rl3>
The first statement will include a local file, while the second one will include a file from Standard Library.
The import directive allows to include templates:
import <template.rl3> with
var1 = value1
var2 = value2
...
This statement will include a local file template.rl3
and replace all accuracies of {var1}
, {var2}
, etc with corresponding values value1
, value2
, etc.
Project
Multiple modules can be assembled into a project file and then compiled as a single model. The project file will look like:
module module_1.rl3
module module_2.rl3
...
module module_N.rl3