The Language

From RL3 Wiki
Jump to: navigation, search

RL3 syntax

RL3 is intended to be a highly readable language with an uncluttered visual layout. RL3 adheres to the off-side rule, i.e. it uses whitespaces or tab indentation to delimit blocks/declarations.

The following is an example of RL3 code:

rule
    assert category "Category1" 1.0
if
    match fact1 {pattern1}
    disjunction
        match fact2 {pattern2}
        match fact3 {pattern2}

The language follows a modular programming philosophy and provides corresponding mechanisms allowing to isolate different parts of a program into modules, include or import common parts and assemble modules into a project.

The key structures in RL3 are rules, predicates, patterns, transformers and annotators.

Rules

Rules allow to define conditional actions. A rule consists of two blocks: rule and if.

rule
    actions
if
    conditions

The rule block describes actions to be executed when all conditions described in the if block are matched.

At the moment the only available action is assert:

   assert label value expression

or alternative syntax

   label = value [weight=expression, label=value]

This action creates a new fact in a factsheet. The value of the fact can be a text string or a variable. The weight can be a float number defined directly or through an expression which may include:

  • Constants (for instance 1.5);
  • Functions:
    • The number of pattern occurrences: {count label pattern}. Where label is the name of a fact in which pattern will be counted.
    • Pattern coverage (an aggregated size of all pattern occurrences): {coverage label pattern}
    • Fact size (the number of symbols in the corresponding fact value): {size label}
    • The size of the largest match: {size label pattern}
  • Operations:
    • / - division
    • % - division with threshold (x%y = x/y if the result is less than 1, else 1).

Conditions can be defined using predefined or user predicates. Each predicate should be defined in a separate row. All rows are concatenated by the conjunction operation by default.

Predicates

Predicates in RL3 are boolean functions that can be evaluated to true or false.

Built-in predicates

  • match
This predicate determines whether a given pattern matches all character sequences in a particular fact. The predicate returns true if a given pattern matches at least one fact with a given name path:
    match path pattern
where path is a sequence of fact labels or * delimited by . (path abc.def corresponds to all sub-facts with a def label in facts with an abc label)
  • search
This predicate searches the character sequence for a given pattern:
    search path pattern
  • dawg
This predicate concatenates the given facts and searches the result in a given dictionary:
    dawg (label_1, label_2, ...) in dictionary [options]
where
  • label_1, label_2, ... are fact labels;
  • dictionary is the name of a DAWG dictionary;
  • options are optional modifiers delimited with ,:
    • nb - ignore spaces;
    • nc - case insensitive (default);
    • cs - case sensitive.
  • count
This predicate counts the number of pattern occurrences and compares it with a given value using given operation (comparator):
    count pattern in path comparator value
where comparator can be one of the following operations: <, <=, =, >=, >.
  • size
This predicate compares the size of a given fact with a given value:
    size path comparator value
  • any
This predicate iterates a given variable through a given fact and pattern, and checks the conditions defined in the following block (if a pattern is not defined, then the default pattern \A.*\Z is used). The predicate returns true if at least one occurrence of a given pattern in a given fact matches given conditions:
    any variable in path [on pattern]
        conditions
  • each
This predicate iterates a given variable through a given fact and pattern, and checks the conditions defined in the following block (if a pattern is not defined, then the default pattern \A.*\Z is used). The predicate returns true if all occurrences of a given pattern in a given fact matches given conditions:
    each variable in path [on pattern]
        predicates
  • not
This predicate inverts the evaluation result of a given predicate:
    not predicate
  • conjunction
This predicate concatenates predicates defined in the following block with a boolean conjunction:
    conjunction
        predicates
  • disjunction
This predicate concatenates predicates defined in the following block with a boolean disjunction:
    disjunction
        predicates

User-defined predicates

User predicates are defined in a separate block and consist of a header and body:

predicate name (arg1, arg2, ..., argN)
    predicates

where

  • name' - predicate name;
  • (arg1, arg2, ..., argN) - optional list of arguments;
  • predicates - predicate body.

Patterns

The pattern language in RL3 is based on the regular expressions syntax with extensions related to the user defined and pre-defined patterns. The following core regular expressions matchers are supported:

Matcher Description
. any symbol
ab sequence of symbols
a|b OR
(a|b) group
a* 0 or more reps of a (greedy)
a+ 1 or more reps of a (greedy)
a? 0 of 1 reps of a (greedy)
a{n,m} n to m reps of a (greedy)
a*? 0 or more reps of a (not greedy)
a+? 1 or more reps of a (not greedy)
a?? 0 or 1 reps of a (not greedy)
a{n,m}? n to m reps of a (not greedy)
^ start of row
$ end of row
\< start of word
\> end of word
\A start of sequence
\Z end of sequence
\b word boundary (or backspace if used inside [])
\B not word boundary
\w word symbol - same as [[:alnum:]]
\W not word symbol - same as [^[:alnum:]]
\d digit
\D not digit
\s space
\S not space
\n new line
\f new page
\t horizontal tab
\v vertical tab
\e escape
\a bell
\c control
[[:alnum:]] alphabetic or numeric symbol class
[[:alpha:]] alphabetic symbol class
[[:blank:]] horizontal space class
[[:cntrl:]] control symbol class
[[:digit:]] numeric symbol class
[[:graph:]] visible symbol class
[[:lower:]] lowercase symbol class
[[:print:]] printable symbol class
[[:punct:]] punctuation symbol class
[[:space:]] space symbol class
[[:upper:]] uppercase symbol class
[[:xdigit:]] hexadecimal symbol class
[[:class:]] symbol of class
[^[:class:]] not symbol of class
[2-7] symbol from given set (i.e. 2, 3, 4, 5, 6, or 7)
[b-e] symbol from given set (i.e. b, c, d, or e)
[abc] one of the given symbols
[0-9abc] symbol from given set or one of the given symbols
[^abc] not one of the given symbols
(?i:a) case insensitive match of a
(?>a) independent sub-match of a (disabled backtracking)
(?=a) look behind for a
(?!a) inversion of look behind for a
(?<=a) look before for a
(?<!a) inversion of look before for a
(?P<name>a) named group
(?P=name) reference to named group
# comment (at row start)
?# comment (inside row)
?R recursion
?$[name] rule assign
?$[name] rule reference
[ start of equivalence class
] end of equivalence class
[. start of collation element
.] end of collation element

Built-in patterns

The following are the predefined patterns supported in RL3:

  • annotation. Matches a sequence of symbols if it belongs to a given annotation: {annotation label}
  • dawg. Matches a sequence of symbols if it belongs to a given dictionary: {dawg dictionary} or {dawg nb dictionary} where dictionary is the name of a precompiled dictionary, and nb instructs the engine to ignore blank symbols (i.e. [ \x09_\-+"
:.,l()]) in matching process.
  • doc_bname_fuzzy. Fuzzy (partial) match of the domain_basename fact: {doc_bname_fuzzy}
  • number. Matches a sequence of numeric symbols - an optimized implementation of (?>\d+) expression: {number}
  • word. Matches a sequence of alphabetic symbols - an optimized implementation of (?>[[:alpha:]]+) expression: {word}
  • token. Matches a sequence of alphanumeric symbols - an optimized implementation of (?>[[:alnum:]]+) expression: {token}
  • ref. Provides a reference to the fact or facts already defined earlier in the pattern: {ref strategy label} or {ref nb strategy label} where label is a fact name, strategy defines options for the facts to be searched or matched in case there is more than one fact satisfying the condition (e.g. any of the given facts, last fact), and nb instructs the engine to ignore blank symbols in matching process.
  • select. Matches a given pattern using the given strategy: {select strategy pattern} where strategy instructs the engine how to choose the best matched sequence (longest = choose the longest matched sequence; shortest = choose the shortest matched sequence), and pattern is the pattern to be matched.
  • =. Matches a given pattern and captures it under the given name: {name=pattern}

User-defined patterns

In RL3 you can define your own named patterns and use them inside other patterns:

pattern name (arg1, arg2, ..., argN) [icase]
    pattern body

where:

  • name - a pattern name;
  • (arg1, arg2, ..., argN) - an optional list of arguments;
  • pattern body - a well-formed pattern consists of core regular expressions matchers, pre-defined and/or user-defined patterns;
  • [icase] - an optional parameter that instructs the engine to compile a case insensitive pattern.

A user-defined pattern may also include blocks of conditions:

pattern name (arg1, arg2, ..., argN)
    pattern_body
if
    conditions

where conditions is a block of predicates.

Such a pattern matches if the pattern's body matches and evaluation of conjunction of all predicates in the if block returns true.

Inside of the if block the following additional (automatically generated) facts can be used:

  • $$ - a value matched with a pattern body;
  • $< - left context;
  • $> - right context.

Transformers

Transformers are special rules which are executed prior to any other rules or annotations and allow to transform the values of input facts (replace matched patterns with given formats).

transform label_1, ... label_N pattern to format

where

  • label_1, ..., label_N - facts to be transformed;
  • pattern - a pattern to be searched (the space symbol in a pattern must be escaped with \s);
  • format - a replacement template.

The format may contain the following expressions:

Expression Description
$1, $2 reference to the captured group
\1, \2 reference to the captured group
\g<name> reference to the captured named group
$& matched sequence
$` prefix of matched sequence
$' suffix of matched sequence
$$ symbol $
\l next symbol transformed to lowercase
\u next symbol transformed to uppercase
\L start of sequence transformed to lowercase
\U start of sequence transformed to uppercase
\E end of sequence transformed to lower/upper case
\a symbol a
\e ESC
\f end of page
\n LF
\r CR
\t horizontal TAB
\v vertical TAB
\xFF hexadecimal
\x{FFFF} hexadecimal
\cX control symbol X

Annotators

Annotators allow to annotate all occurrences of a given pattern in a given fact.

annotation
    actions
search label
    pattern
if
    conditions

The annotation block defines a set of actions to execute for each occurrence of a pattern in a fact referred with label. The actions are executed if and only if evaluation of all conditions defined in the if block returns true. Actions can refer to captured named groups and following special (auto-generated) facts:

  • $$ - matched value
  • $< - left context
  • $> - right context

Modules, includes & imports

An RL3 program may consist of multiple .rl3 files (modules).

There is an include directive which can be used to import one file to another:

include "project_file.rl3"

or

include <stdlib_file.rl3>

The first statement will include a local file, while the second one will include a file from Standard Library.

The import directive allows to include templates:

import <template.rl3> with
    var1 = value1
    var2 = value2
    ...

This statement will include a local file template.rl3 and replace all accuracies of {var1}, {var2}, etc with corresponding values value1, value2, etc.

Project

Multiple modules can be assembled into a project file and then compiled as a single model. The project file will look like:

module module_1.rl3
module module_2.rl3
...
module module_N.rl3