RL3 Best Practices
Contents
Most Common Mistakes
Using ^ and $ anchors instead of \A and \Z
It is often needed to check the context of a matched pattern. E.g. we need to forbid creation of a Person entity if followed by Ltd.
Incorrect:
annotation
Person = person [weight="1.0"]
search text
{person={pattern_person}}
if
not match $> ^ Ltd
The matcher ^ defines the start of a line that could be numerous in the right context of our entity. It is highly recommended to use the matcher \A (the start of sequence) instead.
Correct:
Way 1 (with \A):
annotation
Person = person [weight="1.0"]
search text
{person={pattern_person}}
if
not match $> \A Ltd
Way 2 (without \A):
annotation
Person = person [weight="1.0"]
search text
{person={pattern_person}}(?! Ltd)
if
true
Productivity and Optimization (Avoiding Slow-Down Issues)
Below are the most common RL3 constructs that can slow the annotator in the current version. They must be avoided.
(.*.*) vs (.*)
Herein and in the next examples "." can be changed for any other pattern (letter, space, etc.) that is frequent in the text. |
pattern fast
(.*)
pattern slow1
(.*.*)
pattern something_rare
(this pattern rarely occurs in text)
pattern use_slow1
({slow1}{something_rare})
pattern use_fast
({fast}{something_rare})
Patterns fast and slow1 are semantically equivalent, however there is a great difference.
Suppose {something_rare} = "test":
- there is only one way to parse ({fast}{something_rare});
- there are 5 ways to parse ({slow1}{something_rare}).
Look at the following table:
first subexpression .* | second subexpression .* |
---|---|
test | |
t | est |
te | st |
tes | t |
test |
With use_slow1 an RL3 engine will make 5 useless parsing attempts in every position with no pattern matches and get a rejection (as there are no matches for the pattern {something_rare}).
.*(delimiter)?.* vs .*(delimiter.*)?
".*(delimiter)?.*" is a masked variety of ".*.*"
Since the (delimiter)? group is optional, the pattern is equivalent to the following (incorrect):
(.*.*|.*delimiter.*)
As described above, the .*.* pattern causes slow-downs and thus affects productivity. Therefore, it is required to use the following pattern instead (correct):
.*(delimiter.*)?
Need for a defined left context
If the left margin is not clearly defined (e.g. (.{1,100}{entity_suffix}))
), it can cause a slow-down when parsing.
Consider the following solution instead:
({left_margin}.{1,100}{entity_suffix})
Where {left_margin}
stopper may include: start of the line , punctuation, copyright sign, particular words (e.g. address, hotel, corporate prefix, etc.
It is also recommended to avoid {left_margin} in a condition e.g. if not match $$ {left_margin} to the main pattern. A better solution is to include {left_margin} to the main pattern, e.g.: ({left_margin}.{1,100}{entity_suffix}) |
Carrying out common prefixes
pattern slow
(
{heavy_common_prefix}{suffix1}
|{heavy_common_prefix}{suffix2}
|{heavy_common_prefix}{suffix3}
)
pattern faster
(
{heavy_common_prefix}({suffix1}|{suffix2}|{suffix3})
)
- slow pattern: the prefix will be searched through 3 times in 3 combinations with suffixes 1, 2, 3;
- faster pattern: the prefix will be searched through only once instead which will optimize the execution process.