RL3 Best Practices

From RL3 Wiki
Jump to: navigation, search

Most Common Mistakes

Using ^ and $ anchors instead of \A and \Z

It is often needed to check the context of a matched pattern. E.g. we need to forbid creation of a Person entity if followed by Ltd.

Incorrect:

annotation
    Person = person [weight="1.0"]
search text
    {person={pattern_person}}
if
    not match $> ^ Ltd

The matcher ^ defines the start of a line that could be numerous in the right context of our entity. It is highly recommended to use the matcher \A (the start of sequence) instead.

Correct:

Way 1 (with \A):

annotation
    Person = person [weight="1.0"]
search text
    {person={pattern_person}}
if
    not match $> \A Ltd

Way 2 (without \A):

annotation
    Person = person [weight="1.0"]
search text
    {person={pattern_person}}(?! Ltd)
if
    true

Productivity and Optimization (Avoiding Slow-Down Issues)

Below are the most common RL3 constructs that can slow the annotator in the current version. They must be avoided.


(.*.*) vs (.*)

pattern fast
    (.*)
pattern slow1
    (.*.*)
pattern something_rare
    (this pattern rarely occurs in text)
pattern use_slow1
    ({slow1}{something_rare})
pattern use_fast
    ({fast}{something_rare})

Patterns fast and slow1 are semantically equivalent, however there is a great difference.

Suppose {something_rare} = "test":

  • there is only one way to parse ({fast}{something_rare});
  • there are 5 ways to parse ({slow1}{something_rare}).

Look at the following table:

first subexpression .* second subexpression .*
test
t est
te st
tes t
test


With use_slow1 an RL3 engine will make 5 useless parsing attempts in every position with no pattern matches and get a rejection (as there are no matches for the pattern {something_rare}).

.*(delimiter)?.* vs .*(delimiter.*)?

".*(delimiter)?.*" is a masked variety of ".*.*"

Since the (delimiter)? group is optional, the pattern is equivalent to the following (incorrect):

   (.*.*|.*delimiter.*)

As described above, the .*.* pattern causes slow-downs and thus affects productivity. Therefore, it is required to use the following pattern instead (correct):

   .*(delimiter.*)?

Need for a defined left context

If the left margin is not clearly defined (e.g. (.{1,100}{entity_suffix}))), it can cause a slow-down when parsing.

Consider the following solution instead:

   ({left_margin}.{1,100}{entity_suffix})

Where {left_margin} stopper may include: start of the line , punctuation, copyright sign, particular words (e.g. address, hotel, corporate prefix, etc.

Carrying out common prefixes

pattern slow
    (
         {heavy_common_prefix}{suffix1}
        |{heavy_common_prefix}{suffix2}
        |{heavy_common_prefix}{suffix3}
    )
pattern faster
    (
        {heavy_common_prefix}({suffix1}|{suffix2}|{suffix3})
    )
  • slow pattern: the prefix will be searched through 3 times in 3 combinations with suffixes 1, 2, 3;
  • faster pattern: the prefix will be searched through only once instead which will optimize the execution process.