Named Entity Recognition Example

From RL3 Wiki
Jump to: navigation, search
BookText.jpg

Named-entity recognition (NER) is a process aiming to locate and identify real-world entities or other important concepts (being named entities, i.e. anything that can be referred to by a proper noun) in text. These entities are described (or annotated) with meta-information which may include start and end position, confidence score (i.e. weight), and category (or type) such as person name, organization, numerical entities (e.g. date, time, monetary values), location (e.g. lake, mountain), geopolitical entities (e.g. countries, states, provinces), facilities (e.g. bridges, airports), etc.

In RL3 named-entity recognition is done by searching and annotating patterns in the specific Fact in a Factsheet. A patterns and annotation rules should be defined in source modules and then compiled into a binary model which can be used then in your program (refer to RL3 APIs) or in RL3 tools. A lot of useful patterns (including NER patterns) are already defined in RL3 StdLib and can be used in your annotators.

NER in RL3

Let's extract person names from text file using the pattern from the standard library in 3 simple steps:

  • define an annotation rule;
  • compile a model;
  • run the model against the text file and get the result.

Define an annotation rule

Let's create a source module ner.rl3 with the following annotation rule:

1 include <person.rl3>
2 
3 annotation
4     person=$$
5 search text
6     \<{PERSON_FULL_NAME}\>
7 if
8     true

By means of this rule in the source module, we define the logic for extraction and annotation of a named entity (in our case, a full name of a person) from the given text:

  • we search for all occurrences of person full names in the specific fact;
  • if the name of a person in the text matches the {PERSON_FULL_NAME} pattern (defined beforehand in the standard library of frequently used RL3 patterns), then search can be considered successful and the matched value is annotated.

Line-by-line description:

  • line 1: the include directive imports person patterns from the standard library;
  • line 3: introduces an annotation block (to define annotation logic);
  • line 4: provides annotation label (the name by which annotation can be referred to) and a reference for the matched value (meaning that only the group defined by the rule is captured and not the left/ right context);
  • line 5: introduces a search block (to define search parameters) and an input fact name or label (in our case, text);
  • line 6: define search pattern (in our case, it is a {PERSON_FULL_NAME} pattern from the standard library surrounded with \< and \> matchers which correspond to the start and end of the word accordingly);
  • line 7: introduces an if condition block (additional test logic executed for every matched value);
  • line 8: provides a condition (in our case, no specific conditions are defined; however, they can be set by means of specific built-in and custom predicates and context checks);

Compile a model

To compile a model, use the RL3 Compiler:

$ rl3c -m ner.rl3 -o ner.rl3c

Run a model and get the result

Let's run our model against the test.txt file (UTF-8 encoded) with the following content:

Nothing is impossible, the word itself says, "I'm possible!" – Audrey Hepburn

You can run the compiled model in either of the two ways:

1. With Tools:

To run a model, use the RL3 Run tool:

$ rl3run --model ner.rl3c --type text --input test.txt --fact text

As a result of running a binary model on the specified text file, we get the following output:

{"label":"person","value":"Audrey Hepburn","weight":1.0,"start":63,"end":77}

where:

  • annotation label is the name by which annotation can be referred to;
  • annotation value is the annotated person full name;
  • weight is a confidence score (if not specified directly in the rule, as in our case, the weight is set to 1.0 by default);
  • start & end position of the value being annotated.

2. Through an API.

RL3 provides a set of programming interfaces to integrate RL3 with your existing code. To compile and run a model through an API, it is required to create the Text Processing Pipeline, including:

  • engine creation and initialization;
  • either loading an already compiled model or making compilation directly from RL3 sources;
  • creation and initialization of input/output factsheets;
  • running a model.

For more information on RL3 APIs, please refer to the APIs section on Main Page.

Tips and Tricks

Organize Your Code

As real-world examples are often sophisticated and involve hierarchical structures, knowing how to shape and maintain your code is essential and recommended. See the tips for good code design in RL3 Tips and Tricks.

Look for Context but Where Needed

When NER quality improvement is a goal, setting up necessary contexts is what comes into play first. However, in order to make your code reusable, it is highly advised to follow the basic context-setting rules.

Link to Context & Disambiguate

Disambiguation in NER is vital. RL3 offers ways to interpret and link entities to the right context (including strong/ weak concepts and fact weights).