Text Processing Pipeline

From RL3 Wiki
Jump to: navigation, search

The goal of a text processing/ annotation pipeline is to provide:

  • a meaningful output (i.e. data annotated from text, e.g. location) ...
  • out of raw input (i.e. unstructured data, e.g. a text of any kind (webpage, non-annotated corpus of news articles, e-mails, etc.)).

Both input and output are considered Facts in terms of RL3 basic concepts. Therefore, it can be stated that:

  • the pipeline takes Facts as input (e.g. the text of a webpage);
  • as the result of processing, the pipeline produces new Facts (e.g. the category of a webpage) that can be further executed (e.g. extracted) or updated in an already existing RL3 Factsheet.

The simplified RL3 pipeline may look as follows:

  1. RL3_engine object creation.
  2. RL3_engine object initialization (as an RL3 type object).
  3. Processing annotation patterns (in either of the 2 ways):
    1. in case an RL3 model has already been compiled (by a built-in Compiler), the file with the compiled engine model can be loaded;
    2. compilation can be performed directly from RL3 sources:
      1. first, by performing parsing of the source (can be performed as inline, from a single RL3 Module file or the whole RL3 Project);
      2. then, by linking (compiling) data from single parses.
  4. Factsheet creation.
  5. Assertion of fact(s) (e.g. input text) to a factsheet.
  6. Running an engine to perform necessary annotations.
  7. Either of the 2 ways is possible (depending on the selected execution mode):
    1. execute (e.g. extract all annotated items and add resulting facts to an output factsheet);
    2. update (e.g. update the facts in an existing factsheet).
  8. RL3 object deletion.