Predefined Page Facts
Contents
Predefined Page Facts
If input text comes from a web page, RL3 enables to make use of page metadata in conditions with applying predefined page facts.
Every page serving as an input for an RL3 annotator can have the following information:
URL
Page URL, e.g. http://europe.yamaha.com/en/products/proaudio/
domain
E.g. the domain for http://europe.yamaha.com/en/products/proaudio/ will be "europe.yamaha.com".
domain_name
A top-level domain, e.g. http://europe.yamaha.com/en/products/proaudio/ has a domain name "yamaha.com".
domain_basename
The "base name" of a domain, i.e. a top-level domain without a zone (e.g. .com, .ru, .co.uk) or hosting info (.narod.ru). For a page http://europe.yamaha.com/en/products/proaudio/, domain_basename is "yamaha". For a page http://http://www.zorallabs.com/, domain_basename is "zorallabs".
languages
Page languages. If a page uses more than one language, several values will be available for the fact.
E.g. to annotate only for pages with a German text:
annotation Product=product search text {product={german_specific_product_pattern}} if each xx in languages match xx german
In the example above, a predicate would be executed only if all the languages in the page match \Agerman\Z , i.e. a page has only one languages that equals "german". If it is enough to check that at least one of the languages on the page is German, try the following (either with 'match' or 'search'):
annotation Product=product search text {product={german_specific_product_pattern}} if any xx in languages search xx german
category
Page category, e.g. "index", "about", "contacts". Category can have more than one value.
title
Page title, is extracted from an HTML markup (i.e. text from the element <title>...</title>). E.g. for a page http://zorallabs.com/ will be "Home Page - Zoral Labs".
description
Page description, is extracted from an HTML markup (i.e. text from the element <meta name="description">...</meta>).
pathway
So-called "breadcrumbs" (http://en.wikipedia.org/wiki/Breadcrumb_(navigation) ) often used for navigation on corporate websites. For a page http://zorallabs.com/company/management-team will have a value "Home › Company › Management Team"
TBD!!!
abstract
lpaths
parent_categories
mf_parent_categories
mf_lpaths
smcat
ltexts
mf_ltexts
mp_ltexts
ltexts_from_index