Apache UIMA - hands
      on code
  Gestione delle Informazioni su Web - 2010/2011
                  Tommaso Teofili
          tommaso [at] apache [dot] org
Use Cases - Agenda

UC1 : Real Estatate market analysis

UC2 : Tenders automatic information
extraction

UIMA & search engines

Tutorial

Assignment
UC1 : Source


An online announcement site for sellers and
buyers

Wide purpose (cars, RE, hi-fi, etc...)

Local scope (Rome and nearby)
UC1 - Goals

Track real estate market in order to:

  Take smart decisions

  Predict how things will go in the (near) future

Estate listings text is unstructered

Aggregate queries for statistical analysis need
structured information
UC1 - Source
UC1 - Blocks
UC1 - Crawler
A specialized crawler extract data from the source

Estate listings data are stored grouped by zones in files
on some directory on a managed machine

Define navigation of the site using one XML for each
city zone

The crawler downloads page fragments two times a
week

The estate listings extracted free text is saved on XML
grouped by zone
UC1 - Crawler

Issues :

  Enabled cookies

  Some HTTP headers needed

  Needed to put fixed sleeping intervals
  between requests
UC1 - Domain


Announcement

Zone

MagazineNumber

HouseStructure (with properties)
UC1 - Information
  Extraction Engine
Goal : extract price, zone and telephone
number

The first version used huge regular
expressions

Hard to maintain and unefficient

Poor extraction
UC1 - IE Engine

New requirements: extract the structure of
the house

  Number of rooms, box, garden(s), external
  spaces, number of bathrooms, kitchen,
  etc...

  Track more fine grained zones
Sample text


“ven 26 Dic APPIA via grottaferrata metro 2
¡ piano assolato ingresso salone americana
cucina camera cameretta bagno soppalco
posto auto e 295.000”
UC1 - ContentAnnotator

 From the XML produced by the crawler only
 estate listings must be extracted

 A simple parser to get each node containing
 an estate listing (that in turn will be
 unstructured)

 Create a ContentAnnotation over the
 document
ContentAnnotation
UC1 - Entities
UC1 - ZoneAnnotation
UC1 - Consuming
extracted information
the previous version of the IE engine
produced XML files that needed to be
reparsed to store structured data inside the
DB

with UIMA a CAS Consumer at the end of
the analysis pipeline can automatically put
extracted information on the DB
UIMA - CAS Consumer
Analysis Engine responsible for consuming
information contained inside the CAS

Can write extracted information to:

  DBMS

  Lucene index

  Filesystem

  ...
UC1 - Analysis Graphs
UC1 - Analysis Graphs
UC2 - Monitor of EU
  announcements

Monitor various sources which provide
announcement and tenders

Automate the long monitoring process of such
sources and automatically extract useful
common information from announcements’
texts
UC2 - Blocks
Different input texts
Different input texts
Different input texts
UC2 - Domain
            annotations
Language           Funding type

Abstract           Geographic region

Activity           Sector

Beneficiary         Subject

Budget             Title

Expiration date    Tags
UC2 - Domain entities


First and most important is an entity that
represents the entire tender or
announcement

Annotations inside the domain will finally fill
such entity properties
UC2 - Simple first

Each annotator first looks:

   if some metadata was extracted during navigation

   for the most common pattern for defining
   information inside such announcements

i.e.: “Budget: 200000$” or “Financial information: ......”

Such patterns are common in different languages
UC2 - AbstractAnnotator
 The abstract is usually in the first part of the
 document

 We use Tokenizer and Tagger to get Tokens (with
 PoS tags) and Sentences

 We use dictionary of “good” words and linguistic
 patterns

 We look in the first sentences of the document
 looking for objectives of the announcement
UC2 - ExpirationDateAnnotator


  A DateAnnotator is executed before

  Iterate over DateAnnotations

  Get sentences wrapping such DateAnnotations

  Check if some terms or patterns like “the
  deadline is ...” appear near a DateAnnotation
UC2 - BandoEntity
UIMA & Search Engines

 Decorate documents with automatically
 extracted metadata to improve search
 experience

   relevance

   results

   clustering
Information Retrieval and
     Named Entities
UIMA & Search Engines
 “Push” scenario:

    documents are sent to UIMA which extracts metadata and
    writes on the index with a CAS Consumer

 “Pull” scenario:

    documents are sent to Lucene which asks UIMA to extract
    metadata for it and then Lucene itself writes them to the
    index

 “On demand” scenario:

    metadata are extracted only on demand each time a
    document is retrieved/showed...
UIMA - tutorial


create a Type System

create an Analysis Engine descriptor

create a simple Annotator
Assignment

Named Entities Recognition

  sport: person, player, coach, team,
  competition

  videogames: person, videogame character,
  videogame, software house, hardware
  requirement

Preciosion & Recall test

Apache UIMA - Hands on code

  • 1.
    Apache UIMA -hands on code Gestione delle Informazioni su Web - 2010/2011 Tommaso Teofili tommaso [at] apache [dot] org
  • 2.
    Use Cases -Agenda UC1 : Real Estatate market analysis UC2 : Tenders automatic information extraction UIMA & search engines Tutorial Assignment
  • 3.
    UC1 : Source Anonline announcement site for sellers and buyers Wide purpose (cars, RE, hi-fi, etc...) Local scope (Rome and nearby)
  • 4.
    UC1 - Goals Trackreal estate market in order to: Take smart decisions Predict how things will go in the (near) future Estate listings text is unstructered Aggregate queries for statistical analysis need structured information
  • 5.
  • 6.
  • 7.
    UC1 - Crawler Aspecialized crawler extract data from the source Estate listings data are stored grouped by zones in files on some directory on a managed machine Define navigation of the site using one XML for each city zone The crawler downloads page fragments two times a week The estate listings extracted free text is saved on XML grouped by zone
  • 8.
    UC1 - Crawler Issues: Enabled cookies Some HTTP headers needed Needed to put fixed sleeping intervals between requests
  • 9.
  • 10.
    UC1 - Information Extraction Engine Goal : extract price, zone and telephone number The first version used huge regular expressions Hard to maintain and unefficient Poor extraction
  • 11.
    UC1 - IEEngine New requirements: extract the structure of the house Number of rooms, box, garden(s), external spaces, number of bathrooms, kitchen, etc... Track more fine grained zones
  • 12.
    Sample text “ven 26Dic APPIA via grottaferrata metro 2 ¡ piano assolato ingresso salone americana cucina camera cameretta bagno soppalco posto auto e 295.000”
  • 13.
    UC1 - ContentAnnotator From the XML produced by the crawler only estate listings must be extracted A simple parser to get each node containing an estate listing (that in turn will be unstructured) Create a ContentAnnotation over the document
  • 14.
  • 15.
  • 16.
  • 17.
    UC1 - Consuming extractedinformation the previous version of the IE engine produced XML files that needed to be reparsed to store structured data inside the DB with UIMA a CAS Consumer at the end of the analysis pipeline can automatically put extracted information on the DB
  • 18.
    UIMA - CASConsumer Analysis Engine responsible for consuming information contained inside the CAS Can write extracted information to: DBMS Lucene index Filesystem ...
  • 19.
  • 20.
  • 21.
    UC2 - Monitorof EU announcements Monitor various sources which provide announcement and tenders Automate the long monitoring process of such sources and automatically extract useful common information from announcements’ texts
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
    UC2 - Domain annotations Language Funding type Abstract Geographic region Activity Sector Beneficiary Subject Budget Title Expiration date Tags
  • 27.
    UC2 - Domainentities First and most important is an entity that represents the entire tender or announcement Annotations inside the domain will finally fill such entity properties
  • 28.
    UC2 - Simplefirst Each annotator first looks: if some metadata was extracted during navigation for the most common pattern for defining information inside such announcements i.e.: “Budget: 200000$” or “Financial information: ......” Such patterns are common in different languages
  • 29.
    UC2 - AbstractAnnotator The abstract is usually in the first part of the document We use Tokenizer and Tagger to get Tokens (with PoS tags) and Sentences We use dictionary of “good” words and linguistic patterns We look in the first sentences of the document looking for objectives of the announcement
  • 30.
    UC2 - ExpirationDateAnnotator A DateAnnotator is executed before Iterate over DateAnnotations Get sentences wrapping such DateAnnotations Check if some terms or patterns like “the deadline is ...” appear near a DateAnnotation
  • 31.
  • 32.
    UIMA & SearchEngines Decorate documents with automatically extracted metadata to improve search experience relevance results clustering
  • 33.
  • 34.
    UIMA & SearchEngines “Push” scenario: documents are sent to UIMA which extracts metadata and writes on the index with a CAS Consumer “Pull” scenario: documents are sent to Lucene which asks UIMA to extract metadata for it and then Lucene itself writes them to the index “On demand” scenario: metadata are extracted only on demand each time a document is retrieved/showed...
  • 35.
    UIMA - tutorial createa Type System create an Analysis Engine descriptor create a simple Annotator
  • 36.
    Assignment Named Entities Recognition sport: person, player, coach, team, competition videogames: person, videogame character, videogame, software house, hardware requirement Preciosion & Recall test