Apache UIMA - Hands on code

Apache UIMA - hands
on code
Gestione delle Informazioni su Web - 2010/2011
Tommaso Teoﬁli
tommaso [at] apache [dot] org

Use Cases - Agenda

UC1 : Real Estatate market analysis

UC2 : Tenders automatic information
extraction

UIMA & search engines

Tutorial

Assignment

UC1 : Source

An online announcement site for sellers and
buyers

Wide purpose (cars, RE, hi-ﬁ, etc...)

Local scope (Rome and nearby)

UC1 - Goals

Track real estate market in order to:

Take smart decisions

Predict how things will go in the (near) future

Estate listings text is unstructered

Aggregate queries for statistical analysis need
structured information

UC1 - Crawler
A specialized crawler extract data from the source

Estate listings data are stored grouped by zones in ﬁles
on some directory on a managed machine

Deﬁne navigation of the site using one XML for each
city zone

The crawler downloads page fragments two times a
week

The estate listings extracted free text is saved on XML
grouped by zone

UC1 - Crawler

Issues :

Enabled cookies

Some HTTP headers needed

Needed to put ﬁxed sleeping intervals
between requests

UC1 - Domain

Announcement

Zone

MagazineNumber

HouseStructure (with properties)

UC1 - Information
Extraction Engine
Goal : extract price, zone and telephone
number

The ﬁrst version used huge regular
expressions

Hard to maintain and unefﬁcient

Poor extraction

UC1 - IE Engine

New requirements: extract the structure of
the house

Number of rooms, box, garden(s), external
spaces, number of bathrooms, kitchen,
etc...

Track more ﬁne grained zones

Sample text

“ven 26 Dic APPIA via grottaferrata metro 2
¡ piano assolato ingresso salone americana
cucina camera cameretta bagno soppalco
posto auto e 295.000”

UC1 - ContentAnnotator

From the XML produced by the crawler only
estate listings must be extracted

A simple parser to get each node containing
an estate listing (that in turn will be
unstructured)

Create a ContentAnnotation over the
document

UC1 - Consuming
extracted information
the previous version of the IE engine
produced XML ﬁles that needed to be
reparsed to store structured data inside the
DB

with UIMA a CAS Consumer at the end of
the analysis pipeline can automatically put
extracted information on the DB

UIMA - CAS Consumer
Analysis Engine responsible for consuming
information contained inside the CAS

Can write extracted information to:

DBMS

Lucene index

Filesystem

...

UC2 - Monitor of EU
announcements

Monitor various sources which provide
announcement and tenders

Automate the long monitoring process of such
sources and automatically extract useful
common information from announcements’
texts

UC2 - Domain
annotations
Language Funding type

Abstract Geographic region

Activity Sector

Beneﬁciary Subject

Budget Title

Expiration date Tags

UC2 - Domain entities

First and most important is an entity that
represents the entire tender or
announcement

Annotations inside the domain will ﬁnally ﬁll
such entity properties

UC2 - Simple first

Each annotator first looks:

if some metadata was extracted during navigation

for the most common pattern for defining
information inside such announcements

i.e.: “Budget: 200000$” or “Financial information: ......”

Such patterns are common in different languages

UC2 - AbstractAnnotator
The abstract is usually in the ﬁrst part of the
document

We use Tokenizer and Tagger to get Tokens (with
PoS tags) and Sentences

We use dictionary of “good” words and linguistic
patterns

We look in the ﬁrst sentences of the document
looking for objectives of the announcement

UC2 - ExpirationDateAnnotator

A DateAnnotator is executed before

Iterate over DateAnnotations

Get sentences wrapping such DateAnnotations

Check if some terms or patterns like “the
deadline is ...” appear near a DateAnnotation

UIMA & Search Engines

Decorate documents with automatically
extracted metadata to improve search
experience

relevance

results

clustering

Information Retrieval and
Named Entities

UIMA & Search Engines
“Push” scenario:

documents are sent to UIMA which extracts metadata and
writes on the index with a CAS Consumer

“Pull” scenario:

documents are sent to Lucene which asks UIMA to extract
metadata for it and then Lucene itself writes them to the
index

“On demand” scenario:

metadata are extracted only on demand each time a
document is retrieved/showed...

UIMA - tutorial

create a Type System

create an Analysis Engine descriptor

create a simple Annotator

Assignment

Named Entities Recognition

sport: person, player, coach, team,
competition

videogames: person, videogame character,
videogame, software house, hardware
requirement

Preciosion & Recall test

Apache UIMA - Hands on code

More Related Content

Similar to Apache UIMA - Hands on code

More from Tommaso Teofili

Recently uploaded

Apache UIMA - Hands on code