Information Extraction with UIMA - Usecases

Information Extraction
with UIMA - Use Cases
Gestione delle Informazioni su Web - 2009/2010
Tommaso Teoﬁli
tommaso [at] apache [dot] org

venerdì 16 aprile 2010

Use Cases - Agenda

UC1 : Real Estatate market analysis

UC2 : Tenders automatic information
extraction


UC1 : Source

An online announcement site for sellers and
buyers

Wide purpose (cars, RE, hi-ﬁ, etc...)

Local scope (Rome and nearby)


UC1 - Goals

Are you looking for houses?

A speciﬁed subcategory of the site is dedicated to
real estate

I would like to monitor Rome real estate market to

Take smart decisions

Predict how things will go in the (near) future


UC1 - Source

UC1 - Goals
I want to build a separate web application to
monitor such estate listings

I have to use a crawler to automatically
download selected pages periodically from the
source

Estate listings text is unstructered

I want to make aggregate queries on structured
information


UC1 - Information
Extraction

I have to write an information extraction
engine to populate a relational schema DB
with structured information from free text
of estate listings


UC1 - Blocks

UC1 - Crawler

A specialized crawler extract data from the
source

Estate listings data are stored grouped by
zones in ﬁles on some directory on a
managed machine


UC1 - Crawler

Deﬁne navigation of the site using one XML
for each city zone

The crawler downloads page fragments two
times a week

The estate listings extracted free text is
saved on XML grouped by zone


UC1 - Crawler Modules

UC1 - navigation deﬁnition

UC1 - Crawler

Issues :

Enabled cookies

Some HTTP headers needed

Needed to put ﬁxed sleeping intervals
between requests


UC1 - Domain

EstateListing (Announcement)

Zone

MagazineNumber (Uscita)

HouseStructure with properties


UC1 - Information
Extraction Engine
Goal : extract price, zone and telephone
number

The ﬁrst version contained a specialized IE
engine which used huge regular expressions

Hard to maintain and unefﬁcient

Extracting not so much information


UC1 - IE Engine

New requirement: extract also the structure
of the house

Number of rooms, box, garden(s), external
spaces, number of bathrooms, kitchen, etc...

Using again RegEx resulted to be hard to
maintain and unefﬁcient


UC1 - IE Engine
Subsitute the RegEx based IE engine with a UIMA
based IE engine to:

exploit previous work (RegExs can live inside UIMA
too)

exploit existing components

be able to modify and enhanche IE rules easily

much more efﬁcient

more information extracted


UC1 - Analysis pipeline

UC1 - TypeSystem

Crawled XML

Sample text

“ven 26 Dic APPIA via grottaferrata metro 2
¡ piano assolato ingresso salone americana
cucina camera cameretta bagno soppalco
posto auto e 295.000”


UC1 - ContentAnnotator

From the XML produced by the crawler only
estate listings must be extracted

A simple parser to get each node containing
an estate listing (that in turn will be
unstructured)

Create a ContentAnnotation over the
document


UC1 - ContentAnnotator

ContentAnnotation

UC1 - ACAnnotator

UC1 - Entities

ZoneAnnotator - Dictionary &
RegEx

ZoneAnnotator - Learning
dictionaries

UC1 - ZoneAnnotation

UC1 - Consuming
extracted information
the previous version of the IE engine
produced (again) XMLs that needed to be
parsed to store structured data inside the
DB

with UIMA a CAS Consumer at the end of
the analysis pipeline can automatically put
extracted information on the DB


UC1 - Analyzing real
estate market data

a simple webapp written in Java with Spring
framework modules (Spring core, DAO, JDBC,
MVC) querying aggregate data on MySQL DB

enriched UI with JQuery


UC1 - Analysis Graphs

UC2 - Monitor of
tenders/announcements
Monitor various sources which provide
announcement and tenders to which people
and companies are interested can subscribe

We want to automate the long monitoring
process of such sources and also
automatically extract useful common
information from announcements’ text


UC2 - Blocks

Different input texts

UC2 - Crawling
Similar to UC1 Crawler but using a Firefox
plugin we can deﬁne navigation patterns for
pages of each source

We can also deﬁne metadata we see during
navigation that deliver information

Again an XML will be generated so that it
can be saved on a storage and executed
periodically


UC2 - Deﬁning navigation

UC2 - Domain
annotations
Language Funding type

Abstract Geographic region

Activity Sector

Beneﬁciary Subject

Budget Title

Expiration date Tags


UC2 - Domain entities

First and most important is an entity that
represents the entire tender or
announcement

Annotations inside the domain will ﬁnally ﬁll
such entity properties


UC2 - Pipeline

UC2 - Simple first

Each annotator first looks:

if some metadata was extracted during navigation

for the most common pattern for defining
information inside such announcements

i.e.: “Budget: 200000$” or “Financial information: ......”

Such patterns are language independent (although
this is often not true)


UC2 - AbstractAnnotator
The abstract is usually in the ﬁrst part of the
document

We use Tokenizer and Tagger to get Tokens (with
PoS tags) and Sentences

We use Dictionary to provide a list of “good”
words

We look in the ﬁrst sentences of the document
looking for objectives of the announcement
(mixing good words and regular expressions)


UC2 -
ExpirationDateAnnotator

A DateAnnotator is executed before

Iterate over DateAnnotations

Get sentences wrapping such DateAnnotations

Check if some terms like “deadline” appear in
the same sentence of a DateAnnotation


Date patterns

ExpirationDateAnnotator

GeographicRegionAnnotator

UC2 - ActivityAnnotator

Conclusions on IE
UC1 : simple and stable sentence patterns

UC2 : multi language, much more complex
and different sentence structures and
patterns

Fine grain metadata are very important

Need to play with NLP

Need to establish good test cases


Information Extraction with UIMA - Usecases

More Related Content

Viewers also liked

More from Tommaso Teofili

Recently uploaded

Information Extraction with UIMA - Usecases