Information Extraction
                     with UIMA - Use Cases
                         Gestione delle Informazioni su Web - 2009/2010
                                          Tommaso Teofili
                                  tommaso [at] apache [dot] org




venerdì 16 aprile 2010
Use Cases - Agenda


                         UC1 : Real Estatate market analysis

                         UC2 : Tenders automatic information
                         extraction




venerdì 16 aprile 2010
UC1 : Source


                         An online announcement site for sellers and
                         buyers

                         Wide purpose (cars, RE, hi-fi, etc...)

                         Local scope (Rome and nearby)




venerdì 16 aprile 2010
UC1 - Goals

                         Are you looking for houses?

                         A specified subcategory of the site is dedicated to
                         real estate

                         I would like to monitor Rome real estate market to

                           Take smart decisions

                           Predict how things will go in the (near) future



venerdì 16 aprile 2010
UC1 - Source
venerdì 16 aprile 2010
UC1 - Goals
                         I want to build a separate web application to
                         monitor such estate listings

                         I have to use a crawler to automatically
                         download selected pages periodically from the
                         source

                         Estate listings text is unstructered

                         I want to make aggregate queries on structured
                         information


venerdì 16 aprile 2010
UC1 - Information
                              Extraction


                         I have to write an information extraction
                         engine to populate a relational schema DB
                         with structured information from free text
                         of estate listings




venerdì 16 aprile 2010
UC1 - Blocks
venerdì 16 aprile 2010
UC1 - Crawler


                         A specialized crawler extract data from the
                         source

                         Estate listings data are stored grouped by
                         zones in files on some directory on a
                         managed machine




venerdì 16 aprile 2010
UC1 - Crawler

                         Define navigation of the site using one XML
                         for each city zone

                         The crawler downloads page fragments two
                         times a week

                         The estate listings extracted free text is
                         saved on XML grouped by zone



venerdì 16 aprile 2010
UC1 - Crawler Modules
venerdì 16 aprile 2010
UC1 - navigation definition
venerdì 16 aprile 2010
UC1 - Crawler

                         Issues :

                           Enabled cookies

                           Some HTTP headers needed

                           Needed to put fixed sleeping intervals
                           between requests



venerdì 16 aprile 2010
UC1 - Domain


                         EstateListing (Announcement)

                         Zone

                         MagazineNumber (Uscita)

                         HouseStructure with properties




venerdì 16 aprile 2010
UC1 - Information
                           Extraction Engine
                         Goal : extract price, zone and telephone
                         number

                         The first version contained a specialized IE
                         engine which used huge regular expressions

                         Hard to maintain and unefficient

                         Extracting not so much information



venerdì 16 aprile 2010
UC1 - IE Engine

                         New requirement: extract also the structure
                         of the house

                         Number of rooms, box, garden(s), external
                         spaces, number of bathrooms, kitchen, etc...

                         Using again RegEx resulted to be hard to
                         maintain and unefficient



venerdì 16 aprile 2010
UC1 - IE Engine
                         Subsitute the RegEx based IE engine with a UIMA
                         based IE engine to:

                           exploit previous work (RegExs can live inside UIMA
                           too)

                           exploit existing components

                           be able to modify and enhanche IE rules easily

                           much more efficient

                           more information extracted


venerdì 16 aprile 2010
UC1 - Analysis pipeline
venerdì 16 aprile 2010
UC1 - TypeSystem
venerdì 16 aprile 2010
Crawled XML
venerdì 16 aprile 2010
Sample text


                         “ven 26 Dic APPIA via grottaferrata metro 2
                         ¡ piano assolato ingresso salone americana
                         cucina camera cameretta bagno soppalco
                         posto auto e 295.000”




venerdì 16 aprile 2010
UC1 - ContentAnnotator

                         From the XML produced by the crawler only
                         estate listings must be extracted

                         A simple parser to get each node containing
                         an estate listing (that in turn will be
                         unstructured)

                         Create a ContentAnnotation over the
                         document



venerdì 16 aprile 2010
UC1 - ContentAnnotator
venerdì 16 aprile 2010
ContentAnnotation
venerdì 16 aprile 2010
UC1 - ACAnnotator
venerdì 16 aprile 2010
UC1 - Entities
venerdì 16 aprile 2010
ZoneAnnotator - Dictionary &
                              RegEx
venerdì 16 aprile 2010
ZoneAnnotator - Learning
                               dictionaries
venerdì 16 aprile 2010
UC1 - ZoneAnnotation
venerdì 16 aprile 2010
UC1 - Consuming
                         extracted information
                         the previous version of the IE engine
                         produced (again) XMLs that needed to be
                         parsed to store structured data inside the
                         DB

                         with UIMA a CAS Consumer at the end of
                         the analysis pipeline can automatically put
                         extracted information on the DB



venerdì 16 aprile 2010
UC1 - Analyzing real
                          estate market data

                         a simple webapp written in Java with Spring
                         framework modules (Spring core, DAO, JDBC,
                         MVC) querying aggregate data on MySQL DB

                         enriched UI with JQuery




venerdì 16 aprile 2010
UC1 - Analysis Graphs
venerdì 16 aprile 2010
UC1 - Analysis Graphs
venerdì 16 aprile 2010
UC2 - Monitor of
                     tenders/announcements
                         Monitor various sources which provide
                         announcement and tenders to which people
                         and companies are interested can subscribe

                         We want to automate the long monitoring
                         process of such sources and also
                         automatically extract useful common
                         information from announcements’ text



venerdì 16 aprile 2010
UC2 - Blocks
venerdì 16 aprile 2010
Different input texts
venerdì 16 aprile 2010
Different input texts
venerdì 16 aprile 2010
Different input texts
venerdì 16 aprile 2010
Different input texts
venerdì 16 aprile 2010
UC2 - Crawling
                         Similar to UC1 Crawler but using a Firefox
                         plugin we can define navigation patterns for
                         pages of each source

                         We can also define metadata we see during
                         navigation that deliver information

                         Again an XML will be generated so that it
                         can be saved on a storage and executed
                         periodically


venerdì 16 aprile 2010
UC2 - Defining navigation
venerdì 16 aprile 2010
UC2 - Domain
                                     annotations
                         Language           Funding type

                         Abstract           Geographic region

                         Activity           Sector

                         Beneficiary         Subject

                         Budget             Title

                         Expiration date    Tags



venerdì 16 aprile 2010
UC2 - Domain entities


                         First and most important is an entity that
                         represents the entire tender or
                         announcement

                         Annotations inside the domain will finally fill
                         such entity properties




venerdì 16 aprile 2010
UC2 - Pipeline
venerdì 16 aprile 2010
UC2 - Simple first

                         Each annotator first looks:

                            if some metadata was extracted during navigation

                            for the most common pattern for defining
                            information inside such announcements

                         i.e.: “Budget: 200000$” or “Financial information: ......”

                         Such patterns are language independent (although
                         this is often not true)



venerdì 16 aprile 2010
UC2 - AbstractAnnotator
                         The abstract is usually in the first part of the
                         document

                         We use Tokenizer and Tagger to get Tokens (with
                         PoS tags) and Sentences

                         We use Dictionary to provide a list of “good”
                         words

                         We look in the first sentences of the document
                         looking for objectives of the announcement
                         (mixing good words and regular expressions)


venerdì 16 aprile 2010
UC2 -
                    ExpirationDateAnnotator

                         A DateAnnotator is executed before

                         Iterate over DateAnnotations

                         Get sentences wrapping such DateAnnotations

                         Check if some terms like “deadline” appear in
                         the same sentence of a DateAnnotation



venerdì 16 aprile 2010
Date patterns
venerdì 16 aprile 2010
ExpirationDateAnnotator
venerdì 16 aprile 2010
GeographicRegionAnnotator
venerdì 16 aprile 2010
UC2 - ActivityAnnotator
venerdì 16 aprile 2010
UC2 - ActivityAnnotator
venerdì 16 aprile 2010
Conclusions on IE
                         UC1 : simple and stable sentence patterns

                         UC2 : multi language, much more complex
                         and different sentence structures and
                         patterns

                         Fine grain metadata are very important

                         Need to play with NLP

                         Need to establish good test cases


venerdì 16 aprile 2010

Information Extraction with UIMA - Usecases

  • 1.
    Information Extraction with UIMA - Use Cases Gestione delle Informazioni su Web - 2009/2010 Tommaso Teofili tommaso [at] apache [dot] org venerdì 16 aprile 2010
  • 2.
    Use Cases -Agenda UC1 : Real Estatate market analysis UC2 : Tenders automatic information extraction venerdì 16 aprile 2010
  • 3.
    UC1 : Source An online announcement site for sellers and buyers Wide purpose (cars, RE, hi-fi, etc...) Local scope (Rome and nearby) venerdì 16 aprile 2010
  • 4.
    UC1 - Goals Are you looking for houses? A specified subcategory of the site is dedicated to real estate I would like to monitor Rome real estate market to Take smart decisions Predict how things will go in the (near) future venerdì 16 aprile 2010
  • 5.
    UC1 - Source venerdì16 aprile 2010
  • 6.
    UC1 - Goals I want to build a separate web application to monitor such estate listings I have to use a crawler to automatically download selected pages periodically from the source Estate listings text is unstructered I want to make aggregate queries on structured information venerdì 16 aprile 2010
  • 7.
    UC1 - Information Extraction I have to write an information extraction engine to populate a relational schema DB with structured information from free text of estate listings venerdì 16 aprile 2010
  • 8.
    UC1 - Blocks venerdì16 aprile 2010
  • 9.
    UC1 - Crawler A specialized crawler extract data from the source Estate listings data are stored grouped by zones in files on some directory on a managed machine venerdì 16 aprile 2010
  • 10.
    UC1 - Crawler Define navigation of the site using one XML for each city zone The crawler downloads page fragments two times a week The estate listings extracted free text is saved on XML grouped by zone venerdì 16 aprile 2010
  • 11.
    UC1 - CrawlerModules venerdì 16 aprile 2010
  • 12.
    UC1 - navigationdefinition venerdì 16 aprile 2010
  • 13.
    UC1 - Crawler Issues : Enabled cookies Some HTTP headers needed Needed to put fixed sleeping intervals between requests venerdì 16 aprile 2010
  • 14.
    UC1 - Domain EstateListing (Announcement) Zone MagazineNumber (Uscita) HouseStructure with properties venerdì 16 aprile 2010
  • 15.
    UC1 - Information Extraction Engine Goal : extract price, zone and telephone number The first version contained a specialized IE engine which used huge regular expressions Hard to maintain and unefficient Extracting not so much information venerdì 16 aprile 2010
  • 16.
    UC1 - IEEngine New requirement: extract also the structure of the house Number of rooms, box, garden(s), external spaces, number of bathrooms, kitchen, etc... Using again RegEx resulted to be hard to maintain and unefficient venerdì 16 aprile 2010
  • 17.
    UC1 - IEEngine Subsitute the RegEx based IE engine with a UIMA based IE engine to: exploit previous work (RegExs can live inside UIMA too) exploit existing components be able to modify and enhanche IE rules easily much more efficient more information extracted venerdì 16 aprile 2010
  • 18.
    UC1 - Analysispipeline venerdì 16 aprile 2010
  • 19.
  • 20.
  • 21.
    Sample text “ven 26 Dic APPIA via grottaferrata metro 2 ¡ piano assolato ingresso salone americana cucina camera cameretta bagno soppalco posto auto e 295.000” venerdì 16 aprile 2010
  • 22.
    UC1 - ContentAnnotator From the XML produced by the crawler only estate listings must be extracted A simple parser to get each node containing an estate listing (that in turn will be unstructured) Create a ContentAnnotation over the document venerdì 16 aprile 2010
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
    ZoneAnnotator - Dictionary& RegEx venerdì 16 aprile 2010
  • 28.
    ZoneAnnotator - Learning dictionaries venerdì 16 aprile 2010
  • 29.
  • 30.
    UC1 - Consuming extracted information the previous version of the IE engine produced (again) XMLs that needed to be parsed to store structured data inside the DB with UIMA a CAS Consumer at the end of the analysis pipeline can automatically put extracted information on the DB venerdì 16 aprile 2010
  • 31.
    UC1 - Analyzingreal estate market data a simple webapp written in Java with Spring framework modules (Spring core, DAO, JDBC, MVC) querying aggregate data on MySQL DB enriched UI with JQuery venerdì 16 aprile 2010
  • 32.
    UC1 - AnalysisGraphs venerdì 16 aprile 2010
  • 33.
    UC1 - AnalysisGraphs venerdì 16 aprile 2010
  • 34.
    UC2 - Monitorof tenders/announcements Monitor various sources which provide announcement and tenders to which people and companies are interested can subscribe We want to automate the long monitoring process of such sources and also automatically extract useful common information from announcements’ text venerdì 16 aprile 2010
  • 35.
    UC2 - Blocks venerdì16 aprile 2010
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
    UC2 - Crawling Similar to UC1 Crawler but using a Firefox plugin we can define navigation patterns for pages of each source We can also define metadata we see during navigation that deliver information Again an XML will be generated so that it can be saved on a storage and executed periodically venerdì 16 aprile 2010
  • 41.
    UC2 - Definingnavigation venerdì 16 aprile 2010
  • 42.
    UC2 - Domain annotations Language Funding type Abstract Geographic region Activity Sector Beneficiary Subject Budget Title Expiration date Tags venerdì 16 aprile 2010
  • 43.
    UC2 - Domainentities First and most important is an entity that represents the entire tender or announcement Annotations inside the domain will finally fill such entity properties venerdì 16 aprile 2010
  • 44.
  • 45.
    UC2 - Simplefirst Each annotator first looks: if some metadata was extracted during navigation for the most common pattern for defining information inside such announcements i.e.: “Budget: 200000$” or “Financial information: ......” Such patterns are language independent (although this is often not true) venerdì 16 aprile 2010
  • 46.
    UC2 - AbstractAnnotator The abstract is usually in the first part of the document We use Tokenizer and Tagger to get Tokens (with PoS tags) and Sentences We use Dictionary to provide a list of “good” words We look in the first sentences of the document looking for objectives of the announcement (mixing good words and regular expressions) venerdì 16 aprile 2010
  • 47.
    UC2 - ExpirationDateAnnotator A DateAnnotator is executed before Iterate over DateAnnotations Get sentences wrapping such DateAnnotations Check if some terms like “deadline” appear in the same sentence of a DateAnnotation venerdì 16 aprile 2010
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
    Conclusions on IE UC1 : simple and stable sentence patterns UC2 : multi language, much more complex and different sentence structures and patterns Fine grain metadata are very important Need to play with NLP Need to establish good test cases venerdì 16 aprile 2010