Information Extraction with UIMA - Usecases
Upcoming SlideShare
Loading in...5
×
 

Information Extraction with UIMA - Usecases

on

  • 4,348 views

Slides about "Usecases for Information Extraction with UIMA" for "Information management on the Web" course at DIA (Computer Science Department) of Roma Tre University

Slides about "Usecases for Information Extraction with UIMA" for "Information management on the Web" course at DIA (Computer Science Department) of Roma Tre University

Statistics

Views

Total Views
4,348
Views on SlideShare
4,317
Embed Views
31

Actions

Likes
6
Downloads
163
Comments
0

4 Embeds 31

http://www.slideshare.net 23
http://www.linkedin.com 4
http://www.docshut.com 2
http://www.slashdocs.com 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Information Extraction with UIMA - Usecases Information Extraction with UIMA - Usecases Presentation Transcript

  • Information Extraction with UIMA - Use Cases Gestione delle Informazioni su Web - 2009/2010 Tommaso Teofili tommaso [at] apache [dot] org venerdì 16 aprile 2010
  • Use Cases - Agenda UC1 : Real Estatate market analysis UC2 : Tenders automatic information extraction venerdì 16 aprile 2010
  • UC1 : Source An online announcement site for sellers and buyers Wide purpose (cars, RE, hi-fi, etc...) Local scope (Rome and nearby) venerdì 16 aprile 2010
  • UC1 - Goals Are you looking for houses? A specified subcategory of the site is dedicated to real estate I would like to monitor Rome real estate market to Take smart decisions Predict how things will go in the (near) future venerdì 16 aprile 2010
  • UC1 - Source venerdì 16 aprile 2010
  • UC1 - Goals I want to build a separate web application to monitor such estate listings I have to use a crawler to automatically download selected pages periodically from the source Estate listings text is unstructered I want to make aggregate queries on structured information venerdì 16 aprile 2010
  • UC1 - Information Extraction I have to write an information extraction engine to populate a relational schema DB with structured information from free text of estate listings venerdì 16 aprile 2010
  • UC1 - Blocks venerdì 16 aprile 2010
  • UC1 - Crawler A specialized crawler extract data from the source Estate listings data are stored grouped by zones in files on some directory on a managed machine venerdì 16 aprile 2010
  • UC1 - Crawler Define navigation of the site using one XML for each city zone The crawler downloads page fragments two times a week The estate listings extracted free text is saved on XML grouped by zone venerdì 16 aprile 2010
  • UC1 - Crawler Modules venerdì 16 aprile 2010
  • UC1 - navigation definition venerdì 16 aprile 2010
  • UC1 - Crawler Issues : Enabled cookies Some HTTP headers needed Needed to put fixed sleeping intervals between requests venerdì 16 aprile 2010
  • UC1 - Domain EstateListing (Announcement) Zone MagazineNumber (Uscita) HouseStructure with properties venerdì 16 aprile 2010
  • UC1 - Information Extraction Engine Goal : extract price, zone and telephone number The first version contained a specialized IE engine which used huge regular expressions Hard to maintain and unefficient Extracting not so much information venerdì 16 aprile 2010
  • UC1 - IE Engine New requirement: extract also the structure of the house Number of rooms, box, garden(s), external spaces, number of bathrooms, kitchen, etc... Using again RegEx resulted to be hard to maintain and unefficient venerdì 16 aprile 2010
  • UC1 - IE Engine Subsitute the RegEx based IE engine with a UIMA based IE engine to: exploit previous work (RegExs can live inside UIMA too) exploit existing components be able to modify and enhanche IE rules easily much more efficient more information extracted venerdì 16 aprile 2010
  • UC1 - Analysis pipeline venerdì 16 aprile 2010
  • UC1 - TypeSystem venerdì 16 aprile 2010
  • Crawled XML venerdì 16 aprile 2010
  • Sample text “ven 26 Dic APPIA via grottaferrata metro 2 ¡ piano assolato ingresso salone americana cucina camera cameretta bagno soppalco posto auto e 295.000” venerdì 16 aprile 2010
  • UC1 - ContentAnnotator From the XML produced by the crawler only estate listings must be extracted A simple parser to get each node containing an estate listing (that in turn will be unstructured) Create a ContentAnnotation over the document venerdì 16 aprile 2010
  • UC1 - ContentAnnotator venerdì 16 aprile 2010
  • ContentAnnotation venerdì 16 aprile 2010
  • UC1 - ACAnnotator venerdì 16 aprile 2010
  • UC1 - Entities venerdì 16 aprile 2010
  • ZoneAnnotator - Dictionary & RegEx venerdì 16 aprile 2010
  • ZoneAnnotator - Learning dictionaries venerdì 16 aprile 2010
  • UC1 - ZoneAnnotation venerdì 16 aprile 2010
  • UC1 - Consuming extracted information the previous version of the IE engine produced (again) XMLs that needed to be parsed to store structured data inside the DB with UIMA a CAS Consumer at the end of the analysis pipeline can automatically put extracted information on the DB venerdì 16 aprile 2010
  • UC1 - Analyzing real estate market data a simple webapp written in Java with Spring framework modules (Spring core, DAO, JDBC, MVC) querying aggregate data on MySQL DB enriched UI with JQuery venerdì 16 aprile 2010
  • UC1 - Analysis Graphs venerdì 16 aprile 2010
  • UC1 - Analysis Graphs venerdì 16 aprile 2010
  • UC2 - Monitor of tenders/announcements Monitor various sources which provide announcement and tenders to which people and companies are interested can subscribe We want to automate the long monitoring process of such sources and also automatically extract useful common information from announcements’ text venerdì 16 aprile 2010
  • UC2 - Blocks venerdì 16 aprile 2010
  • Different input texts venerdì 16 aprile 2010
  • Different input texts venerdì 16 aprile 2010
  • Different input texts venerdì 16 aprile 2010
  • Different input texts venerdì 16 aprile 2010
  • UC2 - Crawling Similar to UC1 Crawler but using a Firefox plugin we can define navigation patterns for pages of each source We can also define metadata we see during navigation that deliver information Again an XML will be generated so that it can be saved on a storage and executed periodically venerdì 16 aprile 2010
  • UC2 - Defining navigation venerdì 16 aprile 2010
  • UC2 - Domain annotations Language Funding type Abstract Geographic region Activity Sector Beneficiary Subject Budget Title Expiration date Tags venerdì 16 aprile 2010
  • UC2 - Domain entities First and most important is an entity that represents the entire tender or announcement Annotations inside the domain will finally fill such entity properties venerdì 16 aprile 2010
  • UC2 - Pipeline venerdì 16 aprile 2010
  • UC2 - Simple first Each annotator first looks: if some metadata was extracted during navigation for the most common pattern for defining information inside such announcements i.e.: “Budget: 200000$” or “Financial information: ......” Such patterns are language independent (although this is often not true) venerdì 16 aprile 2010
  • UC2 - AbstractAnnotator The abstract is usually in the first part of the document We use Tokenizer and Tagger to get Tokens (with PoS tags) and Sentences We use Dictionary to provide a list of “good” words We look in the first sentences of the document looking for objectives of the announcement (mixing good words and regular expressions) venerdì 16 aprile 2010
  • UC2 - ExpirationDateAnnotator A DateAnnotator is executed before Iterate over DateAnnotations Get sentences wrapping such DateAnnotations Check if some terms like “deadline” appear in the same sentence of a DateAnnotation venerdì 16 aprile 2010
  • Date patterns venerdì 16 aprile 2010
  • ExpirationDateAnnotator venerdì 16 aprile 2010
  • GeographicRegionAnnotator venerdì 16 aprile 2010
  • UC2 - ActivityAnnotator venerdì 16 aprile 2010
  • UC2 - ActivityAnnotator venerdì 16 aprile 2010
  • Conclusions on IE UC1 : simple and stable sentence patterns UC2 : multi language, much more complex and different sentence structures and patterns Fine grain metadata are very important Need to play with NLP Need to establish good test cases venerdì 16 aprile 2010