Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Information Extraction with UIMA - Usecases

5,492 views

Published on

Slides about "Usecases for Information Extraction with UIMA" for "Information management on the Web" course at DIA (Computer Science Department) of Roma Tre University

Published in: Technology
  • Be the first to comment

Information Extraction with UIMA - Usecases

  1. 1. Information Extraction with UIMA - Use Cases Gestione delle Informazioni su Web - 2009/2010 Tommaso Teofili tommaso [at] apache [dot] org venerdì 16 aprile 2010
  2. 2. Use Cases - Agenda UC1 : Real Estatate market analysis UC2 : Tenders automatic information extraction venerdì 16 aprile 2010
  3. 3. UC1 : Source An online announcement site for sellers and buyers Wide purpose (cars, RE, hi-fi, etc...) Local scope (Rome and nearby) venerdì 16 aprile 2010
  4. 4. UC1 - Goals Are you looking for houses? A specified subcategory of the site is dedicated to real estate I would like to monitor Rome real estate market to Take smart decisions Predict how things will go in the (near) future venerdì 16 aprile 2010
  5. 5. UC1 - Source venerdì 16 aprile 2010
  6. 6. UC1 - Goals I want to build a separate web application to monitor such estate listings I have to use a crawler to automatically download selected pages periodically from the source Estate listings text is unstructered I want to make aggregate queries on structured information venerdì 16 aprile 2010
  7. 7. UC1 - Information Extraction I have to write an information extraction engine to populate a relational schema DB with structured information from free text of estate listings venerdì 16 aprile 2010
  8. 8. UC1 - Blocks venerdì 16 aprile 2010
  9. 9. UC1 - Crawler A specialized crawler extract data from the source Estate listings data are stored grouped by zones in files on some directory on a managed machine venerdì 16 aprile 2010
  10. 10. UC1 - Crawler Define navigation of the site using one XML for each city zone The crawler downloads page fragments two times a week The estate listings extracted free text is saved on XML grouped by zone venerdì 16 aprile 2010
  11. 11. UC1 - Crawler Modules venerdì 16 aprile 2010
  12. 12. UC1 - navigation definition venerdì 16 aprile 2010
  13. 13. UC1 - Crawler Issues : Enabled cookies Some HTTP headers needed Needed to put fixed sleeping intervals between requests venerdì 16 aprile 2010
  14. 14. UC1 - Domain EstateListing (Announcement) Zone MagazineNumber (Uscita) HouseStructure with properties venerdì 16 aprile 2010
  15. 15. UC1 - Information Extraction Engine Goal : extract price, zone and telephone number The first version contained a specialized IE engine which used huge regular expressions Hard to maintain and unefficient Extracting not so much information venerdì 16 aprile 2010
  16. 16. UC1 - IE Engine New requirement: extract also the structure of the house Number of rooms, box, garden(s), external spaces, number of bathrooms, kitchen, etc... Using again RegEx resulted to be hard to maintain and unefficient venerdì 16 aprile 2010
  17. 17. UC1 - IE Engine Subsitute the RegEx based IE engine with a UIMA based IE engine to: exploit previous work (RegExs can live inside UIMA too) exploit existing components be able to modify and enhanche IE rules easily much more efficient more information extracted venerdì 16 aprile 2010
  18. 18. UC1 - Analysis pipeline venerdì 16 aprile 2010
  19. 19. UC1 - TypeSystem venerdì 16 aprile 2010
  20. 20. Crawled XML venerdì 16 aprile 2010
  21. 21. Sample text “ven 26 Dic APPIA via grottaferrata metro 2 ¡ piano assolato ingresso salone americana cucina camera cameretta bagno soppalco posto auto e 295.000” venerdì 16 aprile 2010
  22. 22. UC1 - ContentAnnotator From the XML produced by the crawler only estate listings must be extracted A simple parser to get each node containing an estate listing (that in turn will be unstructured) Create a ContentAnnotation over the document venerdì 16 aprile 2010
  23. 23. UC1 - ContentAnnotator venerdì 16 aprile 2010
  24. 24. ContentAnnotation venerdì 16 aprile 2010
  25. 25. UC1 - ACAnnotator venerdì 16 aprile 2010
  26. 26. UC1 - Entities venerdì 16 aprile 2010
  27. 27. ZoneAnnotator - Dictionary & RegEx venerdì 16 aprile 2010
  28. 28. ZoneAnnotator - Learning dictionaries venerdì 16 aprile 2010
  29. 29. UC1 - ZoneAnnotation venerdì 16 aprile 2010
  30. 30. UC1 - Consuming extracted information the previous version of the IE engine produced (again) XMLs that needed to be parsed to store structured data inside the DB with UIMA a CAS Consumer at the end of the analysis pipeline can automatically put extracted information on the DB venerdì 16 aprile 2010
  31. 31. UC1 - Analyzing real estate market data a simple webapp written in Java with Spring framework modules (Spring core, DAO, JDBC, MVC) querying aggregate data on MySQL DB enriched UI with JQuery venerdì 16 aprile 2010
  32. 32. UC1 - Analysis Graphs venerdì 16 aprile 2010
  33. 33. UC1 - Analysis Graphs venerdì 16 aprile 2010
  34. 34. UC2 - Monitor of tenders/announcements Monitor various sources which provide announcement and tenders to which people and companies are interested can subscribe We want to automate the long monitoring process of such sources and also automatically extract useful common information from announcements’ text venerdì 16 aprile 2010
  35. 35. UC2 - Blocks venerdì 16 aprile 2010
  36. 36. Different input texts venerdì 16 aprile 2010
  37. 37. Different input texts venerdì 16 aprile 2010
  38. 38. Different input texts venerdì 16 aprile 2010
  39. 39. Different input texts venerdì 16 aprile 2010
  40. 40. UC2 - Crawling Similar to UC1 Crawler but using a Firefox plugin we can define navigation patterns for pages of each source We can also define metadata we see during navigation that deliver information Again an XML will be generated so that it can be saved on a storage and executed periodically venerdì 16 aprile 2010
  41. 41. UC2 - Defining navigation venerdì 16 aprile 2010
  42. 42. UC2 - Domain annotations Language Funding type Abstract Geographic region Activity Sector Beneficiary Subject Budget Title Expiration date Tags venerdì 16 aprile 2010
  43. 43. UC2 - Domain entities First and most important is an entity that represents the entire tender or announcement Annotations inside the domain will finally fill such entity properties venerdì 16 aprile 2010
  44. 44. UC2 - Pipeline venerdì 16 aprile 2010
  45. 45. UC2 - Simple first Each annotator first looks: if some metadata was extracted during navigation for the most common pattern for defining information inside such announcements i.e.: “Budget: 200000$” or “Financial information: ......” Such patterns are language independent (although this is often not true) venerdì 16 aprile 2010
  46. 46. UC2 - AbstractAnnotator The abstract is usually in the first part of the document We use Tokenizer and Tagger to get Tokens (with PoS tags) and Sentences We use Dictionary to provide a list of “good” words We look in the first sentences of the document looking for objectives of the announcement (mixing good words and regular expressions) venerdì 16 aprile 2010
  47. 47. UC2 - ExpirationDateAnnotator A DateAnnotator is executed before Iterate over DateAnnotations Get sentences wrapping such DateAnnotations Check if some terms like “deadline” appear in the same sentence of a DateAnnotation venerdì 16 aprile 2010
  48. 48. Date patterns venerdì 16 aprile 2010
  49. 49. ExpirationDateAnnotator venerdì 16 aprile 2010
  50. 50. GeographicRegionAnnotator venerdì 16 aprile 2010
  51. 51. UC2 - ActivityAnnotator venerdì 16 aprile 2010
  52. 52. UC2 - ActivityAnnotator venerdì 16 aprile 2010
  53. 53. Conclusions on IE UC1 : simple and stable sentence patterns UC2 : multi language, much more complex and different sentence structures and patterns Fine grain metadata are very important Need to play with NLP Need to establish good test cases venerdì 16 aprile 2010

×