Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache UIMA - Hands on code


Published on

lesson about UIMA real use cases, integration with search engines and a little hands on code session

Published in: Technology
  • Login to see the comments

Apache UIMA - Hands on code

  1. 1. Apache UIMA - hands on code Gestione delle Informazioni su Web - 2010/2011 Tommaso Teofili tommaso [at] apache [dot] org
  2. 2. Use Cases - AgendaUC1 : Real Estatate market analysisUC2 : Tenders automatic informationextractionUIMA & search enginesTutorialAssignment
  3. 3. UC1 : SourceAn online announcement site for sellers andbuyersWide purpose (cars, RE, hi-fi, etc...)Local scope (Rome and nearby)
  4. 4. UC1 - GoalsTrack real estate market in order to: Take smart decisions Predict how things will go in the (near) futureEstate listings text is unstructeredAggregate queries for statistical analysis needstructured information
  5. 5. UC1 - Source
  6. 6. UC1 - Blocks
  7. 7. UC1 - CrawlerA specialized crawler extract data from the sourceEstate listings data are stored grouped by zones in fileson some directory on a managed machineDefine navigation of the site using one XML for eachcity zoneThe crawler downloads page fragments two times aweekThe estate listings extracted free text is saved on XMLgrouped by zone
  8. 8. UC1 - CrawlerIssues : Enabled cookies Some HTTP headers needed Needed to put fixed sleeping intervals between requests
  9. 9. UC1 - DomainAnnouncementZoneMagazineNumberHouseStructure (with properties)
  10. 10. UC1 - Information Extraction EngineGoal : extract price, zone and telephonenumberThe first version used huge regularexpressionsHard to maintain and unefficientPoor extraction
  11. 11. UC1 - IE EngineNew requirements: extract the structure ofthe house Number of rooms, box, garden(s), external spaces, number of bathrooms, kitchen, etc... Track more fine grained zones
  12. 12. Sample text“ven 26 Dic APPIA via grottaferrata metro 2¡ piano assolato ingresso salone americanacucina camera cameretta bagno soppalcoposto auto e 295.000”
  13. 13. UC1 - ContentAnnotator From the XML produced by the crawler only estate listings must be extracted A simple parser to get each node containing an estate listing (that in turn will be unstructured) Create a ContentAnnotation over the document
  14. 14. ContentAnnotation
  15. 15. UC1 - Entities
  16. 16. UC1 - ZoneAnnotation
  17. 17. UC1 - Consumingextracted informationthe previous version of the IE engineproduced XML files that needed to bereparsed to store structured data inside theDBwith UIMA a CAS Consumer at the end ofthe analysis pipeline can automatically putextracted information on the DB
  18. 18. UIMA - CAS ConsumerAnalysis Engine responsible for consuminginformation contained inside the CASCan write extracted information to: DBMS Lucene index Filesystem ...
  19. 19. UC1 - Analysis Graphs
  20. 20. UC1 - Analysis Graphs
  21. 21. UC2 - Monitor of EU announcementsMonitor various sources which provideannouncement and tendersAutomate the long monitoring process of suchsources and automatically extract usefulcommon information from announcements’texts
  22. 22. UC2 - Blocks
  23. 23. Different input texts
  24. 24. Different input texts
  25. 25. Different input texts
  26. 26. UC2 - Domain annotationsLanguage Funding typeAbstract Geographic regionActivity SectorBeneficiary SubjectBudget TitleExpiration date Tags
  27. 27. UC2 - Domain entitiesFirst and most important is an entity thatrepresents the entire tender orannouncementAnnotations inside the domain will finally fillsuch entity properties
  28. 28. UC2 - Simple firstEach annotator first looks: if some metadata was extracted during navigation for the most common pattern for defining information inside such announcementsi.e.: “Budget: 200000$” or “Financial information: ......”Such patterns are common in different languages
  29. 29. UC2 - AbstractAnnotator The abstract is usually in the first part of the document We use Tokenizer and Tagger to get Tokens (with PoS tags) and Sentences We use dictionary of “good” words and linguistic patterns We look in the first sentences of the document looking for objectives of the announcement
  30. 30. UC2 - ExpirationDateAnnotator A DateAnnotator is executed before Iterate over DateAnnotations Get sentences wrapping such DateAnnotations Check if some terms or patterns like “the deadline is ...” appear near a DateAnnotation
  31. 31. UC2 - BandoEntity
  32. 32. UIMA & Search Engines Decorate documents with automatically extracted metadata to improve search experience relevance results clustering
  33. 33. Information Retrieval and Named Entities
  34. 34. UIMA & Search Engines “Push” scenario: documents are sent to UIMA which extracts metadata and writes on the index with a CAS Consumer “Pull” scenario: documents are sent to Lucene which asks UIMA to extract metadata for it and then Lucene itself writes them to the index “On demand” scenario: metadata are extracted only on demand each time a document is retrieved/showed...
  35. 35. UIMA - tutorialcreate a Type Systemcreate an Analysis Engine descriptorcreate a simple Annotator
  36. 36. AssignmentNamed Entities Recognition sport: person, player, coach, team, competition videogames: person, videogame character, videogame, software house, hardware requirementPreciosion & Recall test