Apache UIMA - Hands on code
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Apache UIMA - Hands on code

Uploaded on

lesson about UIMA real use cases, integration with search engines and a little hands on code session

lesson about UIMA real use cases, integration with search engines and a little hands on code session

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Apache UIMA - hands on code Gestione delle Informazioni su Web - 2010/2011 Tommaso Teofili tommaso [at] apache [dot] org
  • 2. Use Cases - AgendaUC1 : Real Estatate market analysisUC2 : Tenders automatic informationextractionUIMA & search enginesTutorialAssignment
  • 3. UC1 : SourceAn online announcement site for sellers andbuyersWide purpose (cars, RE, hi-fi, etc...)Local scope (Rome and nearby)
  • 4. UC1 - GoalsTrack real estate market in order to: Take smart decisions Predict how things will go in the (near) futureEstate listings text is unstructeredAggregate queries for statistical analysis needstructured information
  • 5. UC1 - Source
  • 6. UC1 - Blocks
  • 7. UC1 - CrawlerA specialized crawler extract data from the sourceEstate listings data are stored grouped by zones in fileson some directory on a managed machineDefine navigation of the site using one XML for eachcity zoneThe crawler downloads page fragments two times aweekThe estate listings extracted free text is saved on XMLgrouped by zone
  • 8. UC1 - CrawlerIssues : Enabled cookies Some HTTP headers needed Needed to put fixed sleeping intervals between requests
  • 9. UC1 - DomainAnnouncementZoneMagazineNumberHouseStructure (with properties)
  • 10. UC1 - Information Extraction EngineGoal : extract price, zone and telephonenumberThe first version used huge regularexpressionsHard to maintain and unefficientPoor extraction
  • 11. UC1 - IE EngineNew requirements: extract the structure ofthe house Number of rooms, box, garden(s), external spaces, number of bathrooms, kitchen, etc... Track more fine grained zones
  • 12. Sample text“ven 26 Dic APPIA via grottaferrata metro 2¡ piano assolato ingresso salone americanacucina camera cameretta bagno soppalcoposto auto e 295.000”
  • 13. UC1 - ContentAnnotator From the XML produced by the crawler only estate listings must be extracted A simple parser to get each node containing an estate listing (that in turn will be unstructured) Create a ContentAnnotation over the document
  • 14. ContentAnnotation
  • 15. UC1 - Entities
  • 16. UC1 - ZoneAnnotation
  • 17. UC1 - Consumingextracted informationthe previous version of the IE engineproduced XML files that needed to bereparsed to store structured data inside theDBwith UIMA a CAS Consumer at the end ofthe analysis pipeline can automatically putextracted information on the DB
  • 18. UIMA - CAS ConsumerAnalysis Engine responsible for consuminginformation contained inside the CASCan write extracted information to: DBMS Lucene index Filesystem ...
  • 19. UC1 - Analysis Graphs
  • 20. UC1 - Analysis Graphs
  • 21. UC2 - Monitor of EU announcementsMonitor various sources which provideannouncement and tendersAutomate the long monitoring process of suchsources and automatically extract usefulcommon information from announcements’texts
  • 22. UC2 - Blocks
  • 23. Different input texts
  • 24. Different input texts
  • 25. Different input texts
  • 26. UC2 - Domain annotationsLanguage Funding typeAbstract Geographic regionActivity SectorBeneficiary SubjectBudget TitleExpiration date Tags
  • 27. UC2 - Domain entitiesFirst and most important is an entity thatrepresents the entire tender orannouncementAnnotations inside the domain will finally fillsuch entity properties
  • 28. UC2 - Simple firstEach annotator first looks: if some metadata was extracted during navigation for the most common pattern for defining information inside such announcementsi.e.: “Budget: 200000$” or “Financial information: ......”Such patterns are common in different languages
  • 29. UC2 - AbstractAnnotator The abstract is usually in the first part of the document We use Tokenizer and Tagger to get Tokens (with PoS tags) and Sentences We use dictionary of “good” words and linguistic patterns We look in the first sentences of the document looking for objectives of the announcement
  • 30. UC2 - ExpirationDateAnnotator A DateAnnotator is executed before Iterate over DateAnnotations Get sentences wrapping such DateAnnotations Check if some terms or patterns like “the deadline is ...” appear near a DateAnnotation
  • 31. UC2 - BandoEntity
  • 32. UIMA & Search Engines Decorate documents with automatically extracted metadata to improve search experience relevance results clustering
  • 33. Information Retrieval and Named Entities
  • 34. UIMA & Search Engines “Push” scenario: documents are sent to UIMA which extracts metadata and writes on the index with a CAS Consumer “Pull” scenario: documents are sent to Lucene which asks UIMA to extract metadata for it and then Lucene itself writes them to the index “On demand” scenario: metadata are extracted only on demand each time a document is retrieved/showed...
  • 35. UIMA - tutorialcreate a Type Systemcreate an Analysis Engine descriptorcreate a simple Annotator
  • 36. AssignmentNamed Entities Recognition sport: person, player, coach, team, competition videogames: person, videogame character, videogame, software house, hardware requirementPreciosion & Recall test