0
Apache UIMA - hands      on code  Gestione delle Informazioni su Web - 2010/2011                  Tommaso Teofili          ...
Use Cases - AgendaUC1 : Real Estatate market analysisUC2 : Tenders automatic informationextractionUIMA & search enginesTut...
UC1 : SourceAn online announcement site for sellers andbuyersWide purpose (cars, RE, hi-fi, etc...)Local scope (Rome and ne...
UC1 - GoalsTrack real estate market in order to:  Take smart decisions  Predict how things will go in the (near) futureEst...
UC1 - Source
UC1 - Blocks
UC1 - CrawlerA specialized crawler extract data from the sourceEstate listings data are stored grouped by zones in fileson ...
UC1 - CrawlerIssues :  Enabled cookies  Some HTTP headers needed  Needed to put fixed sleeping intervals  between requests
UC1 - DomainAnnouncementZoneMagazineNumberHouseStructure (with properties)
UC1 - Information  Extraction EngineGoal : extract price, zone and telephonenumberThe first version used huge regularexpres...
UC1 - IE EngineNew requirements: extract the structure ofthe house  Number of rooms, box, garden(s), external  spaces, num...
Sample text“ven 26 Dic APPIA via grottaferrata metro 2¡ piano assolato ingresso salone americanacucina camera cameretta ba...
UC1 - ContentAnnotator From the XML produced by the crawler only estate listings must be extracted A simple parser to get ...
ContentAnnotation
UC1 - Entities
UC1 - ZoneAnnotation
UC1 - Consumingextracted informationthe previous version of the IE engineproduced XML files that needed to bereparsed to st...
UIMA - CAS ConsumerAnalysis Engine responsible for consuminginformation contained inside the CASCan write extracted inform...
UC1 - Analysis Graphs
UC1 - Analysis Graphs
UC2 - Monitor of EU  announcementsMonitor various sources which provideannouncement and tendersAutomate the long monitorin...
UC2 - Blocks
Different input texts
Different input texts
Different input texts
UC2 - Domain            annotationsLanguage           Funding typeAbstract           Geographic regionActivity           S...
UC2 - Domain entitiesFirst and most important is an entity thatrepresents the entire tender orannouncementAnnotations insi...
UC2 - Simple firstEach annotator first looks:   if some metadata was extracted during navigation   for the most common patte...
UC2 - AbstractAnnotator The abstract is usually in the first part of the document We use Tokenizer and Tagger to get Tokens...
UC2 - ExpirationDateAnnotator  A DateAnnotator is executed before  Iterate over DateAnnotations  Get sentences wrapping su...
UC2 - BandoEntity
UIMA & Search Engines Decorate documents with automatically extracted metadata to improve search experience   relevance   ...
Information Retrieval and     Named Entities
UIMA & Search Engines “Push” scenario:    documents are sent to UIMA which extracts metadata and    writes on the index wi...
UIMA - tutorialcreate a Type Systemcreate an Analysis Engine descriptorcreate a simple Annotator
AssignmentNamed Entities Recognition  sport: person, player, coach, team,  competition  videogames: person, videogame char...
Upcoming SlideShare
Loading in...5
×

Apache UIMA - Hands on code

2,236

Published on

lesson about UIMA real use cases, integration with search engines and a little hands on code session

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,236
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
113
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Transcript of "Apache UIMA - Hands on code"

  1. 1. Apache UIMA - hands on code Gestione delle Informazioni su Web - 2010/2011 Tommaso Teofili tommaso [at] apache [dot] org
  2. 2. Use Cases - AgendaUC1 : Real Estatate market analysisUC2 : Tenders automatic informationextractionUIMA & search enginesTutorialAssignment
  3. 3. UC1 : SourceAn online announcement site for sellers andbuyersWide purpose (cars, RE, hi-fi, etc...)Local scope (Rome and nearby)
  4. 4. UC1 - GoalsTrack real estate market in order to: Take smart decisions Predict how things will go in the (near) futureEstate listings text is unstructeredAggregate queries for statistical analysis needstructured information
  5. 5. UC1 - Source
  6. 6. UC1 - Blocks
  7. 7. UC1 - CrawlerA specialized crawler extract data from the sourceEstate listings data are stored grouped by zones in fileson some directory on a managed machineDefine navigation of the site using one XML for eachcity zoneThe crawler downloads page fragments two times aweekThe estate listings extracted free text is saved on XMLgrouped by zone
  8. 8. UC1 - CrawlerIssues : Enabled cookies Some HTTP headers needed Needed to put fixed sleeping intervals between requests
  9. 9. UC1 - DomainAnnouncementZoneMagazineNumberHouseStructure (with properties)
  10. 10. UC1 - Information Extraction EngineGoal : extract price, zone and telephonenumberThe first version used huge regularexpressionsHard to maintain and unefficientPoor extraction
  11. 11. UC1 - IE EngineNew requirements: extract the structure ofthe house Number of rooms, box, garden(s), external spaces, number of bathrooms, kitchen, etc... Track more fine grained zones
  12. 12. Sample text“ven 26 Dic APPIA via grottaferrata metro 2¡ piano assolato ingresso salone americanacucina camera cameretta bagno soppalcoposto auto e 295.000”
  13. 13. UC1 - ContentAnnotator From the XML produced by the crawler only estate listings must be extracted A simple parser to get each node containing an estate listing (that in turn will be unstructured) Create a ContentAnnotation over the document
  14. 14. ContentAnnotation
  15. 15. UC1 - Entities
  16. 16. UC1 - ZoneAnnotation
  17. 17. UC1 - Consumingextracted informationthe previous version of the IE engineproduced XML files that needed to bereparsed to store structured data inside theDBwith UIMA a CAS Consumer at the end ofthe analysis pipeline can automatically putextracted information on the DB
  18. 18. UIMA - CAS ConsumerAnalysis Engine responsible for consuminginformation contained inside the CASCan write extracted information to: DBMS Lucene index Filesystem ...
  19. 19. UC1 - Analysis Graphs
  20. 20. UC1 - Analysis Graphs
  21. 21. UC2 - Monitor of EU announcementsMonitor various sources which provideannouncement and tendersAutomate the long monitoring process of suchsources and automatically extract usefulcommon information from announcements’texts
  22. 22. UC2 - Blocks
  23. 23. Different input texts
  24. 24. Different input texts
  25. 25. Different input texts
  26. 26. UC2 - Domain annotationsLanguage Funding typeAbstract Geographic regionActivity SectorBeneficiary SubjectBudget TitleExpiration date Tags
  27. 27. UC2 - Domain entitiesFirst and most important is an entity thatrepresents the entire tender orannouncementAnnotations inside the domain will finally fillsuch entity properties
  28. 28. UC2 - Simple firstEach annotator first looks: if some metadata was extracted during navigation for the most common pattern for defining information inside such announcementsi.e.: “Budget: 200000$” or “Financial information: ......”Such patterns are common in different languages
  29. 29. UC2 - AbstractAnnotator The abstract is usually in the first part of the document We use Tokenizer and Tagger to get Tokens (with PoS tags) and Sentences We use dictionary of “good” words and linguistic patterns We look in the first sentences of the document looking for objectives of the announcement
  30. 30. UC2 - ExpirationDateAnnotator A DateAnnotator is executed before Iterate over DateAnnotations Get sentences wrapping such DateAnnotations Check if some terms or patterns like “the deadline is ...” appear near a DateAnnotation
  31. 31. UC2 - BandoEntity
  32. 32. UIMA & Search Engines Decorate documents with automatically extracted metadata to improve search experience relevance results clustering
  33. 33. Information Retrieval and Named Entities
  34. 34. UIMA & Search Engines “Push” scenario: documents are sent to UIMA which extracts metadata and writes on the index with a CAS Consumer “Pull” scenario: documents are sent to Lucene which asks UIMA to extract metadata for it and then Lucene itself writes them to the index “On demand” scenario: metadata are extracted only on demand each time a document is retrieved/showed...
  35. 35. UIMA - tutorialcreate a Type Systemcreate an Analysis Engine descriptorcreate a simple Annotator
  36. 36. AssignmentNamed Entities Recognition sport: person, player, coach, team, competition videogames: person, videogame character, videogame, software house, hardware requirementPreciosion & Recall test
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×