Your SlideShare is downloading. ×
  • Like
2010 10-building-global-listening-platform-with-solr
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

2010 10-building-global-listening-platform-with-solr

  • 990 views
Published

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
990
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
16
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Building a Global Listening Platform with SolrSteve KearnsRosette Product ManagerBasis TechnologyOctober 7, 2010Monday, October 04, 2010 2
  • 2. Agenda• Agenda• Who Am I?• What is a “Listening Platform”?• Challenges (Technical & Global)• Details• Demonstration 3
  • 3. About me• Product Manager at Basis Technology – Rosette linguistics platform • Language ID • Language Support for Search • Entity Extraction • Entity Translation/Search •…• Related history – Media Monitoring at BBN Technologies • Video, Web content extraction: STT, MT, Search 4
  • 4. What is a Listening Platform?• Content aggregator for online media• Targets: – Social/Brand monitoring – Government OSINT• Functions: – Content acquisition – Content analysis – Search indexing – Search (UI) – Visualization 5
  • 5. Content Acquisition• What: – News Articles – Social Media• How: – Web Crawler • Nutch! – RSS feed reader/aggregator • ROME/Curn – Pay a 3rd party aggregator • Good option, if you can afford it. 6
  • 6. Content Acquisition Problems• Web Crawling – Content Extraction! • BoilerPipe, Readability – Crawl History • Duplicate detection • Article updates? – Crawl Control • Crawl Depth? • Crawl Restrictions?• Per-site configuration doesn’t scale. 7
  • 7. Content Acquisition: How • CURN – Customizable Utilitarian RSS Notifier CURN History CURN RSS Yes New NoRSS Feed story? EndFeed Solr List Output Plug-In Download Extract Create Solr Story URL Content Message (BoilerPipe || Readability) 8
  • 8. What is a Listening Platform?• Content aggregator for online media• Targets: – Social/Brand monitoring – Government OSINT• Functions: – Content acquisition – Content analysis – Search indexing – Other visualization 9
  • 9. Content Analysis• Language Identification• Entity Extraction• Relationship Extraction• Classification• Near-Duplicate Detection• Story Tracking 10
  • 10. Content Analysis: How?• Preprocessor to Solr – Custom – OpenPipeline• Solr UpdateRequestProcessors – Chain of URP’s defined in SolrConfig – Add, edit, remove fields 11
  • 11. Content Analysis: How• Custom distributed processing pipeline – Complexity of components – Number of components – Some components require their own data storage• Solr Indexing is the final processing step 12
  • 12. Content Analysis Details• Language Identification for: – Indexing – Faceting/Searching – Entity Extraction• Language-specific indexing for: – Improved recall with high precision• Entity Extraction for: – Faceting – Entity search – Input to relationship extraction 13
  • 13. Language Identification• Detect dominant language• Find language regions 14
  • 14. Language-Specific Indexing• Every language has unique challenges: – Tokenization • Morphological Analysis vs. N-Gram – Stemming vs. Lemmatization • All European and Middle Eastern languages – Compound words • Swedish, Danish, Norwegian, Dutch, German 15
  • 15. Morphological Analysis vs. N-Gram• Search Term: 東京 ルパン上映時間• N-Gram:• Morphological: 16
  • 16. Stemming vs. Lemmatization• Stemming: – Set of rules for removing characters from words – Increased recall at the expense of precision – Example EN rule: Remove trailing “ing” or “al”• Lemmatization: – Complex set of approaches for producing the dictionary form of a word – Increased recall without hurting precision – Uses context to disambiguate candidates 17
  • 17. Stemming vs. Lemmatization• English: “I have spoken at several conferences”• Stemming:• Lemmatization: 18
  • 18. Stemming vs. Lemmatization• German: “Am Samstagmorgen fliege ich zurueck nach Boston.”• Stemming:• Lemmatization (and decompounding!) 19
  • 19. Stemming and Lemmatization Challenges • Can I index text from many languages into the same field? – Yes, but it’s not always a good idea! • Query language ID is not accurate. – You need a custom Query Analyzer that does stemming/lemmatization in many languages for the same query. • How do I query text in multiple fields? – Dismax parser allows you to specify multiple fields to search. 20
  • 20. Entity Extraction• Process of identifying people, places, organizations, dates, times, etc. in unstructured text.• Methods: – List-based – Rules-based – Statistical-based• Define your goals upfront! – Some extraction methods work better for certain entity types • Rules work well for dates, email addresses, and URL’s, but not people • Lists work well for titles, but not locations • Statistical extractors work well for ambiguous entities like people, locations, organizations 21
  • 21. Entity Extraction Example 22
  • 22. User Interface• Google-style search results aren’t enough!• Design UI around workflow and/or• Design for Exploration 23
  • 23. Dashboard/Summary 24
  • 24. Faceting 25
  • 25. Link Analysis 26
  • 26. Details• Data Acquisition: Curn RSS Aggregator• Analysis: – Basis Technology: • Language ID • Relationship Extraction • Search Enablement • Entity Search • Entity Extraction• Indexing: Solr• UI: – JSP – Javascript InfoViz Toolkit (theJit.org) – Yahoo UI (YUI) – gRaphaël (g.raphaeljs.com) 27
  • 27. Architecture CURN Rosette Analysis Document(RSS Harvester) Components Classification and • Lang ID Clustering • Entity Extraction • Relationship Extraction Indexing / User Interface Query Service MySQL (Long Term Datastore) Name Solr Indexer 28
  • 28. Demo• Listening Platform built on Solr• I built this version in 3 months using Solr and products from Basis Technology• I would be happy to show you the Solr config and let you try it out 29