Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2010 10-building-global-listening-platform-with-solr

1,413 views

Published on

  • Be the first to comment

2010 10-building-global-listening-platform-with-solr

  1. 1. Building a Global Listening Platform with SolrSteve KearnsRosette Product ManagerBasis TechnologyOctober 7, 2010Monday, October 04, 2010 2
  2. 2. Agenda• Agenda• Who Am I?• What is a “Listening Platform”?• Challenges (Technical & Global)• Details• Demonstration 3
  3. 3. About me• Product Manager at Basis Technology – Rosette linguistics platform • Language ID • Language Support for Search • Entity Extraction • Entity Translation/Search •…• Related history – Media Monitoring at BBN Technologies • Video, Web content extraction: STT, MT, Search 4
  4. 4. What is a Listening Platform?• Content aggregator for online media• Targets: – Social/Brand monitoring – Government OSINT• Functions: – Content acquisition – Content analysis – Search indexing – Search (UI) – Visualization 5
  5. 5. Content Acquisition• What: – News Articles – Social Media• How: – Web Crawler • Nutch! – RSS feed reader/aggregator • ROME/Curn – Pay a 3rd party aggregator • Good option, if you can afford it. 6
  6. 6. Content Acquisition Problems• Web Crawling – Content Extraction! • BoilerPipe, Readability – Crawl History • Duplicate detection • Article updates? – Crawl Control • Crawl Depth? • Crawl Restrictions?• Per-site configuration doesn’t scale. 7
  7. 7. Content Acquisition: How • CURN – Customizable Utilitarian RSS Notifier CURN History CURN RSS Yes New NoRSS Feed story? EndFeed Solr List Output Plug-In Download Extract Create Solr Story URL Content Message (BoilerPipe || Readability) 8
  8. 8. What is a Listening Platform?• Content aggregator for online media• Targets: – Social/Brand monitoring – Government OSINT• Functions: – Content acquisition – Content analysis – Search indexing – Other visualization 9
  9. 9. Content Analysis• Language Identification• Entity Extraction• Relationship Extraction• Classification• Near-Duplicate Detection• Story Tracking 10
  10. 10. Content Analysis: How?• Preprocessor to Solr – Custom – OpenPipeline• Solr UpdateRequestProcessors – Chain of URP’s defined in SolrConfig – Add, edit, remove fields 11
  11. 11. Content Analysis: How• Custom distributed processing pipeline – Complexity of components – Number of components – Some components require their own data storage• Solr Indexing is the final processing step 12
  12. 12. Content Analysis Details• Language Identification for: – Indexing – Faceting/Searching – Entity Extraction• Language-specific indexing for: – Improved recall with high precision• Entity Extraction for: – Faceting – Entity search – Input to relationship extraction 13
  13. 13. Language Identification• Detect dominant language• Find language regions 14
  14. 14. Language-Specific Indexing• Every language has unique challenges: – Tokenization • Morphological Analysis vs. N-Gram – Stemming vs. Lemmatization • All European and Middle Eastern languages – Compound words • Swedish, Danish, Norwegian, Dutch, German 15
  15. 15. Morphological Analysis vs. N-Gram• Search Term: 東京 ルパン上映時間• N-Gram:• Morphological: 16
  16. 16. Stemming vs. Lemmatization• Stemming: – Set of rules for removing characters from words – Increased recall at the expense of precision – Example EN rule: Remove trailing “ing” or “al”• Lemmatization: – Complex set of approaches for producing the dictionary form of a word – Increased recall without hurting precision – Uses context to disambiguate candidates 17
  17. 17. Stemming vs. Lemmatization• English: “I have spoken at several conferences”• Stemming:• Lemmatization: 18
  18. 18. Stemming vs. Lemmatization• German: “Am Samstagmorgen fliege ich zurueck nach Boston.”• Stemming:• Lemmatization (and decompounding!) 19
  19. 19. Stemming and Lemmatization Challenges • Can I index text from many languages into the same field? – Yes, but it’s not always a good idea! • Query language ID is not accurate. – You need a custom Query Analyzer that does stemming/lemmatization in many languages for the same query. • How do I query text in multiple fields? – Dismax parser allows you to specify multiple fields to search. 20
  20. 20. Entity Extraction• Process of identifying people, places, organizations, dates, times, etc. in unstructured text.• Methods: – List-based – Rules-based – Statistical-based• Define your goals upfront! – Some extraction methods work better for certain entity types • Rules work well for dates, email addresses, and URL’s, but not people • Lists work well for titles, but not locations • Statistical extractors work well for ambiguous entities like people, locations, organizations 21
  21. 21. Entity Extraction Example 22
  22. 22. User Interface• Google-style search results aren’t enough!• Design UI around workflow and/or• Design for Exploration 23
  23. 23. Dashboard/Summary 24
  24. 24. Faceting 25
  25. 25. Link Analysis 26
  26. 26. Details• Data Acquisition: Curn RSS Aggregator• Analysis: – Basis Technology: • Language ID • Relationship Extraction • Search Enablement • Entity Search • Entity Extraction• Indexing: Solr• UI: – JSP – Javascript InfoViz Toolkit (theJit.org) – Yahoo UI (YUI) – gRaphaël (g.raphaeljs.com) 27
  27. 27. Architecture CURN Rosette Analysis Document(RSS Harvester) Components Classification and • Lang ID Clustering • Entity Extraction • Relationship Extraction Indexing / User Interface Query Service MySQL (Long Term Datastore) Name Solr Indexer 28
  28. 28. Demo• Listening Platform built on Solr• I built this version in 3 months using Solr and products from Basis Technology• I would be happy to show you the Solr config and let you try it out 29

×