2010 10-building-global-listening-platform-with-solr

Building a Global Listening
Platform with Solr
Steve Kearns
Rosette Product Manager
Basis Technology
October 7, 2010

Monday, October 04, 2010

2

Agenda
• Agenda
• Who Am I?
• What is a “Listening Platform”?
• Challenges (Technical & Global)
• Details
• Demonstration

3

About me
• Product Manager at Basis Technology
– Rosette linguistics platform
• Language ID
• Language Support for Search
• Entity Extraction
• Entity Translation/Search
•…
• Related history
– Media Monitoring at BBN Technologies
• Video, Web content extraction: STT, MT, Search

4

What is a Listening Platform?
• Content aggregator for online media
• Targets:
– Social/Brand monitoring
– Government OSINT
• Functions:
– Content acquisition
– Content analysis
– Search indexing
– Search (UI)
– Visualization

5

Content Acquisition
• What:
– News Articles
– Social Media
• How:
– Web Crawler
• Nutch!
– RSS feed reader/aggregator
• ROME/Curn
– Pay a 3rd party aggregator
• Good option, if you can afford it.

6

Content Acquisition Problems
• Web Crawling
– Content Extraction!
• BoilerPipe, Readability
– Crawl History
• Duplicate detection
• Article updates?
– Crawl Control
• Crawl Depth?
• Crawl Restrictions?

• Per-site configuration doesn’t scale.
7

Content Acquisition: How
• CURN – Customizable Utilitarian RSS Notifier

CURN History

CURN

RSS Yes New No
RSS Feed story? End
Feed Solr
List
Output
Plug-In Download Extract Create Solr
Story URL Content Message
(BoilerPipe || Readability)

8

What is a Listening Platform?
• Content aggregator for online media
• Targets:
– Social/Brand monitoring
– Government OSINT
• Functions:
– Content acquisition
– Content analysis
– Search indexing
– Other visualization

9

Content Analysis
• Language Identification
• Relationship Extraction
• Classification
• Near-Duplicate Detection
• Story Tracking

10

Content Analysis: How?
• Preprocessor to Solr
– Custom
– OpenPipeline
• Solr UpdateRequestProcessors
– Chain of URP’s defined in SolrConfig
– Add, edit, remove fields

11

Content Analysis: How
• Custom distributed processing pipeline
– Complexity of components
– Number of components
– Some components require their own data storage

• Solr Indexing is the final processing step

12

Content Analysis Details
• Language Identification for:
– Indexing
– Faceting/Searching
– Entity Extraction
• Language-specific indexing for:
– Improved recall with high precision
• Entity Extraction for:
– Faceting
– Entity search
– Input to relationship extraction

13

Language Identification
• Detect dominant language
• Find language regions

14

Language-Specific Indexing
• Every language has unique challenges:
– Tokenization
• Morphological Analysis vs. N-Gram
– Stemming vs. Lemmatization
• All European and Middle Eastern languages
– Compound words
• Swedish, Danish, Norwegian, Dutch, German

15

Morphological Analysis vs. N-Gram
• Search Term: 東京ルパン上映時間
• N-Gram:

• Morphological:

16

Stemming vs. Lemmatization
• Stemming:
– Set of rules for removing characters from words
– Increased recall at the expense of precision
– Example EN rule: Remove trailing “ing” or “al”

• Lemmatization:
– Complex set of approaches for producing the
dictionary form of a word
– Increased recall without hurting precision
– Uses context to disambiguate candidates

17

• English: “I have spoken at several conferences”
• Stemming:

• Lemmatization:

18

• German: “Am Samstagmorgen fliege ich zurueck nach Boston.”
• Stemming:

• Lemmatization (and decompounding!)

19

Stemming and Lemmatization Challenges
• Can I index text from many languages into
the same field?
– Yes, but it’s not always a good idea!
• Query language ID is not accurate.
– You need a custom Query Analyzer that does
stemming/lemmatization in many languages for
the same query.

• How do I query text in multiple fields?
– Dismax parser allows you to specify multiple
fields to search.

20

Entity Extraction
• Process of identifying people, places, organizations, dates,
times, etc. in unstructured text.
• Methods:
– List-based
– Rules-based
– Statistical-based

• Define your goals upfront!
– Some extraction methods work better for certain entity types
• Rules work well for dates, email addresses, and URL’s, but not people
• Lists work well for titles, but not locations
• Statistical extractors work well for ambiguous entities like people,
locations, organizations

21

Entity Extraction Example

22

User Interface
• Google-style search results aren’t enough!

• Design UI around workflow
and/or
• Design for Exploration

23

Dashboard/Summary

24

Details
• Data Acquisition: Curn RSS Aggregator
• Analysis:
– Basis Technology:
• Language ID • Relationship Extraction
• Search Enablement • Entity Search
• Indexing: Solr
• UI:
– JSP – Javascript InfoViz Toolkit (theJit.org)
– Yahoo UI (YUI) – gRaphaël (g.raphaeljs.com)

27

Architecture

CURN Rosette Analysis Document
(RSS Harvester)
Components Classification and
• Lang ID Clustering
• Relationship Extraction

Indexing / User Interface
Query Service

MySQL
(Long Term
Datastore)

Name Solr
Indexer 28

Demo
• Listening Platform built on Solr

• I built this version in 3 months using Solr
and products from Basis Technology

• I would be happy to show you the Solr
config and let you try it out

29

2010 10-building-global-listening-platform-with-solr

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to 2010 10-building-global-listening-platform-with-solr

Similar to 2010 10-building-global-listening-platform-with-solr (20)

More from Lucidworks (Archived)

More from Lucidworks (Archived) (20)

2010 10-building-global-listening-platform-with-solr