Solr and Lucene @ AOL
SEAN TIMM, CHIEF ARCHITECT, AOL ADVERTISING
1999
• Believe, Cher and Livin’ la Vida Loca, Ricky Martin
• The Matrix and The Phantom Menace
• Windows 98 Second Edition...
A Brief History of Search @ AOL
• Acquired PLS in 1998
• AOL Search used ODP
• Site Search
• Local Search
• Built into AOL...
Relevance
• Precision/recall
• “free alcohol” vs. “alcohol free”
• Lawyer versus Attorney
• Iron and ironic  same stem (P...
The Dawn of Solr
• Prohibitively expensive to continue CPL development
• Complicated deployment
• 2005: Investigating migr...
Contributions
• Local Lucene/Solr (superseded by SpatialSearch)
• Query Timeout
• Data Import Handler (DIH)
• Numerous sma...
Contributing to Solr/Lucene
• Learn
–Join the mailing lists
•solr-user@lucene.apache.org
•dev@lucene.apache.org
–Read sear...
Contributing to Solr/Lucene
• Help others
–Answer questions.
–Improve documentation in the code, the wiki, or
the website....
Contributing to Solr/Lucene
• Confirm a bug
• Submit a patch for a reported bug or feature
request
• Improve a patch
• Try...
Contributing to Solr/Lucene
• Submit your own tickets
– Bug
– Feature request
• Start with solr-user@lucene
• Discuss on d...
Applications
• MapQuest (SpatialSearch)
• Mail
• AIM
• AOL Search
• Site Search
• News Search
• RUM
• Sarah Palin e-mails ...
MapQuest Discover
Travel Blogs
MQ Local Search
Related Searches
Bipartite graph snippet
Related Searches Graph
Page 18
“The Eagles”
The band
NFL
Boston College
Hotel California
Tribute
Related Searches
• Simple query
– User
• New York Library
– Solr query
• Lower case
• Prefer exact match “new york library...
Wikipedia Traffic Correlation Schema
<field name="title" type="string" indexed="true" stored="true" required="true" />
<fi...
Temporal Traffic Correlation of Wikipedia Page
Views
Sarah Palin E-mail Stats
• 13,177 documents
• 4 hours from receiving data to production install
• ~150 K requests per day ...
Faceting and Clustering
Huffington Post Comments
• Solr 4
• Uses Solr Cloud
• Single shard
• ReplicationFactor 3
• Real-time
• 90 days of comments...
More HuffPost comments
• Used by editors and moderators
–Topic investigation
–Troll detection
• Config
–Special features: ...
Solr Comments Architecture
Message
Queue
MongoDB
Mongo
Ingestor
Solr
Ingestor
Solr Cloud
Uses SolrJ CloudSolrServer
Tools
...
Relevance in Solr
• “free alcohol” vs. “alcohol free”
–Phrase queries and phrase slop
• Lawyer versus Attorney
–SynonymFil...
Relevance in Solr
• Beyonce vs. Beyoncé
–Various Folding Filters
• Eagles
–Boost on other fields, such as
popularity, publ...
Bringing a New Search Project Online
• Understand the domain
• Ingest (sample) data
• Clean data
• Repeat
• Relevance test...
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Upcoming SlideShare
Loading in...5
×

Solr At AOL, Presented by Sean Timm at SolrExchage DC

366

Published on

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
366
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Solr At AOL, Presented by Sean Timm at SolrExchage DC

  1. 1. Solr and Lucene @ AOL SEAN TIMM, CHIEF ARCHITECT, AOL ADVERTISING
  2. 2. 1999 • Believe, Cher and Livin’ la Vida Loca, Ricky Martin • The Matrix and The Phantom Menace • Windows 98 Second Edition • AltaVista, Northern Light, Yahoo, ODP, Inktomi – Google • PPC Text search ads invented 1998 – Banner ads
  3. 3. A Brief History of Search @ AOL • Acquired PLS in 1998 • AOL Search used ODP • Site Search • Local Search • Built into AOL Server • CPL – VSM then BM25 – Phrase, numeric, date, text, and proximity boosting – Conflation classes (like synonyms)
  4. 4. Relevance • Precision/recall • “free alcohol” vs. “alcohol free” • Lawyer versus Attorney • Iron and ironic  same stem (Porter) • Beyonce vs. Beyoncé • Eagles –Bird, sports teams, band, AMC Eagle • F 15, F-15, F15 • FREAK Relevant Retrieved
  5. 5. The Dawn of Solr • Prohibitively expensive to continue CPL development • Complicated deployment • 2005: Investigating migration to Lucene • 2006: CNET open sourced Solr
  6. 6. Contributions • Local Lucene/Solr (superseded by SpatialSearch) • Query Timeout • Data Import Handler (DIH) • Numerous smaller patches • Committers: Noble Paul, Shalin Mangar, Patrick O’Leary
  7. 7. Contributing to Solr/Lucene • Learn –Join the mailing lists •solr-user@lucene.apache.org •dev@lucene.apache.org –Read search and Solr related blogs –The #solr IRC channel on freenode
  8. 8. Contributing to Solr/Lucene • Help others –Answer questions. –Improve documentation in the code, the wiki, or the website. –Make improvements to the Solr Admin UI.
  9. 9. Contributing to Solr/Lucene • Confirm a bug • Submit a patch for a reported bug or feature request • Improve a patch • Try out a patch and see if it works
  10. 10. Contributing to Solr/Lucene • Submit your own tickets – Bug – Feature request • Start with solr-user@lucene • Discuss on dev@lucene • Create Jira ticket, ideally with patches and unit tests • Yonik’s Law of Patches: – A half-baked patch in Jira, with no documentation, no tests, and no backwards compatibility is better than no patch at all.
  11. 11. Applications • MapQuest (SpatialSearch) • Mail • AIM • AOL Search • Site Search • News Search • RUM • Sarah Palin e-mails (admin) • Demand • Wikipedia article pattern detection
  12. 12. MapQuest Discover
  13. 13. Travel Blogs
  14. 14. MQ Local Search
  15. 15. Related Searches
  16. 16. Bipartite graph snippet
  17. 17. Related Searches Graph Page 18 “The Eagles” The band NFL Boston College Hotel California Tribute
  18. 18. Related Searches • Simple query – User • New York Library – Solr query • Lower case • Prefer exact match “new york library” • Use phrase slop to allow terms in same order and near each other, e.g., new york city public library • primeQuery:“new york library” OR “new york library”~3
  19. 19. Wikipedia Traffic Correlation Schema <field name="title" type="string" indexed="true" stored="true" required="true" /> <field name="title_norm" type="string" indexed="true" stored="true" required="true" /> <field name="total_pvs" type="long" indexed="true" stored="true" required="true" /> <!-- Dynamic field definitions. If a field name is not found, dynamicFields will be used if the name matches any of the patterns. RESTRICTION: the glob-like pattern in the name attribute must have a "*" only at the start or the end. EXAMPLE: name="*_i" will match any field ending in _i (like myid_i, z_i) Longer patterns will be matched first. if equal size patterns both match, the first appearing in the schema will be used. --> <!-- trend direction. field name contains date string, e.g., "trend_20110622" --> <dynamicField name="trend_*" type="int" indexed="true" stored="true"/> <!-- page views. field name contains date string, e.g., "pvs_20110622" --> <dynamicField name="pvs_*" type="long" indexed="true" stored="true"/>
  20. 20. Temporal Traffic Correlation of Wikipedia Page Views
  21. 21. Sarah Palin E-mail Stats • 13,177 documents • 4 hours from receiving data to production install • ~150 K requests per day at launch • Now about 6-7 K requests per day • Running on 3 VMs in two different data centers behind a NetScaler
  22. 22. Faceting and Clustering
  23. 23. Huffington Post Comments • Solr 4 • Uses Solr Cloud • Single shard • ReplicationFactor 3 • Real-time • 90 days of comments • Tested up to 100 writes / second
  24. 24. More HuffPost comments • Used by editors and moderators –Topic investigation –Troll detection • Config –Special features: search for emoticons, prefer exact match, date boosting • Hack-a-thon comment clustering, timeline, and summarization
  25. 25. Solr Comments Architecture Message Queue MongoDB Mongo Ingestor Solr Ingestor Solr Cloud Uses SolrJ CloudSolrServer Tools Server JuLiA
  26. 26. Relevance in Solr • “free alcohol” vs. “alcohol free” –Phrase queries and phrase slop • Lawyer versus Attorney –SynonymFilterFactory • Iron and ironic –Kstem, or Lemmatization via the SynonymFilterFactory instead of Snowball/Porter
  27. 27. Relevance in Solr • Beyonce vs. Beyoncé –Various Folding Filters • Eagles –Boost on other fields, such as popularity, publish date –Use related searches, facets, or clustering • F 15, F-15, F15 –WordDelimiterFilter
  28. 28. Bringing a New Search Project Online • Understand the domain • Ingest (sample) data • Clean data • Repeat • Relevance testing • Scale out • Launch/Success
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×