SlideShare a Scribd company logo
1 of 30
Solr and Lucene @ AOL
SEAN TIMM, CHIEF ARCHITECT, AOL ADVERTISING
1999
• Believe, Cher and Livin’ la Vida Loca, Ricky Martin
• The Matrix and The Phantom Menace
• Windows 98 Second Edition
• AltaVista, Northern Light, Yahoo, ODP, Inktomi
– Google
• PPC Text search ads invented 1998
– Banner ads
A Brief History of Search @ AOL
• Acquired PLS in 1998
• AOL Search used ODP
• Site Search
• Local Search
• Built into AOL Server
• CPL
– VSM then BM25
– Phrase, numeric, date, text, and
proximity boosting
– Conflation classes (like synonyms)
Relevance
• Precision/recall
• “free alcohol” vs. “alcohol free”
• Lawyer versus Attorney
• Iron and ironic  same stem (Porter)
• Beyonce vs. Beyoncé
• Eagles
–Bird, sports teams, band, AMC Eagle
• F 15, F-15, F15
• FREAK
Relevant Retrieved
The Dawn of Solr
• Prohibitively expensive to continue CPL development
• Complicated deployment
• 2005: Investigating migration to Lucene
• 2006: CNET open sourced Solr
Contributions
• Local Lucene/Solr (superseded by SpatialSearch)
• Query Timeout
• Data Import Handler (DIH)
• Numerous smaller patches
• Committers: Noble Paul, Shalin Mangar, Patrick
O’Leary
Contributing to Solr/Lucene
• Learn
–Join the mailing lists
•solr-user@lucene.apache.org
•dev@lucene.apache.org
–Read search and Solr related blogs
–The #solr IRC channel on freenode
Contributing to Solr/Lucene
• Help others
–Answer questions.
–Improve documentation in the code, the wiki, or
the website.
–Make improvements to the Solr Admin UI.
Contributing to Solr/Lucene
• Confirm a bug
• Submit a patch for a reported bug or feature
request
• Improve a patch
• Try out a patch and see if it works
Contributing to Solr/Lucene
• Submit your own tickets
– Bug
– Feature request
• Start with solr-user@lucene
• Discuss on dev@lucene
• Create Jira ticket, ideally with patches and unit tests
• Yonik’s Law of Patches:
– A half-baked patch in Jira, with no documentation, no tests, and no
backwards compatibility is better than no patch at all.
Applications
• MapQuest (SpatialSearch)
• Mail
• AIM
• AOL Search
• Site Search
• News Search
• RUM
• Sarah Palin e-mails (admin)
• Demand
• Wikipedia article pattern detection
MapQuest Discover
Travel Blogs
MQ Local Search
Related Searches
Bipartite graph snippet
Related Searches Graph
Page 18
“The Eagles”
The band
NFL
Boston College
Hotel California
Tribute
Related Searches
• Simple query
– User
• New York Library
– Solr query
• Lower case
• Prefer exact match “new york library”
• Use phrase slop to allow terms in same order and near each
other, e.g., new york city public library
• primeQuery:“new york library” OR “new york library”~3
Wikipedia Traffic Correlation Schema
<field name="title" type="string" indexed="true" stored="true" required="true" />
<field name="title_norm" type="string" indexed="true" stored="true" required="true" />
<field name="total_pvs" type="long" indexed="true" stored="true" required="true" />
<!-- Dynamic field definitions. If a field name is not found, dynamicFields
will be used if the name matches any of the patterns.
RESTRICTION: the glob-like pattern in the name attribute must have
a "*" only at the start or the end.
EXAMPLE: name="*_i" will match any field ending in _i (like myid_i, z_i)
Longer patterns will be matched first. if equal size patterns
both match, the first appearing in the schema will be used. -->
<!-- trend direction. field name contains date string, e.g., "trend_20110622" -->
<dynamicField name="trend_*" type="int" indexed="true" stored="true"/>
<!-- page views. field name contains date string, e.g., "pvs_20110622" -->
<dynamicField name="pvs_*" type="long" indexed="true" stored="true"/>
Temporal Traffic Correlation of Wikipedia Page
Views
Sarah Palin E-mail Stats
• 13,177 documents
• 4 hours from receiving data to production install
• ~150 K requests per day at launch
• Now about 6-7 K requests per day
• Running on 3 VMs in two different data centers
behind a NetScaler
Faceting and Clustering
Huffington Post Comments
• Solr 4
• Uses Solr Cloud
• Single shard
• ReplicationFactor 3
• Real-time
• 90 days of comments
• Tested up to 100 writes / second
More HuffPost comments
• Used by editors and moderators
–Topic investigation
–Troll detection
• Config
–Special features: search for emoticons, prefer
exact match, date boosting
• Hack-a-thon comment clustering, timeline, and
summarization
Solr Comments Architecture
Message
Queue
MongoDB
Mongo
Ingestor
Solr
Ingestor
Solr Cloud
Uses SolrJ CloudSolrServer
Tools
Server
JuLiA
Relevance in Solr
• “free alcohol” vs. “alcohol free”
–Phrase queries and phrase slop
• Lawyer versus Attorney
–SynonymFilterFactory
• Iron and ironic
–Kstem, or Lemmatization via the
SynonymFilterFactory instead of
Snowball/Porter
Relevance in Solr
• Beyonce vs. Beyoncé
–Various Folding Filters
• Eagles
–Boost on other fields, such as
popularity, publish date
–Use related searches, facets, or clustering
• F 15, F-15, F15
–WordDelimiterFilter
Bringing a New Search Project Online
• Understand the domain
• Ingest (sample) data
• Clean data
• Repeat
• Relevance testing
• Scale out
• Launch/Success

More Related Content

Viewers also liked

Presentation to the Old Dominion University (ODU) MBA Association, 3/20/13
Presentation to the Old Dominion University (ODU) MBA Association, 3/20/13Presentation to the Old Dominion University (ODU) MBA Association, 3/20/13
Presentation to the Old Dominion University (ODU) MBA Association, 3/20/13
Marty Kaszubowski
 
O asis1 2[1]
O asis1 2[1]O asis1 2[1]
O asis1 2[1]
tanica
 
Maroon5
Maroon5Maroon5
Maroon5
tanica
 
Tennis
TennisTennis
Tennis
aritz
 
All the lovers
All the loversAll the lovers
All the lovers
tanica
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Lucidworks (Archived)
 
IAMAS 2010 First presentation
IAMAS 2010 First presentationIAMAS 2010 First presentation
IAMAS 2010 First presentation
ocrock
 
Hellosong
HellosongHellosong
Hellosong
tanica
 

Viewers also liked (18)

Coterie 9 11
Coterie 9 11Coterie 9 11
Coterie 9 11
 
Presentation to the Old Dominion University (ODU) MBA Association, 3/20/13
Presentation to the Old Dominion University (ODU) MBA Association, 3/20/13Presentation to the Old Dominion University (ODU) MBA Association, 3/20/13
Presentation to the Old Dominion University (ODU) MBA Association, 3/20/13
 
O asis1 2[1]
O asis1 2[1]O asis1 2[1]
O asis1 2[1]
 
What’s new in apache lucene 3.0
What’s new in apache lucene 3.0What’s new in apache lucene 3.0
What’s new in apache lucene 3.0
 
20101023 ie9 cache
20101023 ie9 cache20101023 ie9 cache
20101023 ie9 cache
 
Maroon5
Maroon5Maroon5
Maroon5
 
Tennis
TennisTennis
Tennis
 
All the lovers
All the loversAll the lovers
All the lovers
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
 
Mains aux fleurs
Mains aux fleursMains aux fleurs
Mains aux fleurs
 
Нестандартные методы интернет рекламы
Нестандартные методы интернет рекламыНестандартные методы интернет рекламы
Нестандартные методы интернет рекламы
 
IAMAS 2010 First presentation
IAMAS 2010 First presentationIAMAS 2010 First presentation
IAMAS 2010 First presentation
 
Highly Relevant Search Result Ranking for Law Enforcement
Highly Relevant Search Result Ranking for Law EnforcementHighly Relevant Search Result Ranking for Law Enforcement
Highly Relevant Search Result Ranking for Law Enforcement
 
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14
 
Descritores de linguagem
Descritores de linguagemDescritores de linguagem
Descritores de linguagem
 
Network Forensics Puzzle Contest に挑戦 #1
Network Forensics Puzzle Contest に挑戦 #1Network Forensics Puzzle Contest に挑戦 #1
Network Forensics Puzzle Contest に挑戦 #1
 
Hellosong
HellosongHellosong
Hellosong
 
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/LuceneSearching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
 

Similar to Solr At AOL, Presented by Sean Timm at SolrExchage DC

Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Lucidworks
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
Rahul Jain
 
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Ontico
 
Halko_santafe_2015
Halko_santafe_2015Halko_santafe_2015
Halko_santafe_2015
Nathan Halko
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+Solr
Jake Mannix
 

Similar to Solr At AOL, Presented by Sean Timm at SolrExchage DC (20)

Semantic Web and Schema.org
Semantic Web and Schema.orgSemantic Web and Schema.org
Semantic Web and Schema.org
 
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
 
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
How the Web of Data Will be Won
How the Web of Data Will be WonHow the Web of Data Will be Won
How the Web of Data Will be Won
 
Haifa
HaifaHaifa
Haifa
 
NISO Virtual Conference: The Semantic Web Coming of Age: Technologies and Imp...
NISO Virtual Conference: The Semantic Web Coming of Age: Technologies and Imp...NISO Virtual Conference: The Semantic Web Coming of Age: Technologies and Imp...
NISO Virtual Conference: The Semantic Web Coming of Age: Technologies and Imp...
 
Synchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web PagesSynchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web Pages
 
Demand, Media, and Search Analytics at AOL
Demand, Media, and Search Analytics at AOLDemand, Media, and Search Analytics at AOL
Demand, Media, and Search Analytics at AOL
 
Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia Entities
 
Webs of People, Webs of Data
Webs of People, Webs of DataWebs of People, Webs of Data
Webs of People, Webs of Data
 
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
 
(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages
 
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache SolrCreating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
 
Lessons Learnt From Working With Rails
Lessons Learnt From Working With RailsLessons Learnt From Working With Rails
Lessons Learnt From Working With Rails
 
Halko_santafe_2015
Halko_santafe_2015Halko_santafe_2015
Halko_santafe_2015
 
What happened to the Semantic Web?
What happened to the Semantic Web?What happened to the Semantic Web?
What happened to the Semantic Web?
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+Solr
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming text
 

More from Lucidworks (Archived)

Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Lucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Lucidworks (Archived)
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Lucidworks (Archived)
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Lucidworks (Archived)
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Lucidworks (Archived)
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Lucidworks (Archived)
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Lucidworks (Archived)
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Lucidworks (Archived)
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
Lucidworks (Archived)
 
Seeley yonik solr performance key innovations
Seeley yonik   solr performance key innovationsSeeley yonik   solr performance key innovations
Seeley yonik solr performance key innovations
Lucidworks (Archived)
 

More from Lucidworks (Archived) (20)

Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
The Data-Driven Paradigm
The Data-Driven ParadigmThe Data-Driven Paradigm
The Data-Driven Paradigm
 
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
 
What's new in solr june 2014
What's new in solr june 2014What's new in solr june 2014
What's new in solr june 2014
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
 
Solr4 nosql search_server_2013
Solr4 nosql search_server_2013Solr4 nosql search_server_2013
Solr4 nosql search_server_2013
 
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
 
Seeley yonik solr performance key innovations
Seeley yonik   solr performance key innovationsSeeley yonik   solr performance key innovations
Seeley yonik solr performance key innovations
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Solr At AOL, Presented by Sean Timm at SolrExchage DC

  • 1.
  • 2. Solr and Lucene @ AOL SEAN TIMM, CHIEF ARCHITECT, AOL ADVERTISING
  • 3. 1999 • Believe, Cher and Livin’ la Vida Loca, Ricky Martin • The Matrix and The Phantom Menace • Windows 98 Second Edition • AltaVista, Northern Light, Yahoo, ODP, Inktomi – Google • PPC Text search ads invented 1998 – Banner ads
  • 4. A Brief History of Search @ AOL • Acquired PLS in 1998 • AOL Search used ODP • Site Search • Local Search • Built into AOL Server • CPL – VSM then BM25 – Phrase, numeric, date, text, and proximity boosting – Conflation classes (like synonyms)
  • 5. Relevance • Precision/recall • “free alcohol” vs. “alcohol free” • Lawyer versus Attorney • Iron and ironic  same stem (Porter) • Beyonce vs. Beyoncé • Eagles –Bird, sports teams, band, AMC Eagle • F 15, F-15, F15 • FREAK Relevant Retrieved
  • 6. The Dawn of Solr • Prohibitively expensive to continue CPL development • Complicated deployment • 2005: Investigating migration to Lucene • 2006: CNET open sourced Solr
  • 7. Contributions • Local Lucene/Solr (superseded by SpatialSearch) • Query Timeout • Data Import Handler (DIH) • Numerous smaller patches • Committers: Noble Paul, Shalin Mangar, Patrick O’Leary
  • 8. Contributing to Solr/Lucene • Learn –Join the mailing lists •solr-user@lucene.apache.org •dev@lucene.apache.org –Read search and Solr related blogs –The #solr IRC channel on freenode
  • 9. Contributing to Solr/Lucene • Help others –Answer questions. –Improve documentation in the code, the wiki, or the website. –Make improvements to the Solr Admin UI.
  • 10. Contributing to Solr/Lucene • Confirm a bug • Submit a patch for a reported bug or feature request • Improve a patch • Try out a patch and see if it works
  • 11. Contributing to Solr/Lucene • Submit your own tickets – Bug – Feature request • Start with solr-user@lucene • Discuss on dev@lucene • Create Jira ticket, ideally with patches and unit tests • Yonik’s Law of Patches: – A half-baked patch in Jira, with no documentation, no tests, and no backwards compatibility is better than no patch at all.
  • 12. Applications • MapQuest (SpatialSearch) • Mail • AIM • AOL Search • Site Search • News Search • RUM • Sarah Palin e-mails (admin) • Demand • Wikipedia article pattern detection
  • 18. Related Searches Graph Page 18 “The Eagles” The band NFL Boston College Hotel California Tribute
  • 19. Related Searches • Simple query – User • New York Library – Solr query • Lower case • Prefer exact match “new york library” • Use phrase slop to allow terms in same order and near each other, e.g., new york city public library • primeQuery:“new york library” OR “new york library”~3
  • 20. Wikipedia Traffic Correlation Schema <field name="title" type="string" indexed="true" stored="true" required="true" /> <field name="title_norm" type="string" indexed="true" stored="true" required="true" /> <field name="total_pvs" type="long" indexed="true" stored="true" required="true" /> <!-- Dynamic field definitions. If a field name is not found, dynamicFields will be used if the name matches any of the patterns. RESTRICTION: the glob-like pattern in the name attribute must have a "*" only at the start or the end. EXAMPLE: name="*_i" will match any field ending in _i (like myid_i, z_i) Longer patterns will be matched first. if equal size patterns both match, the first appearing in the schema will be used. --> <!-- trend direction. field name contains date string, e.g., "trend_20110622" --> <dynamicField name="trend_*" type="int" indexed="true" stored="true"/> <!-- page views. field name contains date string, e.g., "pvs_20110622" --> <dynamicField name="pvs_*" type="long" indexed="true" stored="true"/>
  • 21. Temporal Traffic Correlation of Wikipedia Page Views
  • 22. Sarah Palin E-mail Stats • 13,177 documents • 4 hours from receiving data to production install • ~150 K requests per day at launch • Now about 6-7 K requests per day • Running on 3 VMs in two different data centers behind a NetScaler
  • 23.
  • 25. Huffington Post Comments • Solr 4 • Uses Solr Cloud • Single shard • ReplicationFactor 3 • Real-time • 90 days of comments • Tested up to 100 writes / second
  • 26. More HuffPost comments • Used by editors and moderators –Topic investigation –Troll detection • Config –Special features: search for emoticons, prefer exact match, date boosting • Hack-a-thon comment clustering, timeline, and summarization
  • 27. Solr Comments Architecture Message Queue MongoDB Mongo Ingestor Solr Ingestor Solr Cloud Uses SolrJ CloudSolrServer Tools Server JuLiA
  • 28. Relevance in Solr • “free alcohol” vs. “alcohol free” –Phrase queries and phrase slop • Lawyer versus Attorney –SynonymFilterFactory • Iron and ironic –Kstem, or Lemmatization via the SynonymFilterFactory instead of Snowball/Porter
  • 29. Relevance in Solr • Beyonce vs. Beyoncé –Various Folding Filters • Eagles –Boost on other fields, such as popularity, publish date –Use related searches, facets, or clustering • F 15, F-15, F15 –WordDelimiterFilter
  • 30. Bringing a New Search Project Online • Understand the domain • Ingest (sample) data • Clean data • Repeat • Relevance testing • Scale out • Launch/Success