SlideShare a Scribd company logo
1 of 19
Download to read offline
Deduplication using SOLR 
Neeraj Jain, Software Engineer, Stubhub, Inc. 
neeraj.adi@gmail.com
About myself 
RIDE 
ON
StubHub is about….. 
Worlds 
Largest 
Ticke<ng 
marketplace 
10M active listings 
We 
enable 
“access” 
to 
events 
We want to be more!!!
Some Fun Facts about StubHub! 
Ø An eBay owned company 
Ø Over 25 million users and growing 
Ø We sell one ticket per second 
Ø ~8.5 million page views a day, on an average 
Ø ~ 3 million additional page views per day on Mobile devices 
Ø ~10 M tickets for sale in sports, concerts and others. 
Ø ~ 1 TB of data processed monthly by the analytics infrastructure – This number will 
significantly go up as we bring in data from many of the unstructured data sources 
Ø ~300 Million SQL executions/day
Search at Stubhub! 
SOLR 
1.2 
SOLR 
3+, 
Geo 
spa<al 
search 
SOLR 
4, 
NRT 
SOLR 
Cloud 
2010 
2011 
2012 
2013 
2014
Agenda 
Ø Use case 
Ø Challenges 
Ø Legacy solution 
Ø Our approach 
Ø Results
Use Case : Content Ingestion 
Input 
record 
Pre 
deduplica<on 
Deduplica<on 
Post 
deduplica<on 
Normalize 
Filtering 
Classifica<on 
Geocode 
Review 
Insert 
Update 
Discard 
Feed-­‐1 
Feed-­‐2 
Feed-­‐3 
Feed-­‐n 
Form 
Event 
DB
Challenges : Deduplication 
Ø Problem space 
² Event 
catalog 
Ø Performance considerations 
² Real 
<me 
processing 
² Batch 
processing 
Ø Speed and data quality
Legacy Solution : Deduplication Flow 
Deduplica<onModule 
for 
each 
field 
Event 
DB 
for 
each 
document 
Client 
1: 
getDuplicates() 
2: 
getSubsetByLoca.on() 
3: 
loop 
4: 
DuplicateList 
5: 
upsert() 
Normalize 
Filter 
Compute 
Score 
Feed 
Ingestor 
Batch 
Job 
UGC
Approach : Problem Model 
Ø Milpitas 
Library 
vs 
Milpitas 
Public 
Library 
Ø 1601 
E 
7th 
St 
vs 
1601 
E. 
Seventh 
St. 
Ø Pick 
up 
the 
right 
algo, 
edit 
distance, 
jaccard. 
Library, 
Restaurant, 
etc 
Milpitas 
Library 
160 
N. 
Main 
St; 
40 
N. 
Milpitas 
Blvd. 
Distance 
: 
~0.5 
mi 
e.g. 
venue 
Boost 
name, 
street 
number 
Dup 
detec<on 
-­‐ 
name, 
address 
etc 
Subset 
-­‐ 
Text 
Similarity 
on 
Categories 
Subset 
-­‐ 
Geo 
spa<al 
distance 
Venue 
Deduplica.on
Approach : Deduplication Flow 
Feed 
Ingestor 
Client 
Deduplica<onService 
QueryBuilder 
QueryExecuter 
Scorer 
SOLR 
Index 
Batch 
Job 
UGC 
1: 
/dedupe 
3.1: 
/select 
2: 
build() 
3: 
execute() 
4: 
compute() 
7: 
/update 
6: 
DedupeResponse 
Event 
DB 
8: 
upsert() 
A1: 
poll() 
IndexUpdater 
A2: 
/update, 
/delete 
NameFilter 
AddressFilter 
*Filter 
Filter
Approach : Deduplication Service 
public interface DeduplicationService<T> { 
/** 
* Checks for duplicate entity and return a DeduplicationResponse containing information about duplicates 
found. For each possible duplicate, there is a justification as to why it's a duplicate. 
* @param t entity for which duplicates need to be found. 
* @param options use options provided by this object to find and filter the results. 
* @return a not null instance of DeduplicationResponse object. 
* @throws DeduplicationConnectivityException if there was an issue in connecting to the dedupe data 
store. 
*/ 
public DeduplicationResponse<T> findDuplicates(T t, DedupeOptions options) 
throws DeduplicationConnectivityException; 
}
Approach : Deduplication Service 
@Component(value = "VenueDeduplicationService”) 
public class VenueDeduplicationService 
implements DeduplicationService<Venue> { 
@Override 
public DeduplicationResponse<Venue> findDuplicates(Venue venue, DedupeOptions options) 
throws 
Deduplica<onConnec<vityExcep<on 
{ 
} 
} 
@Component(value = "EventDeduplicationService”) 
public class EventDeduplicationService 
implements DeduplicationService<Event> { 
@Override 
public DeduplicationResponse<Event> findDuplicates(Event event, DedupeOptions options) 
throws DeduplicationConnectivityException { 
} 
}
Approach : Optimizations 
Ø How to keep the score consistent? 
² 
<similarity 
class=“TfSimilarity"/> 
Ø Auto commit settings 
² <autoSomCommit><maxTime>5</maxTime></autoSomCommit> 
Ø Custom PostFilter 
² <queryParser 
name="fdist" 
class=“DistanceQParserPlugin"/> 
Ø Custom update handler 
² 
<processor 
class=“VenueUpdateProcessorFactory”></processor>
Results : Sample Output 
Input 
Venue 
Matched 
Venue 
Score 
Distance 
Jillian's 
Billiards 
Club 
101 
Fourth 
St. 
Jillian's 
175 
4th 
St. 
1.5573 
5.6352 
Lush 
Lounge 
1092 
Post 
St. 
Lush 
Lounge 
1221 
Polk 
St. 
12.9836 
16.6501 
Mountain 
Theatre 
10 
Panoramic 
Hwy. 
Mountain 
Theater 
Nearby 
E 
Ridgecrest 
Boulevard 
and 
Pantoll 
Road 
3.2509 
5.8913
Results : Sample Output 
Input 
Venue 
Matched 
Venue 
Score 
Distance 
The 
Hedley 
Club 
at 
Hotel 
DeAnza 
233 
W. 
Santa 
Clara 
St. 
Hedley 
Club 
233 
W. 
Santa 
Clara 
St. 
5.0805 
0.0000 
Sonya 
Paz 
Fine 
Art 
Gallery 
1793 
LafayeYe 
St. 
Sonya 
Paz 
Gallery 
and 
Studio 
1793 
LafayeYe 
St. 
Suite 
110 
6.6764 
0.0069 
Pearl 
Avenue 
Library 
Community 
Room 
4270 
Pearl 
Ave. 
Pearl 
Avenue 
Branch 
Library 
4270 
Pearl 
Ave. 
5.7024 
0.0000 
Milpitas 
Library 
160 
N. 
Main 
St. 
Milpitas 
Library 
40 
N. 
Milpitas 
Blvd. 
16.4318 
0.7284
Summary 
Ø Use case 
² Content 
inges<on 
Ø Challenges 
² Deduplica<on 
Ø Legacy solution 
Ø Our approach 
² Used 
SOLR 
for 
text 
similarity 
² Extended 
default 
behavior 
² REST 
endpoint 
over 
SOLR 
interface 
Ø Next steps 
² Big 
data 
² Performer 
matching 
² I18n 
Ø Results
Thank You

More Related Content

What's hot

Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrTrey Grainger
 
Webinar: Search and Recommenders
Webinar: Search and RecommendersWebinar: Search and Recommenders
Webinar: Search and RecommendersLucidworks
 
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, TargetJourney of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, TargetLucidworks
 
Webinar: Replace Google Search Appliance with Lucidworks Fusion
Webinar: Replace Google Search Appliance with Lucidworks FusionWebinar: Replace Google Search Appliance with Lucidworks Fusion
Webinar: Replace Google Search Appliance with Lucidworks FusionLucidworks
 
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...lucenerevolution
 
10 Keys to Solr's Future: Presented by Grant Ingersoll, Lucidworks
10 Keys to Solr's Future: Presented by Grant Ingersoll, Lucidworks10 Keys to Solr's Future: Presented by Grant Ingersoll, Lucidworks
10 Keys to Solr's Future: Presented by Grant Ingersoll, LucidworksLucidworks
 
Boosting Documents in Solr by Recency, Popularity, and User Preferences
Boosting Documents in Solr by Recency, Popularity, and User PreferencesBoosting Documents in Solr by Recency, Popularity, and User Preferences
Boosting Documents in Solr by Recency, Popularity, and User PreferencesLucidworks (Archived)
 
Relevancy hacks for eCommerce
Relevancy hacks for eCommerceRelevancy hacks for eCommerce
Relevancy hacks for eCommerceVarun Thacker
 
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Lucidworks
 
Managed Search: Presented by Jacob Graves, Getty Images
Managed Search: Presented by Jacob Graves, Getty ImagesManaged Search: Presented by Jacob Graves, Getty Images
Managed Search: Presented by Jacob Graves, Getty ImagesLucidworks
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesRahul Jain
 
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...Lucidworks
 
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksYour Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksLucidworks
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and SparkLucidworks
 
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Lucidworks
 
Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...
Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...
Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...Ramzi Alqrainy
 
The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...
The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...
The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...Lucidworks
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, FindwiseRelevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, FindwiseLucidworks
 

What's hot (20)

Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
Webinar: Search and Recommenders
Webinar: Search and RecommendersWebinar: Search and Recommenders
Webinar: Search and Recommenders
 
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, TargetJourney of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
 
Webinar: Replace Google Search Appliance with Lucidworks Fusion
Webinar: Replace Google Search Appliance with Lucidworks FusionWebinar: Replace Google Search Appliance with Lucidworks Fusion
Webinar: Replace Google Search Appliance with Lucidworks Fusion
 
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
 
10 Keys to Solr's Future: Presented by Grant Ingersoll, Lucidworks
10 Keys to Solr's Future: Presented by Grant Ingersoll, Lucidworks10 Keys to Solr's Future: Presented by Grant Ingersoll, Lucidworks
10 Keys to Solr's Future: Presented by Grant Ingersoll, Lucidworks
 
Boosting Documents in Solr by Recency, Popularity, and User Preferences
Boosting Documents in Solr by Recency, Popularity, and User PreferencesBoosting Documents in Solr by Recency, Popularity, and User Preferences
Boosting Documents in Solr by Recency, Popularity, and User Preferences
 
Relevancy hacks for eCommerce
Relevancy hacks for eCommerceRelevancy hacks for eCommerce
Relevancy hacks for eCommerce
 
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
 
Managed Search: Presented by Jacob Graves, Getty Images
Managed Search: Presented by Jacob Graves, Getty ImagesManaged Search: Presented by Jacob Graves, Getty Images
Managed Search: Presented by Jacob Graves, Getty Images
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
 
Cost-based Query Optimization
Cost-based Query Optimization Cost-based Query Optimization
Cost-based Query Optimization
 
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksYour Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
 
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
 
Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...
Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...
Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...
 
The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...
The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...
The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, FindwiseRelevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
 

Similar to Deduplication Using Solr: Presented by Neeraj Jain, Stubhub

Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Johann de Boer
 
Reactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and RxReactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and RxSumant Tambe
 
Interactively querying Google Analytics reports from R using ganalytics
Interactively querying Google Analytics reports from R using ganalyticsInteractively querying Google Analytics reports from R using ganalytics
Interactively querying Google Analytics reports from R using ganalyticsJohann de Boer
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
 
Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...
Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...
Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...Eric D. Boyd
 
CQRS and Event Sourcing: A DevOps perspective
CQRS and Event Sourcing: A DevOps perspectiveCQRS and Event Sourcing: A DevOps perspective
CQRS and Event Sourcing: A DevOps perspectiveMaria Gomez
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in MotionRuhani Arora
 
5 must have patterns for your microservice - techorama
5 must have patterns for your microservice - techorama5 must have patterns for your microservice - techorama
5 must have patterns for your microservice - techoramaAli Kheyrollahi
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeWim Godden
 
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon RedshiftAmazon Web Services
 
Idempotency of commands in distributed systems
Idempotency of commands in distributed systemsIdempotency of commands in distributed systems
Idempotency of commands in distributed systemsMax Małecki
 
Hw09 Analytics And Reporting
Hw09   Analytics And ReportingHw09   Analytics And Reporting
Hw09 Analytics And ReportingCloudera, Inc.
 
What can Bioinformaticians learn from YouTube?
What can Bioinformaticians learn from YouTube?What can Bioinformaticians learn from YouTube?
What can Bioinformaticians learn from YouTube?Matt Wood
 
.NET Fest 2018. Антон Молдован. One year of using F# in production at SBTech
.NET Fest 2018. Антон Молдован. One year of using F# in production at SBTech.NET Fest 2018. Антон Молдован. One year of using F# in production at SBTech
.NET Fest 2018. Антон Молдован. One year of using F# in production at SBTechNETFest
 
Design and Implementation of A Data Stream Management System
Design and Implementation of A Data Stream Management SystemDesign and Implementation of A Data Stream Management System
Design and Implementation of A Data Stream Management SystemErdi Olmezogullari
 
112 portfpres.pdf
112 portfpres.pdf112 portfpres.pdf
112 portfpres.pdfsash236
 
Real-Time Web Apps & .NET. What Are Your Options? NDC Oslo 2016
Real-Time Web Apps & .NET. What Are Your Options? NDC Oslo 2016Real-Time Web Apps & .NET. What Are Your Options? NDC Oslo 2016
Real-Time Web Apps & .NET. What Are Your Options? NDC Oslo 2016Phil Leggetter
 
eBay EDW元数据管理及应用
eBay EDW元数据管理及应用eBay EDW元数据管理及应用
eBay EDW元数据管理及应用mysqlops
 
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Sease
 

Similar to Deduplication Using Solr: Presented by Neeraj Jain, Stubhub (20)

Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015
 
Reactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and RxReactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and Rx
 
Interactively querying Google Analytics reports from R using ganalytics
Interactively querying Google Analytics reports from R using ganalyticsInteractively querying Google Analytics reports from R using ganalytics
Interactively querying Google Analytics reports from R using ganalytics
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...
Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...
Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...
 
CQRS and Event Sourcing: A DevOps perspective
CQRS and Event Sourcing: A DevOps perspectiveCQRS and Event Sourcing: A DevOps perspective
CQRS and Event Sourcing: A DevOps perspective
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
 
5 must have patterns for your microservice - techorama
5 must have patterns for your microservice - techorama5 must have patterns for your microservice - techorama
5 must have patterns for your microservice - techorama
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the code
 
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
 
Idempotency of commands in distributed systems
Idempotency of commands in distributed systemsIdempotency of commands in distributed systems
Idempotency of commands in distributed systems
 
Hw09 Analytics And Reporting
Hw09   Analytics And ReportingHw09   Analytics And Reporting
Hw09 Analytics And Reporting
 
What can Bioinformaticians learn from YouTube?
What can Bioinformaticians learn from YouTube?What can Bioinformaticians learn from YouTube?
What can Bioinformaticians learn from YouTube?
 
.NET Fest 2018. Антон Молдован. One year of using F# in production at SBTech
.NET Fest 2018. Антон Молдован. One year of using F# in production at SBTech.NET Fest 2018. Антон Молдован. One year of using F# in production at SBTech
.NET Fest 2018. Антон Молдован. One year of using F# in production at SBTech
 
Design and Implementation of A Data Stream Management System
Design and Implementation of A Data Stream Management SystemDesign and Implementation of A Data Stream Management System
Design and Implementation of A Data Stream Management System
 
112 portfpres.pdf
112 portfpres.pdf112 portfpres.pdf
112 portfpres.pdf
 
Real-Time Web Apps & .NET. What Are Your Options? NDC Oslo 2016
Real-Time Web Apps & .NET. What Are Your Options? NDC Oslo 2016Real-Time Web Apps & .NET. What Are Your Options? NDC Oslo 2016
Real-Time Web Apps & .NET. What Are Your Options? NDC Oslo 2016
 
eBay EDW元数据管理及应用
eBay EDW元数据管理及应用eBay EDW元数据管理及应用
eBay EDW元数据管理及应用
 
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
 

More from Lucidworks

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategyLucidworks
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceLucidworks
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsLucidworks
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesLucidworks
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Lucidworks
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...Lucidworks
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Lucidworks
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Lucidworks
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteLucidworks
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentLucidworks
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeLucidworks
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Lucidworks
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchLucidworks
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Lucidworks
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyLucidworks
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Lucidworks
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceLucidworks
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchLucidworks
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondLucidworks
 

More from Lucidworks (20)

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce Strategy
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in Salesforce
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant Products
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized Experiences
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - Europe
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 Research
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise Search
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and Beyond
 

Recently uploaded

Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Intelisync
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 

Recently uploaded (20)

Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 

Deduplication Using Solr: Presented by Neeraj Jain, Stubhub

  • 1.
  • 2. Deduplication using SOLR Neeraj Jain, Software Engineer, Stubhub, Inc. neeraj.adi@gmail.com
  • 4. StubHub is about….. Worlds Largest Ticke<ng marketplace 10M active listings We enable “access” to events We want to be more!!!
  • 5. Some Fun Facts about StubHub! Ø An eBay owned company Ø Over 25 million users and growing Ø We sell one ticket per second Ø ~8.5 million page views a day, on an average Ø ~ 3 million additional page views per day on Mobile devices Ø ~10 M tickets for sale in sports, concerts and others. Ø ~ 1 TB of data processed monthly by the analytics infrastructure – This number will significantly go up as we bring in data from many of the unstructured data sources Ø ~300 Million SQL executions/day
  • 6. Search at Stubhub! SOLR 1.2 SOLR 3+, Geo spa<al search SOLR 4, NRT SOLR Cloud 2010 2011 2012 2013 2014
  • 7. Agenda Ø Use case Ø Challenges Ø Legacy solution Ø Our approach Ø Results
  • 8. Use Case : Content Ingestion Input record Pre deduplica<on Deduplica<on Post deduplica<on Normalize Filtering Classifica<on Geocode Review Insert Update Discard Feed-­‐1 Feed-­‐2 Feed-­‐3 Feed-­‐n Form Event DB
  • 9. Challenges : Deduplication Ø Problem space ² Event catalog Ø Performance considerations ² Real <me processing ² Batch processing Ø Speed and data quality
  • 10. Legacy Solution : Deduplication Flow Deduplica<onModule for each field Event DB for each document Client 1: getDuplicates() 2: getSubsetByLoca.on() 3: loop 4: DuplicateList 5: upsert() Normalize Filter Compute Score Feed Ingestor Batch Job UGC
  • 11. Approach : Problem Model Ø Milpitas Library vs Milpitas Public Library Ø 1601 E 7th St vs 1601 E. Seventh St. Ø Pick up the right algo, edit distance, jaccard. Library, Restaurant, etc Milpitas Library 160 N. Main St; 40 N. Milpitas Blvd. Distance : ~0.5 mi e.g. venue Boost name, street number Dup detec<on -­‐ name, address etc Subset -­‐ Text Similarity on Categories Subset -­‐ Geo spa<al distance Venue Deduplica.on
  • 12. Approach : Deduplication Flow Feed Ingestor Client Deduplica<onService QueryBuilder QueryExecuter Scorer SOLR Index Batch Job UGC 1: /dedupe 3.1: /select 2: build() 3: execute() 4: compute() 7: /update 6: DedupeResponse Event DB 8: upsert() A1: poll() IndexUpdater A2: /update, /delete NameFilter AddressFilter *Filter Filter
  • 13. Approach : Deduplication Service public interface DeduplicationService<T> { /** * Checks for duplicate entity and return a DeduplicationResponse containing information about duplicates found. For each possible duplicate, there is a justification as to why it's a duplicate. * @param t entity for which duplicates need to be found. * @param options use options provided by this object to find and filter the results. * @return a not null instance of DeduplicationResponse object. * @throws DeduplicationConnectivityException if there was an issue in connecting to the dedupe data store. */ public DeduplicationResponse<T> findDuplicates(T t, DedupeOptions options) throws DeduplicationConnectivityException; }
  • 14. Approach : Deduplication Service @Component(value = "VenueDeduplicationService”) public class VenueDeduplicationService implements DeduplicationService<Venue> { @Override public DeduplicationResponse<Venue> findDuplicates(Venue venue, DedupeOptions options) throws Deduplica<onConnec<vityExcep<on { } } @Component(value = "EventDeduplicationService”) public class EventDeduplicationService implements DeduplicationService<Event> { @Override public DeduplicationResponse<Event> findDuplicates(Event event, DedupeOptions options) throws DeduplicationConnectivityException { } }
  • 15. Approach : Optimizations Ø How to keep the score consistent? ² <similarity class=“TfSimilarity"/> Ø Auto commit settings ² <autoSomCommit><maxTime>5</maxTime></autoSomCommit> Ø Custom PostFilter ² <queryParser name="fdist" class=“DistanceQParserPlugin"/> Ø Custom update handler ² <processor class=“VenueUpdateProcessorFactory”></processor>
  • 16. Results : Sample Output Input Venue Matched Venue Score Distance Jillian's Billiards Club 101 Fourth St. Jillian's 175 4th St. 1.5573 5.6352 Lush Lounge 1092 Post St. Lush Lounge 1221 Polk St. 12.9836 16.6501 Mountain Theatre 10 Panoramic Hwy. Mountain Theater Nearby E Ridgecrest Boulevard and Pantoll Road 3.2509 5.8913
  • 17. Results : Sample Output Input Venue Matched Venue Score Distance The Hedley Club at Hotel DeAnza 233 W. Santa Clara St. Hedley Club 233 W. Santa Clara St. 5.0805 0.0000 Sonya Paz Fine Art Gallery 1793 LafayeYe St. Sonya Paz Gallery and Studio 1793 LafayeYe St. Suite 110 6.6764 0.0069 Pearl Avenue Library Community Room 4270 Pearl Ave. Pearl Avenue Branch Library 4270 Pearl Ave. 5.7024 0.0000 Milpitas Library 160 N. Main St. Milpitas Library 40 N. Milpitas Blvd. 16.4318 0.7284
  • 18. Summary Ø Use case ² Content inges<on Ø Challenges ² Deduplica<on Ø Legacy solution Ø Our approach ² Used SOLR for text similarity ² Extended default behavior ² REST endpoint over SOLR interface Ø Next steps ² Big data ² Performer matching ² I18n Ø Results