Fully Automated QA System For Large Scale Search And Recommendation Engines Using Spark

Spark Summit
Spark SummitSpark Summit
Fully Automated QA System
for Large Scale Search and
Recommendation Engines
Using Spark
Khalifeh AlJadda
Lead Data Scientist, Search Data Science
• Joined CareerBuilder in 2013
• PhD, Computer Science – University of Georgia (2014)
• BSc, MSc, Computer Science, Jordan University of Science and Technology
Activities:
Founder and Chairman of CB Data Science Council
Frequent public speaker in the field of data science
Creator of GELATO (Glycomic Elucidation and Annotation Tool)
...and many more
The Fully
Automated
System
How to Label
Dataset
Introduction How to Measure
Relevancy
Talk Flow
Learning to Rank (LTR)
What is Information Retrieval (IR)?
Information retrieval (IR) is finding material (usually documents) of an
unstructured nature (usually text) that satisfies an information need from
within large collections (usually stored on computers).*
*introduction to information retrieval: http://nlp.stanford.edu/IR-book/pdf/irbookprint.pdf
Information Retrieval (IR) vs Relational Database (RDB)
RDB IR
Objects Records Unstructured Documents
Model Relational Vector Space
Main Data Structure Table Inverted Index
Queries SQL Free text
…
… …
… …
The inverted index
Vocabulary
Relevancy: Information need satisfaction
Precision: Accuracy
Recall: Coverage
Search: Find documents that match a user’s query
Recommendation: Leveraging context to automatically suggest relevant results
Learning to Rank (LTR)
Motivation
Users will turn away if they get irrelevant results
New algorithms and features need test
A/B test is expensive since it has impact on the end users
A/B test requires days before a conclusion can be made
How to Measure Relevancy?
A B C
Retrieved
Documents
Related
Documents
Precision = B/A
Recall = B/C
F1 = 2 * (Prec * Rec) / (Prec+Rec)
Assumption:
We have only 3 jobs for aquatic director in our Solr index
Precision = 2/4 = 0.5
Recall = 2/3 = 0.66
F1 = 2 * (0.5 * 0.66) / (0.5 + 0.66) =
0.56
Problem:
Assume Prec = 90% and Rec = 100% but assume the
10% irrelevant documents were ranked at the top of the results
is that OK?
Discount Cumulative Gain (DCG)
Rank Relevancy
1 0.95
2 0.65
3 0.80
4 0.85
Rank Relevancy
1 0.95
2 0.85
3 0.80
4 0.65
Ranking
Ideal
Given
• Position is
considered in
quantifying
relevancy.
• Labeled dataset
is required.
Learning to Rank (LTR)
How to get labeled data?
● Manually
○ Pros:
■ Accuracy
○ Cons:
■ Not scalable
■ Expensive
○ How:
■ Hire employees, contractors, or interns
■ Crowd-sourcing
● Less cost
● Less accuracy
● Infer relevancy utilizing implicit user feedback
How to infer relevancy?
Rank Document ID
1 Doc1
2 Doc2
3 Doc3
4 Doc4
Query
Query
Doc1 Doc2 Doc3
0
1 1
Query
Doc1 Doc2 Doc3
1
0 0
Click
G
raph
Skip Graph
Query Log
Field Example
Query ID Q1234567890
browser ID B12345ABCD789
Session ID S123456ABCD7890
Raw Query Spark or hadoop and Scala or java
Host Site US
Language EN
Ranked Results D1, D2, D3, D4, .. , Dn
Field Example
Query ID Q1234567890
Action Type* Click
Document ID D1
Document Location 1
Action Log
*Possible Action Types: Click, Download, Print, Block, Unblock, Save,
Apply, Dwell time, Post-click path
Learning to Rank (LTR)
System Architecture
Click/Skip
Click/Skip
Logs
HDFS
nDCG Calculator
HDFS Export
Doc
Rel HDFS
ETL
Field Example
Query ID Q1234567890
browser ID B12345ABCD789
Session ID S123456ABCD7890
Raw Query Spark or hadoop and
Scala or java
Ranked
Results
D1, D2, D3, D4, .. , Dn
Field Example
Query ID Q1234567890
Action Type* Click
Document ID D1
Document Location 1
Keyword DocumentID Rank Clicks Skips Popularity
Keyword DocumentID Relevancy
Noise Challenge
At least 10 distinct users need to take an action on a document to
consider it in the nDCG calculation.
Any skip followed clicks on different sessions from the same
browser ID is ignored.
Actions beyond Clicks weight more than Clicks. For example, we
count Download as 20 clicks, and Print as 100 clicks
500 resumes had been manually
reviewed by our data analyst. The
accuracy of the relevancy scores
calculated by our system is
96%
Accuracy
Dataset by the Numbers
19million + 10+100,000+250,000+ 7
Query Synthesizer
Synthesize Queries
ETL
ETL
Logs
HDFS
Query Docs with
Relevancy
java developer d1,d2,d3,..
spark or hadoop d11,d12,d13,.. Search
ETL
ETL
Logs
HDFS
Query Docs with
Relevancy
java developer d1,d2,d3,..
spark or hadoop d11,d12,d13,..
HDFS Export
Current Search Algorithm
Proposed Semantic Algorithms
Fully Automated QA System For Large Scale Search And Recommendation Engines Using Spark
Fully Automated QA System For Large Scale Search And Recommendation Engines Using Spark
Learning to Rank (LTR)
● It applies machine learning techniques to discover the best combination of features that
provide best ranking.
● It requires labeled set of documents with relevancy scores for given set of queries
● Features used for ranking are usually more computationally expensive than the ones
used for matching
● It works on subset of the matched documents (e.g. top 100)
LambdaMart Example
Mohammed Korayem Hai Liu
David LinChengwei Li
Thank You!
1 of 35

More Related Content

What's hot(20)

Viewers also liked(20)

Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
Benazir Income Support Program (BISP)96K views
Cesec2015 - Arduino DesignerCesec2015 - Arduino Designer
Cesec2015 - Arduino Designer
melbats1.2K views
The tsunamiThe tsunami
The tsunami
samarth singh106 views
ISABELLEISABELLE
ISABELLE
Alix Iglesias258 views
publications and presentationspublications and presentations
publications and presentations
Kathrine Sophia174 views
TRABAJO FINAL DE MOODLETRABAJO FINAL DE MOODLE
TRABAJO FINAL DE MOODLE
karina albores158 views
makkahmakkah
makkah
makkah tube211 views
kemberling morenokemberling moreno
kemberling moreno
kemberling249 views
Edu Glogster  _ juan chenEdu Glogster  _ juan chen
Edu Glogster _ juan chen
enoch-926643 views
ffbPresentation1ffbPresentation1
ffbPresentation1
slenhert269 views
Purity Staffing 2016Purity Staffing 2016
Purity Staffing 2016
Sophie Cusack220 views
Mise on scene/characters roomMise on scene/characters room
Mise on scene/characters room
Luanamaria16177 views
Welding with ESAB's WarriorWelding with ESAB's Warrior
Welding with ESAB's Warrior
Jorg Eichhorn596 views
Presentation Youngcast AISL grade6Presentation Youngcast AISL grade6
Presentation Youngcast AISL grade6
aislyoungcast439 views

Similar to Fully Automated QA System For Large Scale Search And Recommendation Engines Using Spark

Similar to Fully Automated QA System For Large Scale Search And Recommendation Engines Using Spark(20)

More from Spark Summit(20)

Fully Automated QA System For Large Scale Search And Recommendation Engines Using Spark

  • 1. Fully Automated QA System for Large Scale Search and Recommendation Engines Using Spark
  • 2. Khalifeh AlJadda Lead Data Scientist, Search Data Science • Joined CareerBuilder in 2013 • PhD, Computer Science – University of Georgia (2014) • BSc, MSc, Computer Science, Jordan University of Science and Technology Activities: Founder and Chairman of CB Data Science Council Frequent public speaker in the field of data science Creator of GELATO (Glycomic Elucidation and Annotation Tool)
  • 4. The Fully Automated System How to Label Dataset Introduction How to Measure Relevancy Talk Flow
  • 6. What is Information Retrieval (IR)? Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).* *introduction to information retrieval: http://nlp.stanford.edu/IR-book/pdf/irbookprint.pdf
  • 7. Information Retrieval (IR) vs Relational Database (RDB) RDB IR Objects Records Unstructured Documents Model Relational Vector Space Main Data Structure Table Inverted Index Queries SQL Free text
  • 8. … … … … … The inverted index
  • 9. Vocabulary Relevancy: Information need satisfaction Precision: Accuracy Recall: Coverage Search: Find documents that match a user’s query Recommendation: Leveraging context to automatically suggest relevant results
  • 11. Motivation Users will turn away if they get irrelevant results New algorithms and features need test A/B test is expensive since it has impact on the end users A/B test requires days before a conclusion can be made
  • 12. How to Measure Relevancy? A B C Retrieved Documents Related Documents Precision = B/A Recall = B/C F1 = 2 * (Prec * Rec) / (Prec+Rec)
  • 13. Assumption: We have only 3 jobs for aquatic director in our Solr index Precision = 2/4 = 0.5 Recall = 2/3 = 0.66 F1 = 2 * (0.5 * 0.66) / (0.5 + 0.66) = 0.56 Problem: Assume Prec = 90% and Rec = 100% but assume the 10% irrelevant documents were ranked at the top of the results is that OK?
  • 14. Discount Cumulative Gain (DCG) Rank Relevancy 1 0.95 2 0.65 3 0.80 4 0.85 Rank Relevancy 1 0.95 2 0.85 3 0.80 4 0.65 Ranking Ideal Given • Position is considered in quantifying relevancy. • Labeled dataset is required.
  • 16. How to get labeled data? ● Manually ○ Pros: ■ Accuracy ○ Cons: ■ Not scalable ■ Expensive ○ How: ■ Hire employees, contractors, or interns ■ Crowd-sourcing ● Less cost ● Less accuracy ● Infer relevancy utilizing implicit user feedback
  • 17. How to infer relevancy? Rank Document ID 1 Doc1 2 Doc2 3 Doc3 4 Doc4 Query Query Doc1 Doc2 Doc3 0 1 1 Query Doc1 Doc2 Doc3 1 0 0 Click G raph Skip Graph
  • 18. Query Log Field Example Query ID Q1234567890 browser ID B12345ABCD789 Session ID S123456ABCD7890 Raw Query Spark or hadoop and Scala or java Host Site US Language EN Ranked Results D1, D2, D3, D4, .. , Dn
  • 19. Field Example Query ID Q1234567890 Action Type* Click Document ID D1 Document Location 1 Action Log *Possible Action Types: Click, Download, Print, Block, Unblock, Save, Apply, Dwell time, Post-click path
  • 22. ETL Field Example Query ID Q1234567890 browser ID B12345ABCD789 Session ID S123456ABCD7890 Raw Query Spark or hadoop and Scala or java Ranked Results D1, D2, D3, D4, .. , Dn Field Example Query ID Q1234567890 Action Type* Click Document ID D1 Document Location 1 Keyword DocumentID Rank Clicks Skips Popularity Keyword DocumentID Relevancy
  • 23. Noise Challenge At least 10 distinct users need to take an action on a document to consider it in the nDCG calculation. Any skip followed clicks on different sessions from the same browser ID is ignored. Actions beyond Clicks weight more than Clicks. For example, we count Download as 20 clicks, and Print as 100 clicks
  • 24. 500 resumes had been manually reviewed by our data analyst. The accuracy of the relevancy scores calculated by our system is 96% Accuracy
  • 25. Dataset by the Numbers 19million + 10+100,000+250,000+ 7
  • 27. Synthesize Queries ETL ETL Logs HDFS Query Docs with Relevancy java developer d1,d2,d3,.. spark or hadoop d11,d12,d13,.. Search
  • 28. ETL ETL Logs HDFS Query Docs with Relevancy java developer d1,d2,d3,.. spark or hadoop d11,d12,d13,.. HDFS Export
  • 29. Current Search Algorithm Proposed Semantic Algorithms
  • 32. Learning to Rank (LTR) ● It applies machine learning techniques to discover the best combination of features that provide best ranking. ● It requires labeled set of documents with relevancy scores for given set of queries ● Features used for ranking are usually more computationally expensive than the ones used for matching ● It works on subset of the matched documents (e.g. top 100)
  • 34. Mohammed Korayem Hai Liu David LinChengwei Li