SlideShare a Scribd company logo
Reflected Intelligence: Evolving self-learning Information
Retrieval systems
Khalifeh AlJadda
Lead Data Scientist
Khalifeh AlJadda
Lead Data Scientist, Search Data Science
• Joined CareerBuilder in 2013
• PhD, Computer Science – University of Georgia
• BSc, MSc, Computer Science, Jordan University of Science and Technology
Activities:
• Founder and Chair of the Southern Data Science Conference (www.southerndatascience.com)
• Co-founder of ATLytiCS (www.atlytics.org)
• Founder and Chairman of CB Data Science Council
• Frequent public speaker in data science and big data conferences.
• Creator of GELATO (Glycomic Elucidation and Annotation Tool)
About Us
LinkedIn Profile
Bay Area Search
At CareerBuilder, Solr Powers...Search At CareerBuilder
Search by the Numbers
5
Powering 50+ Search Experiences Including:
100million +
Searches per day
500+
Search Servers
1,5billion +
Documents indexed and
searchable
...and many more
The standard
for enterprise
search.
of Fortune 500
uses Solr.
90%
Big Data Platform by the Numbers
Big Data Technologies:
87
Data Nodes
5.5
Petabyte Storage
3650
vCores
16
TB RAM
what is “reflected intelligence”?
The Three C’s
Content:
Keywords and other features in your documents
Collaboration:
How other’s have chosen to interact with your system
Context:
Available information about your users and their intent
Reflected Intelligence
“Leveraging previous data and interactions to improve how
new data and interactions should be interpreted”
Feedback Loops
User
Searches
User
Sees
Results
User
takes an
action
Users’ actions
inform system
improvements
● Recommendation Engines
● Building user profiles from past searches, clicks, and other actions
● Identifying correlations between keywords/phrases
● Building out ontologies automatically from content and queries
● Determining relevancy judgements (precision, recall, nDCG, etc.) from
click logs
● Learning to Rank - using relevancy judgements and machine learning
to train a relevance model
● Identifying misspellings, synonyms, acronyms, and related keywords
● Disambiguation of keyword phrases with multiple meanings
● Learning what’s important in your content
Examples of Reflected Intelligence
big data ecosystem
Bay Area Search
• Massive data volume
• Can’t fit on single machine’s memory
• Can’t be processed on multi-core single machine in reasonable time
• The “1000 genomes” project will produce 1 petabyte of data per year from
multiple sources in multiple countries.
One algorithm used in this project will need 9 years to converge with 300
cores computing power.
• Facebook’s daily log 60 TB
Time to read 1TB from disk ~3 hours
The Big Data Problem
Hadoop
● Distributed computing framework
● Simplify hardware requirements (commodity computers), but move complexity to
software.
● Can run on multi-core single machine as well as on a cluster of commodity
machines.
● Hadoop basic components:
○ HDFS
○ Map/Reduce
● Hadoop echo system:
○ Workflow engine (oozie)
○ SQL-like language (Hive)
○ Pig
○ Zoo Keeper
○ Machine Learning Library (Mahout)
Apache Spark
Features Hadoop Map/Reduce Spark
Storage Disk Memory & Disk
Operations Map/Reduce Map/Reduce/Join/Filter/Sample
Execution Model Batch Batch/Interactive/Streaming
Programming Language Java Java/Scala/Python/R
Solr is the popular, blazing-fast,
open source enterprise search
platform built on Apache Lucene™.
Information Retrieval (IR) vs Relational Database (RDB)
RDB IR
Objects Records Unstructured Documents
Model Relational Vector Space
Main Data Structure Table Inverted Index
Queries SQL Free text
Term Documents
a doc1 [2x]
brown doc3 [1x]
, doc5 [1x]
cat doc4 [1x]
cow doc2 [1x]
, doc5 [1x]
… ...
once doc1 [1x]
, doc5 [1x]
over doc2 [1x]
, doc3 [1x]
the doc2 [2x]
, doc3 [2x]
,
doc4[2x]
, doc5 [1x]
… …
Document Content Field
doc1 once upon a time, in a land far, far
away
doc2 the cow jumped over the moon.
doc3 the quick brown fox jumped over
the lazy dog.
doc4 the cat in the hat
doc5 The brown cow said “moo” once.
… …
What you SEND to Lucene/Solr:
How the content is INDEXED into
Lucene/Solr (conceptually):
The inverted index
Matching text queries to text fields
/solr/select/?q=jobcontent:“software engineer”
Job Content Field Documents
… …
engineer doc1, doc3, doc4,
doc5
…
mechanical doc2, doc4, doc6
… …
software doc1, doc3, doc4,
doc7, doc8
… …
doc5
doc7 doc8
doc1 doc3
doc4
engineer
software
software engineer
Key Solr Features:
● Multilingual Keyword search
● Relevancy Ranking of results
● Faceting & Analytics
● Highlighting
● Spelling Correction
● Autocomplete/Type-ahead
Prediction
● Sorting, Grouping, Deduplication
● Distributed, Fault-tolerant, Scalable
● Geospatial search
● Complex Function queries
● Recommendations (More Like This)
● … many more
*source: Solr in Action, chapter 2
Recommendations
Leveraging context to automatically suggest relevant results
Dimensions
Right Item Right Person Right Time
What Who When
Sending too many emails to too many users in short time may
cause:
● users may unsubscribe from future emails.
● They become desensitized and ignore future emails.
● email service providers may mark such emails as spam if
too many of their users are being contacted in a short
time-window.
Problem Description
Our Goal
Optimizing email recommendation systems such
that they can yield a maximum response rate for
a minimum number of email sends.
Methodology
● In order to figure out who to send to and what to send
for each time period we consider:
○ Individual user behavior
○ Historical group behavior from other users within
the same classification
Activity Score
● We utilize each user’s most recent behavioral data.
● The hypothesis is that active users were more recently
interested and therefore are more likely to respond in
general.
Group Trends
Group Trends
Experiment and Results
measuring & improving relevancy
How to Measure Relevancy?
A B C
Retrieved
Documents
Related
Documents
Precsion = B/A
Recall = B/C
Assumption:
We have only 3 jobs for aquatic director in our Solr index
Precision = 2/4 = 0.5
Recall = 2/3 = 0.66
F1 = 2 * (0.5 * 0.66) / (0.5 + 0.66) =
0.56
Problem:
Assume Prec = 90% and Rec = 100% but assume the
10% irrelevant documents were ranked at the top of the results
is that OK?
Cumulative Discount Gain
Rank Relevancy
1 0.95
2 0.65
3 0.80
4 0.85
Rank Relevancy
1 0.95
2 0.65
3 0.80
4 0.85
Ranking
Ideal
Given
• Position is
considered in
quantifying
relevancy.
• Labeled dataset
is required.
How to get labeled data?
● Manually
○ Pros:
■ Accuracy
○ Cons:
■ Not scalable
■ Expensive
○ How:
■ Hire employees, contractors, or interns
■ Crowd-sourcing
● Less cost
● Less accuracy
● Infer relevancy utilizing Reflected Intelligence (RI)
Derive Item-Relevacy Score
● we build a bipartite graph with two types of nodes (Query and
Item) and two types of edges (Click and Skip).
● This Click/Skip graph is created by analyzing how end-users
interact with the results they get for their submitted queries.
● Each click edge e between a query node q and an item node i
stores the number of distinct users who clicked on that item i
when it was retrieved within the results of the query q.
How to infer relevancy?
Rank Document ID
1 Doc1
2 Doc2
3 Doc3
4 Doc4
Query
Click/Skip Graph for Group of Queries
Query Synthesizer
ETL
ETL
Logs
HDFS
Query Docs with
Relevancy
java developer d1,d2,d3,..
spark or hadoop d11,d12,d13,..
HDFS Export
Results (Evaluation)
Results (Monitoring)
Learning to Rank (LTR)
It applies machine learning techniques to discover the best combination of features that
provide best ranking.
It requires labeled set of documents with relevancy scores for given set of queries
Features used for ranking are usually more computationally expensive than the ones used
for matching
It works on subset of the matched documents (e.g. top 100)
LTR Algorithms
• RankNet*
(Neural Network, boosted trees)
• LambdaMart*
(set of regression trees)
• SVM Rank**
(SVM classifier)
** http://research.microsoft.com/en-us/people/hangli/cao-et-al-sigir2006.pdf
* http://research.microsoft.com/pubs/132652/MSR-TR-2010-82.pdf
LambdaMart Example
Team
Mohammed Korayem Trey GraingerKhalifeh AlJadda
Thank You!

More Related Content

What's hot

Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Caserta
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014The Hive
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big DataLewis Crawford
 
DataHub
DataHubDataHub
AI, Search, and the Disruption of Knowledge Management
AI, Search, and the Disruption of Knowledge ManagementAI, Search, and the Disruption of Knowledge Management
AI, Search, and the Disruption of Knowledge Management
Trey Grainger
 
Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...
Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...
Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...
Markus Harrer
 
Big Data vs Data Science vs Data Analytics | Demystifying The Difference | Ed...
Big Data vs Data Science vs Data Analytics | Demystifying The Difference | Ed...Big Data vs Data Science vs Data Analytics | Demystifying The Difference | Ed...
Big Data vs Data Science vs Data Analytics | Demystifying The Difference | Ed...
Edureka!
 
Exploring Big Data Analytics Tools
Exploring Big Data Analytics ToolsExploring Big Data Analytics Tools
Exploring Big Data Analytics Tools
Multisoft Virtual Academy
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
Indu Khemchandani
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku
 
Data Discovery & Trust through Metadata
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadata
markgrover
 
Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data Modeling
Vital.AI
 
How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace
Mohamadreza Mohtat
 
What Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS LibraryWhat Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS Library
Neo4j
 
Cheat sheets for data scientists
Cheat sheets for data scientistsCheat sheets for data scientists
Cheat sheets for data scientistsAjay Ohri
 
Big Data Analytics - Introduction
Big Data Analytics - IntroductionBig Data Analytics - Introduction
Big Data Analytics - Introduction
Alex Meadows
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
Chandan Rajah
 
Self Evolving Model to Attain to State of Dynamic System Accuracy
Self Evolving Model to Attain to State of Dynamic System AccuracySelf Evolving Model to Attain to State of Dynamic System Accuracy
Self Evolving Model to Attain to State of Dynamic System Accuracy
DataWorks Summit
 
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...
PyData
 
Data Science Retrospective
Data Science RetrospectiveData Science Retrospective
Data Science Retrospective
Sarah Guido
 

What's hot (20)

Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big Data
 
DataHub
DataHubDataHub
DataHub
 
AI, Search, and the Disruption of Knowledge Management
AI, Search, and the Disruption of Knowledge ManagementAI, Search, and the Disruption of Knowledge Management
AI, Search, and the Disruption of Knowledge Management
 
Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...
Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...
Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...
 
Big Data vs Data Science vs Data Analytics | Demystifying The Difference | Ed...
Big Data vs Data Science vs Data Analytics | Demystifying The Difference | Ed...Big Data vs Data Science vs Data Analytics | Demystifying The Difference | Ed...
Big Data vs Data Science vs Data Analytics | Demystifying The Difference | Ed...
 
Exploring Big Data Analytics Tools
Exploring Big Data Analytics ToolsExploring Big Data Analytics Tools
Exploring Big Data Analytics Tools
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
 
Data Discovery & Trust through Metadata
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadata
 
Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data Modeling
 
How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace
 
What Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS LibraryWhat Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS Library
 
Cheat sheets for data scientists
Cheat sheets for data scientistsCheat sheets for data scientists
Cheat sheets for data scientists
 
Big Data Analytics - Introduction
Big Data Analytics - IntroductionBig Data Analytics - Introduction
Big Data Analytics - Introduction
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
 
Self Evolving Model to Attain to State of Dynamic System Accuracy
Self Evolving Model to Attain to State of Dynamic System AccuracySelf Evolving Model to Attain to State of Dynamic System Accuracy
Self Evolving Model to Attain to State of Dynamic System Accuracy
 
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...
 
Data Science Retrospective
Data Science RetrospectiveData Science Retrospective
Data Science Retrospective
 

Similar to SDSC18 and DSATL Meetup March 2018

Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
Trey Grainger
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
Bhupesh Bansal
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
markgrover
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDB
MongoDB
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
Trey Grainger
 
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data WorldSearching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
OpenSource Connections
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Philip Filleul
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
Tao Feng
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
Tao Feng
 
1 data science with python
1 data science with python1 data science with python
1 data science with python
Vishal Sathawane
 
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
Trivadis
 
Market Research Meets Big Data Analytics for Business Transformation
Market Research Meets Big Data Analytics  for Business Transformation Market Research Meets Big Data Analytics  for Business Transformation
Market Research Meets Big Data Analytics for Business Transformation Sally Sadosky
 
How Celtra Optimizes its Advertising Platform with Databricks
How Celtra Optimizes its Advertising Platformwith DatabricksHow Celtra Optimizes its Advertising Platformwith Databricks
How Celtra Optimizes its Advertising Platform with Databricks
Grega Kespret
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
Ilkay Altintas, Ph.D.
 
The Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge GraphThe Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge Graph
Trey Grainger
 
Neo4j in Depth
Neo4j in DepthNeo4j in Depth
Neo4j in Depth
Max De Marzi
 
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Neo4j
 
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Spark Summit
 
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Lucidworks
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014ALTER WAY
 

Similar to SDSC18 and DSATL Meetup March 2018 (20)

Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDB
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
 
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data WorldSearching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
1 data science with python
1 data science with python1 data science with python
1 data science with python
 
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
 
Market Research Meets Big Data Analytics for Business Transformation
Market Research Meets Big Data Analytics  for Business Transformation Market Research Meets Big Data Analytics  for Business Transformation
Market Research Meets Big Data Analytics for Business Transformation
 
How Celtra Optimizes its Advertising Platform with Databricks
How Celtra Optimizes its Advertising Platformwith DatabricksHow Celtra Optimizes its Advertising Platformwith Databricks
How Celtra Optimizes its Advertising Platform with Databricks
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
The Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge GraphThe Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge Graph
 
Neo4j in Depth
Neo4j in DepthNeo4j in Depth
Neo4j in Depth
 
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
 
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
 
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
 

Recently uploaded

Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 

Recently uploaded (20)

Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 

SDSC18 and DSATL Meetup March 2018

  • 1. Reflected Intelligence: Evolving self-learning Information Retrieval systems Khalifeh AlJadda Lead Data Scientist
  • 2. Khalifeh AlJadda Lead Data Scientist, Search Data Science • Joined CareerBuilder in 2013 • PhD, Computer Science – University of Georgia • BSc, MSc, Computer Science, Jordan University of Science and Technology Activities: • Founder and Chair of the Southern Data Science Conference (www.southerndatascience.com) • Co-founder of ATLytiCS (www.atlytics.org) • Founder and Chairman of CB Data Science Council • Frequent public speaker in data science and big data conferences. • Creator of GELATO (Glycomic Elucidation and Annotation Tool) About Us LinkedIn Profile
  • 3.
  • 4. Bay Area Search At CareerBuilder, Solr Powers...Search At CareerBuilder
  • 5. Search by the Numbers 5 Powering 50+ Search Experiences Including: 100million + Searches per day 500+ Search Servers 1,5billion + Documents indexed and searchable ...and many more
  • 6. The standard for enterprise search. of Fortune 500 uses Solr. 90%
  • 7. Big Data Platform by the Numbers Big Data Technologies: 87 Data Nodes 5.5 Petabyte Storage 3650 vCores 16 TB RAM
  • 8. what is “reflected intelligence”?
  • 9. The Three C’s Content: Keywords and other features in your documents Collaboration: How other’s have chosen to interact with your system Context: Available information about your users and their intent Reflected Intelligence “Leveraging previous data and interactions to improve how new data and interactions should be interpreted”
  • 11. ● Recommendation Engines ● Building user profiles from past searches, clicks, and other actions ● Identifying correlations between keywords/phrases ● Building out ontologies automatically from content and queries ● Determining relevancy judgements (precision, recall, nDCG, etc.) from click logs ● Learning to Rank - using relevancy judgements and machine learning to train a relevance model ● Identifying misspellings, synonyms, acronyms, and related keywords ● Disambiguation of keyword phrases with multiple meanings ● Learning what’s important in your content Examples of Reflected Intelligence
  • 13. Bay Area Search • Massive data volume • Can’t fit on single machine’s memory • Can’t be processed on multi-core single machine in reasonable time • The “1000 genomes” project will produce 1 petabyte of data per year from multiple sources in multiple countries. One algorithm used in this project will need 9 years to converge with 300 cores computing power. • Facebook’s daily log 60 TB Time to read 1TB from disk ~3 hours The Big Data Problem
  • 14. Hadoop ● Distributed computing framework ● Simplify hardware requirements (commodity computers), but move complexity to software. ● Can run on multi-core single machine as well as on a cluster of commodity machines. ● Hadoop basic components: ○ HDFS ○ Map/Reduce ● Hadoop echo system: ○ Workflow engine (oozie) ○ SQL-like language (Hive) ○ Pig ○ Zoo Keeper ○ Machine Learning Library (Mahout)
  • 15. Apache Spark Features Hadoop Map/Reduce Spark Storage Disk Memory & Disk Operations Map/Reduce Map/Reduce/Join/Filter/Sample Execution Model Batch Batch/Interactive/Streaming Programming Language Java Java/Scala/Python/R
  • 16. Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene™.
  • 17. Information Retrieval (IR) vs Relational Database (RDB) RDB IR Objects Records Unstructured Documents Model Relational Vector Space Main Data Structure Table Inverted Index Queries SQL Free text
  • 18. Term Documents a doc1 [2x] brown doc3 [1x] , doc5 [1x] cat doc4 [1x] cow doc2 [1x] , doc5 [1x] … ... once doc1 [1x] , doc5 [1x] over doc2 [1x] , doc3 [1x] the doc2 [2x] , doc3 [2x] , doc4[2x] , doc5 [1x] … … Document Content Field doc1 once upon a time, in a land far, far away doc2 the cow jumped over the moon. doc3 the quick brown fox jumped over the lazy dog. doc4 the cat in the hat doc5 The brown cow said “moo” once. … … What you SEND to Lucene/Solr: How the content is INDEXED into Lucene/Solr (conceptually): The inverted index
  • 19. Matching text queries to text fields /solr/select/?q=jobcontent:“software engineer” Job Content Field Documents … … engineer doc1, doc3, doc4, doc5 … mechanical doc2, doc4, doc6 … … software doc1, doc3, doc4, doc7, doc8 … … doc5 doc7 doc8 doc1 doc3 doc4 engineer software software engineer
  • 20. Key Solr Features: ● Multilingual Keyword search ● Relevancy Ranking of results ● Faceting & Analytics ● Highlighting ● Spelling Correction ● Autocomplete/Type-ahead Prediction ● Sorting, Grouping, Deduplication ● Distributed, Fault-tolerant, Scalable ● Geospatial search ● Complex Function queries ● Recommendations (More Like This) ● … many more *source: Solr in Action, chapter 2
  • 21. Recommendations Leveraging context to automatically suggest relevant results
  • 22. Dimensions Right Item Right Person Right Time What Who When
  • 23. Sending too many emails to too many users in short time may cause: ● users may unsubscribe from future emails. ● They become desensitized and ignore future emails. ● email service providers may mark such emails as spam if too many of their users are being contacted in a short time-window. Problem Description
  • 24. Our Goal Optimizing email recommendation systems such that they can yield a maximum response rate for a minimum number of email sends.
  • 25. Methodology ● In order to figure out who to send to and what to send for each time period we consider: ○ Individual user behavior ○ Historical group behavior from other users within the same classification
  • 26. Activity Score ● We utilize each user’s most recent behavioral data. ● The hypothesis is that active users were more recently interested and therefore are more likely to respond in general.
  • 31. How to Measure Relevancy? A B C Retrieved Documents Related Documents Precsion = B/A Recall = B/C
  • 32. Assumption: We have only 3 jobs for aquatic director in our Solr index Precision = 2/4 = 0.5 Recall = 2/3 = 0.66 F1 = 2 * (0.5 * 0.66) / (0.5 + 0.66) = 0.56 Problem: Assume Prec = 90% and Rec = 100% but assume the 10% irrelevant documents were ranked at the top of the results is that OK?
  • 33. Cumulative Discount Gain Rank Relevancy 1 0.95 2 0.65 3 0.80 4 0.85 Rank Relevancy 1 0.95 2 0.65 3 0.80 4 0.85 Ranking Ideal Given • Position is considered in quantifying relevancy. • Labeled dataset is required.
  • 34. How to get labeled data? ● Manually ○ Pros: ■ Accuracy ○ Cons: ■ Not scalable ■ Expensive ○ How: ■ Hire employees, contractors, or interns ■ Crowd-sourcing ● Less cost ● Less accuracy ● Infer relevancy utilizing Reflected Intelligence (RI)
  • 35. Derive Item-Relevacy Score ● we build a bipartite graph with two types of nodes (Query and Item) and two types of edges (Click and Skip). ● This Click/Skip graph is created by analyzing how end-users interact with the results they get for their submitted queries. ● Each click edge e between a query node q and an item node i stores the number of distinct users who clicked on that item i when it was retrieved within the results of the query q.
  • 36. How to infer relevancy? Rank Document ID 1 Doc1 2 Doc2 3 Doc3 4 Doc4 Query
  • 37. Click/Skip Graph for Group of Queries
  • 39. ETL ETL Logs HDFS Query Docs with Relevancy java developer d1,d2,d3,.. spark or hadoop d11,d12,d13,.. HDFS Export
  • 42. Learning to Rank (LTR) It applies machine learning techniques to discover the best combination of features that provide best ranking. It requires labeled set of documents with relevancy scores for given set of queries Features used for ranking are usually more computationally expensive than the ones used for matching It works on subset of the matched documents (e.g. top 100)
  • 43. LTR Algorithms • RankNet* (Neural Network, boosted trees) • LambdaMart* (set of regression trees) • SVM Rank** (SVM classifier) ** http://research.microsoft.com/en-us/people/hangli/cao-et-al-sigir2006.pdf * http://research.microsoft.com/pubs/132652/MSR-TR-2010-82.pdf
  • 45. Team Mohammed Korayem Trey GraingerKhalifeh AlJadda