SDSC18 and DSATL Meetup March 2018

Reflected Intelligence: Evolving self-learning Information
Retrieval systems
Khalifeh AlJadda
Lead Data Scientist

Khalifeh AlJadda
Lead Data Scientist, Search Data Science
• Joined CareerBuilder in 2013
• PhD, Computer Science – University of Georgia
• BSc, MSc, Computer Science, Jordan University of Science and Technology
Activities:
• Founder and Chair of the Southern Data Science Conference (www.southerndatascience.com)
• Co-founder of ATLytiCS (www.atlytics.org)
• Founder and Chairman of CB Data Science Council
• Frequent public speaker in data science and big data conferences.
• Creator of GELATO (Glycomic Elucidation and Annotation Tool)
About Us
LinkedIn Profile

Bay Area Search
At CareerBuilder, Solr Powers...Search At CareerBuilder

Search by the Numbers
5
Powering 50+ Search Experiences Including:
100million +
Searches per day
500+
Search Servers
1,5billion +
Documents indexed and
searchable
...and many more

The standard
for enterprise
search.
of Fortune 500
uses Solr.
90%

Big Data Platform by the Numbers
Big Data Technologies:
87
Data Nodes
5.5
Petabyte Storage
3650
vCores
16
TB RAM

what is “reflected intelligence”?

The Three C’s
Content:
Keywords and other features in your documents
Collaboration:
How other’s have chosen to interact with your system
Context:
Available information about your users and their intent
Reflected Intelligence
“Leveraging previous data and interactions to improve how
new data and interactions should be interpreted”

Feedback Loops
User
Searches
User
Sees
Results
User
takes an
action
Users’ actions
inform system
improvements

● Recommendation Engines
● Building user profiles from past searches, clicks, and other actions
● Identifying correlations between keywords/phrases
● Building out ontologies automatically from content and queries
● Determining relevancy judgements (precision, recall, nDCG, etc.) from
click logs
● Learning to Rank - using relevancy judgements and machine learning
to train a relevance model
● Identifying misspellings, synonyms, acronyms, and related keywords
● Disambiguation of keyword phrases with multiple meanings
● Learning what’s important in your content
Examples of Reflected Intelligence

Bay Area Search
• Massive data volume
• Can’t fit on single machine’s memory
• Can’t be processed on multi-core single machine in reasonable time
• The “1000 genomes” project will produce 1 petabyte of data per year from
multiple sources in multiple countries.
One algorithm used in this project will need 9 years to converge with 300
cores computing power.
• Facebook’s daily log 60 TB
Time to read 1TB from disk ~3 hours
The Big Data Problem

Hadoop
● Distributed computing framework
● Simplify hardware requirements (commodity computers), but move complexity to
software.
● Can run on multi-core single machine as well as on a cluster of commodity
machines.
● Hadoop basic components:
○ HDFS
○ Map/Reduce
● Hadoop echo system:
○ Workflow engine (oozie)
○ SQL-like language (Hive)
○ Pig
○ Zoo Keeper
○ Machine Learning Library (Mahout)

Apache Spark
Features Hadoop Map/Reduce Spark
Storage Disk Memory & Disk
Operations Map/Reduce Map/Reduce/Join/Filter/Sample
Execution Model Batch Batch/Interactive/Streaming
Programming Language Java Java/Scala/Python/R

Solr is the popular, blazing-fast,
open source enterprise search
platform built on Apache Lucene™.

Information Retrieval (IR) vs Relational Database (RDB)
RDB IR
Objects Records Unstructured Documents
Model Relational Vector Space
Main Data Structure Table Inverted Index
Queries SQL Free text

Term Documents
a doc1 [2x]
brown doc3 [1x]
, doc5 [1x]
cat doc4 [1x]
cow doc2 [1x]
, doc5 [1x]
… ...
once doc1 [1x]
, doc5 [1x]
over doc2 [1x]
, doc3 [1x]
the doc2 [2x]
, doc3 [2x]
,
doc4[2x]
, doc5 [1x]
… …
Document Content Field
doc1 once upon a time, in a land far, far
away
doc2 the cow jumped over the moon.
doc3 the quick brown fox jumped over
the lazy dog.
doc4 the cat in the hat
doc5 The brown cow said “moo” once.
… …
What you SEND to Lucene/Solr:
How the content is INDEXED into
Lucene/Solr (conceptually):
The inverted index

Matching text queries to text fields
/solr/select/?q=jobcontent:“software engineer”
Job Content Field Documents
… …
engineer doc1, doc3, doc4,
doc5
…
mechanical doc2, doc4, doc6
… …
software doc1, doc3, doc4,
doc7, doc8
… …
doc5
doc7 doc8
doc1 doc3
doc4
engineer
software
software engineer

Key Solr Features:
● Multilingual Keyword search
● Relevancy Ranking of results
● Faceting & Analytics
● Highlighting
● Spelling Correction
● Autocomplete/Type-ahead
Prediction
● Sorting, Grouping, Deduplication
● Distributed, Fault-tolerant, Scalable
● Geospatial search
● Complex Function queries
● Recommendations (More Like This)
● … many more
*source: Solr in Action, chapter 2

Recommendations
Leveraging context to automatically suggest relevant results

Dimensions
Right Item Right Person Right Time
What Who When

Sending too many emails to too many users in short time may
cause:
● users may unsubscribe from future emails.
● They become desensitized and ignore future emails.
● email service providers may mark such emails as spam if
too many of their users are being contacted in a short
time-window.
Problem Description

Our Goal
Optimizing email recommendation systems such
that they can yield a maximum response rate for
a minimum number of email sends.

Methodology
● In order to figure out who to send to and what to send
for each time period we consider:
○ Individual user behavior
○ Historical group behavior from other users within
the same classification

Activity Score
● We utilize each user’s most recent behavioral data.
● The hypothesis is that active users were more recently
interested and therefore are more likely to respond in
general.

measuring & improving relevancy

How to Measure Relevancy?
A B C
Retrieved
Documents
Related
Documents
Precsion = B/A
Recall = B/C

Assumption:
We have only 3 jobs for aquatic director in our Solr index
Precision = 2/4 = 0.5
Recall = 2/3 = 0.66
F1 = 2 * (0.5 * 0.66) / (0.5 + 0.66) =
0.56
Problem:
Assume Prec = 90% and Rec = 100% but assume the
10% irrelevant documents were ranked at the top of the results
is that OK?

Cumulative Discount Gain
Rank Relevancy
1 0.95
2 0.65
3 0.80
4 0.85
Rank Relevancy
1 0.95
2 0.65
3 0.80
4 0.85
Ranking
Ideal
Given
• Position is
considered in
quantifying
relevancy.
• Labeled dataset
is required.

How to get labeled data?
● Manually
○ Pros:
■ Accuracy
○ Cons:
■ Not scalable
■ Expensive
○ How:
■ Hire employees, contractors, or interns
■ Crowd-sourcing
● Less cost
● Less accuracy
● Infer relevancy utilizing Reflected Intelligence (RI)

Derive Item-Relevacy Score
● we build a bipartite graph with two types of nodes (Query and
Item) and two types of edges (Click and Skip).
● This Click/Skip graph is created by analyzing how end-users
interact with the results they get for their submitted queries.
● Each click edge e between a query node q and an item node i
stores the number of distinct users who clicked on that item i
when it was retrieved within the results of the query q.

How to infer relevancy?
Rank Document ID
1 Doc1
2 Doc2
3 Doc3
4 Doc4
Query

Click/Skip Graph for Group of Queries

ETL
ETL
Logs
HDFS
Query Docs with
Relevancy
java developer d1,d2,d3,..
spark or hadoop d11,d12,d13,..
HDFS Export

Learning to Rank (LTR)
It applies machine learning techniques to discover the best combination of features that
provide best ranking.
It requires labeled set of documents with relevancy scores for given set of queries
Features used for ranking are usually more computationally expensive than the ones used
for matching
It works on subset of the matched documents (e.g. top 100)

LTR Algorithms
• RankNet*
(Neural Network, boosted trees)
• LambdaMart*
(set of regression trees)
• SVM Rank**
(SVM classifier)
** http://research.microsoft.com/en-us/people/hangli/cao-et-al-sigir2006.pdf
* http://research.microsoft.com/pubs/132652/MSR-TR-2010-82.pdf

Team
Mohammed Korayem Trey GraingerKhalifeh AlJadda

SDSC18 and DSATL Meetup March 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to SDSC18 and DSATL Meetup March 2018

Similar to SDSC18 and DSATL Meetup March 2018 (20)

Recently uploaded

Recently uploaded (20)

SDSC18 and DSATL Meetup March 2018