Webinar Presentation: Building a Big Data Recommendation Engine

Focused Expertise Industries Served

• Data Warehouse Design • Healthcare / Insurance
• Business Intelligence • Financial Services
• Big Data Analytics • Retail / eCommerce
• Search / Relevance • Digital Media / Marketing
• Infographics • K-12 / Higher Education

445 Park Ave New York, NY | 1-855-755-2246 | info@casertaconcepts.com

Big Data
Analytics

Recommendations
• Your customers expect them
• Good recommendations make life easier
• Help them find information, products, and services they might not
have thought of

• What makes a good recommendation?
• Relevant but not obvious
• Sense of “surprise”

SOLD!! 23” LED TV 24” LED TV 25” LED TV

23” LED TV``

Blu-Ray Home Theater HDMI Cables

Where can recommendations
engines be found?
• Applications can be found in a wide variety of industries
and applications:
• Travel
• Service Industry
• Music/Online radio
• TV and Video
• Online Publications
• Retail
..and countless others

Our Use Case: Movie Ratings!

Our Goal
• Create a powerful, scalable recommendation engine with minimal
development

• Make recommendations to users as they are browsing movie titles -
instantaneously

• Recommendation must have context to the movie they are currently
viewing.
OOPS! – too much surprise!

How do we hope to accomplish this?
Hadoop – distributed file system and processing platform
Mahout – collection of machine learning libraries

We will leverage 2 algorithms:
• Item Similarity– how similar is this particular movie to other
movies based on usage
• Item-Based Recommender – predict an individuals
preference based on their peers ratings

• Both algorithms only require a simple dataset of 3 fields:
“User ID” , “Item ID”, “Rating”

Item Similarity – Context, Content Filtering
“People who liked this movie liked these as well”

• Item Similarity builds a matrix of items to other items and calculates
similarity (based on user rating)

• The most similar item are then output as a list:
• Item ID, Similar Item ID, Similarity Score
• Items with the highest score are most similar
• In this example users who liked “Twelve Monkeys” (7) also like “Fargo” (100)

7 100 0.690951001800917
7 50 0.653299445638532
7 117 0.643701303640083

Item-Base – Peer, Collaborative Filtering
“People with similar taste to you liked these movies”
• Item-Base takes the Item Similarity matrix and weights based on
“peer” user preference.

• Essentially it determines the best movie critics for you to follow

• The items with the highest recommendation score are then output as tuples
• User ID [Item ID1:Score,…., Item IDn:Score]
• Items with the highest recommendation score are the most relevant to this user
• For user “Johny Sisklebert” (572), the two most highly recommended movies are
“Seven” and “Donnie Brasco”
572 [11:5.0,293:4.70718,8:4.688335,273:4.687676,427:4.685926,234:4.683155,168:4.669672,89:4.66959,4:4.65515]
573 [487:4.54397,1203:4.5291,616:4.51644,605:4.49344,709:4.3406,502:4.33706,152:4.32263,503:4.20515,432:4.26455,611:4.22019]
574 [1:5.0,902:5.0,546:5.0,13:5.0,534:5.0,533:5.0,531:5.0,1082:5.0,1631:5.0,515:5.0]

Recommendation Store
• Serving recommendations needs to be instantaneous
We need a database!

• The core to this solution is two reference tables:

Rec_Item_Similarity Rec_User_Item_Base
Item_ID User_ID
Similar_Item Item_ID
Similarity_Score Recommendation_Score

• When called to make recommendations we query our store
• Rec_Item_Similarity based on the Item_ID they are viewing
• Rec_User_Item_Base based on their User_ID

Delivering Recommendations
So if Johny is viewing “12 Monkeys” we query our
recommendation store and present the results
Item Similarity Raw Score Score
Item-Base (Peer) Raw Score Score
Fargo 0.691 1.000
Seven 5.000 1.000
Star Wars 0.653 0.946
Donnie Brasco 4.707
Item-Based: 0.941
Rock, The 0.644 0.932
Babe 4.688 0.938
Pulp Fiction 0.628 0.909 Peers like these
Heat 4.688 0.938
Return of the Jedi 0.627 0.908 Movies
To Kill a Mockingbird 4.686 0.937
Independence Day 0.618 0.894
Jaws 4.683 0.937
Willy Wonka 0.603 0.872
Monty Python, Holy Grail 4.670 0.934
Mission: Impossible 0.597 0.864 Best
Blade Runner 4.670 0.934
Silence of the Lambs, The 0.596 0.863
Get Shorty
Recommendations
4.655 0.931
Star Trek: First Contact 0.594 0.859
Raiders of the Lost Ark 0.584 0.845
Terminator, The 0.574 0.831 Top 10 Recommendations
Blade Runner 0.571 0.826
Usual Suspects, The 0.569 0.823 Seven (Se7en) 1.823
Seven (Se7en) 0.569 0.823 Blade Runner 1.760
Fargo 1.000
Star Wars 0.946
Donnie Brasco 0.941
Babe 0.938
Heat 0.938
To Kill a Mockingbird 0.937
Jaws 0.937
Monty Python, Holy Grail 0.934

From Good to Great Recommendations
• Note that the first 5 recommendations look pretty good
…but the 6th result would have been “Babe” the children's movie
OOPS!

• Tuning the algorithms might help: parameter changes, similarity
measures.

• How else can we make it better?
1. Delivery filters
2. Introduce additional algorithms such as K-Means, or Fuzzy K-Means

Delivery Scoring and Filters
Apply assumptions to control the results of collaborative filtering
• One or more categories must match
• Only children movies will be recommended for children's movies.

Action Adventure Children's Comedy Crime Drama Film-Noir Horror Romance Sci-Fi Thriller
Twelve Monkeys 0 0 0 0 0 1 0 0 0 1 0
Babe 0 0 1 1 0 1 0 0 0 0 0
Seven (Se7en) 0 0 0 0 1 1 0 0 0 0 1
Star Wars 1 1 0 0 0 0 0 0 1 1 0
Blade Runner 0 0 0 0 0 0 1 0 0 1 0
Fargo 0 0 0 0 1 1 0 0 0 0 1
Willy Wonka 0 1 1 1 0 0 0 0 0 0 0
Monty Python 0 0 0 1 0 0 0 0 0 0 0
Jaws 1 0 0 0 0 0 0 1 0 0 0
Heat 1 0 0 0 1 0 0 0 0 0 1
Donnie Brasco 0 0 0 0 1 1 0 0 0 0 0
To Kill a Mockingbird 0 0 0 0 0 1 0 0 0 0 0

Similarly logic could be applied to promote more favorable options
• New Releases
• Retail Case: Items that are on-sale, overstock

Additional Algorithm – K-Means
“These movies are similar based on their attributes”

• Treats items as coordinates
• Places a number of random
“centroids” and assigns the
nearest items
• Moves the centroids around based
on average location
• Process repeats until the
assignments stop changing

We would use the major attributes of the Movie to create coordinate points.
• Categories
• Actors
• Director
• Synopsis Text

Integrating K-Means into the process
Movies recommended by more than 1 algorithm are the most highly rated

K-Means:
Item-Based Similar

Item Similarity

Best
Recommendations

Summary
• Mahout and Hadoop can provide a relatively low cost and
extremely scalable platform for recommendations

• Mahout offers a great library of established Machine Learning
libraries, reducing development efforts

• A good recommendation system combines Collaborative and
Content filtering algorithms

elliott@casertaconcepts.com

Webinar Presentation: Building a Big Data Recommendation Engine

Recommended

Recommended

More Related Content

More from Caserta

More from Caserta (20)

Recently uploaded

Recently uploaded (20)

Webinar Presentation: Building a Big Data Recommendation Engine