Dublin Ireland Spark Meetup October 15, 2015

After Dark
Generating High-Quality Recommendations using
Real-time Advanced Analytics and Machine Learning with
Chris Fregly
chris@fregly.com, IBM Spark Technology Center (spark.tc)

Who am I?
2
Streaming Data Engineer
Netflix Open Source Committer
Data Solutions Engineer
Apache Contributor
Principal Data Solutions Engineer
IBM Technology Center
Meetup Organizer
Advanced Apache Meetup
Book Author
Advanced (2016)

Advanced Apache Spark Meetup
Total Spark Experts: ~1300 in 3 mos!
Top 5 most active Spark Meetup globally!
Main Goals
Dig deep into the Spark & extended-Spark codebase
Study integrations such as Cassandra, ElasticSearch,
Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R, e
tc
Surface and share the patterns and idioms of these
well-designed, distributed, big data components

Why “ After Dark”?
“Playboy After Dark”
Late 1960’s TV Show
Progressive Show For Its Time
4
And it rhymes!!

What is ?
5
Core
Spark
Streaming
real-timeSpark SQL
structured data
MLlib
machine
learning
GraphX
graph
analytics
…
BlinkDB
approx queries

Tools of this Talk
7
① Redis
② Docker
③ Cassandra
④ MLlib, GraphX
⑤ Parquet, JSON
⑥ Apache Zeppelin
⑦ Spark Streaming, Kafka
⑧ Spark SQL, DataFrames
⑨ Spark JDBC/ODBC Hive ThriftServer
⑩ ElasticSearch, Logstash, Kibana (ELK)
and…

SMACK Stack!
8
① S park (Data Processing)
② M esos (Cluster Manager)
③ A kka (Actors)
④ C assandra (NoSQL)
⑤ K afka (Streaming)

Themes of This Talk
9
①Parallelism
②Performance
③Streaming
④Approximations
⑤Similarity Measures
⑥Recommendations
and…

10
①Generate high-quality recommendations
②Demonstrate high-level libraries:
③ Spark Streaming -> Kafka, Approximates
④ Spark SQL -> DataFrames, Cassandra
① GraphX -> PageRank, Shortest Path
① MLlib -> Matrix Factor, Word2Vec
Goals of After Dark?
Images courtesy of tinder.com, however not affiliated with Tinder in any way.

My First Experience with Parallelism
13
Brady Bunch circa 1980
Season 5, Episode 18: “Two Pete’s in a Pod”

Parallel Algorithm : O(log n)
14

Non-parallel Algorithm : O(n)
15

Daytona Gray Sort Contest
18
① On-disk only
② 28,000 partitions
③ No in-memory caching
(2014)(2013) (2014)

Improved Shuffle and Network Layer
19
①“Sort-based shuffle”
②Minimize OS resources
③Switched to async Netty
④Keep CPUs hot
⑤Reuse byte buffers to minimize GC
⑥Use epoll for I/O to stay in kernel space

Project Tungsten: CPU and Memory
20
①More JVM bytecode generation, JIT optimize
②CPU-cache-aware data structs and algos
-->
③Custom memory management
Serializers Performance HashMap

DataFrames and Catalyst Optimizer
21
21
https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/
Please Use
DataFrames!
-->
-->
JVM bytecode
generation

Columnar Storage Format
22
*Skip whole chunks with min-max heuristics
stored in each chunk (sorted data only)

Parquet File Format
23
①Based on Google Dremel Paper
②Implemented by Twitter and Cloudera
③Columnar storage format
④Optimized for fast columnar aggregations
⑤Tight compression
⑥Supports pushdowns
⑦Nested, self-describing, evolving schema

Types of Compression
24
①Run Length Encoding
Repeated data
②Dictionary Encoding
Fixed set of values
③Delta, Prefix Encoding
Sorted dataset

Types of Query Optimizations
25
①Column, Partition Pruning
②Row, Predicate Pushdown
SELECT b FROM table WHERE a in [a2,a3]

Direct Kafka Streaming - KafkaRDD
① No single Receiver, no Write Ahead Log (WAL)
② Workers pull from Kafka in parallel
③ Each KafkaRDD partition stores relevant offsets
④ Upon Worker Node failure, rebuild from offsets
⑤ Optimizes happy path by avoiding the WAL
27
At least once
delivery guarantee
<--

Count Min Sketch
29
①Approximate counters
②Better than HashMap
③Low, fixed memory
④Known error bounds
⑤Large num of counters
⑥From Twitter’s Algebird
⑦Streaming example in codebase

HyperLogLog
30
①Approximate cardinality
Approx count distinct
②Low memory
1.5KB @ 2% error
10^9 elements!
③From Twitter’s Algebird
④Streaming example in codebase
⑤RDD: countApproxDistinctByKey()

Monte Carlo Simulations
31
From Manhattan Project (A-bomb)
Simulate movement of neutrons
Law of Large Numbers (LLN)
Average of results of many trials
Converge on expected value
SparkPi example in codebase
Pi ~ # red dots /
# total dots * 4

Audience Participation Needed!
34
①Navigate to sparkafterdark.com
②Click 3 actors and 3 actresses
->
You are here
->

Types of Recommendations
35
Non-personalized
Cold Start
No preference or behavior data for user, yet
Personalized
User-Item Similarity
Items that others with similar prefs have liked
Item-Item Similarity
Items similar to your previously-liked items

Non-personalized
Recommendations
36

Summary Statistics and Aggregations
37
①Top Users by Like Count
“I might like users with the highest sum aggregation of
likes overall.”
SparkSQL + DataFrame: Aggregations

Like Graph Analysis
38
②Top Influencers by Like Graph
“I might like users who have the highest probability of
me liking them randomly while walking the like graph.”
GraphX: PageRank

Demo!
Spark SQL + DataFrames + GraphX
+ Hive ThriftServer
39

Types of Similarity
41
Euclidean: linear measure
Magnitude bias
Cosine: angle measure
Adjust for magnitude bias
Jaccard: (intersection / union)
Popularity bias
Log Likelihood
Adjust for popularity bias
Ali Matei Reynold Patrick Andy
Kimberly 1 1 1 1
Leslie 1 1
Meredith 1 1 1
Lisa 1 1 1
Holden 1 1 1 1 1
z

All-Pairs Similarity Comparison
42
Compare everything to everything
aka. “pair-wise similarity” or “similarity join”
Naïve shuffle: O(m*n^2); m=rows, n=cols
Must Minimize shuffle through approximations
Reduce m (rows)
Sampling and bucketing
Reduce n (cols): Remove most frequent value (ie.0)

Reduce m: DIMSUM Sampling
43
Dimension Independent Matrix Square Using MR
Remove rows with low similarity probability
MLlib: RowMatrix.columnSimilarities(…)
Twitter: 40% efficiency gain over Cosine

Reduce m: LSH Bucketing
44
Locality Sensitive Hashing
Split m into b buckets
Use similarity hash algo
Requires pre-processing of data
Compare bucket contents in parallel
Converts O(m*n^2) -> O(m*n/b*b^2);
m=rows, n=cols, b=buckets
500k x 500k matrix
O(1.25E17) -> O(1.25E13); b=50
github.com/mrsqueeze/spark-hash

Reduce n: Remove Most Frequent Value
45
Eliminate most-frequent value
Represent other values with (index,value) pairs
Converts O(m*n^2) -> O(m*nnz^2);
nnz=num nonzeros, nnz << n
Choose most frequent value – may not be zero!
(index,value)
(index,value)

Personalized
Recommendations
46

Terminology of Recommendations
47
User
User seeking recommendations
Item
Item that has been liked or rated
Feedback
Explicit: like, rating
Implicit: search, click, hover, view, scroll

Collaborative Filtering Personalized Recs
48
③Like behavior of similar users
“I like the same people that you like.
What other people did you like that I haven’t seen?”
MLlib: Matrix Factorization, User-Item Similarity

Demo!
Spark SQL + DataFrames + MLlib
49

Text-based Personalized Recs
50
④Similar profiles to me
“Our profiles have similar, unique k-skip n-grams.
We might like each other.”
MLlib: Word2Vec, TF/IDF, Doc Similarity

More Text-based Personalized Recs
51
⑤Similar profiles from my past likes
“Your profile shares a similar feature vector space to
others that I’ve liked. I might like you.”
MLlib: Word2Vec, TF/IDF, Doc Similarity

More Text-based Personalized Recs
52
⑥Relevant, High-Value Emails
“Your initial email has similar named entities to my profile.
I might like you just for making the effort.”
MLlib: Word2Vec, TF/IDF, Entity Recognition
^
Her Email< My Profile

The Future
of
Personalized Recommendations
53

Facial Recognition
54
⑦Eigenfaces
“Your face looks similar to others that I’ve liked.
I might like you.”
MLlib: RowMatrix, PCA, Item-Item Similarity
Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html

Conversation Bot
55
⑧NLP and DecisionTrees
“If your responses to my trite opening
lines are positive, I may read your profile.”
MLlib: TF/IDF, DecisionTree,
Sentiment Analysis
Positive Negative
Image courtesty of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html

Couples’ Recommendations
57
⑨Pathways of Similarity
“I want Mad Max. You want Message In a Bottle.
Let’s find something in between to watch tonight.”
MLlib: RowMatrix, Item-Item Similarity
GraphX: Nearest Neighbors, Shortest Path
similar similar
plots -> <- actors

⑩ Get Off The Computer and Meet People!
chris@fregly.com
@cfregly
IBM Spark Technology Center (spark.tc)
advancedspark.com
github.com/fluxcapacitor/pipeline
hub.docker.com/r/fluxcapacitor/pipeline/
59
Thank you!!
Image courtesy of http://www.duchess-france.org/

Dublin Ireland Spark Meetup October 15, 2015

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Dublin Ireland Spark Meetup October 15, 2015

Similar to Dublin Ireland Spark Meetup October 15, 2015 (20)

More from Chris Fregly

More from Chris Fregly (20)

Recently uploaded

Recently uploaded (20)

Dublin Ireland Spark Meetup October 15, 2015