Generating Real-time,
Streaming Recommendations
[NiFi + Kafka + Spark ML]
Kafka Summit SF
April 26, 2016
Who am I?
Chris Fregly, Principal Data Solutions Engineer
@ IBM Spark Technology Center
Previously, Data Engineer @ Netflix and Databricks
Contributor @ Apache Spark, Committer @ Netflix OSS
Founder @ Advanced Spark and TensorFlow
Author @ Advanced Spark (advancedspark.com)
Relevant Spark Contribution
SPARK-1981: Add Kinesis support for Spark Streaming
Me
Fun Meetup!
Fun Workshop!!
San Jose: May 14th (full details @ advancedspark.com)
Fun Github Repo!!!
Agenda
Live, Interactive Demo!
NiFi
Spark Streaming
Streaming Recommendations
Netflix Pipeline (Bonus!)
Live, Interactive Demo
http://demo2.advancedspark.com
Agenda
Live, Interactive Demo!
NiFi
Spark Streaming
Streaming Recommendations
Netflix Recommendations (Bonus!)
NiFi
NiFi = “Niagra Files”
Maintainers @ Hortonworks since 2015
Developed @ NSA over last 8+ years
Integrates with EVERYTHING!
Provides Data Provenance
Data Flow Management
Me,
Normal Guy
Joe Witt,
NiFi Co-Creator
Buffalo
Wild Wings
Hat
NiFi + Kafka
NiFi Routing: Http Request
NiFi Geo-Enrichment
NiFi Extract Kafka Topic
NiFi Kafka PUT (Finally!)
NiFi Post-Kafka HttpResponse
NiFi Data Provenance
NiFi Provenance Event Types
ATTRIBUTES_MODIFIED (ie. Extract Topic Name)
CONTENT_MODIFIED (ie. Enrich with Geo)
RECEIVE (ie. Handle Http Request)
ROUTE (ie. Check Http Method)
SEND (ie. PutKafka)
DROP (Handle Http Response)
NiFi Search Data Provenance
NiFi Kafka Provenance Event
NiFi Kafka Provenance Event
NiFi Kafka Provenance Event
NiFi Provenance Lineage
Agenda
Live, Interactive Demo!
NiFi
Spark Streaming
Streaming Recommendations
Netflix Recommendations (Bonus!)
Spark Streaming
Submits Time-Based Micro Batches of Data as Spark Jobs
Supports Kinesis, Flume, MQTT, ZeroMQ, Sockets, KAFKA!
Framework for Custom Streaming Receivers
Flexible Window Operations, Optimized State Management
Basic Back Pressure and Throttling Support
At Least Once Guarantees through Write Ahead Log (WAL)
Original Kafka Receiver
Newer Kafka “Direct” Receiver
Spark Streaming KafkaRDD
Kafka “Direct” Streaming Implementation (Spark 1.4+)
Recover/Replay from Kafka using File System-like Offsets
Removes need for Write Ahead Log (WAL)
Uses Kafka, itself, as the WAL!
KafkaRDD
Agenda
Live, Interactive Demo!
NiFi
Spark Streaming
Streaming Recommendations
Netflix Recommendations (Bonus!)
Streaming Recommendations
Incremental Matrix Factorization!!
(Based on github.com/brkyvz/streaming-matrix-factorization)
Recommendation Serving Layer
Use Case: Recommendation Service Depends on Redis Cache
Problem: Redis Cache Goes Down!?
Answer: github.com/Netflix/Hystrix Circuit Breaker!
Circuit States:
Closed: Service OK
Open: Service DOWN
Fallback to Non-Personalized Recommendations from Disk
Agenda
Live, Interactive Demo!
NiFi
Spark Streaming
Streaming Recommendations
Netflix Recommendations (Bonus!)
Netflix Data Pipeline
9 million events, 22 GB per second @ peak!
EC2 D2XL
Disk: 6 TB, 475 MB/s
RAM: 30 G
Network: 700 Mbps
Auto-scaling,
Fault tolerance
A/B Tests,
Trending Now
SAMZA
Splits high and
normal priority
Recommendations Pipeline
Batch Matrix Factorization
Keep Batch Video (V) Matrix
Calculate Newer User (U) Matrix
Compute U x V Dot Product
Save Model to Disk and EVCache
https://github.com/Netflix/EVCache
Throw away
batch user factors (U)
Keep video
factors (V)
Thank You, Kafka Summit SF!
Chris Fregly
@cfregly
All Source Code, Demos, and Docker Images Available
@ advancedspark.com,
github.com/fluxcapacitor/pipeline
Join the Global Meetup for Slides, Videos, Book
@ advancedspark.com

Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi, Kafka, and Spark - Chris Fregly