Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi, Kafka, and Spark - Chris Fregly
[NiFi + Kafka + Spark ML]
Kafka Summit SF
April 26, 2016
Who am I?
Chris Fregly, Principal Data Solutions Engineer
@ IBM Spark Technology Center
Previously, Data Engineer @ Netflix and Databricks
Contributor @ Apache Spark, Committer @ Netflix OSS
Founder @ Advanced Spark and TensorFlow
Author @ Advanced Spark (advancedspark.com)
Relevant Spark Contribution
SPARK-1981: Add Kinesis support for Spark Streaming
NiFi = “Niagra Files”
Maintainers @ Hortonworks since 2015
Developed @ NSA over last 8+ years
Integrates with EVERYTHING!
Provides Data Provenance
Data Flow Management
Submits Time-Based Micro Batches of Data as Spark Jobs
Supports Kinesis, Flume, MQTT, ZeroMQ, Sockets, KAFKA!
Framework for Custom Streaming Receivers
Flexible Window Operations, Optimized State Management
Basic Back Pressure and Throttling Support
At Least Once Guarantees through Write Ahead Log (WAL)
Incremental Matrix Factorization!!
(Based on github.com/brkyvz/streaming-matrix-factorization)
Recommendation Serving Layer
Use Case: Recommendation Service Depends on Redis Cache
Problem: Redis Cache Goes Down!?
Answer: github.com/Netflix/Hystrix Circuit Breaker!
Closed: Service OK
Open: Service DOWN
Fallback to Non-Personalized Recommendations from Disk
Netflix Data Pipeline
9 million events, 22 GB per second @ peak!
Disk: 6 TB, 475 MB/s
RAM: 30 G
Network: 700 Mbps
Splits high and
Batch Matrix Factorization
Keep Batch Video (V) Matrix
Calculate Newer User (U) Matrix
Compute U x V Dot Product
Save Model to Disk and EVCache
batch user factors (U)
Thank You, Kafka Summit SF!
All Source Code, Demos, and Docker Images Available
Join the Global Meetup for Slides, Videos, Book