Slides for my entry to the DC Apache Spark Meetup, Spark Bake Off. I built a demo of a distributed, real-time TCP packet analysis system with Apache Spark, Azure Event Hubs, and Power BI
DC Spark bake off - Realtime TCP Packet Analysis using Spark and Azure Event Hubs
1. Washington DC Area Apache
Spark Interactive
Spark Bake-off
Team Name: Silvio Fiorito
Solution Title: Real-time Packet Analysis using Spark
2. Spark Bake-off
Page: 2
Team Introductions
Silvio Fiorito
– Background in development and app security
– Started working with Hadoop in 2012
– Started using Spark at v0.6 in early 2013
– Built a few prototypes for low-latency query
services with Spark/Shark and then
SparkSQL
– Twitter: @granturing
3. Spark Bake-off
Page: 3
Solution Overview
Real-time TCP packet analysis of geographically
distributed hosts
– Must support high throughput from many hosts
– 3 demo VMs ( 2 x Azure & 1 x AWS)
Local Flume agent pushes events to Azure Event Hub
Events are partitioned and persisted up to 7 days
Spark Streaming app ingests streams
– Reconstruct packets
– Lookups for geo-ip and port description
– Clusters using pre-trained k-means model
– Saves data to Azure Table Storage and publishes on
Service Bus Topic
6. Spark Bake-off
Page: 6
Final Comments & Questions
With more time
– Add true anomaly detection with MLLib
– Test on hosts with real traffic
– Wire up end-to-end with d3.js viz and
SparkSQL backend
– Integrate with existing IDS/IPS rules
– Bad IPs lookup