Washington DC Area Apache 
Spark Interactive 
Spark Bake-off 
Team Name: Silvio Fiorito 
Solution Title: Real-time Packet Analysis using Spark
Spark Bake-off 
Page: 2 
Team Introductions 
 Silvio Fiorito 
– Background in development and app security 
– Started working with Hadoop in 2012 
– Started using Spark at v0.6 in early 2013 
– Built a few prototypes for low-latency query 
services with Spark/Shark and then 
SparkSQL 
– Twitter: @granturing
Spark Bake-off 
Page: 3 
Solution Overview 
 Real-time TCP packet analysis of geographically 
distributed hosts 
– Must support high throughput from many hosts 
– 3 demo VMs ( 2 x Azure & 1 x AWS) 
 Local Flume agent pushes events to Azure Event Hub 
 Events are partitioned and persisted up to 7 days 
 Spark Streaming app ingests streams 
– Reconstruct packets 
– Lookups for geo-ip and port description 
– Clusters using pre-trained k-means model 
– Saves data to Azure Table Storage and publishes on 
Service Bus Topic
Spark Bake-off 
Page: 4 
Solution Overview
Spark Bake-off 
Page: 5 
Sample Dashboard with Power BI
Spark Bake-off 
Page: 6 
Final Comments & Questions 
 With more time 
– Add true anomaly detection with MLLib 
– Test on hosts with real traffic 
– Wire up end-to-end with d3.js viz and 
SparkSQL backend 
– Integrate with existing IDS/IPS rules 
– Bad IPs lookup

DC Spark bake off - Realtime TCP Packet Analysis using Spark and Azure Event Hubs

  • 1.
    Washington DC AreaApache Spark Interactive Spark Bake-off Team Name: Silvio Fiorito Solution Title: Real-time Packet Analysis using Spark
  • 2.
    Spark Bake-off Page:2 Team Introductions  Silvio Fiorito – Background in development and app security – Started working with Hadoop in 2012 – Started using Spark at v0.6 in early 2013 – Built a few prototypes for low-latency query services with Spark/Shark and then SparkSQL – Twitter: @granturing
  • 3.
    Spark Bake-off Page:3 Solution Overview  Real-time TCP packet analysis of geographically distributed hosts – Must support high throughput from many hosts – 3 demo VMs ( 2 x Azure & 1 x AWS)  Local Flume agent pushes events to Azure Event Hub  Events are partitioned and persisted up to 7 days  Spark Streaming app ingests streams – Reconstruct packets – Lookups for geo-ip and port description – Clusters using pre-trained k-means model – Saves data to Azure Table Storage and publishes on Service Bus Topic
  • 4.
    Spark Bake-off Page:4 Solution Overview
  • 5.
    Spark Bake-off Page:5 Sample Dashboard with Power BI
  • 6.
    Spark Bake-off Page:6 Final Comments & Questions  With more time – Add true anomaly detection with MLLib – Test on hosts with real traffic – Wire up end-to-end with d3.js viz and SparkSQL backend – Integrate with existing IDS/IPS rules – Bad IPs lookup