STREAMING EARLY WARNING
Data Day Seattle 6-27-2015
Chance Coble
Use Case Profile
➾Telecommunications company
 Had business problems/pain
 Scalable analytics infrastructure is a problem
 Pushing infrastructure to its limits
 Open to a proof-of-concept engagement with emerging technology
 Wanted to test on historical data
➾We introduced Spark Streaming
 Technology would scale
 Could prove it enabled new analytic techniques (incident detection)
 Open to Scala requirement
 Wanted to prove it was easy to deploy – EC2 helped
2
Organization Profile
➾Telecommunications Wholesale Business
 Process 90 Million calls per day
 Scale up to 1,000 calls per second
 nearly half-a-million calls in a 5 minute window
 Technology is loosely split into
 Operational Support Systems (OSS)
 Business Support Systems (BSS)
➾ Core technology is mature
 Analytics on LAMP stack
 Technology team is strongly skilled in that stack
3
Jargon
➾ Number
 Comprised of Country Code (possibly), Area Code (NPA),
Exchange (NXX) and 4 other digits
 Area codes and exchanges are often geo-coded
4
1 5309512 867
Jargon
➾Trunk Group
 A trunk is a line connecting transmissions for two points. The group
of trunks has some common property, in this case being owned by
the same entity.
 Transmissions from ingress trunks are routed to transmissions to
egress trunks.
➾Route – In this case, selection of a trunk group to
facilitate the termination at the calls destination
➾QoS – Quality of Service governed by metrics
 Call Duration – Short calls are an indication of quality problems
 ASR – Average Seizure Rate
 This company measures this as #connected calls / #calls attempted
➾Real-time: Within 5 minutes
5
The Problem
➾A switch handles most of their routing
➾Configuration table in switch governs routing
 if-this-then-that style logic.
➾Proprietary technology handles adjustments to that table
 Manual intervention also required
6
Call Logs Business Rules
Application
Database Intranet Portal
The Problem
➾Backend system receives a log of calls from the switch
 File dumped every few minutes
 180 well defined fields representing features of a call event
 Supports downstream analytics once enriched with pricing, geo-
coding and account information
Their job is to connect calls at the most efficient price
without sacrificing quality
7
Why Spark?
➾They were interested because
 Workbench can simplify operationalizing analytics
 They can skip a generation of clunky big data tools
 Works with their data structures
 Will “scale-out” rather than up
 Can handle fault-tolerant in-memory updates
8
Why Spark?
9
Spark Basics - Architecture
10
Spark Driver
Spark Context
Cluster
Manager
Executor
…
Tasks Cache
Executor
Executor
Tasks
Tasks
Cache
Cache
Spark Basics – Call Status Count Example
11
val cdrLogPath = ”/cdrs/cdr20140731042210.ssv”
val conf = new SparkConf().setAppName(”CDR Count")
val sc = new SparkContext(conf)
val cdrLines = sc.textFile(cdrLogPath)
val cdrDetails = cdrLines.map(_.split(“;”)).cache()
val successful = cdrDetails.filter(x => x(6)==“S”).count()
val unsuccessful = cdrDetails.filter(x => x(6)==“U”).count()
println(”Successful: %s, Unsuccessful: %s”
.format(successful, unsuccessful))
Spark Basics - RDD’s
➾Operations on data generate distributable tasks through a
Directed Acyclic Graph
 Functional programming FTW!
➾Resilient
 Data is redundantly stored, and can be recomputed through a
generated DAG
➾ Distributed
 The DAG can process each small task, as well as a subset of the
data through optimizations in the Spark planning engine.
➾ Dataset
➾This construct is native to Spark computation
12
Spark Basics - RDD’s
➾Lazy
➾Transformations for tasks and slices
13
Streaming Applications – Why try it?
➾Streaming
Applications
 Site Activity Statistics
 Spam detection
 System monitoring
 Intrusion Detection
 Telecommunications
Network Data
14
Streaming Models
➾Record-at-a-time
 Receive One Record and process it
 Simple, low-latency
 High-Throughput
➾Micro-Batch
 Receive records and occasionally run a batch process over a
window
 Process *must* run fast enough to handle all records collected
 Harder to reduce latency
 Easy Reasoning
 Global state
 Fault tolerance
 Unified Code
15
DStreams
➾Stands for Discretized Streams
➾A series of RDD’s
➾Spark already provided computation model on RDD’s
➾Note records are ordered as they are received
 They are also time-stamped for computation in that order
 Is that always the way you want to see your data?
16
Fault Tolerance – Parallel Recovery
➾ Failed Nodes
➾ Stragglers!
17
Fault Tolerance - Recompute
18
Throughput vs. Latency
19
Anatomy of a Spark Streaming Program
val sparkConf = new SparkConf().setAppName(“QueueStream”)
val ssc = new StreamingContext(sparkConf, Seconds(1))
val rddQueue = new SynchronizedQueue[RDD[Int]]()
val inputStream = ssc.queueStream(rddQueue)
val mappedStream = inputStream.map(x => (x % 10, 1))
val reducedStream = mappedStream.reduceByKey(_ + _)
reducedStream.print()
ssc.start()
for(i  1 to 30) {
rddQueue += ssc.sparkContext.makeRDD(1 to 1000, 10)
Thread.sleep(1000)
}
ssc.stop()
20
Utilities also available for
Twitter
Kafka
Flume
Filestream
Windows
21
WindowSlide
Streaming Call Analysis with Windows
val path = "/Users/chance/Documents/cdrdrop”
val conf = new SparkConf()
.setMaster("local[12]")
.setAppName("CDRIncidentDetection")
.set("spark.executor.memory","8g")
val ssc = new StreamingContext(conf,Seconds(iteration))
val callStream = ssc.textFileStream(path)
val cdr = callStream.window(Seconds(window),Seconds(slide)).map(_.split(";"))
val cdrArr = cdr.filter(c => c.length>136)
.map(c => extractCallDetailRecord(c))
val result = detectIncidents(cdrArr)
result.foreach(rdd => rdd.take(10)
.foreach{case(x,(d,high,low,res)) =>
println(x + "," + high + "," + d + "," + low + "," + res) })
ssc.start()
ssc.awaitTermination()
22
Can we enable new analytics?
23
➾Incident detection
 Chose a univariate technique[1] to detect behavior out of profile
from recent events
 Technique identifies
 out of profile events
 dramatic shifts in the profile
 Easy to understand
Recent
Window
Is it simple to deploy?
➾EC2 helped
➾Client had no Hadoop, and little NoSQL expertise
➾Develop and Deploy
 Built with sbt, ran on master
➾Architecture involved
 Pushed new call detail logs to HDFS on EC2
 Streaming picks up new data and updates RDD’s accordingly
 Results were explored in two ways
 Accessing results through data virtualization
 Writing RDD results (small) to SQL database
 Using a business intelligence tool to create report content
24
Call Logs Streaming DataCurrent
Processing
HDFS on EC2
Analysis and Reporting
Dashboards
Multiple
Delivery
Options
Results
25
Results
26
0
50
100
150
200
250
300
350
5 10
WordCount (Published)
Throughput (MB)
Summary of Results
➾Technology would scale
 Handled 5 minutes of data sub second
➾Proved new analytics enabled
 Solved single-variable incident detection
 Small, simple code
➾Made a case for Scala adoption
 Team is still skeptical about big data
➾Wanted to prove it was easy to deploy – EC2 helped
 Burned on forward slash bug in AWS secret token
27
Incident Visual
28
References
➾[1] Zaharia et al : Discretized Streams
➾[2] Zaharia et al: Discretized Streams: Fault-Tolerant
Streaming
➾[3] Das : Spark Streaming – Real-time Big-Data Processing
➾[4] Spark Streaming Programming Guide
➾[5] Running Spark on EC2
➾[6] Spark on EMR
➾[7] Ahelegby: Time Series Outliers
29
Contact Us
CONTACT US
Email: chance at blacklightsolutions.com
Phone: 512.795.0855
Web: www.blacklightsolutions.com
Twitter: @chancecoble
30

Spark Streaming Early Warning Use Case

  • 1.
    STREAMING EARLY WARNING DataDay Seattle 6-27-2015 Chance Coble
  • 2.
    Use Case Profile ➾Telecommunicationscompany  Had business problems/pain  Scalable analytics infrastructure is a problem  Pushing infrastructure to its limits  Open to a proof-of-concept engagement with emerging technology  Wanted to test on historical data ➾We introduced Spark Streaming  Technology would scale  Could prove it enabled new analytic techniques (incident detection)  Open to Scala requirement  Wanted to prove it was easy to deploy – EC2 helped 2
  • 3.
    Organization Profile ➾Telecommunications WholesaleBusiness  Process 90 Million calls per day  Scale up to 1,000 calls per second  nearly half-a-million calls in a 5 minute window  Technology is loosely split into  Operational Support Systems (OSS)  Business Support Systems (BSS) ➾ Core technology is mature  Analytics on LAMP stack  Technology team is strongly skilled in that stack 3
  • 4.
    Jargon ➾ Number  Comprisedof Country Code (possibly), Area Code (NPA), Exchange (NXX) and 4 other digits  Area codes and exchanges are often geo-coded 4 1 5309512 867
  • 5.
    Jargon ➾Trunk Group  Atrunk is a line connecting transmissions for two points. The group of trunks has some common property, in this case being owned by the same entity.  Transmissions from ingress trunks are routed to transmissions to egress trunks. ➾Route – In this case, selection of a trunk group to facilitate the termination at the calls destination ➾QoS – Quality of Service governed by metrics  Call Duration – Short calls are an indication of quality problems  ASR – Average Seizure Rate  This company measures this as #connected calls / #calls attempted ➾Real-time: Within 5 minutes 5
  • 6.
    The Problem ➾A switchhandles most of their routing ➾Configuration table in switch governs routing  if-this-then-that style logic. ➾Proprietary technology handles adjustments to that table  Manual intervention also required 6 Call Logs Business Rules Application Database Intranet Portal
  • 7.
    The Problem ➾Backend systemreceives a log of calls from the switch  File dumped every few minutes  180 well defined fields representing features of a call event  Supports downstream analytics once enriched with pricing, geo- coding and account information Their job is to connect calls at the most efficient price without sacrificing quality 7
  • 8.
    Why Spark? ➾They wereinterested because  Workbench can simplify operationalizing analytics  They can skip a generation of clunky big data tools  Works with their data structures  Will “scale-out” rather than up  Can handle fault-tolerant in-memory updates 8
  • 9.
  • 10.
    Spark Basics -Architecture 10 Spark Driver Spark Context Cluster Manager Executor … Tasks Cache Executor Executor Tasks Tasks Cache Cache
  • 11.
    Spark Basics –Call Status Count Example 11 val cdrLogPath = ”/cdrs/cdr20140731042210.ssv” val conf = new SparkConf().setAppName(”CDR Count") val sc = new SparkContext(conf) val cdrLines = sc.textFile(cdrLogPath) val cdrDetails = cdrLines.map(_.split(“;”)).cache() val successful = cdrDetails.filter(x => x(6)==“S”).count() val unsuccessful = cdrDetails.filter(x => x(6)==“U”).count() println(”Successful: %s, Unsuccessful: %s” .format(successful, unsuccessful))
  • 12.
    Spark Basics -RDD’s ➾Operations on data generate distributable tasks through a Directed Acyclic Graph  Functional programming FTW! ➾Resilient  Data is redundantly stored, and can be recomputed through a generated DAG ➾ Distributed  The DAG can process each small task, as well as a subset of the data through optimizations in the Spark planning engine. ➾ Dataset ➾This construct is native to Spark computation 12
  • 13.
    Spark Basics -RDD’s ➾Lazy ➾Transformations for tasks and slices 13
  • 14.
    Streaming Applications –Why try it? ➾Streaming Applications  Site Activity Statistics  Spam detection  System monitoring  Intrusion Detection  Telecommunications Network Data 14
  • 15.
    Streaming Models ➾Record-at-a-time  ReceiveOne Record and process it  Simple, low-latency  High-Throughput ➾Micro-Batch  Receive records and occasionally run a batch process over a window  Process *must* run fast enough to handle all records collected  Harder to reduce latency  Easy Reasoning  Global state  Fault tolerance  Unified Code 15
  • 16.
    DStreams ➾Stands for DiscretizedStreams ➾A series of RDD’s ➾Spark already provided computation model on RDD’s ➾Note records are ordered as they are received  They are also time-stamped for computation in that order  Is that always the way you want to see your data? 16
  • 17.
    Fault Tolerance –Parallel Recovery ➾ Failed Nodes ➾ Stragglers! 17
  • 18.
    Fault Tolerance -Recompute 18
  • 19.
  • 20.
    Anatomy of aSpark Streaming Program val sparkConf = new SparkConf().setAppName(“QueueStream”) val ssc = new StreamingContext(sparkConf, Seconds(1)) val rddQueue = new SynchronizedQueue[RDD[Int]]() val inputStream = ssc.queueStream(rddQueue) val mappedStream = inputStream.map(x => (x % 10, 1)) val reducedStream = mappedStream.reduceByKey(_ + _) reducedStream.print() ssc.start() for(i  1 to 30) { rddQueue += ssc.sparkContext.makeRDD(1 to 1000, 10) Thread.sleep(1000) } ssc.stop() 20 Utilities also available for Twitter Kafka Flume Filestream
  • 21.
  • 22.
    Streaming Call Analysiswith Windows val path = "/Users/chance/Documents/cdrdrop” val conf = new SparkConf() .setMaster("local[12]") .setAppName("CDRIncidentDetection") .set("spark.executor.memory","8g") val ssc = new StreamingContext(conf,Seconds(iteration)) val callStream = ssc.textFileStream(path) val cdr = callStream.window(Seconds(window),Seconds(slide)).map(_.split(";")) val cdrArr = cdr.filter(c => c.length>136) .map(c => extractCallDetailRecord(c)) val result = detectIncidents(cdrArr) result.foreach(rdd => rdd.take(10) .foreach{case(x,(d,high,low,res)) => println(x + "," + high + "," + d + "," + low + "," + res) }) ssc.start() ssc.awaitTermination() 22
  • 23.
    Can we enablenew analytics? 23 ➾Incident detection  Chose a univariate technique[1] to detect behavior out of profile from recent events  Technique identifies  out of profile events  dramatic shifts in the profile  Easy to understand Recent Window
  • 24.
    Is it simpleto deploy? ➾EC2 helped ➾Client had no Hadoop, and little NoSQL expertise ➾Develop and Deploy  Built with sbt, ran on master ➾Architecture involved  Pushed new call detail logs to HDFS on EC2  Streaming picks up new data and updates RDD’s accordingly  Results were explored in two ways  Accessing results through data virtualization  Writing RDD results (small) to SQL database  Using a business intelligence tool to create report content 24 Call Logs Streaming DataCurrent Processing HDFS on EC2 Analysis and Reporting Dashboards Multiple Delivery Options
  • 25.
  • 26.
  • 27.
    Summary of Results ➾Technologywould scale  Handled 5 minutes of data sub second ➾Proved new analytics enabled  Solved single-variable incident detection  Small, simple code ➾Made a case for Scala adoption  Team is still skeptical about big data ➾Wanted to prove it was easy to deploy – EC2 helped  Burned on forward slash bug in AWS secret token 27
  • 28.
  • 29.
    References ➾[1] Zaharia etal : Discretized Streams ➾[2] Zaharia et al: Discretized Streams: Fault-Tolerant Streaming ➾[3] Das : Spark Streaming – Real-time Big-Data Processing ➾[4] Spark Streaming Programming Guide ➾[5] Running Spark on EC2 ➾[6] Spark on EMR ➾[7] Ahelegby: Time Series Outliers 29
  • 30.
    Contact Us CONTACT US Email:chance at blacklightsolutions.com Phone: 512.795.0855 Web: www.blacklightsolutions.com Twitter: @chancecoble 30