© 2015 MapR Technologies ‹#›© 2016 MapR Technologies
Tugdual Grall
Technical Evangelist
@tgrall
Anomaly Detection in Telecom with Spark
Code Motion Amsterdam

12 - May - 2016
© 2016 MapR Technologies 2
{“about” : “me”}
Tugdual “Tug” Grall
• MapR
• Technical Evangelist
• MongoDB
• Technical Evangelist
• Couchbase
• Technical Evangelist
• eXo
• CTO
• Oracle
• Developer/Product Manager
• Mainly Java/SOA
• Developer in consulting firms
• Web
• @tgrall
• http://tgrall.github.io
• tgrall

• NantesJUG co-founder

• Pet Project :
• http://www.resultri.com
• tug@mapr.com
• tugdual@gmail.com
© 2016 MapR Technologies 3
Agenda
• Introduction
• Anomaly Detection : Why?
• Anomaly Detection : How?
• Use Cases and Demonstration: Telco Sample Application
© 2016 MapR Technologies 4
Anomaly Detection
© 2016 MapR Technologies 5
Who Needs Anomaly Detection?
Utility providers using
smart meters
© 2016 MapR Technologies 6
Who Needs Anomaly Detection?
Feedback from
manufacturing assembly
lines
© 2016 MapR Technologies 7
Who Needs Anomaly Detection?
Monitoring data traffic on
communication networks
© 2016 MapR Technologies 8
What is Anomaly Detection?
• The goal is to discover rare events
– especially those that shouldn’t have happened
• Find a problem before other people see it
– especially before it causes a problem for customers
• Why is this a challenge?
– I don’t know what an anomaly looks like (yet)
© 2016 MapR Technologies 9
© 2016 MapR Technologies 10
Looks pretty
anomalous
to me
© 2016 MapR Technologies 11
Basic idea:

Find “normal” first
© 2016 MapR Technologies 12
Steps in Anomaly Detection
• Build a model: Collect and process data for training a model
• Use the machine learning model to determine what is the normal
pattern
• Decide how far away from this normal pattern you’ll consider to
be anomalous
• Use the AD model to detect anomalies in new data
– Methods such as clustering for discovery can be helpful
© 2016 MapR Technologies 13
How hard is it to set an alert for anomalies?
Grey data is from normal events; x’s are anomalies.
Where would you set the threshold?
© 2016 MapR Technologies 14
Basic idea:

Set adaptive thresholds
© 2016 MapR Technologies 15
99.9%-ile
© 2016 MapR Technologies 16
With Spikes
99.9%-ile including spikes
© 2016 MapR Technologies 17
Online
Summarizer
99.9%-ile
t
x > t ? Alarm !
x
How Hard Can it Be?
© 2016 MapR Technologies 18
Key Steps in Anomaly Detection
• What is normal?
• What will you measure to identify things that are “far” from normal?
• How far is “far”, if something is to be considered anomalous?
© 2016 MapR Technologies 19
A lot more….
• Model normal, then find
anomalies
• t-digest for adaptive
threshold
• Probabilistic models for
complex patterns
-
0 5 10 15
−20246810
offset+noise+pulse1+pulse2
A
B
© 2016 MapR Technologies 20
https://www.mapr.com/ebook
Learn more about 

Machine Learning & Anomaly Detection
© 2016 MapR Technologies 21
Yes…
but how do I build such application?
© 2016 MapR Technologies 22
© 2016 MapR Technologies 23
© 2016 MapR Technologies 24
© 2016 MapR Technologies 25
© 2016 MapR Technologies 26
Data flow and processing
1. Device to Antenna
2. Antenna to main data center
3. Application should:
✓Store the data
✓Analyse/process the data
✓Detect Anomalies and alert IT
© 2016 MapR Technologies 27
Data flow and processing
1. Device to Antenna
2. Antenna to main data center
3. Application should:
✓Store the data
✓Analyse/process the data
✓Detect Anomalies and alert IT
➡ Pure mobile GSM, LTE, 5G, …
➡ Streaming Technology
➡ Big Data Storage
➡ Distributed Processing
➡ Machine Learning
© 2016 MapR Technologies 28
Architecture
Streams
HDFS/MapR-FS
HBase/MapR-DB JSON
Streaming
Streaming
SQL Engine
Analytics
JDBC/ODBC
© 2016 MapR Technologies 29
© 2016 MapR Technologies 30
• Cluster Computing Platform
• Extends “MapReduce” with
extensions
– Streaming
– Interactive Analytics
• Run in Memory
© 2015 MapR Technologies ‹#›@tgrall
Spark components
Spark SQL
Spark Streaming
(Streaming)
MLlib
(Machine Learning)
Spark Core (General execution engine)
GraphX
(Graph Computation)
Mesos
Distributed File System (HDFS, MapR-FS, S3, …)
Hadoop YARN
© 2016 MapR Technologies 32
Spark Resilient Distributed Datasets “RDD”
Sensor RDD
W
Executor
P4
W
Executor
P1 P3
W
Executor
P2
sc.textFile P1
8213034705,
95, 2.927373,
jake7870, 0……
P2
8213034705,
115, 2.943484,
Davidbresler2,
1….
P3
8213034705,
100, 2.951285,
gladimacowgirl,
58…
P4
8213034705,
117, 2.998947,
daysrus, 95….
© 2016 MapR Technologies 33
Spark Resilient Distributed Datasets
Transformation
Filter()
Action
Count()
RDD
newRDD
Value
© 2015 MapR Technologies@tgrall
Transformations
• Process an RDD, returns an RDD
• Examples :
• map() : one value => another value
• mapToPair() : one value => a tuple
• filter() : filters values/tuples on a given condition
• groupByKey() : groups values by key
• reduceByKey() : aggregates values by key
• join(), cogroup(), … : joins RDDs
© 2015 MapR Technologies@tgrall
Actions
• Process an RDD, returns a value
• Examples :
• count() : counts number of items in dataset
• first() : returns first entry
• take(n) : returns array of the n first elements
• foreach() : applies a function on each element
• collect() : returns all elements
• saveAsTextFile() : saves in files each element
© 2015 MapR Technologies@tgrall
© 2015 MapR Technologies@tgrall
Apache Kafka
• Feeds of messages are organised in
topics
• Processes that publish messages are
called producers
• Processes that subscribed to topic
and process messages are
consumers
• A Kafka cluster is made of one or
more brokers (== node)
© 2016 MapR Technologies 38
Broker 1
Topic A Topic B
Broker 2
Topic A Topic B
Broker 3
Topic A Topic B
Producer
Producer
Producer
Consumer
Consumer
Consumer
© 2016 MapR Technologies 39
What is Spark Streaming?
• Enables scalable, high-throughput, fault-tolerant stream
processing of live data
• Extension of the core Spark
Data Sources Data Sinks
© 2016 MapR Technologies 40
Spark Streaming Architecture
• Divide data stream into batches of X seconds (micro batching)
• Called DStream = sequence of RDDs
Spark
Streaming
input data
stream
DStream RDD batches
Batch
interval
data from
time 0 to 1
data from
time 1 to 2
RDD @ time 2
data from
time 2 to 3
RDD @ time 3RDD @ time 1
© 2016 MapR Technologies 41
Demonstration
https://github.com/mapr-demos/telco-anomaly-detection-spark
© 2016 MapR Technologies 42
Sample Code
• Universe dealing with Antenna & Users (Akka / Actors)
• Antenna Send Data to Spark (Kafka/Streams & Spark Streaming)
• Aggregate CDR Data by Tower (Spark & MapR DB)
• Analyse Tower Behaviour and Send Alerts when needed (Spark &
Kafka/Streams)
© 2016 MapR Technologies 43
Conclusion
• Build a streaming based application to capture data in real time
• Apache Kafka / MapR Streams
• Store data into a scalable data store
• MapR-FS/DB, Hadoop, NoSQL with Spark Support
• Use Spark & Spark Streaming to process data in real time
• Run Analytics jobs using Spark or SQL on “Hadoop” (Apache
Drill)
© 2016 MapR Technologies 44
Interesting Skills to Add to your Resume
• Apache Kafka
• Apache Spark
• NoSQL
• Machine Learning Technics
© 2016 MapR Technologies© 2016 MapR Technologies@tgrall 45
IoT : Racing Cars
Producers Consumers
sensors data
Real Time
Analytics
https://github.com/mapr-demos/racing-time-series
© 2016 MapR Technologies 46
Free eBooks
http://mapr.com/ebook
© 2016 MapR Technologies 47
© 2016 MapR Technologies 48
Q&A
@tgrall maprtech
tug@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterdam 2016

  • 1.
    © 2015 MapRTechnologies ‹#›© 2016 MapR Technologies Tugdual Grall Technical Evangelist @tgrall Anomaly Detection in Telecom with Spark Code Motion Amsterdam
 12 - May - 2016
  • 2.
    © 2016 MapRTechnologies 2 {“about” : “me”} Tugdual “Tug” Grall • MapR • Technical Evangelist • MongoDB • Technical Evangelist • Couchbase • Technical Evangelist • eXo • CTO • Oracle • Developer/Product Manager • Mainly Java/SOA • Developer in consulting firms • Web • @tgrall • http://tgrall.github.io • tgrall
 • NantesJUG co-founder
 • Pet Project : • http://www.resultri.com • tug@mapr.com • tugdual@gmail.com
  • 3.
    © 2016 MapRTechnologies 3 Agenda • Introduction • Anomaly Detection : Why? • Anomaly Detection : How? • Use Cases and Demonstration: Telco Sample Application
  • 4.
    © 2016 MapRTechnologies 4 Anomaly Detection
  • 5.
    © 2016 MapRTechnologies 5 Who Needs Anomaly Detection? Utility providers using smart meters
  • 6.
    © 2016 MapRTechnologies 6 Who Needs Anomaly Detection? Feedback from manufacturing assembly lines
  • 7.
    © 2016 MapRTechnologies 7 Who Needs Anomaly Detection? Monitoring data traffic on communication networks
  • 8.
    © 2016 MapRTechnologies 8 What is Anomaly Detection? • The goal is to discover rare events – especially those that shouldn’t have happened • Find a problem before other people see it – especially before it causes a problem for customers • Why is this a challenge? – I don’t know what an anomaly looks like (yet)
  • 9.
    © 2016 MapRTechnologies 9
  • 10.
    © 2016 MapRTechnologies 10 Looks pretty anomalous to me
  • 11.
    © 2016 MapRTechnologies 11 Basic idea:
 Find “normal” first
  • 12.
    © 2016 MapRTechnologies 12 Steps in Anomaly Detection • Build a model: Collect and process data for training a model • Use the machine learning model to determine what is the normal pattern • Decide how far away from this normal pattern you’ll consider to be anomalous • Use the AD model to detect anomalies in new data – Methods such as clustering for discovery can be helpful
  • 13.
    © 2016 MapRTechnologies 13 How hard is it to set an alert for anomalies? Grey data is from normal events; x’s are anomalies. Where would you set the threshold?
  • 14.
    © 2016 MapRTechnologies 14 Basic idea:
 Set adaptive thresholds
  • 15.
    © 2016 MapRTechnologies 15 99.9%-ile
  • 16.
    © 2016 MapRTechnologies 16 With Spikes 99.9%-ile including spikes
  • 17.
    © 2016 MapRTechnologies 17 Online Summarizer 99.9%-ile t x > t ? Alarm ! x How Hard Can it Be?
  • 18.
    © 2016 MapRTechnologies 18 Key Steps in Anomaly Detection • What is normal? • What will you measure to identify things that are “far” from normal? • How far is “far”, if something is to be considered anomalous?
  • 19.
    © 2016 MapRTechnologies 19 A lot more…. • Model normal, then find anomalies • t-digest for adaptive threshold • Probabilistic models for complex patterns - 0 5 10 15 −20246810 offset+noise+pulse1+pulse2 A B
  • 20.
    © 2016 MapRTechnologies 20 https://www.mapr.com/ebook Learn more about 
 Machine Learning & Anomaly Detection
  • 21.
    © 2016 MapRTechnologies 21 Yes… but how do I build such application?
  • 22.
    © 2016 MapRTechnologies 22
  • 23.
    © 2016 MapRTechnologies 23
  • 24.
    © 2016 MapRTechnologies 24
  • 25.
    © 2016 MapRTechnologies 25
  • 26.
    © 2016 MapRTechnologies 26 Data flow and processing 1. Device to Antenna 2. Antenna to main data center 3. Application should: ✓Store the data ✓Analyse/process the data ✓Detect Anomalies and alert IT
  • 27.
    © 2016 MapRTechnologies 27 Data flow and processing 1. Device to Antenna 2. Antenna to main data center 3. Application should: ✓Store the data ✓Analyse/process the data ✓Detect Anomalies and alert IT ➡ Pure mobile GSM, LTE, 5G, … ➡ Streaming Technology ➡ Big Data Storage ➡ Distributed Processing ➡ Machine Learning
  • 28.
    © 2016 MapRTechnologies 28 Architecture Streams HDFS/MapR-FS HBase/MapR-DB JSON Streaming Streaming SQL Engine Analytics JDBC/ODBC
  • 29.
    © 2016 MapRTechnologies 29
  • 30.
    © 2016 MapRTechnologies 30 • Cluster Computing Platform • Extends “MapReduce” with extensions – Streaming – Interactive Analytics • Run in Memory
  • 31.
    © 2015 MapRTechnologies ‹#›@tgrall Spark components Spark SQL Spark Streaming (Streaming) MLlib (Machine Learning) Spark Core (General execution engine) GraphX (Graph Computation) Mesos Distributed File System (HDFS, MapR-FS, S3, …) Hadoop YARN
  • 32.
    © 2016 MapRTechnologies 32 Spark Resilient Distributed Datasets “RDD” Sensor RDD W Executor P4 W Executor P1 P3 W Executor P2 sc.textFile P1 8213034705, 95, 2.927373, jake7870, 0…… P2 8213034705, 115, 2.943484, Davidbresler2, 1…. P3 8213034705, 100, 2.951285, gladimacowgirl, 58… P4 8213034705, 117, 2.998947, daysrus, 95….
  • 33.
    © 2016 MapRTechnologies 33 Spark Resilient Distributed Datasets Transformation Filter() Action Count() RDD newRDD Value
  • 34.
    © 2015 MapRTechnologies@tgrall Transformations • Process an RDD, returns an RDD • Examples : • map() : one value => another value • mapToPair() : one value => a tuple • filter() : filters values/tuples on a given condition • groupByKey() : groups values by key • reduceByKey() : aggregates values by key • join(), cogroup(), … : joins RDDs
  • 35.
    © 2015 MapRTechnologies@tgrall Actions • Process an RDD, returns a value • Examples : • count() : counts number of items in dataset • first() : returns first entry • take(n) : returns array of the n first elements • foreach() : applies a function on each element • collect() : returns all elements • saveAsTextFile() : saves in files each element
  • 36.
    © 2015 MapRTechnologies@tgrall
  • 37.
    © 2015 MapRTechnologies@tgrall Apache Kafka • Feeds of messages are organised in topics • Processes that publish messages are called producers • Processes that subscribed to topic and process messages are consumers • A Kafka cluster is made of one or more brokers (== node)
  • 38.
    © 2016 MapRTechnologies 38 Broker 1 Topic A Topic B Broker 2 Topic A Topic B Broker 3 Topic A Topic B Producer Producer Producer Consumer Consumer Consumer
  • 39.
    © 2016 MapRTechnologies 39 What is Spark Streaming? • Enables scalable, high-throughput, fault-tolerant stream processing of live data • Extension of the core Spark Data Sources Data Sinks
  • 40.
    © 2016 MapRTechnologies 40 Spark Streaming Architecture • Divide data stream into batches of X seconds (micro batching) • Called DStream = sequence of RDDs Spark Streaming input data stream DStream RDD batches Batch interval data from time 0 to 1 data from time 1 to 2 RDD @ time 2 data from time 2 to 3 RDD @ time 3RDD @ time 1
  • 41.
    © 2016 MapRTechnologies 41 Demonstration https://github.com/mapr-demos/telco-anomaly-detection-spark
  • 42.
    © 2016 MapRTechnologies 42 Sample Code • Universe dealing with Antenna & Users (Akka / Actors) • Antenna Send Data to Spark (Kafka/Streams & Spark Streaming) • Aggregate CDR Data by Tower (Spark & MapR DB) • Analyse Tower Behaviour and Send Alerts when needed (Spark & Kafka/Streams)
  • 43.
    © 2016 MapRTechnologies 43 Conclusion • Build a streaming based application to capture data in real time • Apache Kafka / MapR Streams • Store data into a scalable data store • MapR-FS/DB, Hadoop, NoSQL with Spark Support • Use Spark & Spark Streaming to process data in real time • Run Analytics jobs using Spark or SQL on “Hadoop” (Apache Drill)
  • 44.
    © 2016 MapRTechnologies 44 Interesting Skills to Add to your Resume • Apache Kafka • Apache Spark • NoSQL • Machine Learning Technics
  • 45.
    © 2016 MapRTechnologies© 2016 MapR Technologies@tgrall 45 IoT : Racing Cars Producers Consumers sensors data Real Time Analytics https://github.com/mapr-demos/racing-time-series
  • 46.
    © 2016 MapRTechnologies 46 Free eBooks http://mapr.com/ebook
  • 47.
    © 2016 MapRTechnologies 47
  • 48.
    © 2016 MapRTechnologies 48 Q&A @tgrall maprtech tug@mapr.com Engage with us! MapR maprtech mapr-technologies