Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1
Anomaly Detection with
Apache Spark: Workshop
Sean Owen / Director of Data Science / Cloudera
Anomaly Detection
2
• What is “Unusual”?
• Server metrics
• Access patterns
• Transactions
• Labeled, or not
• Sometimes k...
Clustering
3
• Identify dense clusters
of data points
• Unusual =
far from any cluster
• What is “far”?
• Unsupervised lea...
k-means++ clustering
4
• Simple, well-known,
parallel algorithm
• Iteratively assign points,
update centers
(“means”)
• Go...
5
Anomaly Detection in KDD Cup ‘99
KDD Cup 1999
6
• Annual ML competition
www.sigkdd.org/kddcup/in
dex.php
• ’99: Computer network
intrusion detection
• 4.9M...
7
0,tcp,http,SF,215,45076,
0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,
0.00,0.00,0.00,0.00,1.00,0.00,0.00,0
,0,0.00,0.00,0.00,0.0...
Apache Spark: Something For Everyone
8
• From MS Dryad, UC
Berkeley, DataBricks
• Scala-based
• Expressive, efficient
• JV...
9
Clustering, Take #0
10
val rawData = sc.textFile("/user/srowen/kddcup.data", 120)
rawData: org.apache.spark.rdd.RDD[String] =
MappedRDD[13] at...
11
0,tcp,http,SF,215,45076,
0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,
0.00,0.00,0.00,0.00,1.00,0.00,0.00,0
,0,0.00,0.00,0.00,0....
12
import scala.collection.mutable.ArrayBuffer
val dataAndLabel = rawData.map { line =>
val buffer = ArrayBuffer[String]()...
13
import org.apache.spark.mllib.clustering._
val kmeans = new KMeans()
val model = kmeans.run(data)
model.clusterCenters....
14
0 back. 2203
0 buffer_overflow. 30
0 ftp_write. 8
0 guess_passwd. 53
0 imap. 12
0 ipsweep. 12481
0 land. 21
0 loadmodul...
15
Clustering, Take #1: Choose k
16
import scala.math._
import org.apache.spark.rdd._
def distance(a: Array[Double], b: Array[Double]) =
sqrt(a.zip(b).map(...
17
18
(5, 1938.8583418059309)
(10,1614.7511288131)
(15,1406.5960973638971)
(20,1111.5970245349558)
(25, 905.536686115762)
(30...
19
kmeans.setRuns(10)
kmeans.setEpsilon(1.0e-6)
(30 to 100 by 10)
(30, 886.974050712821)
(40, 747.4268153420192)
(50, 370....
20
Clustering, Take #2: Normalize
21
data.unpersist(true)
val numCols = data.take(1)(0).length
val n = data.count
val sums = data.reduce((a,b) =>
a.zip(b).m...
22
(50, 0.008184436460307516)
(60, 0.005003794119180148)
(70, 0.0036252446694127255)
(80, 0.003448993315406253)
(90, 0.002...
23
Clustering, Take #3: Categoricals
24
val protocols = rawData.map(
_.split(",")(1)).distinct.collect.zipWithIndex.toMap
...
val dataAndLabel = rawData.map { ...
25
(50, 0.09807063330707691)
(60, 0.07344136010921463)
(70, 0.05098421746285664)
(80, 0.04059365147197857)
(90, 0.03647143...
26
Clustering, Take #4: Labels, Entropy
27
0,tcp,http,SF,215,45076,
0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,
0.00,0.00,0.00,0.00,1.00,0.00,0.00,0
,0,0.00,0.00,0.00,0....
Using Labels with Entropy
28
• Measures mixed-ness
• Bad clusters have
very mixed labels
• Function of cluster’s
label fre...
29
def entropy(counts: Iterable[Int]) = {
val values = counts.filter(_ > 0)
val sum: Double = values.sum
values.map { v =>...
30
(30, 1.0266922080881913)
(40, 1.0226914826265483)
(50, 1.019971839275925)
(60, 1.0162839563855304)
(70, 1.0108882243857...
31
Detecting an Anomaly
32
val kmeans = new KMeans()
kmeans.setK(95)
kmeans.setRuns(10)
kmeans.setEpsilon(1.0e-6)
val model = kmeans.run(normalize...
From Here to Production?
33
• Real data set!
• Algorithmic
• Other distance metrics
• k-means|| init
• Use data point IDs
...
sowen@cloudera.com
Upcoming SlideShare
Loading in …5
×

Anomaly detection with Apache Spark

13,939 views

Published on

Sean Owen. Director of Data Science Cloudera.

Curso de Verano "Innovación Disruptiva en tecnologías de seguridad". Campus Vicálvaro de la URJC.

Summer Course "Disruptive innovation in security technologies". URJC's Vicálvaro Campus.

Published in: Technology
  • Be the first to comment

Anomaly detection with Apache Spark

  1. 1. 1 Anomaly Detection with Apache Spark: Workshop Sean Owen / Director of Data Science / Cloudera
  2. 2. Anomaly Detection 2 • What is “Unusual”? • Server metrics • Access patterns • Transactions • Labeled, or not • Sometimes know examples of “unusual” • Sometimes not • Applications • Network security • IT monitoring • Fraud detection • Error detection
  3. 3. Clustering 3 • Identify dense clusters of data points • Unusual = far from any cluster • What is “far”? • Unsupervised learning • Can “supervise” with some labels to improve or interpret en.wikipedia.org/wiki/Cluster_analysis
  4. 4. k-means++ clustering 4 • Simple, well-known, parallel algorithm • Iteratively assign points, update centers (“means”) • Goal: points close to nearest cluster center • Must choose k, number of clusters mahout.apache.org/users/clustering/fuzzy-k-means.html
  5. 5. 5 Anomaly Detection in KDD Cup ‘99
  6. 6. KDD Cup 1999 6 • Annual ML competition www.sigkdd.org/kddcup/in dex.php • ’99: Computer network intrusion detection • 4.9M connections • Most normal, many known to be attacks • Not a realistic sample!
  7. 7. 7 0,tcp,http,SF,215,45076, 0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1, 0.00,0.00,0.00,0.00,1.00,0.00,0.00,0 ,0,0.00,0.00,0.00,0.00,0.00,0.00,0.0 0,0.00,normal. Label Service Bytes Received % SYN errors
  8. 8. Apache Spark: Something For Everyone 8 • From MS Dryad, UC Berkeley, DataBricks • Scala-based • Expressive, efficient • JVM-based • Scala-like API • Distributed works like local, works like streaming • Like Apache Crunch is Collection-like • Interactive REPL • Distributed • Hadoop-friendly • Integrate with where data, cluster already is • ETL no longer separate • MLlib
  9. 9. 9 Clustering, Take #0
  10. 10. 10 val rawData = sc.textFile("/user/srowen/kddcup.data", 120) rawData: org.apache.spark.rdd.RDD[String] = MappedRDD[13] at textFile at <console>:15 rawData.count ... res1: Long = 4898431
  11. 11. 11 0,tcp,http,SF,215,45076, 0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1, 0.00,0.00,0.00,0.00,1.00,0.00,0.00,0 ,0,0.00,0.00,0.00,0.00,0.00,0.00,0.0 0,0.00,normal.
  12. 12. 12 import scala.collection.mutable.ArrayBuffer val dataAndLabel = rawData.map { line => val buffer = ArrayBuffer[String]() buffer.appendAll(line.split(",")) buffer.remove(1, 3) val label = buffer.remove(buffer.length-1) val vector = buffer.map(_.toDouble).toArray (vector,label) } val data = dataAndLabel.map(_._1).cache()
  13. 13. 13 import org.apache.spark.mllib.clustering._ val kmeans = new KMeans() val model = kmeans.run(data) model.clusterCenters.foreach(centroid => println(java.util.Arrays.toString(centroid))) val clusterAndLabel = dataAndLabel.map { case (data,label) => (model.predict(data),label) } val clusterLabelCount = clusterAndLabel.countByValue clusterLabelCount.toList.sorted.foreach { case ((cluster,label),count) => println(f"$cluster%1s$label%18s$count%8s") }
  14. 14. 14 0 back. 2203 0 buffer_overflow. 30 0 ftp_write. 8 0 guess_passwd. 53 0 imap. 12 0 ipsweep. 12481 0 land. 21 0 loadmodule. 9 0 multihop. 7 0 neptune. 1072017 0 nmap. 2316 0 normal. 972781 0 perl. 3 0 phf. 4 0 pod. 264 0 portsweep. 10412 0 rootkit. 10 0 satan. 15892 0 smurf. 2807886 0 spy. 2 0 teardrop. 979 0 warezclient. 1020 0 warezmaster. 20 1 portsweep. 1 Terrible.
  15. 15. 15 Clustering, Take #1: Choose k
  16. 16. 16 import scala.math._ import org.apache.spark.rdd._ def distance(a: Array[Double], b: Array[Double]) = sqrt(a.zip(b).map(p => p._1 - p._2).map(d => d * d).sum) def clusteringScore(data: RDD[Array[Double]], k: Int) = { val kmeans = new KMeans() kmeans.setK(k) val model = kmeans.run(data) val centroids = model.clusterCenters data.map(datum => distance(centroids(model.predict(datum)), datum)).mean } val kScores = (5 to 40 by 5).par.map(k => (k, clusteringScore(data, k)))
  17. 17. 17
  18. 18. 18 (5, 1938.8583418059309) (10,1614.7511288131) (15,1406.5960973638971) (20,1111.5970245349558) (25, 905.536686115762) (30, 931.7399112938756) (35, 550.3231624120361) (40, 443.10108628017787)
  19. 19. 19 kmeans.setRuns(10) kmeans.setEpsilon(1.0e-6) (30 to 100 by 10) (30, 886.974050712821) (40, 747.4268153420192) (50, 370.2801596900413) (60, 325.883722754848) (70, 276.05785104442657) (80, 193.53996444359856) (90, 162.72596475533814) (100,133.19275833671574)
  20. 20. 20 Clustering, Take #2: Normalize
  21. 21. 21 data.unpersist(true) val numCols = data.take(1)(0).length val n = data.count val sums = data.reduce((a,b) => a.zip(b).map(t => t._1 + t._2)) val sumSquares = data.fold(new Array[Double](numCols)) ((a,b) => a.zip(b).map(t => t._1 + t._2*t._2)) val stdevs = sumSquares.zip(sums).map { case(sumSq,sum) => sqrt(n*sumSq - sum*sum)/n } val means = sums.map(_ / n) val normalizedData = data.map( (_,means,stdevs).zipped.map((value,mean,stdev) => if (stdev <= 0) (value-mean) else (value-mean)/stdev)).cache() val kScores = (50 to 120 by 10).par.map(k => (k, clusteringScore(normalizedData, k)))
  22. 22. 22 (50, 0.008184436460307516) (60, 0.005003794119180148) (70, 0.0036252446694127255) (80, 0.003448993315406253) (90, 0.0028508261816040984) (100,0.0024371619202127343) (110,0.002273862516438719) (120,0.0022075535103855447)
  23. 23. 23 Clustering, Take #3: Categoricals
  24. 24. 24 val protocols = rawData.map( _.split(",")(1)).distinct.collect.zipWithIndex.toMap ... val dataAndLabel = rawData.map { line => val buffer = ArrayBuffer[String]() buffer.appendAll(line.split(",")) val protocol = buffer.remove(1) val vector = buffer.map(_.toDouble) val newProtocolFeatures = new Array[Double](protocols.size) newProtocolFeatures(protocols(protocol)) = 1.0 ... vector.insertAll(1, newProtocolFeatures) ... (vector.toArray,label) }
  25. 25. 25 (50, 0.09807063330707691) (60, 0.07344136010921463) (70, 0.05098421746285664) (80, 0.04059365147197857) (90, 0.03647143491690264) (100,0.02384443440377552) (110,0.016909326439972006) (120,0.01610738339266529) (130,0.014301399891441647) (140,0.008563067306283041)
  26. 26. 26 Clustering, Take #4: Labels, Entropy
  27. 27. 27 0,tcp,http,SF,215,45076, 0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1, 0.00,0.00,0.00,0.00,1.00,0.00,0.00,0 ,0,0.00,0.00,0.00,0.00,0.00,0.00,0.0 0,0.00,normal. Label
  28. 28. Using Labels with Entropy 28 • Measures mixed-ness • Bad clusters have very mixed labels • Function of cluster’s label frequencies, p(x) • Good clustering = low entropy clusters - p log pΣ
  29. 29. 29 def entropy(counts: Iterable[Int]) = { val values = counts.filter(_ > 0) val sum: Double = values.sum values.map { v => val p = v / sum -p * log(p) }.sum } def clusteringScore(data: RDD[Array[Double]], labels: RDD[String], k: Int) = { ... val labelsInCluster = data.map(model.predict(_)).zip(labels). groupByKey.values val labelCounts = labelsInCluster.map( _.groupBy(l => l).map(t => t._2.length)) val n = data.count labelCounts.map(m => m.sum * entropy(m)).sum / n }
  30. 30. 30 (30, 1.0266922080881913) (40, 1.0226914826265483) (50, 1.019971839275925) (60, 1.0162839563855304) (70, 1.0108882243857347) (80, 1.0076114958062241) (95, 0.4731290640152461) (100,0.5756131018520718) (105,0.9090079450132587) (110,0.8480807836884104) (120,0.3923520444828631)
  31. 31. 31 Detecting an Anomaly
  32. 32. 32 val kmeans = new KMeans() kmeans.setK(95) kmeans.setRuns(10) kmeans.setEpsilon(1.0e-6) val model = kmeans.run(normalizedData) def distance(a: Array[Double], b: Array[Double]) = sqrt(a.zip(b).map(p => p._1 - p._2).map(d => d * d).sum) val centroids = model.clusterCenters val distances = normalizedData.map(datum => (distance(centroids(model.predict(datum)), datum), datum)) distances.top(5) (Ordering.by[(Double,Array[Double]),Double](_._1))
  33. 33. From Here to Production? 33 • Real data set! • Algorithmic • Other distance metrics • k-means|| init • Use data point IDs • Real-Time • Spark Streaming? • Storm? • Continuous Pipeline • Visualization
  34. 34. sowen@cloudera.com

×