Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

2,479 views

Published on

Spark Summit east talk

Published in: Data & Analytics
  • Be the first to comment

Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

  1. 1. Petabyte scale data science using Spark & R Sridhar Alla, Kiran Muglurmath Comcast
  2. 2. Who we are • Sridhar Alla Director, Solution Architecture, Comcast focuses on architecting and building solutions to meet the needs of the Enterprise Business Intelligence initiatives. • Kiran Muglurmath Executive Director, Data Science, Comcast focuses on architecting and building solutions to meet the needs of the Enterprise Business Intelligence initiatives.
  3. 3. Top Initiatives • Customer Churn Prediction • Clickthru Analytics • Personalization • Customer Journey • Modeling
  4. 4. Spark Stack
  5. 5. • Enables using R packages to process data • Can run Machine Learning and Statistical Analysis algorithms SparkR
  6. 6. Spark MLlib • Implements various Machine Learning Algorithms • Classification, Regression, Collaborative Filtering, Clustering, Decomposition • Works with Streaming, Spark SQL, GraphX or with SparkR.
  7. 7. Using PySpark & SparkR
  8. 8. Hidden Markov Model (HMM) • Supporting points go here.
  9. 9. Dataset Preparation: Training Data • Supporting points go here.
  10. 10. Dataset Preparation: Raw Data • Supporting points go here.
  11. 11. Baum – Welch algorithm for state detection 1. Given the download/upload levels (observations) for a given time interval, the model detects the hidden streaming state for that interval. 2. Given a set of observations (i = 1 .. n), ith hidden variable is independent of (i – 1)th hidden variable. For a discrete random variable Xt with N possible values, assume at P(Xt|X{t-1}) is independent of time t 1. From observations, calculate transition probabilities for N possible states. Then recursively compute maximum likelihoods for all observations, backwards and forwards to identify most probable state for each observation.
  12. 12. Sample Code (R): • library('RHmm') • indata <- read.csv(file.choose(), header = FALSE, sep = ",", quote = """, dec = ".") • testdata <- read.csv(file.choose(), header = FALSE, sep = ",", quote = """, dec = ".") • downloads <- c(as.numeric(indata$V4)) • downloadModel <- HMMFit(downloads, nStates=3) • testdownloads <- c(as.numeric(testdata$V4)) • tVitPath <- viterbi(downloadModel, testdownloads) • #Forward-backward procedure, compute probabilities • tfb <- forwardBackward(downloadModel, testdownloads) • # Plot implied states • layout(1:3) • plot(testdownloads[1:100],ylab="Down Bandwidth",type="l", main="Download bytes") • plot(tVitPath$states[1:100],ylab="Download States",type="l", main="Download States")
  13. 13. Output for a test dataset • Supporting points go here.
  14. 14. Parallelizing in Hadoop Steps: • Create sample datasetto build model.This can be a small sample (~2000 – 5000 rows),or a size sufficientto build generalized model. • Scriptmodel as an R file, exceptthatit should use streamed inputinstead ofreading from CSV files.Separate map.R and reduce.R can be created ifa reduction stage is requiredto create unified outputdatasets. • Test that code works from command line with structure below,where dataset.csv is the inputdatasetwith structure as shown before cat dataset.csv | map.R | reduce.R > output.csv • Ensure thatHive tables are in delimited textformat.Deploy and run model using Hadoop streamingwith sample command line below hadoop jar /usr/hdp/2.2.6.4-1/hadoop-mapreduce/hadoop-streaming.jar -D mapred.min.split.size=268435456 -D mapreduce.task.timeout=300000000 -D mapreduce.map.memory.mb=3584 -D mapreduce.reduce.memory.mb=8092 -input /user/hive/warehouse/ebidatascience.db/ipdr/local_day_id=$NEXT_DATE -output /user/hive/warehouse/ebidatascience.db/ipdr_flagged/ -file ./map.R -file <sample dataset to build model.csv> -mapper ./map.R
  15. 15. Flagged output • Supporting points go here.
  16. 16. Performance • 1.7B observations/day • About 30 minutes processing time/day • 380 shared nodes • 92% accuracy in detecting streaming events
  17. 17. Output for a test dataset • Supporting points go here.
  18. 18. Add Pages as Necessary • Supporting points go here.
  19. 19. We are hiring! • Big Data Engineers (Hadoop, Spark, Kafka…) • Data Analysts (R, SAS…..) • Big Data Analysts (Hive, Pig ….) sridhar_alla@cable.comcast.com
  20. 20. THANK YOU.
  21. 21. Output for a test dataset • Supporting points go here.
  22. 22. Add Pages as Necessary • Supporting points go here.

×