Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Real Time Machine Learning Visualization with Spark

919 views

Published on

Real Time Machine Learning Visualization with Spark

Published in: Technology
  • Be the first to comment

Real Time Machine Learning Visualization with Spark

  1. 1. Real Time Machine Learning Visualization with Spark Chester Chen, Ph.D Sr. Manager, Data Science & Engineering GoPro, Inc. Hadoop Summit, San Jose 2016
  2. 2. Who am I ? • Sr. Manager of Data Science & Engineering at GoPro • Founder and Organizer of SF Big Analytics Meetup (4500+ members) • Previous Employment: – Alpine Data, Tinga, Clearwell/Symantec, AltaVista, Ascent Media, ClearStory Systems, WebWare. • Experience with Spark – Exposed to Spark since Spark 0.6 – Architect for Alpine Spark Integration on Spark 1.1, 1.3 and 1.5.x • Hadoop Distribution – CDH, HDP and MapR
  3. 3. Growing data needs
  4. 4. Lightning-fast cluster computing Real Time ML Visualization with Spark http://spark.apache.org/
  5. 5. Iris data set, K-Means clustering with K=3 Cluster 2 Cluster 1 Cluster 0 Centroids Sepal width vs Petal length
  6. 6. Iris data set, K-Means clustering with K=3 distance
  7. 7. What is K-Means ? • Given a set of observations (x1, x2, …, xn), where each observation is a d- dimensional real vector, • k-means clustering aims to partition the n observations into k (≤ n) sets S = {S1, S2, …, Sk} • The clusters are determined by minimizing the inter-cluster sum of squares (ICSS) (sum of distance functions of each point in the cluster to the K center). In other words, the objective is to find • where μi is the mean of points in Si. • https://en.wikipedia.org/wiki/K-means_clustering
  8. 8. Visualization Cost 35 35.5 36 36.5 37 37.5 38 38.5 0 5 10 15 20 25 Cost vs Iteration Cost
  9. 9. Real Time ML Visualization • Use Cases – Use visualization to determine whether to end the training early • Need a way to visualize the training process including the convergence, clustering or residual plots, etc. • Need a way to stop the training and save current model • Need a way to disable or enable the visualization
  10. 10. Real Time ML Visualization with Spark DEMO
  11. 11. How to Enable Real Time ML Visualization ? • A callback interface for Spark Machine Learning Algorithm to send messages – Algorithms decide when and what message to send – Algorithms don’t care how the message is delivered • A task channel to handle the message delivery from Spark Driver to Spark Client – It doesn’t care about the content of the message or who sent the message • The message is delivered from Spark Client to Browser – We use HTML5 Server-Sent Events ( SSE) and HTTP Chunked Response (PUSH) – Pull is possible, but requires a message Queue • Visualization using JavaScript Frameworks Plot.ly and D3
  12. 12. Spark Job in Yarn-Cluster mode Spark Client Hadoop Cluster Yarn-Container Spark Driver Spark Job Spark Context Spark ML algorithm Command Line Rest API Servlet Application Host
  13. 13. Spark Job in Yarn-Cluster mode Spark Client Hadoop Cluster Command Line Rest API Servlet Application Host Spark Job App Context Spark ML Algorithms ML Listener Message Logger
  14. 14. Spark Client Hadoop ClusterApplication Host Spark Job App Context Spark ML Algorithms ML Listener Message Logger Spark Job in Yarn-Cluster mode Web/ Rest API Server Akka Browser
  15. 15. Enable Real Time ML Visualization SSE Plotly D3 Browser Rest API Server Web Server Spark Client Hadoop Cluster Spark Job App Context Message Logger Task Channel Spark ML Algorithms ML Listener Akka Chunked Response Akka
  16. 16. Enable Real Time ML Visualization SSE Plotly D3 Browser Rest API Server Web Server Spark Client Hadoop Cluster Spark Job App Context Message Logger Task Channel Spark ML Algorithms ML Listener Akka Chunked Response Akka
  17. 17. Machine Learning Listeners
  18. 18. Callback Interface: ML Listener trait MLListener { def onMessage(message: => Any) }
  19. 19. Callback Interface: MLListenerSupport trait MLListenerSupport { // rest of code def sendMessage(message: => Any): Unit = { if (enableListener) { listeners.foreach(l => l.onMessage(message)) } }
  20. 20. KMeansEx: KMeans with MLListener class KMeansExt private (…) extends Serializable with Logging with MLListenerSupport { ... }
  21. 21. KMeansEx: KMeans with MLListener case class KMeansCoreStats (iteration: Int, centers: Array[Vector], cost: Double ) private def runAlgorithm(data: RDD[VectorWithNorm]): KMeansModel = { ... while (!stopIteration && iteration < maxIterations && !activeRuns.isEmpty) { ... if (listenerEnabled()) { sendMessage(KMeansCoreStats(…)) } ... } }
  22. 22. KMeans ML Listener class KMeansListener(columnNames: List[String], data : RDD[Vector], logger : MessageLogger) extends MLListener{ var sampleDataOpt : Option[Array[Vector]]= None override def onMessage(message : => Any): Unit = { message match { case coreStats :KMeansCoreStats => if (sampleDataOpt.isEmpty) sampleDataOpt = Some(data.takeSample(withReplacement = false, num=100)) //use the KMeans model of the current iteration to predict sample cluster indexes val kMeansModel = new KMeansModel(coreStats.centers) val cluster=sampleDataOpt.get.map(vector => (vector.toArray, kMeansModel.predict(vector))) val msg = KMeansStats(…) logger.sendBroadCastMessage(MLConstants.KMEANS_CENTER, msg) case _ => println(" message lost") }
  23. 23. KMeans Spark Job Setup Val appCtxOpt : Option[ApplicationContext] = … val kMeans = new KMeansExt().setK(numClusters) .setEpsilon(epsilon) .setMaxIterations(maxIterations) .enableListener(enableVisualization) .addListener( new KMeansListener(...)) appCtxOpt.foreach(_.addTaskObserver(new MLTaskObserver(kMeans,logger))) kMeans.run(vectors)
  24. 24. ML Task Observer • Receives command from User to update running Spark Job • Once receives UpdateTask Command from notify call, it preforms the necessary update operation trait TaskObserver { def notify (task: UpdateTaskCmd) } class MLTaskObserver(support: MLListenerSupport, logger: MessageLogger ) extends TaskObserver { //implement notify }
  25. 25. Logistic Regression MLListener class LogisticRegression(…) extends MLListenerSupport { def train(data: RDD[(Double, Vector)]): LogisticRegressionModel= { // initialization code val (rawWeights, loss) = OWLQN.runOWLQN( …) generateLORModel(…) } }
  26. 26. Logistic Regression MLListener object OWLQN extends Logging { def runOWLQN(/*args*/,mlSupport:Option[MLListenerSupport]):(Vector, Array[Double]) = { val costFun=new CostFun(data, mlSupport, IterationState(), /*other args */) val states : Iterator[lbfgs.State] = lbfgs.iterations( new CachedDiffFunction(costFun), initialWeights.toBreeze.toDenseVector ) … }
  27. 27. Logistic Regression MLListener In Cost function : override def calculate(weights: BDV[Double]): (Double, BDV[Double]) = { val shouldStop = mlSupport.exists(_.stopIteration) if (!shouldStop) { … mlSupport.filter(_.listenerEnabled()).map { s=> s.sendMessage( (iState.iteration, w, loss)) } … } else { … } }
  28. 28. Task Communication Channel
  29. 29. Task Channel : Akka Messaging Spark Application Application Context Actor System Messager Actor Task Channel Actor SparkContext Spark tasks Akka Akka
  30. 30. Task Channel : Akka messaging SSE Plotly D3 Browser Rest API Server Web Server Spark Client Hadoop Cluster Spark Job App Context Message Logger Task Channel Spark ML Algorithms ML Listener Akka Chunked Response Akka
  31. 31. Push To The Browser
  32. 32. HTTP Chunked Response and SSE SSE Plotly D3 Browser Rest API Server Web Server Spark Client Hadoop Cluster Spark Job App Context Message Logger Task Channel Spark ML Algorithms ML Listener Akka Chunked Response Akka
  33. 33. HTML5 Server-Sent Events (SSE) • Server-sent Events (SSE) is one-way messaging – An event is when a web page automatically get update from Server • Register an event source (JavaScript) var source = new EventSource(url); • The Callback onMessage(data) source.onmessage = function(message){...} • Data Format: data: { n data: “key” : “value”, nn data: } nn
  34. 34. HTTP Chunked Response • Spray Rest Server supports Chunked Response val responseStart = HttpResponse(entity = HttpEntity(`text/event-stream`, s"data: Startn")) requestCtx.responder ! ChunkedResponseStart(responseStart).withAck(Messages.Ack) val nextChunk = MessageChunk(s"data: $r nn") requestCtx.responder ! nextChunk.withAck(Messages.Ack) requestCtx.responder ! MessageChunk(s"data: Finished nn") requestCtx.responder ! ChunkedMessageEnd
  35. 35. Push vs. Pull Push • Pros – The data is streamed (pushed) to browser via chunked response – There is no need for data queue, but the data can be lost if not consumed – Multiple pages can be pushed at the same time, which allows multiple visualization views • Cons – For slow network, slow browser and fast data iterations, the data might all show-up in browser at once, rather showing a nice iteration-by-iteration display – If you control the data chunked response by Network Acknowledgement, the visualization may not show-up at all as the data is not pushed due to slow network acknowledgement
  36. 36. Push vs. Pull Pull • Pros – Message does not get lost, since it can be temporarily stored in the message queue – The visualization will render in an even pace • Cons – Need to periodically send server request for update, – We will need a message queue before the message is consumed – Hard to support multiple pages rendering with simple message queue
  37. 37. Visualization: Plot.ly + D3 Cost vs. IterationCost vs. Iteration ArrTime vs. DistanceArrTime vs. DepTime Alpine Workflow
  38. 38. Use Plot.ly to render graph function showCost(dataParsed) { var costTrace = { … }; var data = [ costTrace ]; var costLayout = { xaxis: {…}, yaxis: {…}, title: … }; Plotly.newPlot('cost', data, costLayout); }
  39. 39. Real Time ML Visualization: Summary • Training machine learning model involves a lot of experimentation, we need a way to visualize the training process. • We presented a system to enable real time machine learning visualization with Spark: – Gives visibility into the training of a model – Allows us monitor the convergence of the algorithms during training – Can stop the iterations when convergence is good enough.
  40. 40. Thank You Chester Chen chesterxgchen@yahoo.com LinkedIn https://www.linkedin.com/in/chester-chen-3205992 SlideShare http://www.slideshare.net/ChesterChen/presentations demo video https://youtu.be/DkbYNYQhrao

×