Real Time Machine Learning Visualization With Spark

Real Time Machine Learning
Visualization with Spark
Chester Chen
Director of Engineering
Alpine Data
March 13, 2016

COMPANY CONFIDENTIAL2
Who am I ?
• Director of Engineering at Alpine Data
• Founder and Organizer of SF Big Analytics Meetup (3500+ members)
• Previous Employment:
– Architect / Director at Tinga, Symantec, AltaVista, Ascent Media, ClearStory
Systems, WebWare.
• Experience with Spark
– Exposed to Spark since Spark 0.6
– Architect for Alpine Spark Integration on Spark 1.1, 1.3 and 1.5.x
• Hadoop Distribution
– CDH, HDP and MapR

Alpine Data at a Glance
Enterprise Scale Predictive Analytics with deep experience in Machine Learning, Data Science, and
Distributed Data Architectures
Industry Innovations and IP
Broad patents awarded for in-cluster and in-database machine learning - 2012
First web-based solution for end-to-end Predictive analytics - 2012
Created Industry first integrated Analytics Services Platform - 2013
First Predictive Analytics solution to be certified on Spark - 2014
Launched Touchpoints, Industry first predictive applications service layer- 2015
Global Brand Names in Financial Services, Telco/Media, Healthcare, Manufacturing, Public Sector and Retail
Visionary in the Gartner Magic Quadrant for Advanced Analytics
Key Partners:

Lightning-fast cluster computing
Real Time ML Visualization with Spark
-- What is Spark
http://spark.apache.org/

Iris data set, K-Means clustering with K=3
Cluster 2
Cluster 1
Cluster 0
Centroids
Sepal width vs Petal length

Iris data set, K-Means clustering with K=3
distance

What is K-Means ?
• Given a set of observations (x1, x2, …, xn), where each observation is a d-
dimensional real vector,
• k-means clustering aims to partition the n observations into k (≤ n) sets
S = {S1, S2, …, Sk}
• The clusters are determined by minimizing the inter-cluster sum of
squares (ICSS) (sum of distance functions of each point in the cluster to
the K center). In other words, the objective is to find
• where μi is the mean of points in Si.
• https://en.wikipedia.org/wiki/K-means_clustering

Visualization Cost
35
35.5
36
36.5
37
37.5
38
38.5
0 5 10 15 20 25
Cost vs Iteration
Cost

Real Time ML Visualization – Why ?
• Use Cases
– Use visualization to determine whether to end the training early
• Need a way to visualize the training process including the
convergence, clustering or residual plots, etc.
• Need a way to stop the training and save current model
• Need a way to disable or enable the visualization

Real Time ML Visualization with Spark
DEMO

How to Enable Real Time ML Visualization ?
• A callback interface for Spark Machine Learning Algorithm to send messages
– Algorithms decide when and what message to send
– Algorithms don’t care how the message is delivered
• A task channel to handle the message delivery from Spark Driver to Spark Client
– It doesn’t care about the content of the message or who sent the message
• The message is delivered from Spark Client to Browser
– We use HTML5 Server-Sent Events ( SSE) and HTTP Chunked Response (PUSH)
– Pull is possible, but requires a message Queue
• Visualization using JavaScript Frameworks Plot.ly and D3

Spark Job in Yarn-Cluster mode
Spark
Client
Hadoop Cluster
Yarn-Container
Spark Driver
Spark Job
Spark Context
Spark ML
algorithm
Command Line
Rest API
Servlet
Application Host

Spark
Client
Hadoop Cluster
Command Line
Rest API
Servlet
Application Host
Spark Job
App Context Spark ML
Algorithms
ML Listener
Message
Logger

Spark
Client
Hadoop ClusterApplication Host
Spark Job
App Context Spark ML
Algorithms
ML Listener
Message
Logger
Web/
Rest
API
Server
Akka
Browser

Enable Real Time ML Visualization
SSE
Plotly
D3
Browser
Rest
API
Server
Web Server
Spark
Client
Hadoop Cluster
Spark Job
App Context
Message
Logger
Task Channel
Spark ML
Algorithms
ML Listener
Akka
Chunked
Response
Akka

Machine Learning Listeners

Callback Interface: ML Listener
trait MLListener {
def onMessage(message: => Any)
}

Callback Interface: MLListenerSupport
trait MLListenerSupport {
// rest of code
def sendMessage(message: => Any): Unit = {
if (enableListener) {
listeners.foreach(l => l.onMessage(message))
}
}

KMeansEx: KMeans with MLListener
class KMeansExt private (…) extends Serializable
with Logging
with MLListenerSupport {
...
}

KMeansEx: KMeans with MLListener
case class KMeansCoreStats (iteration: Int, centers: Array[Vector], cost: Double )
private def runAlgorithm(data: RDD[VectorWithNorm]): KMeansModel = {
...
while (!stopIteration &&
iteration < maxIterations && !activeRuns.isEmpty) {
...
if (listenerEnabled()) {
sendMessage(KMeansCoreStats(…))
}
...
}
}

KMeans Spark Job Setup
val kMeans = new KMeansExt().setK(numClusters)
.setEpsilon(epsilon)
.setMaxIterations(maxIterations)
.enableListener(enableVisualization)
.addListener(
new KMeansListener(...))
appCtxOpt.foreach(_.addTaskObserver(new MLTaskObserver(kMeans,logger)))
kMeans.run(vectors)

KMeans ML Listener
class KMeansListener(columnNames: List[String],
data : RDD[Vector],
logger : MessageLogger) extends MLListener{
//sampling the data
message match {
case coreStats :KMeansCoreStats =>
//use the KMeans model of the current iteration to predict sample
//cluster indexes
//construct message consists of sample, cost, iteration and centroids
//use logger to send the message out
}

ML Task Observer
• Receives command from User to update running Spark Job
• Once receives UpdateTask Command from notify call, it preforms
the necessary update operation
trait TaskObserver {
def notify (task: UpdateTaskCmd)
}
class MLTaskObserver(support: MLListenerSupport, logger: MessageLogger )
extends TaskObserver {
//implement notify
}

Logistic Regression MLListener
class LogisticRegression(…) extends MLListenerSupport {
def train(data: RDD[(Double, Vector)]): LogisticRegressionModel= {
// initialization code
val (rawWeights, loss) = OWLQN.runOWLQN( …)
generateLORModel(…)
}
}

object OWLQN extends Logging {
def runOWLQN(/*args*/,mlSupport:Option[MLListenerSupport]):(Vector,
Array[Double]) = {
val costFun=new CostFun(data, mlSupport, IterationState(), /*other
args */)
val states : Iterator[lbfgs.State] =
lbfgs.iterations(
new CachedDiffFunction(costFun), initialWeights.toBreeze.toDenseVector
)
…
}

In Cost function :
override def calculate(weights: BDV[Double]): (Double, BDV[Double]) = {
val shouldStop = mlSupport.exists(_.stopIteration)
if (!shouldStop) {
…
mlSupport.filter(_.listenerEnabled()).map { s=>
s.sendMessage( (iState.iteration, w, loss))
}
…
}
else {
…
}
}

Task Communication Channel

Task Channel : Akka Messaging
Spark
Application Application
Context
Actor System
Messager
Actor
Task
Channel
Actor
SparkContext Spark tasks
Akka
Akka

Task Channel : Akka messaging
SSE
Plotly
D3
Browser
Rest
API
Server
Web Server
Spark
Client
Hadoop Cluster
Spark Job
App Context
Message
Logger
Task Channel
Spark ML
Algorithms
ML Listener
Akka
Chunked
Response
Akka

Push To The Browser

HTTP Chunked Response and SSE
SSE
Plotly
D3
Browser
Rest
API
Server
Web Server
Spark
Client
Hadoop Cluster
Spark Job
App Context
Message
Logger
Task Channel
Spark ML
Algorithms
ML Listener
Akka
Chunked
Response
Akka

HTML5 Server-Sent Events (SSE)
• Server-sent Events (SSE) is one-way messaging
– An event is when a web page automatically get update from Server
• Register an event source (JavaScript)
var source = new EventSource(url);
• The Callback onMessage(data)
source.onmessage = function(message){...}
• Data Format:
data: { n
data: “key” : “value”, nn
data: } nn

HTTP Chunked Response
• Spray Rest Server supports Chunked Response
val responseStart =
HttpResponse(entity = HttpEntity(`text/event-stream`, s"data: Startn"))
requestCtx.responder ! ChunkedResponseStart(responseStart).withAck(Messages.Ack)
val nextChunk = MessageChunk(s"data: $r nn")
requestCtx.responder ! nextChunk.withAck(Messages.Ack)
requestCtx.responder ! MessageChunk(s"data: Finished nn")
requestCtx.responder ! ChunkedMessageEnd

Push vs. Pull
Push
• Pros
– The data is streamed (pushed) to browser via chunked response
– There is no need for data queue, but the data can be lost if not consumed
– Multiple pages can be pushed at the same time, which allows multiple
visualization views
• Cons
– For slow network, slow browser and fast data iterations, the data might all
show-up in browser at once, rather showing a nice iteration-by-iteration
display
– If you control the data chunked response by Network Acknowledgement,
the visualization may not show-up at all as the data is not pushed due to
slow network acknowledgement

Push vs. Pull
Pull
• Pros
– Message does not get lost, since it can be temporarily stored in the
message queue
– The visualization will render in an even pace
• Cons
– Need to periodically send server request for update,
– We will need a message queue before the message is consumed
– Hard to support multiple pages rendering with simple message
queue

Visualization: Plot.ly + D3
Cost vs. IterationCost vs. Iteration
ArrTime vs. DistanceArrTime vs. DepTime
Alpine Workflow

Use Plot.ly to render graph
function showCost(dataParsed) {
var costTrace = { … };
var data = [ costTrace ];
var costLayout = {
xaxis: {…},
yaxis: {…},
title: …
};
Plotly.newPlot('cost', data, costLayout);
}

Real Time ML Visualization: Summary
• Training machine learning model involves a lot of experimentation,
we need a way to visualize the training process.
• We presented a system to enable real time machine learning
visualization with Spark:
– Gives visibility into the training of a model
– Allows us monitor the convergence of the algorithms during training
– Can stop the iterations when convergence is good enough.

Thank You
Chester Chen
chester@alpinenow.com
LinkedIn
https://www.linkedin.com/in/chester-chen-3205992
SlideShare
http://www.slideshare.net/ChesterChen/presentations
demo video
https://youtu.be/DkbYNYQhrao

Real Time Machine Learning Visualization With Spark

More Related Content

What's hot

Viewers also liked

Similar to Real Time Machine Learning Visualization With Spark

More from Chester Chen

Recently uploaded

Real Time Machine Learning Visualization With Spark

Editor's Notes