Sparkling Water Applications Meetup 07.21.15

Michal Malohlava, Alex Tellez, and H2O.ai
Building Machine Learning Applications with Sparkling Water Series
07/21/2015 Meetup
Ask Craig

Download
now
Hack later
Spark 1.4
Sparkling
Water 1.4.3
h2o.ai/download

Scalable Applications
Distributed
Able to process huge amount of data from
different sources
Easy to develop and experiment
Powerful machine learning engine inside

Build an application
with …
?

Open-source distributed execution platform
User-friendly API for data transformation based on RDD
Platform components - SQL, MLLib, text mining
Multitenancy
Large and active community

Open-source scalable machine
learning platform
Tuned for efficient computation
and memory use
Production ready machine
learning algorithms
R, Python, Java, Scala APIs
Interactive UI, robust data parser

Sparkling Water
Provides
Transparent integration of H2O with Spark ecosystem
Transparent use of H2O data structures and
algorithms with Spark API
Excels in existing Spark workﬂows
requiring advanced Machine Learning
algorithms
Platform for building
Smarter
Applications

Sparkling Water Design
spark-submit
Spark
Master
JVM
Spark
Worker
JVM
Spark
Worker
JVM
Spark
Worker
JVM
Sparkling Water Cluster
Spark
Executor
JVM
H2O
Spark
Executor
JVM
H2O
Spark
Executor
JVM
H2O
Sparkling
App
implements
Regular Spark application
containing also
Sparkling Water
classes

Data Distribution
H2O
H2O
H2O
Sparkling Water Cluster
Spark Executor JVM
Data
Source
(e.g.
HDFS)
H2O
RDD
Spark Executor JVM
Spark Executor JVM
Spark
RDD
RDDs and DataFrames
share same memory
space
toRDD
toH2OFrame

Task: Predict the job category from
a Craigslist Ad Title

ML Workﬂow
1. Perform Feature Extraction on Words + Munging
2. Run Word2Vec algo (MLlib) on JobTitle words
3. Create “title vectors” from
individual word vectors for each job title
4. Pass the Spark RDD H2O RDD for ML in Flow
5. Run H2O GBM algorithm on H2O RDD
6. Create Spark Streaming Application +
Score on new job titles

App
Architecture
Posting
job title
Stream
Craigslist jobs
Word2Vec
Model
GBM 
Model
Word2Vec
Categorize
a job title
Build models
“It is a labor job”
“HIRING
Painting
CONTRACTORS
NOW!!!”

App Skeleton
class CraigslistJobTitlesApp(jobsFile: String = “…”) 
(@transient override val sc: SparkContext, 
@transient override val sqlContext: SQLContext, 
@transient override val h2oContext: H2OContext)
extends SparklingWaterApp 
with SparkContextSupport
with GBMSupport
with ModelMetricsSupport
with Serializable {
def buildModels(datafile: String,
modelName: String): (Model[_,_,_], Word2VecModel)
def classify(jobTitle: String,
modelId: String,
w2vModel: Word2VecModel): (String, Array[Double])
}
Sparkling
environment
Required
capabilities

Data: text munging
Example: “Site Supervisor and Pre K Teachers Needed Now!!!”
Post Tokenization: Seq(site, supervisor, pre, teachers, needed)
val tokens = jobTitles.map(line => token(line))
Next: Apply Spark’s Word2Vec model to each word

Data: Word2Vec model
Simply: A mathematical way to represent a single word as a vector of
numbers. These vector ‘representations’ encode information about the
about a given word (i.e. its meaning)
Post Tokenization: Seq(site, supervisor, pre, teachers, needed)
Post Word2Vec Results:
needed, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
site, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
supervisor, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]

Data: job title vectors
In Steps:
1. Sum the word2vec vectors in a given title
2. Divide this sum by # of words in a given title
Result: ~ Average vector for a given title of N words
needed, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
site, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
supervisor, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]+
+
Divide by Total Words (post tokenization)
~ (site supervisor….needed), [0.998, 0.349, 0.621…….0.915]

Pass to H2O and Build
GBM Model
val finalRdd = filteredTokenizedRdd.map(row => { 
val label = row._1 
val tokens = row._2 
// Compute vector for given list of word tokens, unknown words are ignored 
val vec = wordsToVector(tokens, w2vModel) 
JobOffer(label, vec) 
})
case class JobOffer(category: String, fv: mllib.linalg.Vector)
val h2oFrame: H2OFrame = h2oContext.asH2OFrame(finalRdd.toDF)
Single rowrepresentation
Vector representing job title
Publish Spark DataFrame 
as H2OFrame
val gbmModel = GBMModel(trainFrame, validFrame, "category", modelName, ntrees = 50)
Build GBM model

GBM: 80% accuracy
Algo: Gradient Boosting Machine
#Trees: 50
# Bins: 20
Depth: 5
(ALL DEFAULTVALUES)
~ 20% Error Rate

Classify new job title
def classify(jobTitle: String,
modelId: String,
w2vModel: Word2VecModel): (String, Array[Double]) = {
val model = water.DKV.getGet(modelId)
val tokens = tokenize(jobTitle, STOP_WORDS) 
val vec = wordsToVector(tokens, w2vModel)
val modelOutput = m._output.asInstanceOf[Output] 
val nclasses = modelOutput.nclasses() 
val classNames = modelOutput.classNames() 
val pred = model.score(row, new Array[Double](nclasses + 1)) 
(classNames(pred(0).asInstanceOf[Int]), pred slice (1, pred.length))
}
Transform
the job title into
a vector with
help of Wor2Vec
model
Score the vector
with GBM model

Streaming part
val ssc = new StreamingContext(sc, Seconds(10)) 
 
// Build an initial model 
val staticApp = new CraigslistJobTitlesApp()(sc, sqlContext, h2oContext) 
val (svModel, w2vModel) = staticApp.buildModels("craigslistJobTitles.csv",
"initialModel") 
val modelId = svModel._key.toString 
 
// Start streaming context 
val jobTitlesStream = ssc.socketTextStream("localhost", 9999) 
 
// Classify incoming messages 
jobTitlesStream.filter(!_.isEmpty) 
.map(jobTitle => (jobTitle, staticApp.classify(jobTitle, modelId, w2vModel))) 
.map(pred => """ + pred._1 + "" = " + show(pred._2, classNames)) 
.print() 
 
ssc.start() 
ssc.awaitTermination()
Process data
every 10seconds
Create Spark
socket stream
exposed
on port 9999
Deﬁne
stream
processing

NetCat for sending
messages to localhost:9999
Application is producing
job

Where is the code?
https://github.com/h2oai/sparkling-water/
blob/master/examples/meetups

Sparkling Water Download
h2o.ai/download

Checkout H2O.ai Training Books
http://learn.h2o.ai/ 
Checkout H2O.ai Blog
http://h2o.ai/blog/ 
Checkout H2O.ai Youtube Channel
https://www.youtube.com/user/0xdata 
Checkout GitHub
https://github.com/h2oai/sparkling-water
Meetups
https://meetup.com/
More info

Learn more at h2o.ai
Follow us at @h2oai
Thank you!
Sparkling Water is
open-source 
ML application platform
combining 
power of Spark and H2O

Sparkling Water Applications Meetup 07.21.15

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Sparkling Water Applications Meetup 07.21.15

Similar to Sparkling Water Applications Meetup 07.21.15 (20)

More from Sri Ambati

More from Sri Ambati (20)

Recently uploaded

Recently uploaded (20)

Sparkling Water Applications Meetup 07.21.15