Michal Malohlava's Sparkling Water Applications Meetup on 07.21.15, focusing on the Ask Craig use case.
http://h2o.ai/blog/2015/06/ask-craig-sparkling-water/
8. Open-source distributed execution platform
User-friendly API for data transformation based on RDD
Platform components - SQL, MLLib, text mining
Multitenancy
Large and active community
9. Open-source scalable machine
learning platform
Tuned for efficient computation
and memory use
Production ready machine
learning algorithms
R, Python, Java, Scala APIs
Interactive UI, robust data parser
10.
11. Sparkling Water
Provides
Transparent integration of H2O with Spark ecosystem
Transparent use of H2O data structures and
algorithms with Spark API
Excels in existing Spark workflows
requiring advanced Machine Learning
algorithms
Platform for building
Smarter
Applications
13. Data Distribution
H2O
H2O
H2O
Sparkling Water Cluster
Spark Executor JVM
Data
Source
(e.g.
HDFS)
H2O
RDD
Spark Executor JVM
Spark Executor JVM
Spark
RDD
RDDs and DataFrames
share same memory
space
toRDD
toH2OFrame
16. ML Workflow
1. Perform Feature Extraction on Words + Munging
2. Run Word2Vec algo (MLlib) on JobTitle words
3. Create “title vectors” from
individual word vectors for each job title
4. Pass the Spark RDD H2O RDD for ML in Flow
5. Run H2O GBM algorithm on H2O RDD
6. Create Spark Streaming Application +
Score on new job titles
18. App Skeleton
class CraigslistJobTitlesApp(jobsFile: String = “…”)
(@transient override val sc: SparkContext,
@transient override val sqlContext: SQLContext,
@transient override val h2oContext: H2OContext)
extends SparklingWaterApp
with SparkContextSupport
with GBMSupport
with ModelMetricsSupport
with Serializable {
def buildModels(datafile: String,
modelName: String): (Model[_,_,_], Word2VecModel)
def classify(jobTitle: String,
modelId: String,
w2vModel: Word2VecModel): (String, Array[Double])
}
Sparkling
environment
Required
capabilities
19. Data: text munging
Example: “Site Supervisor and Pre K Teachers Needed Now!!!”
Post Tokenization: Seq(site, supervisor, pre, teachers, needed)
val tokens = jobTitles.map(line => token(line))
Next: Apply Spark’s Word2Vec model to each word
20. Data: Word2Vec model
Simply: A mathematical way to represent a single word as a vector of
numbers. These vector ‘representations’ encode information about the
about a given word (i.e. its meaning)
Post Tokenization: Seq(site, supervisor, pre, teachers, needed)
Post Word2Vec Results:
needed, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
site, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
supervisor, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
21. Data: job title vectors
In Steps:
1. Sum the word2vec vectors in a given title
2. Divide this sum by # of words in a given title
Result: ~ Average vector for a given title of N words
needed, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
site, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
supervisor, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]+
+
Divide by Total Words (post tokenization)
~ (site supervisor….needed), [0.998, 0.349, 0.621…….0.915]
22. Pass to H2O and Build
GBM Model
val finalRdd = filteredTokenizedRdd.map(row => {
val label = row._1
val tokens = row._2
// Compute vector for given list of word tokens, unknown words are ignored
val vec = wordsToVector(tokens, w2vModel)
JobOffer(label, vec)
})
case class JobOffer(category: String, fv: mllib.linalg.Vector)
val h2oFrame: H2OFrame = h2oContext.asH2OFrame(finalRdd.toDF)
Single rowrepresentation
Vector representing job title
Publish Spark DataFrame
as H2OFrame
val gbmModel = GBMModel(trainFrame, validFrame, "category", modelName, ntrees = 50)
Build GBM model
25. Classify new job title
def classify(jobTitle: String,
modelId: String,
w2vModel: Word2VecModel): (String, Array[Double]) = {
val model = water.DKV.getGet(modelId)
val tokens = tokenize(jobTitle, STOP_WORDS)
val vec = wordsToVector(tokens, w2vModel)
val modelOutput = m._output.asInstanceOf[Output]
val nclasses = modelOutput.nclasses()
val classNames = modelOutput.classNames()
val pred = model.score(row, new Array[Double](nclasses + 1))
(classNames(pred(0).asInstanceOf[Int]), pred slice (1, pred.length))
}
Transform
the job title into
a vector with
help of Wor2Vec
model
Score the vector
with GBM model