Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Building Machine
Learning Applications
with Sparkling Water
MLConf 2015 NYC
Michal Malohlava and Alex Tellez and H2O.ai
TBD
Head of Sales
Distributed
Systems
Engineers
Making

ML Scale!
Team@H2O.ai
Scalable 

Machine Learning
For Smarter
Applications
Smarter Applications
Scalable Applications
Distributed
Able to process huge amount of data from
different sources
Easy to develop and experimen...
BUT
how to build
them?
Build an application
with …
Build an application
with …
?
…with Spark and H2O
Open-source distributed execution platform
User-friendly API for data transformation based on RDD
Platform components - SQ...
Open-source scalable machine
learning platform
Tuned for efficient computation
and memory use
Production ready machine
lea...
Sparkling Water
Provides
Transparent integration of H2O with Spark ecosystem
Transparent use of H2O data structures and
al...
Lets build
an application !
OR
OR
OR
Detect spam text messages
Data example
ML Workflow
1. Extract data
2. Transform, tokenize messages
3. Build Tf-IDF
4. Create and evaluate 

Deep Learning model
5....
Lego #1: Data load
// Data load

def load(dataFile: String): RDD[Array[String]] = {

sc.textFile(dataFile).map(l => l.spli...
Lego #2: Ad-hoc
Tokenization
def tokenize(data: RDD[String]): RDD[Seq[String]] = {

val ignoredWords = Seq("the", “a", …)
...
Lego #3: Tf-IDF
def buildIDFModel(tokens: RDD[Seq[String]],

minDocFreq:Int = 4,

hashSpaceSize:Int = 1 << 10):
(HashingTF...
Lego #3: Tf-IDF
def buildIDFModel(tokens: RDD[Seq[String]],

minDocFreq:Int = 4,

hashSpaceSize:Int = 1 << 10):
(HashingTF...
Lego #3: Tf-IDF
def buildIDFModel(tokens: RDD[Seq[String]],

minDocFreq:Int = 4,

hashSpaceSize:Int = 1 << 10):
(HashingTF...
Lego #4: Build a model
def buildDLModel(train: Frame, valid: Frame,

epochs: Int = 10, l1: Double = 0.001, l2: Double = 0....
Assembly application
// Data load

val data = load(DATAFILE)

// Extract response spam or ham

val hamSpam = data.map( r =...
Assembly application
// Data load

val data = load(DATAFILE)

// Extract response spam or ham

val hamSpam = data.map( r =...
Assembly application
// Data load

val data = load(DATAFILE)

// Extract response spam or ham

val hamSpam = data.map( r =...
Assembly application
// Data load

val data = load(DATAFILE)

// Extract response spam or ham

val hamSpam = data.map( r =...
Data exploration
Model evaluation
val trainMetrics = binomialMM(dlModel, train)

val validMetrics = binomialMM(dlModel, valid)
Model evaluation
val trainMetrics = binomialMM(dlModel, train)

val validMetrics = binomialMM(dlModel, valid)
Collect mode...
Spam predictor
def isSpam(msg: String,

dlModel: DeepLearningModel,

hashingTF: HashingTF,

idfModel: IDFModel,

hamThresh...
Spam predictor
def isSpam(msg: String,

dlModel: DeepLearningModel,

hashingTF: HashingTF,

idfModel: IDFModel,

hamThresh...
Spam predictor
def isSpam(msg: String,

dlModel: DeepLearningModel,

hashingTF: HashingTF,

idfModel: IDFModel,

hamThresh...
Spam predictor
def isSpam(msg: String,

dlModel: DeepLearningModel,

hashingTF: HashingTF,

idfModel: IDFModel,

hamThresh...
Predict spam
Predict spam
isSpam("Michal, beer
tonight in MV?")
Predict spam
isSpam("Michal, beer
tonight in MV?")
Predict spam
isSpam("Michal, beer
tonight in MV?")
isSpam("We tried to contact
you re your reply
to our offer of a Video
H...
Predict spam
isSpam("Michal, beer
tonight in MV?")
isSpam("We tried to contact
you re your reply
to our offer of a Video
H...
Checkout H2O.ai Training Books
http://learn.h2o.ai/

Checkout H2O.ai Blog
http://h2o.ai/blog/

Checkout H2O.ai Youtube Cha...
Learn more at h2o.ai
Follow us at @h2oai
Thank you!
Sparkling Water is
open-source

ML application platform
combining

pow...
Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC
Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC
Upcoming SlideShare
Loading in …5
×

Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

1,475 views

Published on

Building Machine Learning Applications with Sparkling Water: Writing applications which are processing and analyzing large amount of data is still hard. It often requires to design and run Machine Learning experiments in small scale and then consolidate them into a form of application and run them in large scale. There are several distributed machine learning platforms which are trying to mitigate this effort. In this talk we will focus on Sparkling Water which is combining benefits of two platforms – H2O and Spark. H2O is an open-source distributed math-engine providing tuned Machine Learning library, Spark is an execution platform which allows for processing large amount of data. The talk will demonstrate Sparkling Water features and shows its benefits for building rich and robust Machine Learning applications.

Published in: Technology
  • Be the first to comment

Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

  1. 1. Building Machine Learning Applications with Sparkling Water MLConf 2015 NYC Michal Malohlava and Alex Tellez and H2O.ai
  2. 2. TBD Head of Sales Distributed Systems Engineers Making
 ML Scale! Team@H2O.ai
  3. 3. Scalable 
 Machine Learning For Smarter Applications
  4. 4. Smarter Applications
  5. 5. Scalable Applications Distributed Able to process huge amount of data from different sources Easy to develop and experiment Powerful machine learning engine inside
  6. 6. BUT how to build them?
  7. 7. Build an application with …
  8. 8. Build an application with … ?
  9. 9. …with Spark and H2O
  10. 10. Open-source distributed execution platform User-friendly API for data transformation based on RDD Platform components - SQL, MLLib, text mining Multitenancy Large and active community
  11. 11. Open-source scalable machine learning platform Tuned for efficient computation and memory use Production ready machine learning algorithms R, Python, Java, Scala APIs Interactive UI, robust data parser
  12. 12. Sparkling Water Provides Transparent integration of H2O with Spark ecosystem Transparent use of H2O data structures and algorithms with Spark API Platform for building Smarter Applications Excels in existing Spark workflows requiring advanced Machine Learning algorithms
  13. 13. Lets build an application !
  14. 14. OR
  15. 15. OR
  16. 16. OR Detect spam text messages
  17. 17. Data example
  18. 18. ML Workflow 1. Extract data 2. Transform, tokenize messages 3. Build Tf-IDF 4. Create and evaluate 
 Deep Learning model 5. Use the model Goal: For a given text message identify if it is spam or not
  19. 19. Lego #1: Data load // Data load
 def load(dataFile: String): RDD[Array[String]] = {
 sc.textFile(dataFile).map(l => l.split(“t")) .filter(r => !r(0).isEmpty)
 }
  20. 20. Lego #2: Ad-hoc Tokenization def tokenize(data: RDD[String]): RDD[Seq[String]] = {
 val ignoredWords = Seq("the", “a", …)
 val ignoredChars = Seq(',', ‘:’, …)
 
 val texts = data.map( r => {
 var smsText = r.toLowerCase
 for( c <- ignoredChars) {
 smsText = smsText.replace(c, ' ')
 }
 
 val words =smsText.split(" ").filter(w => !ignoredWords.contains(w) && w.length>2).distinct
 words.toSeq
 })
 texts
 }
  21. 21. Lego #3: Tf-IDF def buildIDFModel(tokens: RDD[Seq[String]],
 minDocFreq:Int = 4,
 hashSpaceSize:Int = 1 << 10): (HashingTF, IDFModel, RDD[Vector]) = {
 // Hash strings into the given space
 val hashingTF = new HashingTF(hashSpaceSize)
 val tf = hashingTF.transform(tokens)
 // Build term frequency-inverse document frequency
 val idfModel = new IDF(minDocFreq=minDocFreq).fit(tf)
 val expandedText = idfModel.transform(tf)
 (hashingTF, idfModel, expandedText)
 } “Thank for the order…” […,0,3.5,0,1,0,0.3,0,1.3,0,0,…] Thank Order
  22. 22. Lego #3: Tf-IDF def buildIDFModel(tokens: RDD[Seq[String]],
 minDocFreq:Int = 4,
 hashSpaceSize:Int = 1 << 10): (HashingTF, IDFModel, RDD[Vector]) = {
 // Hash strings into the given space
 val hashingTF = new HashingTF(hashSpaceSize)
 val tf = hashingTF.transform(tokens)
 // Build term frequency-inverse document frequency
 val idfModel = new IDF(minDocFreq=minDocFreq).fit(tf)
 val expandedText = idfModel.transform(tf)
 (hashingTF, idfModel, expandedText)
 } Hash words
 into large 
 space “Thank for the order…” […,0,3.5,0,1,0,0.3,0,1.3,0,0,…] Thank Order
  23. 23. Lego #3: Tf-IDF def buildIDFModel(tokens: RDD[Seq[String]],
 minDocFreq:Int = 4,
 hashSpaceSize:Int = 1 << 10): (HashingTF, IDFModel, RDD[Vector]) = {
 // Hash strings into the given space
 val hashingTF = new HashingTF(hashSpaceSize)
 val tf = hashingTF.transform(tokens)
 // Build term frequency-inverse document frequency
 val idfModel = new IDF(minDocFreq=minDocFreq).fit(tf)
 val expandedText = idfModel.transform(tf)
 (hashingTF, idfModel, expandedText)
 } Hash words
 into large 
 space Term freq scale “Thank for the order…” […,0,3.5,0,1,0,0.3,0,1.3,0,0,…] Thank Order
  24. 24. Lego #4: Build a model def buildDLModel(train: Frame, valid: Frame,
 epochs: Int = 10, l1: Double = 0.001, l2: Double = 0.0,
 hidden: Array[Int] = Array[Int](200, 200))
 (implicit h2oContext: H2OContext): DeepLearningModel = {
 import h2oContext._
 // Build a model
 val dlParams = new DeepLearningParameters()
 dlParams._destination_key = Key.make("dlModel.hex").asInstanceOf[Key[Frame]]
 dlParams._train = train
 dlParams._valid = valid
 dlParams._response_column = 'target
 dlParams._epochs = epochs
 dlParams._l1 = l1
 dlParams._hidden = hidden
 
 // Create a job
 val dl = new DeepLearning(dlParams)
 val dlModel = dl.trainModel.get
 
 // Compute metrics on both datasets
 dlModel.score(train).delete()
 dlModel.score(valid).delete()
 
 dlModel
 } Deep Learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple l a y e r s o f n o n l i n e a r transformations
  25. 25. Assembly application // Data load
 val data = load(DATAFILE)
 // Extract response spam or ham
 val hamSpam = data.map( r => r(0))
 val message = data.map( r => r(1))
 // Tokenize message content
 val tokens = tokenize(message)
 
 // Build IDF model
 var (hashingTF, idfModel, tfidf) = buildIDFModel(tokens)
 
 // Merge response with extracted vectors
 val resultRDD: SchemaRDD = hamSpam.zip(tfidf).map(v => SMS(v._1, v._2))
 val table:DataFrame = resultRDD
 
 // Split table
 val keys = Array[String]("train.hex", "valid.hex")
 val ratios = Array[Double](0.8)
 val frs = split(table, keys, ratios)
 val (train, valid) = (frs(0), frs(1))
 table.delete()
 
 // Build a model
 val dlModel = buildDLModel(train, valid)
  26. 26. Assembly application // Data load
 val data = load(DATAFILE)
 // Extract response spam or ham
 val hamSpam = data.map( r => r(0))
 val message = data.map( r => r(1))
 // Tokenize message content
 val tokens = tokenize(message)
 
 // Build IDF model
 var (hashingTF, idfModel, tfidf) = buildIDFModel(tokens)
 
 // Merge response with extracted vectors
 val resultRDD: SchemaRDD = hamSpam.zip(tfidf).map(v => SMS(v._1, v._2))
 val table:DataFrame = resultRDD
 
 // Split table
 val keys = Array[String]("train.hex", "valid.hex")
 val ratios = Array[Double](0.8)
 val frs = split(table, keys, ratios)
 val (train, valid) = (frs(0), frs(1))
 table.delete()
 
 // Build a model
 val dlModel = buildDLModel(train, valid) Data munging
  27. 27. Assembly application // Data load
 val data = load(DATAFILE)
 // Extract response spam or ham
 val hamSpam = data.map( r => r(0))
 val message = data.map( r => r(1))
 // Tokenize message content
 val tokens = tokenize(message)
 
 // Build IDF model
 var (hashingTF, idfModel, tfidf) = buildIDFModel(tokens)
 
 // Merge response with extracted vectors
 val resultRDD: SchemaRDD = hamSpam.zip(tfidf).map(v => SMS(v._1, v._2))
 val table:DataFrame = resultRDD
 
 // Split table
 val keys = Array[String]("train.hex", "valid.hex")
 val ratios = Array[Double](0.8)
 val frs = split(table, keys, ratios)
 val (train, valid) = (frs(0), frs(1))
 table.delete()
 
 // Build a model
 val dlModel = buildDLModel(train, valid) Split dataset Data munging
  28. 28. Assembly application // Data load
 val data = load(DATAFILE)
 // Extract response spam or ham
 val hamSpam = data.map( r => r(0))
 val message = data.map( r => r(1))
 // Tokenize message content
 val tokens = tokenize(message)
 
 // Build IDF model
 var (hashingTF, idfModel, tfidf) = buildIDFModel(tokens)
 
 // Merge response with extracted vectors
 val resultRDD: SchemaRDD = hamSpam.zip(tfidf).map(v => SMS(v._1, v._2))
 val table:DataFrame = resultRDD
 
 // Split table
 val keys = Array[String]("train.hex", "valid.hex")
 val ratios = Array[Double](0.8)
 val frs = split(table, keys, ratios)
 val (train, valid) = (frs(0), frs(1))
 table.delete()
 
 // Build a model
 val dlModel = buildDLModel(train, valid) Split dataset Build model Data munging
  29. 29. Data exploration
  30. 30. Model evaluation val trainMetrics = binomialMM(dlModel, train)
 val validMetrics = binomialMM(dlModel, valid)
  31. 31. Model evaluation val trainMetrics = binomialMM(dlModel, train)
 val validMetrics = binomialMM(dlModel, valid) Collect model 
 metrics
  32. 32. Spam predictor def isSpam(msg: String,
 dlModel: DeepLearningModel,
 hashingTF: HashingTF,
 idfModel: IDFModel,
 hamThreshold: Double = 0.5):Boolean = {
 val msgRdd = sc.parallelize(Seq(msg))
 val msgVector: SchemaRDD = idfModel.transform(
 hashingTF.transform (
 tokenize (msgRdd))) .map(v => SMS("?", v))
 val msgTable: DataFrame = msgVector
 msgTable.remove(0) // remove first column
 val prediction = dlModel.score(msgTable)
 prediction.vecs()(1).at(0) < hamThreshold
 }
  33. 33. Spam predictor def isSpam(msg: String,
 dlModel: DeepLearningModel,
 hashingTF: HashingTF,
 idfModel: IDFModel,
 hamThreshold: Double = 0.5):Boolean = {
 val msgRdd = sc.parallelize(Seq(msg))
 val msgVector: SchemaRDD = idfModel.transform(
 hashingTF.transform (
 tokenize (msgRdd))) .map(v => SMS("?", v))
 val msgTable: DataFrame = msgVector
 msgTable.remove(0) // remove first column
 val prediction = dlModel.score(msgTable)
 prediction.vecs()(1).at(0) < hamThreshold
 } Prepared models
  34. 34. Spam predictor def isSpam(msg: String,
 dlModel: DeepLearningModel,
 hashingTF: HashingTF,
 idfModel: IDFModel,
 hamThreshold: Double = 0.5):Boolean = {
 val msgRdd = sc.parallelize(Seq(msg))
 val msgVector: SchemaRDD = idfModel.transform(
 hashingTF.transform (
 tokenize (msgRdd))) .map(v => SMS("?", v))
 val msgTable: DataFrame = msgVector
 msgTable.remove(0) // remove first column
 val prediction = dlModel.score(msgTable)
 prediction.vecs()(1).at(0) < hamThreshold
 } Prepared models Default decision threshold
  35. 35. Spam predictor def isSpam(msg: String,
 dlModel: DeepLearningModel,
 hashingTF: HashingTF,
 idfModel: IDFModel,
 hamThreshold: Double = 0.5):Boolean = {
 val msgRdd = sc.parallelize(Seq(msg))
 val msgVector: SchemaRDD = idfModel.transform(
 hashingTF.transform (
 tokenize (msgRdd))) .map(v => SMS("?", v))
 val msgTable: DataFrame = msgVector
 msgTable.remove(0) // remove first column
 val prediction = dlModel.score(msgTable)
 prediction.vecs()(1).at(0) < hamThreshold
 } Prepared models Default decision threshold Scoring
  36. 36. Predict spam
  37. 37. Predict spam isSpam("Michal, beer tonight in MV?")
  38. 38. Predict spam isSpam("Michal, beer tonight in MV?")
  39. 39. Predict spam isSpam("Michal, beer tonight in MV?") isSpam("We tried to contact you re your reply to our offer of a Video Handset? 750 anytime any networks mins? UNLIMITED TEXT?")
  40. 40. Predict spam isSpam("Michal, beer tonight in MV?") isSpam("We tried to contact you re your reply to our offer of a Video Handset? 750 anytime any networks mins? UNLIMITED TEXT?")
  41. 41. Checkout H2O.ai Training Books http://learn.h2o.ai/
 Checkout H2O.ai Blog http://h2o.ai/blog/
 Checkout H2O.ai Youtube Channel https://www.youtube.com/user/0xdata
 Checkout GitHub https://github.com/h2oai/sparkling-water Meetups https://meetup.com/ More info
  42. 42. Learn more at h2o.ai Follow us at @h2oai Thank you! Sparkling Water is open-source
 ML application platform combining
 power of Spark and H2O

×