SlideShare a Scribd company logo
1 of 33
Download to read offline
Michal Malohlava, Alex Tellez, and H2O.ai
Building Machine Learning Applications with Sparkling Water Series
07/21/2015 Meetup
Ask Craig
Download
now
Hack later
Spark 1.4
Sparkling
Water 1.4.3
h2o.ai/download
Smarter Applications
Scalable Applications
Distributed
Able to process huge amount of data from
different sources
Easy to develop and experiment
Powerful machine learning engine inside
BUT
how to build
them?
Build an application
with …
?
…with Spark and H2O
Open-source distributed execution platform
User-friendly API for data transformation based on RDD
Platform components - SQL, MLLib, text mining
Multitenancy
Large and active community
Open-source scalable machine
learning platform
Tuned for efficient computation
and memory use
Production ready machine
learning algorithms
R, Python, Java, Scala APIs
Interactive UI, robust data parser
Sparkling Water
Provides
Transparent integration of H2O with Spark ecosystem
Transparent use of H2O data structures and
algorithms with Spark API
Excels in existing Spark workflows
requiring advanced Machine Learning
algorithms
Platform for building
Smarter
Applications
Sparkling Water Design
spark-submit
Spark
Master
JVM
Spark
Worker
JVM
Spark
Worker
JVM
Spark
Worker
JVM
Sparkling Water Cluster
Spark
Executor
JVM
H2O
Spark
Executor
JVM
H2O
Spark
Executor
JVM
H2O
Sparkling
App
implements
Regular Spark application
containing also
Sparkling Water
classes
Data Distribution
H2O
H2O
H2O
Sparkling Water Cluster
Spark Executor JVM
Data
Source
(e.g.
HDFS)
H2O
RDD
Spark Executor JVM
Spark Executor JVM
Spark
RDD
RDDs and DataFrames
share same memory
space
toRDD
toH2OFrame
Lets build
an application !
Task: Predict the job category from
a Craigslist Ad Title
ML Workflow
1. Perform Feature Extraction on Words + Munging
2. Run Word2Vec algo (MLlib) on JobTitle words
3. Create “title vectors” from
individual word vectors for each job title
4. Pass the Spark RDD H2O RDD for ML in Flow
5. Run H2O GBM algorithm on H2O RDD
6. Create Spark Streaming Application +
Score on new job titles
App
Architecture
Posting
job title
Stream
Craigslist jobs
Word2Vec
Model
GBM

Model
Word2Vec
Categorize
a job title
Build models
“It is a labor job”
“HIRING
Painting
CONTRACTORS
NOW!!!”
App Skeleton
class CraigslistJobTitlesApp(jobsFile: String = “…”)

(@transient override val sc: SparkContext,

@transient override val sqlContext: SQLContext,

@transient override val h2oContext: H2OContext)
extends SparklingWaterApp

with SparkContextSupport
with GBMSupport
with ModelMetricsSupport
with Serializable {
def buildModels(datafile: String,
modelName: String): (Model[_,_,_], Word2VecModel)
def classify(jobTitle: String,
modelId: String,
w2vModel: Word2VecModel): (String, Array[Double])
}
Sparkling
environment
Required
capabilities
Data: text munging
Example: “Site Supervisor and Pre K Teachers Needed Now!!!”
Post Tokenization: Seq(site, supervisor, pre, teachers, needed)
val tokens = jobTitles.map(line => token(line))
Next: Apply Spark’s Word2Vec model to each word
Data: Word2Vec model
Simply: A mathematical way to represent a single word as a vector of
numbers. These vector ‘representations’ encode information about the
about a given word (i.e. its meaning)
Post Tokenization: Seq(site, supervisor, pre, teachers, needed)
Post Word2Vec Results:
needed, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
site, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
supervisor, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
Data: job title vectors
In Steps:
1. Sum the word2vec vectors in a given title
2. Divide this sum by # of words in a given title
Result: ~ Average vector for a given title of N words
needed, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
site, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
supervisor, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]+
+
Divide by Total Words (post tokenization)
~ (site supervisor….needed), [0.998, 0.349, 0.621…….0.915]
Pass to H2O and Build
GBM Model
val finalRdd = filteredTokenizedRdd.map(row => {

val label = row._1

val tokens = row._2

// Compute vector for given list of word tokens, unknown words are ignored

val vec = wordsToVector(tokens, w2vModel)

JobOffer(label, vec)

})
case class JobOffer(category: String, fv: mllib.linalg.Vector)
val h2oFrame: H2OFrame = h2oContext.asH2OFrame(finalRdd.toDF)
Single rowrepresentation
Vector representing job title
Publish Spark DataFrame

as H2OFrame
val gbmModel = GBMModel(trainFrame, validFrame, "category", modelName, ntrees = 50)
Build GBM model
GBM: 80% accuracy
Algo: Gradient Boosting Machine
#Trees: 50
# Bins: 20
Depth: 5
(ALL DEFAULTVALUES)
~ 20% Error Rate
App
Architecture
Posting
job title
Stream
Craigslist jobs
Word2Vec
Model
GBM

Model
Word2Vec
Categorize
a job title
Build models
“It is a labor job”
“HIRING
Painting
CONTRACTORS
NOW!!!”
Classify new job title
def classify(jobTitle: String,
modelId: String,
w2vModel: Word2VecModel): (String, Array[Double]) = {
val model = water.DKV.getGet(modelId)
val tokens = tokenize(jobTitle, STOP_WORDS)

val vec = wordsToVector(tokens, w2vModel)
val modelOutput = m._output.asInstanceOf[Output]

val nclasses = modelOutput.nclasses()

val classNames = modelOutput.classNames()

val pred = model.score(row, new Array[Double](nclasses + 1))

(classNames(pred(0).asInstanceOf[Int]), pred slice (1, pred.length))
}
Transform
the job title into
a vector with
help of Wor2Vec
model
Score the vector
with GBM model
Almost done…
Streaming part
val ssc = new StreamingContext(sc, Seconds(10))



// Build an initial model

val staticApp = new CraigslistJobTitlesApp()(sc, sqlContext, h2oContext)

val (svModel, w2vModel) = staticApp.buildModels("craigslistJobTitles.csv",
"initialModel")

val modelId = svModel._key.toString



// Start streaming context

val jobTitlesStream = ssc.socketTextStream("localhost", 9999)



// Classify incoming messages

jobTitlesStream.filter(!_.isEmpty)

.map(jobTitle => (jobTitle, staticApp.classify(jobTitle, modelId, w2vModel)))

.map(pred => """ + pred._1 + "" = " + show(pred._2, classNames))

.print()



ssc.start()

ssc.awaitTermination()
Process data
every 10seconds
Create Spark
socket stream
exposed
on port 9999
Define
stream
processing
NetCat for sending
messages to localhost:9999
Application is producing
job
Where is the code?
https://github.com/h2oai/sparkling-water/
blob/master/examples/meetups
Sparkling Water Download
h2o.ai/download
Checkout H2O.ai Training Books
http://learn.h2o.ai/

Checkout H2O.ai Blog
http://h2o.ai/blog/

Checkout H2O.ai Youtube Channel
https://www.youtube.com/user/0xdata

Checkout GitHub
https://github.com/h2oai/sparkling-water
Meetups
https://meetup.com/
More info
Learn more at h2o.ai
Follow us at @h2oai
Thank you!
Sparkling Water is
open-source

ML application platform
combining

power of Spark and H2O

More Related Content

What's hot

The Apollo and GraphQL Stack
The Apollo and GraphQL StackThe Apollo and GraphQL Stack
The Apollo and GraphQL StackSashko Stubailo
 
From ActiveRecord to EventSourcing
From ActiveRecord to EventSourcingFrom ActiveRecord to EventSourcing
From ActiveRecord to EventSourcingEmanuele DelBono
 
Kapil Thangavelu - Cloud Custodian
Kapil Thangavelu - Cloud CustodianKapil Thangavelu - Cloud Custodian
Kapil Thangavelu - Cloud CustodianServerlessConf
 
[AWS DC Meetup] Not Your Father’s WebApp: The Cloud-Native Architecture of im...
[AWS DC Meetup] Not Your Father’s WebApp: The Cloud-Native Architecture of im...[AWS DC Meetup] Not Your Father’s WebApp: The Cloud-Native Architecture of im...
[AWS DC Meetup] Not Your Father’s WebApp: The Cloud-Native Architecture of im...Chris Shenton
 
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.govNot Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.govChris Shenton
 
Server-less solution for moving Millions of Images in Cloud - Brett Sutter, ...
 Server-less solution for moving Millions of Images in Cloud - Brett Sutter, ... Server-less solution for moving Millions of Images in Cloud - Brett Sutter, ...
Server-less solution for moving Millions of Images in Cloud - Brett Sutter, ...AWS Chicago
 
Using Google App Engine Python
Using Google App Engine PythonUsing Google App Engine Python
Using Google App Engine PythonAkshay Mathur
 
Cloud Abstraction Libraries: Implementation and Comparison
Cloud Abstraction Libraries: Implementation and ComparisonCloud Abstraction Libraries: Implementation and Comparison
Cloud Abstraction Libraries: Implementation and ComparisonUdit Agarwal
 
Long running aws lambda - Joel Schuweiler, Minneapolis
Long running aws lambda -  Joel Schuweiler, MinneapolisLong running aws lambda -  Joel Schuweiler, Minneapolis
Long running aws lambda - Joel Schuweiler, MinneapolisAWS Chicago
 
Tutorial: Building a GraphQL API in PHP
Tutorial: Building a GraphQL API in PHPTutorial: Building a GraphQL API in PHP
Tutorial: Building a GraphQL API in PHPAndrew Rota
 
I/O intensiveなKafka ConsumerアプリケーションのスループットをLINE Ads Platformではどのように改善したか
I/O intensiveなKafka ConsumerアプリケーションのスループットをLINE Ads Platformではどのように改善したかI/O intensiveなKafka ConsumerアプリケーションのスループットをLINE Ads Platformではどのように改善したか
I/O intensiveなKafka ConsumerアプリケーションのスループットをLINE Ads Platformではどのように改善したかLINE Corporation
 
MongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDBMongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDBMongoDB
 
Observability for developer ( Inny So & Andrew Jones, ThoughtWorks) Kafka Su...
Observability for developer ( Inny So & Andrew Jones, ThoughtWorks)  Kafka Su...Observability for developer ( Inny So & Andrew Jones, ThoughtWorks)  Kafka Su...
Observability for developer ( Inny So & Andrew Jones, ThoughtWorks) Kafka Su...confluent
 
Getting Started with GraphQL && PHP
Getting Started with GraphQL && PHPGetting Started with GraphQL && PHP
Getting Started with GraphQL && PHPAndrew Rota
 
Serverless APIs with JavaScript - Matt Searle - ChocPanda
Serverless APIs with JavaScript - Matt Searle - ChocPandaServerless APIs with JavaScript - Matt Searle - ChocPanda
Serverless APIs with JavaScript - Matt Searle - ChocPandaPaul Dykes
 

What's hot (20)

The Apollo and GraphQL Stack
The Apollo and GraphQL StackThe Apollo and GraphQL Stack
The Apollo and GraphQL Stack
 
Serverless GraphQL with AWS AppSync & AWS Amplify
Serverless GraphQL with AWS AppSync & AWS AmplifyServerless GraphQL with AWS AppSync & AWS Amplify
Serverless GraphQL with AWS AppSync & AWS Amplify
 
Ajax
AjaxAjax
Ajax
 
From ActiveRecord to EventSourcing
From ActiveRecord to EventSourcingFrom ActiveRecord to EventSourcing
From ActiveRecord to EventSourcing
 
Kapil Thangavelu - Cloud Custodian
Kapil Thangavelu - Cloud CustodianKapil Thangavelu - Cloud Custodian
Kapil Thangavelu - Cloud Custodian
 
[AWS DC Meetup] Not Your Father’s WebApp: The Cloud-Native Architecture of im...
[AWS DC Meetup] Not Your Father’s WebApp: The Cloud-Native Architecture of im...[AWS DC Meetup] Not Your Father’s WebApp: The Cloud-Native Architecture of im...
[AWS DC Meetup] Not Your Father’s WebApp: The Cloud-Native Architecture of im...
 
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.govNot Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
 
Server-less solution for moving Millions of Images in Cloud - Brett Sutter, ...
 Server-less solution for moving Millions of Images in Cloud - Brett Sutter, ... Server-less solution for moving Millions of Images in Cloud - Brett Sutter, ...
Server-less solution for moving Millions of Images in Cloud - Brett Sutter, ...
 
iOS Realm.io
iOS Realm.ioiOS Realm.io
iOS Realm.io
 
Using Google App Engine Python
Using Google App Engine PythonUsing Google App Engine Python
Using Google App Engine Python
 
Cloud Abstraction Libraries: Implementation and Comparison
Cloud Abstraction Libraries: Implementation and ComparisonCloud Abstraction Libraries: Implementation and Comparison
Cloud Abstraction Libraries: Implementation and Comparison
 
Long running aws lambda - Joel Schuweiler, Minneapolis
Long running aws lambda -  Joel Schuweiler, MinneapolisLong running aws lambda -  Joel Schuweiler, Minneapolis
Long running aws lambda - Joel Schuweiler, Minneapolis
 
Angular js
Angular jsAngular js
Angular js
 
Tutorial: Building a GraphQL API in PHP
Tutorial: Building a GraphQL API in PHPTutorial: Building a GraphQL API in PHP
Tutorial: Building a GraphQL API in PHP
 
I/O intensiveなKafka ConsumerアプリケーションのスループットをLINE Ads Platformではどのように改善したか
I/O intensiveなKafka ConsumerアプリケーションのスループットをLINE Ads Platformではどのように改善したかI/O intensiveなKafka ConsumerアプリケーションのスループットをLINE Ads Platformではどのように改善したか
I/O intensiveなKafka ConsumerアプリケーションのスループットをLINE Ads Platformではどのように改善したか
 
Introduction to ajax
Introduction to ajaxIntroduction to ajax
Introduction to ajax
 
MongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDBMongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDB
 
Observability for developer ( Inny So & Andrew Jones, ThoughtWorks) Kafka Su...
Observability for developer ( Inny So & Andrew Jones, ThoughtWorks)  Kafka Su...Observability for developer ( Inny So & Andrew Jones, ThoughtWorks)  Kafka Su...
Observability for developer ( Inny So & Andrew Jones, ThoughtWorks) Kafka Su...
 
Getting Started with GraphQL && PHP
Getting Started with GraphQL && PHPGetting Started with GraphQL && PHP
Getting Started with GraphQL && PHP
 
Serverless APIs with JavaScript - Matt Searle - ChocPanda
Serverless APIs with JavaScript - Matt Searle - ChocPandaServerless APIs with JavaScript - Matt Searle - ChocPanda
Serverless APIs with JavaScript - Matt Searle - ChocPanda
 

Similar to Sparkling Water Applications Meetup 07.21.15

H2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal MalohlavaH2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal MalohlavaSri Ambati
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaGoDataDriven
 
Alberto Maria Angelo Paro - Isomorphic programming in Scala and WebDevelopmen...
Alberto Maria Angelo Paro - Isomorphic programming in Scala and WebDevelopmen...Alberto Maria Angelo Paro - Isomorphic programming in Scala and WebDevelopmen...
Alberto Maria Angelo Paro - Isomorphic programming in Scala and WebDevelopmen...Codemotion
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Rp 6 session 2 naresh bhatia
Rp 6  session 2 naresh bhatiaRp 6  session 2 naresh bhatia
Rp 6 session 2 naresh bhatiasapientindia
 
Building Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterSri Ambati
 
The Art and Science of Shipping Ember Apps
The Art and Science of Shipping Ember AppsThe Art and Science of Shipping Ember Apps
The Art and Science of Shipping Ember AppsJesse Cravens
 
Smoothing Your Java with DSLs
Smoothing Your Java with DSLsSmoothing Your Java with DSLs
Smoothing Your Java with DSLsintelliyole
 
Rob Tweed :: Ajax and the Impact on Caché and Similar Technologies
Rob Tweed :: Ajax and the Impact on Caché and Similar TechnologiesRob Tweed :: Ajax and the Impact on Caché and Similar Technologies
Rob Tweed :: Ajax and the Impact on Caché and Similar Technologiesgeorge.james
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Databricks
 
From Backbone to Ember and Back(bone) Again
From Backbone to Ember and Back(bone) AgainFrom Backbone to Ember and Back(bone) Again
From Backbone to Ember and Back(bone) Againjonknapp
 
NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020Thodoris Bais
 
Scala in a wild enterprise
Scala in a wild enterpriseScala in a wild enterprise
Scala in a wild enterpriseRafael Bagmanov
 
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital.AI
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingGerger
 
Using the Tooling API to Generate Apex SOAP Web Service Clients
Using the Tooling API to Generate Apex SOAP Web Service ClientsUsing the Tooling API to Generate Apex SOAP Web Service Clients
Using the Tooling API to Generate Apex SOAP Web Service ClientsSalesforce Developers
 
e-suap - client technologies- english version
e-suap - client technologies- english versione-suap - client technologies- english version
e-suap - client technologies- english versionSabino Labarile
 
Beginning MEAN Stack
Beginning MEAN StackBeginning MEAN Stack
Beginning MEAN StackRob Davarnia
 

Similar to Sparkling Water Applications Meetup 07.21.15 (20)

H2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal MalohlavaH2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal Malohlava
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
 
Alberto Maria Angelo Paro - Isomorphic programming in Scala and WebDevelopmen...
Alberto Maria Angelo Paro - Isomorphic programming in Scala and WebDevelopmen...Alberto Maria Angelo Paro - Isomorphic programming in Scala and WebDevelopmen...
Alberto Maria Angelo Paro - Isomorphic programming in Scala and WebDevelopmen...
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Rp 6 session 2 naresh bhatia
Rp 6  session 2 naresh bhatiaRp 6  session 2 naresh bhatia
Rp 6 session 2 naresh bhatia
 
Building Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling Water
 
The Art and Science of Shipping Ember Apps
The Art and Science of Shipping Ember AppsThe Art and Science of Shipping Ember Apps
The Art and Science of Shipping Ember Apps
 
Smoothing Your Java with DSLs
Smoothing Your Java with DSLsSmoothing Your Java with DSLs
Smoothing Your Java with DSLs
 
Rob Tweed :: Ajax and the Impact on Caché and Similar Technologies
Rob Tweed :: Ajax and the Impact on Caché and Similar TechnologiesRob Tweed :: Ajax and the Impact on Caché and Similar Technologies
Rob Tweed :: Ajax and the Impact on Caché and Similar Technologies
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
 
From Backbone to Ember and Back(bone) Again
From Backbone to Ember and Back(bone) AgainFrom Backbone to Ember and Back(bone) Again
From Backbone to Ember and Back(bone) Again
 
NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020
 
Scala in a wild enterprise
Scala in a wild enterpriseScala in a wild enterprise
Scala in a wild enterprise
 
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
Using the Tooling API to Generate Apex SOAP Web Service Clients
Using the Tooling API to Generate Apex SOAP Web Service ClientsUsing the Tooling API to Generate Apex SOAP Web Service Clients
Using the Tooling API to Generate Apex SOAP Web Service Clients
 
Advanced JavaScript
Advanced JavaScriptAdvanced JavaScript
Advanced JavaScript
 
e-suap - client technologies- english version
e-suap - client technologies- english versione-suap - client technologies- english version
e-suap - client technologies- english version
 
Beginning MEAN Stack
Beginning MEAN StackBeginning MEAN Stack
Beginning MEAN Stack
 
Offline Html5 3days
Offline Html5 3daysOffline Html5 3days
Offline Html5 3days
 

More from Sri Ambati

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxSri Ambati
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek Sri Ambati
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thSri Ambati
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionSri Ambati
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Sri Ambati
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMsSri Ambati
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the WaySri Ambati
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OSri Ambati
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Sri Ambati
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersSri Ambati
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Sri Ambati
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Sri Ambati
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...Sri Ambati
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability Sri Ambati
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email AgainSri Ambati
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Sri Ambati
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...Sri Ambati
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...Sri Ambati
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneySri Ambati
 

More from Sri Ambati (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation Journey
 

Recently uploaded

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxnada99848
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptx
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 

Sparkling Water Applications Meetup 07.21.15

  • 1. Michal Malohlava, Alex Tellez, and H2O.ai Building Machine Learning Applications with Sparkling Water Series 07/21/2015 Meetup Ask Craig
  • 4. Scalable Applications Distributed Able to process huge amount of data from different sources Easy to develop and experiment Powerful machine learning engine inside
  • 8. Open-source distributed execution platform User-friendly API for data transformation based on RDD Platform components - SQL, MLLib, text mining Multitenancy Large and active community
  • 9. Open-source scalable machine learning platform Tuned for efficient computation and memory use Production ready machine learning algorithms R, Python, Java, Scala APIs Interactive UI, robust data parser
  • 10.
  • 11. Sparkling Water Provides Transparent integration of H2O with Spark ecosystem Transparent use of H2O data structures and algorithms with Spark API Excels in existing Spark workflows requiring advanced Machine Learning algorithms Platform for building Smarter Applications
  • 12. Sparkling Water Design spark-submit Spark Master JVM Spark Worker JVM Spark Worker JVM Spark Worker JVM Sparkling Water Cluster Spark Executor JVM H2O Spark Executor JVM H2O Spark Executor JVM H2O Sparkling App implements Regular Spark application containing also Sparkling Water classes
  • 13. Data Distribution H2O H2O H2O Sparkling Water Cluster Spark Executor JVM Data Source (e.g. HDFS) H2O RDD Spark Executor JVM Spark Executor JVM Spark RDD RDDs and DataFrames share same memory space toRDD toH2OFrame
  • 15. Task: Predict the job category from a Craigslist Ad Title
  • 16. ML Workflow 1. Perform Feature Extraction on Words + Munging 2. Run Word2Vec algo (MLlib) on JobTitle words 3. Create “title vectors” from individual word vectors for each job title 4. Pass the Spark RDD H2O RDD for ML in Flow 5. Run H2O GBM algorithm on H2O RDD 6. Create Spark Streaming Application + Score on new job titles
  • 17. App Architecture Posting job title Stream Craigslist jobs Word2Vec Model GBM
 Model Word2Vec Categorize a job title Build models “It is a labor job” “HIRING Painting CONTRACTORS NOW!!!”
  • 18. App Skeleton class CraigslistJobTitlesApp(jobsFile: String = “…”)
 (@transient override val sc: SparkContext,
 @transient override val sqlContext: SQLContext,
 @transient override val h2oContext: H2OContext) extends SparklingWaterApp
 with SparkContextSupport with GBMSupport with ModelMetricsSupport with Serializable { def buildModels(datafile: String, modelName: String): (Model[_,_,_], Word2VecModel) def classify(jobTitle: String, modelId: String, w2vModel: Word2VecModel): (String, Array[Double]) } Sparkling environment Required capabilities
  • 19. Data: text munging Example: “Site Supervisor and Pre K Teachers Needed Now!!!” Post Tokenization: Seq(site, supervisor, pre, teachers, needed) val tokens = jobTitles.map(line => token(line)) Next: Apply Spark’s Word2Vec model to each word
  • 20. Data: Word2Vec model Simply: A mathematical way to represent a single word as a vector of numbers. These vector ‘representations’ encode information about the about a given word (i.e. its meaning) Post Tokenization: Seq(site, supervisor, pre, teachers, needed) Post Word2Vec Results: needed, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987] site, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987] supervisor, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
  • 21. Data: job title vectors In Steps: 1. Sum the word2vec vectors in a given title 2. Divide this sum by # of words in a given title Result: ~ Average vector for a given title of N words needed, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987] site, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987] supervisor, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]+ + Divide by Total Words (post tokenization) ~ (site supervisor….needed), [0.998, 0.349, 0.621…….0.915]
  • 22. Pass to H2O and Build GBM Model val finalRdd = filteredTokenizedRdd.map(row => {
 val label = row._1
 val tokens = row._2
 // Compute vector for given list of word tokens, unknown words are ignored
 val vec = wordsToVector(tokens, w2vModel)
 JobOffer(label, vec)
 }) case class JobOffer(category: String, fv: mllib.linalg.Vector) val h2oFrame: H2OFrame = h2oContext.asH2OFrame(finalRdd.toDF) Single rowrepresentation Vector representing job title Publish Spark DataFrame
 as H2OFrame val gbmModel = GBMModel(trainFrame, validFrame, "category", modelName, ntrees = 50) Build GBM model
  • 23. GBM: 80% accuracy Algo: Gradient Boosting Machine #Trees: 50 # Bins: 20 Depth: 5 (ALL DEFAULTVALUES) ~ 20% Error Rate
  • 24. App Architecture Posting job title Stream Craigslist jobs Word2Vec Model GBM
 Model Word2Vec Categorize a job title Build models “It is a labor job” “HIRING Painting CONTRACTORS NOW!!!”
  • 25. Classify new job title def classify(jobTitle: String, modelId: String, w2vModel: Word2VecModel): (String, Array[Double]) = { val model = water.DKV.getGet(modelId) val tokens = tokenize(jobTitle, STOP_WORDS)
 val vec = wordsToVector(tokens, w2vModel) val modelOutput = m._output.asInstanceOf[Output]
 val nclasses = modelOutput.nclasses()
 val classNames = modelOutput.classNames()
 val pred = model.score(row, new Array[Double](nclasses + 1))
 (classNames(pred(0).asInstanceOf[Int]), pred slice (1, pred.length)) } Transform the job title into a vector with help of Wor2Vec model Score the vector with GBM model
  • 27. Streaming part val ssc = new StreamingContext(sc, Seconds(10))
 
 // Build an initial model
 val staticApp = new CraigslistJobTitlesApp()(sc, sqlContext, h2oContext)
 val (svModel, w2vModel) = staticApp.buildModels("craigslistJobTitles.csv", "initialModel")
 val modelId = svModel._key.toString
 
 // Start streaming context
 val jobTitlesStream = ssc.socketTextStream("localhost", 9999)
 
 // Classify incoming messages
 jobTitlesStream.filter(!_.isEmpty)
 .map(jobTitle => (jobTitle, staticApp.classify(jobTitle, modelId, w2vModel)))
 .map(pred => """ + pred._1 + "" = " + show(pred._2, classNames))
 .print()
 
 ssc.start()
 ssc.awaitTermination() Process data every 10seconds Create Spark socket stream exposed on port 9999 Define stream processing
  • 28.
  • 29. NetCat for sending messages to localhost:9999 Application is producing job
  • 30. Where is the code? https://github.com/h2oai/sparkling-water/ blob/master/examples/meetups
  • 32. Checkout H2O.ai Training Books http://learn.h2o.ai/
 Checkout H2O.ai Blog http://h2o.ai/blog/
 Checkout H2O.ai Youtube Channel https://www.youtube.com/user/0xdata
 Checkout GitHub https://github.com/h2oai/sparkling-water Meetups https://meetup.com/ More info
  • 33. Learn more at h2o.ai Follow us at @h2oai Thank you! Sparkling Water is open-source
 ML application platform combining
 power of Spark and H2O