Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sparkling Water Applications Meetup 07.21.15

1,345 views

Published on

Michal Malohlava's Sparkling Water Applications Meetup on 07.21.15, focusing on the Ask Craig use case.
http://h2o.ai/blog/2015/06/ask-craig-sparkling-water/

Published in: Software
  • Be the first to comment

Sparkling Water Applications Meetup 07.21.15

  1. 1. Michal Malohlava, Alex Tellez, and H2O.ai Building Machine Learning Applications with Sparkling Water Series 07/21/2015 Meetup Ask Craig
  2. 2. Download now Hack later Spark 1.4 Sparkling Water 1.4.3 h2o.ai/download
  3. 3. Smarter Applications
  4. 4. Scalable Applications Distributed Able to process huge amount of data from different sources Easy to develop and experiment Powerful machine learning engine inside
  5. 5. BUT how to build them?
  6. 6. Build an application with … ?
  7. 7. …with Spark and H2O
  8. 8. Open-source distributed execution platform User-friendly API for data transformation based on RDD Platform components - SQL, MLLib, text mining Multitenancy Large and active community
  9. 9. Open-source scalable machine learning platform Tuned for efficient computation and memory use Production ready machine learning algorithms R, Python, Java, Scala APIs Interactive UI, robust data parser
  10. 10. Sparkling Water Provides Transparent integration of H2O with Spark ecosystem Transparent use of H2O data structures and algorithms with Spark API Excels in existing Spark workflows requiring advanced Machine Learning algorithms Platform for building Smarter Applications
  11. 11. Sparkling Water Design spark-submit Spark Master JVM Spark Worker JVM Spark Worker JVM Spark Worker JVM Sparkling Water Cluster Spark Executor JVM H2O Spark Executor JVM H2O Spark Executor JVM H2O Sparkling App implements Regular Spark application containing also Sparkling Water classes
  12. 12. Data Distribution H2O H2O H2O Sparkling Water Cluster Spark Executor JVM Data Source (e.g. HDFS) H2O RDD Spark Executor JVM Spark Executor JVM Spark RDD RDDs and DataFrames share same memory space toRDD toH2OFrame
  13. 13. Lets build an application !
  14. 14. Task: Predict the job category from a Craigslist Ad Title
  15. 15. ML Workflow 1. Perform Feature Extraction on Words + Munging 2. Run Word2Vec algo (MLlib) on JobTitle words 3. Create “title vectors” from individual word vectors for each job title 4. Pass the Spark RDD H2O RDD for ML in Flow 5. Run H2O GBM algorithm on H2O RDD 6. Create Spark Streaming Application + Score on new job titles
  16. 16. App Architecture Posting job title Stream Craigslist jobs Word2Vec Model GBM
 Model Word2Vec Categorize a job title Build models “It is a labor job” “HIRING Painting CONTRACTORS NOW!!!”
  17. 17. App Skeleton class CraigslistJobTitlesApp(jobsFile: String = “…”)
 (@transient override val sc: SparkContext,
 @transient override val sqlContext: SQLContext,
 @transient override val h2oContext: H2OContext) extends SparklingWaterApp
 with SparkContextSupport with GBMSupport with ModelMetricsSupport with Serializable { def buildModels(datafile: String, modelName: String): (Model[_,_,_], Word2VecModel) def classify(jobTitle: String, modelId: String, w2vModel: Word2VecModel): (String, Array[Double]) } Sparkling environment Required capabilities
  18. 18. Data: text munging Example: “Site Supervisor and Pre K Teachers Needed Now!!!” Post Tokenization: Seq(site, supervisor, pre, teachers, needed) val tokens = jobTitles.map(line => token(line)) Next: Apply Spark’s Word2Vec model to each word
  19. 19. Data: Word2Vec model Simply: A mathematical way to represent a single word as a vector of numbers. These vector ‘representations’ encode information about the about a given word (i.e. its meaning) Post Tokenization: Seq(site, supervisor, pre, teachers, needed) Post Word2Vec Results: needed, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987] site, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987] supervisor, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
  20. 20. Data: job title vectors In Steps: 1. Sum the word2vec vectors in a given title 2. Divide this sum by # of words in a given title Result: ~ Average vector for a given title of N words needed, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987] site, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987] supervisor, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]+ + Divide by Total Words (post tokenization) ~ (site supervisor….needed), [0.998, 0.349, 0.621…….0.915]
  21. 21. Pass to H2O and Build GBM Model val finalRdd = filteredTokenizedRdd.map(row => {
 val label = row._1
 val tokens = row._2
 // Compute vector for given list of word tokens, unknown words are ignored
 val vec = wordsToVector(tokens, w2vModel)
 JobOffer(label, vec)
 }) case class JobOffer(category: String, fv: mllib.linalg.Vector) val h2oFrame: H2OFrame = h2oContext.asH2OFrame(finalRdd.toDF) Single rowrepresentation Vector representing job title Publish Spark DataFrame
 as H2OFrame val gbmModel = GBMModel(trainFrame, validFrame, "category", modelName, ntrees = 50) Build GBM model
  22. 22. GBM: 80% accuracy Algo: Gradient Boosting Machine #Trees: 50 # Bins: 20 Depth: 5 (ALL DEFAULTVALUES) ~ 20% Error Rate
  23. 23. App Architecture Posting job title Stream Craigslist jobs Word2Vec Model GBM
 Model Word2Vec Categorize a job title Build models “It is a labor job” “HIRING Painting CONTRACTORS NOW!!!”
  24. 24. Classify new job title def classify(jobTitle: String, modelId: String, w2vModel: Word2VecModel): (String, Array[Double]) = { val model = water.DKV.getGet(modelId) val tokens = tokenize(jobTitle, STOP_WORDS)
 val vec = wordsToVector(tokens, w2vModel) val modelOutput = m._output.asInstanceOf[Output]
 val nclasses = modelOutput.nclasses()
 val classNames = modelOutput.classNames()
 val pred = model.score(row, new Array[Double](nclasses + 1))
 (classNames(pred(0).asInstanceOf[Int]), pred slice (1, pred.length)) } Transform the job title into a vector with help of Wor2Vec model Score the vector with GBM model
  25. 25. Almost done…
  26. 26. Streaming part val ssc = new StreamingContext(sc, Seconds(10))
 
 // Build an initial model
 val staticApp = new CraigslistJobTitlesApp()(sc, sqlContext, h2oContext)
 val (svModel, w2vModel) = staticApp.buildModels("craigslistJobTitles.csv", "initialModel")
 val modelId = svModel._key.toString
 
 // Start streaming context
 val jobTitlesStream = ssc.socketTextStream("localhost", 9999)
 
 // Classify incoming messages
 jobTitlesStream.filter(!_.isEmpty)
 .map(jobTitle => (jobTitle, staticApp.classify(jobTitle, modelId, w2vModel)))
 .map(pred => """ + pred._1 + "" = " + show(pred._2, classNames))
 .print()
 
 ssc.start()
 ssc.awaitTermination() Process data every 10seconds Create Spark socket stream exposed on port 9999 Define stream processing
  27. 27. NetCat for sending messages to localhost:9999 Application is producing job
  28. 28. Where is the code? https://github.com/h2oai/sparkling-water/ blob/master/examples/meetups
  29. 29. Sparkling Water Download h2o.ai/download
  30. 30. Checkout H2O.ai Training Books http://learn.h2o.ai/
 Checkout H2O.ai Blog http://h2o.ai/blog/
 Checkout H2O.ai Youtube Channel https://www.youtube.com/user/0xdata
 Checkout GitHub https://github.com/h2oai/sparkling-water Meetups https://meetup.com/ More info
  31. 31. Learn more at h2o.ai Follow us at @h2oai Thank you! Sparkling Water is open-source
 ML application platform combining
 power of Spark and H2O

×