Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2014 09 30_sparkling_water_hands_on


Published on

How Sparkling Water brings Fast Scalable Machine learning via H2O to Apache Spark.

By Michal Malohlava and
Our 100th Meetup at 0xdata, September 30, 2014
Open Source meets Out Door.

- Powered by the open source machine learning software Contributors welcome at:
- To view videos on H2O open source machine learning software, go to:

Published in: Data & Analytics
  • Hello! High Quality And Affordable Essays For You. Starting at $4.99 per page - Check our website!
    Are you sure you want to  Yes  No
    Your message goes here

2014 09 30_sparkling_water_hands_on

  1. 1. @hexadata & @mmalohlava presents Sparkling Water “Killer App for Spark”
  2. 2. Spark and H2O Several months ago…
  3. 3. Sparkling Water Before Tachyon based Unnecessary data duplication Now Pure H2ORDD Transparent use of H2O data and algorithms with Spark API
  4. 4. Sparkling Water + RDD immutable world DataFrame mutable world
  5. 5. Sparkling Water RDD DataFrame
  6. 6. Sparkling Water Design Sparkling App jar file Spark Master JVM spark-submit Spark Worker JVM Spark Worker JVM Spark Worker JVM Sparkling Water Cluster Spark Executor JVM H2O Spark Executor JVM H2O Spark Executor JVM H2O
  7. 7. Data Distribution Sparkling Water Cluster H2O H2O H2O Spark Executor JVM Data Source (e.g. HDFS) H2O RDD Spark RDD Spark Executor JVM Spark Executor JVM
  8. 8. Hands-on Time
  9. 9. Example LoadParse CSV data Use Spark API, do SQL query Create Deep Learning model Use model for prediction
  10. 10. Requirements Linux or Mac OS X Oracle Java 1.7 Virtual image is provided for Windows users
  11. 11. Download
  12. 12. Install and Launch Unpack zip file or Open provided virtual image in VirtualBox and Launch h2o-examples/sparkling-shell
  13. 13. What is Sparkling Shell? Standard spark-shell Launch H2O extension export MASTER=“local-cluster[3,2,1024]” ! spark-shell JAR containing H2O code Spark Master address —jars shaded.jar —conf spark.extensions=org.apache.spark.executor.H2OPlatformExtension Name of H2O extension provided by JAR
  14. 14. …more on launching… ‣ By default single JVM, multi-threaded (export MASTER=local[*]) or ‣ export MASTER=“local-cluster[3,2,1024]” to launch an embedded Spark cluster or ‣ Launch standalone Spark cluster via sbin/ and export MASTER=“spark://localhost:7077”
  15. 15. Lets play with Sparking shell…
  16. 16. Create H2O Client import water.{H2O,H2OClientApp} H2OClientApp.start() H2O.waitForCloudSize(3, 10000)
  17. 17. Is Spark Running? http://localhost:4040
  18. 18. Is H2O running? http://localhost:54321/steam/index.html
  19. 19. Data Load some data and parse them import import org.apache.spark.examples.h2o._ import org.apache.spark.h2o._ val dataFile = “../h2o-examples/smalldata/allyears2k_headers.csv.gz ! // Create DataFrame - involves parse of data val airlinesData = new DataFrame(new File(dataFile))
  20. 20. Where are data? Go to http://localhost:54321/steam/ index.html
  21. 21. Use Spark API // H2O Context provide useful implicits for conversions val h2oContext = new H2OContext(sc) import h2oContext._ // Create RDD wrapper around DataFrame val airlinesTable : RDD[Airlines] = toRDD[Airlines](airlinesData) airlinesTable.count // And use Spark RDD API directly val flightsOnlyToSF = airlinesTable.filter( f = f.Dest==Some(SFO) || f.Dest==Some(SJC) || f.Dest==Some(OAK) ) flightsOnlyToSF.count
  22. 22. Use Spark SQL import org.apache.spark.sql.SQLContext // We need to create SQL context val sqlContext = new SQLContext(sc) import sqlContext._ airlinesTable.registerTempTable(airlinesTable) val query = “SELECT * FROM airlinesTable WHERE Dest LIKE 'SFO' OR Dest LIKE 'SJC' OR Dest LIKE 'OAK'“ // Invoke query val result = sql(query) // Using a registered context and tables result.count assert(result.count == flightsOnlyToSF.count)
  23. 23. Launch H2O Algorithms import hex.deeplearning._ import hex.deeplearning.DeepLearningModel.DeepLearningParameters // Setup deep learning parameters val dlParams = new DeepLearningParameters() dlParams._training_frame = result( 'Year, 'Month, 'DayofMonth, 'DayOfWeek, 'CRSDepTime, 'CRSArrTime,'UniqueCarrier, 'FlightNum, 'TailNum, 'CRSElapsedTime, 'Origin, 'Dest,'Distance, ‘IsDepDelayed) dlParams.response_column = ' // Create a new model builder val dl = new DeepLearning(dlParams) val dlModel = dl.train.get
  24. 24. Make a prediction // Use model to score data val prediction = dlModel.score(result)(‘predict) ! // Collect predicted values via RDD API val predictionValues = toRDD[DoubleHolder](prediction) .collect .map ( _.result.getOrElse(NaN) )
  25. 25. What is under the hood?
  26. 26. Spark App Extension /** Notion of Spark application platform extension. */ trait PlatformExtension extends Serializable { /** Method to start extension */ def start(conf: SparkConf):Unit /** Method to stop extension */ def stop (conf: SparkConf):Unit /* Point in Spark infrastructure which will be intercepted by this extension. */ def intercept: InterceptionPoints = InterceptionPoints.EXECUTOR_LC /* User-friendly description of extension */ def desc:String override def toString = s$desc@$intercept } /** Supported interception points. * * Currently only Executor life cycle is supported. */ object InterceptionPoints extends Enumeration { type InterceptionPoints = Value val EXECUTOR_LC /* Inject into executor lifecycle */ = Value }
  27. 27. Using App Extensions val conf = new SparkConf() .setAppName(“Sparkling H2O Example”) // Setup expected size of H2O cloud conf.set(“spark.h2o.cluster.size”,h2oWorkers) ! // Add H2O extension conf.addExtension[H2OPlatformExtension] ! // Create Spark Context val sc = new SparkContext(sc)
  28. 28. Spark Changes We keep them small (~30 lines of code) JIRA SPARK-3270 - Platform App Extensions SPARK-3270
  29. 29. You can participate! Epic PUBDEV-21aka Sparkling Water PUBDEV-23 Test HDFS reader PUBDEV-26 Implement toSchemaRDD PUBDEV-27 Boolean transfers PUBDEV-31 Support toRDD[ X : Numeric] PUBDEV-32/33 Mesos/YARN support
  30. 30. More info Checkout 0xdata Blog for tutorials Checkout 0xdata Youtube Channel Checkout github
  31. 31. Thank you! Learn more about H2O at or neo for r in h2o-dev perrier; do ! git clone “$r.git”! done Follow us at @hexadata