@hexadata & @mmalohlava 
presents 
Sparkling Water 
“Killer App for Spark”
Spark and H2O 
Several months ago…
Sparkling Water 
Before 
Tachyon based 
Unnecessary data duplication 
Now 
Pure H2ORDD 
Transparent use of H2O data and algorithms with 
Spark API
Sparkling Water 
  
	
 
  
	
 
+ 
RDD 
immutable 
world 
DataFrame 
mutable 
world
Sparkling Water 
  
  
	
 RDD DataFrame
Sparkling Water Design 
Sparkling 
App 
jar file 
Spark 
Master 
JVM 
spark-submit 
Spark 
Worker 
JVM 
Spark 
Worker 
JVM 
Spark 
Worker 
JVM 
Sparkling Water Cluster 
Spark 
Executor 
JVM 
H2O 
Spark 
Executor 
JVM 
H2O 
Spark 
Executor 
JVM 
H2O
Data Distribution 
Sparkling Water Cluster 
H2O 
H2O 
H2O 
Spark Executor JVM 
Data 
Source 
(e.g. 
HDFS) 
H2O 
RDD 
Spark 
RDD 
Spark Executor JVM 
Spark Executor JVM
Hands-on Time
Example 
LoadParse CSV data 
Use Spark API, do SQL query 
Create Deep Learning model 
Use model for prediction
Requirements 
Linux or Mac OS X 
Oracle Java 1.7 
Virtual image 
is provided 
for Windows 
users
Download 
http://0xdata.com/download/
Install and Launch 
Unpack zip file 
or 
Open provided virtual image in VirtualBox 
and 
Launch h2o-examples/sparkling-shell
What is Sparkling Shell? 
Standard spark-shell 
Launch H2O extension 
export MASTER=“local-cluster[3,2,1024]” 
! 
spark-shell  
JAR containing 
H2O code 
Spark Master 
address 
—jars shaded.jar  
—conf spark.extensions=org.apache.spark.executor.H2OPlatformExtension 
Name of H2O extension 
provided by JAR
…more on launching… 
‣ By default single JVM, multi-threaded (export 
MASTER=local[*]) or 
‣ export MASTER=“local-cluster[3,2,1024]” to launch 
an embedded Spark cluster or 
‣ Launch standalone Spark cluster via 
sbin/launch-spark-cloud.sh 
and export MASTER=“spark://localhost:7077”
Lets play with Sparking 
shell…
Create H2O Client 
import water.{H2O,H2OClientApp} 
H2OClientApp.start() 
H2O.waitForCloudSize(3, 10000)
Is Spark Running? 
http://localhost:4040
Is H2O running? 
http://localhost:54321/steam/index.html
Data 
Load some data and parse them 
import java.io.File 
import org.apache.spark.examples.h2o._ 
import org.apache.spark.h2o._ 
val dataFile = 
“../h2o-examples/smalldata/allyears2k_headers.csv.gz 
! 
// Create DataFrame - involves parse of data 
val airlinesData = new DataFrame(new File(dataFile))
Where are data? 
Go to http://localhost:54321/steam/ 
index.html
Use Spark API 
// H2O Context provide useful implicits for conversions 
val h2oContext = new H2OContext(sc) 
import h2oContext._ 
// Create RDD wrapper around DataFrame 
val airlinesTable : RDD[Airlines] = toRDD[Airlines](airlinesData) 
airlinesTable.count 
// And use Spark RDD API directly 
val flightsOnlyToSF = airlinesTable.filter( 
f = 
f.Dest==Some(SFO) || f.Dest==Some(SJC) || f.Dest==Some(OAK) ) 
flightsOnlyToSF.count
Use Spark SQL 
import org.apache.spark.sql.SQLContext 
// We need to create SQL context 
val sqlContext = new SQLContext(sc) 
import sqlContext._ 
airlinesTable.registerTempTable(airlinesTable) 
val query = 
“SELECT * FROM airlinesTable WHERE Dest LIKE 'SFO' OR Dest LIKE 'SJC' OR 
Dest LIKE 'OAK'“ 
// Invoke query 
val result = sql(query) // Using a registered context and tables 
result.count 
assert(result.count == flightsOnlyToSF.count)
Launch H2O Algorithms 
import hex.deeplearning._ 
import hex.deeplearning.DeepLearningModel.DeepLearningParameters 
// Setup deep learning parameters 
val dlParams = new DeepLearningParameters() 
dlParams._training_frame = result( 'Year, 'Month, 'DayofMonth, 'DayOfWeek, 
'CRSDepTime, 'CRSArrTime,'UniqueCarrier, 
'FlightNum, 'TailNum, 'CRSElapsedTime, 
'Origin, 'Dest,'Distance, ‘IsDepDelayed) 
dlParams.response_column = 'IsDepDelayed.name 
// Create a new model builder 
val dl = new DeepLearning(dlParams) 
val dlModel = dl.train.get
Make a prediction 
// Use model to score data 
val prediction = dlModel.score(result)(‘predict) 
! 
// Collect predicted values via RDD API 
val predictionValues = toRDD[DoubleHolder](prediction) 
.collect 
.map ( _.result.getOrElse(NaN) )
What is under the hood?
Spark App Extension 
/** Notion of Spark application platform extension. */ 
trait PlatformExtension extends Serializable { 
/** Method to start extension */ 
def start(conf: SparkConf):Unit 
/** Method to stop extension */ 
def stop (conf: SparkConf):Unit 
/* Point in Spark infrastructure which will be intercepted by this extension. */ 
def intercept: InterceptionPoints = InterceptionPoints.EXECUTOR_LC 
/* User-friendly description of extension */ 
def desc:String 
override def toString = s$desc@$intercept 
} 
/** Supported interception points. 
* 
* Currently only Executor life cycle is supported. */ 
object InterceptionPoints extends Enumeration { 
type InterceptionPoints = Value 
val EXECUTOR_LC /* Inject into executor lifecycle */ 
= Value 
}
Using App Extensions 
val conf = new SparkConf() 
.setAppName(“Sparkling H2O Example”) 
// Setup expected size of H2O cloud 
conf.set(“spark.h2o.cluster.size”,h2oWorkers) 
! 
// Add H2O extension 
conf.addExtension[H2OPlatformExtension] 
! 
// Create Spark Context 
val sc = new SparkContext(sc)
Spark Changes 
We keep them small (~30 lines of code) 
JIRA SPARK-3270 - Platform App Extensions 
https://issues.apache.org/jira/browse/ 
SPARK-3270
You can participate! 
Epic PUBDEV-21aka Sparkling Water 
PUBDEV-23 Test HDFS reader 
PUBDEV-26 Implement toSchemaRDD 
PUBDEV-27 Boolean transfers 
PUBDEV-31 Support toRDD[ X : Numeric] 
PUBDEV-32/33 Mesos/YARN support
More info 
Checkout 0xdata Blog for tutorials 
http://0xdata.com/blog/ 
Checkout 0xdata Youtube Channel 
https://www.youtube.com/user/0xdata 
Checkout github 
https://github.com/0xdata/h2o-dev 
https://github.com/0xdata/perrier
Thank you! 
Learn more about H2O at 
0xdata.com 
or 
neo for r in h2o-dev perrier; do ! 
git clone “git@github.com:0xdata/$r.git”! 
done 
Follow us at @hexadata

2014 09 30_sparkling_water_hands_on

  • 1.
    @hexadata & @mmalohlava presents Sparkling Water “Killer App for Spark”
  • 2.
    Spark and H2O Several months ago…
  • 3.
    Sparkling Water Before Tachyon based Unnecessary data duplication Now Pure H2ORDD Transparent use of H2O data and algorithms with Spark API
  • 4.
    Sparkling Water + RDD immutable world DataFrame mutable world
  • 5.
    Sparkling Water RDD DataFrame
  • 6.
    Sparkling Water Design Sparkling App jar file Spark Master JVM spark-submit Spark Worker JVM Spark Worker JVM Spark Worker JVM Sparkling Water Cluster Spark Executor JVM H2O Spark Executor JVM H2O Spark Executor JVM H2O
  • 7.
    Data Distribution SparklingWater Cluster H2O H2O H2O Spark Executor JVM Data Source (e.g. HDFS) H2O RDD Spark RDD Spark Executor JVM Spark Executor JVM
  • 8.
  • 9.
    Example LoadParse CSVdata Use Spark API, do SQL query Create Deep Learning model Use model for prediction
  • 10.
    Requirements Linux orMac OS X Oracle Java 1.7 Virtual image is provided for Windows users
  • 11.
  • 12.
    Install and Launch Unpack zip file or Open provided virtual image in VirtualBox and Launch h2o-examples/sparkling-shell
  • 13.
    What is SparklingShell? Standard spark-shell Launch H2O extension export MASTER=“local-cluster[3,2,1024]” ! spark-shell JAR containing H2O code Spark Master address —jars shaded.jar —conf spark.extensions=org.apache.spark.executor.H2OPlatformExtension Name of H2O extension provided by JAR
  • 14.
    …more on launching… ‣ By default single JVM, multi-threaded (export MASTER=local[*]) or ‣ export MASTER=“local-cluster[3,2,1024]” to launch an embedded Spark cluster or ‣ Launch standalone Spark cluster via sbin/launch-spark-cloud.sh and export MASTER=“spark://localhost:7077”
  • 15.
    Lets play withSparking shell…
  • 16.
    Create H2O Client import water.{H2O,H2OClientApp} H2OClientApp.start() H2O.waitForCloudSize(3, 10000)
  • 17.
    Is Spark Running? http://localhost:4040
  • 18.
    Is H2O running? http://localhost:54321/steam/index.html
  • 19.
    Data Load somedata and parse them import java.io.File import org.apache.spark.examples.h2o._ import org.apache.spark.h2o._ val dataFile = “../h2o-examples/smalldata/allyears2k_headers.csv.gz ! // Create DataFrame - involves parse of data val airlinesData = new DataFrame(new File(dataFile))
  • 20.
    Where are data? Go to http://localhost:54321/steam/ index.html
  • 21.
    Use Spark API // H2O Context provide useful implicits for conversions val h2oContext = new H2OContext(sc) import h2oContext._ // Create RDD wrapper around DataFrame val airlinesTable : RDD[Airlines] = toRDD[Airlines](airlinesData) airlinesTable.count // And use Spark RDD API directly val flightsOnlyToSF = airlinesTable.filter( f = f.Dest==Some(SFO) || f.Dest==Some(SJC) || f.Dest==Some(OAK) ) flightsOnlyToSF.count
  • 22.
    Use Spark SQL import org.apache.spark.sql.SQLContext // We need to create SQL context val sqlContext = new SQLContext(sc) import sqlContext._ airlinesTable.registerTempTable(airlinesTable) val query = “SELECT * FROM airlinesTable WHERE Dest LIKE 'SFO' OR Dest LIKE 'SJC' OR Dest LIKE 'OAK'“ // Invoke query val result = sql(query) // Using a registered context and tables result.count assert(result.count == flightsOnlyToSF.count)
  • 23.
    Launch H2O Algorithms import hex.deeplearning._ import hex.deeplearning.DeepLearningModel.DeepLearningParameters // Setup deep learning parameters val dlParams = new DeepLearningParameters() dlParams._training_frame = result( 'Year, 'Month, 'DayofMonth, 'DayOfWeek, 'CRSDepTime, 'CRSArrTime,'UniqueCarrier, 'FlightNum, 'TailNum, 'CRSElapsedTime, 'Origin, 'Dest,'Distance, ‘IsDepDelayed) dlParams.response_column = 'IsDepDelayed.name // Create a new model builder val dl = new DeepLearning(dlParams) val dlModel = dl.train.get
  • 24.
    Make a prediction // Use model to score data val prediction = dlModel.score(result)(‘predict) ! // Collect predicted values via RDD API val predictionValues = toRDD[DoubleHolder](prediction) .collect .map ( _.result.getOrElse(NaN) )
  • 25.
    What is underthe hood?
  • 26.
    Spark App Extension /** Notion of Spark application platform extension. */ trait PlatformExtension extends Serializable { /** Method to start extension */ def start(conf: SparkConf):Unit /** Method to stop extension */ def stop (conf: SparkConf):Unit /* Point in Spark infrastructure which will be intercepted by this extension. */ def intercept: InterceptionPoints = InterceptionPoints.EXECUTOR_LC /* User-friendly description of extension */ def desc:String override def toString = s$desc@$intercept } /** Supported interception points. * * Currently only Executor life cycle is supported. */ object InterceptionPoints extends Enumeration { type InterceptionPoints = Value val EXECUTOR_LC /* Inject into executor lifecycle */ = Value }
  • 27.
    Using App Extensions val conf = new SparkConf() .setAppName(“Sparkling H2O Example”) // Setup expected size of H2O cloud conf.set(“spark.h2o.cluster.size”,h2oWorkers) ! // Add H2O extension conf.addExtension[H2OPlatformExtension] ! // Create Spark Context val sc = new SparkContext(sc)
  • 28.
    Spark Changes Wekeep them small (~30 lines of code) JIRA SPARK-3270 - Platform App Extensions https://issues.apache.org/jira/browse/ SPARK-3270
  • 29.
    You can participate! Epic PUBDEV-21aka Sparkling Water PUBDEV-23 Test HDFS reader PUBDEV-26 Implement toSchemaRDD PUBDEV-27 Boolean transfers PUBDEV-31 Support toRDD[ X : Numeric] PUBDEV-32/33 Mesos/YARN support
  • 30.
    More info Checkout0xdata Blog for tutorials http://0xdata.com/blog/ Checkout 0xdata Youtube Channel https://www.youtube.com/user/0xdata Checkout github https://github.com/0xdata/h2o-dev https://github.com/0xdata/perrier
  • 31.
    Thank you! Learnmore about H2O at 0xdata.com or neo for r in h2o-dev perrier; do ! git clone “git@github.com:0xdata/$r.git”! done Follow us at @hexadata