2014 09 30_sparkling_water_hands_on

@hexadata & @mmalohlava
presents
Sparkling Water
“Killer App for Spark”

Spark and H2O
Several months ago…

Sparkling Water
Before
Tachyon based
Unnecessary data duplication
Now
Pure H2ORDD
Transparent use of H2O data and algorithms with
Spark API

Sparkling Water

+
RDD
immutable
world
DataFrame
mutable
world

Sparkling Water

RDD DataFrame

Sparkling Water Design
Sparkling
App
jar file
Spark
Master
JVM
spark-submit
Spark
Worker
JVM
Spark
Worker
JVM
Spark
Worker
JVM
Sparkling Water Cluster
Spark
Executor
JVM
H2O
Spark
Executor
JVM
H2O
Spark
Executor
JVM
H2O

Data Distribution
Sparkling Water Cluster
H2O
H2O
H2O
Spark Executor JVM
Data
Source
(e.g.
HDFS)
H2O
RDD
Spark
RDD
Spark Executor JVM
Spark Executor JVM

Example
LoadParse CSV data
Use Spark API, do SQL query
Create Deep Learning model
Use model for prediction

Requirements
Linux or Mac OS X
Oracle Java 1.7
Virtual image
is provided
for Windows
users

Download
http://0xdata.com/download/

Install and Launch
Unpack zip file
or
Open provided virtual image in VirtualBox
and
Launch h2o-examples/sparkling-shell

What is Sparkling Shell?
Standard spark-shell
Launch H2O extension
export MASTER=“local-cluster[3,2,1024]”
!
spark-shell
JAR containing
H2O code
Spark Master
address
—jars shaded.jar
—conf spark.extensions=org.apache.spark.executor.H2OPlatformExtension
Name of H2O extension
provided by JAR

…more on launching…
‣ By default single JVM, multi-threaded (export
MASTER=local[*]) or
‣ export MASTER=“local-cluster[3,2,1024]” to launch
an embedded Spark cluster or
‣ Launch standalone Spark cluster via
sbin/launch-spark-cloud.sh
and export MASTER=“spark://localhost:7077”

Lets play with Sparking
shell…

Create H2O Client
import water.{H2O,H2OClientApp}
H2OClientApp.start()
H2O.waitForCloudSize(3, 10000)

Is Spark Running?
http://localhost:4040

Is H2O running?
http://localhost:54321/steam/index.html

Data
Load some data and parse them
import java.io.File
import org.apache.spark.examples.h2o._
import org.apache.spark.h2o._
val dataFile =
“../h2o-examples/smalldata/allyears2k_headers.csv.gz
!
// Create DataFrame - involves parse of data
val airlinesData = new DataFrame(new File(dataFile))

Where are data?
Go to http://localhost:54321/steam/
index.html

Use Spark API
// H2O Context provide useful implicits for conversions
val h2oContext = new H2OContext(sc)
import h2oContext._
// Create RDD wrapper around DataFrame
val airlinesTable : RDD[Airlines] = toRDD[Airlines](airlinesData)
airlinesTable.count
// And use Spark RDD API directly
val flightsOnlyToSF = airlinesTable.filter(
f =
f.Dest==Some(SFO) || f.Dest==Some(SJC) || f.Dest==Some(OAK) )
flightsOnlyToSF.count

Use Spark SQL
import org.apache.spark.sql.SQLContext
// We need to create SQL context
val sqlContext = new SQLContext(sc)
import sqlContext._
airlinesTable.registerTempTable(airlinesTable)
val query =
“SELECT * FROM airlinesTable WHERE Dest LIKE 'SFO' OR Dest LIKE 'SJC' OR
Dest LIKE 'OAK'“
// Invoke query
val result = sql(query) // Using a registered context and tables
result.count
assert(result.count == flightsOnlyToSF.count)

Launch H2O Algorithms
import hex.deeplearning._
import hex.deeplearning.DeepLearningModel.DeepLearningParameters
// Setup deep learning parameters
val dlParams = new DeepLearningParameters()
dlParams._training_frame = result( 'Year, 'Month, 'DayofMonth, 'DayOfWeek,
'CRSDepTime, 'CRSArrTime,'UniqueCarrier,
'FlightNum, 'TailNum, 'CRSElapsedTime,
'Origin, 'Dest,'Distance, ‘IsDepDelayed)
dlParams.response_column = 'IsDepDelayed.name
// Create a new model builder
val dl = new DeepLearning(dlParams)
val dlModel = dl.train.get

Make a prediction
// Use model to score data
val prediction = dlModel.score(result)(‘predict)
!
// Collect predicted values via RDD API
val predictionValues = toRDD[DoubleHolder](prediction)
.collect
.map ( _.result.getOrElse(NaN) )

Spark App Extension
/** Notion of Spark application platform extension. */
trait PlatformExtension extends Serializable {
/** Method to start extension */
def start(conf: SparkConf):Unit
/** Method to stop extension */
def stop (conf: SparkConf):Unit
/* Point in Spark infrastructure which will be intercepted by this extension. */
def intercept: InterceptionPoints = InterceptionPoints.EXECUTOR_LC
/* User-friendly description of extension */
def desc:String
override def toString = s$desc@$intercept
}
/** Supported interception points.
*
* Currently only Executor life cycle is supported. */
object InterceptionPoints extends Enumeration {
type InterceptionPoints = Value
val EXECUTOR_LC /* Inject into executor lifecycle */
= Value
}

Using App Extensions
val conf = new SparkConf()
.setAppName(“Sparkling H2O Example”)
// Setup expected size of H2O cloud
conf.set(“spark.h2o.cluster.size”,h2oWorkers)
!
// Add H2O extension
conf.addExtension[H2OPlatformExtension]
!
// Create Spark Context
val sc = new SparkContext(sc)

Spark Changes
We keep them small (~30 lines of code)
JIRA SPARK-3270 - Platform App Extensions
https://issues.apache.org/jira/browse/
SPARK-3270

You can participate!
Epic PUBDEV-21aka Sparkling Water
PUBDEV-23 Test HDFS reader
PUBDEV-26 Implement toSchemaRDD
PUBDEV-27 Boolean transfers
PUBDEV-31 Support toRDD[ X : Numeric]
PUBDEV-32/33 Mesos/YARN support

More info
Checkout 0xdata Blog for tutorials
http://0xdata.com/blog/
Checkout 0xdata Youtube Channel
https://www.youtube.com/user/0xdata
Checkout github
https://github.com/0xdata/h2o-dev
https://github.com/0xdata/perrier

Thank you!
Learn more about H2O at
0xdata.com
or
neo for r in h2o-dev perrier; do !
git clone “git@github.com:0xdata/$r.git”!
done
Follow us at @hexadata

2014 09 30_sparkling_water_hands_on

More Related Content

What's hot

Viewers also liked

Similar to 2014 09 30_sparkling_water_hands_on

More from Sri Ambati

Recently uploaded

2014 09 30_sparkling_water_hands_on