How Sparkling Water brings Fast Scalable Machine learning via H2O to Apache Spark.
By Michal Malohlava and H2O.ai
Our 100th Meetup at 0xdata, September 30, 2014
Open Source meets Out Door.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
12. Install and Launch
Unpack zip file
or
Open provided virtual image in VirtualBox
and
Launch h2o-examples/sparkling-shell
13. What is Sparkling Shell?
Standard spark-shell
Launch H2O extension
export MASTER=“local-cluster[3,2,1024]”
!
spark-shell
JAR containing
H2O code
Spark Master
address
—jars shaded.jar
—conf spark.extensions=org.apache.spark.executor.H2OPlatformExtension
Name of H2O extension
provided by JAR
14. …more on launching…
‣ By default single JVM, multi-threaded (export
MASTER=local[*]) or
‣ export MASTER=“local-cluster[3,2,1024]” to launch
an embedded Spark cluster or
‣ Launch standalone Spark cluster via
sbin/launch-spark-cloud.sh
and export MASTER=“spark://localhost:7077”
19. Data
Load some data and parse them
import java.io.File
import org.apache.spark.examples.h2o._
import org.apache.spark.h2o._
val dataFile =
“../h2o-examples/smalldata/allyears2k_headers.csv.gz
!
// Create DataFrame - involves parse of data
val airlinesData = new DataFrame(new File(dataFile))
20. Where are data?
Go to http://localhost:54321/steam/
index.html
21. Use Spark API
// H2O Context provide useful implicits for conversions
val h2oContext = new H2OContext(sc)
import h2oContext._
// Create RDD wrapper around DataFrame
val airlinesTable : RDD[Airlines] = toRDD[Airlines](airlinesData)
airlinesTable.count
// And use Spark RDD API directly
val flightsOnlyToSF = airlinesTable.filter(
f =
f.Dest==Some(SFO) || f.Dest==Some(SJC) || f.Dest==Some(OAK) )
flightsOnlyToSF.count
22. Use Spark SQL
import org.apache.spark.sql.SQLContext
// We need to create SQL context
val sqlContext = new SQLContext(sc)
import sqlContext._
airlinesTable.registerTempTable(airlinesTable)
val query =
“SELECT * FROM airlinesTable WHERE Dest LIKE 'SFO' OR Dest LIKE 'SJC' OR
Dest LIKE 'OAK'“
// Invoke query
val result = sql(query) // Using a registered context and tables
result.count
assert(result.count == flightsOnlyToSF.count)
23. Launch H2O Algorithms
import hex.deeplearning._
import hex.deeplearning.DeepLearningModel.DeepLearningParameters
// Setup deep learning parameters
val dlParams = new DeepLearningParameters()
dlParams._training_frame = result( 'Year, 'Month, 'DayofMonth, 'DayOfWeek,
'CRSDepTime, 'CRSArrTime,'UniqueCarrier,
'FlightNum, 'TailNum, 'CRSElapsedTime,
'Origin, 'Dest,'Distance, ‘IsDepDelayed)
dlParams.response_column = 'IsDepDelayed.name
// Create a new model builder
val dl = new DeepLearning(dlParams)
val dlModel = dl.train.get
24. Make a prediction
// Use model to score data
val prediction = dlModel.score(result)(‘predict)
!
// Collect predicted values via RDD API
val predictionValues = toRDD[DoubleHolder](prediction)
.collect
.map ( _.result.getOrElse(NaN) )
26. Spark App Extension
/** Notion of Spark application platform extension. */
trait PlatformExtension extends Serializable {
/** Method to start extension */
def start(conf: SparkConf):Unit
/** Method to stop extension */
def stop (conf: SparkConf):Unit
/* Point in Spark infrastructure which will be intercepted by this extension. */
def intercept: InterceptionPoints = InterceptionPoints.EXECUTOR_LC
/* User-friendly description of extension */
def desc:String
override def toString = s$desc@$intercept
}
/** Supported interception points.
*
* Currently only Executor life cycle is supported. */
object InterceptionPoints extends Enumeration {
type InterceptionPoints = Value
val EXECUTOR_LC /* Inject into executor lifecycle */
= Value
}
27. Using App Extensions
val conf = new SparkConf()
.setAppName(“Sparkling H2O Example”)
// Setup expected size of H2O cloud
conf.set(“spark.h2o.cluster.size”,h2oWorkers)
!
// Add H2O extension
conf.addExtension[H2OPlatformExtension]
!
// Create Spark Context
val sc = new SparkContext(sc)
28. Spark Changes
We keep them small (~30 lines of code)
JIRA SPARK-3270 - Platform App Extensions
https://issues.apache.org/jira/browse/
SPARK-3270
29. You can participate!
Epic PUBDEV-21aka Sparkling Water
PUBDEV-23 Test HDFS reader
PUBDEV-26 Implement toSchemaRDD
PUBDEV-27 Boolean transfers
PUBDEV-31 Support toRDD[ X : Numeric]
PUBDEV-32/33 Mesos/YARN support
30. More info
Checkout 0xdata Blog for tutorials
http://0xdata.com/blog/
Checkout 0xdata Youtube Channel
https://www.youtube.com/user/0xdata
Checkout github
https://github.com/0xdata/h2o-dev
https://github.com/0xdata/perrier
31. Thank you!
Learn more about H2O at
0xdata.com
or
neo for r in h2o-dev perrier; do !
git clone “git@github.com:0xdata/$r.git”!
done
Follow us at @hexadata