20160908 hivemall meetup

Copyright©2016 NTT corp. All Rights Reserved.
Hivemall Meets XGBoost in
DataFrame/Spark
2016/9/8
Takeshi Yamamuro (maropu) @ NTT

2Copyright©2016 NTT corp. All Rights Reserved.
Who am I?

• Short for eXtreme Gradient Boosting
•  https://github.com/dmlc/xgboost
• It is...
•  variant of the gradient boosting machine
•  tree-‐‑‒based model
•  open-‐‑‒sourced tool (Apache2 license)
•  written in C++
•  R/python/Julia/Java/Scala interfaces provided
•  widely used in Kaggle competitions
　　　　　 is...

• Most of Hivemall functions supported in
Spark-‐‑‒v1.6 and v2.0
•  the v2.0 support not released yet
• XGBoost integration under development
•  distributed/parallel predictions
•  native libraries bundled for major platforms
•  Mac /Linux on x86_̲64
•  how-‐‑‒to-‐‑‒use: https://gist.github.com/maropu/
33794b293ee937e99b8fb0788843fa3f
Hivemall in DataFrame/Spark

Spark Quick Examples
• Fetch a binary Spark v2.0.0
•  http://spark.apache.org/downloads.html
$ <SPARK_HOME>/bin/spark-shell
scala> :paste
val textFile = sc.textFile(”hoge.txt")
val counts = textFile.flatMap(_.split(" ")) .map(word => (word, 1))
.reduceByKey(_ + _)

Fetch training and test data
• E2006 tﬁdf regression dataset
•  http://www.csie.ntu.edu.tw/~∼cjlin/libsvmtools/
datasets/regression.html#E2006-‐‑‒tﬁdf
$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/
E2006.train.bz2
$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/
E2006.test.bz2

XGBoost in spark-‐‑‒shell
• Scala interface bundled in the Hivemall jar
$ bunzip2 E2006.train.bz2
-conf spark.jars=hivemall-spark-XXX-with-dependencies.jar
scala> import ml.dmlc.xgboost4j.scala._
scala> :paste
// Read trainining data
val trainData = new DMatrix(”E2006.train")
// Define parameters
val paramMap = List(
"eta" -> 0.1, "max_depth" -> 2, "objective" -> ”reg:logistic”
).toMap
// Train the model
val model = XGBoost.train(trainData, paramMap, 2)
// Save model to the file
model.saveModel(”xgboost_models_dir/xgb_0001.model”)

Load test data in parallel
-conf spark.jars=hivemall-spark-XXX-with-dependencies.jar
// Create DataFrame for the test data
scala> val testDf = sqlContext.sparkSession.read.format("libsvm”)
.load("E2006.test.bz2")
scala> testDf.printSchema
root
|-- label: double (nullable = true)
|-- features: vector (nullable = true)

Load test data in parallel
0.000357499151147113 6066:0.0007932706219604
8 6069:0.000311377727123504 6070:0.0003067549
34580457 6071:0.000276992485786437 6072:0.000
39663531098024 6074:0.00039663531098024 6075
:0.00032548335…
testDf
Partition1 Partition2 Partition3 PartitionN
…
…
…
Load in parallel because
bzip2 is splittable
• #partitions depends on three parameters
•  spark.default.parallelism: #cores by default
•  spark.sql.ﬁles.maxPartitionBytes: 128MB by default
•  spark.sql.ﬁles.openCostInBytes: 4MB by default

• XGBoost in DataFrame
•  Load built models and do cross-‐‑‒joins for predictions
Do predictions in parallel
scala> import org.apache.spark.hive.HivemallOps._
scala> :paste
// Load built models from persistent storage
val modelsDf = sqlContext.sparkSession.read.format(xgboost)
.load(”xgboost_models_dir")
// Do prediction in parallel via cross-joins
val predict = modelsDf.join(testDf)
.xgboost_predict($"rowid", $"features", $"model_id", $"pred_model")
.groupBy("rowid")
.avg()

• XGBoost in DataFrame
•  Load built models and do cross-‐‑‒joins for predictions
• Broadcast cross-‐‑‒joins expected
•  Size of `̀modelsDf`̀ must be less than and equal to
spark.sql.autoBroadcastJoinThreshold (10MB by default)
Do predictions in parallel
testDf
rowid label features
1 0.392 1:0.3 5:0.1…
2 0.929 3:0.2…
3 0.132 2:0.9…
4 0.3923 5:0.4…
…
modelsDf
model_̲id pred_̲model
xgb_̲0001.model <binary data>
xgb_̲0002.model <binary data>
cross-joins in parallel

• Structured Streaming in Spark-‐‑‒2.0
•  Scalable and fault-‐‑‒tolerant stream processing engine
built on the Spark SQL engine
•  alpha component in v2.0
Do predictions for streaming data
scala> :paste
// Initialize streaming DataFrame
val testStreamingDf = spark.readStream
.format(”libsvm”) // Not supported in v2.0
…
// Do prediction for streaming data
val predict = modelsDf.join(testStreamingDf)
.xgboost_predict($"rowid", $"features", $"model_id", $"pred_model")
.groupBy("rowid")
.avg()

• One model for a partition
•  WIP: Build models with diﬀerent parameters
Build models in parallel
scala> :paste
// Set options for XGBoost
val xgbOptions = XGBoostOptions()
.set("num_round", "10000")
.set(“max_depth”, “32,48,64”) // Randomly selected by workers
// Set # of models to output
val numModels = 4
// Build models and save them in persistent storage
trainDf.repartition(numModels)
.train_xgboost_regr($“features”, $ “label”, s"${xgbOptions}")
.write
.format(xgboost)
.save(”xgboost_models_dir”)

• If you get stuck in UnsatisﬁedLinkError, you
need to compile a binary by yourself
Compile a binary on your platform
$ mvn validate && mvn package -Pcompile-xgboost -Pspark-2.0 –DskipTests
$ ls target
hivemall-core-0.4.2-rc.2-with-dependencies.jar
hivemall-spark-1.6.2_2.11.8-0.4.2-rc.2-with-dependencies.jar
hivemall-core-0.4.2-rc.2.jar
hivemall-spark-1.6.2_2.11.8-0.4.2-rc.2.jar
hivemall-mixserv-0.4.2-rc.2-fat.jar
hivemall-xgboost-0.4.2-rc.2.jar
hivemall-nlp-0.4.2-rc.2-with-dependencies.jar
hivemall-xgboost_0.60-0.4.2-rc.2-with-dependencies.jar
hivemall-nlp-0.4.2-rc.2.jar
hivemall-xgboost_0.60-0.4.2-rc.2.jar

• Rabbit integration for parallel learning
•  http://dmlc.cs.washington.edu/rabit.html
• Python supports
• spark.ml interface supports
• Bundle more binaries for portability
•  Windows and x86 platforms
• Others?
Future Work

20160908 hivemall meetup

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to 20160908 hivemall meetup

Similar to 20160908 hivemall meetup (20)

More from Takeshi Yamamuro

More from Takeshi Yamamuro (20)

Recently uploaded

Recently uploaded (20)

20160908 hivemall meetup