Copyright©2016 NTT corp. All Rights Reserved.	
Hivemall  Meets  XGBoost  in
DataFrame/Spark
2016/9/8
Takeshi  Yamamuro  (maropu)  @  NTT
2Copyright©2016 NTT corp. All Rights Reserved.	
Who  am  I?
3Copyright©2016 NTT corp. All Rights Reserved.	
• Short  for  eXtreme  Gradient  Boosting
•  https://github.com/dmlc/xgboost
• It  is...
•  variant  of  the  gradient  boosting  machine
•  tree-‐‑‒based  model
•  open-‐‑‒sourced  tool  (Apache2  license)  
•  written  in  C++
•  R/python/Julia/Java/Scala  interfaces  provided
•  widely  used  in  Kaggle  competitions
                is...
4Copyright©2016 NTT corp. All Rights Reserved.	
• Most  of  Hivemall  functions  supported  in  
Spark-‐‑‒v1.6  and  v2.0
•  the  v2.0  support  not  released  yet
• XGBoost  integration  under  development
•  distributed/parallel  predictions
•  native  libraries  bundled  for  major  platforms
•  Mac /Linux  on  x86_̲64
•  how-‐‑‒to-‐‑‒use:  https://gist.github.com/maropu/
33794b293ee937e99b8fb0788843fa3f
Hivemall  in  DataFrame/Spark
5Copyright©2016 NTT corp. All Rights Reserved.	
Spark  Quick  Examples
• Fetch  a  binary  Spark  v2.0.0
•  http://spark.apache.org/downloads.html
$ <SPARK_HOME>/bin/spark-shell
scala> :paste
val textFile = sc.textFile(”hoge.txt")
val counts = textFile.flatMap(_.split(" ")) .map(word => (word, 1))
.reduceByKey(_ + _)
6Copyright©2016 NTT corp. All Rights Reserved.	
Fetch  training  and  test  data
• E2006  tfidf  regression  dataset
•  http://www.csie.ntu.edu.tw/~∼cjlin/libsvmtools/
datasets/regression.html#E2006-‐‑‒tfidf
$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/
E2006.train.bz2
$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/
E2006.test.bz2
7Copyright©2016 NTT corp. All Rights Reserved.	
XGBoost  in  spark-‐‑‒shell
• Scala  interface  bundled  in  the  Hivemall  jar
$ bunzip2 E2006.train.bz2
$ <SPARK_HOME>/bin/spark-shell
-conf spark.jars=hivemall-spark-XXX-with-dependencies.jar
scala> import ml.dmlc.xgboost4j.scala._
scala> :paste
// Read trainining data
val trainData = new DMatrix(”E2006.train")
// Define parameters
val paramMap = List(
"eta" -> 0.1, "max_depth" -> 2, "objective" -> ”reg:logistic”
).toMap
// Train the model
val model = XGBoost.train(trainData, paramMap, 2)
// Save model to the file
model.saveModel(”xgboost_models_dir/xgb_0001.model”)
8Copyright©2016 NTT corp. All Rights Reserved.	
Load  test  data  in  parallel
$ <SPARK_HOME>/bin/spark-shell
-conf spark.jars=hivemall-spark-XXX-with-dependencies.jar
// Create DataFrame for the test data
scala> val testDf = sqlContext.sparkSession.read.format("libsvm”)
.load("E2006.test.bz2")
scala> testDf.printSchema
root
|-- label: double (nullable = true)
|-- features: vector (nullable = true)
9Copyright©2016 NTT corp. All Rights Reserved.	
Load  test  data  in  parallel
0.000357499151147113 6066:0.0007932706219604
8 6069:0.000311377727123504 6070:0.0003067549
34580457 6071:0.000276992485786437 6072:0.000
39663531098024 6074:0.00039663531098024 6075
:0.00032548335…
testDf
Partition1 Partition2 Partition3 PartitionN
…
…
…
Load in parallel because
bzip2 is splittable
• #partitions  depends  on  three  parameters
•  spark.default.parallelism:  #cores  by  default
•  spark.sql.files.maxPartitionBytes:  128MB  by  default
•  spark.sql.files.openCostInBytes:  4MB  by  default
10Copyright©2016 NTT corp. All Rights Reserved.	
• XGBoost  in  DataFrame
•  Load  built  models  and  do  cross-‐‑‒joins  for  predictions
Do  predictions  in  parallel
scala> import org.apache.spark.hive.HivemallOps._
scala> :paste
// Load built models from persistent storage
val modelsDf = sqlContext.sparkSession.read.format(xgboost)
.load(”xgboost_models_dir")
// Do prediction in parallel via cross-joins
val predict = modelsDf.join(testDf)
.xgboost_predict($"rowid", $"features", $"model_id", $"pred_model")
.groupBy("rowid")
.avg()
11Copyright©2016 NTT corp. All Rights Reserved.	
• XGBoost  in  DataFrame
•  Load  built  models  and  do  cross-‐‑‒joins  for  predictions
• Broadcast  cross-‐‑‒joins  expected
•  Size  of  `̀modelsDf`̀  must  be  less  than  and  equal  to  
spark.sql.autoBroadcastJoinThreshold  (10MB  by  default)
Do  predictions  in  parallel
testDf
rowid label features
1 0.392 1:0.3  5:0.1…
2 0.929 3:0.2…
3 0.132 2:0.9…
4 0.3923 5:0.4…
…
modelsDf
model_̲id pred_̲model
xgb_̲0001.model <binary  data>
xgb_̲0002.model <binary  data>
cross-joins in parallel
12Copyright©2016 NTT corp. All Rights Reserved.	
• Structured  Streaming  in  Spark-‐‑‒2.0
•  Scalable  and  fault-‐‑‒tolerant  stream  processing  engine  
built  on  the  Spark  SQL  engine
•  alpha  component  in  v2.0
Do  predictions  for  streaming  data
scala> :paste
// Initialize streaming DataFrame
val testStreamingDf = spark.readStream
.format(”libsvm”) // Not supported in v2.0
…
// Do prediction for streaming data
val predict = modelsDf.join(testStreamingDf)
.xgboost_predict($"rowid", $"features", $"model_id", $"pred_model")
.groupBy("rowid")
.avg()
13Copyright©2016 NTT corp. All Rights Reserved.	
• One  model  for  a  partition
•  WIP:  Build  models  with  different  parameters
Build  models  in  parallel
scala> :paste
// Set options for XGBoost
val xgbOptions = XGBoostOptions()
.set("num_round", "10000")
.set(“max_depth”, “32,48,64”) // Randomly selected by workers
// Set # of models to output
val numModels = 4
// Build models and save them in persistent storage
trainDf.repartition(numModels)
.train_xgboost_regr($“features”, $ “label”, s"${xgbOptions}")
.write
.format(xgboost)
.save(”xgboost_models_dir”)
14Copyright©2016 NTT corp. All Rights Reserved.	
• If  you  get  stuck  in  UnsatisfiedLinkError,  you  
need  to  compile  a  binary  by  yourself
Compile  a  binary  on  your  platform  
$ mvn validate && mvn package -Pcompile-xgboost -Pspark-2.0 –DskipTests
$ ls target
hivemall-core-0.4.2-rc.2-with-dependencies.jar
hivemall-spark-1.6.2_2.11.8-0.4.2-rc.2-with-dependencies.jar
hivemall-core-0.4.2-rc.2.jar
hivemall-spark-1.6.2_2.11.8-0.4.2-rc.2.jar
hivemall-mixserv-0.4.2-rc.2-fat.jar
hivemall-xgboost-0.4.2-rc.2.jar
hivemall-nlp-0.4.2-rc.2-with-dependencies.jar
hivemall-xgboost_0.60-0.4.2-rc.2-with-dependencies.jar
hivemall-nlp-0.4.2-rc.2.jar
hivemall-xgboost_0.60-0.4.2-rc.2.jar
15Copyright©2016 NTT corp. All Rights Reserved.	
• Rabbit  integration  for  parallel  learning
•  http://dmlc.cs.washington.edu/rabit.html
• Python  supports
• spark.ml  interface  supports
• Bundle  more  binaries  for  portability
•  Windows  and  x86  platforms
• Others?
Future  Work

20160908 hivemall meetup

  • 1.
    Copyright©2016 NTT corp.All Rights Reserved. Hivemall  Meets  XGBoost  in DataFrame/Spark 2016/9/8 Takeshi  Yamamuro  (maropu)  @  NTT
  • 2.
    2Copyright©2016 NTT corp.All Rights Reserved. Who  am  I?
  • 3.
    3Copyright©2016 NTT corp.All Rights Reserved. • Short  for  eXtreme  Gradient  Boosting •  https://github.com/dmlc/xgboost • It  is... •  variant  of  the  gradient  boosting  machine •  tree-‐‑‒based  model •  open-‐‑‒sourced  tool  (Apache2  license)   •  written  in  C++ •  R/python/Julia/Java/Scala  interfaces  provided •  widely  used  in  Kaggle  competitions                 is...
  • 4.
    4Copyright©2016 NTT corp.All Rights Reserved. • Most  of  Hivemall  functions  supported  in   Spark-‐‑‒v1.6  and  v2.0 •  the  v2.0  support  not  released  yet • XGBoost  integration  under  development •  distributed/parallel  predictions •  native  libraries  bundled  for  major  platforms •  Mac /Linux  on  x86_̲64 •  how-‐‑‒to-‐‑‒use:  https://gist.github.com/maropu/ 33794b293ee937e99b8fb0788843fa3f Hivemall  in  DataFrame/Spark
  • 5.
    5Copyright©2016 NTT corp.All Rights Reserved. Spark  Quick  Examples • Fetch  a  binary  Spark  v2.0.0 •  http://spark.apache.org/downloads.html $ <SPARK_HOME>/bin/spark-shell scala> :paste val textFile = sc.textFile(”hoge.txt") val counts = textFile.flatMap(_.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)
  • 6.
    6Copyright©2016 NTT corp.All Rights Reserved. Fetch  training  and  test  data • E2006  tfidf  regression  dataset •  http://www.csie.ntu.edu.tw/~∼cjlin/libsvmtools/ datasets/regression.html#E2006-‐‑‒tfidf $ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/ E2006.train.bz2 $ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/ E2006.test.bz2
  • 7.
    7Copyright©2016 NTT corp.All Rights Reserved. XGBoost  in  spark-‐‑‒shell • Scala  interface  bundled  in  the  Hivemall  jar $ bunzip2 E2006.train.bz2 $ <SPARK_HOME>/bin/spark-shell -conf spark.jars=hivemall-spark-XXX-with-dependencies.jar scala> import ml.dmlc.xgboost4j.scala._ scala> :paste // Read trainining data val trainData = new DMatrix(”E2006.train") // Define parameters val paramMap = List( "eta" -> 0.1, "max_depth" -> 2, "objective" -> ”reg:logistic” ).toMap // Train the model val model = XGBoost.train(trainData, paramMap, 2) // Save model to the file model.saveModel(”xgboost_models_dir/xgb_0001.model”)
  • 8.
    8Copyright©2016 NTT corp.All Rights Reserved. Load  test  data  in  parallel $ <SPARK_HOME>/bin/spark-shell -conf spark.jars=hivemall-spark-XXX-with-dependencies.jar // Create DataFrame for the test data scala> val testDf = sqlContext.sparkSession.read.format("libsvm”) .load("E2006.test.bz2") scala> testDf.printSchema root |-- label: double (nullable = true) |-- features: vector (nullable = true)
  • 9.
    9Copyright©2016 NTT corp.All Rights Reserved. Load  test  data  in  parallel 0.000357499151147113 6066:0.0007932706219604 8 6069:0.000311377727123504 6070:0.0003067549 34580457 6071:0.000276992485786437 6072:0.000 39663531098024 6074:0.00039663531098024 6075 :0.00032548335… testDf Partition1 Partition2 Partition3 PartitionN … … … Load in parallel because bzip2 is splittable • #partitions  depends  on  three  parameters •  spark.default.parallelism:  #cores  by  default •  spark.sql.files.maxPartitionBytes:  128MB  by  default •  spark.sql.files.openCostInBytes:  4MB  by  default
  • 10.
    10Copyright©2016 NTT corp.All Rights Reserved. • XGBoost  in  DataFrame •  Load  built  models  and  do  cross-‐‑‒joins  for  predictions Do  predictions  in  parallel scala> import org.apache.spark.hive.HivemallOps._ scala> :paste // Load built models from persistent storage val modelsDf = sqlContext.sparkSession.read.format(xgboost) .load(”xgboost_models_dir") // Do prediction in parallel via cross-joins val predict = modelsDf.join(testDf) .xgboost_predict($"rowid", $"features", $"model_id", $"pred_model") .groupBy("rowid") .avg()
  • 11.
    11Copyright©2016 NTT corp.All Rights Reserved. • XGBoost  in  DataFrame •  Load  built  models  and  do  cross-‐‑‒joins  for  predictions • Broadcast  cross-‐‑‒joins  expected •  Size  of  `̀modelsDf`̀  must  be  less  than  and  equal  to   spark.sql.autoBroadcastJoinThreshold  (10MB  by  default) Do  predictions  in  parallel testDf rowid label features 1 0.392 1:0.3  5:0.1… 2 0.929 3:0.2… 3 0.132 2:0.9… 4 0.3923 5:0.4… … modelsDf model_̲id pred_̲model xgb_̲0001.model <binary  data> xgb_̲0002.model <binary  data> cross-joins in parallel
  • 12.
    12Copyright©2016 NTT corp.All Rights Reserved. • Structured  Streaming  in  Spark-‐‑‒2.0 •  Scalable  and  fault-‐‑‒tolerant  stream  processing  engine   built  on  the  Spark  SQL  engine •  alpha  component  in  v2.0 Do  predictions  for  streaming  data scala> :paste // Initialize streaming DataFrame val testStreamingDf = spark.readStream .format(”libsvm”) // Not supported in v2.0 … // Do prediction for streaming data val predict = modelsDf.join(testStreamingDf) .xgboost_predict($"rowid", $"features", $"model_id", $"pred_model") .groupBy("rowid") .avg()
  • 13.
    13Copyright©2016 NTT corp.All Rights Reserved. • One  model  for  a  partition •  WIP:  Build  models  with  different  parameters Build  models  in  parallel scala> :paste // Set options for XGBoost val xgbOptions = XGBoostOptions() .set("num_round", "10000") .set(“max_depth”, “32,48,64”) // Randomly selected by workers // Set # of models to output val numModels = 4 // Build models and save them in persistent storage trainDf.repartition(numModels) .train_xgboost_regr($“features”, $ “label”, s"${xgbOptions}") .write .format(xgboost) .save(”xgboost_models_dir”)
  • 14.
    14Copyright©2016 NTT corp.All Rights Reserved. • If  you  get  stuck  in  UnsatisfiedLinkError,  you   need  to  compile  a  binary  by  yourself Compile  a  binary  on  your  platform   $ mvn validate && mvn package -Pcompile-xgboost -Pspark-2.0 –DskipTests $ ls target hivemall-core-0.4.2-rc.2-with-dependencies.jar hivemall-spark-1.6.2_2.11.8-0.4.2-rc.2-with-dependencies.jar hivemall-core-0.4.2-rc.2.jar hivemall-spark-1.6.2_2.11.8-0.4.2-rc.2.jar hivemall-mixserv-0.4.2-rc.2-fat.jar hivemall-xgboost-0.4.2-rc.2.jar hivemall-nlp-0.4.2-rc.2-with-dependencies.jar hivemall-xgboost_0.60-0.4.2-rc.2-with-dependencies.jar hivemall-nlp-0.4.2-rc.2.jar hivemall-xgboost_0.60-0.4.2-rc.2.jar
  • 15.
    15Copyright©2016 NTT corp.All Rights Reserved. • Rabbit  integration  for  parallel  learning •  http://dmlc.cs.washington.edu/rabit.html • Python  supports • spark.ml  interface  supports • Bundle  more  binaries  for  portability •  Windows  and  x86  platforms • Others? Future  Work