Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

20160908 hivemall meetup

1,344 views

Published on

A slide for Hivemall Meetup#3

Published in: Engineering
  • Be the first to comment

20160908 hivemall meetup

  1. 1. Copyright©2016 NTT corp. All Rights Reserved. Hivemall  Meets  XGBoost  in DataFrame/Spark 2016/9/8 Takeshi  Yamamuro  (maropu)  @  NTT
  2. 2. 2Copyright©2016 NTT corp. All Rights Reserved. Who  am  I?
  3. 3. 3Copyright©2016 NTT corp. All Rights Reserved. • Short  for  eXtreme  Gradient  Boosting •  https://github.com/dmlc/xgboost • It  is... •  variant  of  the  gradient  boosting  machine •  tree-‐‑‒based  model •  open-‐‑‒sourced  tool  (Apache2  license)   •  written  in  C++ •  R/python/Julia/Java/Scala  interfaces  provided •  widely  used  in  Kaggle  competitions                 is...
  4. 4. 4Copyright©2016 NTT corp. All Rights Reserved. • Most  of  Hivemall  functions  supported  in   Spark-‐‑‒v1.6  and  v2.0 •  the  v2.0  support  not  released  yet • XGBoost  integration  under  development •  distributed/parallel  predictions •  native  libraries  bundled  for  major  platforms •  Mac /Linux  on  x86_̲64 •  how-‐‑‒to-‐‑‒use:  https://gist.github.com/maropu/ 33794b293ee937e99b8fb0788843fa3f Hivemall  in  DataFrame/Spark
  5. 5. 5Copyright©2016 NTT corp. All Rights Reserved. Spark  Quick  Examples • Fetch  a  binary  Spark  v2.0.0 •  http://spark.apache.org/downloads.html $ <SPARK_HOME>/bin/spark-shell scala> :paste val textFile = sc.textFile(”hoge.txt") val counts = textFile.flatMap(_.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)
  6. 6. 6Copyright©2016 NTT corp. All Rights Reserved. Fetch  training  and  test  data • E2006  tfidf  regression  dataset •  http://www.csie.ntu.edu.tw/~∼cjlin/libsvmtools/ datasets/regression.html#E2006-‐‑‒tfidf $ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/ E2006.train.bz2 $ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/ E2006.test.bz2
  7. 7. 7Copyright©2016 NTT corp. All Rights Reserved. XGBoost  in  spark-‐‑‒shell • Scala  interface  bundled  in  the  Hivemall  jar $ bunzip2 E2006.train.bz2 $ <SPARK_HOME>/bin/spark-shell -conf spark.jars=hivemall-spark-XXX-with-dependencies.jar scala> import ml.dmlc.xgboost4j.scala._ scala> :paste // Read trainining data val trainData = new DMatrix(”E2006.train") // Define parameters val paramMap = List( "eta" -> 0.1, "max_depth" -> 2, "objective" -> ”reg:logistic” ).toMap // Train the model val model = XGBoost.train(trainData, paramMap, 2) // Save model to the file model.saveModel(”xgboost_models_dir/xgb_0001.model”)
  8. 8. 8Copyright©2016 NTT corp. All Rights Reserved. Load  test  data  in  parallel $ <SPARK_HOME>/bin/spark-shell -conf spark.jars=hivemall-spark-XXX-with-dependencies.jar // Create DataFrame for the test data scala> val testDf = sqlContext.sparkSession.read.format("libsvm”) .load("E2006.test.bz2") scala> testDf.printSchema root |-- label: double (nullable = true) |-- features: vector (nullable = true)
  9. 9. 9Copyright©2016 NTT corp. All Rights Reserved. Load  test  data  in  parallel 0.000357499151147113 6066:0.0007932706219604 8 6069:0.000311377727123504 6070:0.0003067549 34580457 6071:0.000276992485786437 6072:0.000 39663531098024 6074:0.00039663531098024 6075 :0.00032548335… testDf Partition1 Partition2 Partition3 PartitionN … … … Load in parallel because bzip2 is splittable • #partitions  depends  on  three  parameters •  spark.default.parallelism:  #cores  by  default •  spark.sql.files.maxPartitionBytes:  128MB  by  default •  spark.sql.files.openCostInBytes:  4MB  by  default
  10. 10. 10Copyright©2016 NTT corp. All Rights Reserved. • XGBoost  in  DataFrame •  Load  built  models  and  do  cross-‐‑‒joins  for  predictions Do  predictions  in  parallel scala> import org.apache.spark.hive.HivemallOps._ scala> :paste // Load built models from persistent storage val modelsDf = sqlContext.sparkSession.read.format(xgboost) .load(”xgboost_models_dir") // Do prediction in parallel via cross-joins val predict = modelsDf.join(testDf) .xgboost_predict($"rowid", $"features", $"model_id", $"pred_model") .groupBy("rowid") .avg()
  11. 11. 11Copyright©2016 NTT corp. All Rights Reserved. • XGBoost  in  DataFrame •  Load  built  models  and  do  cross-‐‑‒joins  for  predictions • Broadcast  cross-‐‑‒joins  expected •  Size  of  `̀modelsDf`̀  must  be  less  than  and  equal  to   spark.sql.autoBroadcastJoinThreshold  (10MB  by  default) Do  predictions  in  parallel testDf rowid label features 1 0.392 1:0.3  5:0.1… 2 0.929 3:0.2… 3 0.132 2:0.9… 4 0.3923 5:0.4… … modelsDf model_̲id pred_̲model xgb_̲0001.model <binary  data> xgb_̲0002.model <binary  data> cross-joins in parallel
  12. 12. 12Copyright©2016 NTT corp. All Rights Reserved. • Structured  Streaming  in  Spark-‐‑‒2.0 •  Scalable  and  fault-‐‑‒tolerant  stream  processing  engine   built  on  the  Spark  SQL  engine •  alpha  component  in  v2.0 Do  predictions  for  streaming  data scala> :paste // Initialize streaming DataFrame val testStreamingDf = spark.readStream .format(”libsvm”) // Not supported in v2.0 … // Do prediction for streaming data val predict = modelsDf.join(testStreamingDf) .xgboost_predict($"rowid", $"features", $"model_id", $"pred_model") .groupBy("rowid") .avg()
  13. 13. 13Copyright©2016 NTT corp. All Rights Reserved. • One  model  for  a  partition •  WIP:  Build  models  with  different  parameters Build  models  in  parallel scala> :paste // Set options for XGBoost val xgbOptions = XGBoostOptions() .set("num_round", "10000") .set(“max_depth”, “32,48,64”) // Randomly selected by workers // Set # of models to output val numModels = 4 // Build models and save them in persistent storage trainDf.repartition(numModels) .train_xgboost_regr($“features”, $ “label”, s"${xgbOptions}") .write .format(xgboost) .save(”xgboost_models_dir”)
  14. 14. 14Copyright©2016 NTT corp. All Rights Reserved. • If  you  get  stuck  in  UnsatisfiedLinkError,  you   need  to  compile  a  binary  by  yourself Compile  a  binary  on  your  platform   $ mvn validate && mvn package -Pcompile-xgboost -Pspark-2.0 –DskipTests $ ls target hivemall-core-0.4.2-rc.2-with-dependencies.jar hivemall-spark-1.6.2_2.11.8-0.4.2-rc.2-with-dependencies.jar hivemall-core-0.4.2-rc.2.jar hivemall-spark-1.6.2_2.11.8-0.4.2-rc.2.jar hivemall-mixserv-0.4.2-rc.2-fat.jar hivemall-xgboost-0.4.2-rc.2.jar hivemall-nlp-0.4.2-rc.2-with-dependencies.jar hivemall-xgboost_0.60-0.4.2-rc.2-with-dependencies.jar hivemall-nlp-0.4.2-rc.2.jar hivemall-xgboost_0.60-0.4.2-rc.2.jar
  15. 15. 15Copyright©2016 NTT corp. All Rights Reserved. • Rabbit  integration  for  parallel  learning •  http://dmlc.cs.washington.edu/rabit.html • Python  supports • spark.ml  interface  supports • Bundle  more  binaries  for  portability •  Windows  and  x86  platforms • Others? Future  Work

×