SlideShare a Scribd company logo
Copyright©2016 NTT corp. All Rights Reserved.	
Hivemall  Meets  XGBoost  in
DataFrame/Spark
2016/9/8
Takeshi  Yamamuro  (maropu)  @  NTT
2Copyright©2016 NTT corp. All Rights Reserved.	
Who  am  I?
3Copyright©2016 NTT corp. All Rights Reserved.	
• Short  for  eXtreme  Gradient  Boosting
•  https://github.com/dmlc/xgboost
• It  is...
•  variant  of  the  gradient  boosting  machine
•  tree-‐‑‒based  model
•  open-‐‑‒sourced  tool  (Apache2  license)  
•  written  in  C++
•  R/python/Julia/Java/Scala  interfaces  provided
•  widely  used  in  Kaggle  competitions
                is...
4Copyright©2016 NTT corp. All Rights Reserved.	
• Most  of  Hivemall  functions  supported  in  
Spark-‐‑‒v1.6  and  v2.0
•  the  v2.0  support  not  released  yet
• XGBoost  integration  under  development
•  distributed/parallel  predictions
•  native  libraries  bundled  for  major  platforms
•  Mac /Linux  on  x86_̲64
•  how-‐‑‒to-‐‑‒use:  https://gist.github.com/maropu/
33794b293ee937e99b8fb0788843fa3f
Hivemall  in  DataFrame/Spark
5Copyright©2016 NTT corp. All Rights Reserved.	
Spark  Quick  Examples
• Fetch  a  binary  Spark  v2.0.0
•  http://spark.apache.org/downloads.html
$ <SPARK_HOME>/bin/spark-shell
scala> :paste
val textFile = sc.textFile(”hoge.txt")
val counts = textFile.flatMap(_.split(" ")) .map(word => (word, 1))
.reduceByKey(_ + _)
6Copyright©2016 NTT corp. All Rights Reserved.	
Fetch  training  and  test  data
• E2006  tfidf  regression  dataset
•  http://www.csie.ntu.edu.tw/~∼cjlin/libsvmtools/
datasets/regression.html#E2006-‐‑‒tfidf
$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/
E2006.train.bz2
$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/
E2006.test.bz2
7Copyright©2016 NTT corp. All Rights Reserved.	
XGBoost  in  spark-‐‑‒shell
• Scala  interface  bundled  in  the  Hivemall  jar
$ bunzip2 E2006.train.bz2
$ <SPARK_HOME>/bin/spark-shell
-conf spark.jars=hivemall-spark-XXX-with-dependencies.jar
scala> import ml.dmlc.xgboost4j.scala._
scala> :paste
// Read trainining data
val trainData = new DMatrix(”E2006.train")
// Define parameters
val paramMap = List(
"eta" -> 0.1, "max_depth" -> 2, "objective" -> ”reg:logistic”
).toMap
// Train the model
val model = XGBoost.train(trainData, paramMap, 2)
// Save model to the file
model.saveModel(”xgboost_models_dir/xgb_0001.model”)
8Copyright©2016 NTT corp. All Rights Reserved.	
Load  test  data  in  parallel
$ <SPARK_HOME>/bin/spark-shell
-conf spark.jars=hivemall-spark-XXX-with-dependencies.jar
// Create DataFrame for the test data
scala> val testDf = sqlContext.sparkSession.read.format("libsvm”)
.load("E2006.test.bz2")
scala> testDf.printSchema
root
|-- label: double (nullable = true)
|-- features: vector (nullable = true)
9Copyright©2016 NTT corp. All Rights Reserved.	
Load  test  data  in  parallel
0.000357499151147113 6066:0.0007932706219604
8 6069:0.000311377727123504 6070:0.0003067549
34580457 6071:0.000276992485786437 6072:0.000
39663531098024 6074:0.00039663531098024 6075
:0.00032548335…
testDf
Partition1 Partition2 Partition3 PartitionN
…
…
…
Load in parallel because
bzip2 is splittable
• #partitions  depends  on  three  parameters
•  spark.default.parallelism:  #cores  by  default
•  spark.sql.files.maxPartitionBytes:  128MB  by  default
•  spark.sql.files.openCostInBytes:  4MB  by  default
10Copyright©2016 NTT corp. All Rights Reserved.	
• XGBoost  in  DataFrame
•  Load  built  models  and  do  cross-‐‑‒joins  for  predictions
Do  predictions  in  parallel
scala> import org.apache.spark.hive.HivemallOps._
scala> :paste
// Load built models from persistent storage
val modelsDf = sqlContext.sparkSession.read.format(xgboost)
.load(”xgboost_models_dir")
// Do prediction in parallel via cross-joins
val predict = modelsDf.join(testDf)
.xgboost_predict($"rowid", $"features", $"model_id", $"pred_model")
.groupBy("rowid")
.avg()
11Copyright©2016 NTT corp. All Rights Reserved.	
• XGBoost  in  DataFrame
•  Load  built  models  and  do  cross-‐‑‒joins  for  predictions
• Broadcast  cross-‐‑‒joins  expected
•  Size  of  `̀modelsDf`̀  must  be  less  than  and  equal  to  
spark.sql.autoBroadcastJoinThreshold  (10MB  by  default)
Do  predictions  in  parallel
testDf
rowid label features
1 0.392 1:0.3  5:0.1…
2 0.929 3:0.2…
3 0.132 2:0.9…
4 0.3923 5:0.4…
…
modelsDf
model_̲id pred_̲model
xgb_̲0001.model <binary  data>
xgb_̲0002.model <binary  data>
cross-joins in parallel
12Copyright©2016 NTT corp. All Rights Reserved.	
• Structured  Streaming  in  Spark-‐‑‒2.0
•  Scalable  and  fault-‐‑‒tolerant  stream  processing  engine  
built  on  the  Spark  SQL  engine
•  alpha  component  in  v2.0
Do  predictions  for  streaming  data
scala> :paste
// Initialize streaming DataFrame
val testStreamingDf = spark.readStream
.format(”libsvm”) // Not supported in v2.0
…
// Do prediction for streaming data
val predict = modelsDf.join(testStreamingDf)
.xgboost_predict($"rowid", $"features", $"model_id", $"pred_model")
.groupBy("rowid")
.avg()
13Copyright©2016 NTT corp. All Rights Reserved.	
• One  model  for  a  partition
•  WIP:  Build  models  with  different  parameters
Build  models  in  parallel
scala> :paste
// Set options for XGBoost
val xgbOptions = XGBoostOptions()
.set("num_round", "10000")
.set(“max_depth”, “32,48,64”) // Randomly selected by workers
// Set # of models to output
val numModels = 4
// Build models and save them in persistent storage
trainDf.repartition(numModels)
.train_xgboost_regr($“features”, $ “label”, s"${xgbOptions}")
.write
.format(xgboost)
.save(”xgboost_models_dir”)
14Copyright©2016 NTT corp. All Rights Reserved.	
• If  you  get  stuck  in  UnsatisfiedLinkError,  you  
need  to  compile  a  binary  by  yourself
Compile  a  binary  on  your  platform  
$ mvn validate && mvn package -Pcompile-xgboost -Pspark-2.0 –DskipTests
$ ls target
hivemall-core-0.4.2-rc.2-with-dependencies.jar
hivemall-spark-1.6.2_2.11.8-0.4.2-rc.2-with-dependencies.jar
hivemall-core-0.4.2-rc.2.jar
hivemall-spark-1.6.2_2.11.8-0.4.2-rc.2.jar
hivemall-mixserv-0.4.2-rc.2-fat.jar
hivemall-xgboost-0.4.2-rc.2.jar
hivemall-nlp-0.4.2-rc.2-with-dependencies.jar
hivemall-xgboost_0.60-0.4.2-rc.2-with-dependencies.jar
hivemall-nlp-0.4.2-rc.2.jar
hivemall-xgboost_0.60-0.4.2-rc.2.jar
15Copyright©2016 NTT corp. All Rights Reserved.	
• Rabbit  integration  for  parallel  learning
•  http://dmlc.cs.washington.edu/rabit.html
• Python  supports
• spark.ml  interface  supports
• Bundle  more  binaries  for  portability
•  Windows  and  x86  platforms
• Others?
Future  Work

More Related Content

What's hot

Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Spark Summit
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
DataWorks Summit/Hadoop Summit
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Databricks
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, Sunnyvale
Jim Dowling
 
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, SparkDistributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Jan Wiegelmann
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Databricks
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsDataWorks Summit
 
Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!
DataWorks Summit
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
Takuya UESHIN
 
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Yahoo Developer Network
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
Kazuaki Ishizaki
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi Torres
Spark Summit
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
Takuya UESHIN
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Summit
 
Introducing Koalas 1.0 (and 1.1)
Introducing Koalas 1.0 (and 1.1)Introducing Koalas 1.0 (and 1.1)
Introducing Koalas 1.0 (and 1.1)
Takuya UESHIN
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Databricks
 
Hadoop + GPU
Hadoop + GPUHadoop + GPU
Hadoop + GPU
Vladimir Starostenkov
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Databricks
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySpark
Spark Summit
 
Hopsworks hands on_feature_store_palo_alto_kim_hammar_23_april_2019
Hopsworks hands on_feature_store_palo_alto_kim_hammar_23_april_2019Hopsworks hands on_feature_store_palo_alto_kim_hammar_23_april_2019
Hopsworks hands on_feature_store_palo_alto_kim_hammar_23_april_2019
Kim Hammar
 

What's hot (20)

Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, Sunnyvale
 
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, SparkDistributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source Toolkits
 
Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi Torres
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
 
Introducing Koalas 1.0 (and 1.1)
Introducing Koalas 1.0 (and 1.1)Introducing Koalas 1.0 (and 1.1)
Introducing Koalas 1.0 (and 1.1)
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
 
Hadoop + GPU
Hadoop + GPUHadoop + GPU
Hadoop + GPU
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySpark
 
Hopsworks hands on_feature_store_palo_alto_kim_hammar_23_april_2019
Hopsworks hands on_feature_store_palo_alto_kim_hammar_23_april_2019Hopsworks hands on_feature_store_palo_alto_kim_hammar_23_april_2019
Hopsworks hands on_feature_store_palo_alto_kim_hammar_23_april_2019
 

Viewers also liked

Sano hmm 20150512
Sano hmm 20150512Sano hmm 20150512
Sano hmm 20150512
Masakazu Sano
 
hivemallを使って4日間で性別推定した話
hivemallを使って4日間で性別推定した話hivemallを使って4日間で性別推定した話
hivemallを使って4日間で性別推定した話
eventdotsjp
 
Hivemallmtup 20160908
Hivemallmtup 20160908Hivemallmtup 20160908
Hivemallmtup 20160908
Kazuki Ohmori
 
Hivemall v0.3の機能紹介@1st Hivemall meetup
Hivemall v0.3の機能紹介@1st Hivemall meetupHivemall v0.3の機能紹介@1st Hivemall meetup
Hivemall v0.3の機能紹介@1st Hivemall meetupMakoto Yui
 
2nd Hivemall meetup 20151020
2nd Hivemall meetup 201510202nd Hivemall meetup 20151020
2nd Hivemall meetup 20151020
Makoto Yui
 
Hivemall meetup vol2 oisix
Hivemall meetup vol2 oisixHivemall meetup vol2 oisix
Hivemall meetup vol2 oisix
Taisuke Fukawa
 
Hivemallで始める不動産価格推定サービス
Hivemallで始める不動産価格推定サービスHivemallで始める不動産価格推定サービス
Hivemallで始める不動産価格推定サービス
Kentaro Yoshida
 
Sano tokyowebmining 201625_v04
Sano tokyowebmining 201625_v04Sano tokyowebmining 201625_v04
Sano tokyowebmining 201625_v04
Masakazu Sano
 

Viewers also liked (8)

Sano hmm 20150512
Sano hmm 20150512Sano hmm 20150512
Sano hmm 20150512
 
hivemallを使って4日間で性別推定した話
hivemallを使って4日間で性別推定した話hivemallを使って4日間で性別推定した話
hivemallを使って4日間で性別推定した話
 
Hivemallmtup 20160908
Hivemallmtup 20160908Hivemallmtup 20160908
Hivemallmtup 20160908
 
Hivemall v0.3の機能紹介@1st Hivemall meetup
Hivemall v0.3の機能紹介@1st Hivemall meetupHivemall v0.3の機能紹介@1st Hivemall meetup
Hivemall v0.3の機能紹介@1st Hivemall meetup
 
2nd Hivemall meetup 20151020
2nd Hivemall meetup 201510202nd Hivemall meetup 20151020
2nd Hivemall meetup 20151020
 
Hivemall meetup vol2 oisix
Hivemall meetup vol2 oisixHivemall meetup vol2 oisix
Hivemall meetup vol2 oisix
 
Hivemallで始める不動産価格推定サービス
Hivemallで始める不動産価格推定サービスHivemallで始める不動産価格推定サービス
Hivemallで始める不動産価格推定サービス
 
Sano tokyowebmining 201625_v04
Sano tokyowebmining 201625_v04Sano tokyowebmining 201625_v04
Sano tokyowebmining 201625_v04
 

Similar to 20160908 hivemall meetup

A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
J On The Beach
 
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Databricks
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Jason Dai
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production
Paris Data Engineers !
 
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
Tim Ellison
 
Cisco Connect Toronto 2017 - Model-driven Telemetry
Cisco Connect Toronto 2017 - Model-driven TelemetryCisco Connect Toronto 2017 - Model-driven Telemetry
Cisco Connect Toronto 2017 - Model-driven Telemetry
Cisco Canada
 
Graal and Truffle: One VM to Rule Them All
Graal and Truffle: One VM to Rule Them AllGraal and Truffle: One VM to Rule Them All
Graal and Truffle: One VM to Rule Them All
Thomas Wuerthinger
 
20180417 hivemall meetup#4
20180417 hivemall meetup#420180417 hivemall meetup#4
20180417 hivemall meetup#4
Takeshi Yamamuro
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche Spark
Alex Thompson
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann
 
Model-driven Telemetry: The Foundation of Big Data Analytics
Model-driven Telemetry: The Foundation of Big Data AnalyticsModel-driven Telemetry: The Foundation of Big Data Analytics
Model-driven Telemetry: The Foundation of Big Data Analytics
Cisco Canada
 
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
Running Spark In Production in the Cloud is Not Easy with Nayur KhanRunning Spark In Production in the Cloud is Not Easy with Nayur Khan
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
Databricks
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Jen Aman
 
A Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceA Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark Performance
Tim Ellison
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-API
Yoni Davidson
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Spark Summit
 
AMD It's Time to ROC
AMD It's Time to ROCAMD It's Time to ROC
AMD It's Time to ROC
inside-BigData.com
 
Developing Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache KafkaDeveloping Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache Kafka
Joe Stein
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
felixcss
 

Similar to 20160908 hivemall meetup (20)

A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
 
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production
 
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
 
NYC_2016_slides
NYC_2016_slidesNYC_2016_slides
NYC_2016_slides
 
Cisco Connect Toronto 2017 - Model-driven Telemetry
Cisco Connect Toronto 2017 - Model-driven TelemetryCisco Connect Toronto 2017 - Model-driven Telemetry
Cisco Connect Toronto 2017 - Model-driven Telemetry
 
Graal and Truffle: One VM to Rule Them All
Graal and Truffle: One VM to Rule Them AllGraal and Truffle: One VM to Rule Them All
Graal and Truffle: One VM to Rule Them All
 
20180417 hivemall meetup#4
20180417 hivemall meetup#420180417 hivemall meetup#4
20180417 hivemall meetup#4
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche Spark
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
 
Model-driven Telemetry: The Foundation of Big Data Analytics
Model-driven Telemetry: The Foundation of Big Data AnalyticsModel-driven Telemetry: The Foundation of Big Data Analytics
Model-driven Telemetry: The Foundation of Big Data Analytics
 
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
Running Spark In Production in the Cloud is Not Easy with Nayur KhanRunning Spark In Production in the Cloud is Not Easy with Nayur Khan
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
A Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceA Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark Performance
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-API
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
 
AMD It's Time to ROC
AMD It's Time to ROCAMD It's Time to ROC
AMD It's Time to ROC
 
Developing Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache KafkaDeveloping Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache Kafka
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 

More from Takeshi Yamamuro

LT: Spark 3.1 Feature Expectation
LT: Spark 3.1 Feature ExpectationLT: Spark 3.1 Feature Expectation
LT: Spark 3.1 Feature Expectation
Takeshi Yamamuro
 
Apache Spark + Arrow
Apache Spark + ArrowApache Spark + Arrow
Apache Spark + Arrow
Takeshi Yamamuro
 
Quick Overview of Upcoming Spark 3.0 + α
Quick Overview of Upcoming Spark 3.0 + αQuick Overview of Upcoming Spark 3.0 + α
Quick Overview of Upcoming Spark 3.0 + α
Takeshi Yamamuro
 
MLflowによる機械学習モデルのライフサイクルの管理
MLflowによる機械学習モデルのライフサイクルの管理MLflowによる機械学習モデルのライフサイクルの管理
MLflowによる機械学習モデルのライフサイクルの管理
Takeshi Yamamuro
 
Taming Distributed/Parallel Query Execution Engine of Apache Spark
Taming Distributed/Parallel Query Execution Engine of Apache SparkTaming Distributed/Parallel Query Execution Engine of Apache Spark
Taming Distributed/Parallel Query Execution Engine of Apache Spark
Takeshi Yamamuro
 
LLJVM: LLVM bitcode to JVM bytecode
LLJVM: LLVM bitcode to JVM bytecodeLLJVM: LLVM bitcode to JVM bytecode
LLJVM: LLVM bitcode to JVM bytecode
Takeshi Yamamuro
 
An Experimental Study of Bitmap Compression vs. Inverted List Compression
An Experimental Study of Bitmap Compression vs. Inverted List CompressionAn Experimental Study of Bitmap Compression vs. Inverted List Compression
An Experimental Study of Bitmap Compression vs. Inverted List Compression
Takeshi Yamamuro
 
Sparkのクエリ処理系と周辺の話題
Sparkのクエリ処理系と周辺の話題Sparkのクエリ処理系と周辺の話題
Sparkのクエリ処理系と周辺の話題
Takeshi Yamamuro
 
20150513 legobease
20150513 legobease20150513 legobease
20150513 legobease
Takeshi Yamamuro
 
20150516 icde2015 r19-4
20150516 icde2015 r19-420150516 icde2015 r19-4
20150516 icde2015 r19-4
Takeshi Yamamuro
 
VLDB2013 R1 Emerging Hardware
VLDB2013 R1 Emerging HardwareVLDB2013 R1 Emerging Hardware
VLDB2013 R1 Emerging HardwareTakeshi Yamamuro
 
浮動小数点(IEEE754)を圧縮したい@dsirnlp#4
浮動小数点(IEEE754)を圧縮したい@dsirnlp#4浮動小数点(IEEE754)を圧縮したい@dsirnlp#4
浮動小数点(IEEE754)を圧縮したい@dsirnlp#4Takeshi Yamamuro
 
LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)
LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)
LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)Takeshi Yamamuro
 
Introduction to Modern Analytical DB
Introduction to Modern Analytical DBIntroduction to Modern Analytical DB
Introduction to Modern Analytical DBTakeshi Yamamuro
 
SIGMOD’12勉強会 -Session 7-
SIGMOD’12勉強会 -Session 7-SIGMOD’12勉強会 -Session 7-
SIGMOD’12勉強会 -Session 7-Takeshi Yamamuro
 
A x86-optimized rank&select dictionary for bit sequences
A x86-optimized rank&select dictionary for bit sequencesA x86-optimized rank&select dictionary for bit sequences
A x86-optimized rank&select dictionary for bit sequencesTakeshi Yamamuro
 
VLDB’11勉強会 -Session 9-
VLDB’11勉強会 -Session 9-VLDB’11勉強会 -Session 9-
VLDB’11勉強会 -Session 9-Takeshi Yamamuro
 
研究動向から考えるx86/x64最適化手法
研究動向から考えるx86/x64最適化手法研究動向から考えるx86/x64最適化手法
研究動向から考えるx86/x64最適化手法Takeshi Yamamuro
 
VLDB'10勉強会 -Session 20-
VLDB'10勉強会 -Session 20-VLDB'10勉強会 -Session 20-
VLDB'10勉強会 -Session 20-Takeshi Yamamuro
 

More from Takeshi Yamamuro (20)

LT: Spark 3.1 Feature Expectation
LT: Spark 3.1 Feature ExpectationLT: Spark 3.1 Feature Expectation
LT: Spark 3.1 Feature Expectation
 
Apache Spark + Arrow
Apache Spark + ArrowApache Spark + Arrow
Apache Spark + Arrow
 
Quick Overview of Upcoming Spark 3.0 + α
Quick Overview of Upcoming Spark 3.0 + αQuick Overview of Upcoming Spark 3.0 + α
Quick Overview of Upcoming Spark 3.0 + α
 
MLflowによる機械学習モデルのライフサイクルの管理
MLflowによる機械学習モデルのライフサイクルの管理MLflowによる機械学習モデルのライフサイクルの管理
MLflowによる機械学習モデルのライフサイクルの管理
 
Taming Distributed/Parallel Query Execution Engine of Apache Spark
Taming Distributed/Parallel Query Execution Engine of Apache SparkTaming Distributed/Parallel Query Execution Engine of Apache Spark
Taming Distributed/Parallel Query Execution Engine of Apache Spark
 
LLJVM: LLVM bitcode to JVM bytecode
LLJVM: LLVM bitcode to JVM bytecodeLLJVM: LLVM bitcode to JVM bytecode
LLJVM: LLVM bitcode to JVM bytecode
 
An Experimental Study of Bitmap Compression vs. Inverted List Compression
An Experimental Study of Bitmap Compression vs. Inverted List CompressionAn Experimental Study of Bitmap Compression vs. Inverted List Compression
An Experimental Study of Bitmap Compression vs. Inverted List Compression
 
Sparkのクエリ処理系と周辺の話題
Sparkのクエリ処理系と周辺の話題Sparkのクエリ処理系と周辺の話題
Sparkのクエリ処理系と周辺の話題
 
20150513 legobease
20150513 legobease20150513 legobease
20150513 legobease
 
20150516 icde2015 r19-4
20150516 icde2015 r19-420150516 icde2015 r19-4
20150516 icde2015 r19-4
 
VLDB2013 R1 Emerging Hardware
VLDB2013 R1 Emerging HardwareVLDB2013 R1 Emerging Hardware
VLDB2013 R1 Emerging Hardware
 
浮動小数点(IEEE754)を圧縮したい@dsirnlp#4
浮動小数点(IEEE754)を圧縮したい@dsirnlp#4浮動小数点(IEEE754)を圧縮したい@dsirnlp#4
浮動小数点(IEEE754)を圧縮したい@dsirnlp#4
 
LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)
LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)
LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)
 
Introduction to Modern Analytical DB
Introduction to Modern Analytical DBIntroduction to Modern Analytical DB
Introduction to Modern Analytical DB
 
SIGMOD’12勉強会 -Session 7-
SIGMOD’12勉強会 -Session 7-SIGMOD’12勉強会 -Session 7-
SIGMOD’12勉強会 -Session 7-
 
A x86-optimized rank&select dictionary for bit sequences
A x86-optimized rank&select dictionary for bit sequencesA x86-optimized rank&select dictionary for bit sequences
A x86-optimized rank&select dictionary for bit sequences
 
VAST-Tree, EDBT'12
VAST-Tree, EDBT'12VAST-Tree, EDBT'12
VAST-Tree, EDBT'12
 
VLDB’11勉強会 -Session 9-
VLDB’11勉強会 -Session 9-VLDB’11勉強会 -Session 9-
VLDB’11勉強会 -Session 9-
 
研究動向から考えるx86/x64最適化手法
研究動向から考えるx86/x64最適化手法研究動向から考えるx86/x64最適化手法
研究動向から考えるx86/x64最適化手法
 
VLDB'10勉強会 -Session 20-
VLDB'10勉強会 -Session 20-VLDB'10勉強会 -Session 20-
VLDB'10勉強会 -Session 20-
 

Recently uploaded

Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
BrazilAccount1
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 

Recently uploaded (20)

Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 

20160908 hivemall meetup

  • 1. Copyright©2016 NTT corp. All Rights Reserved. Hivemall  Meets  XGBoost  in DataFrame/Spark 2016/9/8 Takeshi  Yamamuro  (maropu)  @  NTT
  • 2. 2Copyright©2016 NTT corp. All Rights Reserved. Who  am  I?
  • 3. 3Copyright©2016 NTT corp. All Rights Reserved. • Short  for  eXtreme  Gradient  Boosting •  https://github.com/dmlc/xgboost • It  is... •  variant  of  the  gradient  boosting  machine •  tree-‐‑‒based  model •  open-‐‑‒sourced  tool  (Apache2  license)   •  written  in  C++ •  R/python/Julia/Java/Scala  interfaces  provided •  widely  used  in  Kaggle  competitions                 is...
  • 4. 4Copyright©2016 NTT corp. All Rights Reserved. • Most  of  Hivemall  functions  supported  in   Spark-‐‑‒v1.6  and  v2.0 •  the  v2.0  support  not  released  yet • XGBoost  integration  under  development •  distributed/parallel  predictions •  native  libraries  bundled  for  major  platforms •  Mac /Linux  on  x86_̲64 •  how-‐‑‒to-‐‑‒use:  https://gist.github.com/maropu/ 33794b293ee937e99b8fb0788843fa3f Hivemall  in  DataFrame/Spark
  • 5. 5Copyright©2016 NTT corp. All Rights Reserved. Spark  Quick  Examples • Fetch  a  binary  Spark  v2.0.0 •  http://spark.apache.org/downloads.html $ <SPARK_HOME>/bin/spark-shell scala> :paste val textFile = sc.textFile(”hoge.txt") val counts = textFile.flatMap(_.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)
  • 6. 6Copyright©2016 NTT corp. All Rights Reserved. Fetch  training  and  test  data • E2006  tfidf  regression  dataset •  http://www.csie.ntu.edu.tw/~∼cjlin/libsvmtools/ datasets/regression.html#E2006-‐‑‒tfidf $ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/ E2006.train.bz2 $ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/ E2006.test.bz2
  • 7. 7Copyright©2016 NTT corp. All Rights Reserved. XGBoost  in  spark-‐‑‒shell • Scala  interface  bundled  in  the  Hivemall  jar $ bunzip2 E2006.train.bz2 $ <SPARK_HOME>/bin/spark-shell -conf spark.jars=hivemall-spark-XXX-with-dependencies.jar scala> import ml.dmlc.xgboost4j.scala._ scala> :paste // Read trainining data val trainData = new DMatrix(”E2006.train") // Define parameters val paramMap = List( "eta" -> 0.1, "max_depth" -> 2, "objective" -> ”reg:logistic” ).toMap // Train the model val model = XGBoost.train(trainData, paramMap, 2) // Save model to the file model.saveModel(”xgboost_models_dir/xgb_0001.model”)
  • 8. 8Copyright©2016 NTT corp. All Rights Reserved. Load  test  data  in  parallel $ <SPARK_HOME>/bin/spark-shell -conf spark.jars=hivemall-spark-XXX-with-dependencies.jar // Create DataFrame for the test data scala> val testDf = sqlContext.sparkSession.read.format("libsvm”) .load("E2006.test.bz2") scala> testDf.printSchema root |-- label: double (nullable = true) |-- features: vector (nullable = true)
  • 9. 9Copyright©2016 NTT corp. All Rights Reserved. Load  test  data  in  parallel 0.000357499151147113 6066:0.0007932706219604 8 6069:0.000311377727123504 6070:0.0003067549 34580457 6071:0.000276992485786437 6072:0.000 39663531098024 6074:0.00039663531098024 6075 :0.00032548335… testDf Partition1 Partition2 Partition3 PartitionN … … … Load in parallel because bzip2 is splittable • #partitions  depends  on  three  parameters •  spark.default.parallelism:  #cores  by  default •  spark.sql.files.maxPartitionBytes:  128MB  by  default •  spark.sql.files.openCostInBytes:  4MB  by  default
  • 10. 10Copyright©2016 NTT corp. All Rights Reserved. • XGBoost  in  DataFrame •  Load  built  models  and  do  cross-‐‑‒joins  for  predictions Do  predictions  in  parallel scala> import org.apache.spark.hive.HivemallOps._ scala> :paste // Load built models from persistent storage val modelsDf = sqlContext.sparkSession.read.format(xgboost) .load(”xgboost_models_dir") // Do prediction in parallel via cross-joins val predict = modelsDf.join(testDf) .xgboost_predict($"rowid", $"features", $"model_id", $"pred_model") .groupBy("rowid") .avg()
  • 11. 11Copyright©2016 NTT corp. All Rights Reserved. • XGBoost  in  DataFrame •  Load  built  models  and  do  cross-‐‑‒joins  for  predictions • Broadcast  cross-‐‑‒joins  expected •  Size  of  `̀modelsDf`̀  must  be  less  than  and  equal  to   spark.sql.autoBroadcastJoinThreshold  (10MB  by  default) Do  predictions  in  parallel testDf rowid label features 1 0.392 1:0.3  5:0.1… 2 0.929 3:0.2… 3 0.132 2:0.9… 4 0.3923 5:0.4… … modelsDf model_̲id pred_̲model xgb_̲0001.model <binary  data> xgb_̲0002.model <binary  data> cross-joins in parallel
  • 12. 12Copyright©2016 NTT corp. All Rights Reserved. • Structured  Streaming  in  Spark-‐‑‒2.0 •  Scalable  and  fault-‐‑‒tolerant  stream  processing  engine   built  on  the  Spark  SQL  engine •  alpha  component  in  v2.0 Do  predictions  for  streaming  data scala> :paste // Initialize streaming DataFrame val testStreamingDf = spark.readStream .format(”libsvm”) // Not supported in v2.0 … // Do prediction for streaming data val predict = modelsDf.join(testStreamingDf) .xgboost_predict($"rowid", $"features", $"model_id", $"pred_model") .groupBy("rowid") .avg()
  • 13. 13Copyright©2016 NTT corp. All Rights Reserved. • One  model  for  a  partition •  WIP:  Build  models  with  different  parameters Build  models  in  parallel scala> :paste // Set options for XGBoost val xgbOptions = XGBoostOptions() .set("num_round", "10000") .set(“max_depth”, “32,48,64”) // Randomly selected by workers // Set # of models to output val numModels = 4 // Build models and save them in persistent storage trainDf.repartition(numModels) .train_xgboost_regr($“features”, $ “label”, s"${xgbOptions}") .write .format(xgboost) .save(”xgboost_models_dir”)
  • 14. 14Copyright©2016 NTT corp. All Rights Reserved. • If  you  get  stuck  in  UnsatisfiedLinkError,  you   need  to  compile  a  binary  by  yourself Compile  a  binary  on  your  platform   $ mvn validate && mvn package -Pcompile-xgboost -Pspark-2.0 –DskipTests $ ls target hivemall-core-0.4.2-rc.2-with-dependencies.jar hivemall-spark-1.6.2_2.11.8-0.4.2-rc.2-with-dependencies.jar hivemall-core-0.4.2-rc.2.jar hivemall-spark-1.6.2_2.11.8-0.4.2-rc.2.jar hivemall-mixserv-0.4.2-rc.2-fat.jar hivemall-xgboost-0.4.2-rc.2.jar hivemall-nlp-0.4.2-rc.2-with-dependencies.jar hivemall-xgboost_0.60-0.4.2-rc.2-with-dependencies.jar hivemall-nlp-0.4.2-rc.2.jar hivemall-xgboost_0.60-0.4.2-rc.2.jar
  • 15. 15Copyright©2016 NTT corp. All Rights Reserved. • Rabbit  integration  for  parallel  learning •  http://dmlc.cs.washington.edu/rabit.html • Python  supports • spark.ml  interface  supports • Bundle  more  binaries  for  portability •  Windows  and  x86  platforms • Others? Future  Work