SlideShare a Scribd company logo
1 of 24
Download to read offline
Copyright©2018 NTT corp. All Rights Reserved.	
An  Introduction  to  Spark  v2.3  &
Hivemall-‐‑‒on-‐‑‒Spark  v0.5.0
Takeshi  Yamamuro  @  NTT  Lab.
2Copyright©2018 NTT corp. All Rights Reserved.	
• R&D/OSS  engineer
• Ph.D.  in  CS  (Database  Systems)
• Love  OSS  activities
•  Apache  Spark
•  Apache  Hivemall
•  PostgreSQL
•  ...
• My  Active  GitHub  Products
•  spark-‐‑‒sql-‐‑‒server
•  Yet  Another  Spark  SQL  JDBC/ODBC  server  based  on  the  
PostgreSQL  V3  protocol
•  https://github.com/maropu/spark-‐‑‒sql-‐‑‒server
•  lljvm-‐‑‒translator
•  A  lightweight  library  to  inject  LLVM  bitcode  into  JVMs
•  https://github.com/maropu/lljvm-‐‑‒translator
Introduce  Myself
3Copyright©2018 NTT corp. All Rights Reserved.	
HIVEMALL  ON  SPARK  v0.5.0
4Copyright©2018 NTT corp. All Rights Reserved.	
• Hivemall  wrapper  for  Spark
•  Wrapper  implementations  for  DataFrame/SQL
•  +  some  utilities  for  easy-‐‑‒to-‐‑‒use  in  Spark
• The  wrapper  makes  you...
•  run  most  of  Hivemall  functions  in  Spark
•  try  Hivemall  examples  easily  in  your  laptop
•  improve  some  Hivemall  function  performance  in  Spark
Whatʼ’s  Hivemall  on  Spark?
5Copyright©2018 NTT corp. All Rights Reserved.	
• Hivemall  already  has  many  fascinating  ML  
algorithms  and  useful  utilities
•  High  barriers  to  add  newer  algorithms  in  MLlib
Whyʼ’s  Hivemall  on  Spark?
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
6Copyright©2018 NTT corp. All Rights Reserved.	
• Supported  Spark  Versions
•  v2.0,  v2.1,  and  v2.2
•  Upcoming  release  will  support  v2.3
• Custom  Operations
•  Top-‐‑‒K  Join  SparkPlan:  https://bit.ly/2HnaeG1
•  Utility  Functions:  https://bit.ly/2qlk8zH
•  ...
• Installation  via  Spark  Packages
•  https://spark-‐‑‒packages.org
•  ./bin/spark-‐‑‒shell  -‐‑‒-‐‑‒packages  apache-‐‑‒hivemall:apache-‐‑‒
hivemall:0.5.1-‐‑‒spark2.2
A  Status  of  Hivemall-‐‑‒on-‐‑‒Spark  v0.5.0
7Copyright©2018 NTT corp. All Rights Reserved.	
• Joins  Top-‐‑‒K  entries  only
•  “Vanilla  Join  +  Rank  Over”  is  too  slow
Example)  Top-‐‑‒K  Join  Processing
join  key x
join  key y
・・・・・
Joins the top-K rows that have higher
score values, e.g., f(x, y)
leftDf
rightDf
Join
Join
8Copyright©2018 NTT corp. All Rights Reserved.	
• 1.  Download  a  Spark  binary
• 2.  Fetch  training  and  test  data
• 3.  Load  these  data  in  Spark
• 4.  Build  a  model
• 5.  Do  predictions
Quick  Example
9Copyright©2018 NTT corp. All Rights Reserved.	
1.  Download  a  Spark  binary
• Download  a  Spark  v2.2.1  binary
•  https://spark.apache.org/downloads.html
10Copyright©2018 NTT corp. All Rights Reserved.	
2.  Fetch  training  and  test  data
• E2006  tfidf  regression  dataset
•  https://bit.ly/2GOC0di
$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
regression/E2006.train.bz2
$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
regression/E2006.test.bz2
11Copyright©2018 NTT corp. All Rights Reserved.	
3.  Load  training  data  in  Spark
$ <SPARK_HOME>/bin/spark-shell
--packages apache-hivemall:apache-hivemall:0.5.1-spark2.2
scala> import org.apache.spark.sql.hive.HivemallOps._
scala> import org.apache.spark.sql._
scala> :paste
// Creates DataFrame from the bzip’d libsvm-formatted file
val rawTrainDf = spark.read.format("libsvm").load("E2006.train.bz2")
// Since `label` must be [0.0, 1.0], rescales them first
val maxmin = rawTrainDf.select(max($"label"), min($"label")).collect.map {
case Row(max: Double, min: Double) => (max, min)
}.head
val trainDf = rawTrainDf.select(
rescale($"label", lit(maxmin._2), lit(maxmin._1)).as("label"),
$"features”)
12Copyright©2018 NTT corp. All Rights Reserved.	
3.  Load  test  data  in  Spark
scala> val rawTestDf = spark.read.format("libsvm").load("E2006.test.bz2”)
scala> :paste
val testDf = rawTestDf.select(
rowid(),
rescale($"label", lit(maxmin._2), lit(maxmin._1)).as("label"),
$"features")
.explode_vector($"features")
.select($"rowid", $"label".as("target"), $"feature", $"weight".as("value"))
.cache
13Copyright©2018 NTT corp. All Rights Reserved.	
4.  Build  a  model  -‐‑‒  DataFrame
scala> paste:
val modelDf = trainDf.train_logistic_regr($"features", $"label")
.groupBy("feature")
.agg("weight" -> "avg")
14Copyright©2018 NTT corp. All Rights Reserved.	
5.  Do  predictions  -‐‑‒  DataFrame
// Do predictions
scala> paste:
val predictDf = testDf
.join(modelDf, testDf("feature") === modelDf("feature"), "LEFT_OUTER")
.select($"rowid", ($"avg(weight)" * $"value").as("value"))
.groupBy("rowid").sum("value")
.select(
$"rowid",
sigmoid($"sum(value)").as("predicted”))
15Copyright©2018 NTT corp. All Rights Reserved.	
• Feature  Selection  +  Spark  Optimizer  =  Fast  
Data  Extraction
•  HIVEMALL-‐‑‒181:  Plan  rewriting  rules  to  filter  meaningful  
training  data  before  feature  selections
Current  Work  for  Future  Releases
Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking
Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016.
key v0 key v1 v2 key v0 v1 v2
Data Extraction (e.g., by SQL) Feature Selection (e.g., by scikit-learn)
Selected Features
16Copyright©2018 NTT corp. All Rights Reserved.	
• Feature  Selection  +  Spark  Optimizer  =  Fast  
Data  Extraction
•  HIVEMALL-‐‑‒181:  Plan  rewriting  rules  to  filter  meaningful  
training  data  before  feature  selections
Current  Work  for  Future  Releases
Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking
Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016.
key v0 key v1 v2 key v1 v2
Data Extraction + Feature Selection
Join Pruning by Data Statistics
17Copyright©2018 NTT corp. All Rights Reserved.	
SPARK  v2.3
18Copyright©2018 NTT corp. All Rights Reserved.	
Whatʼ’s  Apache  Spark
• Distributed  data  analytics  engine,  
generalizing  Map  Reduce
Spark GitHub
19Copyright©2018 NTT corp. All Rights Reserved.	
Whatʼ’s  Apache  Spark
• 1.  Unified  Engine
•  support  end-‐‑‒to-‐‑‒end  APIs,  e.g.,  MLlib  and  Streaming
• 2.  High-‐‑‒level  APIs
•  easy-‐‑‒to-‐‑‒use,  rich  optimization
• 3.  Integrate  broadly
•  storages,  libraries,  ...
20Copyright©2018 NTT corp. All Rights Reserved.	
• v2.3.0  released  in  2018.2
• v2.x  releases  focus  on  API  stabilities
•  minor  releases:  4month  dev.  +  1month  QA
• Community  discussion  for  v3.0  started  recently
•  time  for  Apache  Spark  3.0?:  https://bit.ly/2qjcd6f
Spark  Release  History
2012
2013
2014
2015
2016
2017
The original paper
(RDD) published
Incubated
in ASF
To an ASF top-level
project
v1.0
v1.1
v1.2
v1.3 v1.4 v1.5 v1.6 v2.0 v2.1
v0.6 v0.7
v0.8 v0.9
DataFrame
APIs
Codegen Support
Dataset
APIs
Structure
Streaming 2018
v2.2
v2.3
Today talk
21Copyright©2018 NTT corp. All Rights Reserved.	
Cited from: What's New in Upcoming Apache Spark 2.3, https://bit.ly/2GNS2nP
An  Introduction  to  Spark  v2.3
22Copyright©2018 NTT corp. All Rights Reserved.	
Cited from: What's New in Upcoming Apache Spark 2.3, https://bit.ly/2GNS2nP
An  Introduction  to  Spark  v2.3
23Copyright©2018 NTT corp. All Rights Reserved.	
An  Introduction  to  Spark  v2.3
• Talked  by  using  the  slide:  What's  New  in  
Upcoming  Apache  Spark  2.3
•  https://bit.ly/2GNS2nP  
24Copyright©2018 NTT corp. All Rights Reserved.	
• Hivemall  on  Spark
•  Wrapper  implementations  for  DataFrame/SQL
•  +  some  utilities  for  easy-‐‑‒to-‐‑‒use  in  Spark
• Feature  Selection  +  Spark  Optimizer  =  Fast  
Data  Extraction
•  WIP  for  Hivemall  future  releases
• Spark  v2.3
•  Structured  Streaming
•  Image  support
•  Pandas  UDF  performance  improvement
•  Spark  on  Kubernetes
•  ...
Recap

More Related Content

What's hot

Kubernetes and real-time analytics - how to connect these two worlds with Apa...
Kubernetes and real-time analytics - how to connect these two worlds with Apa...Kubernetes and real-time analytics - how to connect these two worlds with Apa...
Kubernetes and real-time analytics - how to connect these two worlds with Apa...GetInData
 
Unleashing Data Intelligence with Intel and Apache Spark with Michael Greene
Unleashing Data Intelligence with Intel and Apache Spark with Michael GreeneUnleashing Data Intelligence with Intel and Apache Spark with Michael Greene
Unleashing Data Intelligence with Intel and Apache Spark with Michael GreeneDatabricks
 
StackStorm Product Highlights - DevOps Enterprise 2014 After-Party Ignite Talk
StackStorm Product Highlights - DevOps Enterprise 2014 After-Party Ignite TalkStackStorm Product Highlights - DevOps Enterprise 2014 After-Party Ignite Talk
StackStorm Product Highlights - DevOps Enterprise 2014 After-Party Ignite TalkStackStorm
 
Summit openshift-on-openstack
Summit openshift-on-openstackSummit openshift-on-openstack
Summit openshift-on-openstackPippo620677
 
In-Memory Computing Essentials for Software Engineers
In-Memory Computing Essentials for Software EngineersIn-Memory Computing Essentials for Software Engineers
In-Memory Computing Essentials for Software EngineersDenis Magda
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Linaro
 
Getting Started with Apache Geode
Getting Started with Apache GeodeGetting Started with Apache Geode
Getting Started with Apache GeodeJohn Blum
 
Bringing New Experience with Openstack and Fuel (Ihor Dvoretskyi, Oleksandr M...
Bringing New Experience with Openstack and Fuel (Ihor Dvoretskyi, Oleksandr M...Bringing New Experience with Openstack and Fuel (Ihor Dvoretskyi, Oleksandr M...
Bringing New Experience with Openstack and Fuel (Ihor Dvoretskyi, Oleksandr M...IT Arena
 
Deploying Splunk on OpenShift – Part2 : Getting Data In
Deploying Splunk on OpenShift – Part2 : Getting Data InDeploying Splunk on OpenShift – Part2 : Getting Data In
Deploying Splunk on OpenShift – Part2 : Getting Data InEric Gardner
 
Practical virtual network functions with Snabb (SDN Barcelona VI)
Practical virtual network functions with Snabb (SDN Barcelona VI)Practical virtual network functions with Snabb (SDN Barcelona VI)
Practical virtual network functions with Snabb (SDN Barcelona VI)Igalia
 
Splunk Ninjas: New Features, Pivot and Search Dojo
Splunk Ninjas: New Features, Pivot and Search DojoSplunk Ninjas: New Features, Pivot and Search Dojo
Splunk Ninjas: New Features, Pivot and Search DojoSplunk
 
Nike tech-talk-intro-to-apache-ignite
Nike tech-talk-intro-to-apache-igniteNike tech-talk-intro-to-apache-ignite
Nike tech-talk-intro-to-apache-igniteDani Traphagen
 
In-Memory Computing Essentials
In-Memory Computing EssentialsIn-Memory Computing Essentials
In-Memory Computing EssentialsDenis Magda
 
Apache Spark Best Practices Meetup Talk
Apache Spark Best Practices Meetup TalkApache Spark Best Practices Meetup Talk
Apache Spark Best Practices Meetup TalkEren Avşaroğulları
 

What's hot (16)

Application Management in Openstack
Application Management in Openstack Application Management in Openstack
Application Management in Openstack
 
Kubernetes and real-time analytics - how to connect these two worlds with Apa...
Kubernetes and real-time analytics - how to connect these two worlds with Apa...Kubernetes and real-time analytics - how to connect these two worlds with Apa...
Kubernetes and real-time analytics - how to connect these two worlds with Apa...
 
Unleashing Data Intelligence with Intel and Apache Spark with Michael Greene
Unleashing Data Intelligence with Intel and Apache Spark with Michael GreeneUnleashing Data Intelligence with Intel and Apache Spark with Michael Greene
Unleashing Data Intelligence with Intel and Apache Spark with Michael Greene
 
StackStorm Product Highlights - DevOps Enterprise 2014 After-Party Ignite Talk
StackStorm Product Highlights - DevOps Enterprise 2014 After-Party Ignite TalkStackStorm Product Highlights - DevOps Enterprise 2014 After-Party Ignite Talk
StackStorm Product Highlights - DevOps Enterprise 2014 After-Party Ignite Talk
 
Summit openshift-on-openstack
Summit openshift-on-openstackSummit openshift-on-openstack
Summit openshift-on-openstack
 
In-Memory Computing Essentials for Software Engineers
In-Memory Computing Essentials for Software EngineersIn-Memory Computing Essentials for Software Engineers
In-Memory Computing Essentials for Software Engineers
 
SparkFramework
SparkFrameworkSparkFramework
SparkFramework
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
 
Getting Started with Apache Geode
Getting Started with Apache GeodeGetting Started with Apache Geode
Getting Started with Apache Geode
 
Bringing New Experience with Openstack and Fuel (Ihor Dvoretskyi, Oleksandr M...
Bringing New Experience with Openstack and Fuel (Ihor Dvoretskyi, Oleksandr M...Bringing New Experience with Openstack and Fuel (Ihor Dvoretskyi, Oleksandr M...
Bringing New Experience with Openstack and Fuel (Ihor Dvoretskyi, Oleksandr M...
 
Deploying Splunk on OpenShift – Part2 : Getting Data In
Deploying Splunk on OpenShift – Part2 : Getting Data InDeploying Splunk on OpenShift – Part2 : Getting Data In
Deploying Splunk on OpenShift – Part2 : Getting Data In
 
Practical virtual network functions with Snabb (SDN Barcelona VI)
Practical virtual network functions with Snabb (SDN Barcelona VI)Practical virtual network functions with Snabb (SDN Barcelona VI)
Practical virtual network functions with Snabb (SDN Barcelona VI)
 
Splunk Ninjas: New Features, Pivot and Search Dojo
Splunk Ninjas: New Features, Pivot and Search DojoSplunk Ninjas: New Features, Pivot and Search Dojo
Splunk Ninjas: New Features, Pivot and Search Dojo
 
Nike tech-talk-intro-to-apache-ignite
Nike tech-talk-intro-to-apache-igniteNike tech-talk-intro-to-apache-ignite
Nike tech-talk-intro-to-apache-ignite
 
In-Memory Computing Essentials
In-Memory Computing EssentialsIn-Memory Computing Essentials
In-Memory Computing Essentials
 
Apache Spark Best Practices Meetup Talk
Apache Spark Best Practices Meetup TalkApache Spark Best Practices Meetup Talk
Apache Spark Best Practices Meetup Talk
 

Similar to 20180417 hivemall meetup#4

Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesDatabricks
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
“Quantum” Performance Effects: beyond the Core
“Quantum” Performance Effects: beyond the Core“Quantum” Performance Effects: beyond the Core
“Quantum” Performance Effects: beyond the CoreC4Media
 
Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3DataWorks Summit
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...Big Data Spain
 
20180921_DOAG_BigDataDays_OracleSpatialandPython_kpatenge
20180921_DOAG_BigDataDays_OracleSpatialandPython_kpatenge20180921_DOAG_BigDataDays_OracleSpatialandPython_kpatenge
20180921_DOAG_BigDataDays_OracleSpatialandPython_kpatengeKarin Patenge
 
Apache Pulsar at Yahoo! Japan
Apache Pulsar at Yahoo! JapanApache Pulsar at Yahoo! Japan
Apache Pulsar at Yahoo! JapanStreamNative
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorDatabricks
 
Accelerate Your C/C++ Applications with Amazon EC2 F1 Instances (CMP405) - AW...
Accelerate Your C/C++ Applications with Amazon EC2 F1 Instances (CMP405) - AW...Accelerate Your C/C++ Applications with Amazon EC2 F1 Instances (CMP405) - AW...
Accelerate Your C/C++ Applications with Amazon EC2 F1 Instances (CMP405) - AW...Amazon Web Services
 
S3 Server Hackathon Presented by S3 Server, a Scality Product, Seagate and Ho...
S3 Server Hackathon Presented by S3 Server, a Scality Product, Seagate and Ho...S3 Server Hackathon Presented by S3 Server, a Scality Product, Seagate and Ho...
S3 Server Hackathon Presented by S3 Server, a Scality Product, Seagate and Ho...Scality
 
Hackathon scality holberton seagate 2016 v5
Hackathon scality holberton seagate 2016 v5Hackathon scality holberton seagate 2016 v5
Hackathon scality holberton seagate 2016 v5Scality
 
Review on Apache Spark Technology
Review on Apache Spark TechnologyReview on Apache Spark Technology
Review on Apache Spark TechnologyIRJET Journal
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkTimothy Spann
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesSeungYong Oh
 
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinIntegrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinDatabricks
 
Oracle Cloud Infrastructure Data Science 概要資料(20200406)
Oracle Cloud Infrastructure Data Science 概要資料(20200406)Oracle Cloud Infrastructure Data Science 概要資料(20200406)
Oracle Cloud Infrastructure Data Science 概要資料(20200406)オラクルエンジニア通信
 
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
An Enterprise Analytics Platform with Jupyter Notebooks and Apache SparkAn Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
An Enterprise Analytics Platform with Jupyter Notebooks and Apache SparkLuciano Resende
 
Apache spark 2.4 and beyond
Apache spark 2.4 and beyondApache spark 2.4 and beyond
Apache spark 2.4 and beyondXiao Li
 
Advanced technologies and techniques for debugging HPC applications
Advanced technologies and techniques for debugging HPC applicationsAdvanced technologies and techniques for debugging HPC applications
Advanced technologies and techniques for debugging HPC applicationsRogue Wave Software
 

Similar to 20180417 hivemall meetup#4 (20)

Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
“Quantum” Performance Effects: beyond the Core
“Quantum” Performance Effects: beyond the Core“Quantum” Performance Effects: beyond the Core
“Quantum” Performance Effects: beyond the Core
 
Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 
20180921_DOAG_BigDataDays_OracleSpatialandPython_kpatenge
20180921_DOAG_BigDataDays_OracleSpatialandPython_kpatenge20180921_DOAG_BigDataDays_OracleSpatialandPython_kpatenge
20180921_DOAG_BigDataDays_OracleSpatialandPython_kpatenge
 
20160908 hivemall meetup
20160908 hivemall meetup20160908 hivemall meetup
20160908 hivemall meetup
 
Apache Pulsar at Yahoo! Japan
Apache Pulsar at Yahoo! JapanApache Pulsar at Yahoo! Japan
Apache Pulsar at Yahoo! Japan
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
 
Accelerate Your C/C++ Applications with Amazon EC2 F1 Instances (CMP405) - AW...
Accelerate Your C/C++ Applications with Amazon EC2 F1 Instances (CMP405) - AW...Accelerate Your C/C++ Applications with Amazon EC2 F1 Instances (CMP405) - AW...
Accelerate Your C/C++ Applications with Amazon EC2 F1 Instances (CMP405) - AW...
 
S3 Server Hackathon Presented by S3 Server, a Scality Product, Seagate and Ho...
S3 Server Hackathon Presented by S3 Server, a Scality Product, Seagate and Ho...S3 Server Hackathon Presented by S3 Server, a Scality Product, Seagate and Ho...
S3 Server Hackathon Presented by S3 Server, a Scality Product, Seagate and Ho...
 
Hackathon scality holberton seagate 2016 v5
Hackathon scality holberton seagate 2016 v5Hackathon scality holberton seagate 2016 v5
Hackathon scality holberton seagate 2016 v5
 
Review on Apache Spark Technology
Review on Apache Spark TechnologyReview on Apache Spark Technology
Review on Apache Spark Technology
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
 
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinIntegrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther Kundin
 
Oracle Cloud Infrastructure Data Science 概要資料(20200406)
Oracle Cloud Infrastructure Data Science 概要資料(20200406)Oracle Cloud Infrastructure Data Science 概要資料(20200406)
Oracle Cloud Infrastructure Data Science 概要資料(20200406)
 
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
An Enterprise Analytics Platform with Jupyter Notebooks and Apache SparkAn Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
 
Apache spark 2.4 and beyond
Apache spark 2.4 and beyondApache spark 2.4 and beyond
Apache spark 2.4 and beyond
 
Advanced technologies and techniques for debugging HPC applications
Advanced technologies and techniques for debugging HPC applicationsAdvanced technologies and techniques for debugging HPC applications
Advanced technologies and techniques for debugging HPC applications
 

More from Takeshi Yamamuro

LT: Spark 3.1 Feature Expectation
LT: Spark 3.1 Feature ExpectationLT: Spark 3.1 Feature Expectation
LT: Spark 3.1 Feature ExpectationTakeshi Yamamuro
 
Quick Overview of Upcoming Spark 3.0 + α
Quick Overview of Upcoming Spark 3.0 + αQuick Overview of Upcoming Spark 3.0 + α
Quick Overview of Upcoming Spark 3.0 + αTakeshi Yamamuro
 
MLflowによる機械学習モデルのライフサイクルの管理
MLflowによる機械学習モデルのライフサイクルの管理MLflowによる機械学習モデルのライフサイクルの管理
MLflowによる機械学習モデルのライフサイクルの管理Takeshi Yamamuro
 
Taming Distributed/Parallel Query Execution Engine of Apache Spark
Taming Distributed/Parallel Query Execution Engine of Apache SparkTaming Distributed/Parallel Query Execution Engine of Apache Spark
Taming Distributed/Parallel Query Execution Engine of Apache SparkTakeshi Yamamuro
 
LLJVM: LLVM bitcode to JVM bytecode
LLJVM: LLVM bitcode to JVM bytecodeLLJVM: LLVM bitcode to JVM bytecode
LLJVM: LLVM bitcode to JVM bytecodeTakeshi Yamamuro
 
An Experimental Study of Bitmap Compression vs. Inverted List Compression
An Experimental Study of Bitmap Compression vs. Inverted List CompressionAn Experimental Study of Bitmap Compression vs. Inverted List Compression
An Experimental Study of Bitmap Compression vs. Inverted List CompressionTakeshi Yamamuro
 
Sparkのクエリ処理系と周辺の話題
Sparkのクエリ処理系と周辺の話題Sparkのクエリ処理系と周辺の話題
Sparkのクエリ処理系と周辺の話題Takeshi Yamamuro
 
VLDB2013 R1 Emerging Hardware
VLDB2013 R1 Emerging HardwareVLDB2013 R1 Emerging Hardware
VLDB2013 R1 Emerging HardwareTakeshi Yamamuro
 
浮動小数点(IEEE754)を圧縮したい@dsirnlp#4
浮動小数点(IEEE754)を圧縮したい@dsirnlp#4浮動小数点(IEEE754)を圧縮したい@dsirnlp#4
浮動小数点(IEEE754)を圧縮したい@dsirnlp#4Takeshi Yamamuro
 
LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)
LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)
LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)Takeshi Yamamuro
 
Introduction to Modern Analytical DB
Introduction to Modern Analytical DBIntroduction to Modern Analytical DB
Introduction to Modern Analytical DBTakeshi Yamamuro
 
SIGMOD’12勉強会 -Session 7-
SIGMOD’12勉強会 -Session 7-SIGMOD’12勉強会 -Session 7-
SIGMOD’12勉強会 -Session 7-Takeshi Yamamuro
 
A x86-optimized rank&select dictionary for bit sequences
A x86-optimized rank&select dictionary for bit sequencesA x86-optimized rank&select dictionary for bit sequences
A x86-optimized rank&select dictionary for bit sequencesTakeshi Yamamuro
 
VLDB’11勉強会 -Session 9-
VLDB’11勉強会 -Session 9-VLDB’11勉強会 -Session 9-
VLDB’11勉強会 -Session 9-Takeshi Yamamuro
 
研究動向から考えるx86/x64最適化手法
研究動向から考えるx86/x64最適化手法研究動向から考えるx86/x64最適化手法
研究動向から考えるx86/x64最適化手法Takeshi Yamamuro
 
VLDB'10勉強会 -Session 20-
VLDB'10勉強会 -Session 20-VLDB'10勉強会 -Session 20-
VLDB'10勉強会 -Session 20-Takeshi Yamamuro
 

More from Takeshi Yamamuro (20)

LT: Spark 3.1 Feature Expectation
LT: Spark 3.1 Feature ExpectationLT: Spark 3.1 Feature Expectation
LT: Spark 3.1 Feature Expectation
 
Apache Spark + Arrow
Apache Spark + ArrowApache Spark + Arrow
Apache Spark + Arrow
 
Quick Overview of Upcoming Spark 3.0 + α
Quick Overview of Upcoming Spark 3.0 + αQuick Overview of Upcoming Spark 3.0 + α
Quick Overview of Upcoming Spark 3.0 + α
 
MLflowによる機械学習モデルのライフサイクルの管理
MLflowによる機械学習モデルのライフサイクルの管理MLflowによる機械学習モデルのライフサイクルの管理
MLflowによる機械学習モデルのライフサイクルの管理
 
Taming Distributed/Parallel Query Execution Engine of Apache Spark
Taming Distributed/Parallel Query Execution Engine of Apache SparkTaming Distributed/Parallel Query Execution Engine of Apache Spark
Taming Distributed/Parallel Query Execution Engine of Apache Spark
 
LLJVM: LLVM bitcode to JVM bytecode
LLJVM: LLVM bitcode to JVM bytecodeLLJVM: LLVM bitcode to JVM bytecode
LLJVM: LLVM bitcode to JVM bytecode
 
An Experimental Study of Bitmap Compression vs. Inverted List Compression
An Experimental Study of Bitmap Compression vs. Inverted List CompressionAn Experimental Study of Bitmap Compression vs. Inverted List Compression
An Experimental Study of Bitmap Compression vs. Inverted List Compression
 
Sparkのクエリ処理系と周辺の話題
Sparkのクエリ処理系と周辺の話題Sparkのクエリ処理系と周辺の話題
Sparkのクエリ処理系と周辺の話題
 
20150513 legobease
20150513 legobease20150513 legobease
20150513 legobease
 
20150516 icde2015 r19-4
20150516 icde2015 r19-420150516 icde2015 r19-4
20150516 icde2015 r19-4
 
VLDB2013 R1 Emerging Hardware
VLDB2013 R1 Emerging HardwareVLDB2013 R1 Emerging Hardware
VLDB2013 R1 Emerging Hardware
 
浮動小数点(IEEE754)を圧縮したい@dsirnlp#4
浮動小数点(IEEE754)を圧縮したい@dsirnlp#4浮動小数点(IEEE754)を圧縮したい@dsirnlp#4
浮動小数点(IEEE754)を圧縮したい@dsirnlp#4
 
LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)
LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)
LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)
 
Introduction to Modern Analytical DB
Introduction to Modern Analytical DBIntroduction to Modern Analytical DB
Introduction to Modern Analytical DB
 
SIGMOD’12勉強会 -Session 7-
SIGMOD’12勉強会 -Session 7-SIGMOD’12勉強会 -Session 7-
SIGMOD’12勉強会 -Session 7-
 
A x86-optimized rank&select dictionary for bit sequences
A x86-optimized rank&select dictionary for bit sequencesA x86-optimized rank&select dictionary for bit sequences
A x86-optimized rank&select dictionary for bit sequences
 
VAST-Tree, EDBT'12
VAST-Tree, EDBT'12VAST-Tree, EDBT'12
VAST-Tree, EDBT'12
 
VLDB’11勉強会 -Session 9-
VLDB’11勉強会 -Session 9-VLDB’11勉強会 -Session 9-
VLDB’11勉強会 -Session 9-
 
研究動向から考えるx86/x64最適化手法
研究動向から考えるx86/x64最適化手法研究動向から考えるx86/x64最適化手法
研究動向から考えるx86/x64最適化手法
 
VLDB'10勉強会 -Session 20-
VLDB'10勉強会 -Session 20-VLDB'10勉強会 -Session 20-
VLDB'10勉強会 -Session 20-
 

Recently uploaded

Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 

Recently uploaded (20)

Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 

20180417 hivemall meetup#4

  • 1. Copyright©2018 NTT corp. All Rights Reserved. An  Introduction  to  Spark  v2.3  & Hivemall-‐‑‒on-‐‑‒Spark  v0.5.0 Takeshi  Yamamuro  @  NTT  Lab.
  • 2. 2Copyright©2018 NTT corp. All Rights Reserved. • R&D/OSS  engineer • Ph.D.  in  CS  (Database  Systems) • Love  OSS  activities •  Apache  Spark •  Apache  Hivemall •  PostgreSQL •  ... • My  Active  GitHub  Products •  spark-‐‑‒sql-‐‑‒server •  Yet  Another  Spark  SQL  JDBC/ODBC  server  based  on  the   PostgreSQL  V3  protocol •  https://github.com/maropu/spark-‐‑‒sql-‐‑‒server •  lljvm-‐‑‒translator •  A  lightweight  library  to  inject  LLVM  bitcode  into  JVMs •  https://github.com/maropu/lljvm-‐‑‒translator Introduce  Myself
  • 3. 3Copyright©2018 NTT corp. All Rights Reserved. HIVEMALL  ON  SPARK  v0.5.0
  • 4. 4Copyright©2018 NTT corp. All Rights Reserved. • Hivemall  wrapper  for  Spark •  Wrapper  implementations  for  DataFrame/SQL •  +  some  utilities  for  easy-‐‑‒to-‐‑‒use  in  Spark • The  wrapper  makes  you... •  run  most  of  Hivemall  functions  in  Spark •  try  Hivemall  examples  easily  in  your  laptop •  improve  some  Hivemall  function  performance  in  Spark Whatʼ’s  Hivemall  on  Spark?
  • 5. 5Copyright©2018 NTT corp. All Rights Reserved. • Hivemall  already  has  many  fascinating  ML   algorithms  and  useful  utilities •  High  barriers  to  add  newer  algorithms  in  MLlib Whyʼ’s  Hivemall  on  Spark? https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
  • 6. 6Copyright©2018 NTT corp. All Rights Reserved. • Supported  Spark  Versions •  v2.0,  v2.1,  and  v2.2 •  Upcoming  release  will  support  v2.3 • Custom  Operations •  Top-‐‑‒K  Join  SparkPlan:  https://bit.ly/2HnaeG1 •  Utility  Functions:  https://bit.ly/2qlk8zH •  ... • Installation  via  Spark  Packages •  https://spark-‐‑‒packages.org •  ./bin/spark-‐‑‒shell  -‐‑‒-‐‑‒packages  apache-‐‑‒hivemall:apache-‐‑‒ hivemall:0.5.1-‐‑‒spark2.2 A  Status  of  Hivemall-‐‑‒on-‐‑‒Spark  v0.5.0
  • 7. 7Copyright©2018 NTT corp. All Rights Reserved. • Joins  Top-‐‑‒K  entries  only •  “Vanilla  Join  +  Rank  Over”  is  too  slow Example)  Top-‐‑‒K  Join  Processing join  key x join  key y ・・・・・ Joins the top-K rows that have higher score values, e.g., f(x, y) leftDf rightDf Join Join
  • 8. 8Copyright©2018 NTT corp. All Rights Reserved. • 1.  Download  a  Spark  binary • 2.  Fetch  training  and  test  data • 3.  Load  these  data  in  Spark • 4.  Build  a  model • 5.  Do  predictions Quick  Example
  • 9. 9Copyright©2018 NTT corp. All Rights Reserved. 1.  Download  a  Spark  binary • Download  a  Spark  v2.2.1  binary •  https://spark.apache.org/downloads.html
  • 10. 10Copyright©2018 NTT corp. All Rights Reserved. 2.  Fetch  training  and  test  data • E2006  tfidf  regression  dataset •  https://bit.ly/2GOC0di $ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ regression/E2006.train.bz2 $ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ regression/E2006.test.bz2
  • 11. 11Copyright©2018 NTT corp. All Rights Reserved. 3.  Load  training  data  in  Spark $ <SPARK_HOME>/bin/spark-shell --packages apache-hivemall:apache-hivemall:0.5.1-spark2.2 scala> import org.apache.spark.sql.hive.HivemallOps._ scala> import org.apache.spark.sql._ scala> :paste // Creates DataFrame from the bzip’d libsvm-formatted file val rawTrainDf = spark.read.format("libsvm").load("E2006.train.bz2") // Since `label` must be [0.0, 1.0], rescales them first val maxmin = rawTrainDf.select(max($"label"), min($"label")).collect.map { case Row(max: Double, min: Double) => (max, min) }.head val trainDf = rawTrainDf.select( rescale($"label", lit(maxmin._2), lit(maxmin._1)).as("label"), $"features”)
  • 12. 12Copyright©2018 NTT corp. All Rights Reserved. 3.  Load  test  data  in  Spark scala> val rawTestDf = spark.read.format("libsvm").load("E2006.test.bz2”) scala> :paste val testDf = rawTestDf.select( rowid(), rescale($"label", lit(maxmin._2), lit(maxmin._1)).as("label"), $"features") .explode_vector($"features") .select($"rowid", $"label".as("target"), $"feature", $"weight".as("value")) .cache
  • 13. 13Copyright©2018 NTT corp. All Rights Reserved. 4.  Build  a  model  -‐‑‒  DataFrame scala> paste: val modelDf = trainDf.train_logistic_regr($"features", $"label") .groupBy("feature") .agg("weight" -> "avg")
  • 14. 14Copyright©2018 NTT corp. All Rights Reserved. 5.  Do  predictions  -‐‑‒  DataFrame // Do predictions scala> paste: val predictDf = testDf .join(modelDf, testDf("feature") === modelDf("feature"), "LEFT_OUTER") .select($"rowid", ($"avg(weight)" * $"value").as("value")) .groupBy("rowid").sum("value") .select( $"rowid", sigmoid($"sum(value)").as("predicted”))
  • 15. 15Copyright©2018 NTT corp. All Rights Reserved. • Feature  Selection  +  Spark  Optimizer  =  Fast   Data  Extraction •  HIVEMALL-‐‑‒181:  Plan  rewriting  rules  to  filter  meaningful   training  data  before  feature  selections Current  Work  for  Future  Releases Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. key v0 key v1 v2 key v0 v1 v2 Data Extraction (e.g., by SQL) Feature Selection (e.g., by scikit-learn) Selected Features
  • 16. 16Copyright©2018 NTT corp. All Rights Reserved. • Feature  Selection  +  Spark  Optimizer  =  Fast   Data  Extraction •  HIVEMALL-‐‑‒181:  Plan  rewriting  rules  to  filter  meaningful   training  data  before  feature  selections Current  Work  for  Future  Releases Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. key v0 key v1 v2 key v1 v2 Data Extraction + Feature Selection Join Pruning by Data Statistics
  • 17. 17Copyright©2018 NTT corp. All Rights Reserved. SPARK  v2.3
  • 18. 18Copyright©2018 NTT corp. All Rights Reserved. Whatʼ’s  Apache  Spark • Distributed  data  analytics  engine,   generalizing  Map  Reduce Spark GitHub
  • 19. 19Copyright©2018 NTT corp. All Rights Reserved. Whatʼ’s  Apache  Spark • 1.  Unified  Engine •  support  end-‐‑‒to-‐‑‒end  APIs,  e.g.,  MLlib  and  Streaming • 2.  High-‐‑‒level  APIs •  easy-‐‑‒to-‐‑‒use,  rich  optimization • 3.  Integrate  broadly •  storages,  libraries,  ...
  • 20. 20Copyright©2018 NTT corp. All Rights Reserved. • v2.3.0  released  in  2018.2 • v2.x  releases  focus  on  API  stabilities •  minor  releases:  4month  dev.  +  1month  QA • Community  discussion  for  v3.0  started  recently •  time  for  Apache  Spark  3.0?:  https://bit.ly/2qjcd6f Spark  Release  History 2012 2013 2014 2015 2016 2017 The original paper (RDD) published Incubated in ASF To an ASF top-level project v1.0 v1.1 v1.2 v1.3 v1.4 v1.5 v1.6 v2.0 v2.1 v0.6 v0.7 v0.8 v0.9 DataFrame APIs Codegen Support Dataset APIs Structure Streaming 2018 v2.2 v2.3 Today talk
  • 21. 21Copyright©2018 NTT corp. All Rights Reserved. Cited from: What's New in Upcoming Apache Spark 2.3, https://bit.ly/2GNS2nP An  Introduction  to  Spark  v2.3
  • 22. 22Copyright©2018 NTT corp. All Rights Reserved. Cited from: What's New in Upcoming Apache Spark 2.3, https://bit.ly/2GNS2nP An  Introduction  to  Spark  v2.3
  • 23. 23Copyright©2018 NTT corp. All Rights Reserved. An  Introduction  to  Spark  v2.3 • Talked  by  using  the  slide:  What's  New  in   Upcoming  Apache  Spark  2.3 •  https://bit.ly/2GNS2nP  
  • 24. 24Copyright©2018 NTT corp. All Rights Reserved. • Hivemall  on  Spark •  Wrapper  implementations  for  DataFrame/SQL •  +  some  utilities  for  easy-‐‑‒to-‐‑‒use  in  Spark • Feature  Selection  +  Spark  Optimizer  =  Fast   Data  Extraction •  WIP  for  Hivemall  future  releases • Spark  v2.3 •  Structured  Streaming •  Image  support •  Pandas  UDF  performance  improvement •  Spark  on  Kubernetes •  ... Recap