20180417 hivemall meetup#4

Copyright©2018 NTT corp. All Rights Reserved.
An Introduction to Spark v2.3 &
Hivemall-‐‑‒on-‐‑‒Spark v0.5.0
Takeshi Yamamuro @ NTT Lab.

2Copyright©2018 NTT corp. All Rights Reserved.
• R&D/OSS engineer
• Ph.D. in CS (Database Systems)
• Love OSS activities
•  Apache Spark
•  Apache Hivemall
•  PostgreSQL
•  ...
• My Active GitHub Products
•  spark-‐‑‒sql-‐‑‒server
•  Yet Another Spark SQL JDBC/ODBC server based on the
PostgreSQL V3 protocol
•  https://github.com/maropu/spark-‐‑‒sql-‐‑‒server
•  lljvm-‐‑‒translator
•  A lightweight library to inject LLVM bitcode into JVMs
•  https://github.com/maropu/lljvm-‐‑‒translator
Introduce Myself

HIVEMALL ON SPARK v0.5.0

• Hivemall wrapper for Spark
•  Wrapper implementations for DataFrame/SQL
•  + some utilities for easy-‐‑‒to-‐‑‒use in Spark
• The wrapper makes you...
•  run most of Hivemall functions in Spark
•  try Hivemall examples easily in your laptop
•  improve some Hivemall function performance in Spark
Whatʼ’s Hivemall on Spark?

• Hivemall already has many fascinating ML
algorithms and useful utilities
•  High barriers to add newer algorithms in MLlib
Whyʼ’s Hivemall on Spark?
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

• Supported Spark Versions
•  v2.0, v2.1, and v2.2
•  Upcoming release will support v2.3
• Custom Operations
•  Top-‐‑‒K Join SparkPlan: https://bit.ly/2HnaeG1
•  Utility Functions: https://bit.ly/2qlk8zH
•  ...
• Installation via Spark Packages
•  https://spark-‐‑‒packages.org
•  ./bin/spark-‐‑‒shell -‐‑‒-‐‑‒packages apache-‐‑‒hivemall:apache-‐‑‒
hivemall:0.5.1-‐‑‒spark2.2
A Status of Hivemall-‐‑‒on-‐‑‒Spark v0.5.0

• Joins Top-‐‑‒K entries only
•  “Vanilla Join + Rank Over” is too slow
Example) Top-‐‑‒K Join Processing
join key x
join key y
・・・・・
Joins the top-K rows that have higher
score values, e.g., f(x, y)
leftDf
rightDf
Join
Join

• 1. Download a Spark binary
• 2. Fetch training and test data
• 3. Load these data in Spark
• 4. Build a model
• 5. Do predictions
Quick Example

1. Download a Spark binary
• Download a Spark v2.2.1 binary
•  https://spark.apache.org/downloads.html

2. Fetch training and test data
• E2006 tﬁdf regression dataset
•  https://bit.ly/2GOC0di
$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
regression/E2006.train.bz2
$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
regression/E2006.test.bz2

3. Load training data in Spark
$ <SPARK_HOME>/bin/spark-shell
--packages apache-hivemall:apache-hivemall:0.5.1-spark2.2
scala> import org.apache.spark.sql.hive.HivemallOps._
scala> import org.apache.spark.sql._
scala> :paste
// Creates DataFrame from the bzip’d libsvm-formatted file
val rawTrainDf = spark.read.format("libsvm").load("E2006.train.bz2")
// Since `label` must be [0.0, 1.0], rescales them first
val maxmin = rawTrainDf.select(max($"label"), min($"label")).collect.map {
case Row(max: Double, min: Double) => (max, min)
}.head
val trainDf = rawTrainDf.select(
rescale($"label", lit(maxmin._2), lit(maxmin._1)).as("label"),
$"features”)

3. Load test data in Spark
scala> val rawTestDf = spark.read.format("libsvm").load("E2006.test.bz2”)
scala> :paste
val testDf = rawTestDf.select(
rowid(),
rescale($"label", lit(maxmin._2), lit(maxmin._1)).as("label"),
$"features")
.explode_vector($"features")
.select($"rowid", $"label".as("target"), $"feature", $"weight".as("value"))
.cache

4. Build a model -‐‑‒ DataFrame
scala> paste:
val modelDf = trainDf.train_logistic_regr($"features", $"label")
.groupBy("feature")
.agg("weight" -> "avg")

5. Do predictions -‐‑‒ DataFrame
// Do predictions
scala> paste:
val predictDf = testDf
.join(modelDf, testDf("feature") === modelDf("feature"), "LEFT_OUTER")
.select($"rowid", ($"avg(weight)" * $"value").as("value"))
.groupBy("rowid").sum("value")
.select(
$"rowid",
sigmoid($"sum(value)").as("predicted”))

• Feature Selection + Spark Optimizer = Fast
Data Extraction
•  HIVEMALL-‐‑‒181: Plan rewriting rules to ﬁlter meaningful
training data before feature selections
Current Work for Future Releases
Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking
Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016.
key v0 key v1 v2 key v0 v1 v2
Data Extraction (e.g., by SQL) Feature Selection (e.g., by scikit-learn)
Selected Features

Data Extraction
•  HIVEMALL-‐‑‒181: Plan rewriting rules to ﬁlter meaningful
training data before feature selections
Current Work for Future Releases
Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking
Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016.
key v0 key v1 v2 key v1 v2
Data Extraction + Feature Selection
Join Pruning by Data Statistics

SPARK v2.3

Whatʼ’s Apache Spark
• Distributed data analytics engine,
generalizing Map Reduce
Spark GitHub

Whatʼ’s Apache Spark
• 1. Uniﬁed Engine
•  support end-‐‑‒to-‐‑‒end APIs, e.g., MLlib and Streaming
• 2. High-‐‑‒level APIs
•  easy-‐‑‒to-‐‑‒use, rich optimization
• 3. Integrate broadly
•  storages, libraries, ...

• v2.3.0 released in 2018.2
• v2.x releases focus on API stabilities
•  minor releases: 4month dev. + 1month QA
• Community discussion for v3.0 started recently
•  time for Apache Spark 3.0?: https://bit.ly/2qjcd6f
Spark Release History
2012
2013
2014
2015
2016
2017
The original paper
(RDD) published
Incubated
in ASF
To an ASF top-level
project
v1.0
v1.1
v1.2
v1.3 v1.4 v1.5 v1.6 v2.0 v2.1
v0.6 v0.7
v0.8 v0.9
DataFrame
APIs
Codegen Support
Dataset
APIs
Structure
Streaming 2018
v2.2
v2.3
Today talk

Cited from: What's New in Upcoming Apache Spark 2.3, https://bit.ly/2GNS2nP
An Introduction to Spark v2.3

Cited from: What's New in Upcoming Apache Spark 2.3, https://bit.ly/2GNS2nP

• Talked by using the slide: What's New in
Upcoming Apache Spark 2.3
•  https://bit.ly/2GNS2nP

• Hivemall on Spark
•  Wrapper implementations for DataFrame/SQL
•  + some utilities for easy-‐‑‒to-‐‑‒use in Spark
Data Extraction
•  WIP for Hivemall future releases
• Spark v2.3
•  Structured Streaming
•  Image support
•  Pandas UDF performance improvement
•  Spark on Kubernetes
•  ...
Recap

20180417 hivemall meetup#4

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Similar to 20180417 hivemall meetup#4

Similar to 20180417 hivemall meetup#4 (20)

More from Takeshi Yamamuro

More from Takeshi Yamamuro (20)

Recently uploaded

Recently uploaded (20)

20180417 hivemall meetup#4