Spark 2.0 What's Next （Hadoop / Spark Conference Japan 2016 キーノート講演資料）

Spark 2.0: What’s Next
Reynold Xin @rxin
Spark Conference Japan
Feb 8, 2016

Please put up your hand
if you know what Spark is?

Put up your hand
if you think your significant other
know what Spark is?
(girlfriend, boyfriend, wife, husband, …)

This Talk
What is Spark?
How are people using it?
Spark 2.0

open source data processing engine built around
speed, ease of use, and sophisticated analytics
スピード、使いやすさ、洗練された分析を兼ね合わせた
オープンソースのデータ処理エンジン

About Databricks
Founded by creators of Spark & behind Spark development
Cloud Enterprise Spark Platform
•  Cluster management, interactive notebooks,
dashboards, production jobs,
data governance, security, …
Databricks について
Spark開発者とSpark開発を支持する人たちによって設立された
エンタープライズクラウド Spark プラットフォーム

・クラスタ管理、対話型ノートブック

・ダッシュボード、ジョブ生成

・データガバナンス、セキュリティ

2015: Great Year for Spark
Most active open source project in data (1000+ contributors)
New language: R
Widespread industry support & adoption
2015: Sparkにとって大きな年
データ上、最も活発なオープンソースプロジェクト (1000人以上の貢献者)
新しい言語 : R
幅広い業界サポートと採用

Meetup Groups: December 2014
source: meetup.com

Meetup Groups: December 2015
source: meetup.com
Tokyo
Spark Meetup

IBMはApache Spark の高度化
へのコミットメントをアナウンス、
次の10年で最も重要なオープン
ソースプロジェクトとなる可能性
を秘めているという
Spark Or Hadoop – どちらがベスト
なビッグデータフレームワーク？
Apache Spark が人気急上昇だと
調査結果が示す

Diverse Runtime Environments
HOW RESPONDENTS ARE
RUNNING SPARK
51%
on a public cloud
MOST COMMON SPARK DEPLOYMENT
ENVIRONMENTS (CLUSTER MANAGERS)
48%
40%
11%
Standalone mode YARN Mesos
Cluster Managers
さまざまな実行環境

Industries Using Spark
Other
Software
(SaaS, Web, Mobile)
Consulting (IT)
Retail,
e-Commerce
Advertising,
Marketing, PR
Banking, Finance
Health, Medical,
Pharmacy, Biotech
Carriers,
Telecommunications
Education
Computers, Hardware
29.4%
17.7%
14.0%
9.6%
6.7%
6.5%
4.4%
4.4%
3.9%
3.5%
Sparkを利用している業界
ソフトウェア
コンサルティング (IT)
銀行、金融
コンピューター、ハードウェア
教育
健康、医療、薬剤、バイオテクノロジー
キャリア、通信
広告、
マーケティング、
PR
小売、eコマース
その他

Top Applications
29%
36%
40%
44%
52%
68%
Fraud Detection / Security
User-Facing Services
Log Processing
Recommendation
Data Warehousing
Business Intelligence
ビジネスインテリジェンス
データウェアハウジング
レコメンデーション
ログ処理
ユーザー向けサービス
不正検出 / セキュリティ
上位のアプリケーション

Are we done?
No. Development is faster than ever!
もう完成？
いいえ。開発は今まで以上に活発になって続いている！

2012
started
@
Berkeley
2010
research
paper
2013
Databricks
started
& donated
to ASF
2014
Spark 1.0 & libraries
(SQL, ML, GraphX)
2015
DataFrames
Tungsten
ML Pipelines
2016
Spark 2.0

SQL Streaming MLlib
Spark Core (RDD)
GraphX
Spark stack diagram Sparkのスタック図

Frontend
(user facing APIs)
Backend
(execution)
Spark stack diagram
(a different take)
Sparkのスタック図
(違う見方で)
フロントエンド
(ユーザーに面するAPI)
バックエンド
(実行)

Frontend
(RDD, DataFrame, ML pipelines, …)
Backend
(scheduler, shuﬀle, operators, …)
Spark stack diagram
(a different take)
Sparkのスタック図
(違う見方で)
フロントエンド
(RDD, DataFrame,
ML pipelines, …)
バックエンド
(スケジューラ、
シャッフル、演算子、
…)

Frontend
API Foundation
Streaming
DataFrame/Dataset
SQL
Backend
10X Performance
Whole-stage Codegen
Vectorization
Spark 2.0
フロントエンド API の創設
ストリーミング
DataFrame/Dataset
SQL
バックエンド 10倍のパフォーマンス
全ステージコード生成
ベクトル化

Guiding Principles for API Foundation
1.  Simple yet expressive
2.  (Semantics) well-defined
3.  Suﬀiciently abstracted to allow optimized backends
API を創るにあたっての指針
シンプルだが表現豊かに
(セマンティクスが) 十分定義されている
バックエンドの最適化ができるよう十分に抽象化されている

Java/Scala
frontend
JVM
backend
RDD
DataFrame
frontend
Logical Plan
Physical
execution
Catalyst
optimizer
DataFrame

Python Java/Scala SQL
DataFrame
Logical Plan
JVM Tungsten …

API Foundations in Spark 2.0
1.  Streaming DataFrames
2.  Maturing and merging DataFrame and Dataset
3.  ANSI SQL
•  natural join, subquery, view support
Spark 2.0 におけるAPIの創設
ストリーミング DataFrames
DataFrame と Dataset の成熟とマージ
自然結合、サブクエリ、ビューのサポート

Challenges with Stream Processing
Stream processing is hard to reason about
•  Output over time
•  Late data
•  Failures
•  Distribution
And all this has to work across complex operations
•  Windows, sessions, aggregation, etc
ストリーム処理に関する課題
ストリーム処理が難しい理由は

・長い期間に渡るアウトプット

・遅れてくるデータ

・障害

・分散
これらすべてが複雑なオペレーションにわたって機能しなければならない

・ウィンドウ、セッション、アグリゲーション、など

Next-gen Streaming with DataFrames
1.  Easy-to-use APIs (batch, streaming, and interactive)
2.  Well-defined semantics
•  Out-of-order data
•  Failures
•  Sources/sinks with exactly-once semantics
3.  Leverages Tungsten backend
DataFramesによる次世代ストリーミング
1. 使いやすいAPI (バッチ、ストリーミング、インタラクティブ)
2. うまく定義されたセマンティクス
・順序通りでないデータ
・障害
・exactly-once セマンティクスを持つ source / sink

3. Tungsten バックエンドの利用

Next-gen Streaming with DataFrames
1.  Easy-to-use APIs (batch, streaming, and interactive)
2.  Well-defined semantics
•  Out-of-order data
•  Failures
•  Sources/sinks with exactly-once semantics
3.  Leverages Tungsten backend
DataFramesによる次世代ストリーミング
1. 使いやすいAPI (バッチ、ストリーミング、インタラクティブ)
2. うまく定義されたセマンティクス
・順序通りでないデータ
・障害
・exactly-once セマンティクスを持つ source / sink

3. Tungsten バックエンドの利用
More details next few weeks
数週間後により詳細を

Spark is already pretty fast.
Can we make it 10X faster in 2.0?
Spark はすでにかなり速い
2.0 で 10 倍高速にできるのだろうか？

Spark 1.6 13.95 million
rows/sec
Spark 2.0
work-in-progress
125 million
rows/sec
High throughput
高スループット
Teaser: SQL/DataFrame Performance
come to my talk this afternoon to learn more
詳しく知りたいようでしたら午後のわたしの話を聞きに来てください
少しだけ宣伝: SQL/DataFrame のパフォーマンス

Tungsten Execution
PythonSQL R Streaming
DataFrame (& Dataset)
Advanced
Analytics

Spark 2.0 Release Schedule
Under active development on GitHub
March – April: code freeze
April – May: oﬀicial release
Spark 2.0 のリリーススケジュール
GitHub 上で活発に開発中
3月-4月 : コードフリーズ
4月-5月 : 正式リリース

ありがとうございました
@rxin

Spark 2.0 What's Next （Hadoop / Spark Conference Japan 2016 キーノート講演資料）

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Spark 2.0 What's Next （Hadoop / Spark Conference Japan 2016 キーノート講演資料）

Similar to Spark 2.0 What's Next （Hadoop / Spark Conference Japan 2016 キーノート講演資料） (20)

More from Hadoop / Spark Conference Japan

More from Hadoop / Spark Conference Japan (12)

Recently uploaded

Recently uploaded (11)

Spark 2.0 What's Next （Hadoop / Spark Conference Japan 2016 キーノート講演資料）