State of the art Stream Processing #hadoopreading

Hadoop Summit 2016 San Jose出張報告
- State of the art Stream Processing -
Hadoop Source Code Reading #21
http://www.yahoo.co.jp/
ヤフー株式会社古山慎悟
2016年8月18日

1. はじめに
2. Apache Flink
3. Apache Storm/Heron
4. その他のフレームワーク
5. おわりに

はじめに
• この発表について
• 発表者について
• 発表者が聞いてきたセッション一覧
Copyright (C) 2016 Yahoo Japan Corporation. All Rights Reserved. 無断引用・転載禁止 4

この発表について
• この発表では、 Hadoop Summit 2016 San Joseで聞いてきた各種
ストリーム処理実行エンジンについて紹介します

発表者について
• 2016/4~
– Yahoo! JAPANでストリーム処理チームのマネージャー
• 2014/4 ~
– Yahoo! JAPANで金融とかデータとか
• 2011/10 ~
– ノーチラステクノロジーズでAsakusa Frameworkとか
– ←はノーチラスのときに書いてもらったもの
• 2007/4 ~
– Simplex Technologyで金融まわりのいろいろ
ご参考 https://www.linkedin.com/in/shingofuruyama

発表者が聞いてきたセッション一覧
Monday June 27, 2016
• Flink Meetup
Tuesday, June 28, 2016
• What the #$* is a Business
Catalog and Why You Need It!
• Streaming in the Wild with
Apache Flink
• Governed Self Service Analytics
at eBay
• Analysis of Major Trends in Big
Data Analytics
• H2O: A Platform for Big Math
• How to Build a Successful Data
Lake
• Instrument Your Instruments:
Data-Driven Ops
Wednesday, June 29, 2016
• Blink - Improved Runtime for
Flink and its Application in
Alibaba Search
• Scalable Realtime Analytics
using Druid
• Fine-Grained Security for Spark
and Hive
• Lambda-less Stream Processing
@ Scale in LinkedIn
Thursday, June 30, 2016
• The Future of Apache Storm
• Turning the Stream Processor
into a Database: Building
Online Applications on Streams
• Managing Hadoop, HBase, and
Storm Clusters at Yahoo Scale
• Next Gen Big Data Analytics
with Apache Apex
• Apache Beam: A Unified Model
for Batch and Streaming Data
Processing
• BoF: Streaming & Data Flow

Flink
出所: https://www.youtube.com/watch?v=1JV5o5g30-k
• Streaming in the Wild with Apache Flink

聞いてきたセッション
• Flink Meetup (前夜祭的なもの)
• Robust Stream Processing with Apache Flink
• Streaming in the Wild with Apache Flink
• Turning the Stream Processor into a Database: Building Online
Applications on Streams
• Blink Improved Runtime for Flink and its Application in Alibaba
Search

FlinkはリカバリのためにStateを管理する実行モデル
出所: http://www.slideshare.net/HadoopSummit/streaming-in-the-wild-with-apache-flink-63921696

アプリケーションの書き味
出所: http://www.slideshare.net/HadoopSummit/the-stream-processor-as-a-database-apache-flink

実現したいもの
• Exactly-once guarantees
• Low latency
• High throughput
• Powerful computation model
• Low overhead of the fault tolerance mechanism in the absence
of failures
• Flow control
出所: http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/

それを実現するための理論的背景
• 障害発生時に、データをどこまで処理したか正確に
記録されていて、処理されていないデータが残って
いれば、再実行によって正確な結果が得られる（た
だし分散システム）
• 入力をとめて計算がすべて終わるのを待てば、一貫
性のあるスナップショットを取れる（ただしスナッ
プショットを取りきるまで計算は停止する）
• それだとあまりにあれなので、Chandy-Lamportが
よく使われているけど、空間効率がよくない
• FlinkではGlobal Distributed Savepointsの実装でh
左の論文で提案されているAsynchronous Barrier
Snapshottingを採用
• アプリケーションのDAG（実行計画）におけるタス
クとエッジの状態からスナップショットをつくるの
で空間効率がよい
Copyright (C) 2016 Yahoo Japan Corporation. All Rights Reserved. 無断引用・転載禁止 14出所: http://arxiv.org/abs/1506.08603

高度に発達したストリーム処理の枠組みはKVSと見分けがつかない
出所: http://www.slideshare.net/HadoopSummit/the-stream-processor-as-a-database-apache-flink

Flinkのすごいところ
• レイテンシと一貫性にトレードオフがあったのは過去の話
• Flinkはレイテンシも一貫性も両方取れる

Flinkのすごくないところ
• 現時点ではスケールしない？

Flinkの将来
• 1.0 現行最新版
• 1.1 Monitoringの強化
– 具体的な中身は見てないけどStorm追従に見える
• 1.2 Dynamic scaling
– 具体的な中身は見てないけどApex追従に見える

Storm vs Heron
• The Future of Apache Storm
出所: https://www.youtube.com/watch?v=_Q2uzRIkTd8

Storm vs Heron
• Storm
– 2011/9のどこかでOSSとして公開される
– オリジナルはTwitterが開発
– Storm @Twitter SIGMOD’14 に詳細あり
• Heron
– 2015/6/2にTwitterのブログで存在が発表される
• https://blog.twitter.com/2015/flying-faster-with-twitter-
heron
– オリジナルはTwitterが開発
– Twitter Heron: Stream Processing at Scale SIGMOD’15 に詳細
あり

Heronの話題性
• Nathan MarzというTwitterでStormをやっていたエンジニアがコミュ
ニティをはなれたためStormコミュニティが衰退
• もともとStormがTwitterでつくられていたこと、HeronはStormの完
全上位互換である的な語り口だったことからStormオワコン説が有力に

Heronの特徴
• Off the shelf scheduler: Mesosでうごく
• Handling spikes and congestion: backpressure
• Easy debugging: タスクの粒度を改善、アプリケーションの構造を可
視化
• Compatibility with Storm: Stormで動くのと同じアプリケーション
がHeronでも動く
• Scalability and latency: 対Storm比10-14倍くらいのスループット

Storm 1.xで追加された機能
• Pacemaker
• Distributed Cache
• Nimbus HA
• Windowまわりのハンドリング
（Timestamp/Watermarkを導
入）
• Stateful Bolt
• Checkpointing
• Backpressure
• Resource Aware Scheduler
• Dynamic Log Levels
• Tuple Sampling
• Distributed Log Search
• Dynamic Profiling
• Supervisor Health Check
• Performance 同じストリームに
対して16倍くらいのスループッ
ト

HeronにあってStormにないもの
• Off the shelf scheduler
– MesosとかYARNではStormのアプリケーションを動かせない
– いちおうSliderとかをつかうとStorm on YARNは可能っぽい
• BackpressureをZookeeperとあわせて使うとあかんらしい（という
のが最近発覚した）
– STORM-1949

Stormの将来
• 1.0.xはHeronより高機能・同性能と言っても過言ではないほどに改善
が進んでいる
• 1.1.xでメトリクスまわりがさらに改善される
• 2.xでは「コミッタの裾野を広げるため」にJava化する

その他のフレームワーク
Samza
Apex
Beam/MillWheel
Cloud Dataflow
• 運用管理まわりがいけてるらしい
• See “Lambda-less Stream Processing @ Scale in LinkedIn”
• Partitionの動的最適化やアプリケーションの無停止アップグレー
ドがウリ
• See “Next Gen Big Data Analytics with Apache Apex”
• MillWheelが全ての原点といっていいほど参考にされている
• FlinkがMillWheelを超えている説も？
• See “Apache Beam: A Unified Model for Batch and Streaming Data
Processing”
Spark Streaming
• 自分の観測範囲ではDisられ侍だった
• 聞くところによるとClouderaのセッションではSpark
Streaming推しだった模様で、ベンダによっていろいろ事情が異
なるっぽい

おわりに
• まとめ
• 参考資料

まとめ
• StateをArtisticに扱ってていけてる感があるのはFlinkで、今後に注目
• スケーラビリティや運用管理まわりに一日の長があるので、
Productionで使うならStormがよさそう
• ほかも一長一短まあいろいろあるので戦国時代の様相

参考資料
• 当日のアジェンダ
– http://hadoopsummit.org/san-jose/agenda/
• 当日のビデオ
– https://www.youtube.com/channel/UCAPa-
K_rhylDZAUHVxqqsRA

State of the art Stream Processing #hadoopreading

More Related Content

What's hot

Viewers also liked

Similar to State of the art Stream Processing #hadoopreading

More from Yahoo!デベロッパーネットワーク

State of the art Stream Processing #hadoopreading