The document discusses security threats in Apache Hadoop such as accidental file deletion, accidental task killing, users pretending to be other users or services, remote access, eavesdropping inside and outside the data center, physical access, and a bad Hadoop administrator in the cluster. It covers security architectures in Hadoop including HDFS encryption, the Hive metastore, and Hive Server 2. The document concludes with recommendations to address security threats and resources for further information.
This document summarizes LinkedIn's journey to 450 million members by 2016 through becoming a data-driven company. It discusses how LinkedIn developed data platforms like DALI to standardize reporting, experimentation, and tracking across diverse technologies. This allowed LinkedIn to break down barriers between speed and quality by partnering data scientists and engineers. The document concludes by thanking all those involved in LinkedIn's data-powered journey.
The document discusses evolving HDFS to better support large scale deployments. It summarizes HDFS's strengths in scaling to large clusters and data sizes. However, scaling the large number of small files and blocks is challenging. The solution involves using partial namespaces to store only recently used metadata in memory, and block containers to group blocks together. This will generalize the storage layer to support different container types beyond HDFS blocks. Initial goals are to scale to billions of files and blocks per volume, with the ability to add more volumes for further scaling. The changes will also enable new use cases like block storage and caching data in cloud storage.
2022/3/24に開催した「オンプレML基盤 on Kubernetes」の資料です。機械学習モデルの開発者が、よりモデルの開発にのみ集中できるようにすることを目指して開発している「LakeTahoe(レイクタホ)」について紹介します。
https://ml-kubernetes.connpass.com/event/239859/
ハイブリッドクラウド研究会_Hyper-VとSystem Center Virtual Machine Manager セッションMM
State of the art Stream Processing #hadoopreading
1. Hadoop Summit 2016 San Jose出張報告
- State of the art Stream Processing -
Hadoop Source Code Reading #21
http://www.yahoo.co.jp/
ヤフー株式会社 古山 慎悟
2016年8月18日
7. 発表者が聞いてきたセッション一覧
Monday June 27, 2016
• Flink Meetup
Tuesday, June 28, 2016
• What the #$* is a Business
Catalog and Why You Need It!
• Streaming in the Wild with
Apache Flink
• Governed Self Service Analytics
at eBay
• Analysis of Major Trends in Big
Data Analytics
• H2O: A Platform for Big Math
• How to Build a Successful Data
Lake
• Instrument Your Instruments:
Data-Driven Ops
Wednesday, June 29, 2016
• Blink - Improved Runtime for
Flink and its Application in
Alibaba Search
• Scalable Realtime Analytics
using Druid
• Fine-Grained Security for Spark
and Hive
• Lambda-less Stream Processing
@ Scale in LinkedIn
Thursday, June 30, 2016
• The Future of Apache Storm
• Turning the Stream Processor
into a Database: Building
Online Applications on Streams
• Managing Hadoop, HBase, and
Storm Clusters at Yahoo Scale
• Next Gen Big Data Analytics
with Apache Apex
• Apache Beam: A Unified Model
for Batch and Streaming Data
Processing
• BoF: Streaming & Data Flow
Copyright (C) 2016 Yahoo Japan Corporation. All Rights Reserved. 無断引用・転載禁止 7
9. Flink
Copyright (C) 2016 Yahoo Japan Corporation. All Rights Reserved. 無断引用・転載禁止 9
出所: https://www.youtube.com/watch?v=1JV5o5g30-k
• Streaming in the Wild with Apache Flink
10. 聞いてきたセッション
• Flink Meetup (前夜祭的なもの)
• Robust Stream Processing with Apache Flink
• Streaming in the Wild with Apache Flink
• Turning the Stream Processor into a Database: Building Online
Applications on Streams
• Blink Improved Runtime for Flink and its Application in Alibaba
Search
Copyright (C) 2016 Yahoo Japan Corporation. All Rights Reserved. 無断引用・転載禁止 10
11. FlinkはリカバリのためにStateを管理する実行モデル
Copyright (C) 2016 Yahoo Japan Corporation. All Rights Reserved. 無断引用・転載禁止 11
出所: http://www.slideshare.net/HadoopSummit/streaming-in-the-wild-with-apache-flink-63921696
12. アプリケーションの書き味
Copyright (C) 2016 Yahoo Japan Corporation. All Rights Reserved. 無断引用・転載禁止 12
出所: http://www.slideshare.net/HadoopSummit/the-stream-processor-as-a-database-apache-flink
13. 実現したいもの
• Exactly-once guarantees
• Low latency
• High throughput
• Powerful computation model
• Low overhead of the fault tolerance mechanism in the absence
of failures
• Flow control
出所: http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/
Copyright (C) 2016 Yahoo Japan Corporation. All Rights Reserved. 無断引用・転載禁止 13
15. 高度に発達したストリーム処理の枠組みはKVSと見分けがつかない
Copyright (C) 2016 Yahoo Japan Corporation. All Rights Reserved. 無断引用・転載禁止 15
出所: http://www.slideshare.net/HadoopSummit/the-stream-processor-as-a-database-apache-flink
20. Storm vs Heron
• The Future of Apache Storm
Copyright (C) 2016 Yahoo Japan Corporation. All Rights Reserved. 無断引用・転載禁止 20
出所: https://www.youtube.com/watch?v=_Q2uzRIkTd8
21. Storm vs Heron
• Storm
– 2011/9のどこかでOSSとして公開される
– オリジナルはTwitterが開発
– Storm @Twitter SIGMOD’14 に詳細あり
• Heron
– 2015/6/2にTwitterのブログで存在が発表される
• https://blog.twitter.com/2015/flying-faster-with-twitter-
heron
– オリジナルはTwitterが開発
– Twitter Heron: Stream Processing at Scale SIGMOD’15 に詳細
あり
Copyright (C) 2016 Yahoo Japan Corporation. All Rights Reserved. 無断引用・転載禁止 21
25. HeronにあってStormにないもの
• Off the shelf scheduler
– MesosとかYARNではStormのアプリケーションを動かせない
– いちおうSliderとかをつかうとStorm on YARNは可能っぽい
• BackpressureをZookeeperとあわせて使うとあかんらしい(という
のが最近発覚した)
– STORM-1949
Copyright (C) 2016 Yahoo Japan Corporation. All Rights Reserved. 無断引用・転載禁止 25
28. その他のフレームワーク
Copyright (C) 2016 Yahoo Japan Corporation. All Rights Reserved. 無断引用・転載禁止 28
Samza
Apex
Beam/MillWheel
Cloud Dataflow
• 運用管理まわりがいけてるらしい
• See “Lambda-less Stream Processing @ Scale in LinkedIn”
• Partitionの動的最適化やアプリケーションの無停止アップグレー
ドがウリ
• See “Next Gen Big Data Analytics with Apache Apex”
• MillWheelが全ての原点といっていいほど参考にされている
• FlinkがMillWheelを超えている説も?
• See “Apache Beam: A Unified Model for Batch and Streaming Data
Processing”
Spark Streaming
• 自分の観測範囲ではDisられ侍だった
• 聞くところによるとClouderaのセッションではSpark
Streaming推しだった模様で、ベンダによっていろいろ事情が異
なるっぽい