SlideShare a Scribd company logo
1 of 66
HADOOP PLATFORM
AT YAHOO
A YEAR IN REVIEW
SUMEET SINGH (@sumeetksingh)
Sr. Director, Cloud and Big Data Platforms
Agenda
2
Platform Overview1
Infrastructure and Metrics2
CaffeOnSpark for Distributed DL3
Compute and Sketches4
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
0
100
200
300
400
500
600
700
800
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
50,000
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
RawHDFS(inPB)
#Servers
Year
Servers Storage
Yahoo!
Commits to
Scaling
Hadoop for
Production
Use
Research
Workloads
in Search
and
Advertising
Production
(Modeling)
with machine
learning &
WebMap
Revenue
Systems
with Security,
Multi-
tenancy, and
SLAs
Open
Sourced with
Apache
Hortonworks
Spinoff for
Enterprise
hardening
Nextgen
Hadoop
(H 0.23 YARN)
New Services
(HBase,
Storm, Spark,
Hive)
Increased
User-base
with
partitioned
namespaces
Apache H2.7
(Scalable ML, Latency,
Utilization, Productivity)
Platform Evolution
3
Deployment Models
Private (dedicated)
Clusters
Hosted Multi-tenant
(private cloud)
Clusters
Hosted Compute
Clusters
 Large demanding use
cases
 New technology not
yet platformized
 Data movement and
regulation issues
 When more cost
effective than on-
premise
 Time to market/
results matter
 Data already in
public cloud
 Source of truth for all
of orgs data
 App delivery agility
 Operational efficiency
and cost savings
through economies of
scale
On-Premise Public Cloud
Purpose-built
Big Data
Clusters
 For performance,
tighter integration
with tech stack
 Value added services
such as monitoring,
alerts, tuning and
common tools
4
Platform Today
ZK DBMS MON SSHOP LOG WH TOOLS
Apache / Open Source Projects Yahoo Projects
HDFS HBase HCat Kafka CMS DH
Pig Hive Oozie Hue GDM Big ML
YARN CS MR Tez Spark Storm
Services
Compute
Storage / Msg.
Tools
5
Technology Stack Assembly
ZK DBMS MON SSHOP LOG WH TOOLS
Apache Projects Yahoo Projects
HDFS HBase HCat Kafka CMS DH
Pig Hive Oozie Hue GDM Big ML
YARN CS MR Tez Spark Storm
Services
Compute
Storage / Msg.
Tools
HDFS
(File System)
YARN
(Scheduling, Resource Management)
Common
RHEL6 64-bit, JDK8
Platformized
Tech with
Production
Support
In-
progress,
Unmet
needs or
Apache
Alignment
6
Common Backplane
DataNode NodeManager
NameNode RM
DataNodes RegionServers
NameNode HBase Master Nimbus
Supervisor
Administration, Management and Monitoring
ZooKeeper
Pools
HTTP/HDFS/GDM
Load Proxies
Applications and Data
Data
Feeds
Data
Stores
Oozie
Server
HS2/
HCat
Network
Backplane
7
0
10
20
30
Cluster 1 (2,000 servers)
HDFS 12 PB
Compute 23 TB
Avg. Util: 26%
Research Cluster Consolidation
0
20
40
60
80
ComputeTotalandUsed(TB)
Cluster 3 (5,400 servers)
HDFS 36 PB
Compute 70 TB
Avg. Util: 59%
Cluster 2 (3,100 servers)
HDFS 21 PB
Compute 52 TB
Avg. Util: 40%
0
20
40
60
One Month Sample (2015)
Total Used
8
0
50
100
150
200
250
300
Consolidated Cluster
HDFS 65 PB
Compute 240 TB
Avg. Util: 70%
Consolidated Research Cluster Characteristics
One Month Sample (2016)
40% decrease in TCO
10,500
servers
2,200
servers
Before After
65% increase in compute capacity
50% increase in avg. utilization
Total Used
ComputeTotalandUsed(TB)
9
Common Hadoop Cluster Configuration
Rack 1
Network Backplane
CPU Servers
with JBODs
& 10GbE
Rack 2 Rack N
.
.
.
.
.
.
.
.
.
10
New Hadoop Cluster Configuration
Rack 1
Network Backplane
CPU Servers
with JBODs
& 10GbE
Rack 2 Rack N
100Gbps
InfiniBand
GPU Servers
Hi-Mem Servers
.
.
.
11
YARN Node Labels
J2J3
J4
Queue 1, 40%
Label x
Queue 2, 40%
Label x, y
J1
Queue 3, 20%
x x x x x x
x x x x x x
y y y y y y
y y y y y y
yarn.scheduler.capacity.root.<queue name>.accessible-node-labels = <label name>
yarn.scheduler.capacity.root.<label name>.default-node-label-expression sets the default label asked for by queue
Hadoop Cluster
12
Agenda
Platform Overview1
Infrastructure and Metrics2
CaffeOnSpark for Distributed DL3
Compute and Sketches4
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
13
CaffeOnSpark – Distributed Deep Learning
CaffeOnSpark
for
DL
MLLib
for
non-DL
Hive or
SparkSQL
Spark
YARN (RM and Scheduling)
HDFS (Datasets)
. . .
14
Few Use Cases – Yahoo Weather
15
Few Use Cases – Flickr Facial Recognition
16
Few Use Cases – Flickr Scene Detection
17
CaffeOnSpark Architecture – Common Cluster
Spark Driver
Caffe
(enhanced with
multi-GPU/CPU)
Model
Synchronizer
(across nodes)
HDFS
Datasets
Spark
Executor
(for data feeding
and control)
Caffe
(enhanced with
multi-GPU/CPU)
Model
Synchronizer
(across nodes)
HDFS
Datasets
Spark
Executor
(for data feeding
and control)
Caffe
(enhanced with
multi-GPU/CPU)
Model
Synchronizer
(across nodes)
HDFS
Datasets
Spark
Executor
(for data feeding
and control)
Model
O/P on
HDFS
MPI on RDMA / TCP
18
CaffeOnSpark Architecture – Incremental Learning
cos = new CaffeOnSpark(ctx)
conf = new Config(ctx, args).init()
dl_train_source = DataSource.getSource(conf, true)
cos.train(dl_train_source) //training DL model
lr_raw_source = DataSource.getSource(conf, false)
ext_df = cos.features(lr_raw_source) // extract features via DL
Feature
Engineering:
DeepLearning
19
CaffeOnSpark Architecture – Incremental Learning
cos = new CaffeOnSpark(ctx)
conf = new Config(ctx, args).init()
dl_train_source = DataSource.getSource(conf, true)
cos.train(dl_train_source) //training DL model
lr_raw_source = DataSource.getSource(conf, false)
ext_df = cos.features(lr_raw_source) // extract features via DL
vlr_input=ext_df.withColumn(“L",cos.floats2doubleUDF(ext_df(conf.label))
)
.withColumn(“F",cos.floats2doublesUDF(ext_df(conf.features(0))))
lr = new LogisticRegression().setLabelCol(”L").setFeaturesCol(”F")
lr_model = lr.fit(lr_input_df) …
Feature
Engineering:
DeepLearning
20
TrainClassifiers:
Non-deep
Learning
CaffeOnSpark Architecture – Single Command
spark-submit
--num-executors #Exes
--class CaffeOnSpark
my-caffe-on-spark.jar
-devices #GPUs
-model dl_model_file
-output lr_model_file
21
Distributed Deep Learning
Apache
License
Existing
Clusters
Powerful
DL Platform
Fully
Distributed
High-level
API
Incremental
Learning
CaffeOnSpark
github.com/yahoo/caffeonspark
22
Agenda
Platform Overview1
Infrastructure and Metrics2
CaffeOnSpark for Distributed DL3
Compute and Sketches4
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
23
Hadoop Compute Sources
HDFS
(File System and Storage)
Pig
(Scripting)
Hive
(SQL)
Java MR APIs
YARN
(Resource Management and Scheduling)
Tez
(Execution Engine for
Pig and Hive)
Spark
(Alternate Exec Engine)
MapReduce
(Legacy)
Data Processing
ML
Custom App on
Slider
Oozie
Data
Management
24
Compute Growth
13.3
20.4
23.8
27.2
32.3
34.1
39.1
10
15
20
25
30
35
40
45 Mar-13
Apr-13
May-13
Jun-13
Jul-13
Aug-13
Sep-13
Oct-13
Nov-13
Dec-13
Jan-14
Feb-14
Mar-14
Apr-14
May-14
Jun-14
Jul-14
Aug-14
Sep-14
Oct-14
Nov-14
Dec-14
Jan-15
Feb-15
Mar-15
Apr-15
May-15
Jun-15
Jul-15
Aug-15
Sep-15
Oct-15
Nov-15
Dec-15
Jan-16
Feb-16
Mar-16
#MR,Tez,SparkJobs(inmillions)
25
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Pushing Batch Compute Boundaries%ofTotalCompute(memory-sec)
Q1 2016
MapReduce Tez Spark
112 Million Batch Jobs in Q1’16
Jan 78%
Mar 67%
Mar 21% 12%Jan 8% 14%
26
Multi-tenant Apache Storm
27
Recent Apache Storm Developments at Yahoo
MT & RA
Scheduler
Dist. Cache
API
8 x
Throughput
Improved
Debuggability
1 github.com/yahoo/streaming-benchmarks
Pacemaker
Server
Streaming
Benchmark 1
28
Data Sketches Algorithms
Data Sketches Algorithms Library
datasketches.github.io
 Good enough approximate answers
for problem queries
 Streamable
 Approximate with predictable error
 Sub-linear in size
 Mergeable / additive
 Highly parallelizable
 Maven deployable
Characteristics
29
Distinct Count Sketch, High-level View
Big Data
Stream
Transform Data Structure Estimator
Result + / - ε
White
Noise
Basic Sketch Elements
30
Data Sketches Algorithms
Data Sketches Algorithms Library
datasketches.github.io
31
Agenda
Platform Overview1
Infrastructure and Metrics2
CaffeOnSpark for Distributed DL3
Compute and Sketches4
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
32
Apache HBase at Yahoo
 Security
 Isolated Deployment
 Multi-tenant
 Region Server Group
 Namespace
 Unsupported Features
HBase
Client
HBase
Client
JobTracker Namenode
TaskTracker
DataNode
Namenode
RegionServer
DataNode
RegionServer
DataNode
RegionServer
DataNode
HBase Master
Zookeeper
Quorum
HBase
Client
MR Client
M/R Task
TaskTracker
DataNode
M/R Task
TaskTracker
DataNode
MR Task
Compute Cluster HBase Cluster
Gateway/Launcher
Rest Proxy
HTTP
Client
33
Security
 Authentication
 Kerberos (users, processes)
 Delegation Token (MapReduce, YARN, etc.)
 Authorization
 HBase ACLs (Read, Write, Create, Admin)
 Grant permissions to User or Unix Group
 ACL for Table, Column Family or Column
34
Region Server Groups
 Dedicated region servers for a set of tables
 Resource Isolation (CPU, Memory, IO, etc)
RegionServer
Group Foo
RegionServer
RegionServer
RegionServer
Region Server 1...5
TableA TableB TableC
TableD TableE TableF
RegionServer
Group Bar
RegionServer
RegionServer
RegionServer
Region Server 6…10
Table1 Table2 Table3
Table4 Table5 Table6
35
Namespaces
 Analogous to “Database”
 Namespace ACL to create tables
 Default group
 Quota
 Tables
 Regions
Namespace
Group Tables Quota ACL
36
Split Meta to Spread Load and Avoid Large Regions
37
Favored Nodes for HDFS Locality
38
Humongous Tables
39
Scaling HBase to Handle Millions of Regions on a Cluster
Region Server
Groups
Split
Meta
Split
ZK
Favored
Nodes
Humongous
Tables
40
Transactions on HBase with Omid1
Highly performant and fault tolerant ACID
transactional framework
New Apache Incubator project
incubator.apache.org/projects/omid.html
Handles million of transactions per day for
search and personalization products
1 Omid stands for “Hope” in Persian
41
Omid Components
42
Omid Data Model
43
Agenda
Platform Overview1
Infrastructure and Metrics2
CaffeOnSpark for Distributed DL3
Compute and Sketches4
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
44
Oozie Data Pipelines
Oozie
Message Bus
HCatalog
3. Push notification
<New Partition>
2. Register Topic
4. Notify New Partition
Data Producer HDFS
Produce data (distcp, pig, M/R..)
/data/click/2014/06/02
1. Query/Poll Partition
Start workflow
Update metadata
(ALTER TABLE click ADD PARTITION(data=‘2014/06/02’)
location ’hdfs://data/click/2014/06/02’)
45
Large Scale Data Pipeline Requirements
Administrative
 One should be able to start, stop and pause
all related pipelines at a same time
Dependency Management
 Output of a coordinator “n+1” action is
dependent on coordinator “n” action (dataset
dependency)
 If dataset has a BCP instance, workflow
should run with either, whichever arrives first
 Start as soon as mandatory data is available,
other feeds are optional
 Data is not guaranteed, start processing
even if partial data is available
SLA Management
 Monitor pipeline processing to take
immediate action in case of failures or
SLA misses
 Pipelines owners should get notified if
an SLA is missed
Multiple Providers
 If data is available from multiple
providers, I want to specify the provider
priority
 Combine datasets from multiple
providers to fill the gaps in data a single
provider may have
46
Large Scale Data Pipeline Requirements
Administrative
 One should be able to start, stop and pause
all related pipelines at a same time
Dependency Management
 Output of a coordinator “n+1” action is
dependent on coordinator “n” action (dataset
dependency)
 If dataset has a BCP instance, workflow
should run with either, whichever arrives first
 Start as soon as mandatory data is available,
other feeds are optional
 Data is not guaranteed, start processing
even if partial data is available
SLA Management
 Monitor pipeline processing to take
immediate action in case of failures or
SLA misses
 Pipelines owners should get notified if
an SLA is missed
Multiple Providers
 If data is available from multiple
providers, I want to specify the provider
priority
 Combine datasets from multiple
providers to fill the gaps in data a single
provider may have
47
BCP And Mandatory / Optional Feeds
Pull data from A or B. Specify dataset as
AorB. Action will start running as soon
either dataset A or B is available.
<input-logic>
<or name=“AorB”>
<data-in dataset="A” wait=“10”/>
<data-in dataset="B"/>
</or>
</input-logic>
Dataset B is optional, Oozie will start
processing as soon as A is available. It
will include dataset from A and whatever
is available from B.
<input-logic>
<and name="optional
<data-in dataset="A"/>
<data-in dataset="B" min=”0”/>
</and>
</input-logic>
48
Data Not Guaranteed / Priority Among Dataset Instances
A will have higher precedence over B
and B will have higher precedence
over C.
<input-logic>
<or name="AorBorC">
<data-in dataset="A"/>
<data-in dataset="B"/>
<data-in dataset="C”/>
</or>
</input-logic>
49
Oozie will start processing if available A
instances are >= 10. Min can also be
combined with wait (as shown for dataset B).
<input-logic>
<data-in dataset="A" min=”10”/>
<data-in dataset=“B” min =“10”
wait=“20”/>
</input-logic>
Combining Dataset From Multiple Providers
Combine function will first check instances from A and go to B next for whatever is
missing in A.
<data-in name="A" dataset="dataset_A">
<start-instance> ${coord:CURRENT(-5)} </start-instance>
<end-instance> ${coord:latest(-1)} </end-instance>
</data-in>
<data-in name="B" dataset="dataset_B">
<start-instance>${coord:CURRENT(-5)}</start-instance>
<end-instance>${coord:CURRENT(-1)}</end-instance>
</data-in>
<input-logic>
<combine name="AB">
<data-in dataset="A"/>
<data-in dataset="B"/>
</combine>
</input-logic>
50
Agenda
Platform Overview1
Infrastructure and Metrics2
CaffeOnSpark for Distributed DL3
Compute and Sketches4
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
51
Automated Onboarding / Collaboration Portal
52
Built for Tenant Transparency
53
Queue Utilization Dashboard
54
Data Discovery and Access
55
Audits, Compliance, and Efficiency
Starling
FS, Job, Task logs
Cluster 1 Cluster 2 Cluster n...
CF, Region, Action, Query Stats
Cluster 1 Cluster 2 Cluster n...
DB, Tbl., Part., Colmn. Access Stats
...MS 1 MS 2 MS n
GDM
Data Defn., Flow, Feed, Source
F 1 F 2 F n
Log Warehouse
Log Sources
56
Audits, Compliance, and Efficiency (cont’d)
Data Discovery and Access
Public
Non-sensitive
Financial $
Governance
Classification
No addn. reqmt.
LMS Integration
Stock Admin
Integration
Approval Flow
Restricted
57
Hosted UI – Hue as a Service
WSGI
Hue-1.Cluster-1 (Hot)
VIPUsers
HS2
Hue
MySQL DB
(HA)
Hadoop Cluster
HCat
Meta
Oozie
Server
YARN
RM
Web
HDFS
NMs
WSGI
Hue-2.Cluster-1 (hot)
HS2
IdP
SAML
Auth.
Serving pages and static content
Cookies, saved queries,
workflows etc.
FullStackHA
REST / Thrift
(jQuery, Bootstrap, Knockout.js, Love)
58
Going Forward
Increased
Intelligence
Greater
Speed
Higher
Efficiency
Necessary
Scale
59
Increased Intelligence
GBDT FTRL SGD
Deep
Learning
Random
Forests
ML Libraries
Click
Prediction Search RankingKeyword Auctions Ad
Relevance Abuse Detection
Applications
Proven to
Work at Scale
Solve Complex
Problems
YARN (Resource Manager)
Heterogeneous
Scheduling
Long-running
Services
GPUs
Large
Memory Support
Core Grid
Enhancements
…
Parameter ServerGlobally Shared
Parameters
Compute Engines
Distributed
Processing
…
60
Greater Speed
DeData
Management
Ease of
Use
Productivity
Dimensions
Real-time
Pipelines
Unified Metadata &
Lineage
Fine-grained
Access Control
Self-serve Data
Movement
SLA & Cost
Transparency
Intuitive
UIs
Planning &
Collab. Tools
Central Grid
Portal
Improvements
Query times
< 1 sec
4x Speedups in
ETL
SQL on
HBase
Limitless BI
Clients
Analytics, BI &
Reporting
61
Higher Efficiency
Achieve five 9’s availability and 70% average compute utilization across clusters
62
Hadoop Users at Yahoo
Slingstone & Aviate Mail Anti-Spam
Gemini Campaign
Mgmt.
Search Assist
Audience Analytics Flickr YAM+ & Targeting Membership Abuse
… and many more.
63
Yahoo at the Apache Open Source Foundation
10 Committers (6 PMC)
3 Committers (3 PMC)
3 Committers (2 PMC)
6 Committer (5 PMC)
1 Committer
3 Committers (2 PMCs)
7 Committers (6 PMCs)
1 2
43
5 6
7 8
1 Committer
64
Join Us @ yahoohadoop.tumblr.com
65
THANK YOU
SUMEET SINGH (@sumeetksingh)
Sr. Director, Cloud and Big Data Platforms
Icon Courtesy – iconfinder.com (under Creative Commons)

More Related Content

What's hot

Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesAshraf Uddin
 
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadhMithlesh Sadh
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyRohit Dubey
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data WarehousingEyad Manna
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Simplilearn
 
Big Data
Big DataBig Data
Big DataNGDATA
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 

What's hot (20)

Big Data Trends
Big Data TrendsBig Data Trends
Big Data Trends
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture Capabilities
 
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadh
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Apache hive
Apache hiveApache hive
Apache hive
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
 
Big Data
Big DataBig Data
Big Data
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Big data, Big decision
Big data, Big decisionBig data, Big decision
Big data, Big decision
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 

Viewers also liked

Process Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at ExpediaProcess Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at Expediahuguk
 
BIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social MediaBIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social MediaSkillspeed
 
Pinterest hadoop summit_talk
Pinterest hadoop summit_talkPinterest hadoop summit_talk
Pinterest hadoop summit_talkKrishna Gade
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieYahoo Developer Network
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...Yahoo Developer Network
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector Yahoo Developer Network
 
IT業界のリーディングカンパニーとして描く「少し先の未来」〜Yahoo! JAPANの事例を通して〜#a11yfuture
IT業界のリーディングカンパニーとして描く「少し先の未来」〜Yahoo! JAPANの事例を通して〜#a11yfutureIT業界のリーディングカンパニーとして描く「少し先の未来」〜Yahoo! JAPANの事例を通して〜#a11yfuture
IT業界のリーディングカンパニーとして描く「少し先の未来」〜Yahoo! JAPANの事例を通して〜#a11yfutureYahoo!デベロッパーネットワーク
 
ユーザー企業内製CSIRTにおける対応のポイント
ユーザー企業内製CSIRTにおける対応のポイントユーザー企業内製CSIRTにおける対応のポイント
ユーザー企業内製CSIRTにおける対応のポイントRecruit Technologies
 
What i learned from translation of the sre ryuji tamagawa
What i learned from translation of the sre ryuji tamagawaWhat i learned from translation of the sre ryuji tamagawa
What i learned from translation of the sre ryuji tamagawaRakuten Group, Inc.
 
Rakutenとsreと私 yanagimoto koichi
Rakutenとsreと私 yanagimoto koichiRakutenとsreと私 yanagimoto koichi
Rakutenとsreと私 yanagimoto koichiRakuten Group, Inc.
 
Kafka Connect(Japanese)
Kafka Connect(Japanese)Kafka Connect(Japanese)
Kafka Connect(Japanese)Roman Shtykh
 
ビックデータ処理技術の全体像とリクルートでの使い分け
ビックデータ処理技術の全体像とリクルートでの使い分けビックデータ処理技術の全体像とリクルートでの使い分け
ビックデータ処理技術の全体像とリクルートでの使い分けTetsutaro Watanabe
 
Struggling with BIGDATA -リクルートおけるデータサイエンス/エンジニアリング-
Struggling with BIGDATA -リクルートおけるデータサイエンス/エンジニアリング-Struggling with BIGDATA -リクルートおけるデータサイエンス/エンジニアリング-
Struggling with BIGDATA -リクルートおけるデータサイエンス/エンジニアリング-Recruit Technologies
 
新卒2年目が鍛えられたコードレビュー道場
新卒2年目が鍛えられたコードレビュー道場新卒2年目が鍛えられたコードレビュー道場
新卒2年目が鍛えられたコードレビュー道場Recruit Technologies
 
Company Recommendation for New Graduates via Implicit Feedback Multiple Matri...
Company Recommendation for New Graduates via Implicit Feedback Multiple Matri...Company Recommendation for New Graduates via Implicit Feedback Multiple Matri...
Company Recommendation for New Graduates via Implicit Feedback Multiple Matri...Recruit Technologies
 

Viewers also liked (20)

Process Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at ExpediaProcess Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at Expedia
 
BIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social MediaBIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social Media
 
Pinterest hadoop summit_talk
Pinterest hadoop summit_talkPinterest hadoop summit_talk
Pinterest hadoop summit_talk
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
 
IT業界のリーディングカンパニーとして描く「少し先の未来」〜Yahoo! JAPANの事例を通して〜#a11yfuture
IT業界のリーディングカンパニーとして描く「少し先の未来」〜Yahoo! JAPANの事例を通して〜#a11yfutureIT業界のリーディングカンパニーとして描く「少し先の未来」〜Yahoo! JAPANの事例を通して〜#a11yfuture
IT業界のリーディングカンパニーとして描く「少し先の未来」〜Yahoo! JAPANの事例を通して〜#a11yfuture
 
ユーザー企業内製CSIRTにおける対応のポイント
ユーザー企業内製CSIRTにおける対応のポイントユーザー企業内製CSIRTにおける対応のポイント
ユーザー企業内製CSIRTにおける対応のポイント
 
What i learned from translation of the sre ryuji tamagawa
What i learned from translation of the sre ryuji tamagawaWhat i learned from translation of the sre ryuji tamagawa
What i learned from translation of the sre ryuji tamagawa
 
Rakutenとsreと私 yanagimoto koichi
Rakutenとsreと私 yanagimoto koichiRakutenとsreと私 yanagimoto koichi
Rakutenとsreと私 yanagimoto koichi
 
Yahoo! JAPANのデータ基盤とHadoop #dbts2016
Yahoo! JAPANのデータ基盤とHadoop #dbts2016Yahoo! JAPANのデータ基盤とHadoop #dbts2016
Yahoo! JAPANのデータ基盤とHadoop #dbts2016
 
Yahoo! JAPANを支えるビッグデータプラットフォーム技術
Yahoo! JAPANを支えるビッグデータプラットフォーム技術Yahoo! JAPANを支えるビッグデータプラットフォーム技術
Yahoo! JAPANを支えるビッグデータプラットフォーム技術
 
Prestoクエリログの保存/分析機能の構築 #yjdsnight
Prestoクエリログの保存/分析機能の構築 #yjdsnightPrestoクエリログの保存/分析機能の構築 #yjdsnight
Prestoクエリログの保存/分析機能の構築 #yjdsnight
 
Yahoo! JAPANにおけるオンライン機械学習実例 #streamctjp
Yahoo! JAPANにおけるオンライン機械学習実例 #streamctjpYahoo! JAPANにおけるオンライン機械学習実例 #streamctjp
Yahoo! JAPANにおけるオンライン機械学習実例 #streamctjp
 
Kafka Connect(Japanese)
Kafka Connect(Japanese)Kafka Connect(Japanese)
Kafka Connect(Japanese)
 
ビックデータ処理技術の全体像とリクルートでの使い分け
ビックデータ処理技術の全体像とリクルートでの使い分けビックデータ処理技術の全体像とリクルートでの使い分け
ビックデータ処理技術の全体像とリクルートでの使い分け
 
Apache Big Data Miami 2017 - Hadoop Source Code Reading #23 #hadoopreading
Apache Big Data Miami 2017 - Hadoop Source Code Reading #23 #hadoopreadingApache Big Data Miami 2017 - Hadoop Source Code Reading #23 #hadoopreading
Apache Big Data Miami 2017 - Hadoop Source Code Reading #23 #hadoopreading
 
Struggling with BIGDATA -リクルートおけるデータサイエンス/エンジニアリング-
Struggling with BIGDATA -リクルートおけるデータサイエンス/エンジニアリング-Struggling with BIGDATA -リクルートおけるデータサイエンス/エンジニアリング-
Struggling with BIGDATA -リクルートおけるデータサイエンス/エンジニアリング-
 
新卒2年目が鍛えられたコードレビュー道場
新卒2年目が鍛えられたコードレビュー道場新卒2年目が鍛えられたコードレビュー道場
新卒2年目が鍛えられたコードレビュー道場
 
Company Recommendation for New Graduates via Implicit Feedback Multiple Matri...
Company Recommendation for New Graduates via Implicit Feedback Multiple Matri...Company Recommendation for New Graduates via Implicit Feedback Multiple Matri...
Company Recommendation for New Graduates via Implicit Feedback Multiple Matri...
 

Similar to Hadoop Platform at Yahoo: A Year in Review

Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Sumeet Singh
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesDataWorks Summit
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data trainingagiamas
 
Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1Big Data Joe™ Rossi
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120Hyoungjun Kim
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
Experience sql server on l inux and docker
Experience sql server on l inux and dockerExperience sql server on l inux and docker
Experience sql server on l inux and dockerBob Ward
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...E-Commerce Brasil
 
Introduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI PlatformIntroduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI PlatformIndrajit Poddar
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)Claudiu Barbura
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
 

Similar to Hadoop Platform at Yahoo: A Year in Review (20)

Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data training
 
Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
Experience sql server on l inux and docker
Experience sql server on l inux and dockerExperience sql server on l inux and docker
Experience sql server on l inux and docker
 
Hortonworks.bdb
Hortonworks.bdbHortonworks.bdb
Hortonworks.bdb
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
 
Introduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI PlatformIntroduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI Platform
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
 
Huhadoop - v1.1
Huhadoop - v1.1Huhadoop - v1.1
Huhadoop - v1.1
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 

More from DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 

Recently uploaded (20)

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 

Hadoop Platform at Yahoo: A Year in Review

  • 1. HADOOP PLATFORM AT YAHOO A YEAR IN REVIEW SUMEET SINGH (@sumeetksingh) Sr. Director, Cloud and Big Data Platforms
  • 2. Agenda 2 Platform Overview1 Infrastructure and Metrics2 CaffeOnSpark for Distributed DL3 Compute and Sketches4 Oozie6 Ease of Use7 Q&A8 HBase and Omid5
  • 3. 0 100 200 300 400 500 600 700 800 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 50,000 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 RawHDFS(inPB) #Servers Year Servers Storage Yahoo! Commits to Scaling Hadoop for Production Use Research Workloads in Search and Advertising Production (Modeling) with machine learning & WebMap Revenue Systems with Security, Multi- tenancy, and SLAs Open Sourced with Apache Hortonworks Spinoff for Enterprise hardening Nextgen Hadoop (H 0.23 YARN) New Services (HBase, Storm, Spark, Hive) Increased User-base with partitioned namespaces Apache H2.7 (Scalable ML, Latency, Utilization, Productivity) Platform Evolution 3
  • 4. Deployment Models Private (dedicated) Clusters Hosted Multi-tenant (private cloud) Clusters Hosted Compute Clusters  Large demanding use cases  New technology not yet platformized  Data movement and regulation issues  When more cost effective than on- premise  Time to market/ results matter  Data already in public cloud  Source of truth for all of orgs data  App delivery agility  Operational efficiency and cost savings through economies of scale On-Premise Public Cloud Purpose-built Big Data Clusters  For performance, tighter integration with tech stack  Value added services such as monitoring, alerts, tuning and common tools 4
  • 5. Platform Today ZK DBMS MON SSHOP LOG WH TOOLS Apache / Open Source Projects Yahoo Projects HDFS HBase HCat Kafka CMS DH Pig Hive Oozie Hue GDM Big ML YARN CS MR Tez Spark Storm Services Compute Storage / Msg. Tools 5
  • 6. Technology Stack Assembly ZK DBMS MON SSHOP LOG WH TOOLS Apache Projects Yahoo Projects HDFS HBase HCat Kafka CMS DH Pig Hive Oozie Hue GDM Big ML YARN CS MR Tez Spark Storm Services Compute Storage / Msg. Tools HDFS (File System) YARN (Scheduling, Resource Management) Common RHEL6 64-bit, JDK8 Platformized Tech with Production Support In- progress, Unmet needs or Apache Alignment 6
  • 7. Common Backplane DataNode NodeManager NameNode RM DataNodes RegionServers NameNode HBase Master Nimbus Supervisor Administration, Management and Monitoring ZooKeeper Pools HTTP/HDFS/GDM Load Proxies Applications and Data Data Feeds Data Stores Oozie Server HS2/ HCat Network Backplane 7
  • 8. 0 10 20 30 Cluster 1 (2,000 servers) HDFS 12 PB Compute 23 TB Avg. Util: 26% Research Cluster Consolidation 0 20 40 60 80 ComputeTotalandUsed(TB) Cluster 3 (5,400 servers) HDFS 36 PB Compute 70 TB Avg. Util: 59% Cluster 2 (3,100 servers) HDFS 21 PB Compute 52 TB Avg. Util: 40% 0 20 40 60 One Month Sample (2015) Total Used 8
  • 9. 0 50 100 150 200 250 300 Consolidated Cluster HDFS 65 PB Compute 240 TB Avg. Util: 70% Consolidated Research Cluster Characteristics One Month Sample (2016) 40% decrease in TCO 10,500 servers 2,200 servers Before After 65% increase in compute capacity 50% increase in avg. utilization Total Used ComputeTotalandUsed(TB) 9
  • 10. Common Hadoop Cluster Configuration Rack 1 Network Backplane CPU Servers with JBODs & 10GbE Rack 2 Rack N . . . . . . . . . 10
  • 11. New Hadoop Cluster Configuration Rack 1 Network Backplane CPU Servers with JBODs & 10GbE Rack 2 Rack N 100Gbps InfiniBand GPU Servers Hi-Mem Servers . . . 11
  • 12. YARN Node Labels J2J3 J4 Queue 1, 40% Label x Queue 2, 40% Label x, y J1 Queue 3, 20% x x x x x x x x x x x x y y y y y y y y y y y y yarn.scheduler.capacity.root.<queue name>.accessible-node-labels = <label name> yarn.scheduler.capacity.root.<label name>.default-node-label-expression sets the default label asked for by queue Hadoop Cluster 12
  • 13. Agenda Platform Overview1 Infrastructure and Metrics2 CaffeOnSpark for Distributed DL3 Compute and Sketches4 Oozie6 Ease of Use7 Q&A8 HBase and Omid5 13
  • 14. CaffeOnSpark – Distributed Deep Learning CaffeOnSpark for DL MLLib for non-DL Hive or SparkSQL Spark YARN (RM and Scheduling) HDFS (Datasets) . . . 14
  • 15. Few Use Cases – Yahoo Weather 15
  • 16. Few Use Cases – Flickr Facial Recognition 16
  • 17. Few Use Cases – Flickr Scene Detection 17
  • 18. CaffeOnSpark Architecture – Common Cluster Spark Driver Caffe (enhanced with multi-GPU/CPU) Model Synchronizer (across nodes) HDFS Datasets Spark Executor (for data feeding and control) Caffe (enhanced with multi-GPU/CPU) Model Synchronizer (across nodes) HDFS Datasets Spark Executor (for data feeding and control) Caffe (enhanced with multi-GPU/CPU) Model Synchronizer (across nodes) HDFS Datasets Spark Executor (for data feeding and control) Model O/P on HDFS MPI on RDMA / TCP 18
  • 19. CaffeOnSpark Architecture – Incremental Learning cos = new CaffeOnSpark(ctx) conf = new Config(ctx, args).init() dl_train_source = DataSource.getSource(conf, true) cos.train(dl_train_source) //training DL model lr_raw_source = DataSource.getSource(conf, false) ext_df = cos.features(lr_raw_source) // extract features via DL Feature Engineering: DeepLearning 19
  • 20. CaffeOnSpark Architecture – Incremental Learning cos = new CaffeOnSpark(ctx) conf = new Config(ctx, args).init() dl_train_source = DataSource.getSource(conf, true) cos.train(dl_train_source) //training DL model lr_raw_source = DataSource.getSource(conf, false) ext_df = cos.features(lr_raw_source) // extract features via DL vlr_input=ext_df.withColumn(“L",cos.floats2doubleUDF(ext_df(conf.label)) ) .withColumn(“F",cos.floats2doublesUDF(ext_df(conf.features(0)))) lr = new LogisticRegression().setLabelCol(”L").setFeaturesCol(”F") lr_model = lr.fit(lr_input_df) … Feature Engineering: DeepLearning 20 TrainClassifiers: Non-deep Learning
  • 21. CaffeOnSpark Architecture – Single Command spark-submit --num-executors #Exes --class CaffeOnSpark my-caffe-on-spark.jar -devices #GPUs -model dl_model_file -output lr_model_file 21
  • 22. Distributed Deep Learning Apache License Existing Clusters Powerful DL Platform Fully Distributed High-level API Incremental Learning CaffeOnSpark github.com/yahoo/caffeonspark 22
  • 23. Agenda Platform Overview1 Infrastructure and Metrics2 CaffeOnSpark for Distributed DL3 Compute and Sketches4 Oozie6 Ease of Use7 Q&A8 HBase and Omid5 23
  • 24. Hadoop Compute Sources HDFS (File System and Storage) Pig (Scripting) Hive (SQL) Java MR APIs YARN (Resource Management and Scheduling) Tez (Execution Engine for Pig and Hive) Spark (Alternate Exec Engine) MapReduce (Legacy) Data Processing ML Custom App on Slider Oozie Data Management 24
  • 26. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Pushing Batch Compute Boundaries%ofTotalCompute(memory-sec) Q1 2016 MapReduce Tez Spark 112 Million Batch Jobs in Q1’16 Jan 78% Mar 67% Mar 21% 12%Jan 8% 14% 26
  • 28. Recent Apache Storm Developments at Yahoo MT & RA Scheduler Dist. Cache API 8 x Throughput Improved Debuggability 1 github.com/yahoo/streaming-benchmarks Pacemaker Server Streaming Benchmark 1 28
  • 29. Data Sketches Algorithms Data Sketches Algorithms Library datasketches.github.io  Good enough approximate answers for problem queries  Streamable  Approximate with predictable error  Sub-linear in size  Mergeable / additive  Highly parallelizable  Maven deployable Characteristics 29
  • 30. Distinct Count Sketch, High-level View Big Data Stream Transform Data Structure Estimator Result + / - ε White Noise Basic Sketch Elements 30
  • 31. Data Sketches Algorithms Data Sketches Algorithms Library datasketches.github.io 31
  • 32. Agenda Platform Overview1 Infrastructure and Metrics2 CaffeOnSpark for Distributed DL3 Compute and Sketches4 Oozie6 Ease of Use7 Q&A8 HBase and Omid5 32
  • 33. Apache HBase at Yahoo  Security  Isolated Deployment  Multi-tenant  Region Server Group  Namespace  Unsupported Features HBase Client HBase Client JobTracker Namenode TaskTracker DataNode Namenode RegionServer DataNode RegionServer DataNode RegionServer DataNode HBase Master Zookeeper Quorum HBase Client MR Client M/R Task TaskTracker DataNode M/R Task TaskTracker DataNode MR Task Compute Cluster HBase Cluster Gateway/Launcher Rest Proxy HTTP Client 33
  • 34. Security  Authentication  Kerberos (users, processes)  Delegation Token (MapReduce, YARN, etc.)  Authorization  HBase ACLs (Read, Write, Create, Admin)  Grant permissions to User or Unix Group  ACL for Table, Column Family or Column 34
  • 35. Region Server Groups  Dedicated region servers for a set of tables  Resource Isolation (CPU, Memory, IO, etc) RegionServer Group Foo RegionServer RegionServer RegionServer Region Server 1...5 TableA TableB TableC TableD TableE TableF RegionServer Group Bar RegionServer RegionServer RegionServer Region Server 6…10 Table1 Table2 Table3 Table4 Table5 Table6 35
  • 36. Namespaces  Analogous to “Database”  Namespace ACL to create tables  Default group  Quota  Tables  Regions Namespace Group Tables Quota ACL 36
  • 37. Split Meta to Spread Load and Avoid Large Regions 37
  • 38. Favored Nodes for HDFS Locality 38
  • 40. Scaling HBase to Handle Millions of Regions on a Cluster Region Server Groups Split Meta Split ZK Favored Nodes Humongous Tables 40
  • 41. Transactions on HBase with Omid1 Highly performant and fault tolerant ACID transactional framework New Apache Incubator project incubator.apache.org/projects/omid.html Handles million of transactions per day for search and personalization products 1 Omid stands for “Hope” in Persian 41
  • 44. Agenda Platform Overview1 Infrastructure and Metrics2 CaffeOnSpark for Distributed DL3 Compute and Sketches4 Oozie6 Ease of Use7 Q&A8 HBase and Omid5 44
  • 45. Oozie Data Pipelines Oozie Message Bus HCatalog 3. Push notification <New Partition> 2. Register Topic 4. Notify New Partition Data Producer HDFS Produce data (distcp, pig, M/R..) /data/click/2014/06/02 1. Query/Poll Partition Start workflow Update metadata (ALTER TABLE click ADD PARTITION(data=‘2014/06/02’) location ’hdfs://data/click/2014/06/02’) 45
  • 46. Large Scale Data Pipeline Requirements Administrative  One should be able to start, stop and pause all related pipelines at a same time Dependency Management  Output of a coordinator “n+1” action is dependent on coordinator “n” action (dataset dependency)  If dataset has a BCP instance, workflow should run with either, whichever arrives first  Start as soon as mandatory data is available, other feeds are optional  Data is not guaranteed, start processing even if partial data is available SLA Management  Monitor pipeline processing to take immediate action in case of failures or SLA misses  Pipelines owners should get notified if an SLA is missed Multiple Providers  If data is available from multiple providers, I want to specify the provider priority  Combine datasets from multiple providers to fill the gaps in data a single provider may have 46
  • 47. Large Scale Data Pipeline Requirements Administrative  One should be able to start, stop and pause all related pipelines at a same time Dependency Management  Output of a coordinator “n+1” action is dependent on coordinator “n” action (dataset dependency)  If dataset has a BCP instance, workflow should run with either, whichever arrives first  Start as soon as mandatory data is available, other feeds are optional  Data is not guaranteed, start processing even if partial data is available SLA Management  Monitor pipeline processing to take immediate action in case of failures or SLA misses  Pipelines owners should get notified if an SLA is missed Multiple Providers  If data is available from multiple providers, I want to specify the provider priority  Combine datasets from multiple providers to fill the gaps in data a single provider may have 47
  • 48. BCP And Mandatory / Optional Feeds Pull data from A or B. Specify dataset as AorB. Action will start running as soon either dataset A or B is available. <input-logic> <or name=“AorB”> <data-in dataset="A” wait=“10”/> <data-in dataset="B"/> </or> </input-logic> Dataset B is optional, Oozie will start processing as soon as A is available. It will include dataset from A and whatever is available from B. <input-logic> <and name="optional <data-in dataset="A"/> <data-in dataset="B" min=”0”/> </and> </input-logic> 48
  • 49. Data Not Guaranteed / Priority Among Dataset Instances A will have higher precedence over B and B will have higher precedence over C. <input-logic> <or name="AorBorC"> <data-in dataset="A"/> <data-in dataset="B"/> <data-in dataset="C”/> </or> </input-logic> 49 Oozie will start processing if available A instances are >= 10. Min can also be combined with wait (as shown for dataset B). <input-logic> <data-in dataset="A" min=”10”/> <data-in dataset=“B” min =“10” wait=“20”/> </input-logic>
  • 50. Combining Dataset From Multiple Providers Combine function will first check instances from A and go to B next for whatever is missing in A. <data-in name="A" dataset="dataset_A"> <start-instance> ${coord:CURRENT(-5)} </start-instance> <end-instance> ${coord:latest(-1)} </end-instance> </data-in> <data-in name="B" dataset="dataset_B"> <start-instance>${coord:CURRENT(-5)}</start-instance> <end-instance>${coord:CURRENT(-1)}</end-instance> </data-in> <input-logic> <combine name="AB"> <data-in dataset="A"/> <data-in dataset="B"/> </combine> </input-logic> 50
  • 51. Agenda Platform Overview1 Infrastructure and Metrics2 CaffeOnSpark for Distributed DL3 Compute and Sketches4 Oozie6 Ease of Use7 Q&A8 HBase and Omid5 51
  • 52. Automated Onboarding / Collaboration Portal 52
  • 53. Built for Tenant Transparency 53
  • 55. Data Discovery and Access 55
  • 56. Audits, Compliance, and Efficiency Starling FS, Job, Task logs Cluster 1 Cluster 2 Cluster n... CF, Region, Action, Query Stats Cluster 1 Cluster 2 Cluster n... DB, Tbl., Part., Colmn. Access Stats ...MS 1 MS 2 MS n GDM Data Defn., Flow, Feed, Source F 1 F 2 F n Log Warehouse Log Sources 56
  • 57. Audits, Compliance, and Efficiency (cont’d) Data Discovery and Access Public Non-sensitive Financial $ Governance Classification No addn. reqmt. LMS Integration Stock Admin Integration Approval Flow Restricted 57
  • 58. Hosted UI – Hue as a Service WSGI Hue-1.Cluster-1 (Hot) VIPUsers HS2 Hue MySQL DB (HA) Hadoop Cluster HCat Meta Oozie Server YARN RM Web HDFS NMs WSGI Hue-2.Cluster-1 (hot) HS2 IdP SAML Auth. Serving pages and static content Cookies, saved queries, workflows etc. FullStackHA REST / Thrift (jQuery, Bootstrap, Knockout.js, Love) 58
  • 60. Increased Intelligence GBDT FTRL SGD Deep Learning Random Forests ML Libraries Click Prediction Search RankingKeyword Auctions Ad Relevance Abuse Detection Applications Proven to Work at Scale Solve Complex Problems YARN (Resource Manager) Heterogeneous Scheduling Long-running Services GPUs Large Memory Support Core Grid Enhancements … Parameter ServerGlobally Shared Parameters Compute Engines Distributed Processing … 60
  • 61. Greater Speed DeData Management Ease of Use Productivity Dimensions Real-time Pipelines Unified Metadata & Lineage Fine-grained Access Control Self-serve Data Movement SLA & Cost Transparency Intuitive UIs Planning & Collab. Tools Central Grid Portal Improvements Query times < 1 sec 4x Speedups in ETL SQL on HBase Limitless BI Clients Analytics, BI & Reporting 61
  • 62. Higher Efficiency Achieve five 9’s availability and 70% average compute utilization across clusters 62
  • 63. Hadoop Users at Yahoo Slingstone & Aviate Mail Anti-Spam Gemini Campaign Mgmt. Search Assist Audience Analytics Flickr YAM+ & Targeting Membership Abuse … and many more. 63
  • 64. Yahoo at the Apache Open Source Foundation 10 Committers (6 PMC) 3 Committers (3 PMC) 3 Committers (2 PMC) 6 Committer (5 PMC) 1 Committer 3 Committers (2 PMCs) 7 Committers (6 PMCs) 1 2 43 5 6 7 8 1 Committer 64
  • 65. Join Us @ yahoohadoop.tumblr.com 65
  • 66. THANK YOU SUMEET SINGH (@sumeetksingh) Sr. Director, Cloud and Big Data Platforms Icon Courtesy – iconfinder.com (under Creative Commons)

Editor's Notes

  1. JIRA 1976 (Oozie 4.3)
  2. While $coord:latest allows skipping to available ones, the workflow will never trigger unless mentioned number of instances are found. Min can be also combined with wait. If all dependencies are not met and if we have met MIN dependencies and then Oozie keeps on waiting for more instance till wait time elapses or all data dependencies are met.
  3. (30 secs) T: 2 min 30 secs xyz
  4. Protocols REST – Use pyhton-requests and a custom client to streamline RESTful interface calls Thrift – Custom connection pooling and socket multiplexing to streamline thrift calls Accessibility Middleware – Make Hadoop interfaces accessible in request objects Hue uses CherryPy web server. You can use the following options to change the IP address and port that the web server listens on. The default setting is port 8888 on all configured IP addresses. If you don’t specify a secret key, your session cookies will not be secure. Hue will run but it will also display error messages telling you to set the secret key. You can configure Hue to serve over HTTPS. To do so, you must install "pyOpenSSL" within Hue’s context and configure your keys.
  5. Protocols REST – Use pyhton-requests and a custom client to streamline RESTful interface calls Thrift – Custom connection pooling and socket multiplexing to streamline thrift calls Accessibility Middleware – Make Hadoop interfaces accessible in request objects Hue uses CherryPy web server. You can use the following options to change the IP address and port that the web server listens on. The default setting is port 8888 on all configured IP addresses. If you don’t specify a secret key, your session cookies will not be secure. Hue will run but it will also display error messages telling you to set the secret key. You can configure Hue to serve over HTTPS. To do so, you must install "pyOpenSSL" within Hue’s context and configure your keys.