SlideShare a Scribd company logo
1 of 37
Oozie towards zero downtime
Hadoop Summit 2015/04/15
Purshotam Shah purushah@yahoo-inc.com
Ryota Egashira egashira@yahoo-inc.com
● Introduction
■ Scale at Yahoo
■ Use Cases
■ Why zero-down time matters?
● Architectural Overview
● Technical Challenge
■ Security
■ Log Streaming
■ HCatalog Integration in HA
● Experiences
● Future Work
2 Yahoo Confidential & Proprietary
Agenda
Why Oozie?
The Problem The Need
▪ Doing something on the grid often
required multiple steps
▪ MapReduce job
▪ Pig job
▪ Streaming job
▪ HDFS operation (mkdir, chmod,
etc)…
▪ Workflow scheduler with better support
for grid jobs (native integration with
Hadoop)
▪ orchestrate dependency between jobs
▪ execute at specific time or on data
availability
▪ retry jobs in the event of failures
(reliable)
▪ Multiple ad-hoc solutions existed
▪ custom job control
▪ cron…
▪ Common framework for communication
and execution of production process
▪ sync (clocked dataset) awareness
▪ async (unspecified freq) data
awareness
A server-based workflow
scheduling system to
manage Hadoop jobs
3 Yahoo Confidential & Proprietary
Scale at Yahoo
Deployed on all clusters (production, non-production)
One instance per cluster
75 products / 2000 + projects
255 monthly users
28.9 million Hadoop Jobs monthly (Jan 2015, total)
72% from Oozie (including launcher jobs)
108,000 workflow jobs daily (Feb 2015, one busy cluster)
Between 1-8 actions :Avg. 4 actions/workflow
Extreme use case, submit 100-200 workflow jobs per min
1,700 coordinator jobs daily (Feb 2015, one busy cluster)
Frequency: 5, 10, 15 mins, hourly, daily, weekly, monthly (25% : < 15 min)
67 % of workflow jobs kicked from coordinator
60 bundle jobs daily (Feb 2015, one busy cluster)
4 Yahoo Confidential & Proprietary
Hadoop Jobs on the Platform Job distribution (Jan, 2015)
5 Yahoo Confidential & Proprietary
Y! business processed by Oozie
Ad Exchange
Ad Latency
Search Advertising
Content Agility
Content Optimization
Content Personalization
Flickr Video
Audience Targeting
Behavioral Targeting
Partner Targeting
Retargeting
Web Targeting
Advertisement Content Targeting
6 Yahoo Confidential & Proprietary
Y! business processed by Oozie
Anti Spam
Content
Retargeting
Research
Dashboards & Reports
Forecasting
Email Data Intelligence Data Management
Audience Event Pipeline
7 Yahoo Confidential & Proprietary
Use Case - Data pipeline
8 Yahoo Confidential & Proprietary
Number of action created hourly
Mid-Night PMPMAM
4am 2pm 0am 10am 8pm
9 Yahoo Confidential & Proprietary
Number of action created per minute
10 Yahoo Confidential & Proprietary
SCALE
▪ At one point of time all 5 min, 15 min, 30 min, hourly,
daily, monthly coordinator job will collide and there will
be outburst of coordinator actions, which single host
can’t handle.
▪ We noticed processing delay and customers
complaining slowness.
11 Yahoo Confidential & Proprietary
Why Downtime matters? Downtime needed
Oozie Upgrade (Major Release > 1 per Quarter, Minor > 1 per
Month)
12 Yahoo Confidential & Proprietary
Why Downtime matters? Downtime needed
Dependent Hadoop Projects Upgrade (YARN, HDFS, Hive, HBase, etc)
Oozie
YARN
HDFS
Hive
HBase
Pig
HCatlog
Pig
13 Yahoo Confidential & Proprietary
Why Downtime matters? Downtime needed
Configuration error / change
14 Yahoo Confidential & Proprietary
Why Downtime matters? Downtime needed
Hardware error / upgrade
15 Yahoo Confidential & Proprietary
Why Downtime matters? Customers
Revenue-impact applications need running all the time, no delay!
16 Yahoo Confidential & Proprietary
Why Downtime matters? Ops
Ops- under pressure to minimize downtime
17 Yahoo Confidential & Proprietary
Solution : High Availability
18 Yahoo Confidential & Proprietary
● Definition: failure of a component != failure of entire
system
o by removing single point of failure
● Requirement: Transparency to Users
o User should not know it’s HA or not
o No change in API and usage pattern
19 Yahoo Confidential & Proprietary
Architecture
Load
Balancer
RDB
Hadoop Cluster
submit request
request redirection
Oozie Server 1
Oozie Server n
Inter server communication
Zookeeper
Curator
Architectural Overview: Database
20 Yahoo Confidential & Proprietary
● Oozie stores most of its state in a database
o (submitted jobs, workflow definitions, etc)
● Oracle database( 2 rack) in HA is used ( Hot-warm).
● Zookeeper ( Curator) for coordination
Architectural Overview: Access
21 Yahoo Confidential & Proprietary
● Users and client programs need a single address to
connect to
o Web UI, REST/Java API,
JobTracker/ResourceManager callbacks, etc
● Virtual IP (VIP) is used as user facing URL.
Architectural Overview: Security
22 Yahoo Confidential & Proprietary
 We use Kerberos and some of internal security system to communicate
among components.
23 Yahoo Confidential & Proprietary
Security: https + kerberos
/ cookie-based auth
Architectural Overview: Authentication
Load
Balancer
RDB
Hadoop Cluster
submit request
request redirection
Oozie Server 1
Oozie Server n
Inter server communication
for log streaming etc
Zookeeper
Curator
Security: https + kerberos /
cookie-based-auth
Security: https+kerberos
Zookeeper for lock and
management
Security: Kerberos
Security: kerberos
Technical Challenge: Log Streaming
24 Yahoo Confidential & Proprietary
● Each Oozie server only has access to its own logs
● Jobs can execute on any server
o Job execution can switch among server
● User need to see sequential logs rather than server1 and
server2 logs.
25 Yahoo Confidential & Proprietary
Architectural: Log Streaming
2. Call other server
to fetch logs
1. user request comes
to server1
3. Call all other server are
merge logs using log
timestamp
4. Log is displayed to
user
2. Fetch
server list
from ZK
Caveat:Log Streaming
26 Yahoo Confidential & Proprietary
 If an Oozie Server goes down, any logs from it will be unavailable
27 Yahoo Confidential & Proprietary
Technical Challenge:HCatalog Integration
• Hive Metastore(HCatalog) : Manage metadata for datasets
– Oozie register for dataset to HCatlog
– Oozie receive notification from HCatlog through JMS (e.g., ActiveMQ)
– Oozie starts job immediately after data becomes ready
JMS
(e.g, ActiveMQ)
. Push notification
<New Partition>
1. Register Topic
. Notify New Partition
Job
Oozie Server 1
Oozie Server 2
28 Yahoo Confidential & Proprietary
Technical Challenge:Hive Metastore Integration
• Oozie maintains in-memory list of datasets which need
notification.
• Notification comes to only one server.
• One notification come to one server, Oozie need to
invalidate cache in all other servers.
• This is done by having a periodic task on each server
which check job status of each dataset and if it’s not
waiting. It remove the dataset from cache.
29 Yahoo Confidential & Proprietary
Technical Challenge:Hive Metastore Integration
3. Push notification
<New Partition>2. Register Topic
4. Notify New Partition JMS
(e.g, ActiveMQ)
Job
Oozie Server 1
Oozie Server 2
Remove
registrationPeriodic check
Challenges
30 Yahoo Confidential & Proprietary
● Distributed Job ID
o Maintain distributed sequence number for Job ID using
Apache Curator + Zookeeper
● Zookeeper Failure Handling
o Oozie servers automatically shutdown when Zookeeper is
down
● Sharelib
o Support sharelib update in HA
More Challenges
• SLA support
– Oozie has in-memory data structure to track sla status
for each job (start/duration/end met/miss and
notifications)
– add check of sla status against Database
– use ZK lock to synchronize update on the same job
from multiple servers.
• Distributed Locks
– Reentrant distributed lock using Apache Curator +
Zookeeper31 Yahoo Confidential & Proprietary
Experiences
• HA running on all
production grids > 7
months at Yahoo!
– Stable !
32 Yahoo Confidential & Proprietary
Issues
– Zookeeper down
(when upgrading zk quorum h/w)
– Server going out of sync
(during upgrade, sharelib)
33 Yahoo Confidential & Proprietary
Benefits
▪ Zero downtime for applications
▪ Rolling upgrade (zero downtime)
› Maintenance upgrade
› Configuration upgrade
▪ No more materization delay
34 Yahoo Confidential & Proprietary
Workflow Job Submission Throughput
35 Yahoo Confidential & Proprietary
Future work
• Faster job fail-over
– currently wait for a thread (Recovery Service) to pick
non-progressing jobs every few minutes
– Oozie server should immediately notice when other
server is down and fail-over job (e.g, using ZK
watcher)
• Improve log streaming
36 Yahoo Confidential & Proprietary
Acknowledgement
Robert Kanter
Olga L. Natkovich
Rohini Palaniswamy
Michelle Chiang
Jacob Tolar
Sumeet Singh
37 Yahoo Confidential & Proprietary

More Related Content

What's hot

Oozie &amp; sqoop by pradeep
Oozie &amp; sqoop by pradeepOozie &amp; sqoop by pradeep
Oozie &amp; sqoop by pradeepPradeep Pandey
 
Data Pipeline Management Framework on Oozie
Data Pipeline Management Framework on OozieData Pipeline Management Framework on Oozie
Data Pipeline Management Framework on OozieShareThis
 
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for HadoopMay 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for HadoopYahoo Developer Network
 
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NApache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NYahoo Developer Network
 
Oozie HUG May12
Oozie HUG May12Oozie HUG May12
Oozie HUG May12mislam77
 
Oozie Summit 2011
Oozie Summit 2011Oozie Summit 2011
Oozie Summit 2011mislam77
 
Oozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY WayOozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY WayDataWorks Summit
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewMadhur Nawandar
 
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud InfrastructureSCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud InfrastructureMatt Ray
 
TXLF: Chef- Software Defined Infrastructure Today & Tomorrow
TXLF: Chef- Software Defined Infrastructure Today & TomorrowTXLF: Chef- Software Defined Infrastructure Today & Tomorrow
TXLF: Chef- Software Defined Infrastructure Today & TomorrowMatt Ray
 
SQL Monitoring in Oracle Database 12c
SQL Monitoring in Oracle Database 12cSQL Monitoring in Oracle Database 12c
SQL Monitoring in Oracle Database 12cTanel Poder
 
New awesome features in MySQL 5.7
New awesome features in MySQL 5.7New awesome features in MySQL 5.7
New awesome features in MySQL 5.7Zhaoyang Wang
 
DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜
DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜
DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜Michitoshi Yoshida
 
Hitchhiker's Guide to free Oracle tuning tools
Hitchhiker's Guide to free Oracle tuning toolsHitchhiker's Guide to free Oracle tuning tools
Hitchhiker's Guide to free Oracle tuning toolsBjoern Rost
 
Pluggable Databases: What they will break and why you should use them anyway!
Pluggable Databases: What they will break and why you should use them anyway!Pluggable Databases: What they will break and why you should use them anyway!
Pluggable Databases: What they will break and why you should use them anyway!Guatemala User Group
 
Cdcr apachecon-talk
Cdcr apachecon-talkCdcr apachecon-talk
Cdcr apachecon-talkAmrit Sarkar
 
監査ログをもっと身近に!〜統合監査のすすめ〜
監査ログをもっと身近に!〜統合監査のすすめ〜監査ログをもっと身近に!〜統合監査のすすめ〜
監査ログをもっと身近に!〜統合監査のすすめ〜Michitoshi Yoshida
 

What's hot (20)

Oozie &amp; sqoop by pradeep
Oozie &amp; sqoop by pradeepOozie &amp; sqoop by pradeep
Oozie &amp; sqoop by pradeep
 
Oozie at Yahoo
Oozie at YahooOozie at Yahoo
Oozie at Yahoo
 
Data Pipeline Management Framework on Oozie
Data Pipeline Management Framework on OozieData Pipeline Management Framework on Oozie
Data Pipeline Management Framework on Oozie
 
October 2014 HUG : Oozie HA
October 2014 HUG : Oozie HAOctober 2014 HUG : Oozie HA
October 2014 HUG : Oozie HA
 
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for HadoopMay 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
 
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NApache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
 
Oozie HUG May12
Oozie HUG May12Oozie HUG May12
Oozie HUG May12
 
Oozie Summit 2011
Oozie Summit 2011Oozie Summit 2011
Oozie Summit 2011
 
Oozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY WayOozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY Way
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overview
 
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud InfrastructureSCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
 
TXLF: Chef- Software Defined Infrastructure Today & Tomorrow
TXLF: Chef- Software Defined Infrastructure Today & TomorrowTXLF: Chef- Software Defined Infrastructure Today & Tomorrow
TXLF: Chef- Software Defined Infrastructure Today & Tomorrow
 
Gradle - Build System
Gradle - Build SystemGradle - Build System
Gradle - Build System
 
SQL Monitoring in Oracle Database 12c
SQL Monitoring in Oracle Database 12cSQL Monitoring in Oracle Database 12c
SQL Monitoring in Oracle Database 12c
 
New awesome features in MySQL 5.7
New awesome features in MySQL 5.7New awesome features in MySQL 5.7
New awesome features in MySQL 5.7
 
DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜
DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜
DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜
 
Hitchhiker's Guide to free Oracle tuning tools
Hitchhiker's Guide to free Oracle tuning toolsHitchhiker's Guide to free Oracle tuning tools
Hitchhiker's Guide to free Oracle tuning tools
 
Pluggable Databases: What they will break and why you should use them anyway!
Pluggable Databases: What they will break and why you should use them anyway!Pluggable Databases: What they will break and why you should use them anyway!
Pluggable Databases: What they will break and why you should use them anyway!
 
Cdcr apachecon-talk
Cdcr apachecon-talkCdcr apachecon-talk
Cdcr apachecon-talk
 
監査ログをもっと身近に!〜統合監査のすすめ〜
監査ログをもっと身近に!〜統合監査のすすめ〜監査ログをもっと身近に!〜統合監査のすすめ〜
監査ログをもっと身近に!〜統合監査のすすめ〜
 

Similar to Oozie towards zero downtime

DockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging WorkshopDockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging WorkshopBrian Christner
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAlluxio, Inc.
 
Zoo keeper in the wild
Zoo keeper in the wildZoo keeper in the wild
Zoo keeper in the wilddatamantra
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialhadooparchbook
 
Ahmedabad MuleSoft Meetup #4
Ahmedabad MuleSoft Meetup #4Ahmedabad MuleSoft Meetup #4
Ahmedabad MuleSoft Meetup #4Tejas Purohit
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationshadooparchbook
 
PPWT2019 - EmPower your BI architecture
PPWT2019 - EmPower your BI architecturePPWT2019 - EmPower your BI architecture
PPWT2019 - EmPower your BI architectureRiccardo Perico
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for HadoopJoe Crobak
 
Monitoring with Clickhouse
Monitoring with ClickhouseMonitoring with Clickhouse
Monitoring with Clickhouseunicast
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingGreat Wide Open
 
Sql server tips from the field
Sql server tips from the fieldSql server tips from the field
Sql server tips from the fieldJoAnna Cheshire
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in ProductionRobert Sanders
 
SaltConf14 - Brendan Burns, Google - Management at Google Scale
SaltConf14 - Brendan Burns, Google - Management at Google ScaleSaltConf14 - Brendan Burns, Google - Management at Google Scale
SaltConf14 - Brendan Burns, Google - Management at Google ScaleSaltStack
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
Instant developer onboarding with self contained repositories
Instant developer onboarding with self contained repositoriesInstant developer onboarding with self contained repositories
Instant developer onboarding with self contained repositoriesYshay Yaacobi
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applicationshadooparchbook
 
[Coupang] Journey to the Continuous and Scalable Big Data Platform : 지속적으로 확장...
[Coupang] Journey to the Continuous and Scalable Big Data Platform : 지속적으로 확장...[Coupang] Journey to the Continuous and Scalable Big Data Platform : 지속적으로 확장...
[Coupang] Journey to the Continuous and Scalable Big Data Platform : 지속적으로 확장...Matthew (정재화)
 
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...Amazon Web Services
 
Monitoring in a fast-changing world with Prometheus
Monitoring in a fast-changing world with PrometheusMonitoring in a fast-changing world with Prometheus
Monitoring in a fast-changing world with PrometheusJulien Pivotto
 

Similar to Oozie towards zero downtime (20)

DockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging WorkshopDockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging Workshop
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & Alluxio
 
Zoo keeper in the wild
Zoo keeper in the wildZoo keeper in the wild
Zoo keeper in the wild
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
 
Ahmedabad MuleSoft Meetup #4
Ahmedabad MuleSoft Meetup #4Ahmedabad MuleSoft Meetup #4
Ahmedabad MuleSoft Meetup #4
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applications
 
PPWT2019 - EmPower your BI architecture
PPWT2019 - EmPower your BI architecturePPWT2019 - EmPower your BI architecture
PPWT2019 - EmPower your BI architecture
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
Monitoring with Clickhouse
Monitoring with ClickhouseMonitoring with Clickhouse
Monitoring with Clickhouse
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
Sql server tips from the field
Sql server tips from the fieldSql server tips from the field
Sql server tips from the field
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
 
SaltConf14 - Brendan Burns, Google - Management at Google Scale
SaltConf14 - Brendan Burns, Google - Management at Google ScaleSaltConf14 - Brendan Burns, Google - Management at Google Scale
SaltConf14 - Brendan Burns, Google - Management at Google Scale
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Instant developer onboarding with self contained repositories
Instant developer onboarding with self contained repositoriesInstant developer onboarding with self contained repositories
Instant developer onboarding with self contained repositories
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
 
[Coupang] Journey to the Continuous and Scalable Big Data Platform : 지속적으로 확장...
[Coupang] Journey to the Continuous and Scalable Big Data Platform : 지속적으로 확장...[Coupang] Journey to the Continuous and Scalable Big Data Platform : 지속적으로 확장...
[Coupang] Journey to the Continuous and Scalable Big Data Platform : 지속적으로 확장...
 
Geode Meetup Apachecon
Geode Meetup ApacheconGeode Meetup Apachecon
Geode Meetup Apachecon
 
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
 
Monitoring in a fast-changing world with Prometheus
Monitoring in a fast-changing world with PrometheusMonitoring in a fast-changing world with Prometheus
Monitoring in a fast-changing world with Prometheus
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Recently uploaded (20)

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Oozie towards zero downtime

  • 1. Oozie towards zero downtime Hadoop Summit 2015/04/15 Purshotam Shah purushah@yahoo-inc.com Ryota Egashira egashira@yahoo-inc.com
  • 2. ● Introduction ■ Scale at Yahoo ■ Use Cases ■ Why zero-down time matters? ● Architectural Overview ● Technical Challenge ■ Security ■ Log Streaming ■ HCatalog Integration in HA ● Experiences ● Future Work 2 Yahoo Confidential & Proprietary Agenda
  • 3. Why Oozie? The Problem The Need ▪ Doing something on the grid often required multiple steps ▪ MapReduce job ▪ Pig job ▪ Streaming job ▪ HDFS operation (mkdir, chmod, etc)… ▪ Workflow scheduler with better support for grid jobs (native integration with Hadoop) ▪ orchestrate dependency between jobs ▪ execute at specific time or on data availability ▪ retry jobs in the event of failures (reliable) ▪ Multiple ad-hoc solutions existed ▪ custom job control ▪ cron… ▪ Common framework for communication and execution of production process ▪ sync (clocked dataset) awareness ▪ async (unspecified freq) data awareness A server-based workflow scheduling system to manage Hadoop jobs 3 Yahoo Confidential & Proprietary
  • 4. Scale at Yahoo Deployed on all clusters (production, non-production) One instance per cluster 75 products / 2000 + projects 255 monthly users 28.9 million Hadoop Jobs monthly (Jan 2015, total) 72% from Oozie (including launcher jobs) 108,000 workflow jobs daily (Feb 2015, one busy cluster) Between 1-8 actions :Avg. 4 actions/workflow Extreme use case, submit 100-200 workflow jobs per min 1,700 coordinator jobs daily (Feb 2015, one busy cluster) Frequency: 5, 10, 15 mins, hourly, daily, weekly, monthly (25% : < 15 min) 67 % of workflow jobs kicked from coordinator 60 bundle jobs daily (Feb 2015, one busy cluster) 4 Yahoo Confidential & Proprietary
  • 5. Hadoop Jobs on the Platform Job distribution (Jan, 2015) 5 Yahoo Confidential & Proprietary
  • 6. Y! business processed by Oozie Ad Exchange Ad Latency Search Advertising Content Agility Content Optimization Content Personalization Flickr Video Audience Targeting Behavioral Targeting Partner Targeting Retargeting Web Targeting Advertisement Content Targeting 6 Yahoo Confidential & Proprietary
  • 7. Y! business processed by Oozie Anti Spam Content Retargeting Research Dashboards & Reports Forecasting Email Data Intelligence Data Management Audience Event Pipeline 7 Yahoo Confidential & Proprietary
  • 8. Use Case - Data pipeline 8 Yahoo Confidential & Proprietary
  • 9. Number of action created hourly Mid-Night PMPMAM 4am 2pm 0am 10am 8pm 9 Yahoo Confidential & Proprietary
  • 10. Number of action created per minute 10 Yahoo Confidential & Proprietary
  • 11. SCALE ▪ At one point of time all 5 min, 15 min, 30 min, hourly, daily, monthly coordinator job will collide and there will be outburst of coordinator actions, which single host can’t handle. ▪ We noticed processing delay and customers complaining slowness. 11 Yahoo Confidential & Proprietary
  • 12. Why Downtime matters? Downtime needed Oozie Upgrade (Major Release > 1 per Quarter, Minor > 1 per Month) 12 Yahoo Confidential & Proprietary
  • 13. Why Downtime matters? Downtime needed Dependent Hadoop Projects Upgrade (YARN, HDFS, Hive, HBase, etc) Oozie YARN HDFS Hive HBase Pig HCatlog Pig 13 Yahoo Confidential & Proprietary
  • 14. Why Downtime matters? Downtime needed Configuration error / change 14 Yahoo Confidential & Proprietary
  • 15. Why Downtime matters? Downtime needed Hardware error / upgrade 15 Yahoo Confidential & Proprietary
  • 16. Why Downtime matters? Customers Revenue-impact applications need running all the time, no delay! 16 Yahoo Confidential & Proprietary
  • 17. Why Downtime matters? Ops Ops- under pressure to minimize downtime 17 Yahoo Confidential & Proprietary
  • 18. Solution : High Availability 18 Yahoo Confidential & Proprietary ● Definition: failure of a component != failure of entire system o by removing single point of failure ● Requirement: Transparency to Users o User should not know it’s HA or not o No change in API and usage pattern
  • 19. 19 Yahoo Confidential & Proprietary Architecture Load Balancer RDB Hadoop Cluster submit request request redirection Oozie Server 1 Oozie Server n Inter server communication Zookeeper Curator
  • 20. Architectural Overview: Database 20 Yahoo Confidential & Proprietary ● Oozie stores most of its state in a database o (submitted jobs, workflow definitions, etc) ● Oracle database( 2 rack) in HA is used ( Hot-warm). ● Zookeeper ( Curator) for coordination
  • 21. Architectural Overview: Access 21 Yahoo Confidential & Proprietary ● Users and client programs need a single address to connect to o Web UI, REST/Java API, JobTracker/ResourceManager callbacks, etc ● Virtual IP (VIP) is used as user facing URL.
  • 22. Architectural Overview: Security 22 Yahoo Confidential & Proprietary  We use Kerberos and some of internal security system to communicate among components.
  • 23. 23 Yahoo Confidential & Proprietary Security: https + kerberos / cookie-based auth Architectural Overview: Authentication Load Balancer RDB Hadoop Cluster submit request request redirection Oozie Server 1 Oozie Server n Inter server communication for log streaming etc Zookeeper Curator Security: https + kerberos / cookie-based-auth Security: https+kerberos Zookeeper for lock and management Security: Kerberos Security: kerberos
  • 24. Technical Challenge: Log Streaming 24 Yahoo Confidential & Proprietary ● Each Oozie server only has access to its own logs ● Jobs can execute on any server o Job execution can switch among server ● User need to see sequential logs rather than server1 and server2 logs.
  • 25. 25 Yahoo Confidential & Proprietary Architectural: Log Streaming 2. Call other server to fetch logs 1. user request comes to server1 3. Call all other server are merge logs using log timestamp 4. Log is displayed to user 2. Fetch server list from ZK
  • 26. Caveat:Log Streaming 26 Yahoo Confidential & Proprietary  If an Oozie Server goes down, any logs from it will be unavailable
  • 27. 27 Yahoo Confidential & Proprietary Technical Challenge:HCatalog Integration • Hive Metastore(HCatalog) : Manage metadata for datasets – Oozie register for dataset to HCatlog – Oozie receive notification from HCatlog through JMS (e.g., ActiveMQ) – Oozie starts job immediately after data becomes ready JMS (e.g, ActiveMQ) . Push notification <New Partition> 1. Register Topic . Notify New Partition Job Oozie Server 1 Oozie Server 2
  • 28. 28 Yahoo Confidential & Proprietary Technical Challenge:Hive Metastore Integration • Oozie maintains in-memory list of datasets which need notification. • Notification comes to only one server. • One notification come to one server, Oozie need to invalidate cache in all other servers. • This is done by having a periodic task on each server which check job status of each dataset and if it’s not waiting. It remove the dataset from cache.
  • 29. 29 Yahoo Confidential & Proprietary Technical Challenge:Hive Metastore Integration 3. Push notification <New Partition>2. Register Topic 4. Notify New Partition JMS (e.g, ActiveMQ) Job Oozie Server 1 Oozie Server 2 Remove registrationPeriodic check
  • 30. Challenges 30 Yahoo Confidential & Proprietary ● Distributed Job ID o Maintain distributed sequence number for Job ID using Apache Curator + Zookeeper ● Zookeeper Failure Handling o Oozie servers automatically shutdown when Zookeeper is down ● Sharelib o Support sharelib update in HA
  • 31. More Challenges • SLA support – Oozie has in-memory data structure to track sla status for each job (start/duration/end met/miss and notifications) – add check of sla status against Database – use ZK lock to synchronize update on the same job from multiple servers. • Distributed Locks – Reentrant distributed lock using Apache Curator + Zookeeper31 Yahoo Confidential & Proprietary
  • 32. Experiences • HA running on all production grids > 7 months at Yahoo! – Stable ! 32 Yahoo Confidential & Proprietary
  • 33. Issues – Zookeeper down (when upgrading zk quorum h/w) – Server going out of sync (during upgrade, sharelib) 33 Yahoo Confidential & Proprietary
  • 34. Benefits ▪ Zero downtime for applications ▪ Rolling upgrade (zero downtime) › Maintenance upgrade › Configuration upgrade ▪ No more materization delay 34 Yahoo Confidential & Proprietary
  • 35. Workflow Job Submission Throughput 35 Yahoo Confidential & Proprietary
  • 36. Future work • Faster job fail-over – currently wait for a thread (Recovery Service) to pick non-progressing jobs every few minutes – Oozie server should immediately notice when other server is down and fail-over job (e.g, using ZK watcher) • Improve log streaming 36 Yahoo Confidential & Proprietary
  • 37. Acknowledgement Robert Kanter Olga L. Natkovich Rohini Palaniswamy Michelle Chiang Jacob Tolar Sumeet Singh 37 Yahoo Confidential & Proprietary