SlideShare a Scribd company logo
Hadoop 2 @Twitter,
Elephant Scale
Lohit VijayaRenu Gera Shegalov
@lohitvijayarenu @gerashegalov
@TwitterHadoop
1 / 29 v1.0
About this talk
Share @twitterhadoop’s efforts, experience and learning in
moving thousand users and multi petabyte workloads from
Hadoop 1 to Hadoop 2
@twitterhadoop
2 / 29 v1.0
Use cases
Personalization
Graph analysis, Recommendations, Trends, User/topic modeling
Analytics
a/b testing, user behavior analysis, api analytics
Growth
Network Digest, People Recommendations, Email
Revenue
Engagement prediction, Ad targeting, ads analytics, marketplace optimization
Nielsen Twitter TV Rating
Tweet impressions processing
Backups & Scribe Logs
MySQL backups, Manhattan backups, FrontEnd scribe logs
Many more...
@twitterhadoop
3 / 29 v1.0
Hadoop and Data pipeline
TFE
hadoop real
time
hadoop
processing
hadoop
warehouse
hadoop
cold
hadoop
backupsSearch,
Ads, etc Partners
MySQL
hadoop
hbase
Vertica
Manhatta
n
hadoop
tst
@twitterhadoop
SVN, Git,
...
hadoop
tst
4 / 29 v1.0
Elephant Scale
➔ Tens of thousands Hadoop servers
(Mix of hardware)
➔ Hundreds of thousands of disk drives
➔ Few hundred PB data stored in
HDFS
➔ Hundreds of thousands of daily
hadoop jobs
➔ Tens of millions of daily hadoop tasks
@twitterhadoop
Individual Cluster Stats
➔ More than 3500 nodes
➔ 30-50+ PB data stored in HDFS
➔ 35K RPC/second on NNs
➔ 30K+ jobs per day
➔ 10M+ tasks per day
➔ 6PB+ data crunched per day
5 / 29 v1.0
Hadoop 1 Challenges (Q4-2012)
Growth:
Supporting twitter growth,
Request for new features on
older branch, new JAVA
Scalability:
NameNode files/blocks, NN
Operations, GC pause,
Checkpointing
JobTracker GC pause, task
assignment
Reliability:
SPOF NN and JT, NameNode
restart delays
Efficiency:
Slot utilization, QoS, Multi
Tenant, New features &
frameworks
Maintenance:
Old codebase, Numerous issues
fixed in later versions, dev
branch
. @twitterhadoop
6 / 29 v1.0
Hadoop 2 Configuration (Q1-2013)
NodeManager
DataNode
NodeManager
DataNode
NodeManager
DataNode
YARN ResourceManager
JN JN JN JN JN JN
ViewFS, HDFS Balancer, Admin tools, hRaven, Metrics Alerts
……. …….
logs user tmp Trash
@twitterhadoop
TrashTrash
7 / 29 v1.0
Hadoop 2 Migration (Q2-Q4 2013)
Phase 1 :
Testing
Phase 3 :
Production
Phase 2 :
Semi production
➔ Apache 2.0.3 branch
➔ New Hardware*, New
OS and JVM
➔ Benchmarks and user
jobs (lots of them…)
➔ Dependent
component updates
➔ Data movement
between different
versions
➔ Metrics, Alerts and tools
➔ Production use cases
running in 2 clusters in
parallel.
➔ Tuning/parameter updates
and learnings
➔ Started contributing fixes
back to community
➔ Educating users about new
version and changes
➔ Benefits of Hadoop 2
➔ Stable Apache 2.0.5
release with many
fixes and backports
➔ Multiple internal
releases
➔ Template for new
clusters
➔ Ready to roll Apache
2.3 release
*http://www.slideshare.net/Hadoop_Summit/hadoop-hardware-twitter-size-does-matter
@twitterhadoop
8 / 29 v1.0
CPU Utilization
Hadoop 1 CPU
Utilization for
one day. (45%
peaks)
Hadoop 2 CPU
Utilization for
one day. (85%
peaks)
@twitterhadoop
9 / 29 v1.0
Memory Utilization
Hadoop 1
Memory
Utilization for
one day (68%
peaks)
Hadoop 2
Memory
Utilization for
one day (96%
peaks)
@twitterhadoop
10 / 29 v1.0
Migration Challenge: web-based FS
Need a web-based FS to deal with H1/H2 interactions
● Hftp based on cross-DC LogMover experience
● Apps broken due to no FNF on non-existing paths
HDFS-6143
● Faced challenges cross-version checksums
@twitterhadoop
11 / 29 v1.0
Migration Challenge: hard-coded FS
1000’s of occurrences hdfs://${NN}/path and absolute URIs
● For cluster1 dial hdfs://hadoop-cluster1-nn.dc CNAME
● For cluster2 dial …
Ideal: use logical paths and viewfs as defaultFS
More realistic and faster:
● HDFSCompatibleViewFS HADOOP-9985
@twitterhadoop
12 / 29 v1.0
Migration Challenge: Interoperability
Migration in progress: H1 job requires input from H2
● hftp://OMGwhatNN/has/my/path problem
● ideal: use viewfs on H1 resolving to correct H2-NN
● realistic: see above “hardcoded FS”
● Even if you know OMGwhatNN, is it active?
@twitterhadoop
13 / 29 v1.0
StandbyActive
Cluster
CNAME
H1 client
Active Standby Active Standby
Load client-side mounttable on
the server side:
1. redirect to the right
namespace
2. redirect to active within
namespace
@twitterhadoop
14 / 29 v1.0
Migration: Tools and Ecosystem
● Port/recompile/package:
o Data Access Layer/HCatalog,
o Pig,
o Cascading/Scalding
o ElephantBird
o hadoop-lzo
● PIG-3913 (local mode counters),
● Analytics team fixed PIG-2888 (performance)
● hRaven fixes:
o translation between slot_millis and mb_millis
@twitterhadoop
15 / 29 v1.0
HadOops found and fixed
● ViewFS can’t be used for public DistributedCache (DC)
o HADOOP-10191, YARN-1542
● getFileStatus RPC storm on public DC:
o YARN-1771
● No user-specified progress string in MR-AM UI task
o MAPREDUCE-5550
● Uberized jobs for scheduling small jobs great but ...
o can you kill them? MAPREDUCE-5841
o size correctly for map-only? YARN-1190
@twitterhadoop
16 / 29 v1.0
More HadOops
Incident: a job blacklists nodes by logging terabytes
● need capping, but userlog.limit.kb loses valuable log tail
● RollingFileAppender for MR-AM/tasks MAPREDUCE-
5672
@twitterhadoop
17 / 29 v1.0
Diagnostics improvement
App/Job/Task kill:
● DAG processors/users can say why
o MAPREDUCE-5648, YARN-1551
● MR-AM: “speculation”, “reducer preemption”
o MAPREDUCE-5692, MAPREDUCE-5825
● Thread Dumps
o On task timeout: MAPREDUCE-5044
o On demand from CLI/UI: MAPREDUCE-5784, ...
@twitterhadoop
18 / 29 v1.0
UX/UI improvements
● NameNode state and cluster stats
● App size in MB on RM Apps Page
● RM Scheduler UI improvements: queue descriptions,
bugs min/max resource calc.
● Task Attempt state filtering in MR-AM
HDFS-5928, YARN-1945, HDFS-5296...
@twitterhadoop
19 / 29 v1.0
YARN reliability improvements
● Unhealthy nodes / positive feedback
o drain containers instead of killing: YARN-1996
o don’t rerun maps when all reduces committed: MAPREDUCE-5817
● RM crashes JIRA fixed either just internally or public
o YARN-351, YARN-502
@twitterhadoop
20 / 29 v1.0
MapReduce usability
● Memory.mb as a single tunable: Xmx, sort.mb auto-set
o mb is optimized on case-by-case basis
o MAPREDUCE-5785
● Users want newer artifacts like guava: job.classloader
o MAPREDUCE-5146 / 5751 / 5813 / 5814
● Help users debug
o thread dump on timeout, and on demand via UI
o educate users about heap dumps on OOM and java profiling
@twitterhadoop
21 / 29 v1.0
Multi-DC environment
MR clients across latency boundaries. Submit fast:
● moving split calculation to MR-AM: MAPREDUCE-207
DSCP bit coloring for DataXfer
● HDFS-5175
● Hftp (switched to Apache Commons HttpClient)
DataXfer throttling (client RW)
22 / 29 v1.0
YARN: Beyond Java & MapReduce
● MR-AM and other REST API’s across the stack for easy
integration in non-JVM tools.
● Vowpal Wabbit: (production)
o no extra spanning tree step
● Spark (semi-production)
@twitterhadoop
23 / 29 v1.0
Ongoing Project: Shared Cache
MapReduce function shipping: computation->data
● Teams have jobs with 100’s of jars uploaded via libjars
o Ideal: manage a jar repo on HDFS
o Reference jars via DistributedCache instead of uploading
o Real: currently hard to coordinate
● YARN-1492: Manage artifacts cache transparently
● Measure it:
o YARN-1529: Localization overhead/cache hits NM metrics
o MAPREDUCE-5696: Job localization counters
@twitterhadoop
24 / 29 v1.0
Upcoming Challenges
● Reduce ops complexity:
o grow to 10K+-node clusters
o try to avoid adding more clusters
● Scalability limits for NN, RM
● NN heap sizes: large Java heap vs namespace splitting
● RPC QoS Issues
● NN startup: long initial block report processing
● Integrating non-MR frameworks with hRaven
@twitterhadoop
25 / 29 v1.0
Future Work Ideas
● Productize RM HA and work-preserving restart
● HDFS Readable Standby NN
● Whole DAG in a single NN namespace
● Contribute to HDFS-5477 - Dedicated BM service
● NN SLA: fairshare for RPC queues: HADOOP-10598
● Finer lock granularity in NN
@twitterhadoop
26 / 29 v1.0
Summary: Hadoop 2 @ Twitter
● No JT bottleneck: Lightweight RM + MR-AM
● High compute density with flexible slots
● Reduced NN bottleneck using Federation
● HDFS HA removes the angst to try out new NN configs
● Much closer to upstream to consume/contribute fixes
o Development on 2.3 branch
● Adopting new frameworks on YARN
@twitterhadoop
27 / 29 v1.0
Conclusion
Migrating 1000+ users/use cases is anything but trivial
… however,
● Hadoop 2 made it worthwhile
● Hadoop 2 contributions:
o 40+ patches committed
o ~40 in review
@twitterhadoop
28 / 29 v1.0
Thank you! Questions
@JoinTheFlock about.twitter.com/careers
@TwitterHadoop
Catch up with us in person
@LohitVijayaRenu
@GeraShegalov
@twitterhadoop
29 / 29 v1.0

More Related Content

What's hot

MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase API
mcsrivas
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Modern Data Stack France
 
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUsOptimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Chris Fregly
 
Distributed Deep Learning on Spark
Distributed Deep Learning on SparkDistributed Deep Learning on Spark
Distributed Deep Learning on Spark
Mathieu Dumoulin
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
Yahoo Developer Network
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
Big Data Montreal
 
LAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96BoardsLAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96Boards
Linaro
 
Hadoop scheduler
Hadoop schedulerHadoop scheduler
Hadoop scheduler
Subhas Kumar Ghosh
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data Science
Data Works MD
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
veeracynixit
 
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
DataWorks Summit
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Clusterairbots
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
inside-BigData.com
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
Great Wide Open
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
Spark Summit
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Mathieu Dumoulin
 
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
inside-BigData.com
 
RAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature EngineeringRAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature Engineering
Keith Kraus
 
Shift into High Gear: Dramatically Improve Hadoop & NoSQL Performance
Shift into High Gear: Dramatically Improve Hadoop & NoSQL PerformanceShift into High Gear: Dramatically Improve Hadoop & NoSQL Performance
Shift into High Gear: Dramatically Improve Hadoop & NoSQL Performance
MapR Technologies
 
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDistributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Databricks
 

What's hot (20)

MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase API
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
 
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUsOptimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
 
Distributed Deep Learning on Spark
Distributed Deep Learning on SparkDistributed Deep Learning on Spark
Distributed Deep Learning on Spark
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
 
LAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96BoardsLAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96Boards
 
Hadoop scheduler
Hadoop schedulerHadoop scheduler
Hadoop scheduler
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data Science
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Cluster
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
 
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
 
RAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature EngineeringRAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature Engineering
 
Shift into High Gear: Dramatically Improve Hadoop & NoSQL Performance
Shift into High Gear: Dramatically Improve Hadoop & NoSQL PerformanceShift into High Gear: Dramatically Improve Hadoop & NoSQL Performance
Shift into High Gear: Dramatically Improve Hadoop & NoSQL Performance
 
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDistributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
 

Similar to Hadoop 2 @Twitter, Elephant Scale. Presented at

Next Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduceNext Generation of Hadoop MapReduce
Next Generation of Hadoop MapReducehuguk
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Hortonworks
 
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache HadoopYARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache HadoopHortonworks
 
Hadoop breizhjug
Hadoop breizhjugHadoop breizhjug
Hadoop breizhjug
David Morin
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Sumeet Singh
 
HBase @ Twitter
HBase @ TwitterHBase @ Twitter
HBase @ Twitter
ctrezzo
 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
Chris Nauroth
 
Savanna - Elastic Hadoop on OpenStack
Savanna - Elastic Hadoop on OpenStackSavanna - Elastic Hadoop on OpenStack
Savanna - Elastic Hadoop on OpenStack
Sergey Lukjanov
 
Yarns About Yarn
Yarns About YarnYarns About Yarn
Yarns About Yarn
Cloudera, Inc.
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Yahoo Developer Network
 
Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014
spinningmatt
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
StampedeCon
 
Evolution of Drupal and the Drupal community
Evolution of Drupal and the Drupal communityEvolution of Drupal and the Drupal community
Evolution of Drupal and the Drupal community
Angela Byron
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Frank Munz
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
Adam Doyle
 
Hadoop Cluster on Docker Containers
Hadoop Cluster on Docker ContainersHadoop Cluster on Docker Containers
Hadoop Cluster on Docker Containers
pranav_joshi
 
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NYApache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
Wangda Tan
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2
hdhappy001
 

Similar to Hadoop 2 @Twitter, Elephant Scale. Presented at (20)

Next Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduceNext Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduce
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache HadoopYARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
 
Hadoop breizhjug
Hadoop breizhjugHadoop breizhjug
Hadoop breizhjug
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
 
ha_module5
ha_module5ha_module5
ha_module5
 
HBase @ Twitter
HBase @ TwitterHBase @ Twitter
HBase @ Twitter
 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
 
Savanna - Elastic Hadoop on OpenStack
Savanna - Elastic Hadoop on OpenStackSavanna - Elastic Hadoop on OpenStack
Savanna - Elastic Hadoop on OpenStack
 
Yarns About Yarn
Yarns About YarnYarns About Yarn
Yarns About Yarn
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
 
HugNov14
HugNov14HugNov14
HugNov14
 
Evolution of Drupal and the Drupal community
Evolution of Drupal and the Drupal communityEvolution of Drupal and the Drupal community
Evolution of Drupal and the Drupal community
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
 
Hadoop Cluster on Docker Containers
Hadoop Cluster on Docker ContainersHadoop Cluster on Docker Containers
Hadoop Cluster on Docker Containers
 
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NYApache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2
 

More from lohitvijayarenu

OpenSource and the Cloud ApacheCon.pptx
OpenSource and the Cloud  ApacheCon.pptxOpenSource and the Cloud  ApacheCon.pptx
OpenSource and the Cloud ApacheCon.pptx
lohitvijayarenu
 
The Adoption of Apache Beam at Twitter
The Adoption of Apache Beam at TwitterThe Adoption of Apache Beam at Twitter
The Adoption of Apache Beam at Twitter
lohitvijayarenu
 
Log Events @Twitter
Log Events @TwitterLog Events @Twitter
Log Events @Twitter
lohitvijayarenu
 
Story of migrating event pipeline from batch to streaming
Story of migrating event pipeline from batch to streamingStory of migrating event pipeline from batch to streaming
Story of migrating event pipeline from batch to streaming
lohitvijayarenu
 
Scaling event aggregation at twitter
Scaling event aggregation at twitterScaling event aggregation at twitter
Scaling event aggregation at twitter
lohitvijayarenu
 
Scaling HDFS for Exabyte Storage@twitter
Scaling HDFS for Exabyte Storage@twitterScaling HDFS for Exabyte Storage@twitter
Scaling HDFS for Exabyte Storage@twitter
lohitvijayarenu
 
Managing 100s of PetaBytes of data in Cloud
Managing 100s of PetaBytes of data in CloudManaging 100s of PetaBytes of data in Cloud
Managing 100s of PetaBytes of data in Cloud
lohitvijayarenu
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
lohitvijayarenu
 
Twitter's Data Replicator for Google Cloud Storage
Twitter's Data Replicator for Google Cloud StorageTwitter's Data Replicator for Google Cloud Storage
Twitter's Data Replicator for Google Cloud Storage
lohitvijayarenu
 
How @twitterhadoop chose google cloud
How @twitterhadoop chose google cloudHow @twitterhadoop chose google cloud
How @twitterhadoop chose google cloud
lohitvijayarenu
 
Large Scale EventLog Management @Twitter
Large Scale EventLog Management @TwitterLarge Scale EventLog Management @Twitter
Large Scale EventLog Management @Twitter
lohitvijayarenu
 
Routing trillion events per day @twitter
Routing trillion events per day @twitterRouting trillion events per day @twitter
Routing trillion events per day @twitter
lohitvijayarenu
 
Open Source india 2014
Open Source india 2014Open Source india 2014
Open Source india 2014
lohitvijayarenu
 
HBase backups and performance on MapR
HBase backups and performance on MapRHBase backups and performance on MapR
HBase backups and performance on MapR
lohitvijayarenu
 

More from lohitvijayarenu (14)

OpenSource and the Cloud ApacheCon.pptx
OpenSource and the Cloud  ApacheCon.pptxOpenSource and the Cloud  ApacheCon.pptx
OpenSource and the Cloud ApacheCon.pptx
 
The Adoption of Apache Beam at Twitter
The Adoption of Apache Beam at TwitterThe Adoption of Apache Beam at Twitter
The Adoption of Apache Beam at Twitter
 
Log Events @Twitter
Log Events @TwitterLog Events @Twitter
Log Events @Twitter
 
Story of migrating event pipeline from batch to streaming
Story of migrating event pipeline from batch to streamingStory of migrating event pipeline from batch to streaming
Story of migrating event pipeline from batch to streaming
 
Scaling event aggregation at twitter
Scaling event aggregation at twitterScaling event aggregation at twitter
Scaling event aggregation at twitter
 
Scaling HDFS for Exabyte Storage@twitter
Scaling HDFS for Exabyte Storage@twitterScaling HDFS for Exabyte Storage@twitter
Scaling HDFS for Exabyte Storage@twitter
 
Managing 100s of PetaBytes of data in Cloud
Managing 100s of PetaBytes of data in CloudManaging 100s of PetaBytes of data in Cloud
Managing 100s of PetaBytes of data in Cloud
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Twitter's Data Replicator for Google Cloud Storage
Twitter's Data Replicator for Google Cloud StorageTwitter's Data Replicator for Google Cloud Storage
Twitter's Data Replicator for Google Cloud Storage
 
How @twitterhadoop chose google cloud
How @twitterhadoop chose google cloudHow @twitterhadoop chose google cloud
How @twitterhadoop chose google cloud
 
Large Scale EventLog Management @Twitter
Large Scale EventLog Management @TwitterLarge Scale EventLog Management @Twitter
Large Scale EventLog Management @Twitter
 
Routing trillion events per day @twitter
Routing trillion events per day @twitterRouting trillion events per day @twitter
Routing trillion events per day @twitter
 
Open Source india 2014
Open Source india 2014Open Source india 2014
Open Source india 2014
 
HBase backups and performance on MapR
HBase backups and performance on MapRHBase backups and performance on MapR
HBase backups and performance on MapR
 

Recently uploaded

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 

Recently uploaded (20)

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 

Hadoop 2 @Twitter, Elephant Scale. Presented at

  • 1. Hadoop 2 @Twitter, Elephant Scale Lohit VijayaRenu Gera Shegalov @lohitvijayarenu @gerashegalov @TwitterHadoop 1 / 29 v1.0
  • 2. About this talk Share @twitterhadoop’s efforts, experience and learning in moving thousand users and multi petabyte workloads from Hadoop 1 to Hadoop 2 @twitterhadoop 2 / 29 v1.0
  • 3. Use cases Personalization Graph analysis, Recommendations, Trends, User/topic modeling Analytics a/b testing, user behavior analysis, api analytics Growth Network Digest, People Recommendations, Email Revenue Engagement prediction, Ad targeting, ads analytics, marketplace optimization Nielsen Twitter TV Rating Tweet impressions processing Backups & Scribe Logs MySQL backups, Manhattan backups, FrontEnd scribe logs Many more... @twitterhadoop 3 / 29 v1.0
  • 4. Hadoop and Data pipeline TFE hadoop real time hadoop processing hadoop warehouse hadoop cold hadoop backupsSearch, Ads, etc Partners MySQL hadoop hbase Vertica Manhatta n hadoop tst @twitterhadoop SVN, Git, ... hadoop tst 4 / 29 v1.0
  • 5. Elephant Scale ➔ Tens of thousands Hadoop servers (Mix of hardware) ➔ Hundreds of thousands of disk drives ➔ Few hundred PB data stored in HDFS ➔ Hundreds of thousands of daily hadoop jobs ➔ Tens of millions of daily hadoop tasks @twitterhadoop Individual Cluster Stats ➔ More than 3500 nodes ➔ 30-50+ PB data stored in HDFS ➔ 35K RPC/second on NNs ➔ 30K+ jobs per day ➔ 10M+ tasks per day ➔ 6PB+ data crunched per day 5 / 29 v1.0
  • 6. Hadoop 1 Challenges (Q4-2012) Growth: Supporting twitter growth, Request for new features on older branch, new JAVA Scalability: NameNode files/blocks, NN Operations, GC pause, Checkpointing JobTracker GC pause, task assignment Reliability: SPOF NN and JT, NameNode restart delays Efficiency: Slot utilization, QoS, Multi Tenant, New features & frameworks Maintenance: Old codebase, Numerous issues fixed in later versions, dev branch . @twitterhadoop 6 / 29 v1.0
  • 7. Hadoop 2 Configuration (Q1-2013) NodeManager DataNode NodeManager DataNode NodeManager DataNode YARN ResourceManager JN JN JN JN JN JN ViewFS, HDFS Balancer, Admin tools, hRaven, Metrics Alerts ……. ……. logs user tmp Trash @twitterhadoop TrashTrash 7 / 29 v1.0
  • 8. Hadoop 2 Migration (Q2-Q4 2013) Phase 1 : Testing Phase 3 : Production Phase 2 : Semi production ➔ Apache 2.0.3 branch ➔ New Hardware*, New OS and JVM ➔ Benchmarks and user jobs (lots of them…) ➔ Dependent component updates ➔ Data movement between different versions ➔ Metrics, Alerts and tools ➔ Production use cases running in 2 clusters in parallel. ➔ Tuning/parameter updates and learnings ➔ Started contributing fixes back to community ➔ Educating users about new version and changes ➔ Benefits of Hadoop 2 ➔ Stable Apache 2.0.5 release with many fixes and backports ➔ Multiple internal releases ➔ Template for new clusters ➔ Ready to roll Apache 2.3 release *http://www.slideshare.net/Hadoop_Summit/hadoop-hardware-twitter-size-does-matter @twitterhadoop 8 / 29 v1.0
  • 9. CPU Utilization Hadoop 1 CPU Utilization for one day. (45% peaks) Hadoop 2 CPU Utilization for one day. (85% peaks) @twitterhadoop 9 / 29 v1.0
  • 10. Memory Utilization Hadoop 1 Memory Utilization for one day (68% peaks) Hadoop 2 Memory Utilization for one day (96% peaks) @twitterhadoop 10 / 29 v1.0
  • 11. Migration Challenge: web-based FS Need a web-based FS to deal with H1/H2 interactions ● Hftp based on cross-DC LogMover experience ● Apps broken due to no FNF on non-existing paths HDFS-6143 ● Faced challenges cross-version checksums @twitterhadoop 11 / 29 v1.0
  • 12. Migration Challenge: hard-coded FS 1000’s of occurrences hdfs://${NN}/path and absolute URIs ● For cluster1 dial hdfs://hadoop-cluster1-nn.dc CNAME ● For cluster2 dial … Ideal: use logical paths and viewfs as defaultFS More realistic and faster: ● HDFSCompatibleViewFS HADOOP-9985 @twitterhadoop 12 / 29 v1.0
  • 13. Migration Challenge: Interoperability Migration in progress: H1 job requires input from H2 ● hftp://OMGwhatNN/has/my/path problem ● ideal: use viewfs on H1 resolving to correct H2-NN ● realistic: see above “hardcoded FS” ● Even if you know OMGwhatNN, is it active? @twitterhadoop 13 / 29 v1.0
  • 14. StandbyActive Cluster CNAME H1 client Active Standby Active Standby Load client-side mounttable on the server side: 1. redirect to the right namespace 2. redirect to active within namespace @twitterhadoop 14 / 29 v1.0
  • 15. Migration: Tools and Ecosystem ● Port/recompile/package: o Data Access Layer/HCatalog, o Pig, o Cascading/Scalding o ElephantBird o hadoop-lzo ● PIG-3913 (local mode counters), ● Analytics team fixed PIG-2888 (performance) ● hRaven fixes: o translation between slot_millis and mb_millis @twitterhadoop 15 / 29 v1.0
  • 16. HadOops found and fixed ● ViewFS can’t be used for public DistributedCache (DC) o HADOOP-10191, YARN-1542 ● getFileStatus RPC storm on public DC: o YARN-1771 ● No user-specified progress string in MR-AM UI task o MAPREDUCE-5550 ● Uberized jobs for scheduling small jobs great but ... o can you kill them? MAPREDUCE-5841 o size correctly for map-only? YARN-1190 @twitterhadoop 16 / 29 v1.0
  • 17. More HadOops Incident: a job blacklists nodes by logging terabytes ● need capping, but userlog.limit.kb loses valuable log tail ● RollingFileAppender for MR-AM/tasks MAPREDUCE- 5672 @twitterhadoop 17 / 29 v1.0
  • 18. Diagnostics improvement App/Job/Task kill: ● DAG processors/users can say why o MAPREDUCE-5648, YARN-1551 ● MR-AM: “speculation”, “reducer preemption” o MAPREDUCE-5692, MAPREDUCE-5825 ● Thread Dumps o On task timeout: MAPREDUCE-5044 o On demand from CLI/UI: MAPREDUCE-5784, ... @twitterhadoop 18 / 29 v1.0
  • 19. UX/UI improvements ● NameNode state and cluster stats ● App size in MB on RM Apps Page ● RM Scheduler UI improvements: queue descriptions, bugs min/max resource calc. ● Task Attempt state filtering in MR-AM HDFS-5928, YARN-1945, HDFS-5296... @twitterhadoop 19 / 29 v1.0
  • 20. YARN reliability improvements ● Unhealthy nodes / positive feedback o drain containers instead of killing: YARN-1996 o don’t rerun maps when all reduces committed: MAPREDUCE-5817 ● RM crashes JIRA fixed either just internally or public o YARN-351, YARN-502 @twitterhadoop 20 / 29 v1.0
  • 21. MapReduce usability ● Memory.mb as a single tunable: Xmx, sort.mb auto-set o mb is optimized on case-by-case basis o MAPREDUCE-5785 ● Users want newer artifacts like guava: job.classloader o MAPREDUCE-5146 / 5751 / 5813 / 5814 ● Help users debug o thread dump on timeout, and on demand via UI o educate users about heap dumps on OOM and java profiling @twitterhadoop 21 / 29 v1.0
  • 22. Multi-DC environment MR clients across latency boundaries. Submit fast: ● moving split calculation to MR-AM: MAPREDUCE-207 DSCP bit coloring for DataXfer ● HDFS-5175 ● Hftp (switched to Apache Commons HttpClient) DataXfer throttling (client RW) 22 / 29 v1.0
  • 23. YARN: Beyond Java & MapReduce ● MR-AM and other REST API’s across the stack for easy integration in non-JVM tools. ● Vowpal Wabbit: (production) o no extra spanning tree step ● Spark (semi-production) @twitterhadoop 23 / 29 v1.0
  • 24. Ongoing Project: Shared Cache MapReduce function shipping: computation->data ● Teams have jobs with 100’s of jars uploaded via libjars o Ideal: manage a jar repo on HDFS o Reference jars via DistributedCache instead of uploading o Real: currently hard to coordinate ● YARN-1492: Manage artifacts cache transparently ● Measure it: o YARN-1529: Localization overhead/cache hits NM metrics o MAPREDUCE-5696: Job localization counters @twitterhadoop 24 / 29 v1.0
  • 25. Upcoming Challenges ● Reduce ops complexity: o grow to 10K+-node clusters o try to avoid adding more clusters ● Scalability limits for NN, RM ● NN heap sizes: large Java heap vs namespace splitting ● RPC QoS Issues ● NN startup: long initial block report processing ● Integrating non-MR frameworks with hRaven @twitterhadoop 25 / 29 v1.0
  • 26. Future Work Ideas ● Productize RM HA and work-preserving restart ● HDFS Readable Standby NN ● Whole DAG in a single NN namespace ● Contribute to HDFS-5477 - Dedicated BM service ● NN SLA: fairshare for RPC queues: HADOOP-10598 ● Finer lock granularity in NN @twitterhadoop 26 / 29 v1.0
  • 27. Summary: Hadoop 2 @ Twitter ● No JT bottleneck: Lightweight RM + MR-AM ● High compute density with flexible slots ● Reduced NN bottleneck using Federation ● HDFS HA removes the angst to try out new NN configs ● Much closer to upstream to consume/contribute fixes o Development on 2.3 branch ● Adopting new frameworks on YARN @twitterhadoop 27 / 29 v1.0
  • 28. Conclusion Migrating 1000+ users/use cases is anything but trivial … however, ● Hadoop 2 made it worthwhile ● Hadoop 2 contributions: o 40+ patches committed o ~40 in review @twitterhadoop 28 / 29 v1.0
  • 29. Thank you! Questions @JoinTheFlock about.twitter.com/careers @TwitterHadoop Catch up with us in person @LohitVijayaRenu @GeraShegalov @twitterhadoop 29 / 29 v1.0

Editor's Notes

  1. With scale and growth like this, twitter faced different kind of challenges with Hadoop 1.JT used to run >20K jobs per day.
  2. JobTracker caches number of jobs per users and does not take into account size of job. Frequent JT full GCs.
  3. Reasoning behind why Twitter had to chose different namespaces. As of now all Datanodes talk to all NameNodes, we have been thinking about different combinations where subset of DataNodes can talk to different namespaces as well.
  4. We had decided to build new Hadoop 2 clusters instead of worrying about migrating/upgrading Hadoop 1 clusters. Saved huge downtime issues. Around phase two is when users started seeing benefits of moving to Hadoop 2. Simple fixes when long way helping lots of customers.
  5. v1
  6. Hadoop community made a lot of progress