Hadoop 2 @ Twitter, Elephant Scale


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • With scale and growth like this, twitter faced different kind of challenges with Hadoop 1.JT used to run >20K jobs per day.
  • JobTracker caches number of jobs per users and does not take into account size of job. Frequent JT full GCs.
  • Reasoning behind why Twitter had to chose different namespaces. As of now all Datanodes talk to all NameNodes, we have been thinking about different combinations where subset of DataNodes can talk to different namespaces as well.
  • We had decided to build new Hadoop 2 clusters instead of worrying about migrating/upgrading Hadoop 1 clusters. Saved huge downtime issues. Around phase two is when users started seeing benefits of moving to Hadoop 2. Simple fixes when long way helping lots of customers.
  • v1
  • Hadoop community made a lot of progress
  • Hadoop 2 @ Twitter, Elephant Scale

    1. 1. Hadoop 2 @Twitter, Elephant Scale Lohit VijayaRenu Gera Shegalov @lohitvijayarenu @gerashegalov @TwitterHadoop 1 / 29 v1.0
    2. 2. About this talk Share @twitterhadoop’s efforts, experience and learning in moving thousand users and multi petabyte workloads from Hadoop 1 to Hadoop 2 @twitterhadoop 2 / 29 v1.0
    3. 3. Use cases Personalization Graph analysis, Recommendations, Trends, User/topic modeling Analytics a/b testing, user behavior analysis, api analytics Growth Network Digest, People Recommendations, Email Revenue Engagement prediction, Ad targeting, ads analytics, marketplace optimization Nielsen Twitter TV Rating Tweet impressions processing Backups & Scribe Logs MySQL backups, Manhattan backups, FrontEnd scribe logs Many more... @twitterhadoop 3 / 29 v1.0
    4. 4. Hadoop and Data pipeline TFE hadoop real time hadoop processing hadoop warehouse hadoop cold hadoop backupsSearch, Ads, etc Partners MySQL hadoop hbase Vertica Manhatta n hadoop tst @twitterhadoop SVN, Git, ... hadoop tst 4 / 29 v1.0
    5. 5. Elephant Scale ➔ Tens of thousands Hadoop servers (Mix of hardware) ➔ Hundreds of thousands of disk drives ➔ Few hundred PB data stored in HDFS ➔ Hundreds of thousands of daily hadoop jobs ➔ Tens of millions of daily hadoop tasks @twitterhadoop Individual Cluster Stats ➔ More than 3500 nodes ➔ 30-50+ PB data stored in HDFS ➔ 35K RPC/second on NNs ➔ 30K+ jobs per day ➔ 10M+ tasks per day ➔ 6PB+ data crunched per day 5 / 29 v1.0
    6. 6. Hadoop 1 Challenges (Q4-2012) Growth: Supporting twitter growth, Request for new features on older branch, new JAVA Scalability: NameNode files/blocks, NN Operations, GC pause, Checkpointing JobTracker GC pause, task assignment Reliability: SPOF NN and JT, NameNode restart delays Efficiency: Slot utilization, QoS, Multi Tenant, New features & frameworks Maintenance: Old codebase, Numerous issues fixed in later versions, dev branch . @twitterhadoop 6 / 29 v1.0
    7. 7. Hadoop 2 Configuration (Q1-2013) NodeManager DataNode NodeManager DataNode NodeManager DataNode YARN ResourceManager JN JN JN JN JN JN ViewFS, HDFS Balancer, Admin tools, hRaven, Metrics Alerts ……. ……. logs user tmp Trash @twitterhadoop TrashTrash 7 / 29 v1.0
    8. 8. Hadoop 2 Migration (Q2-Q4 2013) Phase 1 : Testing Phase 3 : Production Phase 2 : Semi production ➔ Apache 2.0.3 branch ➔ New Hardware*, New OS and JVM ➔ Benchmarks and user jobs (lots of them…) ➔ Dependent component updates ➔ Data movement between different versions ➔ Metrics, Alerts and tools ➔ Production use cases running in 2 clusters in parallel. ➔ Tuning/parameter updates and learnings ➔ Started contributing fixes back to community ➔ Educating users about new version and changes ➔ Benefits of Hadoop 2 ➔ Stable Apache 2.0.5 release with many fixes and backports ➔ Multiple internal releases ➔ Template for new clusters ➔ Ready to roll Apache 2.3 release *http://www.slideshare.net/Hadoop_Summit/hadoop-hardware-twitter-size-does-matter @twitterhadoop 8 / 29 v1.0
    9. 9. CPU Utilization Hadoop 1 CPU Utilization for one day. (45% peaks) Hadoop 2 CPU Utilization for one day. (85% peaks) @twitterhadoop 9 / 29 v1.0
    10. 10. Memory Utilization Hadoop 1 Memory Utilization for one day (68% peaks) Hadoop 2 Memory Utilization for one day (96% peaks) @twitterhadoop 10 / 29 v1.0
    11. 11. Migration Challenge: web-based FS Need a web-based FS to deal with H1/H2 interactions ● Hftp based on cross-DC LogMover experience ● Apps broken due to no FNF on non-existing paths HDFS-6143 ● Faced challenges cross-version checksums @twitterhadoop 11 / 29 v1.0
    12. 12. Migration Challenge: hard-coded FS 1000’s of occurrences hdfs://${NN}/path and absolute URIs ● For cluster1 dial hdfs://hadoop-cluster1-nn.dc CNAME ● For cluster2 dial … Ideal: use logical paths and viewfs as defaultFS More realistic and faster: ● HDFSCompatibleViewFS HADOOP-9985 @twitterhadoop 12 / 29 v1.0
    13. 13. Migration Challenge: Interoperability Migration in progress: H1 job requires input from H2 ● hftp://OMGwhatNN/has/my/path problem ● ideal: use viewfs on H1 resolving to correct H2-NN ● realistic: see above “hardcoded FS” ● Even if you know OMGwhatNN, is it active? @twitterhadoop 13 / 29 v1.0
    14. 14. StandbyActive Cluster CNAME H1 client Active Standby Active Standby Load client-side mounttable on the server side: 1. redirect to the right namespace 2. redirect to active within namespace @twitterhadoop 14 / 29 v1.0
    15. 15. Migration: Tools and Ecosystem ● Port/recompile/package: o Data Access Layer/HCatalog, o Pig, o Cascading/Scalding o ElephantBird o hadoop-lzo ● PIG-3913 (local mode counters), ● Analytics team fixed PIG-2888 (performance) ● hRaven fixes: o translation between slot_millis and mb_millis @twitterhadoop 15 / 29 v1.0
    16. 16. HadOops found and fixed ● ViewFS can’t be used for public DistributedCache (DC) o HADOOP-10191, YARN-1542 ● getFileStatus RPC storm on public DC: o YARN-1771 ● No user-specified progress string in MR-AM UI task o MAPREDUCE-5550 ● Uberized jobs for scheduling small jobs great but ... o can you kill them? MAPREDUCE-5841 o size correctly for map-only? YARN-1190 @twitterhadoop 16 / 29 v1.0
    17. 17. More HadOops Incident: a job blacklists nodes by logging terabytes ● need capping, but userlog.limit.kb loses valuable log tail ● RollingFileAppender for MR-AM/tasks MAPREDUCE- 5672 @twitterhadoop 17 / 29 v1.0
    18. 18. Diagnostics improvement App/Job/Task kill: ● DAG processors/users can say why o MAPREDUCE-5648, YARN-1551 ● MR-AM: “speculation”, “reducer preemption” o MAPREDUCE-5692, MAPREDUCE-5825 ● Thread Dumps o On task timeout: MAPREDUCE-5044 o On demand from CLI/UI: MAPREDUCE-5784, ... @twitterhadoop 18 / 29 v1.0
    19. 19. UX/UI improvements ● NameNode state and cluster stats ● App size in MB on RM Apps Page ● RM Scheduler UI improvements: queue descriptions, bugs min/max resource calc. ● Task Attempt state filtering in MR-AM HDFS-5928, YARN-1945, HDFS-5296... @twitterhadoop 19 / 29 v1.0
    20. 20. YARN reliability improvements ● Unhealthy nodes / positive feedback o drain containers instead of killing: YARN-1996 o don’t rerun maps when all reduces committed: MAPREDUCE-5817 ● RM crashes JIRA fixed either just internally or public o YARN-351, YARN-502 @twitterhadoop 20 / 29 v1.0
    21. 21. MapReduce usability ● Memory.mb as a single tunable: Xmx, sort.mb auto-set o mb is optimized on case-by-case basis o MAPREDUCE-5785 ● Users want newer artifacts like guava: job.classloader o MAPREDUCE-5146 / 5751 / 5813 / 5814 ● Help users debug o thread dump on timeout, and on demand via UI o educate users about heap dumps on OOM and java profiling @twitterhadoop 21 / 29 v1.0
    22. 22. Multi-DC environment MR clients across latency boundaries. Submit fast: ● moving split calculation to MR-AM: MAPREDUCE-207 DSCP bit coloring for DataXfer ● HDFS-5175 ● Hftp (switched to Apache Commons HttpClient) DataXfer throttling (client RW) 22 / 29 v1.0
    23. 23. YARN: Beyond Java & MapReduce ● MR-AM and other REST API’s across the stack for easy integration in non-JVM tools. ● Vowpal Wabbit: (production) o no extra spanning tree step ● Spark (semi-production) @twitterhadoop 23 / 29 v1.0
    24. 24. Ongoing Project: Shared Cache MapReduce function shipping: computation->data ● Teams have jobs with 100’s of jars uploaded via libjars o Ideal: manage a jar repo on HDFS o Reference jars via DistributedCache instead of uploading o Real: currently hard to coordinate ● YARN-1492: Manage artifacts cache transparently ● Measure it: o YARN-1529: Localization overhead/cache hits NM metrics o MAPREDUCE-5696: Job localization counters @twitterhadoop 24 / 29 v1.0
    25. 25. Upcoming Challenges ● Reduce ops complexity: o grow to 10K+-node clusters o try to avoid adding more clusters ● Scalability limits for NN, RM ● NN heap sizes: large Java heap vs namespace splitting ● RPC QoS Issues ● NN startup: long initial block report processing ● Integrating non-MR frameworks with hRaven @twitterhadoop 25 / 29 v1.0
    26. 26. Future Work Ideas ● Productize RM HA and work-preserving restart ● HDFS Readable Standby NN ● Whole DAG in a single NN namespace ● Contribute to HDFS-5477 - Dedicated BM service ● NN SLA: fairshare for RPC queues: HADOOP-10598 ● Finer lock granularity in NN @twitterhadoop 26 / 29 v1.0
    27. 27. Summary: Hadoop 2 @ Twitter ● No JT bottleneck: Lightweight RM + MR-AM ● High compute density with flexible slots ● Reduced NN bottleneck using Federation ● HDFS HA removes the angst to try out new NN configs ● Much closer to upstream to consume/contribute fixes o Development on 2.3 branch ● Adopting new frameworks on YARN @twitterhadoop 27 / 29 v1.0
    28. 28. Conclusion Migrating 1000+ users/use cases is anything but trivial … however, ● Hadoop 2 made it worthwhile ● Hadoop 2 contributions: o 40+ patches committed o ~40 in review @twitterhadoop 28 / 29 v1.0
    29. 29. Thank you! Questions @JoinTheFlock about.twitter.com/careers @TwitterHadoop Catch up with us in person @LohitVijayaRenu @GeraShegalov @twitterhadoop 29 / 29 v1.0