Your SlideShare is downloading. ×

Yahoo! - Arun Murthy - Hadoop World 2010

1,359

Published on

Apache Hadoop in the Enterprise …

Apache Hadoop in the Enterprise

Arun Murthy
Yahoo!

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,359
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Hadoop at Yahoo! Ready for Business Arun C. Murthy Hadoop Team acm@yahoo-inc.com @acmurthy
  • 2. Existential Angst – Who Am I? •  Yahoo –  Lead, Hadoop Map-Reduce development team •  Apache Hadoop –  Full time contributor since April, 2006 –  Long-term Committer –  Member of Apache Hadoop Project Management Committee 2
  • 3. Outline •  Hadoop is mission critical for Yahoo •  Making Hadoop enterprise-ready for Yahoo 3
  • 4. Hadoop at Yahoo! •  Hadoop is mission critical for Yahoo •  Making Hadoop enterprise-ready for Yahoo 4
  • 5. Hadoop at Yahoo! - Scale of Operation 5 Washington 25000 nodes Nebraska 9000 nodes Virginia 10000 nodes
  • 6. The Team - Hadoop Development 6
  • 7. Hadoop Contributions 7 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Feb-06 Apr-06 Jun-06 Aug-06 Oct-06 Dec-06 Feb-07 Apr-07 Jun-07 Aug-07 Oct-07 Dec-07 Feb-08 Apr-08 Jun-08 Aug-08 Oct-08 Dec-08 Feb-09 Apr-09 Jun-09 Aug-09 Oct-09 Dec-09 Feb-10 Apr-10 Jun-10 Aug-10 Oct-10 Patches Hadoop Patches yahoo powerset other facebook cloudera
  • 8. Hadoop at Yahoo! 8 99.85 99.47 99.69 99.2 99.3 99.4 99.5 99.6 99.7 99.8 99.9 Production Research Sandbox Availability SLA
  • 9. Hadoop Usage at Yahoo! Research Science Impact Daily Production “Behind every click” Today 9 ThousandsofServers Petabytes 44K Hadoop Servers 170 PB Raw Hadoop Storage 1M+ Monthy Hadoop Jobs
  • 10. Research to Mission Critical Research workloads •  Search •  Advertising Modeling •  Machine Learning •  WebMap (production) Revenue Systems •  Strong Security •  Improved SLAs •  Small Jobs Increased user base •  Partitioned Namespaces •  All data storage and processing •  Mainstream 10 2006/2007 2008 2009 2010
  • 11. Application Patterns •  Data Processing and Aggregations •  Data co-located in a shared environment •  Batch processing of Data •  Processing 100 Billion events per day ETL / Warehouse •  Modeling and Machine Learning Algorithms •  Weekly/Monthly run of algorithms Analytics & Sciences •  Derive Insights form the production data •  Feedback for Optimizations in the production environments •  Nearline production optimizations Nearline Production 11
  • 12. Getting there… •  Hadoop is mission critical for Yahoo •  Making Hadoop enterprise-ready for Yahoo 12
  • 13. Crossing the Chasm •  Hadoop grew rapidly charting new territories in features, abstractions, APIs, scale, … –  Small team –  Small number of early customers who needed a new platform •  Today: dramatic growth in customer base –  New requirements and expectations •  Choices/tradeoffs in approaches – past and future –  Scale –  Backward Compatibility –  Security –  SLAs & Predictability 13 Geoffrey A Moore*
  • 14. Evolution of Hadoop at Yahoo! 14 •  Utilization at Scale •  Security •  Multi-tenancy •  Super-size 09/09 04/09 04/11 04/10 Multi-Tenancy hadoop-0.20 yhadoop-0.20 20.S Fred HDFS Federation hadoop-next 09/10 CapacityScheduler Security Yahoo Hadoop Apache Hadoop 4400+ patches on hadoop-0.20!
  • 15. Utilization at Scale 15 04/09 04/11 04/10 Multi-Tenancy hadoop-0.20 yhadoop-0.20 20.S Fred HDFS Federation hadoop-next 09/10 CapacityScheduler Security Yahoo Hadoop Apache Hadoop
  • 16. Motivation •  Exploit shared storage –  Unified namespace •  Provide compute elasticity –  Stop relying on private clusters (Hadoop on Demand) •  Higher utilization at massive scale 16
  • 17. CapacityScheduler •  Resource allocation in shared, multi-tenant cluster •  A cluster is funded by several organizations •  Each organization gets queue allocations based on their funding –  Guaranteed capacity –  Control who can submit jobs to their queues –  Set job priorities within their queues 17
  • 18. CapacityScheduler - Benefits •  Improved utilization and latency •  Almost dedicated hardware via virtual clusters •  Significantly better utilization of excess capacity –  Mix SLA critical and ad-hoc jobs •  Predictable behavior 18 0.00 1.00 2.00 3.00 Job throughput InputBytes throughput OutputBytes throughput Normalized Throughput Hadoop 20 Hadoop 18 936 GB/hr 0.0%10.0%20.0%30.0%40.0%50.0%60.0%70.0%80.0% MapSlot Utilization ReduceSlot Utilization Slot Utilization (%) Hadoop 20
  • 19. Security 19 04/09 04/11 04/10 Multi-Tenancy hadoop-0.20 yhadoop-0.20 20.S Fred HDFS Federation hadoop-next 09/10 CapacityScheduler Security Yahoo Hadoop Apache Hadoop
  • 20. Motivation •  Revenue bearing applications •  Strong security for data on multi-tenant clusters –  Enable sharing clusters between disjoint kinds of users •  Auditing –  Access to data –  Access and change management 20
  • 21. Secure Hadoop •  Kerberos based strong authentication –  Client-based authentication introduced in hadoop-0.16 (2007) –  Authenticate RPC and HTTP connections •  Multiple man years of development •  Integration with existing security mechanisms in Yahoo •  Authorization –  Use HDFS Authorization –  Add MapReduce Authorization 21
  • 22. Multi-Tenancy 22 04/09 04/11 04/10 Multi-Tenancy hadoop-0.20 yhadoop-0.20 20.S Fred HDFS Federation hadoop-next 09/10 CapacityScheduler Security Yahoo Hadoop Apache Hadoop
  • 23. Motivation •  Ever growing demand –  Consolidation for economics of scale and operability –  Several clusters of 4k nodes each •  Growing demand for stability –  Isolation for applications –  Shield framework from poorly designed or rogue applications 23
  • 24. Fred •  Limits –  Plug uptime vulnerabilities in the framework –  Enforce best practices http://developer.yahoo.com/blogs/hadoop/posts/2010/08/ apache_hadoop_best_practices_a/ •  Shield clusters from poorly written applications –  NameNode exposed to applications performing too many metadata operations from the backend tasks –  JT exposed to with Counters •  Shield users from each other –  Isolation •  Metrics and Monitoring 24
  • 25. Super-Sized Hadoop 25 04/09 04/11 04/10 Multi-Tenancy hadoop-0.20 yhadoop-0.20 20.S Fred HDFS Federation hadoop-next 09/10 CapacityScheduler Security Yahoo Hadoop Apache Hadoop
  • 26. Motivation •  Massive storage and processing –  Hardware gets more capable per dollar –  (4k 2011 nodes) = (12k 2009 nodes) –  Continued consolidation for economics and operability 26
  • 27. HDFS Federation •  Redefine the meaning of a HDFS cluster –  Scale horizontally by having multiple NameNodes per cluster •  Striping – Already in production –  Shared storage pool –  Shared namespace •  Striping – Mount tables in production –  Helps availability –  Better isolation •  72 PB raw storage per cluster –  6000 nodes per cluster –  12TB raw, per node 27
  • 28. Availability •  Mission critical system •  HDFS –  Faster HDFS restarts •  Full cluster restart in 75min (down from 3-4 hrs) •  NN bounce in 15 minutes •  Part of the problem is the NameNode’s size – Federation will help –  Steps towards automated failover •  Backup NN •  Move state off the NN server so we can failover easily –  Federation will significantly improve NN isolation, availability, & stability •  Availability for Map-Reduce framework and jobs –  Continued operation across HDFS restarts 28
  • 29. Conclusions •  Yahoo Hadoop is behind every click at Yahoo! –  Stable, scalable and secure –  The most tested and reliable version of Hadoop – 4400 patches! •  Yahoo continues to be the primary contributor to Apache Hadoop 29
  • 30. Questions? 30 Thanks!

×