Crossing the Chasm


Published on

Sanjay Radia, Founder and Architect at Hortonworks, talks about Apache Hadoop and it's uptake in the industry.

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Sanjay Radia –been on Hadoop project at Yahoo for last 4 yearsThese 4 years have been a blastVery proud to be a Y employee - Yahoo’s leadership has made Hadoop the success it is today(Forward thinking in Spinning out Hortonworks.)I am very exited to be part of the team at Hortonworks and continuing to grow Apache Hadoop as an open-source project
  • Early customers at YahooA key idea was to provide Hadoop as a service – otherwise, adoption would not have been so rapid.Customer starts focusing his application on hadoop rather than figure out how to get the capex and convince IT to deploy hadoop
  • Some early decisions, why and how well they workedChoices/tradeoffs based on immediate customer needs, time, resources, operational needsGoal:Gain acceptance, evolve and grow customer baseFundamental architectural improvements that cuts across various issuesConclude with the community effect
  • Stronger security driven by multiple tenants in shared clustersSecurity will be critical for new enterprise users especially those with financial data10 Person year effort by completely by Yahoo! and a major milestone for Apache Hadoop
  • Early customers happy with a new platform that let them solve problems they could not solve otherwiseBut many new customers …Unused capacity – esp effective if applications peaks at different timesMix production and non production jobsFairness not a goal –customers are guaranteed capacity they have paid forNN under federation also provides isolation
  • scaling is enough to meet needs of very largecustomers – like FB and Yother customers should be more than okay Before you need more namespace, it will be therea NN that stores only partial namespace in memory
  • Data – can I read what I wrote, is the service availableWhen I asked one of the original authors of of GFS if there were any decisions they would revist – random writersSimplicity is keyRaw disk – fs take time to stabilize – we can take advantage of ext4, xfs or zfs
  • Pipeline – useful for slow appender
  • Certification once ready for releaseHDFS (over 92% coverage)balancer, block replication, corruption, fsck, security,. Cmd lines , viewfs, faults and injection, checkpointing, Federation testing – ready to move to release testingMRDist cache, capacity scheduler, speculative exec, block listing, decommissioning, ui testingFailures and injectionsScale testing – 4 TT per node.Performance– sort, scan, compression, gridmixv3, slive, ..HIT integration whole stack (hdfs, MR, Oozie, Pig, Hive, HCat,) Sandbox – Customer application validation - release vote for 23.0Research:Larger variety of jobs and load than production clusters
  • So far I have been explaining how we have been addressing specific issuesBut we have made fundamental change to the architecture to both HDFS and MRFundamental changes that cuts across issues
  • HDFS – the change was fairly simpleMR – redesign – allocation of compute resources a general layer separate Scheduler from App/Job Manager
  • New Append, Security, Federation, – YahooRaid, High tide, FBHBase improvements – FB & Cloudera
  • The fundamental design improvements will accelerate the rate of improvementIf there is feature that customers need – it will be provided quickly
  • Crossing the Chasm

    1. 1. Crossing the ChasmHadoop for the Enterprise<br />Sanjay Radia – Hortonworks Founder & Architect<br />Formerly Hadoop Architect @ Yahoo!<br />4 Years @ Yahoo! <br />@srr (@hortonworks)<br />© Hortonworks Inc. 2011<br />June 29, 2011<br />
    2. 2. Crossing the Chasm<br />Geoffrey A. Moore<br />Apache Hadoop grew rapidly charting new territories in features, abstractions, APIs, scale, fault tolerance, multi-tenancy, operations …<br />Small number of early customers who needed a new platform<br />Provide Hadoop as a service to make adoption easy<br />Today:<br />Dramatic growth in adoption and customer base<br />Growth of Hadoop stack and applications<br />New requirements and expectations<br />Mission<br />Critical<br />Late<br />Majority<br />Early<br />Majority<br />Early<br />Adopters<br />3<br />
    3. 3. Crossing the Chasm: Overview<br />How the Chasm is being crossed<br />Security<br />SLAs & Predictability<br />Scalability<br />Availability & Data Integrity<br />Backward Compatibility<br />Quality & Testing<br />Fundamental architectural improvements<br /> Federation &<br />Adapt to changing, sometime unforeseen, needs<br />Fuel innovation and rapid development<br />The Community Effect<br />4<br />
    4. 4. Security<br />Early Gains<br />No authorization or authentication requirements<br />Added permissions and passed client-side userid to server (0.16)<br />Addresses accidental deletes by another user<br />Service Authorization (0.18, 0.20)<br />Issues: Stronger Authorization required<br />Shared clusters – multiple tenants<br />Critical data<br />New categories of users (financial) <br />SOX compliance<br />Our Response<br />Authentication using Kerberos (0.20.203)<br />10 Person-year effort by Yahoo!<br />5<br />
    5. 5. SLAs and Predictability<br />Issue: Customers uncomfortable with shared clusters<br />Customer traditionally plan for peaks with dedicated HW<br />Dedicated clusters had poor utilization<br />Response: Capacity scheduler (0.20)<br />Guaranteed capacities in a multi-tenant shared cluster<br />Almost like dedicated hardware<br />Each organization given queue(s) with a guaranteed capacity<br />controls who is allowed to submit jobs to their queues<br />sets the priorities of jobs within their queue<br />creates sub-queues (0.21) for finer grain control within their capacity<br />Unused capacity given to tasks in other queues<br />Better than private cluster –access to unused capacity when in crunch<br />Resource limits for tasks – deals with misbehaved apps<br />Response: FairShare Scheduler (0.20)<br />Focus is fair share of resources, but does have pools<br />6<br />
    6. 6. Scalability<br />Early Gains<br />Simple design allowed rapid improvements<br />Single master, namespace in RAM, simpler locking<br />Cluster size improvements: 1K  2K  4K<br />Vertical scaling: Tuned GC + Efficient memory usage<br />Archive file system – reduce files and blocks (0.20)<br />Current Issues<br />Growth of files and storage limited by single NN (0.20)<br />Only an issue for very very large clusters<br />JobTracker does not scale to beyond 30K tasks – needs redesign<br />Our Response<br />RW locks in NN (0.22)<br />– complete rewrite of MR servers (JT, TT) - 100K tasks (0.23)<br />Federation: horizontal scaling of namespace – billion files (0.23)<br />NN that keeps only part of Namespace in memory –trillion files (0.23.x)<br />7<br />
    7. 7. HDFS Availability & Data Integrity:Early Gains<br />Simple design, Java, storage fault tolerance<br />Java – saved from pointer errors that lead to data corruption<br />Simplicity - subset of Posix – random writers not supported<br />Storage: Rely in OS’s file system rather than use raw disk<br />Storage Fault Tolerance: multiple replicas, active monitoring<br />Single Namenode Master<br />Persistent state: multiple copies + checkpoints<br />Restart on failure<br />How well did it work?<br />Lost 650 blocks out of 329 M on 10 clusters with 20K nodes in 2009<br />82% abandoned open file (append bug, fixed in 0.21)<br />15% files created with single replica (data reliability not needed)<br /> 3% due to roughly 7 bugs that were then fixed (0.21)<br />Over the last 18 months 22 failures on 25 clusters<br />Only 8 would have benefitted from HA failover!! (0.23 failures per cluster year)<br />NN is very robust and can take a lot of abuse<br />NN is resilient against overload caused by misbehaving apps<br />8<br />
    8. 8. HDFS Availability & Data Integrity:Response<br />Data Integrity<br />Append/flush/sync redesign (0.21)<br />Pipeline recruits new replicas rather than just remove them on failures (0.23)<br />Improving Availability of NN<br />Faster HDFS restarts<br />NN bounce in 20 minutes (0.23)<br />Federation allows smaller NNs (0.23)<br />Federation will significantly improve NN isolation hence availability (0.23)<br />Why did we wait this long for HA NN?<br />The failure rates did not demand making this a high priority<br />Failover requires corner cases to be correctly addressed<br />Correct fencing of shared state during failover is critical<br />Can lead to corruption of data and reduceavailability!!<br />Many factors impact availability, not just failover<br />9<br />
    9. 9. HDFS Availability & Data Integrity:Response: HA NN<br />Active work has started on HA NN (Failover)<br />HA NN – Detailed design (HDFS-1623)<br />Community effort<br />HDFS-1971, 1972, 1973, 1974,1975, 2005, 2064, 1073<br />HA: Prototype work<br />Backup NN (0.21)<br />Avatar NN (Facebook)<br />HA NN prototype using Linux HA (Yahoo!)<br />HA NN prototype with Backup NN and block report replicator (EBay)<br />HA the highest priority for 23.x<br />10<br />
    10. 10. MapReduce: Fault Tolerance and Availability<br />Early Gains: Fault-tolerance of tasks and compute nodes<br />Current Issues:Loss of job queue if Job tracker is restarted<br />Our Response<br /> designed with fault tolerance and availability<br />HA Resource Manager (0.23.x)<br />Loss of Resource Manager – degraded mode - recover via restart or failover<br />Apps continue with their current resources<br />App Manager can reschedule with current resources<br />New apps cannot submitted or launched,<br />New resources cannot be allocated<br />Loss of an App Manager - recovers<br />App is restarted and state is recovered<br />Loss of tasks and nodes - recovers<br />Recovered as in old MapReduce<br />11<br />
    11. 11. Backwards Compatibility<br />Early Gains<br />Early success stemmed from a philosophy of ship early and often, resulting in changing APIs.<br />Data and metadata compatibility always maintained<br />The early customers paid the price<br />current customers reap benefits of more mature interfaces<br />Issues<br />Increased adoption leads to increased expectations of backwards compatibility<br />12<br />
    12. 12. Backward Compatibility:Response<br />Interface classification - audience and stability tags (0.21)<br />Patterned on enterprise-quality software process<br />Evolve interfaces but maintain backward compatible<br />Added newer forward looking interfaces - old interface maintained <br />Test for compatibility<br />Run old jars of automation tests, Real Yahoo applications<br />Applications adopting higher abstractions (Pig, Hive)<br />Insulates from lower primitive interfaces<br />Wire compatibility (Hadoop-7347)<br />Maintain compatibility with current protocol (java serialization)<br />Adapters for addressing future discontinuity<br /> e.g. serializationor protocol change<br />Moved to ProtocolBuf for data transfer protocol<br />13<br />
    13. 13. Testing & Quality<br />Nightly Testing<br />Against 1200 automated tests on 30 nodes<br />Against live data and live applications<br />QE Certification for Release<br />Large variety and scale tests on 500 nodes<br />Performance benchmarking<br />QE HIT integration testing of whole stack<br />Release Testing<br />Sandbox cluster – 3 clusters each with 400 -1K nodes<br />Major releases: 2 months testing on actual data - all production projects must sign off<br />Research clusters – 6 Clusters (non-revenue production jobs) (4K Nodes)<br />Major releases – minimum 2 months before moving to production<br />.25Million to .5Million jobs per week <br />if it clears research then mostly fine in fine in production<br />Release<br />Production clusters - 11 clusters (4.5K nodes)<br />Revenue generating, stricter SLAs<br />14<br />
    14. 14. Fundamental Architecture Changes that cut across several issues<br />Coupled<br />One-to-One<br />Job Manager<br />Resource Scheduler<br />Storage Resources<br />Compute Resources<br />MapReduce<br />HDFS<br />Namesystem<br />HDFS storage: mostly a separate layer – but one customer: one NN<br />Federation generalizes the layer<br />MapReduce – compute resource scheduling tightly coupled to MapReduce job management <br /> separates the layers<br />15<br />
    15. 15. HBase<br />Fundamental Architecture Changes that cut across several issues<br />Resource Scheduler<br />Layered<br />One-to-Many<br />HDFS<br />Namesystem<br />HDFS<br />Namesystem<br />Alternate NN Implementation<br />MR App with Different version MR lib<br />Storage Resources<br />HDFS<br />Namesystem<br />HDFS<br />Namesystem<br />MR tmp<br />MPI App<br />HDFS<br />Namesystem<br />MR App<br />Compute Resources<br />Scalability, Isolation, Availability<br />Generic lower layer: first class support new applications on top<br />MR tmp, HBase, MPI, <br />Layering facilitates faster development of new work<br />NN that caches Namespace – a few months of work<br />New implementations of MR App manager<br />Compatibility: Support multiple versions of MR<br />Tenants upgrade at their own pace – crucial for shared clusters<br />16<br />
    16. 16. The Community Effect<br />Some projects are done entirely by teams at Yahoo!, FB or Cloudera<br />But several projects are joint work<br />Yahoo & FB on NN scalability and concurrency esp in face of misbehaved apps<br />Edits log v2 and refactoring edits log (Cloudera and Yahoo!/Hortonworks)<br />HDFS-1073, 2003, 1557, 1926<br />NN HA – Yahoo!/Hortonworks, Cloudera, FB, EBay<br />HDFS-1623, 1971, 1972, 1973, 1974, ,1975, 2005<br />Features to support HBase: FB, Cloudera, Yahoo, and the HBase community<br />Expect to see rapid improvements in the very near future<br />Further Scalability - NN that cache part of namespace<br />Improved IO Performance - DN performance improvements<br />Wire Compatibility - Wire protocols, operational improvements, <br />New App Managers for<br />Continued improvement of management and operability<br />17<br />
    17. 17. Hadoop is Successfully Crossing the Chasm<br />Hadoop used in enterprises for revenue generating applications <br />Apache Hadoop is improving at a rapid rate<br />Addressing many issues including HA<br />Fundamental design improvements to fuel innovation<br />The might of a large growing developer community<br />Battle tested on large clusters and variety of applications<br />At Yahoo!, Facebook and the many other Hadoop customers.<br />Data integrity has been a focus from the early days<br />A level of testing that even the large commercial vendors cannot match!<br />Can you trust your data to anything less?<br />18<br />
    18. 18. Q & A<br />Hortonworks @ Hadoop Summit<br />1:45pm: Next Generation Apache Hadoop MapReduce<br />Community track by Arun Murthy<br />2:15pm: Introducing HCatalog (Hadoop Table Manager)<br />Community track by Alan Gates<br />4:00pm: Large Scale Math with Hadoop MapReduce<br />Applications and Research Track by Tsz-Wo Sze<br />4:30pm: HDFS Federation and Other Features<br />Community track by Suresh Srinivas and Sanjay Radia<br />19<br />
    19. 19. About Hortonworks<br />Mission: Revolutionize and commoditize the storage and processing of big data via open source<br />Vision: Half of the world’s data will be stored in Apache Hadoop within five years<br />Strategy: Drive advancements that make Apache Hadoop projects more consumable for the community, enterprises and ecosystem<br />Make Apache Hadoop easy to install, manage and use <br />Improve Apache Hadoop performance and availability<br />Make Apache Hadoop easy to integrate and extend <br />© Hortonworks Inc. 2011<br />20<br />
    20. 20. Thank You.<br />© Hortonworks Inc. 2011<br />
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.