Cloud-Friendly Hadoop and Hive - StampedeCon 2013


Published on

`At the StampedeCon 2013 Big Data conference in St. Louis, Shrikanth Shankar, Head of Engineering at Qubole, presented Cloud-Friendly Hadoop and Hive. The cloud reduces the barrier to entry for many small and medium size enterprises into analytics. Hadoop and related frameworks like Hive, Oozie, Sqoop are becoming tools of choice for deriving insights from data. However, these frameworks were designed for in-house datacenters, which have different tradeoffs from a cloud environment, and making them run well in the cloud presents some challenges. In this talk, Shrikanth Shankar, Head of Engineering at Qubole, describes how these experiences taught us to extend Hadoop and Hive to exploit these new tradeoffs. Use cases will be presented that show how the challenges at large scale at Facebook are now making it extremely easy for a significantly smaller end user to leverage these technologies in the cloud.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Cloud-Friendly Hadoop and Hive - StampedeCon 2013

  1. 1. CLOUD FRIENDLY HADOOP/HIVE Shrikanth Shankar | Qubole VP of Engineering Thursday, July 25, 13
  2. 2. INTRODUCTION • Hadoop has revolutionized big data processing • Becoming the de-facto platform for new data projects • Started as file system (HDFS) + Programming framework (Map-Reduce).An ecosystem of projects has sprung up on top of Hadoop • Hive, Pig, Cascading etc. - Simple ways of processing data • Sqoop, Flume etc. - Data movement into and out of HDFS • Oozie,Azkaban etc. - Workflow scheduling • However, these systems were all designed with an on-premise architecture in mind. • The cloud is different enough - Some things can/should change. Thursday, July 25, 13
  3. 3. DN/TT DN/TT ON-PREMISE HADOOP ARCHITECTURE Hadoop Cluster Namenode JobTracker DN/TTDN/TTDN/TT ...... IT control Relational systems (Hive metastore etc.) End User End User ...... End User Thursday, July 25, 13
  4. 4. HADOOP ON-PREMISE • Usually deployed on bare-metal nodes* • HDFS is store of choice (3-way replication for safety). Locality of data access is a big design point • Clusters are mostly static - new machines are added on IT schedule* • Static clusters means users can focus on their tasks (MR jobs, Hive queries) and not on cluster management • IT bears the burden of managing clusters Thursday, July 25, 13
  5. 5. HADOOP ON-PREMISE • Partitioning of resources • Static partitioning with different clusters for Batch and Interactive workloads • Within a cluster load balancing is done by the JT scheduler • Capex costs are significant • IT controlled - requires an Ops team (Hadoop ops, Sysadmin etc.) Thursday, July 25, 13
  7. 7. CLOUD COMPONENTS Object Stores Ephemeral compute nodes Block Stores PaaS Offerings (RDS, etc.) Thursday, July 25, 13
  8. 8. INFRASTRUCTURE CHARACTERISTICS • Running in aVM • Not that big a deal usually - except plan for performance variability • No locality information • Nodes are ephemeral - if you lose a node you will lose data on the node • AZ-wide correlated failures are to be expected. Region wide are possible (but rare) • High capacity Object stores with high cross sectional bandwidth • High latency, Variability in perf, REMOTE*. Not POSIX compliant • Persistent block stores • REMOTE,Variable perf, Thursday, July 25, 13
  9. 9. INFRASTRUCTURE CHARACTERISTICS • ELASTIC • Add a 100 nodes on demand in a few minutes • Costs are Op-ex (largely). • Nodes are per hour (CPU + Disk), Storage is per GB • Cost management is a key challenge • Some interesting payment choices (On-demand, Spot, Reserved) Thursday, July 25, 13
  10. 10. LETS PUTTHESE WORLDS TOGETHER Thursday, July 25, 13
  11. 11. STORAGE • From a cost perspective using HDFS for long term storage means you pay for both CPU and disk. • Its also more expensive to make HDFS reliable (cross AZ, maybe even cross Region?) • Using an object store allows you to pay only for storage • With object stores you see latency issues since data is remote Thursday, July 25, 13
  12. 12. STORAGE • But node storage is still needed when jobs and queries are active • For intermediate job results (not all results should go back to S3 - e.g. stage outputs in Hive) • For intermediate data (mapper output) • Makes scaling nodes challenging • Also since performance is better - may want to move remote data to HDFS before accessing Thursday, July 25, 13
  13. 13. COMPUTE AND CLUSTERS • If you dont need Hadoop for persistent storage - when do you need a cluster? • Bring them up on demand - maybe for every job? • But that can be expensive - no multiplexing • Ideally you want to share Hadoop clusters as much as possible. Shut down cluster when not being used Thursday, July 25, 13
  14. 14. COMPUTE AND CLUSTERS • If cluster is dynamic and you need sharing - how do you do ‘discover’ it? • How about cluster sizing? • Static is a left over from on-premise • Be dynamic on the cloud. Hard for end users to do manually Thursday, July 25, 13
  15. 15. COMPUTE AND CLUSTER • Adding nodes needs to be done based on load • E.g. Most of the time jobs need < 5 nodes. A batch job comes in needs 100 nodes. We should expand the cluster (for as long as needed) • Removing nodes is trickier • If we lose intermediate results lots of work will be lost. • Job1 uses 100 nodes, produces data spread over all of them. Job 2 consumes results but only needs 10 nodes. How do you give up 90 nodes? Thursday, July 25, 13
  16. 16. COMPUTE AND CLUSTER • Pricing choices are interesting • For e.g. spot nodes average half the price of an on-demand node • But if price spikes you lose all the spot nodes at once • Hadoop fault tolerance can retry failed jobs (but expensive) - what about data loss when you lose all the spot nodes? Thursday, July 25, 13
  17. 17. END USER EXPERIENCE • The cloud isnt just about cost - its also about agility.To allow this we need to focus on the end user experience • End users would prefer to focus on higher level API’s • e.g. Run a Hadoop job or a Hive query - specifics of clusters should be hidden from them • Some things should be persistent (log files, results, ...) • They get this for free on premise Thursday, July 25, 13
  18. 18. BETTER END STATE • IT/dev ops/users should set high level controls • Usage governance (max cluster size, max bill, cpu hours used per month etc.) • End users should focus at the level they understand • Smart software should bridge the gap Thursday, July 25, 13
  19. 19. QUESTIONS? Thursday, July 25, 13