Jan 2013 HUG: Cloud-Friendly Hadoop and Hive


Published on

The cloud reduces the barrier to entry for many small and medium size enterprises into analytics. Hadoop and related frameworks like Hive, Oozie, Sqoop are becoming tools of choice for deriving insights from data. However, these frameworks were designed for in-house datacenters which have different tradeoffs from a cloud environment and making them run well in the cloud presents some challenges. In this talk, we describe how we've extended Hadoop and Hive to exploit these new tradeoffs and offer them as part of the Qubole Data Service (QDS). We will also present use-cases that show how QDS is making it extremely easy for an end user to use these technologies in the cloud.

Speaker: Ashish Thusoo, CEO, Qubole

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Jan 2013 HUG: Cloud-Friendly Hadoop and Hive

  1. 1. Hadoop User Group Ashish Thusoo Jan 16, 2013Qubole Inc., Proprietary
  2. 2. About Me • Big Data Veteran • Ran the data infrastructure team at Facebook before starting Qubole • Co-created Hive in 2007 @ FacebookQubole Inc., Proprietary
  3. 3. What is Qubole? • A comprehensive cloud data platform based on Hadoop and Hive for data in the cloud - Turnkey Infrastructure - Cloud Optimized Stack - Open Data Formats • Useful for exploring data and creating batch processing applications/data pipelinesQubole Inc., Proprietary
  4. 4. Why Qubole? BOTTLENECK End Users Heterogenous Data (User Ops, Product Managers(Structured & Unstructured) etc.) The Intermediaries (Data Scientists and Engineers)Qubole Inc., Proprietary
  5. 5. Qubole Service Cloud Data Service Explore Schedule SDK API ODBC Logs Cloud Data Platform Connectors Events Elastic . Robust . Fast Data Marts DBs Big Data Technology Stack Metrics EC2 / S3 Cloud SourcesQubole Inc., Proprietary
  6. 6. Cloud vs Bare Metal • Dynamic vs Fixed Provisioning • Separation between Compute and Storage • Purchasing and BudgetingQubole Inc., Proprietary
  7. 7. Dynamic Provisioning • Advantage: Transient Clusters • Burden: How big of a cluster do I need? • Solution: Auto-scaled HadoopQubole Inc., Proprietary
  8. 8. Challenges:Auto-scaled Hadoop http://www.qubole.com/blog/index.php/first-auto-scaling-hadoop-hive-clusters/ • Adapting to Burstiness - Current load is not enough, also need to predict future load • Adapting State-fully - Removing HDFS nodes is risky without decommissioningQubole Inc., Proprietary
  9. 9. Implementation:Auto-scaled Hadoop http://www.qubole.com/blog/index.php/first-auto-scaling-hadoop-hive-clusters/ • TaskTrackers report launch times of JobTracker • JT computes amount of time required to finish existing workloads • If the time is above a certain threshold then more nodes are added • At hourly boundaries the nodes are removedQubole Inc., Proprietary
  10. 10. Implementation:Auto-scaled Hadoop http://www.qubole.com/blog/index.php/first-auto-scaling-hadoop-hive-clusters/ • Restrictions on Deleting Nodes: - Nodes Containing Task Outputs of Current Jobs - Fast Decommissioning Done for Data Nodes - Minimum Cluster Size Threshold • Fast Decommissioning - possible because HDFS is a cache for usQubole Inc., Proprietary
  11. 11. Compute & Storage on the Cloud (EC2/S3) • On the cloud Compute and Storage are Separate!! • Advantage: Don’t Pay for CPU for Storing Data • Burden: Separation Can Cause Slowness & Variability • Solutions: -Qubole Inc., Proprietary Caching File System
  12. 12. Caching File System http://www.qubole.com/blog/index.php/columnar-cloud-cache/Qubole Inc., Proprietary
  13. 13. Caching File System http://www.qubole.com/blog/index.php/columnar-cloud-cache/ • Benefits: - Masks the performance variance associated with S3 while reading data - Columnar caching on the fly enables data to be persisted in open formats while still giving the benefits of performanceQubole Inc., Proprietary
  14. 14. Masking S3 Latency http://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/ • File Operations in S3 are much slower than HDFS • Problem: This leads to bad performance when data is distributed in a lot of files • Solution: - Fast Split Generation Algorithm - Pipelined File OpensQubole Inc., Proprietary
  15. 15. Faster Split Generation http://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/ • Directory operations with merging instead of per file metadata (upto 8x speedup)Qubole Inc., Proprietary
  16. 16. Pipelined File Opens http://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/ • Open S3 files before they are read (30% improvements in simple queries)Qubole Inc., Proprietary
  17. 17. Purchasing Instances • Buying Instances on Spot Prices vs On- Demand Prices • Benefits: Cheaper on average by 50-60% • Problems: Spot instances are not guaranteed and can be taken away anytime - Bad for MapReduce - Disastrous for HDFSQubole Inc., Proprietary
  18. 18. Spotted Hadoop Clusters http://www.qubole.com/blog/index.php/hadoop-auto-scale-ec2-spot-instances/ • Simplified Spot Bidding Strategy - Configuring Bidding Timeouts - Configuring % of instances through spot - Configuring bid pricses • Spot Instance Aware HDFS Block Placement - Ensures One Replica of the Blocks Reside On On-Demand NodesQubole Inc., Proprietary
  19. 19. Conclusion • Cloud is Different from Bare Metal • Check out more optimizations that we have made to run Hadoop and Hive optimally in the cloud at our blog http://www.qubole.com/blog/Qubole Inc., Proprietary
  20. 20. Thank you. Free Sign up for Qubole at https://api.qubole.com/users/sign_up Careers at http://www.qubole.com/careersQubole Inc., Proprietary