Your SlideShare is downloading. ×
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Jan 2013 HUG: Cloud-Friendly Hadoop and Hive


Published on

The cloud reduces the barrier to entry for many small and medium size enterprises into analytics. Hadoop and related frameworks like Hive, Oozie, Sqoop are becoming tools of choice for deriving …

The cloud reduces the barrier to entry for many small and medium size enterprises into analytics. Hadoop and related frameworks like Hive, Oozie, Sqoop are becoming tools of choice for deriving insights from data. However, these frameworks were designed for in-house datacenters which have different tradeoffs from a cloud environment and making them run well in the cloud presents some challenges. In this talk, we describe how we've extended Hadoop and Hive to exploit these new tradeoffs and offer them as part of the Qubole Data Service (QDS). We will also present use-cases that show how QDS is making it extremely easy for an end user to use these technologies in the cloud.

Speaker: Ashish Thusoo, CEO, Qubole

1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Hadoop User Group Ashish Thusoo Jan 16, 2013Qubole Inc., Proprietary
  • 2. About Me • Big Data Veteran • Ran the data infrastructure team at Facebook before starting Qubole • Co-created Hive in 2007 @ FacebookQubole Inc., Proprietary
  • 3. What is Qubole? • A comprehensive cloud data platform based on Hadoop and Hive for data in the cloud - Turnkey Infrastructure - Cloud Optimized Stack - Open Data Formats • Useful for exploring data and creating batch processing applications/data pipelinesQubole Inc., Proprietary
  • 4. Why Qubole? BOTTLENECK End Users Heterogenous Data (User Ops, Product Managers(Structured & Unstructured) etc.) The Intermediaries (Data Scientists and Engineers)Qubole Inc., Proprietary
  • 5. Qubole Service Cloud Data Service Explore Schedule SDK API ODBC Logs Cloud Data Platform Connectors Events Elastic . Robust . Fast Data Marts DBs Big Data Technology Stack Metrics EC2 / S3 Cloud SourcesQubole Inc., Proprietary
  • 6. Cloud vs Bare Metal • Dynamic vs Fixed Provisioning • Separation between Compute and Storage • Purchasing and BudgetingQubole Inc., Proprietary
  • 7. Dynamic Provisioning • Advantage: Transient Clusters • Burden: How big of a cluster do I need? • Solution: Auto-scaled HadoopQubole Inc., Proprietary
  • 8. Challenges:Auto-scaled Hadoop • Adapting to Burstiness - Current load is not enough, also need to predict future load • Adapting State-fully - Removing HDFS nodes is risky without decommissioningQubole Inc., Proprietary
  • 9. Implementation:Auto-scaled Hadoop • TaskTrackers report launch times of JobTracker • JT computes amount of time required to finish existing workloads • If the time is above a certain threshold then more nodes are added • At hourly boundaries the nodes are removedQubole Inc., Proprietary
  • 10. Implementation:Auto-scaled Hadoop • Restrictions on Deleting Nodes: - Nodes Containing Task Outputs of Current Jobs - Fast Decommissioning Done for Data Nodes - Minimum Cluster Size Threshold • Fast Decommissioning - possible because HDFS is a cache for usQubole Inc., Proprietary
  • 11. Compute & Storage on the Cloud (EC2/S3) • On the cloud Compute and Storage are Separate!! • Advantage: Don’t Pay for CPU for Storing Data • Burden: Separation Can Cause Slowness & Variability • Solutions: -Qubole Inc., Proprietary Caching File System
  • 12. Caching File System Inc., Proprietary
  • 13. Caching File System • Benefits: - Masks the performance variance associated with S3 while reading data - Columnar caching on the fly enables data to be persisted in open formats while still giving the benefits of performanceQubole Inc., Proprietary
  • 14. Masking S3 Latency • File Operations in S3 are much slower than HDFS • Problem: This leads to bad performance when data is distributed in a lot of files • Solution: - Fast Split Generation Algorithm - Pipelined File OpensQubole Inc., Proprietary
  • 15. Faster Split Generation • Directory operations with merging instead of per file metadata (upto 8x speedup)Qubole Inc., Proprietary
  • 16. Pipelined File Opens • Open S3 files before they are read (30% improvements in simple queries)Qubole Inc., Proprietary
  • 17. Purchasing Instances • Buying Instances on Spot Prices vs On- Demand Prices • Benefits: Cheaper on average by 50-60% • Problems: Spot instances are not guaranteed and can be taken away anytime - Bad for MapReduce - Disastrous for HDFSQubole Inc., Proprietary
  • 18. Spotted Hadoop Clusters • Simplified Spot Bidding Strategy - Configuring Bidding Timeouts - Configuring % of instances through spot - Configuring bid pricses • Spot Instance Aware HDFS Block Placement - Ensures One Replica of the Blocks Reside On On-Demand NodesQubole Inc., Proprietary
  • 19. Conclusion • Cloud is Different from Bare Metal • Check out more optimizations that we have made to run Hadoop and Hive optimally in the cloud at our blog Inc., Proprietary
  • 20. Thank you. Free Sign up for Qubole at Careers at Inc., Proprietary