Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Upcoming SlideShare
Loading in...5
×
 

Jan 2013 HUG: Cloud-Friendly Hadoop and Hive

on

  • 2,856 views

The cloud reduces the barrier to entry for many small and medium size enterprises into analytics. Hadoop and related frameworks like Hive, Oozie, Sqoop are becoming tools of choice for deriving ...

The cloud reduces the barrier to entry for many small and medium size enterprises into analytics. Hadoop and related frameworks like Hive, Oozie, Sqoop are becoming tools of choice for deriving insights from data. However, these frameworks were designed for in-house datacenters which have different tradeoffs from a cloud environment and making them run well in the cloud presents some challenges. In this talk, we describe how we've extended Hadoop and Hive to exploit these new tradeoffs and offer them as part of the Qubole Data Service (QDS). We will also present use-cases that show how QDS is making it extremely easy for an end user to use these technologies in the cloud.

Speaker: Ashish Thusoo, CEO, Qubole

Statistics

Views

Total Views
2,856
Views on SlideShare
2,856
Embed Views
0

Actions

Likes
1
Downloads
48
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Jan 2013 HUG: Cloud-Friendly Hadoop and Hive Jan 2013 HUG: Cloud-Friendly Hadoop and Hive Presentation Transcript

  • Hadoop User Group Ashish Thusoo Jan 16, 2013Qubole Inc., Proprietary
  • About Me • Big Data Veteran • Ran the data infrastructure team at Facebook before starting Qubole • Co-created Hive in 2007 @ FacebookQubole Inc., Proprietary
  • What is Qubole? • A comprehensive cloud data platform based on Hadoop and Hive for data in the cloud - Turnkey Infrastructure - Cloud Optimized Stack - Open Data Formats • Useful for exploring data and creating batch processing applications/data pipelinesQubole Inc., Proprietary View slide
  • Why Qubole? BOTTLENECK End Users Heterogenous Data (User Ops, Product Managers(Structured & Unstructured) etc.) The Intermediaries (Data Scientists and Engineers)Qubole Inc., Proprietary View slide
  • Qubole Service Cloud Data Service Explore Schedule SDK API ODBC Logs Cloud Data Platform Connectors Events Elastic . Robust . Fast Data Marts DBs Big Data Technology Stack Metrics EC2 / S3 Cloud SourcesQubole Inc., Proprietary
  • Cloud vs Bare Metal • Dynamic vs Fixed Provisioning • Separation between Compute and Storage • Purchasing and BudgetingQubole Inc., Proprietary
  • Dynamic Provisioning • Advantage: Transient Clusters • Burden: How big of a cluster do I need? • Solution: Auto-scaled HadoopQubole Inc., Proprietary
  • Challenges:Auto-scaled Hadoop http://www.qubole.com/blog/index.php/first-auto-scaling-hadoop-hive-clusters/ • Adapting to Burstiness - Current load is not enough, also need to predict future load • Adapting State-fully - Removing HDFS nodes is risky without decommissioningQubole Inc., Proprietary
  • Implementation:Auto-scaled Hadoop http://www.qubole.com/blog/index.php/first-auto-scaling-hadoop-hive-clusters/ • TaskTrackers report launch times of JobTracker • JT computes amount of time required to finish existing workloads • If the time is above a certain threshold then more nodes are added • At hourly boundaries the nodes are removedQubole Inc., Proprietary
  • Implementation:Auto-scaled Hadoop http://www.qubole.com/blog/index.php/first-auto-scaling-hadoop-hive-clusters/ • Restrictions on Deleting Nodes: - Nodes Containing Task Outputs of Current Jobs - Fast Decommissioning Done for Data Nodes - Minimum Cluster Size Threshold • Fast Decommissioning - possible because HDFS is a cache for usQubole Inc., Proprietary
  • Compute & Storage on the Cloud (EC2/S3) • On the cloud Compute and Storage are Separate!! • Advantage: Don’t Pay for CPU for Storing Data • Burden: Separation Can Cause Slowness & Variability • Solutions: -Qubole Inc., Proprietary Caching File System
  • Caching File System http://www.qubole.com/blog/index.php/columnar-cloud-cache/Qubole Inc., Proprietary
  • Caching File System http://www.qubole.com/blog/index.php/columnar-cloud-cache/ • Benefits: - Masks the performance variance associated with S3 while reading data - Columnar caching on the fly enables data to be persisted in open formats while still giving the benefits of performanceQubole Inc., Proprietary
  • Masking S3 Latency http://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/ • File Operations in S3 are much slower than HDFS • Problem: This leads to bad performance when data is distributed in a lot of files • Solution: - Fast Split Generation Algorithm - Pipelined File OpensQubole Inc., Proprietary
  • Faster Split Generation http://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/ • Directory operations with merging instead of per file metadata (upto 8x speedup)Qubole Inc., Proprietary
  • Pipelined File Opens http://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/ • Open S3 files before they are read (30% improvements in simple queries)Qubole Inc., Proprietary
  • Purchasing Instances • Buying Instances on Spot Prices vs On- Demand Prices • Benefits: Cheaper on average by 50-60% • Problems: Spot instances are not guaranteed and can be taken away anytime - Bad for MapReduce - Disastrous for HDFSQubole Inc., Proprietary
  • Spotted Hadoop Clusters http://www.qubole.com/blog/index.php/hadoop-auto-scale-ec2-spot-instances/ • Simplified Spot Bidding Strategy - Configuring Bidding Timeouts - Configuring % of instances through spot - Configuring bid pricses • Spot Instance Aware HDFS Block Placement - Ensures One Replica of the Blocks Reside On On-Demand NodesQubole Inc., Proprietary
  • Conclusion • Cloud is Different from Bare Metal • Check out more optimizations that we have made to run Hadoop and Hive optimally in the cloud at our blog http://www.qubole.com/blog/Qubole Inc., Proprietary
  • Thank you. Free Sign up for Qubole at https://api.qubole.com/users/sign_up Careers at http://www.qubole.com/careersQubole Inc., Proprietary