Cloud Friendly Hadoop & Hive         Joydeep Sen Sarma           Qubole
Agenda What is Qubole Data Service Hadoop as a Service in Cloud Hive as a Service in Cloud                           2
Qubole Data Service                                             SDK    ODBC Explore – Integrate – Analyze – Schedule      ...
Agenda• What is Qubole Data Service• Hadoop as a Service in Cloud• Hive as a Service in Cloud                           4
Step 1(Optional): Setup Hadoop              5
Step 2: Fire Away                                                       hadoop jar –Dmapred.min.split.size=32000000       ...
Come back anytime        7
Hadoop as Service1. Detect when cluster is required  – Not all Hive statements require cluster (EXPLAIN/SHOW/..)2. Atomica...
Hadoop as Service• Archive Job History/Logs to S3  – Transparent access to Old jobs• Auto-Config different node types  – U...
Scaling Upinsert overwrite table dest                           Slavesselect … from ads joincampaigns on …group by …;     ...
Scaling Down1. On hour boundary – check if node is required:   – Can’t remove nodes with map-outputs (today)   – Don’t go ...
Spot InstancesOn an average 50-60% cheaper            12                 12
Spot Instance: Challenges• Can lose Spot nodes anytime  – Disastrous for HDFS  – Hybrid Mode: Use mix of On-Demand and Spo...
Agenda What is Qubole Data Service Hadoop as a Service in Cloud Hive as a Service in Cloud                          14
Query History/Results         15
Cheap to Test           Evaluate expressions on            sample data           Run Query on Sample     16
Fastest Hive SaaS• Works with Small Files!           • Stable JVM Reuse!  – Faster Split Computation (8x)     – Fix re-ent...
Questions?     @QuboleFr e e Tr i a l : www.qubole.com
Upcoming SlideShare
Loading in …5
×

Cloud Friendly Hadoop and Hive

735 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
735
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Cloud Friendly Hadoop and Hive

  1. 1. Cloud Friendly Hadoop & Hive Joydeep Sen Sarma Qubole
  2. 2. Agenda What is Qubole Data Service Hadoop as a Service in Cloud Hive as a Service in Cloud 2
  3. 3. Qubole Data Service SDK ODBC Explore – Integrate – Analyze – Schedule API Vertica Oozie Hive Pig Sqoop Mysql HadoopAWS EC2 3 S3://adco/logsAWS S3
  4. 4. Agenda• What is Qubole Data Service• Hadoop as a Service in Cloud• Hive as a Service in Cloud 4
  5. 5. Step 1(Optional): Setup Hadoop 5
  6. 6. Step 2: Fire Away hadoop jar –Dmapred.min.split.size=32000000 myapp.jar –partitioner .org.apache…select t.county, count(1) from (selecttransform(a.zip) using ‘geo.py’ asa.county from SMALL_TABLE a) tgroup by t.county; AdCo Hadoop insert overwrite table dest select a.id, a.zip, count(distinct b.uid) from ads a join LARGE_TABLE b on (a.id=b.ad_id) 6 group by a.id, a.zip; 6
  7. 7. Come back anytime 7
  8. 8. Hadoop as Service1. Detect when cluster is required – Not all Hive statements require cluster (EXPLAIN/SHOW/..)2. Atomically create cluster – Long running process, concurrency control using Mysql3. Shutdown when not in use – Do on hour boundary (whose?) – Not if User Sessions are active! 8
  9. 9. Hadoop as Service• Archive Job History/Logs to S3 – Transparent access to Old jobs• Auto-Config different node types – Use ALL ephemeral drives for HDFS/MR – Use right number of slots per machine• Scrub, Scrub, Scrub – Bad Nodes, Bad Clusters, AWS timeouts 9
  10. 10. Scaling Upinsert overwrite table dest Slavesselect … from ads joincampaigns on …group by …; Progress Map Tasks Job Tracker ReduceTasks Supply Demand Master StarCluster 10 AWS
  11. 11. Scaling Down1. On hour boundary – check if node is required: – Can’t remove nodes with map-outputs (today) – Don’t go below minimum cluster size2. Remove node from Map-Reduce Cluster3. Request HDFS Decomissioning – fast! – Delete affected cache files instead of re-replicating – One surviving replica and we are Done.4. Delete Instance 11
  12. 12. Spot InstancesOn an average 50-60% cheaper 12 12
  13. 13. Spot Instance: Challenges• Can lose Spot nodes anytime – Disastrous for HDFS – Hybrid Mode: Use mix of On-Demand and Spot – Hybrid Mode: Keep one replica in On-Demand nodes• Spot Instances may not be available – Timeout and use On-Demand nodes as fallback 13
  14. 14. Agenda What is Qubole Data Service Hadoop as a Service in Cloud Hive as a Service in Cloud 14
  15. 15. Query History/Results 15
  16. 16. Cheap to Test  Evaluate expressions on sample data  Run Query on Sample 16
  17. 17. Fastest Hive SaaS• Works with Small Files! • Stable JVM Reuse! – Faster Split Computation (8x) – Fix re-entrancy issues – Prefetching S3 files (30%) – 1.2-2x speedup• Direct writes to S3 • Columnar Cache – HIVE-1620 – Use HDFS as cache for S3 – Upto 5x faster for JSON data• N E W – Multi-Tenant Hive Server 17
  18. 18. Questions? @QuboleFr e e Tr i a l : www.qubole.com

×