Qubole hadoop-summit-2013-europe

1,346 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,346
On SlideShare
0
From Embeds
0
Number of Embeds
106
Actions
Shares
0
Downloads
36
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Qubole hadoop-summit-2013-europe

  1. 1. Cloud Friendly Hadoop & Hive Joydeep Sen Sarma Qubole
  2. 2. Agenda What is Qubole Data Service Hadoop as a Service in Cloud Hive as a Service in Cloud 2
  3. 3. Qubole Data ServiceAWS EC2 3AWS S3
  4. 4. Qubole Data Service API Oozie Hive Pig Sqoop HadoopAWS EC2AWS S3
  5. 5. Qubole Data Service API Vertica Oozie Hive Pig Sqoop Mysql HadoopAWS EC2 5 S3://adco/logsAWS S3
  6. 6. Qubole Data Service SDK ODBC Explore – Integrate – Analyze – Schedule API Vertica Oozie Hive Pig Sqoop Mysql HadoopAWS EC2 6 6AWS S3 S3://adco/logs
  7. 7. Qubole Data Service SDK ODBC Explore – Integrate – Analyze – Schedule API Vertica Oozie Hive Pig Sqoop Mysql HadoopAWS EC2 7 7AWS S3 S3://adco/logs
  8. 8. Agenda• What is Qubole Data Service• Hadoop as a Service in Cloud• Hive as a Service in Cloud 8
  9. 9. Step 1(Optional): Setup Hadoop 9
  10. 10. Step 2: Fire Away AdCo Hadoop 10
  11. 11. Step 2: Fire Awayselect t.county, count(1) from (selecttransform(a.zip) using ‘geo.py’ asa.county from SMALL_TABLE a) tgroup by t.county; AdCo Hadoop 11
  12. 12. Step 2: Fire Awayselect t.county, count(1) from (selecttransform(a.zip) using ‘geo.py’ asa.county from SMALL_TABLE a) tgroup by t.county; AdCo Hadoop 12
  13. 13. Step 2: Fire Away hadoop jar –Dmapred.min.split.size=32000000 myapp.jar –partitioner .org.apache…select t.county, count(1) from (selecttransform(a.zip) using ‘geo.py’ asa.county from SMALL_TABLE a) tgroup by t.county; AdCo Hadoop insert overwrite table dest select a.id, a.zip, count(distinct b.uid) from ads a join LARGE_TABLE b on (a.id=b.ad_id) 13 group by a.id, a.zip; 13
  14. 14. Step 2: Fire Away hadoop jar –Dmapred.min.split.size=32000000 myapp.jar –partitioner .org.apache…select t.county, count(1) from (selecttransform(a.zip) using ‘geo.py’ asa.county from SMALL_TABLE a) tgroup by t.county; AdCo Hadoop insert overwrite table dest select a.id, a.zip, count(distinct b.uid) from ads a join LARGE_TABLE b on (a.id=b.ad_id) 14 group by a.id, a.zip; 14
  15. 15. Step 2: Fire Away hadoop jar –Dmapred.min.split.size=32000000 myapp.jar –partitioner .org.apache… AdCo Hadoop 15
  16. 16. Step 2: Fire Away hadoop jar –Dmapred.min.split.size=32000000 myapp.jar –partitioner .org.apache… AdCo Hadoop 16
  17. 17. Step 2: Fire Away AdCo Hadoop 17
  18. 18. Come back anytime 18
  19. 19. Hadoop as Service1. Detect when cluster is required – Not all Hive statements require cluster (EXPLAIN/SHOW/..)2. Atomically create cluster – Long running process, concurrency control using Mysql3. Shutdown when not in use – Do on hour boundary (whose?) – Not if User Sessions are active! 19
  20. 20. Hadoop as Service• Archive Job History/Logs to S3 – Transparent access to Old jobs• Auto-Config different node types – Use ALL ephemeral drives for HDFS/MR – Use right number of slots per machine• Scrub, Scrub, Scrub – Bad Nodes, Bad Clusters, AWS timeouts 20
  21. 21. Scaling Up SlavesMap Tasks Job TrackerReduceTasks Master StarCluster 21 AWS
  22. 22. Scaling Upinsert overwrite table dest Slavesselect … from ads joincampaigns on …group by …; Map Tasks Job Tracker ReduceTasks Master StarCluster 22 AWS
  23. 23. Scaling Upinsert overwrite table dest Slavesselect … from ads joincampaigns on …group by …; Map Tasks Job Tracker ReduceTasks Master StarCluster 23 AWS
  24. 24. Scaling Upinsert overwrite table dest Slavesselect … from ads joincampaigns on …group by …; Map Tasks Job Tracker ReduceTasks Master StarCluster 24 AWS
  25. 25. Scaling Upinsert overwrite table dest Slavesselect … from ads joincampaigns on …group by …; Progress Map Tasks Job Tracker ReduceTasks Master StarCluster 25 AWS
  26. 26. Scaling Upinsert overwrite table dest Slavesselect … from ads joincampaigns on …group by …; Progress Map Tasks Job Tracker ReduceTasks Supply Demand Master StarCluster 26 AWS
  27. 27. Scaling Upinsert overwrite table dest Slavesselect … from ads joincampaigns on …group by …; Progress Map Tasks Job Tracker ReduceTasks Supply Demand Master StarCluster 27 AWS
  28. 28. Scaling Upinsert overwrite table dest Slavesselect … from ads joincampaigns on …group by …; Progress Map Tasks Job Tracker ReduceTasks Master StarCluster 28 AWS
  29. 29. Scaling Upinsert overwrite table dest Slavesselect … from ads joincampaigns on …group by …; Progress Map Tasks Job Tracker ReduceTasks Master StarCluster 29 AWS
  30. 30. Scaling Down1. On hour boundary – check if node is required: – Can’t remove nodes with map-outputs (today) – Don’t go below minimum cluster size2. Remove node from Map-Reduce Cluster3. Request HDFS Decomissioning – fast! – Delete affected cache files instead of re-replicating – One surviving replica and we are Done.4. Delete Instance 30
  31. 31. Spot InstancesOn an average 50-60% cheaper 31 31
  32. 32. Spot Instance: Challenges• Can lose Spot nodes anytime – Disastrous for HDFS – Hybrid Mode: Use mix of On-Demand and Spot – Hybrid Mode: Keep one replica in On-Demand nodes• Spot Instances may not be available – Timeout and use On-Demand nodes as fallback 32
  33. 33. Agenda What is Qubole Data Service Hadoop as a Service in Cloud Hive as a Service in Cloud 33
  34. 34. Query History/Results 34
  35. 35. Cheap to Test  Evaluate expressions on sample data 35
  36. 36. Cheap to Test  Run Query on Sample 36
  37. 37. Fastest Hive SaaS• Works with Small Files! – Faster Split Computation (8x) – Prefetching S3 files (30%) 37
  38. 38. Fastest Hive SaaS• Works with Small Files! • Stable JVM Reuse! – Faster Split Computation (8x) – Fix re-entrancy issues – Prefetching S3 files (30%) – 1.2-2x speedup 38
  39. 39. Fastest Hive SaaS• Works with Small Files! • Stable JVM Reuse! – Faster Split Computation (8x) – Fix re-entrancy issues – Prefetching S3 files (30%) – 1.2-2x speedup• Direct writes to S3 – HIVE-1620 39
  40. 40. Fastest Hive SaaS• Works with Small Files! • Stable JVM Reuse! – Faster Split Computation (8x) – Fix re-entrancy issues – Prefetching S3 files (30%) – 1.2-2x speedup• Direct writes to S3 • Columnar Cache – HIVE-1620 – Use HDFS as cache for S3 – Upto 5x faster for JSON data 40
  41. 41. Fastest Hive SaaS• Works with Small Files! • Stable JVM Reuse! – Faster Split Computation (8x) – Fix re-entrancy issues – Prefetching S3 files (30%) – 1.2-2x speedup• Direct writes to S3 • Columnar Cache – HIVE-1620 – Use HDFS as cache for S3 – Upto 5x faster for JSON data• NEW – Multi-Tenant Hive Server 41
  42. 42. Questions? @QuboleFree Trial: www.qubole.com

×