Your SlideShare is downloading. ×
0
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Qubole hadoop-summit-2013-europe

853

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
853
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
30
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Cloud Friendly Hadoop & Hive Joydeep Sen Sarma Qubole
  • 2. Agenda What is Qubole Data Service Hadoop as a Service in Cloud Hive as a Service in Cloud 2
  • 3. Qubole Data ServiceAWS EC2 3AWS S3
  • 4. Qubole Data Service API Oozie Hive Pig Sqoop HadoopAWS EC2AWS S3
  • 5. Qubole Data Service API Vertica Oozie Hive Pig Sqoop Mysql HadoopAWS EC2 5 S3://adco/logsAWS S3
  • 6. Qubole Data Service SDK ODBC Explore – Integrate – Analyze – Schedule API Vertica Oozie Hive Pig Sqoop Mysql HadoopAWS EC2 6 6AWS S3 S3://adco/logs
  • 7. Qubole Data Service SDK ODBC Explore – Integrate – Analyze – Schedule API Vertica Oozie Hive Pig Sqoop Mysql HadoopAWS EC2 7 7AWS S3 S3://adco/logs
  • 8. Agenda• What is Qubole Data Service• Hadoop as a Service in Cloud• Hive as a Service in Cloud 8
  • 9. Step 1(Optional): Setup Hadoop 9
  • 10. Step 2: Fire Away AdCo Hadoop 10
  • 11. Step 2: Fire Awayselect t.county, count(1) from (selecttransform(a.zip) using ‘geo.py’ asa.county from SMALL_TABLE a) tgroup by t.county; AdCo Hadoop 11
  • 12. Step 2: Fire Awayselect t.county, count(1) from (selecttransform(a.zip) using ‘geo.py’ asa.county from SMALL_TABLE a) tgroup by t.county; AdCo Hadoop 12
  • 13. Step 2: Fire Away hadoop jar –Dmapred.min.split.size=32000000 myapp.jar –partitioner .org.apache…select t.county, count(1) from (selecttransform(a.zip) using ‘geo.py’ asa.county from SMALL_TABLE a) tgroup by t.county; AdCo Hadoop insert overwrite table dest select a.id, a.zip, count(distinct b.uid) from ads a join LARGE_TABLE b on (a.id=b.ad_id) 13 group by a.id, a.zip; 13
  • 14. Step 2: Fire Away hadoop jar –Dmapred.min.split.size=32000000 myapp.jar –partitioner .org.apache…select t.county, count(1) from (selecttransform(a.zip) using ‘geo.py’ asa.county from SMALL_TABLE a) tgroup by t.county; AdCo Hadoop insert overwrite table dest select a.id, a.zip, count(distinct b.uid) from ads a join LARGE_TABLE b on (a.id=b.ad_id) 14 group by a.id, a.zip; 14
  • 15. Step 2: Fire Away hadoop jar –Dmapred.min.split.size=32000000 myapp.jar –partitioner .org.apache… AdCo Hadoop 15
  • 16. Step 2: Fire Away hadoop jar –Dmapred.min.split.size=32000000 myapp.jar –partitioner .org.apache… AdCo Hadoop 16
  • 17. Step 2: Fire Away AdCo Hadoop 17
  • 18. Come back anytime 18
  • 19. Hadoop as Service1. Detect when cluster is required – Not all Hive statements require cluster (EXPLAIN/SHOW/..)2. Atomically create cluster – Long running process, concurrency control using Mysql3. Shutdown when not in use – Do on hour boundary (whose?) – Not if User Sessions are active! 19
  • 20. Hadoop as Service• Archive Job History/Logs to S3 – Transparent access to Old jobs• Auto-Config different node types – Use ALL ephemeral drives for HDFS/MR – Use right number of slots per machine• Scrub, Scrub, Scrub – Bad Nodes, Bad Clusters, AWS timeouts 20
  • 21. Scaling Up SlavesMap Tasks Job TrackerReduceTasks Master StarCluster 21 AWS
  • 22. Scaling Upinsert overwrite table dest Slavesselect … from ads joincampaigns on …group by …; Map Tasks Job Tracker ReduceTasks Master StarCluster 22 AWS
  • 23. Scaling Upinsert overwrite table dest Slavesselect … from ads joincampaigns on …group by …; Map Tasks Job Tracker ReduceTasks Master StarCluster 23 AWS
  • 24. Scaling Upinsert overwrite table dest Slavesselect … from ads joincampaigns on …group by …; Map Tasks Job Tracker ReduceTasks Master StarCluster 24 AWS
  • 25. Scaling Upinsert overwrite table dest Slavesselect … from ads joincampaigns on …group by …; Progress Map Tasks Job Tracker ReduceTasks Master StarCluster 25 AWS
  • 26. Scaling Upinsert overwrite table dest Slavesselect … from ads joincampaigns on …group by …; Progress Map Tasks Job Tracker ReduceTasks Supply Demand Master StarCluster 26 AWS
  • 27. Scaling Upinsert overwrite table dest Slavesselect … from ads joincampaigns on …group by …; Progress Map Tasks Job Tracker ReduceTasks Supply Demand Master StarCluster 27 AWS
  • 28. Scaling Upinsert overwrite table dest Slavesselect … from ads joincampaigns on …group by …; Progress Map Tasks Job Tracker ReduceTasks Master StarCluster 28 AWS
  • 29. Scaling Upinsert overwrite table dest Slavesselect … from ads joincampaigns on …group by …; Progress Map Tasks Job Tracker ReduceTasks Master StarCluster 29 AWS
  • 30. Scaling Down1. On hour boundary – check if node is required: – Can’t remove nodes with map-outputs (today) – Don’t go below minimum cluster size2. Remove node from Map-Reduce Cluster3. Request HDFS Decomissioning – fast! – Delete affected cache files instead of re-replicating – One surviving replica and we are Done.4. Delete Instance 30
  • 31. Spot InstancesOn an average 50-60% cheaper 31 31
  • 32. Spot Instance: Challenges• Can lose Spot nodes anytime – Disastrous for HDFS – Hybrid Mode: Use mix of On-Demand and Spot – Hybrid Mode: Keep one replica in On-Demand nodes• Spot Instances may not be available – Timeout and use On-Demand nodes as fallback 32
  • 33. Agenda What is Qubole Data Service Hadoop as a Service in Cloud Hive as a Service in Cloud 33
  • 34. Query History/Results 34
  • 35. Cheap to Test  Evaluate expressions on sample data 35
  • 36. Cheap to Test  Run Query on Sample 36
  • 37. Fastest Hive SaaS• Works with Small Files! – Faster Split Computation (8x) – Prefetching S3 files (30%) 37
  • 38. Fastest Hive SaaS• Works with Small Files! • Stable JVM Reuse! – Faster Split Computation (8x) – Fix re-entrancy issues – Prefetching S3 files (30%) – 1.2-2x speedup 38
  • 39. Fastest Hive SaaS• Works with Small Files! • Stable JVM Reuse! – Faster Split Computation (8x) – Fix re-entrancy issues – Prefetching S3 files (30%) – 1.2-2x speedup• Direct writes to S3 – HIVE-1620 39
  • 40. Fastest Hive SaaS• Works with Small Files! • Stable JVM Reuse! – Faster Split Computation (8x) – Fix re-entrancy issues – Prefetching S3 files (30%) – 1.2-2x speedup• Direct writes to S3 • Columnar Cache – HIVE-1620 – Use HDFS as cache for S3 – Upto 5x faster for JSON data 40
  • 41. Fastest Hive SaaS• Works with Small Files! • Stable JVM Reuse! – Faster Split Computation (8x) – Fix re-entrancy issues – Prefetching S3 files (30%) – 1.2-2x speedup• Direct writes to S3 • Columnar Cache – HIVE-1620 – Use HDFS as cache for S3 – Upto 5x faster for JSON data• NEW – Multi-Tenant Hive Server 41
  • 42. Questions? @QuboleFree Trial: www.qubole.com

×