Qubole Overview at the Fifth Elephant Conference


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Intro- self, Qubole. In this video, we will see how users setup a Qubole Cluster in 3 simple steps..Those 3 steps are…
  • They are…123Now lets look at the details of each step, starting with step #1.
  • Qubole is a big data platform that is optimized to run on cloud infrastructure. Today, Qubole supports Amazon cloud. Qubole cluster deploys each qubole compute node on an Amazon EC2 instance and persists data on S3.You need to make a decision on how you wil deploy the Qubole cluster. You have the option of deplying on your own Amazon storage and compute or … you also have the option of… Qubole recommends..Next, lets take a look at step 2.---------------Qubole uses a distributed processing architecture to process and analyze very large structured and semi-structured data sets. It is designed to run on cloud infrax…. Today, Qubole supports Amazon cloud.Qubole cluster mgmtsw instantiates and manages qubole compute nodes on EC2 instances and persists data on S3.To setup, decision on As a user, you can choose to deploy Qubole on your own AWS storage and compute resources or their own… you also have the option of…
  • Intro- self, Qubole. In this video, we will see how users setup a Qubole Cluster in 3 simple steps..Those 3 steps are…
  • Qubole Overview at the Fifth Elephant Conference

    1. 1. The Elephant in the Cloud Qubole Data Platform
    2. 2. Cloud is Awesome• On-Demand• Elastic• Cheap – Spot Instances!• Infinite Storage
    3. 3. But it’s Complicated ..
    4. 4. But it’s Complicated ..• Setup my own Hive metastore .. damn.• Setup my own cluster , hmmm .. – How many nodes? – What type of nodes? – Spot vs. On-Demand? How to bid? – What happens if Spot instances disappear?• Why did my query fail last night?• How to schedule something to run periodically?
    5. 5. Easier:
    6. 6. Easier: month old Job
    7. 7. Auto-Scaling select t.county, count(1) from (select transform(a.zip) using ‘geo.py’ as a.county from SMALL_TABLE a) t group by t.county;Newco_Hadoop insert overwrite table dest select a.id, a.zip, count(distinct b.uid) from ads a join LARGE_TABLE b on (a.id=b.ad_id) group by a.id, a.zip;
    8. 8. Consolidation=Efficiency
    9. 9. Consolidation=Efficiency
    10. 10. Engineering Trivia• When to add/delete nodes? – Project future demand using JobTracker Stats• How to safely delete nodes? – Don’t if they hold intermediate data – Decomission from HDFS – Delete cache blocks• How to place Data? – One copy on Core Nodes – Cached File to Node Affinity
    11. 11. Issues with Cloud Storage• Slower compared to Local Drives (4x)• Very slow on small files (5x)• Tremendous Variance (avg:95, stddev: 25)
    12. 12. Switch to HDFS?• S3DistCp for Efficient Copy between S3 and HDFS We have also made available S3DistCp, an extension of the open source Apache DistCp tool for distributed data copy, that has been optimized to work with Amazon S3. Using S3DistCp, you can efficiently copy large amounts of data between Amazon S3 and HDFS on your Amazon EMR job flow or copy files between Amazon S3 buckets. During data copy you can also optimize your files for Hadoop processing. This includes modifying compression schemes, concatenating small files, and creating partitions.
    13. 13. Switch to HDFS?Use HDFS as Cache
    14. 14. Columnar-Cloud-Cache Cluster-1 MapTask Uploader MRHDFS page_views.json S3
    15. 15. Columnar-Cloud-Cache Cluster-1 MapTask MRHDFS page_views.json S3
    16. 16. Columnar-Cloud-Cache Cluster-2 MRHDFS S3
    17. 17. Columnar-Cloud-Cache Cluster-2 MapTask Uploader MRHDFS page_views.json S3
    18. 18. vs. S3 csv json• Upto 5x faster• Predictable
    19. 19. HDFS as Cache• Drop cached files liberally: – When nodes are decomissioned – When nodes fail• Make block placement smart: – Always maintain copy in core node
    20. 20. Tip of Iceberg• Extract data samples to MySql – Quick expression evaluation• Checkbox for Fast and Dirty Queries – Sample data automatically – Stop computation after 90% – Approximate count distinct• Periodic Jobs!• Query Authoring widgets
    21. 21. Q&A