Your SlideShare is downloading. ×
  • Like
NATC 2013 - Big Data as a Service
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

NATC 2013 - Big Data as a Service


NASSCOM Annual Technology Conference 2013 …

NASSCOM Annual Technology Conference 2013

Speaker: Joydeep Sen Sarma, Co-Founder, Quobole

Published in Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. The Big Data SaaS Company Big Data as a Service Joydeep Sen Sarma | The Big Data SaaS Company
  • 2. Who’s Qubole • Founded 10/2011: – Ashish Thusoo & Joydeep Sen Sarma, Apache Hive, Facebook – +Alumni - Oracle, GreenPlum, Vertica, Aster, Karmasphere, TerraCotta, Microsoft • Rapidly growing: – Engineering: Palo Alto (5), Bangalore (16) – Business: Palo Alto (4) • Series-A from LightSpeed and Charles River | The Big Data SaaS Company
  • 3. Thesis Managed Big Data as a Service in the Cloud • SaaS will displace shipped software • Cloud will displace bare-metal • Big Data already displacing Rdbms | The Big Data SaaS Company
  • 4. Big Data Puzzle GUI(Hue) Interfaces (ODBC/JDBC) Operations Dashboard Data Connectors (MongoAdaptor..) Schedular(Oozie) Cloud Orchestration (Whirr) or Compute + Storage | The Big Data SaaS Company Hadoop Hive/PIG Mahout/Weka
  • 5. Meet “Qubole” Operations Dashboard Cloud Orchestration (Whirr) or Compute + Storage GUI(Hue) Hadoop Interfaces (ODBC/JDBC) Data Connectors (MongoAdaptor..) Schedular(Oozie) Hive/PIG Mahout/Weka • Fully Integrated Big Data Service • Users Focus on Analyzing and building Data Driven apps • Qubole manages infrastructure, cloud provisioning | The Big Data SaaS Company
  • 6. Customers | The Big Data SaaS Company
  • 7. Use Cases • Summarizing Logs and Reporting • Data Integration • Ad-Hoc analysis of Historical Data • Preparing Data for Data Mining • Indexing Data for Search • Users – Developers (of end-products) – Java/C++/Python – ETL and Data Engineers – SQL/Java/Python – Analysts – SQL / R | The Big Data SaaS Company
  • 8. Qubole Data Service Integrate – Analyze – Schedule – Visualize Vertica Oozie Oozie Hive Hive Sqoop Sqoop Mysql Presto! Presto! AWS EC2 Pig Pig Hadoop Hadoop | AWS S3The Big Data SaaS Company 8 S3://adco/logs
  • 9. Now on GCE! | The Big Data SaaS Company
  • 10. What Users Like • Simplicity – Great Visual User Interface – Zero Operations – Accessible to Analysts (ie. non-Engineers) • Efficiency – Significantly faster than competition (in most cases) – Cluster Consolidation is game changer – Spot Instance integration | The Big Data SaaS Company
  • 11. What Users Like • Managed Service Model – Constantly Upgrading software – Support when needed – Dealing with AWS issues • Nine-Course Meal – Seamless integration of Hadoop/Hive/Pig/.. – Unified Command/Workflow model (also Simplicity) – Less things to learn/manage: • “Please help us avoid Pentaho, Tableau, …” | The Big Data SaaS Company
  • 12. Core Technology • Auto-Scaling Hadoop Clusters in Cloud – Including OpenStack, Rackspace, GCE etc • Fastest Hive SaaS – Numerous Optimizations for Cloud Storage – 5x faster than EMR • Connectors – RDBMS, MongoDB/NoSql, GA – Incremental Data Scrapes • Job Scheduler – Dependencies, Workflows, Incremental Jobs | The Big Data SaaS Company
  • 13. Auto-Scaling hadoop jar –Dmapred.min.split.size=32000000 myapp.jar –partitioner .org.apache… select t.county, count(1) from (select transform( using ‘’ as a.county from SMALL_TABLE a) t group by t.county; AdCo Hadoop | The Big Data SaaS Company insert overwrite table dest select,, count(distinct b.uid) from ads a join LARGE_TABLE b on ( group by,; 13
  • 14. Scaling Up Slaves insert overwrite table dest select … from ads join campaigns on …group by …; Progress Map Tasks Job Tracker ReduceTasks Supply Demand Master AWS | The Big Data SaaS Company StarCluster 14
  • 15. Scaling Down 1. On hour boundary – check if node is required: – Can’t remove nodes with map-outputs (today) – Don’t go below minimum cluster size 1. Remove node from Map-Reduce Cluster 2. Request HDFS Decomissioning – fast! – Delete affected cache files instead of re-replicating – One surviving replica and we are Done. 1. Delete Instance | The Big Data SaaS Company
  • 16. Fastest Hive SaaS • Works with Small Files! – Faster Split Computation (8x) – Prefetching S3 files (30%) • Direct writes to S3 – HIVE-1620 • Multi-Tenant Hive Server • Stable JVM Reuse! – Fix re-entrancy issues – 1.2-2x speedup • Columnar Cache – Use HDFS as cache for S3 – Upto 5x faster for JSON data – HIVE-4226 • 5x faster than EMR in TPCH against S3 | The Big Data SaaS Company
  • 17. Spot Instance Integration Upto 90% off | The Big Data SaaS Company
  • 18. Spot Instance Integration • Can lose Spot nodes anytime – Disastrous for HDFS – Hybrid Mode: Use mix of On-Demand and Spot – Hybrid Mode: Keep one replica in On-Demand nodes • Spot Instances may not be available – Timeout and use On-Demand nodes as fallback | The Big Data SaaS Company
  • 19. Closing Thoughts • AWS (/Cloud) is the new BIOS • Large multi-tenant [I/S]aaS is the new mainframe – Feedback loop is not available to average developers – Will be dominated by a few large companies • Open Source is the ocean that lifts SaaS Boat – But Boat has proprietary stuff – SaaS requires software innovation at different pace • SaaS has network effects – Static software cannot keep up with rapidly evolving SaaS | The Big Data SaaS Company
  • 20. Questions? Us: @Qubole Free Trial: | The Big Data SaaS Company