The Big Data SaaS Company

Big Data as a Service
Joydeep Sen Sarma

|

The Big Data SaaS Company
Who’s Qubole
• Founded 10/2011:
– Ashish Thusoo & Joydeep Sen Sarma, Apache Hive, Facebook
– +Alumni - Oracle, GreenPlum, ...
Thesis
Managed
Big Data as a Service
in the

Cloud
• SaaS will displace shipped software
• Cloud will displace bare-metal
...
Big Data Puzzle
GUI(Hue)
Interfaces
(ODBC/JDBC)

Operations
Dashboard

Data Connectors
(MongoAdaptor..)

Schedular(Oozie)
...
Meet “Qubole”
Operations
Dashboard

Cloud Orchestration
(Whirr) or
Compute + Storage

GUI(Hue)

Hadoop

Interfaces
(ODBC/J...
Customers

|

The Big Data SaaS Company
Use Cases
• Summarizing Logs and Reporting
• Data Integration
• Ad-Hoc analysis of Historical Data
• Preparing Data for Da...
Qubole Data Service

Integrate – Analyze – Schedule – Visualize
Vertica
Oozie
Oozie

Hive
Hive

Sqoop
Sqoop
Mysql

Presto!...
Now on GCE!

|

The Big Data SaaS Company
What Users Like
• Simplicity
– Great Visual User Interface
– Zero Operations
– Accessible to Analysts (ie. non-Engineers)
...
What Users Like
• Managed Service Model
– Constantly Upgrading software
– Support when needed
– Dealing with AWS issues

•...
Core Technology
• Auto-Scaling Hadoop Clusters in Cloud
– Including OpenStack, Rackspace, GCE etc

• Fastest Hive SaaS
– N...
Auto-Scaling
hadoop jar –Dmapred.min.split.size=32000000
myapp.jar –partitioner .org.apache…
select t.county, count(1) fro...
Scaling Up
Slaves

insert overwrite table dest
select … from ads join campaigns
on …group by …;

Progress

Map Tasks

Job ...
Scaling Down
1. On hour boundary – check if node is required:
– Can’t remove nodes with map-outputs (today)
– Don’t go bel...
Fastest Hive SaaS
• Works with Small Files!
– Faster Split Computation (8x)
– Prefetching S3 files (30%)

• Direct writes ...
Spot Instance Integration

Upto 90% off
|

The Big Data SaaS Company
Spot Instance Integration
• Can lose Spot nodes anytime
– Disastrous for HDFS
– Hybrid Mode: Use mix of On-Demand and Spot...
Closing Thoughts
• AWS (/Cloud) is the new BIOS
• Large multi-tenant [I/S]aaS is the new mainframe
– Feedback loop is not ...
Questions?
Me:joydeep@qubole.com
Us: @Qubole
Free Trial: www.qubole.com
|

The Big Data SaaS Company
Upcoming SlideShare
Loading in...5
×

NATC 2013 - Big Data as a Service

406

Published on

NASSCOM Annual Technology Conference 2013

Speaker: Joydeep Sen Sarma, Co-Founder, Quobole

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
406
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
27
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

NATC 2013 - Big Data as a Service

  1. 1. The Big Data SaaS Company Big Data as a Service Joydeep Sen Sarma | The Big Data SaaS Company
  2. 2. Who’s Qubole • Founded 10/2011: – Ashish Thusoo & Joydeep Sen Sarma, Apache Hive, Facebook – +Alumni - Oracle, GreenPlum, Vertica, Aster, Karmasphere, TerraCotta, Microsoft • Rapidly growing: – Engineering: Palo Alto (5), Bangalore (16) – Business: Palo Alto (4) • Series-A from LightSpeed and Charles River | The Big Data SaaS Company
  3. 3. Thesis Managed Big Data as a Service in the Cloud • SaaS will displace shipped software • Cloud will displace bare-metal • Big Data already displacing Rdbms | The Big Data SaaS Company
  4. 4. Big Data Puzzle GUI(Hue) Interfaces (ODBC/JDBC) Operations Dashboard Data Connectors (MongoAdaptor..) Schedular(Oozie) Cloud Orchestration (Whirr) or Compute + Storage | The Big Data SaaS Company Hadoop Hive/PIG Mahout/Weka
  5. 5. Meet “Qubole” Operations Dashboard Cloud Orchestration (Whirr) or Compute + Storage GUI(Hue) Hadoop Interfaces (ODBC/JDBC) Data Connectors (MongoAdaptor..) Schedular(Oozie) Hive/PIG Mahout/Weka • Fully Integrated Big Data Service • Users Focus on Analyzing and building Data Driven apps • Qubole manages infrastructure, cloud provisioning | The Big Data SaaS Company
  6. 6. Customers | The Big Data SaaS Company
  7. 7. Use Cases • Summarizing Logs and Reporting • Data Integration • Ad-Hoc analysis of Historical Data • Preparing Data for Data Mining • Indexing Data for Search • Users – Developers (of end-products) – Java/C++/Python – ETL and Data Engineers – SQL/Java/Python – Analysts – SQL / R | The Big Data SaaS Company
  8. 8. Qubole Data Service Integrate – Analyze – Schedule – Visualize Vertica Oozie Oozie Hive Hive Sqoop Sqoop Mysql Presto! Presto! AWS EC2 Pig Pig Hadoop Hadoop | AWS S3The Big Data SaaS Company 8 S3://adco/logs
  9. 9. Now on GCE! | The Big Data SaaS Company
  10. 10. What Users Like • Simplicity – Great Visual User Interface – Zero Operations – Accessible to Analysts (ie. non-Engineers) • Efficiency – Significantly faster than competition (in most cases) – Cluster Consolidation is game changer – Spot Instance integration | The Big Data SaaS Company
  11. 11. What Users Like • Managed Service Model – Constantly Upgrading software – Support when needed – Dealing with AWS issues • Nine-Course Meal – Seamless integration of Hadoop/Hive/Pig/.. – Unified Command/Workflow model (also Simplicity) – Less things to learn/manage: • “Please help us avoid Pentaho, Tableau, …” | The Big Data SaaS Company
  12. 12. Core Technology • Auto-Scaling Hadoop Clusters in Cloud – Including OpenStack, Rackspace, GCE etc • Fastest Hive SaaS – Numerous Optimizations for Cloud Storage – 5x faster than EMR • Connectors – RDBMS, MongoDB/NoSql, GA – Incremental Data Scrapes • Job Scheduler – Dependencies, Workflows, Incremental Jobs | The Big Data SaaS Company
  13. 13. Auto-Scaling hadoop jar –Dmapred.min.split.size=32000000 myapp.jar –partitioner .org.apache… select t.county, count(1) from (select transform(a.zip) using ‘geo.py’ as a.county from SMALL_TABLE a) t group by t.county; AdCo Hadoop | The Big Data SaaS Company insert overwrite table dest select a.id, a.zip, count(distinct b.uid) from ads a join LARGE_TABLE b on (a.id=b.ad_id) group by a.id, a.zip; 13
  14. 14. Scaling Up Slaves insert overwrite table dest select … from ads join campaigns on …group by …; Progress Map Tasks Job Tracker ReduceTasks Supply Demand Master AWS | The Big Data SaaS Company StarCluster 14
  15. 15. Scaling Down 1. On hour boundary – check if node is required: – Can’t remove nodes with map-outputs (today) – Don’t go below minimum cluster size 1. Remove node from Map-Reduce Cluster 2. Request HDFS Decomissioning – fast! – Delete affected cache files instead of re-replicating – One surviving replica and we are Done. 1. Delete Instance | The Big Data SaaS Company
  16. 16. Fastest Hive SaaS • Works with Small Files! – Faster Split Computation (8x) – Prefetching S3 files (30%) • Direct writes to S3 – HIVE-1620 • Multi-Tenant Hive Server • Stable JVM Reuse! – Fix re-entrancy issues – 1.2-2x speedup • Columnar Cache – Use HDFS as cache for S3 – Upto 5x faster for JSON data – HIVE-4226 • 5x faster than EMR in TPCH against S3 | The Big Data SaaS Company
  17. 17. Spot Instance Integration Upto 90% off | The Big Data SaaS Company
  18. 18. Spot Instance Integration • Can lose Spot nodes anytime – Disastrous for HDFS – Hybrid Mode: Use mix of On-Demand and Spot – Hybrid Mode: Keep one replica in On-Demand nodes • Spot Instances may not be available – Timeout and use On-Demand nodes as fallback | The Big Data SaaS Company
  19. 19. Closing Thoughts • AWS (/Cloud) is the new BIOS • Large multi-tenant [I/S]aaS is the new mainframe – Feedback loop is not available to average developers – Will be dominated by a few large companies • Open Source is the ocean that lifts SaaS Boat – But Boat has proprietary stuff – SaaS requires software innovation at different pace • SaaS has network effects – Static software cannot keep up with rapidly evolving SaaS | The Big Data SaaS Company
  20. 20. Questions? Me:joydeep@qubole.com Us: @Qubole Free Trial: www.qubole.com | The Big Data SaaS Company
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×