NATC 2013 - Big Data as a Service

  • 294 views
Uploaded on

NASSCOM Annual Technology Conference 2013 …

NASSCOM Annual Technology Conference 2013

Speaker: Joydeep Sen Sarma, Co-Founder, Quobole

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
294
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
21
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. The Big Data SaaS Company Big Data as a Service Joydeep Sen Sarma | The Big Data SaaS Company
  • 2. Who’s Qubole • Founded 10/2011: – Ashish Thusoo & Joydeep Sen Sarma, Apache Hive, Facebook – +Alumni - Oracle, GreenPlum, Vertica, Aster, Karmasphere, TerraCotta, Microsoft • Rapidly growing: – Engineering: Palo Alto (5), Bangalore (16) – Business: Palo Alto (4) • Series-A from LightSpeed and Charles River | The Big Data SaaS Company
  • 3. Thesis Managed Big Data as a Service in the Cloud • SaaS will displace shipped software • Cloud will displace bare-metal • Big Data already displacing Rdbms | The Big Data SaaS Company
  • 4. Big Data Puzzle GUI(Hue) Interfaces (ODBC/JDBC) Operations Dashboard Data Connectors (MongoAdaptor..) Schedular(Oozie) Cloud Orchestration (Whirr) or Compute + Storage | The Big Data SaaS Company Hadoop Hive/PIG Mahout/Weka
  • 5. Meet “Qubole” Operations Dashboard Cloud Orchestration (Whirr) or Compute + Storage GUI(Hue) Hadoop Interfaces (ODBC/JDBC) Data Connectors (MongoAdaptor..) Schedular(Oozie) Hive/PIG Mahout/Weka • Fully Integrated Big Data Service • Users Focus on Analyzing and building Data Driven apps • Qubole manages infrastructure, cloud provisioning | The Big Data SaaS Company
  • 6. Customers | The Big Data SaaS Company
  • 7. Use Cases • Summarizing Logs and Reporting • Data Integration • Ad-Hoc analysis of Historical Data • Preparing Data for Data Mining • Indexing Data for Search • Users – Developers (of end-products) – Java/C++/Python – ETL and Data Engineers – SQL/Java/Python – Analysts – SQL / R | The Big Data SaaS Company
  • 8. Qubole Data Service Integrate – Analyze – Schedule – Visualize Vertica Oozie Oozie Hive Hive Sqoop Sqoop Mysql Presto! Presto! AWS EC2 Pig Pig Hadoop Hadoop | AWS S3The Big Data SaaS Company 8 S3://adco/logs
  • 9. Now on GCE! | The Big Data SaaS Company
  • 10. What Users Like • Simplicity – Great Visual User Interface – Zero Operations – Accessible to Analysts (ie. non-Engineers) • Efficiency – Significantly faster than competition (in most cases) – Cluster Consolidation is game changer – Spot Instance integration | The Big Data SaaS Company
  • 11. What Users Like • Managed Service Model – Constantly Upgrading software – Support when needed – Dealing with AWS issues • Nine-Course Meal – Seamless integration of Hadoop/Hive/Pig/.. – Unified Command/Workflow model (also Simplicity) – Less things to learn/manage: • “Please help us avoid Pentaho, Tableau, …” | The Big Data SaaS Company
  • 12. Core Technology • Auto-Scaling Hadoop Clusters in Cloud – Including OpenStack, Rackspace, GCE etc • Fastest Hive SaaS – Numerous Optimizations for Cloud Storage – 5x faster than EMR • Connectors – RDBMS, MongoDB/NoSql, GA – Incremental Data Scrapes • Job Scheduler – Dependencies, Workflows, Incremental Jobs | The Big Data SaaS Company
  • 13. Auto-Scaling hadoop jar –Dmapred.min.split.size=32000000 myapp.jar –partitioner .org.apache… select t.county, count(1) from (select transform(a.zip) using ‘geo.py’ as a.county from SMALL_TABLE a) t group by t.county; AdCo Hadoop | The Big Data SaaS Company insert overwrite table dest select a.id, a.zip, count(distinct b.uid) from ads a join LARGE_TABLE b on (a.id=b.ad_id) group by a.id, a.zip; 13
  • 14. Scaling Up Slaves insert overwrite table dest select … from ads join campaigns on …group by …; Progress Map Tasks Job Tracker ReduceTasks Supply Demand Master AWS | The Big Data SaaS Company StarCluster 14
  • 15. Scaling Down 1. On hour boundary – check if node is required: – Can’t remove nodes with map-outputs (today) – Don’t go below minimum cluster size 1. Remove node from Map-Reduce Cluster 2. Request HDFS Decomissioning – fast! – Delete affected cache files instead of re-replicating – One surviving replica and we are Done. 1. Delete Instance | The Big Data SaaS Company
  • 16. Fastest Hive SaaS • Works with Small Files! – Faster Split Computation (8x) – Prefetching S3 files (30%) • Direct writes to S3 – HIVE-1620 • Multi-Tenant Hive Server • Stable JVM Reuse! – Fix re-entrancy issues – 1.2-2x speedup • Columnar Cache – Use HDFS as cache for S3 – Upto 5x faster for JSON data – HIVE-4226 • 5x faster than EMR in TPCH against S3 | The Big Data SaaS Company
  • 17. Spot Instance Integration Upto 90% off | The Big Data SaaS Company
  • 18. Spot Instance Integration • Can lose Spot nodes anytime – Disastrous for HDFS – Hybrid Mode: Use mix of On-Demand and Spot – Hybrid Mode: Keep one replica in On-Demand nodes • Spot Instances may not be available – Timeout and use On-Demand nodes as fallback | The Big Data SaaS Company
  • 19. Closing Thoughts • AWS (/Cloud) is the new BIOS • Large multi-tenant [I/S]aaS is the new mainframe – Feedback loop is not available to average developers – Will be dominated by a few large companies • Open Source is the ocean that lifts SaaS Boat – But Boat has proprietary stuff – SaaS requires software innovation at different pace • SaaS has network effects – Static software cannot keep up with rapidly evolving SaaS | The Big Data SaaS Company
  • 20. Questions? Me:joydeep@qubole.com Us: @Qubole Free Trial: www.qubole.com | The Big Data SaaS Company