NATC 2013 - Big Data as a Service
Upcoming SlideShare
Loading in...5
×
 

NATC 2013 - Big Data as a Service

on

  • 536 views

NASSCOM Annual Technology Conference 2013

NASSCOM Annual Technology Conference 2013

Speaker: Joydeep Sen Sarma, Co-Founder, Quobole

Statistics

Views

Total Views
536
Views on SlideShare
536
Embed Views
0

Actions

Likes
1
Downloads
18
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    NATC 2013 - Big Data as a Service NATC 2013 - Big Data as a Service Presentation Transcript

    • The Big Data SaaS Company Big Data as a Service Joydeep Sen Sarma | The Big Data SaaS Company
    • Who’s Qubole • Founded 10/2011: – Ashish Thusoo & Joydeep Sen Sarma, Apache Hive, Facebook – +Alumni - Oracle, GreenPlum, Vertica, Aster, Karmasphere, TerraCotta, Microsoft • Rapidly growing: – Engineering: Palo Alto (5), Bangalore (16) – Business: Palo Alto (4) • Series-A from LightSpeed and Charles River | The Big Data SaaS Company
    • Thesis Managed Big Data as a Service in the Cloud • SaaS will displace shipped software • Cloud will displace bare-metal • Big Data already displacing Rdbms | The Big Data SaaS Company
    • Big Data Puzzle GUI(Hue) Interfaces (ODBC/JDBC) Operations Dashboard Data Connectors (MongoAdaptor..) Schedular(Oozie) Cloud Orchestration (Whirr) or Compute + Storage | The Big Data SaaS Company Hadoop Hive/PIG Mahout/Weka
    • Meet “Qubole” Operations Dashboard Cloud Orchestration (Whirr) or Compute + Storage GUI(Hue) Hadoop Interfaces (ODBC/JDBC) Data Connectors (MongoAdaptor..) Schedular(Oozie) Hive/PIG Mahout/Weka • Fully Integrated Big Data Service • Users Focus on Analyzing and building Data Driven apps • Qubole manages infrastructure, cloud provisioning | The Big Data SaaS Company
    • Customers | The Big Data SaaS Company
    • Use Cases • Summarizing Logs and Reporting • Data Integration • Ad-Hoc analysis of Historical Data • Preparing Data for Data Mining • Indexing Data for Search • Users – Developers (of end-products) – Java/C++/Python – ETL and Data Engineers – SQL/Java/Python – Analysts – SQL / R | The Big Data SaaS Company
    • Qubole Data Service Integrate – Analyze – Schedule – Visualize Vertica Oozie Oozie Hive Hive Sqoop Sqoop Mysql Presto! Presto! AWS EC2 Pig Pig Hadoop Hadoop | AWS S3The Big Data SaaS Company 8 S3://adco/logs
    • Now on GCE! | The Big Data SaaS Company
    • What Users Like • Simplicity – Great Visual User Interface – Zero Operations – Accessible to Analysts (ie. non-Engineers) • Efficiency – Significantly faster than competition (in most cases) – Cluster Consolidation is game changer – Spot Instance integration | The Big Data SaaS Company
    • What Users Like • Managed Service Model – Constantly Upgrading software – Support when needed – Dealing with AWS issues • Nine-Course Meal – Seamless integration of Hadoop/Hive/Pig/.. – Unified Command/Workflow model (also Simplicity) – Less things to learn/manage: • “Please help us avoid Pentaho, Tableau, …” | The Big Data SaaS Company
    • Core Technology • Auto-Scaling Hadoop Clusters in Cloud – Including OpenStack, Rackspace, GCE etc • Fastest Hive SaaS – Numerous Optimizations for Cloud Storage – 5x faster than EMR • Connectors – RDBMS, MongoDB/NoSql, GA – Incremental Data Scrapes • Job Scheduler – Dependencies, Workflows, Incremental Jobs | The Big Data SaaS Company
    • Auto-Scaling hadoop jar –Dmapred.min.split.size=32000000 myapp.jar –partitioner .org.apache… select t.county, count(1) from (select transform(a.zip) using ‘geo.py’ as a.county from SMALL_TABLE a) t group by t.county; AdCo Hadoop | The Big Data SaaS Company insert overwrite table dest select a.id, a.zip, count(distinct b.uid) from ads a join LARGE_TABLE b on (a.id=b.ad_id) group by a.id, a.zip; 13
    • Scaling Up Slaves insert overwrite table dest select … from ads join campaigns on …group by …; Progress Map Tasks Job Tracker ReduceTasks Supply Demand Master AWS | The Big Data SaaS Company StarCluster 14
    • Scaling Down 1. On hour boundary – check if node is required: – Can’t remove nodes with map-outputs (today) – Don’t go below minimum cluster size 1. Remove node from Map-Reduce Cluster 2. Request HDFS Decomissioning – fast! – Delete affected cache files instead of re-replicating – One surviving replica and we are Done. 1. Delete Instance | The Big Data SaaS Company
    • Fastest Hive SaaS • Works with Small Files! – Faster Split Computation (8x) – Prefetching S3 files (30%) • Direct writes to S3 – HIVE-1620 • Multi-Tenant Hive Server • Stable JVM Reuse! – Fix re-entrancy issues – 1.2-2x speedup • Columnar Cache – Use HDFS as cache for S3 – Upto 5x faster for JSON data – HIVE-4226 • 5x faster than EMR in TPCH against S3 | The Big Data SaaS Company
    • Spot Instance Integration Upto 90% off | The Big Data SaaS Company
    • Spot Instance Integration • Can lose Spot nodes anytime – Disastrous for HDFS – Hybrid Mode: Use mix of On-Demand and Spot – Hybrid Mode: Keep one replica in On-Demand nodes • Spot Instances may not be available – Timeout and use On-Demand nodes as fallback | The Big Data SaaS Company
    • Closing Thoughts • AWS (/Cloud) is the new BIOS • Large multi-tenant [I/S]aaS is the new mainframe – Feedback loop is not available to average developers – Will be dominated by a few large companies • Open Source is the ocean that lifts SaaS Boat – But Boat has proprietary stuff – SaaS requires software innovation at different pace • SaaS has network effects – Static software cannot keep up with rapidly evolving SaaS | The Big Data SaaS Company
    • Questions? Me:joydeep@qubole.com Us: @Qubole Free Trial: www.qubole.com | The Big Data SaaS Company