Qubole
Click to Query your Big Data on the Cloud
A company like Facebook provides Data
infrastructure as a service (created by the founders
of Qubole)
- More than 30% of the company uses this infrastructure
every month

- Users range from developers, analysts, business analysts or
business users

- Manages over an Exabyte of data

- Has made the company more data driven and agile with
data use
-It took the founders a team of over 30 people to create
this infrastructure and currently the team managing this
infrastructure has more than 100 people
2
Operations
Analyst
Marketing Ops
Analyst
Data
Architect
Business
Users
Product
Support
Customer
Support
Developer
Sales Ops
Product
Managers
Data
Infrastructure
QUBOLE VISION DATA FOR ALL CLICK-T0-QUERY
3
~ 170+ PB of data processed
per month
10 – 3000 node clusters
on a daily basis
300,000 machines per month
20,000 jobs on a daily
basis
AGILITY TIME-TO-INSIGHT CLICK-T0-QUERY
CONFIDENTIAL. SUBJECT TO NDA PROVISIONS.
Industries and Use Cases
Media &
Advertising
Oil & Gas Retail Life Sciences Financial
Services
Security
Social
Networking &
Gaming
Targeted
Advertising
Seismic
Analysis
Image and
Video
Processing
Customer
Profile
Transaction
Analysis
Genome
Analysis
Monte Carlo
Simulations
Risk
Analysis
Fraud
Detection
Anti-virus
Image
Recognition
In-game
Metrics
Usage
Analysis
User
Demographics
Predefined
Reporting
Ad Hoc
Analytics
Statistical
Analytics
Predictive
Analytics
Machine
Learning
MapReduce Streaming
Workload Classifications
Match Your Processing Engines to Your Workload Parameters
SQL Data Pipeline MapReduce Spark NoSQL Store
AGILITY TIME-TO-INSIGHT CLICK-T0-QUERY
5
5
• 10-1000+ Nodes in <5min
• Flexible - different nodes for different loads
• Data For All - usable by many
• Low TCO - Only ON when needed
• Extensive planning required - Inflexible and Static.
• Not built for Cloud.
• Need Hadoop experts to install, maintain and use.
• High TCO - Always ON
Qubole UI via
Browser
SDK
ODBC
User Access
Qubole’s

AWS Account
Customer’s AWS Account
REST API

(HTTPS)
SSH
Ephemeral Hadoop Clusters,
Managed by Qubole
Slave
Master
Data Flow within
Customer’s AWS
(optional)
Other RDS,
Redshift
Ephemeral
Web Tier
Web Servers
Encrypted
Result Cache
Encrypted
HDFS
Slave
Encrypted
HDFS
RDS – Qubole
User, Account
Configurations
(Encrypted
credentials
Amazon S3
No HDFS Load
w/S3 Server Side
Encryption
Default Hive
Metastore
Encryption Options:
a) Qubole can encrypt the result cache
b) Qubole supports encryption of the ephemeral drives used for HDFS
c) Qubole supports S3 Server Side Encryption
(c)
(b)
(a)
(optional)
Custom
Hive
Metastore
SSH
BUILT FOR CLOUD PERFORMANCE COST-EFFICIENT
Ephemeral Clusters:
• Auto-Scaling - both up and down
• Spot Instances - data management and back-fill
• VMs deployed with awareness of time
Demo
7
Why Qubole?
8
“Qubole has enabled more users within Pinterest to get to the
data and has made the data platform lot more scalable and
stable”

Mohammad Shahangian - Lead, Data Science and Infrastructure
Moved to Qubole from Amazon EMR because
of stability and rapidly expanded big data usage by
giving access to data to users beyond developers.
Rapid expansion of big data beyond developers (240 users
out of 600 person company)
Use CasesUser and Query Growth
Rapid expansion in use cases ranging from ETL, search,
adhoc querying, product analytics etc.
Rock solid infrastructure sees 50% less failures as
compared to AWS Elastic Map/Reduce
Enterprise scale processing and data access
Why Qubole?
9
“We needed something that was reliable and easy to learn,
setup, use and put into production without the risk and high
expectations that comes with committing millions of dollars in
upfront investment. Qubole was that thing.”
Marc Rosen - Sr. Director, Data Analytics
Moved to Big data on the cloud (from internal Oracle
clusters) because getting to analysis was much
quicker than operating infrastructure themselves.
Used to answer client queries and power client
dashboards.
Use Cases# Commands Per Month
0
1250
2500
3750
5000
Aug-13
Sept-13
Oct-13
Nov-13
Dec-13
Jan-14
Feb-14
Number of queries
Segment audiences based on their behavior including
such topics as user pathway and multi-dimensional recency
analysis
Build customer profiles (both uni/multivariate) across
thousands of first party (i.e., client CRM files) and third
party (i.e., demographic) segments
Simplify attribution insights showing the effects of upper
funnel prospecting on lower funnel remarketing media
strategies

BIPD Tech Tuesday Presentation - Qubole

  • 1.
    Qubole Click to Queryyour Big Data on the Cloud
  • 2.
    A company likeFacebook provides Data infrastructure as a service (created by the founders of Qubole) - More than 30% of the company uses this infrastructure every month
 - Users range from developers, analysts, business analysts or business users
 - Manages over an Exabyte of data
 - Has made the company more data driven and agile with data use -It took the founders a team of over 30 people to create this infrastructure and currently the team managing this infrastructure has more than 100 people 2 Operations Analyst Marketing Ops Analyst Data Architect Business Users Product Support Customer Support Developer Sales Ops Product Managers Data Infrastructure QUBOLE VISION DATA FOR ALL CLICK-T0-QUERY
  • 3.
    3 ~ 170+ PBof data processed per month 10 – 3000 node clusters on a daily basis 300,000 machines per month 20,000 jobs on a daily basis AGILITY TIME-TO-INSIGHT CLICK-T0-QUERY
  • 4.
    CONFIDENTIAL. SUBJECT TONDA PROVISIONS. Industries and Use Cases Media & Advertising Oil & Gas Retail Life Sciences Financial Services Security Social Networking & Gaming Targeted Advertising Seismic Analysis Image and Video Processing Customer Profile Transaction Analysis Genome Analysis Monte Carlo Simulations Risk Analysis Fraud Detection Anti-virus Image Recognition In-game Metrics Usage Analysis User Demographics Predefined Reporting Ad Hoc Analytics Statistical Analytics Predictive Analytics Machine Learning MapReduce Streaming Workload Classifications Match Your Processing Engines to Your Workload Parameters SQL Data Pipeline MapReduce Spark NoSQL Store
  • 5.
    AGILITY TIME-TO-INSIGHT CLICK-T0-QUERY 5 5 •10-1000+ Nodes in <5min • Flexible - different nodes for different loads • Data For All - usable by many • Low TCO - Only ON when needed • Extensive planning required - Inflexible and Static. • Not built for Cloud. • Need Hadoop experts to install, maintain and use. • High TCO - Always ON
  • 6.
    Qubole UI via Browser SDK ODBC UserAccess Qubole’s
 AWS Account Customer’s AWS Account REST API
 (HTTPS) SSH Ephemeral Hadoop Clusters, Managed by Qubole Slave Master Data Flow within Customer’s AWS (optional) Other RDS, Redshift Ephemeral Web Tier Web Servers Encrypted Result Cache Encrypted HDFS Slave Encrypted HDFS RDS – Qubole User, Account Configurations (Encrypted credentials Amazon S3 No HDFS Load w/S3 Server Side Encryption Default Hive Metastore Encryption Options: a) Qubole can encrypt the result cache b) Qubole supports encryption of the ephemeral drives used for HDFS c) Qubole supports S3 Server Side Encryption (c) (b) (a) (optional) Custom Hive Metastore SSH BUILT FOR CLOUD PERFORMANCE COST-EFFICIENT Ephemeral Clusters: • Auto-Scaling - both up and down • Spot Instances - data management and back-fill • VMs deployed with awareness of time
  • 7.
  • 8.
    Why Qubole? 8 “Qubole hasenabled more users within Pinterest to get to the data and has made the data platform lot more scalable and stable”
 Mohammad Shahangian - Lead, Data Science and Infrastructure Moved to Qubole from Amazon EMR because of stability and rapidly expanded big data usage by giving access to data to users beyond developers. Rapid expansion of big data beyond developers (240 users out of 600 person company) Use CasesUser and Query Growth Rapid expansion in use cases ranging from ETL, search, adhoc querying, product analytics etc. Rock solid infrastructure sees 50% less failures as compared to AWS Elastic Map/Reduce Enterprise scale processing and data access
  • 9.
    Why Qubole? 9 “We neededsomething that was reliable and easy to learn, setup, use and put into production without the risk and high expectations that comes with committing millions of dollars in upfront investment. Qubole was that thing.” Marc Rosen - Sr. Director, Data Analytics Moved to Big data on the cloud (from internal Oracle clusters) because getting to analysis was much quicker than operating infrastructure themselves. Used to answer client queries and power client dashboards. Use Cases# Commands Per Month 0 1250 2500 3750 5000 Aug-13 Sept-13 Oct-13 Nov-13 Dec-13 Jan-14 Feb-14 Number of queries Segment audiences based on their behavior including such topics as user pathway and multi-dimensional recency analysis Build customer profiles (both uni/multivariate) across thousands of first party (i.e., client CRM files) and third party (i.e., demographic) segments Simplify attribution insights showing the effects of upper funnel prospecting on lower funnel remarketing media strategies