Atlanta MLConf

September 18, 2015
Jason Huang
Senior Solutions Architect, Qubole Inc.

Company Founding
Qubole founders built the Facebook data platform.
The Facebook model changed the role for data
in an enterprise.
• Needed to turn the data assets into a “utility” to make a viable
business.
– Collaborative: over 30% of employees use the
data directly.
– Accessible: developers, analysts, business analysts or
business users all running queries. Has made the
company more data driven and agile with data
use.
– Scalable: Exabyte's of data moving fast
It took the founders a team of over 30 people to create
this infrastructure and currently the team managing this
infrastructure has more than 100 people.
Work at Facebook inspired the founding of Qubole
Operations
Analyst
Marketing Ops
Analyst
Data
Architect
Business
Users
Product
Support
Customer
Support
Developer
Sales Ops
Product
Managers
Data
Infrastructure

Impediments for an Aspiring Data Driven Enterprise
Where Big
Data falls
short:
• 6-18 month implementation time
• Only 27% of Big Data initiatives are
classified as “Successful” in 2014
Rigid and
inflexible
infrastructure
Non adaptive
software
services
Highly
specialized
systems
Difficult to
build and
operate
• Only 13% of organizations achieve full-scale production
• 57% of organizations cite skills gap as a major inhibitor

State of the Big Data Industry (n=417)
0%
10%
20%
30%
40%
50%
60%
70%
80%
Hadoop MapReduce Pig Spark Storm Presto Cassandra HBase Hive

• Apache Spark is a fast and general engine for big data processing,
with built-in modules for streaming, SQL, machine learning and
graph processing.
Analytic Libraries:
• Spark Streaming (Streaming Data)
• Spark SQL (Data Processing)
• MLlib (Machine Learning)
• GraphX (Graph Processing)
Apache Spark

• Streaming Data
– Process streaming data with Spark built-in functions
– Applications such as fraud detection and log processing
– ETL via data ingestion
• Machine Learning
– Helps users run repeated queries and machine learning algorithms on
data sets
– MLlib can work in areas such as clustering, classification, and
dimensionality reduction
– Used for very common big data functions - predictive intelligence,
customer segmentation, and sentiment analysis
Common Spark Use Cases

• Interactive Analysis
– MapReduce was built to handle batch processing
– SQL-on-Hadoop engines such as Hive or Pig can be too slow for interactive
analysis
– Spark is fast enough to perform exploratory queries without sampling
– Provides multiple language-specific APIs including R, Python, Scala and Java.
• Fog Computing
– The Internet of Things - objects and devices with tiny embedded sensors that
communicate with each other and users, creating a fully interconnected world
– Decentralize data processing and storage and use Spark streaming analytics
and interactive real time queries
Common Spark Use Cases

Use Spark for distributed computation:
- Combine SparkSQL, GraphX along
with MLlib in the same Spark
program
- Ability to use language of choice -
python/scala/R/java
- Extensive algorithms
(http://spark.apache.org/docs/latest/
mllib-guide.html)
Why Spark MLlib?

• Classification and Regression: logistic regression, linear regression,
linear support vector machine (SVM), naive Bayes, decision trees
• Collaborative Filtering: alternating least squares (ALS)
• Clustering: k-means, Gaussian mixture
• Dimensionality Reduction: singular value decomposition (SVD),
principal component analysis (PCA)
Algorithms

• Spark : Fast, Scalable and Flexible
• R : Statistics, Packages and Plots
SparkR combines both - very powerful
Use SparkR API to take advantage of Spark, bring the data back into
R - and do some machine learning, data visualization, etc.
How about R? Use SparkR!

What about the cloud?
Central
Governance &
Security
Internet
Scale
Instant
Deployment
Isolated Multi-
tenancy
Elastic
Object Store
Underpinnings

• Zero configuration – Spark, SparkR, MLlib, GraphX, etc. all pre-
installed on all cluster nodes
– e.g. submit SparkR programs via a client-side API to an on-
demand compute cluster
• ETL (data cleansing, transformations, table joins, etc.) required prior to
any ML modeling and analysis
– e.g. Use other Big Data tools in order to prepare data –
hive/hadoop/cascading/pig…
Spark in the Cloud

• Use AWS S3 object store to decouple compute and storage; scale
processing power and storage capacity independently
• S3 is highly available, reliable, scalable and cost effective
• Elastic compute provides unlimited scale on-demand: calculations
may require 10, 100 or 1,000+ compute nodes.
• Ability to have multiple clusters – distinguish between teams,
workloads, production, non-production R&D/test
Spark in the Cloud

Cloud object store for data sets:
e.g. AWS S3:

• Flexible compute resource options
– High memory instances
• AWS EC2 r3.* for high memory workloads to cache and
manipulate large Spark RDDs
– High CPU
• AWS EC2 c3.* for CPU intensive workloads
• Automatic cluster termination when idle
• Periodically check for bad instances and remove them
Spark in the Cloud

• Install Spark on EC2 (HDFS if required)
• Choose Spark backend cluster mode and configure it
– Standalone
– Yarn
– Mesos
• Spin up a cluster of instances
CONFIDENTIAL. SUBJECT TO NDA PROVISIONS.
DIY - Getting Started on the Cloud

EC2 scripts can help:
http://spark.apache.org/docs/latest/ec2-scripts.html
- Helps spin up named clusters
- Creates a security group, comes pre-baked with Spark
installed
CONFIDENTIAL. SUBJECT TO NDA PROVISIONS.
DIY - Getting Started on the Cloud

Qubole Case Study
Qubole Case Study
Operations
Analyst
Marketing
Ops
Analyst
Data
Architect
Busines
s
Users
Product
Support
Customer
Support
Developer
Sales Ops
Product
Managers
Ease of use
for analysts
• Dozens of Data
Scientist and
Analyst users
• Produces double-
digit TBs of data
per day
• Does not have
dedicated staff
to setup and
manage clusters
and Hadoop
Distributions

0101
1010
1010
Qubole Case Study
Qubole Case Study
Producers Continuous Processing Storage Analytics
CDN
Real Time
Bidding
Retargeting
Platform
ETL
Kinesis S3 Redshift
Machine LearningStreaming
Customer Data
Why Spark?
0101
1010
1010
0101
1010
1010
0101
1010
1010
“Qubole put our cluster
management, auto-scaling
and ad-hoc queries on
autopilot. Its higher
performance for Big Data
queries translates directly
into faster and more
actionable marketing
intelligence for our
customers.”
Yekesa Kosuru
VP, Technology

Atlanta MLConf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Atlanta MLConf

Similar to Atlanta MLConf (20)

More from Qubole

More from Qubole (8)

Recently uploaded

Recently uploaded (20)

Atlanta MLConf