What’s New in the Berkeley Data Analytics Stack

What’s Next for the
Berkeley Data Analytics
Stack
UC BERKELEY
Michael Franklin
July 20 2015
Data Science Summit
SF

The Berkeley AMPLab
80+ Students, Postdocs, Faculty and Staff from:
Databases, Machine Learning, Systems, Security, and Netwo
Mission Statement: Making Sense of Data at Scale by Integratin
• Algorithms – Machine Learning, Statistical Methods,
• Machines – Cluster and Cloud Computing
• People – Crowdsourcing and Human Computation
Franklin Jordan Stoica Patterson ShenkerRechtKatzJosephGoldberg Mahoney
PopaGonzalez

AMPLab: A Public/Private Partnership
NSF CISE Expedition Award:
Part of 2012 White House Big Data Initiative
Darpa XData Program
DoE/Lawrence Berkeley National Lab
And these Industrial Sponsors:

Velox Model Serving
Tachyon
Spark
Streamin
g
Shark
BlinkDB
GraphX MLlib
MLBa
se
Spark
R
Cancer Genomics, Energy Debugging, Smart
Buildings
Sample
Clean
In House Applications
Spark
Stack
(Apache and BSD open source)
HDFS,
S3, …Mesos Yarn
Access and Interfaces
Processing Engine
Resource Virtualization
Tachyon
Storage

Big Data Ecosystem
Evolution
MapReduce
Pregel
Dremel
GraphLab
Storm
Giraph
Drill
Tez
Impala
S4
…
Specialized systems
(iterative, interactive and
streaming apps)
General batch
processing

AMPLab Unification
Philosophy
Don’t specialize MapReduce – Generalize it!
Two additions to Hadoop MR can enable all the
models shown earlier!
1. General Task DAGs
2. Data Sharing
For Users:
Fewer Systems to Use
Less Data Movement
Spark
Streaming
GraphX
…SparkSQL
MLbase

In-Memory
Dataflow
System
M. Zaharia, M. Choudhury, M. Franklin, I. Stoica, S. Shenker, “Spark: Cluster Computing
with Working Sets, USENIX HotCloud, 2010.
• Developed in AMPLab and its predecessor the
RADLab
• Alternative to Hadoop MapReduce
• 10-100x speedup for ML and interactive queries
• Central component of the BDAS Stack
• “Graduated” to Apache Foundation -> Apache
Spark

Apache Spark Meetups
Around the World (Jan ‘15)

Apache Spark Meetups
Around the World (July ‘15)
+ 72%
+124
+ 79%+ 57%

Stack
Resource
Virtualization
Storage
Processing
Engine
Access and
Interfaces
In-house
Apps
Mesos
Spark Core
Spark
Streaming
SparkSQL
BlinkDB
GraphX
MLlib
MLBase
Hadoop Yarn
SampleCle
an
G-OLA
SparkR
Buildings
Velox
MLPipelin
es
Splash
Tachyon
HDFS, S3,
Ceph, …
Succinct

Stack
Resource
Virtualization
Storage
Processing
Engine
Access and
Interfaces
In-house
Apps
Mesos
Spark
Streaming
SparkSQL
BlinkDB
GraphX
MLlib
MLBase
Hadoop Yarn
SampleCle
an
G-OLA
SparkR
Buildings
Velox
MLPipelin
es
Splash
Tachyon
HDFS, S3,
Ceph, …
Succinct
Spark Core
• Major rearchitecture and features
(community)
– DataFrames API
– Tungsten: bringing Spark closer to bare metal
• Memory Management and Binary Processing
• Cache-aware computation
• Code generation
• R interface
• Spark SQL and Spark Streaming
enhancements
• Still rapidly growing!

Resource
Virtualization
Storage
Processing
Engine
Access and
Interfaces
In-house
Apps
Mesos
Spark Core
Spark
Streaming
SparkSQL
BlinkDB
GraphX
MLlib
MLBase
Hadoop Yarn
SampleCle
an
G-OLA
SparkR
Buildings
MLPipelin
es
Splash
Tachyon
HDFS, S3,
Ceph, …
Succinct
Velox
• Velox – Model Serving and
Personalization
– KeystoneML integration
– Improved service APIs and deployment
tools
– Open source alpha release
BDAS: Latest
Developments

13
Data Model
Where do models go?
Conference
Papers
Sales
Reports
Drive
Actions
Training
Introducing Velox: Model
Serving

Driving Actions
14
Suggesting Items
at Checkout
Fraud
Detection
Cognitive
Assistance
Internet of
Things
Low-Latency Personalized Rapidly Changing

Problem: Separate
Systems
15
Offline Analytics
Systems
Sophisticated ML
on static data.
Low-Latency
data serving
How do we serve low-latency predictions and
train on live data?
Online Serving
Systems
MongoDB

Velox Model Serving
System
Decompose personalized predictive models:
16
[CIDR’15]

Velox Model Serving
System
Decompose personalized predictive models:
17
[Crankshaw, Bailis, Gonzalez et al. CIDR’15]
Split
Personalization
Model
Feature
Model
OnlineBatch
Feature
Caching
Approx.
Features
Online
Updates
Active
Learning
Order-of-magnitude reductions in prediction latencies.

Access and
Interfaces
BDAS: Latest
Developments
Resource
Virtualization
Storage
Processing
Engine
In-house
Apps
Mesos
Spark Core
Spark
Streaming
SparkSQL
BlinkDB
GraphX
MLlib
MLBase
Hadoop Yarn
SampleCle
an
G-OLA
SparkR
Buildings
Velox
Splash
Tachyon
HDFS, S3,
Ceph, …
Succinct
MLPipelin
es
• MLPipelines  KeystoneML
– Alpha release
– End-to-end pipelines in vision, speech, and NLP
– Horizontal scalability to 100’s of machines and
multi-terabyte datasets

What is KeystoneML?
Software framework for building scalable end-to-end machine
learning pipelines.
Helps us explore how to build systems for robust, scalable, end-
to-end advanced analytics workloads and the patterns that
emerge.
Example pipelines that achieve state-of-the-art results on large
scale datasets in computer vision, NLP, and speech - fast.
Previewed at AMP Camp 5 and on AMPLab Blog as “ML
Pipelines”
Public release last month! http://keystone-ml.org/

How does it fit with
BDAS?
Spark
MLlibGraphX ml-matrix
KeystoneML
Batch Model Training
Velox
Model Server
Real Time Serving
http://amplab.github.io/velox-modelserver

Example: Image
Classification
Images
(VOC2007)
.fit( )
Resize
Grayscale
SIFT
PCA
Fisher Vector
MaxClassifier
Linear
Regression
Resize
Grayscale
SIFT
MaxClassifier
PCA Map
Fisher Encoder
Linear Model
Achieves performance of
Chatfield et. al., 2011
Embarassingly parallel
featurization and evaluation
15 min on a modest cluster
5K examples, 40K features,
20 classes

Current Software
FeaturesData Loaders
» CSV, CIFAR, ImageNet, VOC, TIMIT, 20 Newsgroups
Transformers
» NLP - Tokenization, n-grams, term frequency, NER*,
parsing*
» Images - Convolution, Grayscaling, LCS, SIFT*,
FisherVector*, Pooling, Windowing, HOG, Daisy
» Speech - MFCCs*
» Stats - Random Features, Normalization, Scaling*,
Signed Hellinger Mapping, FFT
» Utility/misc - Caching, Top-K classifier, indicator label
mapping, sparse/dense encoding transformers.
Estimators
» Learning - Block linear models, Linear Discriminant
Analysis, PCA, ZCA Whitening, Naive Bayes*, GMM*
• Example Pipelines
• NLP - 20 Newsgroups,
Wikipedia Language model
• Images - MNIST, CIFAR, VOC,
ImageNet
• Speech - TIMIT
• Evaluation Metrics
• Binary Classification
• Multiclass Classification
• Multilabel Classification
* - Links to external library: MLlib, ml-matrix, VLFeat, EncEval

Research Direction:
Automatic Resource
Estimation
Long-complicated pipelines.
» Just a composition of dataflows!
How long will this thing take to run?
When do I cache?
» Pose as a constrained optimization
problem.
Enables Efficient Hyperparameter Tuning
(ref. E. Sparks et al. “Automating Model Search for
Large Scale Machine Learning”, SOCC, Aug 2015)
Resize
Grayscale
SIFT
PCA
Fisher
Vector
Top 5
Classifier
LCS
PCA
Fisher
Vector
Block Linear
Solver
Weighted
Block Linear
Solver

Resource
Virtualization
Storage
Processing
Engine
Access and
Interfaces
In-house
Apps
Mesos
Spark Core
Spark
Streaming
SparkSQL
BlinkDB
GraphX
MLlib
MLBase
Hadoop Yarn
G-OLA
SparkR
Buildings
Velox
MLPipelin
es
Splash
Tachyon
HDFS, S3,
Ceph, …
Succinct
SampleCle
an
• Released two Spark Packages
– SampleClean: SparkSQL-integrated library for
record dedup, entity resolution, and active
learning
– AMPCrowd: web service for crowdsourcing
through Amazon Mechanical Turk or a "internal"
crowd
• REST API to allow for human-in-the-loop,
BDAS: Latest
Developments

SampleClean Framework
Current research
focus:
Latency Reduction
for human-in-the-
loop
• Straggler
Mitigation
• Pool Maintenance
• Active Learning

Summary
• AmpLab project
• Cross-disciplinary team, Industry engagement
• Open Source development and community
building
• BDAS philosophy: Unification
• Spark + SQL + Graphs + ML + …
• After graduating Mesos, Tachyon & Spark
we are moving up the stack to support
declarative and real-time Machine
Learning and analytics.

To find out more or
get involved:
amplab.berkeley.edu
franklin@berkeley.e
du
UC BERKELEY
Thanks to NSF CISE Expeditions in Computing, DARPA XData,
Founding Sponsors: Amazon Web Services, Google, IBM, and SAP,
the Thomas and Stacy Siebel Foundation,
all our industrial sponsors and partners, and all the members of the AMPLab Team.

What’s New in the Berkeley Data Analytics Stack

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to What’s New in the Berkeley Data Analytics Stack

Similar to What’s New in the Berkeley Data Analytics Stack (20)

More from Turi, Inc.

More from Turi, Inc. (20)

Recently uploaded

Recently uploaded (20)

What’s New in the Berkeley Data Analytics Stack

Editor's Notes