SlideShare a Scribd company logo
1 of 27
What’s Next for the
Berkeley Data Analytics
Stack
UC BERKELEY
Michael Franklin
July 20 2015
Data Science Summit
SF
The Berkeley AMPLab
80+ Students, Postdocs, Faculty and Staff from:
Databases, Machine Learning, Systems, Security, and Netwo
Mission Statement: Making Sense of Data at Scale by Integratin
• Algorithms – Machine Learning, Statistical Methods,
• Machines – Cluster and Cloud Computing
• People – Crowdsourcing and Human Computation
Franklin Jordan Stoica Patterson ShenkerRechtKatzJosephGoldberg Mahoney
PopaGonzalez
AMPLab: A Public/Private Partnership
NSF CISE Expedition Award:
Part of 2012 White House Big Data Initiative
Darpa XData Program
DoE/Lawrence Berkeley National Lab
And these Industrial Sponsors:
Velox Model Serving
Tachyon
Spark
Streamin
g
Shark
BlinkDB
GraphX MLlib
MLBa
se
Spark
R
Cancer Genomics, Energy Debugging, Smart
Buildings
Sample
Clean
In House Applications
Spark
Berkeley Data Analytics
Stack
(Apache and BSD open source)
HDFS,
S3, …Mesos Yarn
Access and Interfaces
Processing Engine
Resource Virtualization
Tachyon
Storage
Big Data Ecosystem
Evolution
MapReduce
Pregel
Dremel
GraphLab
Storm
Giraph
Drill
Tez
Impala
S4
…
Specialized systems
(iterative, interactive and
streaming apps)
General batch
processing
AMPLab Unification
Philosophy
Don’t specialize MapReduce – Generalize it!
Two additions to Hadoop MR can enable all the
models shown earlier!
1. General Task DAGs
2. Data Sharing
For Users:
Fewer Systems to Use
Less Data Movement
Spark
Streaming
GraphX
…SparkSQL
MLbase
In-Memory
Dataflow
System
M. Zaharia, M. Choudhury, M. Franklin, I. Stoica, S. Shenker, “Spark: Cluster Computing
with Working Sets, USENIX HotCloud, 2010.
• Developed in AMPLab and its predecessor the
RADLab
• Alternative to Hadoop MapReduce
• 10-100x speedup for ML and interactive queries
• Central component of the BDAS Stack
• “Graduated” to Apache Foundation -> Apache
Spark
Apache Spark Meetups
Around the World (Jan ‘15)
Apache Spark Meetups
Around the World (July ‘15)
+ 72%
+124
+ 79%+ 57%
Berkeley Data Analytics
Stack
Resource
Virtualization
Storage
Processing
Engine
Access and
Interfaces
In-house
Apps
Mesos
Spark Core
Spark
Streaming
SparkSQL
BlinkDB
GraphX
MLlib
MLBase
Hadoop Yarn
SampleCle
an
G-OLA
SparkR
Cancer Genomics, Energy Debugging, Smart
Buildings
Velox
MLPipelin
es
Splash
Tachyon
HDFS, S3,
Ceph, …
Succinct
Berkeley Data Analytics
Stack
Resource
Virtualization
Storage
Processing
Engine
Access and
Interfaces
In-house
Apps
Mesos
Spark
Streaming
SparkSQL
BlinkDB
GraphX
MLlib
MLBase
Hadoop Yarn
SampleCle
an
G-OLA
SparkR
Cancer Genomics, Energy Debugging, Smart
Buildings
Velox
MLPipelin
es
Splash
Tachyon
HDFS, S3,
Ceph, …
Succinct
Spark Core
• Major rearchitecture and features
(community)
– DataFrames API
– Tungsten: bringing Spark closer to bare metal
• Memory Management and Binary Processing
• Cache-aware computation
• Code generation
• R interface
• Spark SQL and Spark Streaming
enhancements
• Still rapidly growing!
Resource
Virtualization
Storage
Processing
Engine
Access and
Interfaces
In-house
Apps
Mesos
Spark Core
Spark
Streaming
SparkSQL
BlinkDB
GraphX
MLlib
MLBase
Hadoop Yarn
SampleCle
an
G-OLA
SparkR
Cancer Genomics, Energy Debugging, Smart
Buildings
MLPipelin
es
Splash
Tachyon
HDFS, S3,
Ceph, …
Succinct
Velox
• Velox – Model Serving and
Personalization
– KeystoneML integration
– Improved service APIs and deployment
tools
– Open source alpha release
BDAS: Latest
Developments
13
Data Model
Where do models go?
Conference
Papers
Sales
Reports
Drive
Actions
Training
Introducing Velox: Model
Serving
Driving Actions
14
Suggesting Items
at Checkout
Fraud
Detection
Cognitive
Assistance
Internet of
Things
Low-Latency Personalized Rapidly Changing
Problem: Separate
Systems
15
Offline Analytics
Systems
Sophisticated ML
on static data.
Low-Latency
data serving
How do we serve low-latency predictions and
train on live data?
Online Serving
Systems
MongoDB
Velox Model Serving
System
Decompose personalized predictive models:
16
[CIDR’15]
Velox Model Serving
System
Decompose personalized predictive models:
17
[Crankshaw, Bailis, Gonzalez et al. CIDR’15]
Split
Personalization
Model
Feature
Model
OnlineBatch
Feature
Caching
Approx.
Features
Online
Updates
Active
Learning
Order-of-magnitude reductions in prediction latencies.
Access and
Interfaces
BDAS: Latest
Developments
Resource
Virtualization
Storage
Processing
Engine
In-house
Apps
Mesos
Spark Core
Spark
Streaming
SparkSQL
BlinkDB
GraphX
MLlib
MLBase
Hadoop Yarn
SampleCle
an
G-OLA
SparkR
Cancer Genomics, Energy Debugging, Smart
Buildings
Velox
Splash
Tachyon
HDFS, S3,
Ceph, …
Succinct
MLPipelin
es
• MLPipelines  KeystoneML
– Alpha release
– End-to-end pipelines in vision, speech, and NLP
– Horizontal scalability to 100’s of machines and
multi-terabyte datasets
What is KeystoneML?
Software framework for building scalable end-to-end machine
learning pipelines.
Helps us explore how to build systems for robust, scalable, end-
to-end advanced analytics workloads and the patterns that
emerge.
Example pipelines that achieve state-of-the-art results on large
scale datasets in computer vision, NLP, and speech - fast.
Previewed at AMP Camp 5 and on AMPLab Blog as “ML
Pipelines”
Public release last month! http://keystone-ml.org/
How does it fit with
BDAS?
Spark
MLlibGraphX ml-matrix
KeystoneML
Batch Model Training
Velox
Model Server
Real Time Serving
http://amplab.github.io/velox-modelserver
Example: Image
Classification
Images
(VOC2007)
.fit( )
Resize
Grayscale
SIFT
PCA
Fisher Vector
MaxClassifier
Linear
Regression
Resize
Grayscale
SIFT
MaxClassifier
PCA Map
Fisher Encoder
Linear Model
Achieves performance of
Chatfield et. al., 2011
Embarassingly parallel
featurization and evaluation
15 min on a modest cluster
5K examples, 40K features,
20 classes
Current Software
FeaturesData Loaders
» CSV, CIFAR, ImageNet, VOC, TIMIT, 20 Newsgroups
Transformers
» NLP - Tokenization, n-grams, term frequency, NER*,
parsing*
» Images - Convolution, Grayscaling, LCS, SIFT*,
FisherVector*, Pooling, Windowing, HOG, Daisy
» Speech - MFCCs*
» Stats - Random Features, Normalization, Scaling*,
Signed Hellinger Mapping, FFT
» Utility/misc - Caching, Top-K classifier, indicator label
mapping, sparse/dense encoding transformers.
Estimators
» Learning - Block linear models, Linear Discriminant
Analysis, PCA, ZCA Whitening, Naive Bayes*, GMM*
• Example Pipelines
• NLP - 20 Newsgroups,
Wikipedia Language model
• Images - MNIST, CIFAR, VOC,
ImageNet
• Speech - TIMIT
• Evaluation Metrics
• Binary Classification
• Multiclass Classification
• Multilabel Classification
* - Links to external library: MLlib, ml-matrix, VLFeat, EncEval
Research Direction:
Automatic Resource
Estimation
Long-complicated pipelines.
» Just a composition of dataflows!
How long will this thing take to run?
When do I cache?
» Pose as a constrained optimization
problem.
Enables Efficient Hyperparameter Tuning
(ref. E. Sparks et al. “Automating Model Search for
Large Scale Machine Learning”, SOCC, Aug 2015)
Resize
Grayscale
SIFT
PCA
Fisher
Vector
Top 5
Classifier
LCS
PCA
Fisher
Vector
Block Linear
Solver
Weighted
Block Linear
Solver
Resource
Virtualization
Storage
Processing
Engine
Access and
Interfaces
In-house
Apps
Mesos
Spark Core
Spark
Streaming
SparkSQL
BlinkDB
GraphX
MLlib
MLBase
Hadoop Yarn
G-OLA
SparkR
Cancer Genomics, Energy Debugging, Smart
Buildings
Velox
MLPipelin
es
Splash
Tachyon
HDFS, S3,
Ceph, …
Succinct
SampleCle
an
• Released two Spark Packages
– SampleClean: SparkSQL-integrated library for
record dedup, entity resolution, and active
learning
– AMPCrowd: web service for crowdsourcing
through Amazon Mechanical Turk or a "internal"
crowd
• REST API to allow for human-in-the-loop,
BDAS: Latest
Developments
SampleClean Framework
Current research
focus:
Latency Reduction
for human-in-the-
loop
• Straggler
Mitigation
• Pool Maintenance
• Active Learning
Summary
• AmpLab project
• Cross-disciplinary team, Industry engagement
• Open Source development and community
building
• BDAS philosophy: Unification
• Spark + SQL + Graphs + ML + …
• After graduating Mesos, Tachyon & Spark
we are moving up the stack to support
declarative and real-time Machine
Learning and analytics.
To find out more or
get involved:
amplab.berkeley.edu
franklin@berkeley.e
du
UC BERKELEY
Thanks to NSF CISE Expeditions in Computing, DARPA XData,
Founding Sponsors: Amazon Web Services, Google, IBM, and SAP,
the Thomas and Stacy Siebel Foundation,
all our industrial sponsors and partners, and all the members of the AMPLab Team.

More Related Content

What's hot

Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoSri Ambati
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman
 
Machine Learning with Azure
Machine Learning with AzureMachine Learning with Azure
Machine Learning with AzureBarbara Fusinska
 
GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGr...
GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGr...GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGr...
GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGr...Turi, Inc.
 
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...Databricks
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with EaseBuild, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with EaseDatabricks
 
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Databricks
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovSpark Summit
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with SparkMd. Mahedi Kaysar
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkDB Tsai
 
Distributed processing of large graphs in python
Distributed processing of large graphs in pythonDistributed processing of large graphs in python
Distributed processing of large graphs in pythonJose Quesada (hiring)
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16MLconf
 
Deep Learning with Microsoft Cognitive Toolkit
Deep Learning with Microsoft Cognitive ToolkitDeep Learning with Microsoft Cognitive Toolkit
Deep Learning with Microsoft Cognitive ToolkitBarbara Fusinska
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Databricks
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondDataWorks Summit
 
Deep learning on Hadoop/Spark -NextML
Deep learning on Hadoop/Spark -NextMLDeep learning on Hadoop/Spark -NextML
Deep learning on Hadoop/Spark -NextMLAdam Gibson
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Spark Summit
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDeep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDatabricks
 

What's hot (20)

Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry Larko
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
 
Machine Learning with Azure
Machine Learning with AzureMachine Learning with Azure
Machine Learning with Azure
 
GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGr...
GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGr...GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGr...
GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGr...
 
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with EaseBuild, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with Ease
 
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
Distributed processing of large graphs in python
Distributed processing of large graphs in pythonDistributed processing of large graphs in python
Distributed processing of large graphs in python
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
 
Deep Learning with Microsoft Cognitive Toolkit
Deep Learning with Microsoft Cognitive ToolkitDeep Learning with Microsoft Cognitive Toolkit
Deep Learning with Microsoft Cognitive Toolkit
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 
Deep learning on Hadoop/Spark -NextML
Deep learning on Hadoop/Spark -NextMLDeep learning on Hadoop/Spark -NextML
Deep learning on Hadoop/Spark -NextML
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
 
Deploying Machine Learning Models to Production
Deploying Machine Learning Models to ProductionDeploying Machine Learning Models to Production
Deploying Machine Learning Models to Production
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDeep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
 

Viewers also liked

PLOTCON NYC: Building a Flexible Analytics Stack
PLOTCON NYC: Building a Flexible Analytics StackPLOTCON NYC: Building a Flexible Analytics Stack
PLOTCON NYC: Building a Flexible Analytics StackPlotly
 
Scala training workshop 02
Scala training workshop 02Scala training workshop 02
Scala training workshop 02Nguyen Tuan
 
PLOTCON NYC: Interactive Visual Statistics on Massive Datasets
PLOTCON NYC: Interactive Visual Statistics on Massive DatasetsPLOTCON NYC: Interactive Visual Statistics on Massive Datasets
PLOTCON NYC: Interactive Visual Statistics on Massive DatasetsPlotly
 
PLOTCON NYC: Enterprise Dataviz' Unicorn Problem
PLOTCON NYC: Enterprise Dataviz' Unicorn ProblemPLOTCON NYC: Enterprise Dataviz' Unicorn Problem
PLOTCON NYC: Enterprise Dataviz' Unicorn ProblemPlotly
 
PLOTCON NYC: Custom Colormaps for Your Field
PLOTCON NYC: Custom Colormaps for Your FieldPLOTCON NYC: Custom Colormaps for Your Field
PLOTCON NYC: Custom Colormaps for Your FieldPlotly
 
PLOTCON NYC: Domain Specific Visualization
PLOTCON NYC: Domain Specific VisualizationPLOTCON NYC: Domain Specific Visualization
PLOTCON NYC: Domain Specific VisualizationPlotly
 
PLOTCON NYC: Building Products Out of Data
PLOTCON NYC:  Building Products Out of DataPLOTCON NYC:  Building Products Out of Data
PLOTCON NYC: Building Products Out of DataPlotly
 
PLOTCON NYC: The Future of Business Intelligence: Data Visualization
PLOTCON NYC:  The Future of Business Intelligence: Data VisualizationPLOTCON NYC:  The Future of Business Intelligence: Data Visualization
PLOTCON NYC: The Future of Business Intelligence: Data VisualizationPlotly
 
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPlotly
 
PLOTCON NYC: New Open Viz in R
PLOTCON NYC: New Open Viz in RPLOTCON NYC: New Open Viz in R
PLOTCON NYC: New Open Viz in RPlotly
 
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)Spark Summit
 
PLOTCON NYC: Get Your Point Across: The Art of Choosing the Right Visualizati...
PLOTCON NYC: Get Your Point Across: The Art of Choosing the Right Visualizati...PLOTCON NYC: Get Your Point Across: The Art of Choosing the Right Visualizati...
PLOTCON NYC: Get Your Point Across: The Art of Choosing the Right Visualizati...Plotly
 
PLOTCON NYC: New Data Viz in Data Journalism
PLOTCON NYC: New Data Viz in Data JournalismPLOTCON NYC: New Data Viz in Data Journalism
PLOTCON NYC: New Data Viz in Data JournalismPlotly
 
PLOTCON NYC: Data Science in the Enterprise From Concept to Execution
PLOTCON NYC: Data Science in the Enterprise From Concept to ExecutionPLOTCON NYC: Data Science in the Enterprise From Concept to Execution
PLOTCON NYC: Data Science in the Enterprise From Concept to ExecutionPlotly
 
PLOTCON NYC: Text is data! Analysis and Visualization Methods
PLOTCON NYC: Text is data! Analysis and Visualization MethodsPLOTCON NYC: Text is data! Analysis and Visualization Methods
PLOTCON NYC: Text is data! Analysis and Visualization MethodsPlotly
 
SportsDataViz using Plotly, Shiny and Flexdashboard - PlotCon 2016
SportsDataViz using Plotly, Shiny and Flexdashboard - PlotCon 2016SportsDataViz using Plotly, Shiny and Flexdashboard - PlotCon 2016
SportsDataViz using Plotly, Shiny and Flexdashboard - PlotCon 2016Tanya Cashorali
 

Viewers also liked (20)

PLOTCON NYC: Building a Flexible Analytics Stack
PLOTCON NYC: Building a Flexible Analytics StackPLOTCON NYC: Building a Flexible Analytics Stack
PLOTCON NYC: Building a Flexible Analytics Stack
 
Sparkstreaming
SparkstreamingSparkstreaming
Sparkstreaming
 
Devops Spark Streaming
Devops Spark StreamingDevops Spark Streaming
Devops Spark Streaming
 
Scala training workshop 02
Scala training workshop 02Scala training workshop 02
Scala training workshop 02
 
PLOTCON NYC: Interactive Visual Statistics on Massive Datasets
PLOTCON NYC: Interactive Visual Statistics on Massive DatasetsPLOTCON NYC: Interactive Visual Statistics on Massive Datasets
PLOTCON NYC: Interactive Visual Statistics on Massive Datasets
 
PLOTCON NYC: Enterprise Dataviz' Unicorn Problem
PLOTCON NYC: Enterprise Dataviz' Unicorn ProblemPLOTCON NYC: Enterprise Dataviz' Unicorn Problem
PLOTCON NYC: Enterprise Dataviz' Unicorn Problem
 
PLOTCON NYC: Custom Colormaps for Your Field
PLOTCON NYC: Custom Colormaps for Your FieldPLOTCON NYC: Custom Colormaps for Your Field
PLOTCON NYC: Custom Colormaps for Your Field
 
PLOTCON NYC: Domain Specific Visualization
PLOTCON NYC: Domain Specific VisualizationPLOTCON NYC: Domain Specific Visualization
PLOTCON NYC: Domain Specific Visualization
 
PLOTCON NYC: Building Products Out of Data
PLOTCON NYC:  Building Products Out of DataPLOTCON NYC:  Building Products Out of Data
PLOTCON NYC: Building Products Out of Data
 
PLOTCON NYC: The Future of Business Intelligence: Data Visualization
PLOTCON NYC:  The Future of Business Intelligence: Data VisualizationPLOTCON NYC:  The Future of Business Intelligence: Data Visualization
PLOTCON NYC: The Future of Business Intelligence: Data Visualization
 
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
 
PLOTCON NYC: New Open Viz in R
PLOTCON NYC: New Open Viz in RPLOTCON NYC: New Open Viz in R
PLOTCON NYC: New Open Viz in R
 
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
 
PLOTCON NYC: Get Your Point Across: The Art of Choosing the Right Visualizati...
PLOTCON NYC: Get Your Point Across: The Art of Choosing the Right Visualizati...PLOTCON NYC: Get Your Point Across: The Art of Choosing the Right Visualizati...
PLOTCON NYC: Get Your Point Across: The Art of Choosing the Right Visualizati...
 
PLOTCON NYC: New Data Viz in Data Journalism
PLOTCON NYC: New Data Viz in Data JournalismPLOTCON NYC: New Data Viz in Data Journalism
PLOTCON NYC: New Data Viz in Data Journalism
 
PLOTCON NYC: Data Science in the Enterprise From Concept to Execution
PLOTCON NYC: Data Science in the Enterprise From Concept to ExecutionPLOTCON NYC: Data Science in the Enterprise From Concept to Execution
PLOTCON NYC: Data Science in the Enterprise From Concept to Execution
 
PLOTCON NYC: Text is data! Analysis and Visualization Methods
PLOTCON NYC: Text is data! Analysis and Visualization MethodsPLOTCON NYC: Text is data! Analysis and Visualization Methods
PLOTCON NYC: Text is data! Analysis and Visualization Methods
 
Spark Technology Center IBM
Spark Technology Center IBMSpark Technology Center IBM
Spark Technology Center IBM
 
SportsDataViz using Plotly, Shiny and Flexdashboard - PlotCon 2016
SportsDataViz using Plotly, Shiny and Flexdashboard - PlotCon 2016SportsDataViz using Plotly, Shiny and Flexdashboard - PlotCon 2016
SportsDataViz using Plotly, Shiny and Flexdashboard - PlotCon 2016
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 

Similar to What’s New in the Berkeley Data Analytics Stack

The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-Systeminside-BigData.com
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data ScientistsRichard Garris
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for SparkMark Kerzner
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesAlice Zheng
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesDatabricks
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Jason Dai
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...Geoffrey Fox
 
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Spark Summit
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Libraryjeykottalam
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15MLconf
 
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...Felix Gessert
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFMLconf
 
Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha Talagala
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Big Data Trend and Open Data
Big Data Trend and Open DataBig Data Trend and Open Data
Big Data Trend and Open DataJongwook Woo
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open PlatformJongwook Woo
 

Similar to What’s New in the Berkeley Data Analytics Stack (20)

The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the Masses
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
 
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
 
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
 
Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Big Data Trend and Open Data
Big Data Trend and Open DataBig Data Trend and Open Data
Big Data Trend and Open Data
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open Platform
 

More from Turi, Inc.

Webinar - Analyzing Video
Webinar - Analyzing VideoWebinar - Analyzing Video
Webinar - Analyzing VideoTuri, Inc.
 
Webinar - Patient Readmission Risk
Webinar - Patient Readmission RiskWebinar - Patient Readmission Risk
Webinar - Patient Readmission RiskTuri, Inc.
 
Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Turi, Inc.
 
Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Turi, Inc.
 
Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Turi, Inc.
 
Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Turi, Inc.
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsTuri, Inc.
 
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataTuri, Inc.
 
Intelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsIntelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsTuri, Inc.
 
Text Analysis with Machine Learning
Text Analysis with Machine LearningText Analysis with Machine Learning
Text Analysis with Machine LearningTuri, Inc.
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab CreateTuri, Inc.
 
Machine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesMachine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesTuri, Inc.
 
Machine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinMachine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinTuri, Inc.
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Turi, Inc.
 
Introduction to Recommender Systems
Introduction to Recommender SystemsIntroduction to Recommender Systems
Introduction to Recommender SystemsTuri, Inc.
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in productionTuri, Inc.
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringTuri, Inc.
 
Building Personalized Data Products with Dato
Building Personalized Data Products with DatoBuilding Personalized Data Products with Dato
Building Personalized Data Products with DatoTuri, Inc.
 
Getting Started With Dato - August 2015
Getting Started With Dato - August 2015Getting Started With Dato - August 2015
Getting Started With Dato - August 2015Turi, Inc.
 

More from Turi, Inc. (20)

Webinar - Analyzing Video
Webinar - Analyzing VideoWebinar - Analyzing Video
Webinar - Analyzing Video
 
Webinar - Patient Readmission Risk
Webinar - Patient Readmission RiskWebinar - Patient Readmission Risk
Webinar - Patient Readmission Risk
 
Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)
 
Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)
 
Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)
 
Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
 
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log Data
 
Intelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsIntelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning Toolkits
 
Text Analysis with Machine Learning
Text Analysis with Machine LearningText Analysis with Machine Learning
Text Analysis with Machine Learning
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab Create
 
Machine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesMachine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive Services
 
Machine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinMachine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos Guestrin
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
 
Introduction to Recommender Systems
Introduction to Recommender SystemsIntroduction to Recommender Systems
Introduction to Recommender Systems
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in production
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature Engineering
 
SFrame
SFrameSFrame
SFrame
 
Building Personalized Data Products with Dato
Building Personalized Data Products with DatoBuilding Personalized Data Products with Dato
Building Personalized Data Products with Dato
 
Getting Started With Dato - August 2015
Getting Started With Dato - August 2015Getting Started With Dato - August 2015
Getting Started With Dato - August 2015
 

Recently uploaded

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 

Recently uploaded (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

What’s New in the Berkeley Data Analytics Stack

  • 1. What’s Next for the Berkeley Data Analytics Stack UC BERKELEY Michael Franklin July 20 2015 Data Science Summit SF
  • 2. The Berkeley AMPLab 80+ Students, Postdocs, Faculty and Staff from: Databases, Machine Learning, Systems, Security, and Netwo Mission Statement: Making Sense of Data at Scale by Integratin • Algorithms – Machine Learning, Statistical Methods, • Machines – Cluster and Cloud Computing • People – Crowdsourcing and Human Computation Franklin Jordan Stoica Patterson ShenkerRechtKatzJosephGoldberg Mahoney PopaGonzalez
  • 3. AMPLab: A Public/Private Partnership NSF CISE Expedition Award: Part of 2012 White House Big Data Initiative Darpa XData Program DoE/Lawrence Berkeley National Lab And these Industrial Sponsors:
  • 4. Velox Model Serving Tachyon Spark Streamin g Shark BlinkDB GraphX MLlib MLBa se Spark R Cancer Genomics, Energy Debugging, Smart Buildings Sample Clean In House Applications Spark Berkeley Data Analytics Stack (Apache and BSD open source) HDFS, S3, …Mesos Yarn Access and Interfaces Processing Engine Resource Virtualization Tachyon Storage
  • 5. Big Data Ecosystem Evolution MapReduce Pregel Dremel GraphLab Storm Giraph Drill Tez Impala S4 … Specialized systems (iterative, interactive and streaming apps) General batch processing
  • 6. AMPLab Unification Philosophy Don’t specialize MapReduce – Generalize it! Two additions to Hadoop MR can enable all the models shown earlier! 1. General Task DAGs 2. Data Sharing For Users: Fewer Systems to Use Less Data Movement Spark Streaming GraphX …SparkSQL MLbase
  • 7. In-Memory Dataflow System M. Zaharia, M. Choudhury, M. Franklin, I. Stoica, S. Shenker, “Spark: Cluster Computing with Working Sets, USENIX HotCloud, 2010. • Developed in AMPLab and its predecessor the RADLab • Alternative to Hadoop MapReduce • 10-100x speedup for ML and interactive queries • Central component of the BDAS Stack • “Graduated” to Apache Foundation -> Apache Spark
  • 8. Apache Spark Meetups Around the World (Jan ‘15)
  • 9. Apache Spark Meetups Around the World (July ‘15) + 72% +124 + 79%+ 57%
  • 10. Berkeley Data Analytics Stack Resource Virtualization Storage Processing Engine Access and Interfaces In-house Apps Mesos Spark Core Spark Streaming SparkSQL BlinkDB GraphX MLlib MLBase Hadoop Yarn SampleCle an G-OLA SparkR Cancer Genomics, Energy Debugging, Smart Buildings Velox MLPipelin es Splash Tachyon HDFS, S3, Ceph, … Succinct
  • 11. Berkeley Data Analytics Stack Resource Virtualization Storage Processing Engine Access and Interfaces In-house Apps Mesos Spark Streaming SparkSQL BlinkDB GraphX MLlib MLBase Hadoop Yarn SampleCle an G-OLA SparkR Cancer Genomics, Energy Debugging, Smart Buildings Velox MLPipelin es Splash Tachyon HDFS, S3, Ceph, … Succinct Spark Core • Major rearchitecture and features (community) – DataFrames API – Tungsten: bringing Spark closer to bare metal • Memory Management and Binary Processing • Cache-aware computation • Code generation • R interface • Spark SQL and Spark Streaming enhancements • Still rapidly growing!
  • 12. Resource Virtualization Storage Processing Engine Access and Interfaces In-house Apps Mesos Spark Core Spark Streaming SparkSQL BlinkDB GraphX MLlib MLBase Hadoop Yarn SampleCle an G-OLA SparkR Cancer Genomics, Energy Debugging, Smart Buildings MLPipelin es Splash Tachyon HDFS, S3, Ceph, … Succinct Velox • Velox – Model Serving and Personalization – KeystoneML integration – Improved service APIs and deployment tools – Open source alpha release BDAS: Latest Developments
  • 13. 13 Data Model Where do models go? Conference Papers Sales Reports Drive Actions Training Introducing Velox: Model Serving
  • 14. Driving Actions 14 Suggesting Items at Checkout Fraud Detection Cognitive Assistance Internet of Things Low-Latency Personalized Rapidly Changing
  • 15. Problem: Separate Systems 15 Offline Analytics Systems Sophisticated ML on static data. Low-Latency data serving How do we serve low-latency predictions and train on live data? Online Serving Systems MongoDB
  • 16. Velox Model Serving System Decompose personalized predictive models: 16 [CIDR’15]
  • 17. Velox Model Serving System Decompose personalized predictive models: 17 [Crankshaw, Bailis, Gonzalez et al. CIDR’15] Split Personalization Model Feature Model OnlineBatch Feature Caching Approx. Features Online Updates Active Learning Order-of-magnitude reductions in prediction latencies.
  • 18. Access and Interfaces BDAS: Latest Developments Resource Virtualization Storage Processing Engine In-house Apps Mesos Spark Core Spark Streaming SparkSQL BlinkDB GraphX MLlib MLBase Hadoop Yarn SampleCle an G-OLA SparkR Cancer Genomics, Energy Debugging, Smart Buildings Velox Splash Tachyon HDFS, S3, Ceph, … Succinct MLPipelin es • MLPipelines  KeystoneML – Alpha release – End-to-end pipelines in vision, speech, and NLP – Horizontal scalability to 100’s of machines and multi-terabyte datasets
  • 19. What is KeystoneML? Software framework for building scalable end-to-end machine learning pipelines. Helps us explore how to build systems for robust, scalable, end- to-end advanced analytics workloads and the patterns that emerge. Example pipelines that achieve state-of-the-art results on large scale datasets in computer vision, NLP, and speech - fast. Previewed at AMP Camp 5 and on AMPLab Blog as “ML Pipelines” Public release last month! http://keystone-ml.org/
  • 20. How does it fit with BDAS? Spark MLlibGraphX ml-matrix KeystoneML Batch Model Training Velox Model Server Real Time Serving http://amplab.github.io/velox-modelserver
  • 21. Example: Image Classification Images (VOC2007) .fit( ) Resize Grayscale SIFT PCA Fisher Vector MaxClassifier Linear Regression Resize Grayscale SIFT MaxClassifier PCA Map Fisher Encoder Linear Model Achieves performance of Chatfield et. al., 2011 Embarassingly parallel featurization and evaluation 15 min on a modest cluster 5K examples, 40K features, 20 classes
  • 22. Current Software FeaturesData Loaders » CSV, CIFAR, ImageNet, VOC, TIMIT, 20 Newsgroups Transformers » NLP - Tokenization, n-grams, term frequency, NER*, parsing* » Images - Convolution, Grayscaling, LCS, SIFT*, FisherVector*, Pooling, Windowing, HOG, Daisy » Speech - MFCCs* » Stats - Random Features, Normalization, Scaling*, Signed Hellinger Mapping, FFT » Utility/misc - Caching, Top-K classifier, indicator label mapping, sparse/dense encoding transformers. Estimators » Learning - Block linear models, Linear Discriminant Analysis, PCA, ZCA Whitening, Naive Bayes*, GMM* • Example Pipelines • NLP - 20 Newsgroups, Wikipedia Language model • Images - MNIST, CIFAR, VOC, ImageNet • Speech - TIMIT • Evaluation Metrics • Binary Classification • Multiclass Classification • Multilabel Classification * - Links to external library: MLlib, ml-matrix, VLFeat, EncEval
  • 23. Research Direction: Automatic Resource Estimation Long-complicated pipelines. » Just a composition of dataflows! How long will this thing take to run? When do I cache? » Pose as a constrained optimization problem. Enables Efficient Hyperparameter Tuning (ref. E. Sparks et al. “Automating Model Search for Large Scale Machine Learning”, SOCC, Aug 2015) Resize Grayscale SIFT PCA Fisher Vector Top 5 Classifier LCS PCA Fisher Vector Block Linear Solver Weighted Block Linear Solver
  • 24. Resource Virtualization Storage Processing Engine Access and Interfaces In-house Apps Mesos Spark Core Spark Streaming SparkSQL BlinkDB GraphX MLlib MLBase Hadoop Yarn G-OLA SparkR Cancer Genomics, Energy Debugging, Smart Buildings Velox MLPipelin es Splash Tachyon HDFS, S3, Ceph, … Succinct SampleCle an • Released two Spark Packages – SampleClean: SparkSQL-integrated library for record dedup, entity resolution, and active learning – AMPCrowd: web service for crowdsourcing through Amazon Mechanical Turk or a "internal" crowd • REST API to allow for human-in-the-loop, BDAS: Latest Developments
  • 25. SampleClean Framework Current research focus: Latency Reduction for human-in-the- loop • Straggler Mitigation • Pool Maintenance • Active Learning
  • 26. Summary • AmpLab project • Cross-disciplinary team, Industry engagement • Open Source development and community building • BDAS philosophy: Unification • Spark + SQL + Graphs + ML + … • After graduating Mesos, Tachyon & Spark we are moving up the stack to support declarative and real-time Machine Learning and analytics.
  • 27. To find out more or get involved: amplab.berkeley.edu franklin@berkeley.e du UC BERKELEY Thanks to NSF CISE Expeditions in Computing, DARPA XData, Founding Sponsors: Amazon Web Services, Google, IBM, and SAP, the Thomas and Stacy Siebel Foundation, all our industrial sponsors and partners, and all the members of the AMPLab Team.

Editor's Notes

  1. Connect to political bias story
  2. Spark batch analytics vs low-latency serving system
  3. Use rect linear loss