SlideShare a Scribd company logo
1 of 24
The Challenges of
Bringing Machine
Learning to the
Masses
Alice Zheng and Sethu Raman
GraphLab Inc.
NIPS workshop on Software Engineering for Machine Learning
December 13, 2014
Self introduction
ML Research
“Accessible ML”
The need for accessible ML
• So much potential in ML
• Everyone trying to make sense of their data
• ML is transforming lives and industries:
personalized medicine, internet search, social
networks, advertising, etc.
• But success is unattainable to most
Building a predictive app
Was using 217 business rules
hoping world doesn’t change
Have an inspiring idea to
reinvent their business
Key pains:
Hiring Talent
Shortfall in data-savvy workers
needed to make sense out of
big data by 2018 [McKinsey 2011]
35%
Noisy Space of Tools
Data scientists use a variety of tools, across
different programming languages…
require a lot of context-switching…
affects productivity and impedes reproducibility.
Ben Lorica,
Data Analysis: Just one component of
the Data Science workflow
Building a predictive app
Feature
engineering
Model
definition
Training
evaluation
Data
DeploymentMonitoring
Pure ML is not enough
• Building a predictive application involves much
more than just building ML models
• System engineering: data storage, computation
infrastructure, networking…
• Data Science: problem definition, data cleaning,
feature engineering
• Software development: turn prototype model into
bullet-proof production code
• Operations engineering: deploy and monitor app
• …
Pain points
• What are the right features?
• What model should I use?
• How do I train it?
• How do I set the tuning parameters?
• Do I even have the right data?
• Ok, I have a working prototype, now what?
Pain points
• Increase in data size or decrease in
latency requires complete rewrite of code
and new toolset
• GB – R/scikit-learn/Matlab
• TB-PB—Hadoop/Mahout/Spark
• Many forms of data and data structures
• Images, text, speech, logs
• Dense lists, sparse dictionaries, time series
• Tables, graphs, matrices, tensors
The need for an ML platform
• Minimize tool/code switching, maximize
performance (speed/accuracy/scale)
• Graceful transition from small to large
dataset sizes
• Flexible, interoperable data types
• Minimize complexity
• System-agnostic
• Simple API
• Auto-tune parameters
The parallel to databases
• What’s an example of a mega-successful
platform for data operations?
• Databases!
• SQL, Oracle, NoSQL, …
• What lessons can we bring in from the
database world?
Database engine components
Storage
engine
Query
execution
Query
optimizer
Storage
Database engine components
Storage
engine
Query
execution
Query
optimizer
Storage
Complex but self-contained, has clean API,
only changes when there’s new hardware.
Database engine components
Storage
engine
Query
execution
Query
optimizer
Storage
Complex bag of tricks, no formalism,
constantly changing to adapt to
data, query, disk characteristics.
ML engine components
Feature
engineering
Model
definition
Training
evaluation
Data
Bags of tricks,
expert knowledge,
experience,
lots of trial and error
Advances in databases
• Reasonable abstraction—relational DB
• Hardware speedups
• Pragmatic software implementation
Successful platform
• Take-away lesson: fast computation
engine + “good enough” execution plan
To advance ML platforms
• ML will be end-user friendly when the
platform is clever enough to handle less-
than-optimal directions from the user
• What needs to happen?
• The complexity needs to be automated and
wrapped away with neat interfaces between
components
• Fast components, “good enough” directions
GraphLab
• Started as a research project at CMU in
2009
• Now a Seattle-based startup
The GraphLab CreateTM Solution
• Flexible, interoperable data types
• SArray+SFrame+SGraph inter-translatable
• dense list, sparse array, image, text, tables, graphs
• Graceful transition between data sizes
• SFrame: memory to disk to distributed
• One environment, many substrates
• Python front-end
• Localhost, cluster, Hadoop, EC2
• End-to-end
• Data ingestion+feature engineering+model building+
deployment in a single environment
GraphLab Create ML Toolkits
Machine Learning Task
Business
Task
Algorithms & SDK
Recommender, Target, Social
Match, …
Regression, Classification,
Data Matching,…
SVM, Matrix
Factorization, LDA, …
Developers
Savvy Dev
& Data Sci.
ML
experts
Demos
GLC SDK example
• Task: fill in missing value in an array using
previous value
• Existing solution:
• E.g., use Pandas—Python library providing in-
memory dataframes
• Problem:
• Given, say, 25M rows and 50 cols, takes
forever to even load the data
GLC SDK solution
> cat fill.cpp
#include <flexible_type/flexible_type.hpp>
#include <unity/lib/toolkit_function_macros.hpp>
#include <unity/lib/gl_sarray.hpp>
using namespace graphlab;
gl_sarray fill(gl_sarray sa) {
gl_sarray_writer writer(sa.dtype(), 1);
flexible_type last_value = sa[0];
for (const auto &elem: sa.range_iterator()) {
if (elem != FLEX_UNDEFINED)
last_value = elem;
writer.write(last_value, 0);
}
return writer.close();
}
BEGIN_FUNCTION_REGISTRATION
REGISTER_FUNCTION(fill, "sa");
END_FUNCTION_REGISTRATION
GLC SDK solution
> cat Makefile
all: fill.so
fill.so: fill.cpp
g++ -std=c++11 $^ -l graphlab –l ~/graphlab-dev/deps/shared-fPIC
–o $@ -O3
> python
>>> import graphlab as gl
>>> gl.ext_import(‘fill.so’, ‘example’)
>>> sa = gl.Sarray([1, 2, 3, None, 6])
>>> print gl.extensions.example.fill.fill(sa)
[1, 2, 3, 3, 6]
Join the revolution!
• Research methods to make the following
efficient and automatic:
• Feature engineering
• Model selection
• Model debugging
• Problem formulation (??)
• Develop novel algorithms on top of our SDK
• Backed by scalable, flexible typed data structures
• Automatic Python wrappers
• Make them available to many other peple
• We’re hiring! jobs@graphlab.com

More Related Content

What's hot

The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...Databricks
 
Well architected ML platforms for Enterprise Data Science
Well architected ML platforms for Enterprise Data ScienceWell architected ML platforms for Enterprise Data Science
Well architected ML platforms for Enterprise Data ScienceLeela Krishna Kandrakota
 
Databricks Overview for MLOps
Databricks Overview for MLOpsDatabricks Overview for MLOps
Databricks Overview for MLOpsDatabricks
 
RPA and Artificial Intelligence Brochure EQ - U2 b
RPA and Artificial Intelligence Brochure EQ - U2 bRPA and Artificial Intelligence Brochure EQ - U2 b
RPA and Artificial Intelligence Brochure EQ - U2 bZoe Gammie
 
ETL Made Easy with Azure Data Factory and Azure Databricks
ETL Made Easy with Azure Data Factory and Azure DatabricksETL Made Easy with Azure Data Factory and Azure Databricks
ETL Made Easy with Azure Data Factory and Azure DatabricksDatabricks
 
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Mihai Criveti
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLJordan Birdsell
 
Introduction to Azure Synapse Webinar
Introduction to Azure Synapse WebinarIntroduction to Azure Synapse Webinar
Introduction to Azure Synapse WebinarPeter Ward
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDatabricks
 
Introduction: Relational to Graphs
Introduction: Relational to GraphsIntroduction: Relational to Graphs
Introduction: Relational to GraphsNeo4j
 
Demystifying Robotic Process Automation (RPA) & Automation Testing
Demystifying Robotic Process Automation (RPA) & Automation TestingDemystifying Robotic Process Automation (RPA) & Automation Testing
Demystifying Robotic Process Automation (RPA) & Automation TestingCitiusTech
 
Tableau slideshare
Tableau slideshareTableau slideshare
Tableau slideshareSakshi Jain
 
ETL tool evaluation criteria
ETL tool evaluation criteriaETL tool evaluation criteria
ETL tool evaluation criteriaAsis Mohanty
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningDatabricks
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowDremio Corporation
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouseJames Serra
 

What's hot (20)

The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
 
Well architected ML platforms for Enterprise Data Science
Well architected ML platforms for Enterprise Data ScienceWell architected ML platforms for Enterprise Data Science
Well architected ML platforms for Enterprise Data Science
 
MLOps.pptx
MLOps.pptxMLOps.pptx
MLOps.pptx
 
Robotic Process Automation-RPA
Robotic Process Automation-RPARobotic Process Automation-RPA
Robotic Process Automation-RPA
 
Big Data in Azure
Big Data in AzureBig Data in Azure
Big Data in Azure
 
Databricks Overview for MLOps
Databricks Overview for MLOpsDatabricks Overview for MLOps
Databricks Overview for MLOps
 
RPA and Artificial Intelligence Brochure EQ - U2 b
RPA and Artificial Intelligence Brochure EQ - U2 bRPA and Artificial Intelligence Brochure EQ - U2 b
RPA and Artificial Intelligence Brochure EQ - U2 b
 
ETL Made Easy with Azure Data Factory and Azure Databricks
ETL Made Easy with Azure Data Factory and Azure DatabricksETL Made Easy with Azure Data Factory and Azure Databricks
ETL Made Easy with Azure Data Factory and Azure Databricks
 
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of ML
 
Introduction to Azure Synapse Webinar
Introduction to Azure Synapse WebinarIntroduction to Azure Synapse Webinar
Introduction to Azure Synapse Webinar
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction Log
 
Introduction: Relational to Graphs
Introduction: Relational to GraphsIntroduction: Relational to Graphs
Introduction: Relational to Graphs
 
Demystifying Robotic Process Automation (RPA) & Automation Testing
Demystifying Robotic Process Automation (RPA) & Automation TestingDemystifying Robotic Process Automation (RPA) & Automation Testing
Demystifying Robotic Process Automation (RPA) & Automation Testing
 
Tableau slideshare
Tableau slideshareTableau slideshare
Tableau slideshare
 
ETL tool evaluation criteria
ETL tool evaluation criteriaETL tool evaluation criteria
ETL tool evaluation criteria
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 

Viewers also liked

Understanding Feature Space in Machine Learning
Understanding Feature Space in Machine LearningUnderstanding Feature Space in Machine Learning
Understanding Feature Space in Machine LearningAlice Zheng
 
What the Bleep is Big Data? A Holistic View of Data and Algorithms
What the Bleep is Big Data? A Holistic View of Data and AlgorithmsWhat the Bleep is Big Data? A Holistic View of Data and Algorithms
What the Bleep is Big Data? A Holistic View of Data and AlgorithmsAlice Zheng
 
Feature engineering for diverse data types
Feature engineering for diverse data typesFeature engineering for diverse data types
Feature engineering for diverse data typesAlice Zheng
 
Cassandra synergy
Cassandra synergyCassandra synergy
Cassandra synergyniallmilton
 
Introduction &amp; EHR Benefits Realization
Introduction &amp; EHR Benefits RealizationIntroduction &amp; EHR Benefits Realization
Introduction &amp; EHR Benefits RealizationDave Shiple
 
Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Turi, Inc.
 
Enterprise mHealth Strategy
Enterprise mHealth StrategyEnterprise mHealth Strategy
Enterprise mHealth StrategyDave Shiple
 
BSSML16 L7. Feature Engineering
BSSML16 L7. Feature EngineeringBSSML16 L7. Feature Engineering
BSSML16 L7. Feature EngineeringBigML, Inc
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering odsc
 
IT Strategic Planning - Methodology and Approach
IT Strategic Planning - Methodology and ApproachIT Strategic Planning - Methodology and Approach
IT Strategic Planning - Methodology and ApproachDave Shiple
 
Webinar - Analyzing Video
Webinar - Analyzing VideoWebinar - Analyzing Video
Webinar - Analyzing VideoTuri, Inc.
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsTuri, Inc.
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab CreateTuri, Inc.
 
Intelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsIntelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsTuri, Inc.
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 

Viewers also liked (15)

Understanding Feature Space in Machine Learning
Understanding Feature Space in Machine LearningUnderstanding Feature Space in Machine Learning
Understanding Feature Space in Machine Learning
 
What the Bleep is Big Data? A Holistic View of Data and Algorithms
What the Bleep is Big Data? A Holistic View of Data and AlgorithmsWhat the Bleep is Big Data? A Holistic View of Data and Algorithms
What the Bleep is Big Data? A Holistic View of Data and Algorithms
 
Feature engineering for diverse data types
Feature engineering for diverse data typesFeature engineering for diverse data types
Feature engineering for diverse data types
 
Cassandra synergy
Cassandra synergyCassandra synergy
Cassandra synergy
 
Introduction &amp; EHR Benefits Realization
Introduction &amp; EHR Benefits RealizationIntroduction &amp; EHR Benefits Realization
Introduction &amp; EHR Benefits Realization
 
Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)
 
Enterprise mHealth Strategy
Enterprise mHealth StrategyEnterprise mHealth Strategy
Enterprise mHealth Strategy
 
BSSML16 L7. Feature Engineering
BSSML16 L7. Feature EngineeringBSSML16 L7. Feature Engineering
BSSML16 L7. Feature Engineering
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
 
IT Strategic Planning - Methodology and Approach
IT Strategic Planning - Methodology and ApproachIT Strategic Planning - Methodology and Approach
IT Strategic Planning - Methodology and Approach
 
Webinar - Analyzing Video
Webinar - Analyzing VideoWebinar - Analyzing Video
Webinar - Analyzing Video
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab Create
 
Intelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsIntelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning Toolkits
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 

Similar to The Challenges of Bringing Machine Learning to the Masses

Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
 
Engineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsEngineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsHisham Arafat
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPaige_Roberts
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
Proud to be polyglot
Proud to be polyglotProud to be polyglot
Proud to be polyglotTugdual Grall
 
Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha Talagala
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Spark Summit
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataAshnikbiz
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineeringThang Bui (Bob)
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019VMware Tanzu
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningPaco Nathan
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFMLconf
 
Ai & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientistAi & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientistAlberto Diaz Martin
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-Systeminside-BigData.com
 

Similar to The Challenges of Bringing Machine Learning to the Masses (20)

Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Engineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsEngineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platforms
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Proud to be polyglot
Proud to be polyglotProud to be polyglot
Proud to be polyglot
 
Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
DA_01_Intro.pptx
DA_01_Intro.pptxDA_01_Intro.pptx
DA_01_Intro.pptx
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
 
DataOps with Project Amaterasu
DataOps with Project AmaterasuDataOps with Project Amaterasu
DataOps with Project Amaterasu
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
 
Ai & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientistAi & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientist
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 

Recently uploaded

Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 

Recently uploaded (20)

Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 

The Challenges of Bringing Machine Learning to the Masses

  • 1. The Challenges of Bringing Machine Learning to the Masses Alice Zheng and Sethu Raman GraphLab Inc. NIPS workshop on Software Engineering for Machine Learning December 13, 2014
  • 3. The need for accessible ML • So much potential in ML • Everyone trying to make sense of their data • ML is transforming lives and industries: personalized medicine, internet search, social networks, advertising, etc. • But success is unattainable to most
  • 4. Building a predictive app Was using 217 business rules hoping world doesn’t change Have an inspiring idea to reinvent their business Key pains: Hiring Talent Shortfall in data-savvy workers needed to make sense out of big data by 2018 [McKinsey 2011] 35% Noisy Space of Tools Data scientists use a variety of tools, across different programming languages… require a lot of context-switching… affects productivity and impedes reproducibility. Ben Lorica, Data Analysis: Just one component of the Data Science workflow
  • 5. Building a predictive app Feature engineering Model definition Training evaluation Data DeploymentMonitoring
  • 6. Pure ML is not enough • Building a predictive application involves much more than just building ML models • System engineering: data storage, computation infrastructure, networking… • Data Science: problem definition, data cleaning, feature engineering • Software development: turn prototype model into bullet-proof production code • Operations engineering: deploy and monitor app • …
  • 7. Pain points • What are the right features? • What model should I use? • How do I train it? • How do I set the tuning parameters? • Do I even have the right data? • Ok, I have a working prototype, now what?
  • 8. Pain points • Increase in data size or decrease in latency requires complete rewrite of code and new toolset • GB – R/scikit-learn/Matlab • TB-PB—Hadoop/Mahout/Spark • Many forms of data and data structures • Images, text, speech, logs • Dense lists, sparse dictionaries, time series • Tables, graphs, matrices, tensors
  • 9. The need for an ML platform • Minimize tool/code switching, maximize performance (speed/accuracy/scale) • Graceful transition from small to large dataset sizes • Flexible, interoperable data types • Minimize complexity • System-agnostic • Simple API • Auto-tune parameters
  • 10. The parallel to databases • What’s an example of a mega-successful platform for data operations? • Databases! • SQL, Oracle, NoSQL, … • What lessons can we bring in from the database world?
  • 12. Database engine components Storage engine Query execution Query optimizer Storage Complex but self-contained, has clean API, only changes when there’s new hardware.
  • 13. Database engine components Storage engine Query execution Query optimizer Storage Complex bag of tricks, no formalism, constantly changing to adapt to data, query, disk characteristics.
  • 14. ML engine components Feature engineering Model definition Training evaluation Data Bags of tricks, expert knowledge, experience, lots of trial and error
  • 15. Advances in databases • Reasonable abstraction—relational DB • Hardware speedups • Pragmatic software implementation Successful platform • Take-away lesson: fast computation engine + “good enough” execution plan
  • 16. To advance ML platforms • ML will be end-user friendly when the platform is clever enough to handle less- than-optimal directions from the user • What needs to happen? • The complexity needs to be automated and wrapped away with neat interfaces between components • Fast components, “good enough” directions
  • 17. GraphLab • Started as a research project at CMU in 2009 • Now a Seattle-based startup
  • 18. The GraphLab CreateTM Solution • Flexible, interoperable data types • SArray+SFrame+SGraph inter-translatable • dense list, sparse array, image, text, tables, graphs • Graceful transition between data sizes • SFrame: memory to disk to distributed • One environment, many substrates • Python front-end • Localhost, cluster, Hadoop, EC2 • End-to-end • Data ingestion+feature engineering+model building+ deployment in a single environment
  • 19. GraphLab Create ML Toolkits Machine Learning Task Business Task Algorithms & SDK Recommender, Target, Social Match, … Regression, Classification, Data Matching,… SVM, Matrix Factorization, LDA, … Developers Savvy Dev & Data Sci. ML experts
  • 20. Demos
  • 21. GLC SDK example • Task: fill in missing value in an array using previous value • Existing solution: • E.g., use Pandas—Python library providing in- memory dataframes • Problem: • Given, say, 25M rows and 50 cols, takes forever to even load the data
  • 22. GLC SDK solution > cat fill.cpp #include <flexible_type/flexible_type.hpp> #include <unity/lib/toolkit_function_macros.hpp> #include <unity/lib/gl_sarray.hpp> using namespace graphlab; gl_sarray fill(gl_sarray sa) { gl_sarray_writer writer(sa.dtype(), 1); flexible_type last_value = sa[0]; for (const auto &elem: sa.range_iterator()) { if (elem != FLEX_UNDEFINED) last_value = elem; writer.write(last_value, 0); } return writer.close(); } BEGIN_FUNCTION_REGISTRATION REGISTER_FUNCTION(fill, "sa"); END_FUNCTION_REGISTRATION
  • 23. GLC SDK solution > cat Makefile all: fill.so fill.so: fill.cpp g++ -std=c++11 $^ -l graphlab –l ~/graphlab-dev/deps/shared-fPIC –o $@ -O3 > python >>> import graphlab as gl >>> gl.ext_import(‘fill.so’, ‘example’) >>> sa = gl.Sarray([1, 2, 3, None, 6]) >>> print gl.extensions.example.fill.fill(sa) [1, 2, 3, 3, 6]
  • 24. Join the revolution! • Research methods to make the following efficient and automatic: • Feature engineering • Model selection • Model debugging • Problem formulation (??) • Develop novel algorithms on top of our SDK • Backed by scalable, flexible typed data structures • Automatic Python wrappers • Make them available to many other peple • We’re hiring! jobs@graphlab.com