SlideShare a Scribd company logo

A New Year in Data Science: ML Unpaused

Data Day Texas 2015 keynote talk http://datadaytexas.com/

1 of 99
Download to read offline
A New Year in Data Science: 

ML Unpaused
Data Day Texas

Austin, 2015-01-10
Paco Nathan, @pacoid
Observations about Machine Learning, Data Science,
Big Data, Open Source, Cluster Computing, Notebooks,
etc., over the past year … plus, a look ahead
Backstory
Backstory: The Sun Also Rises
Some wake early in
the morning and go
build buildings
Backstory: The Sun Also Rises
Some wake early in
the morning and go
build buildings
Backstory: The Sun Also Rises
Some gaze into the
heavens, sit back,
and explain the
process…

Recommended

Data Science in Future Tense
Data Science in Future TenseData Science in Future Tense
Data Science in Future TensePaco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and GiraphDoug Needham
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 

More Related Content

What's hot

SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Krishna Sankar
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Machine Learning in the Cloud with GraphLab
Machine Learning in the Cloud with GraphLabMachine Learning in the Cloud with GraphLab
Machine Learning in the Cloud with GraphLabDanny Bickson
 
Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Doug Needham
 
Crowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic PerspectivesCrowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic PerspectivesAditya Parameswaran
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflowsSSSW
 
machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...Armando Vieira
 
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks DataWorks Summit/Hadoop Summit
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with AnacondaTravis Oliphant
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014The Hive
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSSri Ambati
 
EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEuropean Data Forum
 

What's hot (20)

SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Machine Learning in the Cloud with GraphLab
Machine Learning in the Cloud with GraphLabMachine Learning in the Cloud with GraphLab
Machine Learning in the Cloud with GraphLab
 
Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Apache Spark GraphX highlights.
Apache Spark GraphX highlights.
 
Crowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic PerspectivesCrowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic Perspectives
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflows
 
machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...
 
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with Anaconda
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko Grobelnik
 

Viewers also liked

How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapePaco Nathan
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningPaco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?Paco Nathan
 
Hack your Mindset- BRAIN UC Agosto 2015
Hack your Mindset- BRAIN UC Agosto 2015Hack your Mindset- BRAIN UC Agosto 2015
Hack your Mindset- BRAIN UC Agosto 2015Mindset Dynamics
 
Revista Mundo Contact Agosto 2015
Revista Mundo Contact Agosto 2015Revista Mundo Contact Agosto 2015
Revista Mundo Contact Agosto 2015Mundo Contact
 
Ecología Gasoducto Vs. Energía electrica
Ecología Gasoducto Vs. Energía electricaEcología Gasoducto Vs. Energía electrica
Ecología Gasoducto Vs. Energía electricaChristopher Marrero
 
What is cultured pearl gemstones or moti ratna
What is cultured pearl gemstones or moti ratnaWhat is cultured pearl gemstones or moti ratna
What is cultured pearl gemstones or moti ratnaPearl Gemstone
 
Frederic Maire - Renault Innovation Silicon Valley - Stanford - Jan 30 2012 v2
Frederic Maire - Renault Innovation Silicon Valley - Stanford - Jan 30 2012 v2Frederic Maire - Renault Innovation Silicon Valley - Stanford - Jan 30 2012 v2
Frederic Maire - Renault Innovation Silicon Valley - Stanford - Jan 30 2012 v2Burton Lee
 
Presentación ODEBRECHT - Foro 08-09-10
Presentación ODEBRECHT - Foro 08-09-10Presentación ODEBRECHT - Foro 08-09-10
Presentación ODEBRECHT - Foro 08-09-10Felix Zambrano A.
 
Trabajo de economia Proyecto de inversion Kumon
Trabajo de economia Proyecto de inversion KumonTrabajo de economia Proyecto de inversion Kumon
Trabajo de economia Proyecto de inversion KumonLisandro Cunci
 
Hiperplasia prostática benigna
Hiperplasia prostática benignaHiperplasia prostática benigna
Hiperplasia prostática benignaIMSS
 

Viewers also liked (18)

How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine Learning
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?
 
OS Accelerate London - 09/16/15
OS Accelerate London - 09/16/15OS Accelerate London - 09/16/15
OS Accelerate London - 09/16/15
 
Hack your Mindset- BRAIN UC Agosto 2015
Hack your Mindset- BRAIN UC Agosto 2015Hack your Mindset- BRAIN UC Agosto 2015
Hack your Mindset- BRAIN UC Agosto 2015
 
Qué Has Hecho Hoy?
Qué Has Hecho Hoy?Qué Has Hecho Hoy?
Qué Has Hecho Hoy?
 
Revista Mundo Contact Agosto 2015
Revista Mundo Contact Agosto 2015Revista Mundo Contact Agosto 2015
Revista Mundo Contact Agosto 2015
 
Ecología Gasoducto Vs. Energía electrica
Ecología Gasoducto Vs. Energía electricaEcología Gasoducto Vs. Energía electrica
Ecología Gasoducto Vs. Energía electrica
 
What is cultured pearl gemstones or moti ratna
What is cultured pearl gemstones or moti ratnaWhat is cultured pearl gemstones or moti ratna
What is cultured pearl gemstones or moti ratna
 
Frederic Maire - Renault Innovation Silicon Valley - Stanford - Jan 30 2012 v2
Frederic Maire - Renault Innovation Silicon Valley - Stanford - Jan 30 2012 v2Frederic Maire - Renault Innovation Silicon Valley - Stanford - Jan 30 2012 v2
Frederic Maire - Renault Innovation Silicon Valley - Stanford - Jan 30 2012 v2
 
Cable utp
Cable utpCable utp
Cable utp
 
Presentación ODEBRECHT - Foro 08-09-10
Presentación ODEBRECHT - Foro 08-09-10Presentación ODEBRECHT - Foro 08-09-10
Presentación ODEBRECHT - Foro 08-09-10
 
Trabajo de economia Proyecto de inversion Kumon
Trabajo de economia Proyecto de inversion KumonTrabajo de economia Proyecto de inversion Kumon
Trabajo de economia Proyecto de inversion Kumon
 
Hiperplasia prostática benigna
Hiperplasia prostática benignaHiperplasia prostática benigna
Hiperplasia prostática benigna
 

Similar to A New Year in Data Science: ML Unpaused

Hector Guerrero- Road to Business Analytics
Hector Guerrero- Road to Business AnalyticsHector Guerrero- Road to Business Analytics
Hector Guerrero- Road to Business AnalyticsErika Marr
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingPaco Nathan
 
Mdst3705 2013-02-12-finding-data
Mdst3705 2013-02-12-finding-dataMdst3705 2013-02-12-finding-data
Mdst3705 2013-02-12-finding-dataRafael Alvarado
 
Broad Data (India 2015)
Broad Data (India 2015)Broad Data (India 2015)
Broad Data (India 2015)James Hendler
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud ComputingRahul Pola
 
Cloud computing
Cloud computingCloud computing
Cloud computingBasil John
 
Lecture 1 Slides -Introduction to algorithms.pdf
Lecture 1 Slides -Introduction to algorithms.pdfLecture 1 Slides -Introduction to algorithms.pdf
Lecture 1 Slides -Introduction to algorithms.pdfRanvinuHewage
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabadKelly Technologies
 
Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018mark madsen
 
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019Dhiana Deva
 
Engineering Personal Statement
Engineering Personal StatementEngineering Personal Statement
Engineering Personal StatementDenise Hudson
 
Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview. Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview. Doug Needham
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera, Inc.
 
New professional careers in data
New professional careers in dataNew professional careers in data
New professional careers in dataDavid Rostcheck
 
Data Science in E-commerce
Data Science in E-commerceData Science in E-commerce
Data Science in E-commerceVincent Michel
 

Similar to A New Year in Data Science: ML Unpaused (20)

Hector Guerrero- Road to Business Analytics
Hector Guerrero- Road to Business AnalyticsHector Guerrero- Road to Business Analytics
Hector Guerrero- Road to Business Analytics
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
 
Mdst3705 2013-02-12-finding-data
Mdst3705 2013-02-12-finding-dataMdst3705 2013-02-12-finding-data
Mdst3705 2013-02-12-finding-data
 
Broad Data (India 2015)
Broad Data (India 2015)Broad Data (India 2015)
Broad Data (India 2015)
 
Session1
Session1Session1
Session1
 
Session1
Session1Session1
Session1
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
Lecture 1 Slides -Introduction to algorithms.pdf
Lecture 1 Slides -Introduction to algorithms.pdfLecture 1 Slides -Introduction to algorithms.pdf
Lecture 1 Slides -Introduction to algorithms.pdf
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabad
 
Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018
 
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019
 
Engineering Personal Statement
Engineering Personal StatementEngineering Personal Statement
Engineering Personal Statement
 
Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview. Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview.
 
Data Science at UCI
Data Science at UCIData Science at UCI
Data Science at UCI
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
New professional careers in data
New professional careers in dataNew professional careers in data
New professional careers in data
 
Data Science in E-commerce
Data Science in E-commerceData Science in E-commerce
Data Science in E-commerce
 
Graph Realities
Graph RealitiesGraph Realities
Graph Realities
 

More from Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
 
Computable Content
Computable ContentComputable Content
Computable ContentPaco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons LearnedPaco Nathan
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesPaco Nathan
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEPaco Nathan
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 

More from Paco Nathan (11)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 

Recently uploaded

IT Nation Evolve event 2024 - Quarter 1
IT Nation Evolve event 2024  - Quarter 1IT Nation Evolve event 2024  - Quarter 1
IT Nation Evolve event 2024 - Quarter 1Inbay UK
 
iOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostingeriOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostingerssuser9354ce
 
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)Jay Zhao
 
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...Neo4j
 
Transcript: Trending now: Book subjects on the move in the Canadian market - ...
Transcript: Trending now: Book subjects on the move in the Canadian market - ...Transcript: Trending now: Book subjects on the move in the Canadian market - ...
Transcript: Trending now: Book subjects on the move in the Canadian market - ...BookNet Canada
 
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...ShapeBlue
 
Geospatial Synergy: Amplifying Efficiency with FME & Esri
Geospatial Synergy: Amplifying Efficiency with FME & EsriGeospatial Synergy: Amplifying Efficiency with FME & Esri
Geospatial Synergy: Amplifying Efficiency with FME & EsriSafe Software
 
Roundtable_-_API_Research__Testing_Tools.pdf
Roundtable_-_API_Research__Testing_Tools.pdfRoundtable_-_API_Research__Testing_Tools.pdf
Roundtable_-_API_Research__Testing_Tools.pdfMostafa Higazy
 
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlueCloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlueShapeBlue
 
Trending now: Book subjects on the move in the Canadian market - Tech Forum 2024
Trending now: Book subjects on the move in the Canadian market - Tech Forum 2024Trending now: Book subjects on the move in the Canadian market - Tech Forum 2024
Trending now: Book subjects on the move in the Canadian market - Tech Forum 2024BookNet Canada
 
Act Like an Owner, Challenge Like a VC by former CPO, Tripadvisor
Act Like an Owner,  Challenge Like a VC by former CPO, TripadvisorAct Like an Owner,  Challenge Like a VC by former CPO, Tripadvisor
Act Like an Owner, Challenge Like a VC by former CPO, TripadvisorProduct School
 
AI improves software testing to be more fault tolerant, focused and efficient
AI improves software testing to be more fault tolerant, focused and efficientAI improves software testing to be more fault tolerant, focused and efficient
AI improves software testing to be more fault tolerant, focused and efficientKari Kakkonen
 
HBR SERIES METAL HOUSED RESISTORS POWER ELECTRICAL ABSORBS HIGH CURRENT DURIN...
HBR SERIES METAL HOUSED RESISTORS POWER ELECTRICAL ABSORBS HIGH CURRENT DURIN...HBR SERIES METAL HOUSED RESISTORS POWER ELECTRICAL ABSORBS HIGH CURRENT DURIN...
HBR SERIES METAL HOUSED RESISTORS POWER ELECTRICAL ABSORBS HIGH CURRENT DURIN...htrindia
 
My Journey towards Artificial Intelligence
My Journey towards Artificial IntelligenceMy Journey towards Artificial Intelligence
My Journey towards Artificial IntelligenceVijayananda Mohire
 
Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...
Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...
Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...UiPathCommunity
 
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptxGraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptxNeo4j
 
How to write an effective Cyber Incident Response Plan
How to write an effective Cyber Incident Response PlanHow to write an effective Cyber Incident Response Plan
How to write an effective Cyber Incident Response PlanDatabarracks
 
Harnessing the Power of GenAI for Exceptional Product Outcomes by Booking.com...
Harnessing the Power of GenAI for Exceptional Product Outcomes by Booking.com...Harnessing the Power of GenAI for Exceptional Product Outcomes by Booking.com...
Harnessing the Power of GenAI for Exceptional Product Outcomes by Booking.com...Product School
 
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptxThe Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptxNeo4j
 

Recently uploaded (20)

IT Nation Evolve event 2024 - Quarter 1
IT Nation Evolve event 2024  - Quarter 1IT Nation Evolve event 2024  - Quarter 1
IT Nation Evolve event 2024 - Quarter 1
 
In sharing we trust. Taking advantage of a diverse consortium to build a tran...
In sharing we trust. Taking advantage of a diverse consortium to build a tran...In sharing we trust. Taking advantage of a diverse consortium to build a tran...
In sharing we trust. Taking advantage of a diverse consortium to build a tran...
 
iOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostingeriOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostinger
 
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)
 
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
 
Transcript: Trending now: Book subjects on the move in the Canadian market - ...
Transcript: Trending now: Book subjects on the move in the Canadian market - ...Transcript: Trending now: Book subjects on the move in the Canadian market - ...
Transcript: Trending now: Book subjects on the move in the Canadian market - ...
 
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
 
Geospatial Synergy: Amplifying Efficiency with FME & Esri
Geospatial Synergy: Amplifying Efficiency with FME & EsriGeospatial Synergy: Amplifying Efficiency with FME & Esri
Geospatial Synergy: Amplifying Efficiency with FME & Esri
 
Roundtable_-_API_Research__Testing_Tools.pdf
Roundtable_-_API_Research__Testing_Tools.pdfRoundtable_-_API_Research__Testing_Tools.pdf
Roundtable_-_API_Research__Testing_Tools.pdf
 
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlueCloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
 
Trending now: Book subjects on the move in the Canadian market - Tech Forum 2024
Trending now: Book subjects on the move in the Canadian market - Tech Forum 2024Trending now: Book subjects on the move in the Canadian market - Tech Forum 2024
Trending now: Book subjects on the move in the Canadian market - Tech Forum 2024
 
Act Like an Owner, Challenge Like a VC by former CPO, Tripadvisor
Act Like an Owner,  Challenge Like a VC by former CPO, TripadvisorAct Like an Owner,  Challenge Like a VC by former CPO, Tripadvisor
Act Like an Owner, Challenge Like a VC by former CPO, Tripadvisor
 
AI improves software testing to be more fault tolerant, focused and efficient
AI improves software testing to be more fault tolerant, focused and efficientAI improves software testing to be more fault tolerant, focused and efficient
AI improves software testing to be more fault tolerant, focused and efficient
 
HBR SERIES METAL HOUSED RESISTORS POWER ELECTRICAL ABSORBS HIGH CURRENT DURIN...
HBR SERIES METAL HOUSED RESISTORS POWER ELECTRICAL ABSORBS HIGH CURRENT DURIN...HBR SERIES METAL HOUSED RESISTORS POWER ELECTRICAL ABSORBS HIGH CURRENT DURIN...
HBR SERIES METAL HOUSED RESISTORS POWER ELECTRICAL ABSORBS HIGH CURRENT DURIN...
 
My Journey towards Artificial Intelligence
My Journey towards Artificial IntelligenceMy Journey towards Artificial Intelligence
My Journey towards Artificial Intelligence
 
Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...
Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...
Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...
 
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptxGraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
 
How to write an effective Cyber Incident Response Plan
How to write an effective Cyber Incident Response PlanHow to write an effective Cyber Incident Response Plan
How to write an effective Cyber Incident Response Plan
 
Harnessing the Power of GenAI for Exceptional Product Outcomes by Booking.com...
Harnessing the Power of GenAI for Exceptional Product Outcomes by Booking.com...Harnessing the Power of GenAI for Exceptional Product Outcomes by Booking.com...
Harnessing the Power of GenAI for Exceptional Product Outcomes by Booking.com...
 
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptxThe Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
 

A New Year in Data Science: ML Unpaused

  • 1. A New Year in Data Science: 
 ML Unpaused Data Day Texas
 Austin, 2015-01-10 Paco Nathan, @pacoid
  • 2. Observations about Machine Learning, Data Science, Big Data, Open Source, Cluster Computing, Notebooks, etc., over the past year … plus, a look ahead
  • 4. Backstory: The Sun Also Rises Some wake early in the morning and go build buildings
  • 5. Backstory: The Sun Also Rises Some wake early in the morning and go build buildings
  • 6. Backstory: The Sun Also Rises Some gaze into the heavens, sit back, and explain the process…
  • 7. Backstory: The Sun Also Rises Some gaze into the heavens, sit back, and explain the process… Clearly, provably, 
 our Sun revolves around the Earth 
 at an observable rate
  • 8. Backstory: The Sun Also Rises Others create and evaluate models to predict the Earth’s orbit of the Sun
  • 9. Backstory: The Sun Also Rises Sometimes, when 
 the sky gods become angry and obscure the Sun as our due punishment… We grow scared and react: sacrifices must be offered, our plans must change, etc.
  • 10. Backstory: The Sun Also Rises Sometimes, when the sky gods become angry and obscure the Sun punishment… We grow scared and react: sacrifices must be offered, our plans must These points are what 
 I’d like to discuss today
  • 13. Feel free to disagree, but I find that definition 
 to be flawed… Whither Data Science?
  • 14. Feel free to disagree, but I find that definition 
 to be flawed… 1. That ignores DevOps (how’s that working out?) 
 and Visualization/Design (ditto) Whither Data Science?
  • 15. Feel free to disagree, but I find that definition 
 to be flawed… 1. That ignores DevOps (how’s that working out?) 
 and Visualization/Design (ditto) 2. When the CEO asks you to help explain why 
 revenue nose-dived over the past month… neither field has a clue about how to model business phenomena Whither Data Science?
  • 16. Software Engineering: 
 implement and test a model that somebody selected …almost ignores the matter of modeling entirely, 
 at least not since old school types like Dijkstra ! Statistics: 
 measure and justify a model that somebody selected …was never particularly good at teaching how to 
 model problems – as two renowned statisticians, 
 William Cleveland and Leo Breiman, noted Whither Data Science?
  • 17. Software Engineering: implement and test a model that somebody selected …almost ignores the matter of modeling entirely, at least not since old school types like ! Statistics: measure and justify a model that somebody selected …was never particularly good at teaching how to model problems – as two renowned statisticians, William Cleveland Whither Data Science? Both fields are necessary, but not sufficient
  • 18. TheThorn in the Side of Big Data: too few artists
 Christopher Ré, Stanford
 safaribooksonline.com/library/view/strata-conference-santa/9781491900321/ part92.html Whither Data Science?
  • 19. TheThorn in the Side of Big Data: too few artists Christopher Ré, Stanford safaribooksonline.com/library/view/strata-conference-santa/9781491900321/ part92.html Whither Data Science? “You should think about features and not algorithms”
  • 21. Floyd Marinescu observed about the aftermath 
 of EJBs in Brief History… Intended for building framework components,
 e.g., for IBM, Oracle, Sun, but not many others Based on RMI, prior to notions 
 like RESTful web services Enterprise Java Beans: Lessons from hate-watch reality television
  • 22. Maybe a handful of people in the world would 
 ever actually need to use EJBs, but those few people wanted a spec Then, for tragic political reasons (MSFT envy), 
 Sun Microsystems made EJBs prominent in 
 their Java APIs Enterprise Java Beans: Lessons from hate-watch reality television
  • 23. Fortunately, we evolved: Spring, JBoss, etc., 
 those came along as relatively more sane tech Now we see the Docker thing soar, with notions such as microservices displacing legacy cruft (BTW, if you haven’t yet, check out Weave) Enterprise Java Beans: Lessons from hate-watch reality television
  • 24. I mention this because, to me, EJB represented 
 a convoluted form of template thinking: Enterprise Java Beans: Lessons from hate-watch reality television developing complex web apps 
 for the sake of 
 developing complex web apps
  • 25. Enterprise Java Beans: Lessons from hate-watch reality television IRL developers and template thinking don’t determine public policy… right?
  • 26. Enterprise Java Beans: Lessons from hate-watch reality television To paraphrase Dean Wampler, consider WordCount a simple apps written for MapReduce in Hadoop … ~50 lines of unapologetic Java that feels hella like writing EJBs:
  • 27. Enterprise Java Beans: Lessons from hate-watch reality television Compare that with functional programming, where 
 the same WC app is three lines of easily-read Scala when run in Apache Spark:
  • 28. Enterprise Java Beans: Lessons from hate-watch reality television Check out Dean’s talk at 11:00, 
 “Why Scala isTaking Over 
 the Big DataWorld” Compare that with functional programming, where 
 the same WC app is three lines of easily-read Scala when run in Apache Spark:
  • 29. Enterprise Java Beans: Lessons from hate-watch reality television Hadoop suffers because, IMHO, that convoluted 
 EJB style of developer-centric template thinking staged a coup Perhaps we could “donate” some OSS talent… Send a pull request… Or something.
  • 30. Lies, Damn Lies, 
 Statistics, and 
 Data Science
  • 31. Probability got going, formally, in the 16th c. – 
 although interesting mathematical estimations 
 trace back to classical times Arabs in the 9th c. used frequency analysis – 
 later rediscovered by Europeans during the 
 early Italian Renaissance Statistics followed, originally more about what 
 we might call demographics – through 18th c. Lies, Damn Lies, Statistics, Data Science
  • 32. Laplace, Gauss, et al., bridged the fields in the 
 late 18th c. using distributions (what we studied 
 in Stats 101) to infer the probability of errors 
 in estimates ! ! Much of the 19th/20th c. work was about using goodness of fit tests, etc., justifying some distribution • generally speaking, that require samples • that, in turn, implies batch windows Lies, Damn Lies, Statistics, Data Science
  • 33. Lies, Damn Lies, Statistics, Data Science That kind of template thinking in action
 really lurvs it some batch windows
  • 34. While 19th/20th c. stats work focused on defensibility 21st c. work, w.r.t. Big Data apps, focuses more 
 on predictability – plus there’s a shift in how we make estimates… Lies, Damn Lies, Statistics, Data Science BTW, doesn’t it seem weird to crunch through piles of data in large batch jobs, at large expense, when the results get used to approximate features ultimately? Why not perform that in stream?
  • 35. A fascinating, relatively new area pioneered by relatively few people – e.g., Philippe Flajolet Provides approximation with error bounds using much less resources (RAM, CPU, etc.) highlyscalable.wordpress.com/2012/05/01/ probabilistic-structures-web-analytics- data-mining/ Lies, Damn Lies, Statistics, Data Science
  • 36. algorithm use case example Bloom Filter set membership code MinHash set similarity code HyperLogLog set cardinality code Count-Min Sketch frequency summaries code DSQ streaming quantiles code SkipList ordered sequence search code Lies, Damn Lies, Statistics, Data Science
  • 37. Lies, Damn Lies, Statistics, Data Science E.g., ±4% could buy you two orders of magnitude reduction in the required memory footprint for 
 an analytics app ! OSS projects such as Algebird and BlinkDB provide for this newer approach to the math of approximations at scale
  • 38. Lies, Damn Lies, Statistics, Data Science E.g., ±4% could buy you two orders of magnitude reduction in the required memory footprint for an analytics app ! OSS projects such as provide for this newer approach to the math of approximati Oscar Boykin at 14:00, 
 “Aggregators: Modeling 
 Data Queries Functionally” co-author of Algebird, Scalding
  • 40. Data Science is inherently interdisciplinary To paraphrase Chris Ré, emphasis on algorithms 
 is relatively minor in the grand scheme – Especially when compared to needs for modeling business problems effectively To wit: beyond phenomenology, leading 
 into quantitative analysis and repeatable results On the one hand, CS + Stats do not quite address those needs… The Interzone
  • 41. On the other hand, Physics does well to teach modeling – I like to hire physicists to work on Data teams… The Interzone They tend to get the interdisciplinary aspects: 
 got the math background, coding experience, generally good at systems engineering, etc. Not saying we should all rush out to get Physics degrees; there’s something to be learned there, 
 vital for the work and priorities ahead
  • 42. I mention this because we are at a crossroads, 
 which has more to do with the physical world – 
 some talks here at DDTx15 help illustrate that Vast implications for Health Care, Transportation, Agriculture, Energy, Gov, Manufacturing in general… More about that 
 in a bit – The Interzone
  • 44. Most of the ML libraries that one encounters 
 today focus on two general kinds of solutions: • convex optimization • matrix factorization The Libraries: Alexandria Redux
  • 45. One might think of the convex optimization 
 in this case as a kind of curve fitting – generally 
 with some regularization term to avoid overfitting, 
 which is not good Good Bad The Libraries: Alexandria Redux
  • 46. For supervised learning, used to create classifiers: 1. categorize the expected data into N classes 2. split a sample of the data into train/test sets 3. use learners to optimize classifiers based on
 the training set, to label the data into N classes 4. evaluate the classifiers against the test set, measuring error in predicted vs. expected labels The Libraries: Alexandria Redux
  • 47. Bokay, great for security problems with simply two classes: good guys vs. bad guys How do you decide what the classes are 
 for more complex problems in business? That’s where the matrix factorization parts come in handy… The Libraries: Alexandria Redux
  • 48. For unsupervised learning, which is often used 
 to reduce dimension: 1. create a covariance matrix of the data 2. solve for the eigenvectors and eigenvalues 
 of the matrix 3. select the top N eigenvectors, based on diminishing returns for how they explain variance in the data 4. those eigenvectors define your N classes The Libraries: Alexandria Redux
  • 49. An excellent overview of ML definitions 
 (up to this point) is given in: The Libraries: Alexandria Redux To wit: 
 Generalization = Representation + Optimization + Evaluation A Few UsefulThings to Know about Machine Learning
 Pedro Domingos
 CACM 55:10 (Oct 2012)
 http://dl.acm.org/citation.cfm?id=2347755
  • 50. evaluationoptimizationrepresentationcirca 2010 ETL into cluster/cloud data data visualize, reporting Data Prep Features Learners, Parameters Unsupervised Learning Explore train set test set models Evaluate Optimize Scoring production data use cases data pipelines actionable results decisions, feedback bar developers foo algorithms Algorithms and developer-centric template thinking only go so far in a workflow… Results are shown in blue, and the real work 
 is highlighted in red The Libraries: Alexandria Redux
  • 51. evaluationoptimizationrepresentationcirca 2010 ETL into cluster/cloud data data visualize, reporting Data Prep Features Learners, Parameters Unsupervised Learning Explore train set test set models Evaluate Optimize Scoring production data use cases data pipelines actionable results decisions, feedback bar developers foo algorithms Algorithms and developer-centric template thinking only go so far Results are shown in is highlighted in 1. focus on features not algorithms 2. learn how to model business problems by leveraging data 3. notice the workflows needed? 4. leave the dev-centric thinking 
 for odd city council meetings The Libraries: Alexandria Redux
  • 52. evaluationoptimizationrepresentationcirca 2010 ETL into cluster/cloud data data visualize, reporting Data Prep Features Learners, Parameters Unsupervised Learning Explore train set test set models Evaluate Optimize Scoring production data use cases data pipelines actionable results decisions, feedback bar developers foo algorithms Algorithms and developer-centric template thinking only go so far Results are shown in is highlighted in The Libraries: Alexandria Redux Matthew Kirk 12:00
 “Lessons Learned: Machine Learning andTechnical Debt” Ted Dunning 13:00
 “Computing with Chaos” Julia Evans 15:00
 “Data Pipelines.They're a lot of work!” Christopher Johnson 16:00
 “Scala Data Pipelines for Music Recommendations”
  • 53. Even so, business demands exceed far beyond what classifiers and labels alone can give us… Businesses lurv Optimization, gobs of it; in 
 that context ML libraries today merely scratch the surface Round hole, square peg The Libraries: Alexandria Redux
  • 54. Imagine that you compete with FedEx… how do you optimize delivery routes for airplanes, trucks, trains, nanodrones, hoverboards, etc.? Which do you optimize: fuel cost, delivery time, maintenance schedules, minimizing lost packages? Doesn’t sound much like online advertising, social networks, or 
 any episode of Silicon Valley The Libraries: Alexandria Redux
  • 56. What were the origins of machine learning? • Marvin Minsky @MIT, 1950s • Support Vector Machines @Bell Labs, 1990s • Google @Stanford, 1990s • Ray Kurzweil, 2000s Nope… ML, Unpaused
  • 57. ML has been an aspect of AI research for a 
 long while, through several different vectors A good early history (up to 1980s) is given in: ML, Unpaused Machine Learning:A Historical and Methodological Analysis
 Jaime Carbonell, Ryszard Michalski, Tom Mitchell
 AI Magazine 4:3 (1983)
 http://dx.doi.org/10.1609/aimag.v4i3.406 To wit: task-oriented studies, knowledge acquisition, cognitive simulation, theoretical exploration … overall, a much 
 broader class of optimization problems
  • 58. An era of anticipation – AI was making inroads… • emphasis on capturing/representing knowledge 
 and expertise – production use cases in medicine • Fifth Generation Computing (parallel h/w) 
 in Japan MCC, etc. However: • few outside academia had enough cluster compute power – aside from 3-letter agencies and AT&T • meanwhile ML was not yet considered “academic” enough within academia Circa early 1980s:
  • 59. Stock market “corrected” in 1987: But…
  • 60. Some fundamental tech platforms emerge… • Hubble Space Telescope, Human Genome Project, WWW, electric cars relaunched And throughout that decade: • Linux, Java @Sun, JavaScript @Netscape • Firefly, an initial commercial ML app 
 on teh interwebs @MIT Media Lab • Rise of e-commerce leveraging horizontal 
 scale-out with commodity hardware Circa early 1990s:
  • 61. Stock market “tumbled” in 2000: But…
  • 62. GOOG AMZN EBAY YHOO LNKD NFLX FB TWTR emerged out of the dust… • web apps dominated for search, e-commerce, 
 social networks, etc. • did we mention EJBs and template thinking? • mobile picked up traction • recommender systems went mainstream • AI picked up with semantic web efforts… Circa early 2000s:
  • 63. Stock market “went free-fall” in 2008: But…
  • 64. Successful e-commerce firms have IPO’ed and are now busy building skyscrapers in downtown SF… Circa mid 2010s: LinkedIn, 350 Bush Transbay Transit Salesforce, 415 Mission
  • 65. An odd truism about the hubris of the uber-wealthy and the timing of their skyscraper projects… But… Sears Tower, Chicago Lehman Brothers, London Fontainebleau, Las Vegas
  • 66. An odd truism about the hubris of the uber-wealthy and the timing of their skyscraper projects… But…
  • 67. Businesses lurv Optimization, lots of it… • ML circa 1985 focused on those needs, but got knocked back to something inevitably more aristotelian and predictable • Outside of SiliconValley, we’ve made big strides • One danger: next downturn cycle,VCs might 
 reshape tech industry, reverting to “safe bets” Circa mid 2010s: Back to the Future However, a few extremely interesting aspects have emerged…
  • 68. evaluationoptimizationrepresentationcirca 2010 ETL into cluster/cloud data data visualize, reporting Data Prep Features Learners, Parameters Unsupervised Learning Explore train set test set models Evaluate Optimize Scoring production data use cases data pipelines actionable results decisions, feedback bar developers foo algorithms We have approximation, deep learning and symbolic regression to assist on “Features” evaluationoptimizationrepresentationcirca 2010 ETL into cluster/cloud data data visualize, reporting Data Prep Features Learners, Parameters Unsupervised Learning Explore train set test set models Evaluate Optimize Scoring production data use cases data pipelines actionable results decisions, feedback bar developers foo algorithms Or, maybe, cognitive computing will help on several of the more difficult aspects of this… Circa mid 2010s: Extremely Interesting Emerging Aspects
  • 69. Circa mid 2010s: Extremely Interesting Emerging Aspects DeepDive @Stanford http://deepdive.stanford.edu/ Knowledge Graph @Google http://www.google.com/insidesearch/ features/search/knowledge.html IBM Watson http://www.ibm.com/ smarterplanet/us/en/ibmwatson/ Scaled Inference https://scaledinference.com/
  • 70. Circa mid 2010s: Extremely Interesting Emerging Aspects Rhetorical postures: “Is AI a good idea, or potentially harmful?” 
 – per Elon Musk, et al.
  • 71. Circa mid 2010s: Extremely Interesting Emerging Aspects Clearly: good idea 
 brewbot.io Rhetorical postures: “Is AI a good idea, or potentially harmful?” 
 – per Elon Musk, et al.
  • 72. Circa mid 2010s: Extremely Interesting Emerging Aspects Speaking of which, a highly recommended podcast 
 by actual data scientists drinking really good beers: partiallyderivative.com
  • 73. Circa mid 2010s: Extremely Interesting Emerging Aspects 2015: Notebooks in Containers in the Cloud “Keep simple things simple and complex things possible.” databricks.com/product PublishingWorkflows for Jupyter Andrew Odewahn, Kyle Kelley, Rune Madsen odewahn.github.io/publishing-workflows-for-jupyter IPython Interactive Demo
 Nature Magazine + Rackspace nature.com/news/ipython-interactive-demo-7.21492
  • 74. 2015: Notebooks in Containers in the Cloud “Keep simple things simple and complex things possible.” databricks.com/product PublishingWorkflows for Jupyter Andrew Odewahn odewahn.github.io/publishing-workflows-for-jupyter IPython Interactive Demo Nature Magazine + Rackspace nature.com/news/ipython-interactive-demo-7.21492 Circa mid 2010s: Extremely Interesting Emerging Aspects Makes me wonder about the “data engineer” role … notebooks simplify ops needs, while ultimately the domain experts wield the real power with data
  • 76. Frontstory: The Sun Also Rises Some wake early in the morning and go build buildings dev-centric templates
  • 77. Some gaze into the heavens, sit back, and explain the process… 20th c. stats Frontstory: The Sun Also Rises
  • 78. Sometimes, when the sky gods become angry and obscure the Sun as our due punishment… VCs during recessions Frontstory: The Sun Also Rises
  • 79. Others create and evaluate models to predict the Earth’s orbit of the Sun What’s needed most Frontstory: The Sun Also Rises
  • 80. Forward Motion: SV trend: early data scientists displace old-school product managers Because there are hard 
 problems to be solved… Because we need 
 new eyes on target… Because use cases…
  • 82. Because Use Cases: Health Care “In fact, using ourTopological Data Analysis system, they were able to discover multiple types of Type 2 diabetes … huge impact on all the hundreds of millions of people” – Ayasdi “Nobody knows what to do with those archives …They’re just sitting there, costing money. This is just seen as a big opportunity. It’s like,‘Oh, this is what we were saving this up for!’” – Enlitic “Sloan-Kettering is also trainingWatson on 1,500 real-world lung cancer cases, helping it to decipher physician notes and learn from the hospital’s expertise in treating cancer.” – IBM Watson Employing tech such as deep learning and cognitive computing for vital use cases in 
 health care:
  • 83. Because Use Cases: Transportation http://automatic.com/ ! Detects events like hard braking, acceleration – uploaded in real-time with geolocation to a Spark Streaming pipeline … data trends indicate road hazards, blind intersections, bad signal placement, and other input to improve traffic planning. Also detects inefficient vehicle operation, under-inflated tires, poor driving behaviors, aggressive acceleration, etc.
  • 84. Because Use Cases: Education https://databricks.com/blog/2014/12/08/ pearson… ! Integrates Kafka + Spark Streaming + Cassandra + Blur, running within aYARN cluster on AWS to provide a scalable, reliable, cloud-based platform for services that analyze student performance across product and institution boundaries. Delivers immersive learning experiences designed for how students read, think, and learn; as well as efficacy insights to both learners and institutions which were not possible before. ! Reliability features handle Kafka node failures, receiver failures, leader changes, committed offset in ZK, plus adjustable data-rate throughput.
  • 85. Because Use Cases: Language, everywhere http://idibon.com/ ! ! ! Our social fabric is encoded as text documents, and similarly it get tested, deployed, maintained, and monitored there – it’s the launch point for cognitive computing. http://digitalreasoning.com/
  • 86. http://digitalreasoning.com/ Because Use Cases: Language, everywhere http://idibon.com/ ! ! ! Our social fabric is encoded as text documents, and similarly it get tested, deployed, maintained, and monitored there – it’s the launch point for cognitive computing. Robert Munroe, 12:00 “Building Better Experts: co-optimization of human and machine intelligence at Idibon” AndrewTrask, David Gilmore 11:00 “Deep Learning for Natural Language Processing”
  • 87. Because Use Cases: Geospatial Advanced geo uses cases throughout all levels of gov 
 and industry for Big Data, machine learning, graph algorithms, approximations, etc. If you roll trucks you probably use licenses from ESRI. Also consider the IoT sensor data, e.g., from National Instruments' customers – where does it go, what do organizations use to analyze it? These are the large-scale optimization problems you were looking for… http://esri.github.io/gis-tools-for-hadoop/ (and Spark) http://thunderheadxpler.blogspot.com/ http://geotrellis.io/ http://www.oculusinfo.com/tiles/ https://databricks.com/blog/2014/12/03/app...
  • 88. Because Use Cases: Telecom,Travel, Banking, etc. http://spark-summit.org/2014/talk/ stratio-streaming… Stratio represents one of the most sophisticated integrations for Spark Streaming – the union of a real-time messaging bus with a complex event processing engine: Kafka, Spark Streaming, Cassandra, along with the Siddhi CEP engine Telecom, in particular, is leveraging this new streaming technology as a big win near-term http://www.openstratio.org/
 https://github.com/stratio https://github.com/Stratio/streaming- cep-engine BTW if you’re in Madrid next fall 
 check out Big Data Hispano
  • 89. Because Use Cases… Common theme: many of those use cases are powered by Apache Spark – Especially notice Spark Streaming, which is a big game-changer for analytics across industry
  • 90. Because Use Cases… Common theme: many of those use cases are powered by Especially notice game-changer for analytics across industry Taylor Goetz 11:00
 “Beyond theTweetingToaster: IoT Streaming AnalyticsWith Apache Storm, Kafka, and Arduino” Hari Shreedharan 12:00
 “RealTime Data Processing Using Spark Streaming”
  • 91. Because Use Cases: Agriculture Ag+Data Issues
 http://radar.oreilly.com/2014/04/agdata.html Data Guild whitepaper: Ag Systems + Data Outlook
 http://goo.gl/OK8RFf • livelihood for 40% of world population • $15T/year annual GDP globally • data-intensive issues, much legal impasse Over a half billion small farms worldwide, and most 
 are family-run farms that rely on rain-fed agriculture Nudge, and I just might propose DWave clusters 
 into cold craters on the Lunar South Pole with 
 routers @L5 and an LLO skyhook… to handle
 the vector quantization demands. Or something. airships e.g., JP Aerospace, 40 km atmostats e.g.,Titan Aerospace, 20 km microsats e.g., Planet Labs, 400 km robots e.g., Blue River, 1 m sensors e.g., Hortau, -0.3 m drones e.g., HoneyComb, 120 m Layered Sensing Networks
  • 93. Apache Spark developer certificate program • http://oreilly.com/go/sparkcert • defined by Spark experts @Databricks • assessed by O’Reilly Media • establishes the bar for Spark expertise certification:
  • 94. MOOCs: Anthony Joseph
 UC Berkeley begins 2015-02-23 edx.org/course/uc-berkeleyx/uc- berkeleyx-cs100-1x- introduction-big-6181 Ameet Talwalkar
 UCLA begins 2015-04-14 edx.org/course/uc-berkeleyx/ uc-berkeleyx-cs190-1x- scalable-machine-6066
  • 95. community: spark.apache.org/community.html events worldwide: goo.gl/2YqJZK ! video+preso archives: spark-summit.org resources: databricks.com/spark-training-resources workshops: databricks.com/spark-training
  • 97. confs: Strata CA
 San Jose, Feb 18-20
 strataconf.com/strata2015 Spark Summit East
 NYC, Mar 18-19
 spark-summit.org/east Big Data Tech Con
 Boston, Apr 26-28
 bigdatatechcon.com Strata EU
 London, May 5-7
 strataconf.com/big-data-conference-uk-2015 Spark Summit 2015
 SF, Jun 15-17
 spark-summit.org
  • 98. books: Fast Data Processing 
 with Spark
 Holden Karau
 Packt (2013)
 shop.oreilly.com/product/ 9781782167068.do Spark in Action
 Chris Fregly
 Manning (2015*)
 sparkinaction.com/ Learning Spark
 Holden Karau, 
 Andy Konwinski, Matei Zaharia
 O’Reilly (2015*)
 shop.oreilly.com/product/ 0636920028512.do
  • 99. presenter: Just Enough Math O’Reilly, 2014 justenoughmath.com
 preview: youtu.be/TQ58cWgdCpA monthly newsletter for updates, 
 events, conf summaries, etc.: liber118.com/pxn/ Enterprise Data Workflows with Cascading O’Reilly, 2013 shop.oreilly.com/product/ 0636920028536.do