SlideShare a Scribd company logo

Dataiku - From Big Data To Machine Learning

Dataiku
Dataiku

This presentation was made in front of CIO to sensibilize to the big data in practical terms and to the new usages of machine learning and analytics.

1 of 76
Download to read offline
1Dataiku6/4/2013
6/4/2013Dataiku 2
Hi !
Current Life:
CEO, Dataiku
Tweet about this: @dataiku @club_dsi_gun
Past Life:
Criteo
IsCool Entertainment
Exalead
Florian
Douetteau
Available on Slide Share
http://www.slideshare.net/Dataiku
Goals Today:
• Concrete Feedback on Data Analytics
Projects
• Data Team in practice and Key technologies
• Motivate you to start a data science project
Slide deck allergic ? Check:
https://github.com/dataiku
6/4/2013Dataiku 3
Dataiku
Dataiku : An open source platform
to help you build your data lab
‟
”
6/4/2013Dataiku 4
Collocation
6/4/2013Dataiku 5
Big Apple
Big Mama
Big Data
A familiar grouping of words,
especially words that habitually appear
together and thereby convey meaning
by association.
C
o
l
l
o
c
“Big” Data in 1999
6/4/2013Dataiku 6
struct Element {
Key key;
void* stat_data ;
}
….
C
Optimized Data structures
Perfect Hashing
HP-UNIX Servers – 4GB Ram
100 GB data
Web Crawler – Socket reuse
HTTP 0.9
1 Month

Recommended

Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks DeltaDatabricks
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design PatternsJohn Yeung
 
Machine Learning & Amazon SageMaker
Machine Learning & Amazon SageMakerMachine Learning & Amazon SageMaker
Machine Learning & Amazon SageMakerAmazon Web Services
 
Build, train and deploy ML models at scale.pdf
Build, train and deploy ML models at scale.pdfBuild, train and deploy ML models at scale.pdf
Build, train and deploy ML models at scale.pdfAmazon Web Services
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageDATAVERSITY
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionProvectus
 
Introduction to Knowledge Graphs and Semantic AI
Introduction to Knowledge Graphs and Semantic AIIntroduction to Knowledge Graphs and Semantic AI
Introduction to Knowledge Graphs and Semantic AISemantic Web Company
 
Dataiku data science studio
Dataiku data science studioDataiku data science studio
Dataiku data science studioNorman Poh
 

More Related Content

What's hot

Data Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & ApproachesData Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & ApproachesDATAVERSITY
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...Ed Fernandez
 
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...DATAVERSITY
 
Introduction to Knowledge Graphs
Introduction to Knowledge GraphsIntroduction to Knowledge Graphs
Introduction to Knowledge Graphsmukuljoshi
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Using Data Strategy Design to Build Data-Driven Products
Using Data Strategy Design to Build Data-Driven ProductsUsing Data Strategy Design to Build Data-Driven Products
Using Data Strategy Design to Build Data-Driven ProductsDatentreiber
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock
 
MLops workshop AWS
MLops workshop AWSMLops workshop AWS
MLops workshop AWSGili Nachum
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph IntroductionSören Auer
 
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...Robert McDermott
 
How to Build Data Science Teams
How to Build Data Science TeamsHow to Build Data Science Teams
How to Build Data Science TeamsGanes Kesari
 
Graph Databases – Benefits and Risks
Graph Databases – Benefits and RisksGraph Databases – Benefits and Risks
Graph Databases – Benefits and RisksDATAVERSITY
 
LLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team StructureLLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team StructureAggregage
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks FundamentalsDalibor Wijas
 
MLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleMLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleDatabricks
 
Exploring Opportunities in the Generative AI Value Chain.pdf
Exploring Opportunities in the Generative AI Value Chain.pdfExploring Opportunities in the Generative AI Value Chain.pdf
Exploring Opportunities in the Generative AI Value Chain.pdfDung Hoang
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
 

What's hot (20)

Data Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & ApproachesData Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & Approaches
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
 
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
 
Introduction to Knowledge Graphs
Introduction to Knowledge GraphsIntroduction to Knowledge Graphs
Introduction to Knowledge Graphs
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Using Data Strategy Design to Build Data-Driven Products
Using Data Strategy Design to Build Data-Driven ProductsUsing Data Strategy Design to Build Data-Driven Products
Using Data Strategy Design to Build Data-Driven Products
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
MLops workshop AWS
MLops workshop AWSMLops workshop AWS
MLops workshop AWS
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph Introduction
 
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
 
How to Build Data Science Teams
How to Build Data Science TeamsHow to Build Data Science Teams
How to Build Data Science Teams
 
Graph Databases – Benefits and Risks
Graph Databases – Benefits and RisksGraph Databases – Benefits and Risks
Graph Databases – Benefits and Risks
 
LLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team StructureLLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team Structure
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
MLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleMLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at Scale
 
Exploring Opportunities in the Generative AI Value Chain.pdf
Exploring Opportunities in the Generative AI Value Chain.pdfExploring Opportunities in the Generative AI Value Chain.pdf
Exploring Opportunities in the Generative AI Value Chain.pdf
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 

Similar to Dataiku - From Big Data To Machine Learning

Conférence Laboratoire des Mondes Virtuels_Dataiku_Choix technologiques pour ...
Conférence Laboratoire des Mondes Virtuels_Dataiku_Choix technologiques pour ...Conférence Laboratoire des Mondes Virtuels_Dataiku_Choix technologiques pour ...
Conférence Laboratoire des Mondes Virtuels_Dataiku_Choix technologiques pour ...Johan-André Jeanville
 
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Betacowork
 
Online Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for FunOnline Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for FunDataiku
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceJuuso Parkkinen
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
 
Data Architecture Strategies: Artificial Intelligence - Real-World Applicatio...
Data Architecture Strategies: Artificial Intelligence - Real-World Applicatio...Data Architecture Strategies: Artificial Intelligence - Real-World Applicatio...
Data Architecture Strategies: Artificial Intelligence - Real-World Applicatio...DATAVERSITY
 
Data Architecture Strategies Webinar: Emerging Trends in Data Architecture – ...
Data Architecture Strategies Webinar: Emerging Trends in Data Architecture – ...Data Architecture Strategies Webinar: Emerging Trends in Data Architecture – ...
Data Architecture Strategies Webinar: Emerging Trends in Data Architecture – ...DATAVERSITY
 
Data analytics course archtype
Data analytics course archtypeData analytics course archtype
Data analytics course archtypenakshatraL
 
MVP (Minimum Viable Product) Readiness | Boost Labs
MVP (Minimum Viable Product) Readiness | Boost LabsMVP (Minimum Viable Product) Readiness | Boost Labs
MVP (Minimum Viable Product) Readiness | Boost LabsBoost Labs
 
Lunch and Learn: You have the data, now what?
Lunch and Learn: You have the data, now what?Lunch and Learn: You have the data, now what?
Lunch and Learn: You have the data, now what?DiUS
 
Building successful data science teams
Building successful data science teamsBuilding successful data science teams
Building successful data science teamsVenkatesh Umaashankar
 
Success Through an Actionable Data Science Stack
Success Through an Actionable Data Science StackSuccess Through an Actionable Data Science Stack
Success Through an Actionable Data Science StackDomino Data Lab
 
Big Data LA 2016: Backstage to a Data Driven Culture
Big Data LA 2016: Backstage to a Data Driven CultureBig Data LA 2016: Backstage to a Data Driven Culture
Big Data LA 2016: Backstage to a Data Driven CulturePauline Chow
 
Drinking from the Digital Data Fire Hose
Drinking from the Digital Data Fire HoseDrinking from the Digital Data Fire Hose
Drinking from the Digital Data Fire HoseGigi Johnson
 
AI Orange Belt - Session 3
AI Orange Belt - Session 3AI Orange Belt - Session 3
AI Orange Belt - Session 3AI Black Belt
 
BIG DATA IN BUSINESS Implement and use Big Data to your organization’s advantage
BIG DATA IN BUSINESS Implement and use Big Data to your organization’s advantageBIG DATA IN BUSINESS Implement and use Big Data to your organization’s advantage
BIG DATA IN BUSINESS Implement and use Big Data to your organization’s advantageAurélie Pols
 
Your Company Cares About Open Source Sustainability, But Are You Measuring an...
Your Company Cares About Open Source Sustainability, But Are You Measuring an...Your Company Cares About Open Source Sustainability, But Are You Measuring an...
Your Company Cares About Open Source Sustainability, But Are You Measuring an...All Things Open
 

Similar to Dataiku - From Big Data To Machine Learning (20)

Conférence Laboratoire des Mondes Virtuels_Dataiku_Choix technologiques pour ...
Conférence Laboratoire des Mondes Virtuels_Dataiku_Choix technologiques pour ...Conférence Laboratoire des Mondes Virtuels_Dataiku_Choix technologiques pour ...
Conférence Laboratoire des Mondes Virtuels_Dataiku_Choix technologiques pour ...
 
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez
 
Online Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for FunOnline Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for Fun
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data Science
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Data Architecture Strategies: Artificial Intelligence - Real-World Applicatio...
Data Architecture Strategies: Artificial Intelligence - Real-World Applicatio...Data Architecture Strategies: Artificial Intelligence - Real-World Applicatio...
Data Architecture Strategies: Artificial Intelligence - Real-World Applicatio...
 
Data Architecture Strategies Webinar: Emerging Trends in Data Architecture – ...
Data Architecture Strategies Webinar: Emerging Trends in Data Architecture – ...Data Architecture Strategies Webinar: Emerging Trends in Data Architecture – ...
Data Architecture Strategies Webinar: Emerging Trends in Data Architecture – ...
 
Data analytics course archtype
Data analytics course archtypeData analytics course archtype
Data analytics course archtype
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
MVP (Minimum Viable Product) Readiness | Boost Labs
MVP (Minimum Viable Product) Readiness | Boost LabsMVP (Minimum Viable Product) Readiness | Boost Labs
MVP (Minimum Viable Product) Readiness | Boost Labs
 
Lunch and Learn: You have the data, now what?
Lunch and Learn: You have the data, now what?Lunch and Learn: You have the data, now what?
Lunch and Learn: You have the data, now what?
 
Building successful data science teams
Building successful data science teamsBuilding successful data science teams
Building successful data science teams
 
Success Through an Actionable Data Science Stack
Success Through an Actionable Data Science StackSuccess Through an Actionable Data Science Stack
Success Through an Actionable Data Science Stack
 
Big Data LA 2016: Backstage to a Data Driven Culture
Big Data LA 2016: Backstage to a Data Driven CultureBig Data LA 2016: Backstage to a Data Driven Culture
Big Data LA 2016: Backstage to a Data Driven Culture
 
Drinking from the Digital Data Fire Hose
Drinking from the Digital Data Fire HoseDrinking from the Digital Data Fire Hose
Drinking from the Digital Data Fire Hose
 
Data is not the new snake oil
Data is not the new snake oilData is not the new snake oil
Data is not the new snake oil
 
AI Orange Belt - Session 3
AI Orange Belt - Session 3AI Orange Belt - Session 3
AI Orange Belt - Session 3
 
First Steps on Big Data
First Steps on Big DataFirst Steps on Big Data
First Steps on Big Data
 
BIG DATA IN BUSINESS Implement and use Big Data to your organization’s advantage
BIG DATA IN BUSINESS Implement and use Big Data to your organization’s advantageBIG DATA IN BUSINESS Implement and use Big Data to your organization’s advantage
BIG DATA IN BUSINESS Implement and use Big Data to your organization’s advantage
 
Your Company Cares About Open Source Sustainability, But Are You Measuring an...
Your Company Cares About Open Source Sustainability, But Are You Measuring an...Your Company Cares About Open Source Sustainability, But Are You Measuring an...
Your Company Cares About Open Source Sustainability, But Are You Measuring an...
 

More from Dataiku

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Dataiku
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Dataiku
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelDataiku
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 Dataiku
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...Dataiku
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) Dataiku
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare IndustryDataiku
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ? Dataiku
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Dataiku
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku Dataiku
 
Dataiku productive application to production - pap is may 2015
Dataiku    productive application to production - pap is may 2015 Dataiku    productive application to production - pap is may 2015
Dataiku productive application to production - pap is may 2015 Dataiku
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Dataiku
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHDataiku
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuDataiku
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Dataiku
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...Dataiku
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystemDataiku
 

More from Dataiku (20)

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML model
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare Industry
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ?
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku
 
Dataiku productive application to production - pap is may 2015
Dataiku    productive application to production - pap is may 2015 Dataiku    productive application to production - pap is may 2015
Dataiku productive application to production - pap is may 2015
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECH
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - Dataiku
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
 

Recently uploaded

Cultivating Entrepreneurial Mindset in Product Management: Strategies for Suc...
Cultivating Entrepreneurial Mindset in Product Management: Strategies for Suc...Cultivating Entrepreneurial Mindset in Product Management: Strategies for Suc...
Cultivating Entrepreneurial Mindset in Product Management: Strategies for Suc...Product School
 
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptxGraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptxNeo4j
 
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlueVM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlueShapeBlue
 
Confoo 2024 Gettings started with OpenAI and data science
Confoo 2024 Gettings started with OpenAI and data scienceConfoo 2024 Gettings started with OpenAI and data science
Confoo 2024 Gettings started with OpenAI and data scienceSusan Ibach
 
Launching New Products In Companies Where It Matters Most by Product Director...
Launching New Products In Companies Where It Matters Most by Product Director...Launching New Products In Companies Where It Matters Most by Product Director...
Launching New Products In Companies Where It Matters Most by Product Director...Product School
 
Act Like an Owner, Challenge Like a VC by former CPO, Tripadvisor
Act Like an Owner,  Challenge Like a VC by former CPO, TripadvisorAct Like an Owner,  Challenge Like a VC by former CPO, Tripadvisor
Act Like an Owner, Challenge Like a VC by former CPO, TripadvisorProduct School
 
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)Jay Zhao
 
Harnessing the Power of GenAI for Exceptional Product Outcomes by Booking.com...
Harnessing the Power of GenAI for Exceptional Product Outcomes by Booking.com...Harnessing the Power of GenAI for Exceptional Product Outcomes by Booking.com...
Harnessing the Power of GenAI for Exceptional Product Outcomes by Booking.com...Product School
 
How to write an effective Cyber Incident Response Plan
How to write an effective Cyber Incident Response PlanHow to write an effective Cyber Incident Response Plan
How to write an effective Cyber Incident Response PlanDatabarracks
 
Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...
Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...
Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...UiPathCommunity
 
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHub
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHubHow We Grew Up with CloudStack and its Journey – Dilip Singh, DataHub
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHubShapeBlue
 
iOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostingeriOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostingerssuser9354ce
 
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...Neo4j
 
New ThousandEyes Product Features and Release Highlights: February 2024
New ThousandEyes Product Features and Release Highlights: February 2024New ThousandEyes Product Features and Release Highlights: February 2024
New ThousandEyes Product Features and Release Highlights: February 2024ThousandEyes
 
Transcript: Trending now: Book subjects on the move in the Canadian market - ...
Transcript: Trending now: Book subjects on the move in the Canadian market - ...Transcript: Trending now: Book subjects on the move in the Canadian market - ...
Transcript: Trending now: Book subjects on the move in the Canadian market - ...BookNet Canada
 
Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...
Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...
Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...2toLead Limited
 
National Institute of Standards and Technology (NIST) Cybersecurity Framework...
National Institute of Standards and Technology (NIST) Cybersecurity Framework...National Institute of Standards and Technology (NIST) Cybersecurity Framework...
National Institute of Standards and Technology (NIST) Cybersecurity Framework...MichaelBenis1
 
Building Bridges: Merging RPA Processes, UiPath Apps, and Data Service to bu...
Building Bridges:  Merging RPA Processes, UiPath Apps, and Data Service to bu...Building Bridges:  Merging RPA Processes, UiPath Apps, and Data Service to bu...
Building Bridges: Merging RPA Processes, UiPath Apps, and Data Service to bu...DianaGray10
 
Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...
Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...
Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...DianaGray10
 

Recently uploaded (20)

Cultivating Entrepreneurial Mindset in Product Management: Strategies for Suc...
Cultivating Entrepreneurial Mindset in Product Management: Strategies for Suc...Cultivating Entrepreneurial Mindset in Product Management: Strategies for Suc...
Cultivating Entrepreneurial Mindset in Product Management: Strategies for Suc...
 
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptxGraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
 
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlueVM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
 
Confoo 2024 Gettings started with OpenAI and data science
Confoo 2024 Gettings started with OpenAI and data scienceConfoo 2024 Gettings started with OpenAI and data science
Confoo 2024 Gettings started with OpenAI and data science
 
Launching New Products In Companies Where It Matters Most by Product Director...
Launching New Products In Companies Where It Matters Most by Product Director...Launching New Products In Companies Where It Matters Most by Product Director...
Launching New Products In Companies Where It Matters Most by Product Director...
 
In sharing we trust. Taking advantage of a diverse consortium to build a tran...
In sharing we trust. Taking advantage of a diverse consortium to build a tran...In sharing we trust. Taking advantage of a diverse consortium to build a tran...
In sharing we trust. Taking advantage of a diverse consortium to build a tran...
 
Act Like an Owner, Challenge Like a VC by former CPO, Tripadvisor
Act Like an Owner,  Challenge Like a VC by former CPO, TripadvisorAct Like an Owner,  Challenge Like a VC by former CPO, Tripadvisor
Act Like an Owner, Challenge Like a VC by former CPO, Tripadvisor
 
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)
 
Harnessing the Power of GenAI for Exceptional Product Outcomes by Booking.com...
Harnessing the Power of GenAI for Exceptional Product Outcomes by Booking.com...Harnessing the Power of GenAI for Exceptional Product Outcomes by Booking.com...
Harnessing the Power of GenAI for Exceptional Product Outcomes by Booking.com...
 
How to write an effective Cyber Incident Response Plan
How to write an effective Cyber Incident Response PlanHow to write an effective Cyber Incident Response Plan
How to write an effective Cyber Incident Response Plan
 
Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...
Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...
Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...
 
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHub
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHubHow We Grew Up with CloudStack and its Journey – Dilip Singh, DataHub
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHub
 
iOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostingeriOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostinger
 
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
 
New ThousandEyes Product Features and Release Highlights: February 2024
New ThousandEyes Product Features and Release Highlights: February 2024New ThousandEyes Product Features and Release Highlights: February 2024
New ThousandEyes Product Features and Release Highlights: February 2024
 
Transcript: Trending now: Book subjects on the move in the Canadian market - ...
Transcript: Trending now: Book subjects on the move in the Canadian market - ...Transcript: Trending now: Book subjects on the move in the Canadian market - ...
Transcript: Trending now: Book subjects on the move in the Canadian market - ...
 
Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...
Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...
Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...
 
National Institute of Standards and Technology (NIST) Cybersecurity Framework...
National Institute of Standards and Technology (NIST) Cybersecurity Framework...National Institute of Standards and Technology (NIST) Cybersecurity Framework...
National Institute of Standards and Technology (NIST) Cybersecurity Framework...
 
Building Bridges: Merging RPA Processes, UiPath Apps, and Data Service to bu...
Building Bridges:  Merging RPA Processes, UiPath Apps, and Data Service to bu...Building Bridges:  Merging RPA Processes, UiPath Apps, and Data Service to bu...
Building Bridges: Merging RPA Processes, UiPath Apps, and Data Service to bu...
 
Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...
Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...
Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...
 

Dataiku - From Big Data To Machine Learning

  • 2. 6/4/2013Dataiku 2 Hi ! Current Life: CEO, Dataiku Tweet about this: @dataiku @club_dsi_gun Past Life: Criteo IsCool Entertainment Exalead Florian Douetteau Available on Slide Share http://www.slideshare.net/Dataiku Goals Today: • Concrete Feedback on Data Analytics Projects • Data Team in practice and Key technologies • Motivate you to start a data science project Slide deck allergic ? Check: https://github.com/dataiku
  • 3. 6/4/2013Dataiku 3 Dataiku Dataiku : An open source platform to help you build your data lab ‟ ”
  • 5. Collocation 6/4/2013Dataiku 5 Big Apple Big Mama Big Data A familiar grouping of words, especially words that habitually appear together and thereby convey meaning by association. C o l l o c
  • 6. “Big” Data in 1999 6/4/2013Dataiku 6 struct Element { Key key; void* stat_data ; } …. C Optimized Data structures Perfect Hashing HP-UNIX Servers – 4GB Ram 100 GB data Web Crawler – Socket reuse HTTP 0.9 1 Month
  • 7.  Hadoop  Java / Pig / Hive / Scala / Closure / …  A Dozen NoSQL data store  MPP Databases  Real-Time 6/4/2013Dataiku 7 Big Data in 2013 1 Hour
  • 8. Data Analytics: The Stakes 6/4/2013Dataiku 8 1 TB ? $ Social Gaming 2011Web Search 1999 Logistics 2004 Online Advertising 2012 1 TB 100M $ E- Commerce 2013 Banking CRM 2008 1 TB 1B $ Web Search 2010 100 TB ? $ 10 TB 10M $ 1000TB 500M $ 50TB 1B$
  • 9. Meet Hal Alowne 6/4/2013Dataiku - Data Tuesday 9 Big Guys • 10B$+ Revenue • 100M+ customers • 100+ Data Scientist Hal Alowne BI Manager Dim’s Private Showroom Hey Hal ! We need a big data platform like the big guys. Let’s just do as they do! ‟ ”European E-commerce Web site • 100M$ Revenue • 1 Million customer • 1 Data Analyst (Hal Himself) Dim Sum CEO & Founder Dim’s Private Showroom Big Data Copy Cat Project
  • 10. Technology is complex 6/4/2013Dataiku 10 Hadoop Ceph Sphere Cassandra Spark Scikit-Learn Mahout WEKA MLBase RapidMiner Panda D3 Crossfilter InfiniDB LucidDB Impala Elastic Search SOLR MongoDB Riak Membase Pig Hive Cascading Talend Machine Learning Mystery Land Scalability CentralNoSQL-Slavia SQL Colunnar Republic Vizualization County Data Clean Wasteland Statistician Old House R
  • 11. Statistics and Machine Learning is complex ! 6/4/2013Dataiku 11  Try to understand myself
  • 12. (Some Book you might want to read) 6/4/2013Dataiku 12
  • 13. Plumbing is not complex (but difficult) 6/4/2013Dataiku 13 Implicit User Data (Views, Searches…) Content Data (Title, Categories, Price, …) Explicit User Data (Click, Buy, …) User Information (Location, Graph…) 500TB 50TB 1TB 200GB Transformation Matrix Transformation Predictor Per User Stats Per Content Stats User Similarity Rank Predictor Content Similarity
  • 14. MERIT = TIME + ROI 6/4/2013Dataiku 14 Targeted Newsletter Recommender Systems Adapted Product / Promotions TIME : 6 MONTHS ROI : APPS  Build a lab in 6 months (rather than 18 months) Find the right people (6 months?) Choose the technology (6 months?) Make it work (6 months?) Build the lab (6 months)  Deploy apps that actually deliver value 2013 2014 2013 • Train People • Reuse working patterns
  • 15. The Problem 6/4/2013Dataiku 15 It’s utterly complex and unreasonable
  • 16. Our Goal 6/4/2013Dataiku 16 Our Goal: Change his perspective on data science projects (sorry, we couldn’t find a picture of Hal Smiling)
  • 17.  Why and For What ? ◦ Business Theory ◦ Concrete Projects  How people and project ? ◦ How to start ◦ Dedicated team ?  What technologies ? ◦ Machine Learning ◦ Architecture Agenda 6/4/2013Dataiku 17
  • 19.  Product Success driven by Quality !  Margin / Customer Value / Traffic / Acquisition 6/4/2013Dataiku 19 Example: Launching an App on the App Store
  • 20.  Margin for new customers might decline …  Margin for new features might decline …  Is your business really scalable ? 6/4/2013Dataiku 20 you continue growing ….
  • 21.  Existing Customers Profiles  Existing Product Assets  Existing Specific Business Model  And your KNOWLEDGE of it 6/4/2013Dataiku 21 Where is your core business advantage ?
  • 22. 6/4/2013Dataiku 22 Data Driven Business What your value ? Number of Customers Customer Knowledge Increase over time with: - Time spend in your app - User relationship (network effet) - Partner / Other Apps Interactions Your Value
  • 23. Data Impact Not all business equals 6/4/2013Dataiku 23 Online Advertising Telecommunication Insurance Ability to Acquire Margin New Services Overall Subscription Market Infrastructure Driver Selling Data Risk / Price Optimization Subscription Market Subscription Market
  • 24. From Theory To Practice 6/4/2013Dataiku 24
  • 25.  What should be free in the application ?  How to optimize conversion ?  How to plan and create a business model ? Main Pain Point: How to plan and optimize pricing in the application ? 6/4/2013Dataiku 25 Freemium Application
  • 26. Example (Freemium Application) Fremium Model Optimization 6/4/2013Dataiku 26 Business Model User Cluster Simulation  Optimized Pricing: Margin +23%  Business Planning Capability 1 month  9 months  R + Python + InfiniDB On-Premise 1TB Dataset 5 weeks project
  • 27.  Business Intelligence Stack as Scalability and maintenance issues  Backoffice implements business rules that are challenged  Existing infrastructure cannot cope with per- user information Main Pain Point: 23 hours 52 minutes to compute Business Intelligence aggregates for one day. 6/4/2013Dataiku 27 Large E-Retailer
  • 28. • Relieve their current DWH and accelerate production of some aggregates/KPIs • Be the backbone for new personalized user experience on their website: more recommendations, more profiling, etc., • Train existing people around machine learning and segmentation experience  1h12 to perform the aggregate, available every morning  New home page personalization deployed in a few weeks  Hadoop Cluster (24 cores) Google Compute Engine Python + R + Vertica 12 TB dataset 6 weeks projects 6/4/2013Dataiku - Data Tuesday 28 Large E-Retailer : The Datalab
  • 29.  BI performed directly on production databases  New reports required the CTO direct work for design and implementation  Each photo tag manually validated and completed Large Photo Bank 6/4/2013Dataiku - Data Tuesday 29 Main pain point: No visibility on new users behaviours
  • 30.  Implementing a Cloud-based data lab to : • centralize all available data, previously scattered between SQL DB and file systems, • improve web tracking granularity to enhance customer knowledge via behavior modeling and segmentation, • create content-based recommendation engines with keywords clustering and association. 6/4/2013Dataiku - Data Tuesday 30 Large Photo Bank : The Datalab  R + Vertica + Hadoop Amazon Web Services 8 weeks projects  Automated content filtering and recommendation
  • 31.  Large set of manually crafted linguistic resources for interpreting users queries  New Brands, rare terms .. hard to maintain 6/4/2013Dataiku 31 Large Online Directory Main Pain Point: Ability to maintain a very large ontological knowledge sets, with more than 100k concepts
  • 32.  Analyze clicks, rephrasing navigation to detect queries that require specific processing  Gather web and external data to enrich the existing index  Train team to Hadoop and Machine Learning  Continuous Relevance Monitoring  Automated enrichment  2x more productivity  Hadoop (48 cores) Python On Premise 10 weeks projects 6/4/2013Dataiku 32 Large Online Directory: The Data Lab
  • 33.  Launch A Marketing campaign  After a few days PREDICT based on behaviours ◦  Total ARPU for users after 3 months ◦  Efficiency of a campaign ◦ Continue or not ? Example ( E-Application ) Marketing Campaign Prediction Dataiku 33
  • 34. A very large community Some mid-size communities Lots of small clusters mostly 2 players)  Correlation ◦ between community size and engagement / virality  Meaningul patterns ◦ 2 players / Family / Group  What is the minimum number of friends to have in the application to get additional engagement ? Example (Social Gaming) Social Gaming Communities 6/4/2013Dataiku 34
  • 35.  What others do ? ◦ Concrete Projects  How people and project ? ◦ How to start ◦ Dedicated team ?  What technologies ? ◦ Machine Learning ◦ Architecture Agenda 6/4/2013Dataiku 35
  • 37.  A / B Test (or equivalent for your business) is the first step to get into a “data-driven” mind set  No advanced analytics requires, some existing tools can help  Changing a color button +21% 6/4/2013Dataiku 37 (1) Be Data Driven
  • 38.  People  Microsoft Excel 6/4/2013Dataiku 38 (2) Use Excel
  • 39.  Data Team  Data Tools 6/4/2013Dataiku 39 (3) Build a team The Business Expert who knows maths The Analyst that reveals patterns The Coding Guy That is enthusiastic
  • 40.  data lab, (n. m): a small group with all the expertise, including business minded people, machine learning knowledge and the right technology  A proven organization used by successful data-driven companies over the past few years (eBay, LinkedIn, Walmart…) TEAM + TOOLS = LAB 6/4/2013Dataiku 40
  • 41. Organization 6/4/2013Dataiku 41 Targeted campaings Price optimization Personalized experience Quality Assurance Workload and yield management User Feedback (A/B Test) Continuous improvement Data Product Designer Business & Marketing Engineers User Voice
  • 42. Short Term Focus Long Term Drive Business People Optimize Margin, …. Create new business revenue streams Marketing People Optimize click ratio Brand awareness and impact IT People Make IT work Clean and efficient Architecture Data People Get Stats Right, make predictions Create Data Driven Features It’s just a new team 6/4/2013Dataiku 42
  • 43. Super Intern 6/4/2013Dataiku 43 What is your ability to integrate a new smart guy and give him any data he would need and any computing power he would need to enhance your product ?
  • 44.  What others do ? ◦ Concrete Projects  How people and project ? ◦ How to start ◦ Dedicated team ?  What technologies ? ◦ Machine Learning ◦ Architecture Agenda 6/4/2013Dataiku 44
  • 45. An oversimplified view of big data architecture 6/4/2013Dataiku 45
  • 47. (What it really looks like) 6/4/2013Dataiku 47
  • 48. What kind of scale? 6/4/2013Dataiku 48 Database Business Layer Application Or Data Science App Or ?
  • 49. What kind of interaction ? 6/4/2013Dataiku 49 Database Business Layer Application Data Science App ? ? ? ? ? ?
  • 50. Classic Columnar Architecture 6/4/2013Dataiku 50 Some data Some Place To Pour It In Some Tool To To Some Maths And Graphs
  • 51. Classic Columnar Architecture 6/4/2013Dataiku 51 Lots of data Some Place To Pour It In Some Tool To To Some Maths And Graphs Web Tracking Logs Raw Server Logs Order / Product / Customer Facebook Info Open Data (Weather, Currency …)
  • 52. The Corinthian Architecture 6/4/2013Dataiku 52 Lots of data Some Place To Perform Rapid Calculations Some Tools To Do Some Maths And Charts Some Place To Pour It In And Clean / Prepare It
  • 53. Data Storage And Preparation 6/4/2013Dataiku 53 Large Scale: Hadoop Cluster Cassandra MPP SQL Columnar Medium/Large Scale: CouchBase MongoDB …. Selection Drivers Volume Scalability
  • 54. Calculations 6/4/2013Dataiku 54 Classic Database • PostgresSQL • MySQL • …. MPP SQL Database • Vertica, Vectorwise, InfiniDB, GreenplumHD…. Hadoop New Databases • Impala … Selection Drivers: Speed ( Interactivity ) Expressivity
  • 55. The Corinthian Architecture 6/4/2013Dataiku 55 Lots of data Some Place To Perform Rapid Calculations Some Tools To Do Some Maths And Charts Some Place To Pour It In And Clean / Prepare It Statistics Cohorts Regressions Bar Charts For Marketing Nice Infography for you Company Board
  • 56. The Corinthian Architecture 6/4/2013Dataiku 56 Lots of data Some Database To Perform Rapid Calculations Some Tools To Do Some Maths Some Other To Do Some Charts Some Place To Pour It In And Clean / Prepare It
  • 57. Statistical Tools 6/4/2013Dataiku 57 Open Source: • IPython • Rstudio Commercial • RapidMiner • SAS • RevolutionR Selection Drivers Existing Knowhow Scalability
  • 58. 6/4/2013Dataiku 58 What is a statistical tool ?  Interact and explore data  Some stats capabilities  Some Graph Capabilities
  • 59. Visualization Tools 6/4/2013Dataiku 59 Open Source: • SpotFire • Tableau • QlikView SAAS • BIME • ChartIO • RevolutionR HTML5 / AdHoc • D3 • GraphViz Selection Drivers How Many Contributors / Readers ? Scalability
  • 60. The One Database won’t make it all problem 6/4/2013Dataiku 60 Lots of data Some Database To Perform Rapid Calculations Some Tools To Do Some Maths Some Other To Do Some Charts Some Place To Pour It In And Clean / Prepare It JOIN / Aggregate Rapid Goup By Computations Direct Access to the computed Results to production etc..
  • 61. The Roman Social Forum 6/4/2013Dataiku 61 Lots of data Some Database To Perform Rapid Calculations And Some Database For Graphs Some Tools To Do Some Maths Some Other To Do Some Charts Some Place To Pour It In And Clean / Prepare It
  • 62. Graph 6/4/2013Dataiku 62 Databases • Neo4J • Titan • OrientDB • InfiniteGraph Analytic / Visualization • Gephi Selection Drivers Scalability What Algorithms ? Licensing Constraints
  • 63. The Key Value Store 6/4/2013Dataiku 63 Lots of data Some Database To Perform Rapid Calculations And Some Database For Graphs And Some Distributed Key Value Store Some Tools To Do Some Maths Some Other To Do Some Charts Some Place To Pour It In And Clean / Prepare It
  • 64. NoSQL 6/4/2013Dataiku 64 Search • SOLR • ElasticSearch Document • MongoDB • CouchDB KeyValue • Redis • Hbase … Selection Drivers Durability / Avaiability … Performance Ease of use and API Indexing
  • 65. Action requires Prediction 6/4/2013Dataiku 65 Lots of data Some Database To Perform Rapid Calculations And some database for graphs And Some Distributed Key Value Store Some Tools To Do Some Maths Some Other To Do Some Charts Some Place To Pour It In And Clean / Prepare It Draw A Line  For the future What are my real users groups ? Should I launch a discount offering or not ? To everybody or to specific users only ?
  • 66. The Medieval Fairy Land 6/4/2013Dataiku 66 Lots of data Some Tools To Do Some Maths Some Other To Do Some Charts and some MACHINE LEARNING Some Place To Pour It In And Clean / Prepare It Some Database To Perform Rapid Calculations And Some Database For Graphs And Some Distributed Key Value Store
  • 67. Predictions 6/4/2013Dataiku 67 Java • Mahout (Hadoop) • WEKA Python • Scikit-Learn • PyML R Commercial • Kxen • SAS • SPSS… Selection Drivers Scalability Black Box / White Box ? Data Management Integration
  • 69.  Exploratory Data Analysis ◦ Identifying and visualizing key patterns and correlations within the dataset  Unsupervised Learning ◦ Create groups of similar observations sharing same patterns (aka Clustering, Segmentation)  Supervised Learning ◦ Modeling a variable using independent features (aka Scoring, Predictive Modeling, Classification)  Time Series Prevision ◦ Predict a time-dependent variable using its own history, and sometimes other covariates (variables)  Graph Analysis ◦ Analyzing relationships between a set of “nodes”, linked by “edges”  Associations / Sequences Mining ◦ Identifying frequently associated items within transactions/ events databases, sometimes ordered over time  And many more… Classes of Machine Learning Problems 04/06/2013Dataiku - Innovation Services 69
  • 70. Mapping ML to Business Questions 04/06/2013Dataiku - Innovation Services 70 Class Sample Business Questions Exploratory Data Analysis What does my dataset look like ? What are the key correlations in my data ? Unsupervised Learning Can I create groups of users who share the same purchasing behavior ? The same navigation behavior ? Supervised Learning What users are likely to click on ad X ? What users are likely to convert to paying users ? Who is going to leave my service ? What is the profile of the users who do X ? Time Series Prevision What is the prevision of my revenue next month ? Given the weather forecast, can I also forecast my sales ? Product Sale Forecast (for surbooking) Graph Analysis Can I identify influencers in my users community ? Can I recommend new friends to my users ? Association & Sequences Mining Which products are frequently bought together ? What is the typical navigation path on my website ?
  • 71. Machine Learning Methods Detailed 04/06/2013Dataiku - Innovation Services 71 Analytical Task ML Task Sample Algorithms Shape of Dataset Exploratory Data Analysis Univariate Analysis Distribution, frequencies, histogram, boxplots, fit tests... N obs. (1 row per obs.) * P features Bivariate Analysis Scatterplots, correlations (Pearson, Spearman), GLM, Chi Square... N obs. (1 row per obs.) * P features Multivariate Analysis Principal components analysis, multi-dimensional scaling correspondence analysis, factor analysis… N obs. (1 row per obs.) * P features “Oriented” Data Analysis Unsupervised Learning K-means, K-medoids, hierarchical clustering, gaussian mixture models, mean shift, dbscan, spectral clustering... N obs. (1 row per obs.) * P features Supervised Learning Linear & logistic regression, decision trees, neural networks, SVM, naïve Bayes, K-NN, random forests… N obs. (1 row per obs.) * P features Time Series Prevision ARMA, VARMAX, ARIMA… Time Series (rows: time period, columns: measures) Graph Analysis Centrality (closeness, betweeness, Page Rank, HITS), modularity (Louvain)… Nodes and Edges lists (+ attributes) Associations & Sequences Frequent Itemsets, A priori, Market Basket… (Timestamped) events or transactions
  • 72.  Cluster a dataset into K Buckets by choosing the “closest” neighbours 6/4/2013Dataiku 72 Unsupervised Method K-Means
  • 73.  Predict the color of a point depending on the colors of its K closest neighbours 6/4/2013Dataiku 73 Supervised K-Nearest-Neighbours
  • 74.  Find the most “significant” input variable and split value  Split the dataset recursively 6/4/2013Dataiku 74 Supervised Decision Tree
  • 75. Several Paths to Machine Learning 04/06/2013Dataiku - Innovation Services 75 Analytical Dataset I’m looking for clusters I want to predict a variable I’m looking variable by variable, or pairs I know how many groups to look for HCA … Partitioning (K- means…) GMM … DP GMM … K-means + Gap | Silhouette | … 2-steps clustering I just want to explore Yes No Ye s No Small Dataset (<<1K) Ye s No Medium Dataset (<<100K) Ye s No I can sample Ye s No Affinity Propagation, Mean Shift… Unsupervised Learning Ye s No All my variables are numeric Ye s No CA… I have a distance matrix Ye s No MDS... PCA … Exploratory Data Analysis Data Viz... Ye s Not Only I value interpretability Generalized Linear Model Simple Decision Tree Supervised Learning* Correlation Analysis GLM Parametric and non parametric stat. tests * Methods generally working for both classification & regression Support Vector Machines Neural Networks K-Nearest Neighbors Ensembles (Random Forest, Gradient Boosted Tree MARS Generalized Additive Model
  • 76. 6/4/2013Dataiku 76 Questions ?  Take Away ◦ There are new ways to perform data analytics that are within your reach and can bring business value  Some Additional Resources ◦ Open Source Projects  Dataiku Cloud Transport Client http://dctc.io  Dataiku Web Tracker https://github.com/dataiku/wt1 ◦ Our Technical Blog  http://www.dataiku.com/blog