SlideShare a Scribd company logo
The Big Data Dead Valley Dilemma
and Much More
francis@qmining.com
Founder QMining
@fraka6
Unhidden Agenda
● Big Data Big Picture
● Big Data Dead Valley Dilemma
● Elastic Map Reduce (EMR) numbers
● Scaling Learning (MPI & hadoop)
Big Data
=
Lot of Data
(evidence)
+
CPU bounded
(forgotten)
Big Data
=
Lot of Data
(evidence)
-
IO bounded
(reality)
IO bounded
CPU
<100%Data
● HD/Bus speed
● Network
● File server
Big Data Scalability
(ex: hadoop)
=
Cluster
+
Locality+ node failure
(Data move close to CPU)
The Big Data Dilemma
Big Data Dead Valley
TechnoMaturtity/
Risk
Enterprise size
SMB
Enterprise
Start-ups
Techno Maturity
Risk
Big Data
=
SMALL
MARKET
(B2B vs B2C)
Small Market......hum?
WHY?????
Maturity
Data, Process, QA, infra, talent, $, Long term vision
Data->Analytics ->BI-> Big-Data -> Data-Mining ->
Data Access & Quality
User data privacy, IT outsourcing protection, Data Quality
Enterprise Slowness
1. Boston CXO Forum 24 October : Best Practice on Global
Innovation (IBM, EMC, P&G, Intuit)
Exploit vs Explore - M&A
2. Brad Feld (Managing Director at Foundry Group)
Hierarchy vs network
Big Data Dead Valley
TechnoMaturtity/
Risk
Enterprise Maturity
SMB
Enterprise
Start-ups
Techno Maturity
Risk
QMarketing example
Leveraging hadoop
● map = hits to session
● reduce = sessions to ROI
Online Marketing
Management
Channel % budget ROI
----------------------------------------------
PPC 50% ?
Organic 20% ?
Email Campaign 20% ?
Social Media 10% ?
ROI Dashboard
All abstractions leak
Abstract -> Procrastinate!
http://www.aleax.it/pycon_abst.pdf (Alex Martelli : "Abstraction as a Leverage" )
Minimize A Tower of Abstraction
Simplify & lower the layer of abstraction
Examples:
● Work on file not BD if possible
● HD direct connect on server
● Low level linux command lines (cut, grep, sed etc.)
● High level languages : python
Abstraction = 20X benefits
EMR vs AWS & S3 1.0
(no data locality optimization + network &
~IO bounded)
EMR = 45 min
AWS = 4 min
EMR vs AWS & S3 2.0
EMR = 5+10 min*
AWS = ~4 min
*30 min prepro ;)
EMR = 5+4 if (big files & compress files)
Scaling Machine Learning
● Scaling Data-Preprocessing = Hadoop
● Small dataset = GPU
● Train with Big Dataset = ?? Communication Infrastructures =
MPI & MapReduce (John Langford http://hunch.net/?p=2094)
MPI allreduce
Hadoop vs MPI
MPI
● No fault tolerance by default
● Poor understanding of where data is (manual split on nodes + bad
communication & prog complexity)
● Limit scale to ~100 nodes in practice (sharing unavoidable)
● Cluster shared -> slower nodes issues before disk/node failure
MapReduce
● Setup and teardown costs are significant (interaction schedular &
communicating the prog + large number of node)
● Worst: mapreduce wait for free nodes + many mapreduce iteration +
reach high quality prediction
● Flaw: required refactoring code in map/reduce
Hadoop-compatible AllReduce -
Vowpall Rabbit (Hadoop + MPI)
● MPI = All reduce (all nodes same state)
● MapReduce = Conceptual Simplicity
● MPI: No need to refactor code
● MapReduce: Data Locality (Map only)
● MPI: Ability to use local storage (or RAM): temp file on
local disk + allow to be cached in RAM by OS
● MapReduce: Automatic cleanup of local resources (tmp
files)
● MPI: Fast Optimization approach remain within the
conceptual scope: AllReduce = fct call
● MapReduce robustness (speculative execution to deal
with slow nodes)
Summary
● Big Data Big Picture
○ BigData : Cluster + IO bounded (Locality)
● Big Data Dead Valley Dilemma (MMID)
○ Small Market/Maturity/Data:access,quality/Slowness
● EMR (aws) = Slow
● Minimize Tower or abstraction
● Scaling MP: bottleneck = ML
○ MPI:no fault tolerance + where is the data?
○ Hadoop: slow setup & teardown + Require
Refactoring
○ Hadoop compatible AllReduce
Reference MPI & hadoop
blog:
http://bickson.blogspot.ca/2011/12/mpi-vs-hadoop.html
http://hunch.net/?p=2094
Video & slides presentaiton John Langford
Learning From Lots Of Data (full)
CONFÉRENCIER: John LANGFORD, Senior Research Scientist, Microsoft Research
Slides: http://lisaweb.iro.umontrea...
Implementation :
vowpal_wabbit
hum...
Questions?
francis@qmining.com

More Related Content

What's hot

BlazingSQL + RAPIDS AI at GTC San Jose 2019
BlazingSQL + RAPIDS AI at GTC San Jose 2019BlazingSQL + RAPIDS AI at GTC San Jose 2019
BlazingSQL + RAPIDS AI at GTC San Jose 2019
Rodrigo Aramburu
 
Using python to analyze spatial data
Using python to analyze spatial dataUsing python to analyze spatial data
Using python to analyze spatial data
Kudos S.A.S
 
Coriani 2
Coriani 2Coriani 2
Coriani 2
Innocenti Andrea
 
Geospatial Big Data - Foss4gNA
Geospatial Big Data - Foss4gNAGeospatial Big Data - Foss4gNA
Geospatial Big Data - Foss4gNA
normanbarker
 
[2D3]TurboGraph- Ultrafast graph analystics engine for billion-scale graphs i...
[2D3]TurboGraph- Ultrafast graph analystics engine for billion-scale graphs i...[2D3]TurboGraph- Ultrafast graph analystics engine for billion-scale graphs i...
[2D3]TurboGraph- Ultrafast graph analystics engine for billion-scale graphs i...
NAVER D2
 
Best Hadoop and Amazon Online Training
Best Hadoop and Amazon Online TrainingBest Hadoop and Amazon Online Training
Best Hadoop and Amazon Online Training
Samatha Kamuni
 
Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012
Ted Dunning
 

What's hot (7)

BlazingSQL + RAPIDS AI at GTC San Jose 2019
BlazingSQL + RAPIDS AI at GTC San Jose 2019BlazingSQL + RAPIDS AI at GTC San Jose 2019
BlazingSQL + RAPIDS AI at GTC San Jose 2019
 
Using python to analyze spatial data
Using python to analyze spatial dataUsing python to analyze spatial data
Using python to analyze spatial data
 
Coriani 2
Coriani 2Coriani 2
Coriani 2
 
Geospatial Big Data - Foss4gNA
Geospatial Big Data - Foss4gNAGeospatial Big Data - Foss4gNA
Geospatial Big Data - Foss4gNA
 
[2D3]TurboGraph- Ultrafast graph analystics engine for billion-scale graphs i...
[2D3]TurboGraph- Ultrafast graph analystics engine for billion-scale graphs i...[2D3]TurboGraph- Ultrafast graph analystics engine for billion-scale graphs i...
[2D3]TurboGraph- Ultrafast graph analystics engine for billion-scale graphs i...
 
Best Hadoop and Amazon Online Training
Best Hadoop and Amazon Online TrainingBest Hadoop and Amazon Online Training
Best Hadoop and Amazon Online Training
 
Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012
 

Similar to The big data dead valley dilemma and much more.

Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BI
Prasad Prabhu (PP)
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
Reynold Xin
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
Cisco Canada
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
Cisco Canada
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Sumeet Singh
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Sumeet Singh
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
Big Data Montreal
 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
DataWorks Summit
 
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Hazelcast
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015
Daniela Zuppini
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
Sreenu Musham
 
My other computer_is_a_datacentre
My other computer_is_a_datacentreMy other computer_is_a_datacentre
My other computer_is_a_datacentre
Steve Loughran
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data Science
Data Works MD
 
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
GPU Accelerated Data Science with RAPIDS - ODSC West 2020GPU Accelerated Data Science with RAPIDS - ODSC West 2020
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
John Zedlewski
 
Big Data and OSS at IBM
Big Data and OSS at IBMBig Data and OSS at IBM
Big Data and OSS at IBM
Boulder Java User's Group
 
Steve Totman Syncsort Big Data Warehousing hug 23 sept Final
Steve Totman Syncsort Big Data Warehousing hug 23 sept FinalSteve Totman Syncsort Big Data Warehousing hug 23 sept Final
Steve Totman Syncsort Big Data Warehousing hug 23 sept Final
Steven Totman
 
Mr hadoop seedrocket
Mr hadoop seedrocketMr hadoop seedrocket
Mr hadoop seedrocket
SeedRocket
 
End to End Machine Learning Open Source Solution Presented in Cisco Developer...
End to End Machine Learning Open Source Solution Presented in Cisco Developer...End to End Machine Learning Open Source Solution Presented in Cisco Developer...
End to End Machine Learning Open Source Solution Presented in Cisco Developer...
Manish Harsh
 
Hadoop-2.6.0 Slides
Hadoop-2.6.0 SlidesHadoop-2.6.0 Slides
Hadoop-2.6.0 Slides
kul prasad subedi
 
Hadoop - How It Works
Hadoop - How It WorksHadoop - How It Works
Hadoop - How It Works
Vladimír Hanušniak
 

Similar to The big data dead valley dilemma and much more. (20)

Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BI
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
 
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
My other computer_is_a_datacentre
My other computer_is_a_datacentreMy other computer_is_a_datacentre
My other computer_is_a_datacentre
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data Science
 
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
GPU Accelerated Data Science with RAPIDS - ODSC West 2020GPU Accelerated Data Science with RAPIDS - ODSC West 2020
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
 
Big Data and OSS at IBM
Big Data and OSS at IBMBig Data and OSS at IBM
Big Data and OSS at IBM
 
Steve Totman Syncsort Big Data Warehousing hug 23 sept Final
Steve Totman Syncsort Big Data Warehousing hug 23 sept FinalSteve Totman Syncsort Big Data Warehousing hug 23 sept Final
Steve Totman Syncsort Big Data Warehousing hug 23 sept Final
 
Mr hadoop seedrocket
Mr hadoop seedrocketMr hadoop seedrocket
Mr hadoop seedrocket
 
End to End Machine Learning Open Source Solution Presented in Cisco Developer...
End to End Machine Learning Open Source Solution Presented in Cisco Developer...End to End Machine Learning Open Source Solution Presented in Cisco Developer...
End to End Machine Learning Open Source Solution Presented in Cisco Developer...
 
Hadoop-2.6.0 Slides
Hadoop-2.6.0 SlidesHadoop-2.6.0 Slides
Hadoop-2.6.0 Slides
 
Hadoop - How It Works
Hadoop - How It WorksHadoop - How It Works
Hadoop - How It Works
 

More from Francis Piéraut

4th industrial revolution fuel by combining big data and deeplearning a qui...
4th industrial revolution fuel by combining big data and deeplearning   a qui...4th industrial revolution fuel by combining big data and deeplearning   a qui...
4th industrial revolution fuel by combining big data and deeplearning a qui...
Francis Piéraut
 
Startups ultime experience
Startups ultime experienceStartups ultime experience
Startups ultime experience
Francis Piéraut
 
The ultimate trick to learn faster
The ultimate trick  to learn fasterThe ultimate trick  to learn faster
The ultimate trick to learn faster
Francis Piéraut
 
ML_tools&libs-part1.pptx
ML_tools&libs-part1.pptxML_tools&libs-part1.pptx
ML_tools&libs-part1.pptx
Francis Piéraut
 
ML_big_picture-2.0.pptx
ML_big_picture-2.0.pptxML_big_picture-2.0.pptx
ML_big_picture-2.0.pptx
Francis Piéraut
 
Big data barrier of entry (flash)
Big data barrier of entry (flash) Big data barrier of entry (flash)
Big data barrier of entry (flash)
Francis Piéraut
 
Big data trap
Big data trapBig data trap
Big data trap
Francis Piéraut
 
Big data: Just another barrier of entry
Big data: Just another barrier of entryBig data: Just another barrier of entry
Big data: Just another barrier of entry
Francis Piéraut
 
Appengine vs Amazon; pros &amp; cons for startups
Appengine vs Amazon; pros &amp; cons for startupsAppengine vs Amazon; pros &amp; cons for startups
Appengine vs Amazon; pros &amp; cons for startups
Francis Piéraut
 
No BI without Machine Learning
No BI without Machine LearningNo BI without Machine Learning
No BI without Machine Learning
Francis Piéraut
 
Java Empowered by Jython
Java Empowered by JythonJava Empowered by Jython
Java Empowered by Jython
Francis Piéraut
 
easy_install digipy &amp; mlboost
easy_install digipy &amp; mlboosteasy_install digipy &amp; mlboost
easy_install digipy &amp; mlboost
Francis Piéraut
 
Machine Learning empowered by Python April2009
Machine Learning empowered by Python April2009Machine Learning empowered by Python April2009
Machine Learning empowered by Python April2009
Francis Piéraut
 
Intro to Machine Learning Enpowered by Python (Montreal Python)
Intro to Machine Learning Enpowered by Python (Montreal Python)Intro to Machine Learning Enpowered by Python (Montreal Python)
Intro to Machine Learning Enpowered by Python (Montreal Python)
Francis Piéraut
 
Master Defense Slides (translated)
Master Defense Slides (translated)Master Defense Slides (translated)
Master Defense Slides (translated)
Francis Piéraut
 
Soutenance 17 Avril 2003
Soutenance 17 Avril 2003Soutenance 17 Avril 2003
Soutenance 17 Avril 2003
Francis Piéraut
 

More from Francis Piéraut (16)

4th industrial revolution fuel by combining big data and deeplearning a qui...
4th industrial revolution fuel by combining big data and deeplearning   a qui...4th industrial revolution fuel by combining big data and deeplearning   a qui...
4th industrial revolution fuel by combining big data and deeplearning a qui...
 
Startups ultime experience
Startups ultime experienceStartups ultime experience
Startups ultime experience
 
The ultimate trick to learn faster
The ultimate trick  to learn fasterThe ultimate trick  to learn faster
The ultimate trick to learn faster
 
ML_tools&libs-part1.pptx
ML_tools&libs-part1.pptxML_tools&libs-part1.pptx
ML_tools&libs-part1.pptx
 
ML_big_picture-2.0.pptx
ML_big_picture-2.0.pptxML_big_picture-2.0.pptx
ML_big_picture-2.0.pptx
 
Big data barrier of entry (flash)
Big data barrier of entry (flash) Big data barrier of entry (flash)
Big data barrier of entry (flash)
 
Big data trap
Big data trapBig data trap
Big data trap
 
Big data: Just another barrier of entry
Big data: Just another barrier of entryBig data: Just another barrier of entry
Big data: Just another barrier of entry
 
Appengine vs Amazon; pros &amp; cons for startups
Appengine vs Amazon; pros &amp; cons for startupsAppengine vs Amazon; pros &amp; cons for startups
Appengine vs Amazon; pros &amp; cons for startups
 
No BI without Machine Learning
No BI without Machine LearningNo BI without Machine Learning
No BI without Machine Learning
 
Java Empowered by Jython
Java Empowered by JythonJava Empowered by Jython
Java Empowered by Jython
 
easy_install digipy &amp; mlboost
easy_install digipy &amp; mlboosteasy_install digipy &amp; mlboost
easy_install digipy &amp; mlboost
 
Machine Learning empowered by Python April2009
Machine Learning empowered by Python April2009Machine Learning empowered by Python April2009
Machine Learning empowered by Python April2009
 
Intro to Machine Learning Enpowered by Python (Montreal Python)
Intro to Machine Learning Enpowered by Python (Montreal Python)Intro to Machine Learning Enpowered by Python (Montreal Python)
Intro to Machine Learning Enpowered by Python (Montreal Python)
 
Master Defense Slides (translated)
Master Defense Slides (translated)Master Defense Slides (translated)
Master Defense Slides (translated)
 
Soutenance 17 Avril 2003
Soutenance 17 Avril 2003Soutenance 17 Avril 2003
Soutenance 17 Avril 2003
 

Recently uploaded

HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
GDSC PJATK
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
flufftailshop
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 

Recently uploaded (20)

HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 

The big data dead valley dilemma and much more.

  • 1. The Big Data Dead Valley Dilemma and Much More francis@qmining.com Founder QMining @fraka6
  • 2. Unhidden Agenda ● Big Data Big Picture ● Big Data Dead Valley Dilemma ● Elastic Map Reduce (EMR) numbers ● Scaling Learning (MPI & hadoop)
  • 3. Big Data = Lot of Data (evidence) + CPU bounded (forgotten)
  • 4. Big Data = Lot of Data (evidence) - IO bounded (reality)
  • 5. IO bounded CPU <100%Data ● HD/Bus speed ● Network ● File server
  • 6. Big Data Scalability (ex: hadoop) = Cluster + Locality+ node failure (Data move close to CPU)
  • 7. The Big Data Dilemma
  • 8. Big Data Dead Valley TechnoMaturtity/ Risk Enterprise size SMB Enterprise Start-ups Techno Maturity Risk
  • 11. WHY????? Maturity Data, Process, QA, infra, talent, $, Long term vision
  • 12. Data->Analytics ->BI-> Big-Data -> Data-Mining ->
  • 13. Data Access & Quality User data privacy, IT outsourcing protection, Data Quality
  • 14. Enterprise Slowness 1. Boston CXO Forum 24 October : Best Practice on Global Innovation (IBM, EMC, P&G, Intuit) Exploit vs Explore - M&A 2. Brad Feld (Managing Director at Foundry Group) Hierarchy vs network
  • 15. Big Data Dead Valley TechnoMaturtity/ Risk Enterprise Maturity SMB Enterprise Start-ups Techno Maturity Risk
  • 16.
  • 17. QMarketing example Leveraging hadoop ● map = hits to session ● reduce = sessions to ROI
  • 18. Online Marketing Management Channel % budget ROI ---------------------------------------------- PPC 50% ? Organic 20% ? Email Campaign 20% ? Social Media 10% ?
  • 20. All abstractions leak Abstract -> Procrastinate! http://www.aleax.it/pycon_abst.pdf (Alex Martelli : "Abstraction as a Leverage" )
  • 21. Minimize A Tower of Abstraction Simplify & lower the layer of abstraction Examples: ● Work on file not BD if possible ● HD direct connect on server ● Low level linux command lines (cut, grep, sed etc.) ● High level languages : python Abstraction = 20X benefits
  • 22. EMR vs AWS & S3 1.0 (no data locality optimization + network & ~IO bounded) EMR = 45 min AWS = 4 min
  • 23. EMR vs AWS & S3 2.0 EMR = 5+10 min* AWS = ~4 min *30 min prepro ;) EMR = 5+4 if (big files & compress files)
  • 24. Scaling Machine Learning ● Scaling Data-Preprocessing = Hadoop ● Small dataset = GPU ● Train with Big Dataset = ?? Communication Infrastructures = MPI & MapReduce (John Langford http://hunch.net/?p=2094)
  • 26.
  • 27.
  • 28.
  • 29. Hadoop vs MPI MPI ● No fault tolerance by default ● Poor understanding of where data is (manual split on nodes + bad communication & prog complexity) ● Limit scale to ~100 nodes in practice (sharing unavoidable) ● Cluster shared -> slower nodes issues before disk/node failure MapReduce ● Setup and teardown costs are significant (interaction schedular & communicating the prog + large number of node) ● Worst: mapreduce wait for free nodes + many mapreduce iteration + reach high quality prediction ● Flaw: required refactoring code in map/reduce
  • 30. Hadoop-compatible AllReduce - Vowpall Rabbit (Hadoop + MPI) ● MPI = All reduce (all nodes same state) ● MapReduce = Conceptual Simplicity ● MPI: No need to refactor code ● MapReduce: Data Locality (Map only) ● MPI: Ability to use local storage (or RAM): temp file on local disk + allow to be cached in RAM by OS ● MapReduce: Automatic cleanup of local resources (tmp files) ● MPI: Fast Optimization approach remain within the conceptual scope: AllReduce = fct call ● MapReduce robustness (speculative execution to deal with slow nodes)
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38. Summary ● Big Data Big Picture ○ BigData : Cluster + IO bounded (Locality) ● Big Data Dead Valley Dilemma (MMID) ○ Small Market/Maturity/Data:access,quality/Slowness ● EMR (aws) = Slow ● Minimize Tower or abstraction ● Scaling MP: bottleneck = ML ○ MPI:no fault tolerance + where is the data? ○ Hadoop: slow setup & teardown + Require Refactoring ○ Hadoop compatible AllReduce
  • 39. Reference MPI & hadoop blog: http://bickson.blogspot.ca/2011/12/mpi-vs-hadoop.html http://hunch.net/?p=2094 Video & slides presentaiton John Langford Learning From Lots Of Data (full) CONFÉRENCIER: John LANGFORD, Senior Research Scientist, Microsoft Research Slides: http://lisaweb.iro.umontrea... Implementation : vowpal_wabbit