SlideShare a Scribd company logo
1©MapR Technologies - Confidential
Scalability in Hadoop and
Similar Systems
2©MapR Technologies - Confidential
Big is the next big thing
 Big data and Hadoop are exploding
 Companies are being funded
 Books are being written
 Applications sprouting up everywhere
2
3©MapR Technologies - Confidential
Slow Motion Explosion
3
4©MapR Technologies - Confidential
Hadoop Explosion
4
5©MapR Technologies - Confidential
Why Now?
 But Moore’s law has applied for a long time
 Why is Hadoop exploding now?
 Why not 10 years ago?
 Why not 20?
58/13/2013
6©MapR Technologies - Confidential
Size Matters, but …
 If it were just availability of data then existing big companies would
adopt big data technology first
6
7©MapR Technologies - Confidential
Size Matters, but …
 If it were just availability of data then existing big companies would
adopt big data technology first
They didn’t
7
8©MapR Technologies - Confidential
Or Maybe Cost
 If it were just a net positive value then finance companies should
adopt first because they have higher opportunity value / byte
8
9©MapR Technologies - Confidential
Or Maybe Cost
 If it were just a net positive value then finance companies should
adopt first because they have higher opportunity value / byte
They didn’t
9
10©MapR Technologies - Confidential
Backwards adoption
 Under almost any threshold argument startups would not adopt
big data technology first
10
11©MapR Technologies - Confidential
Backwards adoption
 Under almost any threshold argument startups would not adopt
big data technology first
They did
11
12©MapR Technologies - Confidential
Everywhere at Once?
 Something very strange is happening
– Big data is being applied at many different scales
– At many value scales
– By large companies and small
12
13©MapR Technologies - Confidential
Everywhere at Once?
 Something very strange is happening
– Big data is being applied at many different scales
– At many value scales
– By large companies and small
Why?
13
14©MapR Technologies - Confidential
More data is being produced more quickly
Data sizes are bigger than even a very large computer can hold
Cost to create and store continues to decrease
The Conventional Answer
15©MapR Technologies - Confidential
Analytics Scaling Laws
 Analytics scaling is all about the 80-20 rule
– Big gains for little initial effort
– Rapidly diminishing returns
 The key to net value is how costs scale
– Old school – exponential scaling
– Big data – linear scaling, low constant
 Cost/performance has changed radically
– IF you can use many commodity boxes
16©MapR Technologies - Confidential
We knew that
We should have
known that
We didn’t know that!
You’re kidding, people do that?
17©MapR Technologies - Confidential
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
Anybody with eyes
Intern with a spreadsheet
In-house analytics
Industry-wide data consortium
NSA, non-proliferation
18©MapR Technologies - Confidential
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
Net value optimum has a
sharp peak well before
maximum effort
19©MapR Technologies - Confidential
But scaling laws are changing
both slope and shape
20©MapR Technologies - Confidential
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
More than just a little
21©MapR Technologies - Confidential
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
They are changing a LOT!
22©MapR Technologies - Confidential
23©MapR Technologies - Confidential
24©MapR Technologies - Confidential
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
25©MapR Technologies - Confidential
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
26©MapR Technologies - Confidential
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
Initially, linear cost scaling
actually makes things worse
A tipping point is reached and
things change radically …
27©MapR Technologies - Confidential
Pre-requisites for Tipping
 To reach the tipping point,
 Algorithms must scale out horizontally
– On commodity hardware
– That can and will fail
 Data practice must change
– Denormalized is the new black
– Flexible data dictionaries are the rule
– Structured data becomes rare
28©MapR Technologies - Confidential
Yeah… but wait
29©MapR Technologies - Confidential
The Standard Sort of Model
 People talk about the law of large numbers as if it were …
 Well, as if it were a law
 It’s not …
 It is a context and assumption dependent theorem
30©MapR Technologies - Confidential
What if …
 These assumptions are:
 Changes have a
– stationary,
– independent,
– finite variance distribution
 What happens if these assumptions are wrong?
 And which of them is really wrong?
31©MapR Technologies - Confidential
For Example
Time
Stuff
32©MapR Technologies - Confidential
Time
Stuff
End point
has nice
tractable
distribution
33©MapR Technologies - Confidential
What if the Assumptions are Wrong?
 Take the finite variance as a simple example
 This leads to Levy stable distributions
 Like the Cauchy distribution
34©MapR Technologies - Confidential
Is it Really Different?
35©MapR Technologies - Confidential
Time
Stuff
36©MapR Technologies - Confidential
What About Real Life?
37©MapR Technologies - Confidential
38©MapR Technologies - Confidential
But is it Really Infinite Variance?
 Or are there other kinds of phenomena that show this?
 What about the independence assumption?
 What if the supposedly independent components of the system
communicate?
 Like we do. Everyday. All the time.
39©MapR Technologies - Confidential
Why the Difference?
Law of large
numbers
Infinite
variance
Interacting
agents
Apologies and credit to
Simon DaDeo, SFI
The space of
all things that
change
The space of
interacting
things
40©MapR Technologies - Confidential
What Happens with Interactions
 Social phenomena defeat the law of large numbers
 Distributions are well modeled by “rich get richer” processes
– Pittman-Yar process, Indian Buffet
 Limiting dstributions are heavy tailed, power law
 We see these distributions everywhere
– price of cotton in the 19th century
– word frequencies
– popularity of Github projects
– equity pricing and volumes
– sizes of cities
– popularity of web-sites
41©MapR Technologies - Confidential
What are the
Implications?
42©MapR Technologies - Confidential
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
43©MapR Technologies - Confidential
In a Nutshell
 Scalability is much more important than we thought
 Mashups are more important than we thought
 Network effects are more important than we thought
 Exploration is more important than we thought
 Hadoop style linear scaling must be mixed with ad hoc analysis
44©MapR Technologies - Confidential
Thank You
45©MapR Technologies - Confidential
whoami?
 Ted Dunning
– @ted_dunning
– tdunning@maprtech.com (MapR distribution for Hadoop)
– tdunning@apache.com (Mahout, Hadoop, Lucene, Zookeeper, Drill)
– ted.dunning@gmail.com (me)
 More info:
http://www.mapr.com/company/events/hadoop-in-finance-2012

More Related Content

Viewers also liked

เรื่องที่ 2 แหล่งสารสนเทศ
เรื่องที่ 2 แหล่งสารสนเทศเรื่องที่ 2 แหล่งสารสนเทศ
เรื่องที่ 2 แหล่งสารสนเทศ
Marg Kok
 
Hadoop Summit EU - Crowd Sourcing Reflected Intelligence
Hadoop Summit EU - Crowd Sourcing Reflected IntelligenceHadoop Summit EU - Crowd Sourcing Reflected Intelligence
Hadoop Summit EU - Crowd Sourcing Reflected Intelligence
MapR Technologies
 
The Last Traffic Jam - LatAm Spanish
The Last Traffic Jam - LatAm SpanishThe Last Traffic Jam - LatAm Spanish
The Last Traffic Jam - LatAm SpanishConnected Futures
 
Drill 1.0
Drill 1.0Drill 1.0
Strata+Hadoop 2015 Keynote: Impacting Business as it Happens
Strata+Hadoop 2015 Keynote: Impacting Business as it HappensStrata+Hadoop 2015 Keynote: Impacting Business as it Happens
Strata+Hadoop 2015 Keynote: Impacting Business as it Happens
MapR Technologies
 
USO DE LAS ESTRATEGIAS EXPOSITIVAS PARA EL DESARROLLO DE ACTITUDES Y VALORES
USO DE LAS ESTRATEGIAS EXPOSITIVAS PARA EL DESARROLLO DE ACTITUDES Y VALORESUSO DE LAS ESTRATEGIAS EXPOSITIVAS PARA EL DESARROLLO DE ACTITUDES Y VALORES
USO DE LAS ESTRATEGIAS EXPOSITIVAS PARA EL DESARROLLO DE ACTITUDES Y VALORESChus Fernández de la Fuente
 

Viewers also liked (9)

เรื่องที่ 2 แหล่งสารสนเทศ
เรื่องที่ 2 แหล่งสารสนเทศเรื่องที่ 2 แหล่งสารสนเทศ
เรื่องที่ 2 แหล่งสารสนเทศ
 
Atlhug 20150625
Atlhug 20150625Atlhug 20150625
Atlhug 20150625
 
Dunning ml-conf-2014
Dunning ml-conf-2014Dunning ml-conf-2014
Dunning ml-conf-2014
 
Hadoop Summit EU - Crowd Sourcing Reflected Intelligence
Hadoop Summit EU - Crowd Sourcing Reflected IntelligenceHadoop Summit EU - Crowd Sourcing Reflected Intelligence
Hadoop Summit EU - Crowd Sourcing Reflected Intelligence
 
The Last Traffic Jam - LatAm Spanish
The Last Traffic Jam - LatAm SpanishThe Last Traffic Jam - LatAm Spanish
The Last Traffic Jam - LatAm Spanish
 
Drill 1.0
Drill 1.0Drill 1.0
Drill 1.0
 
Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
 
Strata+Hadoop 2015 Keynote: Impacting Business as it Happens
Strata+Hadoop 2015 Keynote: Impacting Business as it HappensStrata+Hadoop 2015 Keynote: Impacting Business as it Happens
Strata+Hadoop 2015 Keynote: Impacting Business as it Happens
 
USO DE LAS ESTRATEGIAS EXPOSITIVAS PARA EL DESARROLLO DE ACTITUDES Y VALORES
USO DE LAS ESTRATEGIAS EXPOSITIVAS PARA EL DESARROLLO DE ACTITUDES Y VALORESUSO DE LAS ESTRATEGIAS EXPOSITIVAS PARA EL DESARROLLO DE ACTITUDES Y VALORES
USO DE LAS ESTRATEGIAS EXPOSITIVAS PARA EL DESARROLLO DE ACTITUDES Y VALORES
 

Similar to Chicago Hadoop in Finance - Ted Dunning

Steve Jenkins - Business Opportunities for Big Data in the Enterprise
Steve Jenkins - Business Opportunities for Big Data in the Enterprise Steve Jenkins - Business Opportunities for Big Data in the Enterprise
Steve Jenkins - Business Opportunities for Big Data in the Enterprise
WeAreEsynergy
 
Strata New York 2012
Strata New York 2012Strata New York 2012
Strata New York 2012
MapR Technologies
 
London hug
London hugLondon hug
London hug
MapR Technologies
 
Chicago finance-big-data
Chicago finance-big-dataChicago finance-big-data
Chicago finance-big-data
Ted Dunning
 
Big data, why now?
Big data, why now?Big data, why now?
Big data, why now?
Ted Dunning
 
Deep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningDeep Learning vs. Cheap Learning
Deep Learning vs. Cheap Learning
MapR Technologies
 
What is the past future tense of data?
What is the past future tense of data?What is the past future tense of data?
What is the past future tense of data?
Ted Dunning
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedTed Dunning
 
Buzz Words Dunning Multi Modal Recommendations
Buzz Words Dunning Multi Modal RecommendationsBuzz Words Dunning Multi Modal Recommendations
Buzz Words Dunning Multi Modal Recommendations
MapR Technologies
 
Buzz words-dunning-multi-modal-recommendation
Buzz words-dunning-multi-modal-recommendationBuzz words-dunning-multi-modal-recommendation
Buzz words-dunning-multi-modal-recommendation
Ted Dunning
 
Big data and the data quality imperative
Big data and the data quality imperativeBig data and the data quality imperative
Big data and the data quality imperativeTrillium Software
 
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop Performance
MapR Technologies
 
Devoxx Real-Time Learning
Devoxx Real-Time LearningDevoxx Real-Time Learning
Devoxx Real-Time Learning
MapR Technologies
 
Industry 4.0 and Digital Transformation
Industry 4.0 and Digital TransformationIndustry 4.0 and Digital Transformation
Industry 4.0 and Digital Transformation
Jordan Cueto
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
Ted Dunning
 
Big Data Paris
Big Data ParisBig Data Paris
Big Data Paris
MapR Technologies
 
Big Data Paris
Big Data ParisBig Data Paris
Big Data Paris
Ted Dunning
 
London Data Science - Super-Fast Clustering Report
London Data Science - Super-Fast Clustering ReportLondon Data Science - Super-Fast Clustering Report
London Data Science - Super-Fast Clustering Report
MapR Technologies
 
VSD Paris 2018 - Présentation Finale
VSD Paris 2018 - Présentation FinaleVSD Paris 2018 - Présentation Finale
VSD Paris 2018 - Présentation Finale
Veritas Technologies LLC
 

Similar to Chicago Hadoop in Finance - Ted Dunning (20)

Steve Jenkins - Business Opportunities for Big Data in the Enterprise
Steve Jenkins - Business Opportunities for Big Data in the Enterprise Steve Jenkins - Business Opportunities for Big Data in the Enterprise
Steve Jenkins - Business Opportunities for Big Data in the Enterprise
 
Strata New York 2012
Strata New York 2012Strata New York 2012
Strata New York 2012
 
London hug
London hugLondon hug
London hug
 
Chicago finance-big-data
Chicago finance-big-dataChicago finance-big-data
Chicago finance-big-data
 
Big data, why now?
Big data, why now?Big data, why now?
Big data, why now?
 
Deep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningDeep Learning vs. Cheap Learning
Deep Learning vs. Cheap Learning
 
What is the past future tense of data?
What is the past future tense of data?What is the past future tense of data?
What is the past future tense of data?
 
GoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 SkinnedGoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 Skinned
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinned
 
Buzz Words Dunning Multi Modal Recommendations
Buzz Words Dunning Multi Modal RecommendationsBuzz Words Dunning Multi Modal Recommendations
Buzz Words Dunning Multi Modal Recommendations
 
Buzz words-dunning-multi-modal-recommendation
Buzz words-dunning-multi-modal-recommendationBuzz words-dunning-multi-modal-recommendation
Buzz words-dunning-multi-modal-recommendation
 
Big data and the data quality imperative
Big data and the data quality imperativeBig data and the data quality imperative
Big data and the data quality imperative
 
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop Performance
 
Devoxx Real-Time Learning
Devoxx Real-Time LearningDevoxx Real-Time Learning
Devoxx Real-Time Learning
 
Industry 4.0 and Digital Transformation
Industry 4.0 and Digital TransformationIndustry 4.0 and Digital Transformation
Industry 4.0 and Digital Transformation
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
 
Big Data Paris
Big Data ParisBig Data Paris
Big Data Paris
 
Big Data Paris
Big Data ParisBig Data Paris
Big Data Paris
 
London Data Science - Super-Fast Clustering Report
London Data Science - Super-Fast Clustering ReportLondon Data Science - Super-Fast Clustering Report
London Data Science - Super-Fast Clustering Report
 
VSD Paris 2018 - Présentation Finale
VSD Paris 2018 - Présentation FinaleVSD Paris 2018 - Présentation Finale
VSD Paris 2018 - Présentation Finale
 

More from MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
MapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
MapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
MapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
MapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
MapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
MapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
MapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
MapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
MapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
MapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
MapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
MapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
MapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
MapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
MapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
MapR Technologies
 

More from MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Recently uploaded

National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 

Recently uploaded (20)

National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 

Chicago Hadoop in Finance - Ted Dunning

  • 1. 1©MapR Technologies - Confidential Scalability in Hadoop and Similar Systems
  • 2. 2©MapR Technologies - Confidential Big is the next big thing  Big data and Hadoop are exploding  Companies are being funded  Books are being written  Applications sprouting up everywhere 2
  • 3. 3©MapR Technologies - Confidential Slow Motion Explosion 3
  • 4. 4©MapR Technologies - Confidential Hadoop Explosion 4
  • 5. 5©MapR Technologies - Confidential Why Now?  But Moore’s law has applied for a long time  Why is Hadoop exploding now?  Why not 10 years ago?  Why not 20? 58/13/2013
  • 6. 6©MapR Technologies - Confidential Size Matters, but …  If it were just availability of data then existing big companies would adopt big data technology first 6
  • 7. 7©MapR Technologies - Confidential Size Matters, but …  If it were just availability of data then existing big companies would adopt big data technology first They didn’t 7
  • 8. 8©MapR Technologies - Confidential Or Maybe Cost  If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte 8
  • 9. 9©MapR Technologies - Confidential Or Maybe Cost  If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte They didn’t 9
  • 10. 10©MapR Technologies - Confidential Backwards adoption  Under almost any threshold argument startups would not adopt big data technology first 10
  • 11. 11©MapR Technologies - Confidential Backwards adoption  Under almost any threshold argument startups would not adopt big data technology first They did 11
  • 12. 12©MapR Technologies - Confidential Everywhere at Once?  Something very strange is happening – Big data is being applied at many different scales – At many value scales – By large companies and small 12
  • 13. 13©MapR Technologies - Confidential Everywhere at Once?  Something very strange is happening – Big data is being applied at many different scales – At many value scales – By large companies and small Why? 13
  • 14. 14©MapR Technologies - Confidential More data is being produced more quickly Data sizes are bigger than even a very large computer can hold Cost to create and store continues to decrease The Conventional Answer
  • 15. 15©MapR Technologies - Confidential Analytics Scaling Laws  Analytics scaling is all about the 80-20 rule – Big gains for little initial effort – Rapidly diminishing returns  The key to net value is how costs scale – Old school – exponential scaling – Big data – linear scaling, low constant  Cost/performance has changed radically – IF you can use many commodity boxes
  • 16. 16©MapR Technologies - Confidential We knew that We should have known that We didn’t know that! You’re kidding, people do that?
  • 17. 17©MapR Technologies - Confidential 2,0000 500 1000 1500 1 0 0.25 0.5 0.75 Scale Value Anybody with eyes Intern with a spreadsheet In-house analytics Industry-wide data consortium NSA, non-proliferation
  • 18. 18©MapR Technologies - Confidential 2,0000 500 1000 1500 1 0 0.25 0.5 0.75 Scale Value 2,0000 500 1000 1500 1 0 0.25 0.5 0.75 Scale Value Net value optimum has a sharp peak well before maximum effort
  • 19. 19©MapR Technologies - Confidential But scaling laws are changing both slope and shape
  • 20. 20©MapR Technologies - Confidential 2,0000 500 1000 1500 1 0 0.25 0.5 0.75 Scale Value More than just a little
  • 21. 21©MapR Technologies - Confidential 2,0000 500 1000 1500 1 0 0.25 0.5 0.75 Scale Value They are changing a LOT!
  • 22. 22©MapR Technologies - Confidential
  • 23. 23©MapR Technologies - Confidential
  • 24. 24©MapR Technologies - Confidential 2,0000 500 1000 1500 1 0 0.25 0.5 0.75 Scale Value
  • 25. 25©MapR Technologies - Confidential 2,0000 500 1000 1500 1 0 0.25 0.5 0.75 Scale Value
  • 26. 26©MapR Technologies - Confidential 2,0000 500 1000 1500 1 0 0.25 0.5 0.75 Scale Value Initially, linear cost scaling actually makes things worse A tipping point is reached and things change radically …
  • 27. 27©MapR Technologies - Confidential Pre-requisites for Tipping  To reach the tipping point,  Algorithms must scale out horizontally – On commodity hardware – That can and will fail  Data practice must change – Denormalized is the new black – Flexible data dictionaries are the rule – Structured data becomes rare
  • 28. 28©MapR Technologies - Confidential Yeah… but wait
  • 29. 29©MapR Technologies - Confidential The Standard Sort of Model  People talk about the law of large numbers as if it were …  Well, as if it were a law  It’s not …  It is a context and assumption dependent theorem
  • 30. 30©MapR Technologies - Confidential What if …  These assumptions are:  Changes have a – stationary, – independent, – finite variance distribution  What happens if these assumptions are wrong?  And which of them is really wrong?
  • 31. 31©MapR Technologies - Confidential For Example Time Stuff
  • 32. 32©MapR Technologies - Confidential Time Stuff End point has nice tractable distribution
  • 33. 33©MapR Technologies - Confidential What if the Assumptions are Wrong?  Take the finite variance as a simple example  This leads to Levy stable distributions  Like the Cauchy distribution
  • 34. 34©MapR Technologies - Confidential Is it Really Different?
  • 35. 35©MapR Technologies - Confidential Time Stuff
  • 36. 36©MapR Technologies - Confidential What About Real Life?
  • 37. 37©MapR Technologies - Confidential
  • 38. 38©MapR Technologies - Confidential But is it Really Infinite Variance?  Or are there other kinds of phenomena that show this?  What about the independence assumption?  What if the supposedly independent components of the system communicate?  Like we do. Everyday. All the time.
  • 39. 39©MapR Technologies - Confidential Why the Difference? Law of large numbers Infinite variance Interacting agents Apologies and credit to Simon DaDeo, SFI The space of all things that change The space of interacting things
  • 40. 40©MapR Technologies - Confidential What Happens with Interactions  Social phenomena defeat the law of large numbers  Distributions are well modeled by “rich get richer” processes – Pittman-Yar process, Indian Buffet  Limiting dstributions are heavy tailed, power law  We see these distributions everywhere – price of cotton in the 19th century – word frequencies – popularity of Github projects – equity pricing and volumes – sizes of cities – popularity of web-sites
  • 41. 41©MapR Technologies - Confidential What are the Implications?
  • 42. 42©MapR Technologies - Confidential 2,0000 500 1000 1500 1 0 0.25 0.5 0.75 Scale Value
  • 43. 43©MapR Technologies - Confidential In a Nutshell  Scalability is much more important than we thought  Mashups are more important than we thought  Network effects are more important than we thought  Exploration is more important than we thought  Hadoop style linear scaling must be mixed with ad hoc analysis
  • 44. 44©MapR Technologies - Confidential Thank You
  • 45. 45©MapR Technologies - Confidential whoami?  Ted Dunning – @ted_dunning – tdunning@maprtech.com (MapR distribution for Hadoop) – tdunning@apache.com (Mahout, Hadoop, Lucene, Zookeeper, Drill) – ted.dunning@gmail.com (me)  More info: http://www.mapr.com/company/events/hadoop-in-finance-2012

Editor's Notes

  1. Why is big data sooo fashionable with big and small companies from different industries? What has suddenly changed?
  2. Google searches are up 10x over just four years ago.
  3. Hadoop use is exploding. We chose this example, which shows job trends for Hadoop. Further evidence that you should pay attention during this talk.
  4. But we have seen constant growth for a long time. And simple growth would only explain some kinds of companies starting with big data (probably big ones) and then slow adoption. Databases started with big companies and took 20 years or more to reach everywhere because the need exceeded cost at different times for different companies. The internet, on the other hand, largely happened to everybody at the same time so it changed things in nearly all industries at all scales nearly simultaneously. Why is big data exploding right now and why is it exploding at all?
  5. The different kinds of scaling laws have different shape and I think that shape is the key.
  6. The value of analytics always increases with more data, but the rate of increase drops dramatically after an initial quick increase.
  7. In classical analytics, the cost of doing analytics increases sharply.
  8. The result is a net value that has a sharp optimum in the area where value is increasing rapidly and cost is not yet increasing so rapidly.
  9. New techniques such as Hadoop result in linear scaling of cost. This is a change in shape and it causes a qualitative change in the way that costs trade off against value to give net value. As technology improves, the slope of this cost line is also changing rapidly over time.
  10. This next sequence shows how the net value changes with different slope linear cost models.
  11. Notice how the best net value has jumped up significantly
  12. And as the line approaches horizontal, the highest net value occurs at dramatically larger data scale.
  13. And as the line approaches horizontal, the highest net value occurs at dramatically larger data scale.