SlideShare a Scribd company logo
1 of 25
Bigger Data. Better Insights.™
Is Bigger Data Really Better?
10 Facts from Theory and Practice
Alexander Gray, PhD
CTO, Skytree
Adj. Assoc. Prof., Georgia Tech
2
Is bigger data necessarily better?
If so, when and when not?
To what extent?
Even if it is, can we realize the gains?
3
First, what is the link between
bigger data and
bigger business value?
Let’s start with your high-value prediction problem
Healthcare:
Diagnosis
Prescription
Prognosis
Prevention
Drug screening
Drug efficacy
Cost optimization
Energy:
Remote sensing
Automatic equipment operation
Telco/data center:
Churn
Load prediction/provisioning
Asset-intensive:
Predictive maintenance
Prescriptive maintenance
Fault diagnosis
Dynamic allocation
Govt/law enforcement:
Association/phone call analysis
Threat scoring
Security:
Data loss prevention
Intrusion detection
Point-of-compromise identification
Malware identification
Marketing/sales:
Lead scoring
Recommendation
Personalized pricing
Personalized product/service
Product/service optimization
Optimal next action
Opportunity scoring
Retail:
Demand forecasting
Optimal pricing
Promo planning
Ensemble planning
Workforce allocation
Demand-driven supply chain
Insurance:
Loss model
Bind model
Claims leakage
Claims fraud
Bank/credit card:
Transaction fraud
Credit/loan scoring
Investing/trading
Money laundering
Advertising:
Ad selection
User/site bidding
Spend optimization
5
1. More $  Better prediction
Increasing business value is
achieved by
increasing predictive power.
Example: fraud detection
• False negative: Costs $2000
• False positive: Costs $100
6
The fundamental sources of error
7
2. Data size is a basic lever for predictive power
The training data size is one of
the main determinants of your
model’s predictive power.
8
3. More data  more predictive power
When you use
more training data,
you increase predictive power.
9
For realistic high-value models,
how do things work?
10
4. More sophisticated models  need more data
When you move to more
sophisticated models, you need
more training data.
e.g. for nonparametric regression,
density estimation:
e.g. nonparametric methods like k-
NN (or GBT, RF, SVM, NN, etc)
converge to zero estimation error
for near-arbitrary data:
11
5. More features  need more data
When you use more features,
you need more training data.
e.g. for nonparametric regression,
density estimation:
Note that more features improve
accuracy, speaking generally
(more on that in a different talk, or
ask me)
12
6. More data  better prediction is real
Real empirical ML results
follow the math:
More training data increases
predictive power.
13
How else can
down-sampling the data be harmful,
creating poor results?
14
7. Down-sampling for CV  wrong parameters
The optimal hyperparameters of
a model are actually dependent
on the training set size.
15
8. Down-sampling may be throwing out gems
In many cases the important
data points are too rare to be
further reduced.
• High-interest outliers or small
clusters
• High-value but rare known objects
or events
• Rare but high-value discrete values
or classes
• Missing values means each point is
less informative
• Natural systems with massive
variation
Another thing: non-uniform sampling,
without appropriate corrections, may warp
important probabilities
16
What else should we know,
toward best practices
for big data?
17
9. ML on big data is now possible*
It is now actually possible to
fully train models with very
large amounts of data.
*with Skytree!
18
9. ML on big data is now possible*
Even with full tuning at each size,
to find the optimal parameters!
*with Skytree!
It is now actually possible to
fully train models with very
large amounts of data.
19
Let’s look again at the basic sources of error
20
Let’s look again at the basic sources of error
If your error due to having an insufficient model
class (e.g. linear models like logistic regression)
dominates, adding more data won’t help
Error due to number of data is not your worst
problem
21
Let’s look again at the basic sources of error
If your error due to having incomplete model
optimization (e.g. stochastic gradient descent for
parameters or too-small grid in cross-validation
for hyperparameters) dominates, adding more
data won’t help
Error due to number of data is not your worst
problem
22
10. Your other errors may be holding you back
It is necessary to minimize all
the sources of error at the
same time.
• Training (too-) simple models in
order to handle large datasets may
not gain benefit
• Performing (too-) incomplete
training in order to handle large
datasets may not gain benefit
23
Summary: 10 facts from theory and practice
1. Better prediction  More $
2. Data size is a basic lever for predictive
power
3. More data  predictive power
4. More sophisticated models  need
more data
5. More features  need more data
6. More data  better prediction is real
7. Down-sampling for CV  wrong
parameters
8. Down-sampling may be throwing out
gems
9. ML on big data is now possible*
10.Your other errors may be holding you
back
A written form (14-page white paper)
Is available at our the Skytree booth.
24
Conclusions: 5 practical upshots
• Training on a subsample of the data is giving up measurable predictive power,
and thus significant business value.
• When a dataset contains rare objects or values, which is common,
subsampling can be disastrous.
• Training too-simple models may block the benefit from the data size.
• Performing too-incomplete training may block the benefit from the data size.
• Performing cross-validation on a subsample is incorrect.
When you are ready to max out your data’s potential
with true state-of-the-art ML:
www.skytree.net
Thanks!
Bigger Data. Better Insights.™
Thanks!
Alexander Gray, PhD
CTO, Skytree

More Related Content

What's hot

ML Drift - How to find issues before they become problems
ML Drift - How to find issues before they become problemsML Drift - How to find issues before they become problems
ML Drift - How to find issues before they become problemsAmy Hodler
 
Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Roger Barga
 
Data Driven Engineering 2014
Data Driven Engineering 2014Data Driven Engineering 2014
Data Driven Engineering 2014Roger Barga
 
Scientific revenue unreasonable effectiveness of data
Scientific revenue unreasonable effectiveness of dataScientific revenue unreasonable effectiveness of data
Scientific revenue unreasonable effectiveness of dataWilliam Grosso
 
840 plenary elder_using his laptop
840 plenary elder_using his laptop840 plenary elder_using his laptop
840 plenary elder_using his laptopRising Media, Inc.
 
Putting data science in your business a first utility feedback
Putting data science in your business a first utility feedbackPutting data science in your business a first utility feedback
Putting data science in your business a first utility feedbackPeculium Crypto
 
Predictive Analytics: An Executive Primer
Predictive Analytics: An Executive PrimerPredictive Analytics: An Executive Primer
Predictive Analytics: An Executive PrimerRyan Withop
 
Machine Learning 101
Machine Learning 101Machine Learning 101
Machine Learning 101Setu Chokshi
 
1555 track 1 huang_using his mac
1555 track 1 huang_using his mac1555 track 1 huang_using his mac
1555 track 1 huang_using his macRising Media, Inc.
 
Predictive analytics in action real-world examples and advice
Predictive analytics in action real-world examples and advicePredictive analytics in action real-world examples and advice
Predictive analytics in action real-world examples and adviceThe Marketing Distillery
 
Machine learning with TensorFlow
Machine learning with TensorFlow  Machine learning with TensorFlow
Machine learning with TensorFlow Eslam Saeed
 
Watson Equipment Advisor
Watson Equipment Advisor Watson Equipment Advisor
Watson Equipment Advisor IBM Watson
 
Xpanse Analytics Platform
Xpanse Analytics PlatformXpanse Analytics Platform
Xpanse Analytics PlatformMichael Keane
 
The current state of prediction in neuroimaging
The current state of prediction in neuroimagingThe current state of prediction in neuroimaging
The current state of prediction in neuroimagingSaigeRutherford
 
Webinar Tutorial - A Beginners Guide To MaxDiff Scaling
Webinar Tutorial - A Beginners Guide To MaxDiff ScalingWebinar Tutorial - A Beginners Guide To MaxDiff Scaling
Webinar Tutorial - A Beginners Guide To MaxDiff ScalingQuestionPro
 
Popular Machine Learning Myths
Popular Machine Learning Myths Popular Machine Learning Myths
Popular Machine Learning Myths Rock Interview
 

What's hot (20)

ML Drift - How to find issues before they become problems
ML Drift - How to find issues before they become problemsML Drift - How to find issues before they become problems
ML Drift - How to find issues before they become problems
 
ForresterPredictiveWave
ForresterPredictiveWaveForresterPredictiveWave
ForresterPredictiveWave
 
The REAL face of Big Data
The REAL face of Big DataThe REAL face of Big Data
The REAL face of Big Data
 
Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Barga Galvanize Sept 2015
Barga Galvanize Sept 2015
 
Data Driven Engineering 2014
Data Driven Engineering 2014Data Driven Engineering 2014
Data Driven Engineering 2014
 
Scientific revenue unreasonable effectiveness of data
Scientific revenue unreasonable effectiveness of dataScientific revenue unreasonable effectiveness of data
Scientific revenue unreasonable effectiveness of data
 
840 plenary elder_using his laptop
840 plenary elder_using his laptop840 plenary elder_using his laptop
840 plenary elder_using his laptop
 
Putting data science in your business a first utility feedback
Putting data science in your business a first utility feedbackPutting data science in your business a first utility feedback
Putting data science in your business a first utility feedback
 
Predictive Analytics: An Executive Primer
Predictive Analytics: An Executive PrimerPredictive Analytics: An Executive Primer
Predictive Analytics: An Executive Primer
 
Machine Learning 101
Machine Learning 101Machine Learning 101
Machine Learning 101
 
1555 track 1 huang_using his mac
1555 track 1 huang_using his mac1555 track 1 huang_using his mac
1555 track 1 huang_using his mac
 
Carrying out analysis
Carrying out analysisCarrying out analysis
Carrying out analysis
 
predictive analytics
predictive analyticspredictive analytics
predictive analytics
 
Predictive analytics in action real-world examples and advice
Predictive analytics in action real-world examples and advicePredictive analytics in action real-world examples and advice
Predictive analytics in action real-world examples and advice
 
Machine learning with TensorFlow
Machine learning with TensorFlow  Machine learning with TensorFlow
Machine learning with TensorFlow
 
Watson Equipment Advisor
Watson Equipment Advisor Watson Equipment Advisor
Watson Equipment Advisor
 
Xpanse Analytics Platform
Xpanse Analytics PlatformXpanse Analytics Platform
Xpanse Analytics Platform
 
The current state of prediction in neuroimaging
The current state of prediction in neuroimagingThe current state of prediction in neuroimaging
The current state of prediction in neuroimaging
 
Webinar Tutorial - A Beginners Guide To MaxDiff Scaling
Webinar Tutorial - A Beginners Guide To MaxDiff ScalingWebinar Tutorial - A Beginners Guide To MaxDiff Scaling
Webinar Tutorial - A Beginners Guide To MaxDiff Scaling
 
Popular Machine Learning Myths
Popular Machine Learning Myths Popular Machine Learning Myths
Popular Machine Learning Myths
 

Viewers also liked

Computacion1
Computacion1Computacion1
Computacion1jorge1597
 
Закон РФ "Об образовании"
Закон РФ "Об образовании"Закон РФ "Об образовании"
Закон РФ "Об образовании"koneqq
 
Grafik
GrafikGrafik
Grafikchiwil
 
Elsa Coupard & Claude Mussou: Curating History with French Audiovisual Archives
Elsa Coupard & Claude Mussou: Curating History with French Audiovisual ArchivesElsa Coupard & Claude Mussou: Curating History with French Audiovisual Archives
Elsa Coupard & Claude Mussou: Curating History with French Audiovisual ArchivesEUscreen
 
Retail sales – United states – october 2016
Retail sales – United states – october 2016Retail sales – United states – october 2016
Retail sales – United states – october 2016paul young cpa, cga
 
Examensarbete_EricssonAB_JesperLarsson_MarcusStenberg
Examensarbete_EricssonAB_JesperLarsson_MarcusStenbergExamensarbete_EricssonAB_JesperLarsson_MarcusStenberg
Examensarbete_EricssonAB_JesperLarsson_MarcusStenbergJesper Larsson
 
Expohomenaxe iesasorey2
Expohomenaxe iesasorey2Expohomenaxe iesasorey2
Expohomenaxe iesasorey2mariasorey
 
Vender ó morir
Vender ó morirVender ó morir
Vender ó morirMike Nieva
 
Proyecto turinnova digital
Proyecto turinnova digitalProyecto turinnova digital
Proyecto turinnova digitalconchini
 
Digital Strategy - Automotive and changes in customer behaviour
Digital Strategy - Automotive and changes in customer behaviour Digital Strategy - Automotive and changes in customer behaviour
Digital Strategy - Automotive and changes in customer behaviour Nigel Hudson
 
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...DataWorks Summit
 
Funciones basicas de exel de excel
Funciones basicas de exel de excelFunciones basicas de exel de excel
Funciones basicas de exel de excelcatalina55211645
 
Cómo sacar a LinkedIn el máximo partido
Cómo sacar a LinkedIn el máximo partidoCómo sacar a LinkedIn el máximo partido
Cómo sacar a LinkedIn el máximo partidoMaría Rubio
 

Viewers also liked (20)

Computacion1
Computacion1Computacion1
Computacion1
 
Закон РФ "Об образовании"
Закон РФ "Об образовании"Закон РФ "Об образовании"
Закон РФ "Об образовании"
 
Redefine Big Data
Redefine Big DataRedefine Big Data
Redefine Big Data
 
Grafik
GrafikGrafik
Grafik
 
Mapping analysis
Mapping analysisMapping analysis
Mapping analysis
 
His 303 week 5 final paper
His 303 week 5 final paperHis 303 week 5 final paper
His 303 week 5 final paper
 
Elsa Coupard & Claude Mussou: Curating History with French Audiovisual Archives
Elsa Coupard & Claude Mussou: Curating History with French Audiovisual ArchivesElsa Coupard & Claude Mussou: Curating History with French Audiovisual Archives
Elsa Coupard & Claude Mussou: Curating History with French Audiovisual Archives
 
BERZILA
BERZILABERZILA
BERZILA
 
Retail sales – United states – october 2016
Retail sales – United states – october 2016Retail sales – United states – october 2016
Retail sales – United states – october 2016
 
Examensarbete_EricssonAB_JesperLarsson_MarcusStenberg
Examensarbete_EricssonAB_JesperLarsson_MarcusStenbergExamensarbete_EricssonAB_JesperLarsson_MarcusStenberg
Examensarbete_EricssonAB_JesperLarsson_MarcusStenberg
 
Expohomenaxe iesasorey2
Expohomenaxe iesasorey2Expohomenaxe iesasorey2
Expohomenaxe iesasorey2
 
Vender ó morir
Vender ó morirVender ó morir
Vender ó morir
 
Proyecto turinnova digital
Proyecto turinnova digitalProyecto turinnova digital
Proyecto turinnova digital
 
Actualidad y tendencias digital mkt travel
Actualidad y tendencias digital mkt travel Actualidad y tendencias digital mkt travel
Actualidad y tendencias digital mkt travel
 
Apresentação da Calculadora hp 12c
Apresentação da Calculadora hp 12cApresentação da Calculadora hp 12c
Apresentação da Calculadora hp 12c
 
Digital Strategy - Automotive and changes in customer behaviour
Digital Strategy - Automotive and changes in customer behaviour Digital Strategy - Automotive and changes in customer behaviour
Digital Strategy - Automotive and changes in customer behaviour
 
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...
 
Funciones basicas de exel de excel
Funciones basicas de exel de excelFunciones basicas de exel de excel
Funciones basicas de exel de excel
 
Cómo sacar a LinkedIn el máximo partido
Cómo sacar a LinkedIn el máximo partidoCómo sacar a LinkedIn el máximo partido
Cómo sacar a LinkedIn el máximo partido
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
 

Similar to Is Bigger Data Really Better? 10 Facts from Theory and Practice

Hadoop BIG Data - Fraud Detection with Real-Time Analytics
Hadoop BIG Data - Fraud Detection with Real-Time AnalyticsHadoop BIG Data - Fraud Detection with Real-Time Analytics
Hadoop BIG Data - Fraud Detection with Real-Time Analyticshkbhadraa
 
Machine Learning: A Fast Review
Machine Learning: A Fast ReviewMachine Learning: A Fast Review
Machine Learning: A Fast ReviewAhmad Ali Abin
 
machinelearning-191005133446.pdf
machinelearning-191005133446.pdfmachinelearning-191005133446.pdf
machinelearning-191005133446.pdfLellaLinton
 
Data mining techniques and dss
Data mining techniques and dssData mining techniques and dss
Data mining techniques and dssNiyitegekabilly
 
Transforming Insurance Analytics with Big Data and Automated Machine Learning

Transforming Insurance Analytics with Big Data and Automated Machine Learning
Transforming Insurance Analytics with Big Data and Automated Machine Learning

Transforming Insurance Analytics with Big Data and Automated Machine Learning
Cloudera, Inc.
 
Data Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analyticsData Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analyticsAkin Osman Kazakci
 
Machine Learning in Business What It Is and How to Use It
Machine Learning in Business What It Is and How to Use ItMachine Learning in Business What It Is and How to Use It
Machine Learning in Business What It Is and How to Use ItKashish Trivedi
 
Artificial Intelligence for Medicine
Artificial Intelligence for MedicineArtificial Intelligence for Medicine
Artificial Intelligence for MedicineTassilo Klein
 
SDD2017 - 03 Abed Ajraou - putting data science in your business a first uti...
SDD2017 - 03 Abed Ajraou  - putting data science in your business a first uti...SDD2017 - 03 Abed Ajraou  - putting data science in your business a first uti...
SDD2017 - 03 Abed Ajraou - putting data science in your business a first uti...Dario Mangano
 
Learn How to Make Machine Learning Work
Learn How to Make Machine Learning WorkLearn How to Make Machine Learning Work
Learn How to Make Machine Learning WorkiTrainMalaysia1
 
Data drift and machine learning
Data drift and machine learningData drift and machine learning
Data drift and machine learningSmita Agrawal
 
Perspectives on Machine Learning
Perspectives on Machine LearningPerspectives on Machine Learning
Perspectives on Machine LearningDr. Niren Sirohi
 
Data drift and machine learning
Data drift and machine learningData drift and machine learning
Data drift and machine learningSmita Agrawal
 
Data mining and privacy preserving in data mining
Data mining and privacy preserving in data miningData mining and privacy preserving in data mining
Data mining and privacy preserving in data miningNeeda Multani
 
U5 a1 stages in the decision making process
U5 a1 stages in the decision making processU5 a1 stages in the decision making process
U5 a1 stages in the decision making processPeter R Breach
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2Mahmoud Alfarra
 
Is Machine learning useful for Fraud Prevention?
Is Machine learning useful for Fraud Prevention?Is Machine learning useful for Fraud Prevention?
Is Machine learning useful for Fraud Prevention?Andrea Dal Pozzolo
 
AI-Led-Cognitive-Data-Quality.pdf
AI-Led-Cognitive-Data-Quality.pdfAI-Led-Cognitive-Data-Quality.pdf
AI-Led-Cognitive-Data-Quality.pdfarifulislam946965
 

Similar to Is Bigger Data Really Better? 10 Facts from Theory and Practice (20)

Hadoop BIG Data - Fraud Detection with Real-Time Analytics
Hadoop BIG Data - Fraud Detection with Real-Time AnalyticsHadoop BIG Data - Fraud Detection with Real-Time Analytics
Hadoop BIG Data - Fraud Detection with Real-Time Analytics
 
Machine Learning: A Fast Review
Machine Learning: A Fast ReviewMachine Learning: A Fast Review
Machine Learning: A Fast Review
 
machinelearning-191005133446.pdf
machinelearning-191005133446.pdfmachinelearning-191005133446.pdf
machinelearning-191005133446.pdf
 
Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
 
Data mining techniques and dss
Data mining techniques and dssData mining techniques and dss
Data mining techniques and dss
 
Transforming Insurance Analytics with Big Data and Automated Machine Learning

Transforming Insurance Analytics with Big Data and Automated Machine Learning
Transforming Insurance Analytics with Big Data and Automated Machine Learning

Transforming Insurance Analytics with Big Data and Automated Machine Learning

 
Data Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analyticsData Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analytics
 
Machine Learning in Business What It Is and How to Use It
Machine Learning in Business What It Is and How to Use ItMachine Learning in Business What It Is and How to Use It
Machine Learning in Business What It Is and How to Use It
 
Credit Card Fraud Detection_ Mansi_Choudhary.pptx
Credit Card Fraud Detection_ Mansi_Choudhary.pptxCredit Card Fraud Detection_ Mansi_Choudhary.pptx
Credit Card Fraud Detection_ Mansi_Choudhary.pptx
 
Artificial Intelligence for Medicine
Artificial Intelligence for MedicineArtificial Intelligence for Medicine
Artificial Intelligence for Medicine
 
SDD2017 - 03 Abed Ajraou - putting data science in your business a first uti...
SDD2017 - 03 Abed Ajraou  - putting data science in your business a first uti...SDD2017 - 03 Abed Ajraou  - putting data science in your business a first uti...
SDD2017 - 03 Abed Ajraou - putting data science in your business a first uti...
 
Learn How to Make Machine Learning Work
Learn How to Make Machine Learning WorkLearn How to Make Machine Learning Work
Learn How to Make Machine Learning Work
 
Data drift and machine learning
Data drift and machine learningData drift and machine learning
Data drift and machine learning
 
Perspectives on Machine Learning
Perspectives on Machine LearningPerspectives on Machine Learning
Perspectives on Machine Learning
 
Data drift and machine learning
Data drift and machine learningData drift and machine learning
Data drift and machine learning
 
Data mining and privacy preserving in data mining
Data mining and privacy preserving in data miningData mining and privacy preserving in data mining
Data mining and privacy preserving in data mining
 
U5 a1 stages in the decision making process
U5 a1 stages in the decision making processU5 a1 stages in the decision making process
U5 a1 stages in the decision making process
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
 
Is Machine learning useful for Fraud Prevention?
Is Machine learning useful for Fraud Prevention?Is Machine learning useful for Fraud Prevention?
Is Machine learning useful for Fraud Prevention?
 
AI-Led-Cognitive-Data-Quality.pdf
AI-Led-Cognitive-Data-Quality.pdfAI-Led-Cognitive-Data-Quality.pdf
AI-Led-Cognitive-Data-Quality.pdf
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 

Recently uploaded (20)

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 

Is Bigger Data Really Better? 10 Facts from Theory and Practice

  • 1. Bigger Data. Better Insights.™ Is Bigger Data Really Better? 10 Facts from Theory and Practice Alexander Gray, PhD CTO, Skytree Adj. Assoc. Prof., Georgia Tech
  • 2. 2 Is bigger data necessarily better? If so, when and when not? To what extent? Even if it is, can we realize the gains?
  • 3. 3 First, what is the link between bigger data and bigger business value?
  • 4. Let’s start with your high-value prediction problem Healthcare: Diagnosis Prescription Prognosis Prevention Drug screening Drug efficacy Cost optimization Energy: Remote sensing Automatic equipment operation Telco/data center: Churn Load prediction/provisioning Asset-intensive: Predictive maintenance Prescriptive maintenance Fault diagnosis Dynamic allocation Govt/law enforcement: Association/phone call analysis Threat scoring Security: Data loss prevention Intrusion detection Point-of-compromise identification Malware identification Marketing/sales: Lead scoring Recommendation Personalized pricing Personalized product/service Product/service optimization Optimal next action Opportunity scoring Retail: Demand forecasting Optimal pricing Promo planning Ensemble planning Workforce allocation Demand-driven supply chain Insurance: Loss model Bind model Claims leakage Claims fraud Bank/credit card: Transaction fraud Credit/loan scoring Investing/trading Money laundering Advertising: Ad selection User/site bidding Spend optimization
  • 5. 5 1. More $  Better prediction Increasing business value is achieved by increasing predictive power. Example: fraud detection • False negative: Costs $2000 • False positive: Costs $100
  • 7. 7 2. Data size is a basic lever for predictive power The training data size is one of the main determinants of your model’s predictive power.
  • 8. 8 3. More data  more predictive power When you use more training data, you increase predictive power.
  • 9. 9 For realistic high-value models, how do things work?
  • 10. 10 4. More sophisticated models  need more data When you move to more sophisticated models, you need more training data. e.g. for nonparametric regression, density estimation: e.g. nonparametric methods like k- NN (or GBT, RF, SVM, NN, etc) converge to zero estimation error for near-arbitrary data:
  • 11. 11 5. More features  need more data When you use more features, you need more training data. e.g. for nonparametric regression, density estimation: Note that more features improve accuracy, speaking generally (more on that in a different talk, or ask me)
  • 12. 12 6. More data  better prediction is real Real empirical ML results follow the math: More training data increases predictive power.
  • 13. 13 How else can down-sampling the data be harmful, creating poor results?
  • 14. 14 7. Down-sampling for CV  wrong parameters The optimal hyperparameters of a model are actually dependent on the training set size.
  • 15. 15 8. Down-sampling may be throwing out gems In many cases the important data points are too rare to be further reduced. • High-interest outliers or small clusters • High-value but rare known objects or events • Rare but high-value discrete values or classes • Missing values means each point is less informative • Natural systems with massive variation Another thing: non-uniform sampling, without appropriate corrections, may warp important probabilities
  • 16. 16 What else should we know, toward best practices for big data?
  • 17. 17 9. ML on big data is now possible* It is now actually possible to fully train models with very large amounts of data. *with Skytree!
  • 18. 18 9. ML on big data is now possible* Even with full tuning at each size, to find the optimal parameters! *with Skytree! It is now actually possible to fully train models with very large amounts of data.
  • 19. 19 Let’s look again at the basic sources of error
  • 20. 20 Let’s look again at the basic sources of error If your error due to having an insufficient model class (e.g. linear models like logistic regression) dominates, adding more data won’t help Error due to number of data is not your worst problem
  • 21. 21 Let’s look again at the basic sources of error If your error due to having incomplete model optimization (e.g. stochastic gradient descent for parameters or too-small grid in cross-validation for hyperparameters) dominates, adding more data won’t help Error due to number of data is not your worst problem
  • 22. 22 10. Your other errors may be holding you back It is necessary to minimize all the sources of error at the same time. • Training (too-) simple models in order to handle large datasets may not gain benefit • Performing (too-) incomplete training in order to handle large datasets may not gain benefit
  • 23. 23 Summary: 10 facts from theory and practice 1. Better prediction  More $ 2. Data size is a basic lever for predictive power 3. More data  predictive power 4. More sophisticated models  need more data 5. More features  need more data 6. More data  better prediction is real 7. Down-sampling for CV  wrong parameters 8. Down-sampling may be throwing out gems 9. ML on big data is now possible* 10.Your other errors may be holding you back A written form (14-page white paper) Is available at our the Skytree booth.
  • 24. 24 Conclusions: 5 practical upshots • Training on a subsample of the data is giving up measurable predictive power, and thus significant business value. • When a dataset contains rare objects or values, which is common, subsampling can be disastrous. • Training too-simple models may block the benefit from the data size. • Performing too-incomplete training may block the benefit from the data size. • Performing cross-validation on a subsample is incorrect. When you are ready to max out your data’s potential with true state-of-the-art ML: www.skytree.net Thanks!
  • 25. Bigger Data. Better Insights.™ Thanks! Alexander Gray, PhD CTO, Skytree