Is Bigger Data Really Better? 10 Facts from Theory and Practice

Bigger Data. Better Insights.™
Is Bigger Data Really Better?
10 Facts from Theory and Practice
Alexander Gray, PhD
CTO, Skytree
Adj. Assoc. Prof., Georgia Tech

2
Is bigger data necessarily better?
If so, when and when not?
To what extent?
Even if it is, can we realize the gains?

3
First, what is the link between
bigger data and
bigger business value?

Let’s start with your high-value prediction problem
Healthcare:
Diagnosis
Prescription
Prognosis
Prevention
Drug screening
Drug efficacy
Cost optimization
Energy:
Remote sensing
Automatic equipment operation
Telco/data center:
Churn
Load prediction/provisioning
Asset-intensive:
Predictive maintenance
Prescriptive maintenance
Fault diagnosis
Dynamic allocation
Govt/law enforcement:
Association/phone call analysis
Threat scoring
Security:
Data loss prevention
Intrusion detection
Point-of-compromise identification
Malware identification
Marketing/sales:
Lead scoring
Recommendation
Personalized pricing
Personalized product/service
Product/service optimization
Optimal next action
Opportunity scoring
Retail:
Demand forecasting
Optimal pricing
Promo planning
Ensemble planning
Workforce allocation
Demand-driven supply chain
Insurance:
Loss model
Bind model
Claims leakage
Claims fraud
Bank/credit card:
Transaction fraud
Credit/loan scoring
Investing/trading
Money laundering
Advertising:
Ad selection
User/site bidding
Spend optimization

5
1. More $  Better prediction
Increasing business value is
achieved by
increasing predictive power.
Example: fraud detection
• False negative: Costs $2000
• False positive: Costs $100

6
The fundamental sources of error

7
2. Data size is a basic lever for predictive power
The training data size is one of
the main determinants of your
model’s predictive power.

8
3. More data  more predictive power
When you use
more training data,
you increase predictive power.

9
For realistic high-value models,
how do things work?

10
4. More sophisticated models  need more data
When you move to more
sophisticated models, you need
more training data.
e.g. for nonparametric regression,
density estimation:
e.g. nonparametric methods like k-
NN (or GBT, RF, SVM, NN, etc)
converge to zero estimation error
for near-arbitrary data:

11
5. More features  need more data
When you use more features,
you need more training data.
e.g. for nonparametric regression,
density estimation:
Note that more features improve
accuracy, speaking generally
(more on that in a different talk, or
ask me)

12
6. More data  better prediction is real
Real empirical ML results
follow the math:
More training data increases
predictive power.

13
How else can
down-sampling the data be harmful,
creating poor results?

14
7. Down-sampling for CV  wrong parameters
The optimal hyperparameters of
a model are actually dependent
on the training set size.

15
8. Down-sampling may be throwing out gems
In many cases the important
data points are too rare to be
further reduced.
• High-interest outliers or small
clusters
• High-value but rare known objects
or events
• Rare but high-value discrete values
or classes
• Missing values means each point is
less informative
• Natural systems with massive
variation
Another thing: non-uniform sampling,
without appropriate corrections, may warp
important probabilities

16
What else should we know,
toward best practices
for big data?

17
9. ML on big data is now possible*
It is now actually possible to
fully train models with very
large amounts of data.
*with Skytree!

18
Even with full tuning at each size,
to find the optimal parameters!
*with Skytree!
It is now actually possible to
fully train models with very
large amounts of data.

19
Let’s look again at the basic sources of error

20
If your error due to having an insufficient model
class (e.g. linear models like logistic regression)
dominates, adding more data won’t help
Error due to number of data is not your worst
problem

21
If your error due to having incomplete model
optimization (e.g. stochastic gradient descent for
parameters or too-small grid in cross-validation
for hyperparameters) dominates, adding more
data won’t help
Error due to number of data is not your worst
problem

22
10. Your other errors may be holding you back
It is necessary to minimize all
the sources of error at the
same time.
• Training (too-) simple models in
order to handle large datasets may
not gain benefit
• Performing (too-) incomplete
training in order to handle large
datasets may not gain benefit

23
Summary: 10 facts from theory and practice
1. Better prediction  More $
2. Data size is a basic lever for predictive
power
3. More data  predictive power
4. More sophisticated models  need
more data
5. More features  need more data
6. More data  better prediction is real
7. Down-sampling for CV  wrong
parameters
8. Down-sampling may be throwing out
gems
10.Your other errors may be holding you
back
A written form (14-page white paper)
Is available at our the Skytree booth.

24
Conclusions: 5 practical upshots
• Training on a subsample of the data is giving up measurable predictive power,
and thus significant business value.
• When a dataset contains rare objects or values, which is common,
subsampling can be disastrous.
• Training too-simple models may block the benefit from the data size.
• Performing too-incomplete training may block the benefit from the data size.
• Performing cross-validation on a subsample is incorrect.
When you are ready to max out your data’s potential
with true state-of-the-art ML:
www.skytree.net
Thanks!

Bigger Data. Better Insights.™
Thanks!
Alexander Gray, PhD
CTO, Skytree

Is Bigger Data Really Better? 10 Facts from Theory and Practice

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Is Bigger Data Really Better? 10 Facts from Theory and Practice

Similar to Is Bigger Data Really Better? 10 Facts from Theory and Practice (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Is Bigger Data Really Better? 10 Facts from Theory and Practice