Is Bigger Data Really Better? 10 Facts from Theory and Practice
1. Bigger Data. Better Insights.™
Is Bigger Data Really Better?
10 Facts from Theory and Practice
Alexander Gray, PhD
CTO, Skytree
Adj. Assoc. Prof., Georgia Tech
2. 2
Is bigger data necessarily better?
If so, when and when not?
To what extent?
Even if it is, can we realize the gains?
3. 3
First, what is the link between
bigger data and
bigger business value?
4. Let’s start with your high-value prediction problem
Healthcare:
Diagnosis
Prescription
Prognosis
Prevention
Drug screening
Drug efficacy
Cost optimization
Energy:
Remote sensing
Automatic equipment operation
Telco/data center:
Churn
Load prediction/provisioning
Asset-intensive:
Predictive maintenance
Prescriptive maintenance
Fault diagnosis
Dynamic allocation
Govt/law enforcement:
Association/phone call analysis
Threat scoring
Security:
Data loss prevention
Intrusion detection
Point-of-compromise identification
Malware identification
Marketing/sales:
Lead scoring
Recommendation
Personalized pricing
Personalized product/service
Product/service optimization
Optimal next action
Opportunity scoring
Retail:
Demand forecasting
Optimal pricing
Promo planning
Ensemble planning
Workforce allocation
Demand-driven supply chain
Insurance:
Loss model
Bind model
Claims leakage
Claims fraud
Bank/credit card:
Transaction fraud
Credit/loan scoring
Investing/trading
Money laundering
Advertising:
Ad selection
User/site bidding
Spend optimization
5. 5
1. More $ Better prediction
Increasing business value is
achieved by
increasing predictive power.
Example: fraud detection
• False negative: Costs $2000
• False positive: Costs $100
10. 10
4. More sophisticated models need more data
When you move to more
sophisticated models, you need
more training data.
e.g. for nonparametric regression,
density estimation:
e.g. nonparametric methods like k-
NN (or GBT, RF, SVM, NN, etc)
converge to zero estimation error
for near-arbitrary data:
11. 11
5. More features need more data
When you use more features,
you need more training data.
e.g. for nonparametric regression,
density estimation:
Note that more features improve
accuracy, speaking generally
(more on that in a different talk, or
ask me)
12. 12
6. More data better prediction is real
Real empirical ML results
follow the math:
More training data increases
predictive power.
14. 14
7. Down-sampling for CV wrong parameters
The optimal hyperparameters of
a model are actually dependent
on the training set size.
15. 15
8. Down-sampling may be throwing out gems
In many cases the important
data points are too rare to be
further reduced.
• High-interest outliers or small
clusters
• High-value but rare known objects
or events
• Rare but high-value discrete values
or classes
• Missing values means each point is
less informative
• Natural systems with massive
variation
Another thing: non-uniform sampling,
without appropriate corrections, may warp
important probabilities
17. 17
9. ML on big data is now possible*
It is now actually possible to
fully train models with very
large amounts of data.
*with Skytree!
18. 18
9. ML on big data is now possible*
Even with full tuning at each size,
to find the optimal parameters!
*with Skytree!
It is now actually possible to
fully train models with very
large amounts of data.
20. 20
Let’s look again at the basic sources of error
If your error due to having an insufficient model
class (e.g. linear models like logistic regression)
dominates, adding more data won’t help
Error due to number of data is not your worst
problem
21. 21
Let’s look again at the basic sources of error
If your error due to having incomplete model
optimization (e.g. stochastic gradient descent for
parameters or too-small grid in cross-validation
for hyperparameters) dominates, adding more
data won’t help
Error due to number of data is not your worst
problem
22. 22
10. Your other errors may be holding you back
It is necessary to minimize all
the sources of error at the
same time.
• Training (too-) simple models in
order to handle large datasets may
not gain benefit
• Performing (too-) incomplete
training in order to handle large
datasets may not gain benefit
23. 23
Summary: 10 facts from theory and practice
1. Better prediction More $
2. Data size is a basic lever for predictive
power
3. More data predictive power
4. More sophisticated models need
more data
5. More features need more data
6. More data better prediction is real
7. Down-sampling for CV wrong
parameters
8. Down-sampling may be throwing out
gems
9. ML on big data is now possible*
10.Your other errors may be holding you
back
A written form (14-page white paper)
Is available at our the Skytree booth.
24. 24
Conclusions: 5 practical upshots
• Training on a subsample of the data is giving up measurable predictive power,
and thus significant business value.
• When a dataset contains rare objects or values, which is common,
subsampling can be disastrous.
• Training too-simple models may block the benefit from the data size.
• Performing too-incomplete training may block the benefit from the data size.
• Performing cross-validation on a subsample is incorrect.
When you are ready to max out your data’s potential
with true state-of-the-art ML:
www.skytree.net
Thanks!