Machine Learning in Big Data
- Look forward or be left behind
V. William Porto
Hadoop Summit June 2015
2  RedPoint Global Inc. 2015 Confidential
Machine Learning – keeping ahead of the curve
Three basic tenants for success in today’s world
Prediction - you need to learn and use what you’ve learned
Optimization - the world is a dynamic place
Automation - because people don’t scale well
3  RedPoint Global Inc. 2015 Confidential
Machine Learning – why bother?
If you have always done it that way, it is probably wrong” - Charles Kettering
4  RedPoint Global Inc. 2015 Confidential
Machine Learning – what really is it all about?
Learning vs. instruction
Humans learn instinctively – computers not so much
Intelligent Systems
Memory
Prediction (modeling)
Assessment
Feedback
Adaptation
5  RedPoint Global Inc. 2015 Confidential
Data Modeling – what, why, how
Regression – what happened in the past
Prediction – what will happen in the future
“Prediction is very difficult – especially if it’s about the future”
- Nihls Bohr
6  RedPoint Global Inc. 2015 Confidential
Data Modeling – what, why, how
Choices, choices - the wide world of data modeling
Supervised models
you have historical data and known correlated outputs (truth)
Unsupervised models
historical data, but may not have (or trust) associated outputs
7  RedPoint Global Inc. 2015 Confidential
Supervised vs. Unsupervised Models
8  RedPoint Global Inc. 2015 Confidential
Linear Models
Major Assumption: the world is linear
Pros:
the math is easy!
fast execution
Cons:
the real world isn’t really linear
all errors aren’t all equal
easy to generate misleading results
9  RedPoint Global Inc. 2015 Confidential
Decision Trees
Major Assumption: the world is discrete
Pros:
easy to understand
fast execution
no linearity assumptions
Cons:
lots of ‘human time’ to create
bias in unbalanced trees
some concepts need very large trees
10  RedPoint Global Inc. 2015 Confidential
Non-Linear Models
Major Assumption: data is representative
Pros:
‘universal’ modeling tools
fast execution
no linearity assumptions
Cons:
lots of parameters, many techniques
training can be slow
difficult to explain and understand
Artificial Neural Network
Bayesian Network
11  RedPoint Global Inc. 2015 Confidential
Clustering/Segmentation
Basic Question – which one describes the data the best?
Raw data
12  RedPoint Global Inc. 2015 Confidential
Clustering/Segmentation – group think
Collaborative Filtering
Relationship Matrix
13  RedPoint Global Inc. 2015 Confidential
Clustering/Segmentation with Statistics
Statistical Techniques:
K-Means
Vector Quantization
Pros:
relatively simple
statistically-backed results
Cons:
assumptions: data distribution
how many clusters really are there?
K-Means Clustering
Vector Quantization
14  RedPoint Global Inc. 2015 Confidential
Clustering/Segmentation – data driven
Feature Maps:
Pros:
lets data speaks for itself
useful boundary relationships
Cons:
slow to train
Customer Demographics
15  RedPoint Global Inc. 2015 Confidential
Model Selection – how to choose?
Basic Model Type (prediction or segmentation)
inputs + correlated outputs
inputs only?
Basic Questions:
which one to use for my problem?
parameters?
is this the best choice?
could I do better, and how?
16  RedPoint Global Inc. 2015 Confidential
Optimization – making the best choices
Standard (old-school) Techniques:
PCA, Partial Least Squares, etc.
Pros:
because the math is easy !
Cons:
lots of (usually incorrect) assumptions
new data = start from scratch
17  RedPoint Global Inc. 2015 Confidential
Optimization – is that the only way?
18  RedPoint Global Inc. 2015 Confidential
Optimization – Evolving better solutions
Simulated Evolution
Pros:
fast, efficient search
always have a solution
arbitrary ‘evaluation’ functions
can start with existing solution(s)
Cons:
CPU time + memory – but that’s why we have distributed processing!
19  RedPoint Global Inc. 2015 Confidential
Optimization – Evolving Models
What does a ‘solution’ look like?
model type
parameters
data (training + testing)
Variation – alter model type, parameters
Assessment – how well does the model work?
Selection – survival of the fittest
20  RedPoint Global Inc. 2015 Confidential
Evolutionary Optimization in a Hadoop Environment
Challenges:
data partitioning
distributed computation
communication
MapReduce
21  RedPoint Global Inc. 2015 Confidential
Optimization in a Hadoop Environment – what really works
MapReduce:
algorithmic task partitioning
iterative tasks vs. fully compartmented tasks
aggregation – distribution tasks
communication / synchronization costs
22  RedPoint Global Inc. 2015 Confidential
ML in a Hadoop Environment – Single Algorithm Architecture
Multi-Core Machine (per Chu and Kim, et. al 2006, Stanford NLPG)
ML Algorithm
Engine
Master
Mapper Mapper Mapper Mapper
Data Reducer
input
reduce
query info
result
query info
map (split data)intermediate data
23  RedPoint Global Inc. 2015 Confidential
Machine Learning in a Hadoop Environment
ML Algorithms:
Locally Weighted Linear Regression
K-Means Nearest Neighbor (KNN)
Feed-forward Multi-layer Neural Network (MLP)
Principal Component Analysis (PCA)
Support Vector Machine (SVM)
24  RedPoint Global Inc. 2015 Confidential
Machine Learning in a Hadoop Environment – example
Hadoop Multi-Core Tests (per Chu and Kim, et. al 2006, Stanford NLPG)
# Processors
Speed
increase
25  RedPoint Global Inc. 2015 Confidential
ML in a Hadoop Environment – Evolutionary Optimization Architecture
Offspring Partition
Offspring Partition Map
Initial (seed)
Population
Coordinator
Map
...
...
Offspring Partition
Master
(Variation)
Reducer
Reducer
...
1st reduction
stage
(local selection)
2nd reduction
stage
(global selection)
Reducer
Nth
generation
solutions
map stage
(evaluation)
26  RedPoint Global Inc. 2015 Confidential
Machine Learning – Hadoop, MPI, GPU?
query info
Analyze the algorithmic bottlenecks
Use Hadoop / MapReduce if:
large number of features
relatively few inter-process communication steps
e.g., on-line training
Use MPI, GPUs if:
large number of training samples
e.g., batch training
27  RedPoint Global Inc. 2015 Confidential
Optimization – Don’t Stop Now
Adaptation
update models regularly
drop old data, retrain
Model with different time scales
daily, weekly, seasonal, yearly, multi-year
Automate the process !
28  RedPoint Global Inc. 2015 Confidential
A Word about RedPoint Global
Launched 2006
Founded and staffed by industry
veterans
Headquarters: Wellesley,
Massachusetts
Offices in US, UK, Australia, Philippines
Global customer base
Serves most major industries
MAGIC QUADRANT
Data Quality
MAGIC QUADRANT
Multichannel Campaign
Management
MAGIC QUADRANT
Integrated Marketing
Management
29  RedPoint Global Inc. 2015 Confidential
Time for Q&A
For more information contact:
Bill Porto
RedPoint Global Inc.
36 Washington St., Suite 120
Wellesley Hills, MA 02481
vwporto@redpoint.net

Machine Learning in Big Data

  • 1.
    Machine Learning inBig Data - Look forward or be left behind V. William Porto Hadoop Summit June 2015
  • 2.
    2  RedPointGlobal Inc. 2015 Confidential Machine Learning – keeping ahead of the curve Three basic tenants for success in today’s world Prediction - you need to learn and use what you’ve learned Optimization - the world is a dynamic place Automation - because people don’t scale well
  • 3.
    3  RedPointGlobal Inc. 2015 Confidential Machine Learning – why bother? If you have always done it that way, it is probably wrong” - Charles Kettering
  • 4.
    4  RedPointGlobal Inc. 2015 Confidential Machine Learning – what really is it all about? Learning vs. instruction Humans learn instinctively – computers not so much Intelligent Systems Memory Prediction (modeling) Assessment Feedback Adaptation
  • 5.
    5  RedPointGlobal Inc. 2015 Confidential Data Modeling – what, why, how Regression – what happened in the past Prediction – what will happen in the future “Prediction is very difficult – especially if it’s about the future” - Nihls Bohr
  • 6.
    6  RedPointGlobal Inc. 2015 Confidential Data Modeling – what, why, how Choices, choices - the wide world of data modeling Supervised models you have historical data and known correlated outputs (truth) Unsupervised models historical data, but may not have (or trust) associated outputs
  • 7.
    7  RedPointGlobal Inc. 2015 Confidential Supervised vs. Unsupervised Models
  • 8.
    8  RedPointGlobal Inc. 2015 Confidential Linear Models Major Assumption: the world is linear Pros: the math is easy! fast execution Cons: the real world isn’t really linear all errors aren’t all equal easy to generate misleading results
  • 9.
    9  RedPointGlobal Inc. 2015 Confidential Decision Trees Major Assumption: the world is discrete Pros: easy to understand fast execution no linearity assumptions Cons: lots of ‘human time’ to create bias in unbalanced trees some concepts need very large trees
  • 10.
    10  RedPointGlobal Inc. 2015 Confidential Non-Linear Models Major Assumption: data is representative Pros: ‘universal’ modeling tools fast execution no linearity assumptions Cons: lots of parameters, many techniques training can be slow difficult to explain and understand Artificial Neural Network Bayesian Network
  • 11.
    11  RedPointGlobal Inc. 2015 Confidential Clustering/Segmentation Basic Question – which one describes the data the best? Raw data
  • 12.
    12  RedPointGlobal Inc. 2015 Confidential Clustering/Segmentation – group think Collaborative Filtering Relationship Matrix
  • 13.
    13  RedPointGlobal Inc. 2015 Confidential Clustering/Segmentation with Statistics Statistical Techniques: K-Means Vector Quantization Pros: relatively simple statistically-backed results Cons: assumptions: data distribution how many clusters really are there? K-Means Clustering Vector Quantization
  • 14.
    14  RedPointGlobal Inc. 2015 Confidential Clustering/Segmentation – data driven Feature Maps: Pros: lets data speaks for itself useful boundary relationships Cons: slow to train Customer Demographics
  • 15.
    15  RedPointGlobal Inc. 2015 Confidential Model Selection – how to choose? Basic Model Type (prediction or segmentation) inputs + correlated outputs inputs only? Basic Questions: which one to use for my problem? parameters? is this the best choice? could I do better, and how?
  • 16.
    16  RedPointGlobal Inc. 2015 Confidential Optimization – making the best choices Standard (old-school) Techniques: PCA, Partial Least Squares, etc. Pros: because the math is easy ! Cons: lots of (usually incorrect) assumptions new data = start from scratch
  • 17.
    17  RedPointGlobal Inc. 2015 Confidential Optimization – is that the only way?
  • 18.
    18  RedPointGlobal Inc. 2015 Confidential Optimization – Evolving better solutions Simulated Evolution Pros: fast, efficient search always have a solution arbitrary ‘evaluation’ functions can start with existing solution(s) Cons: CPU time + memory – but that’s why we have distributed processing!
  • 19.
    19  RedPointGlobal Inc. 2015 Confidential Optimization – Evolving Models What does a ‘solution’ look like? model type parameters data (training + testing) Variation – alter model type, parameters Assessment – how well does the model work? Selection – survival of the fittest
  • 20.
    20  RedPointGlobal Inc. 2015 Confidential Evolutionary Optimization in a Hadoop Environment Challenges: data partitioning distributed computation communication MapReduce
  • 21.
    21  RedPointGlobal Inc. 2015 Confidential Optimization in a Hadoop Environment – what really works MapReduce: algorithmic task partitioning iterative tasks vs. fully compartmented tasks aggregation – distribution tasks communication / synchronization costs
  • 22.
    22  RedPointGlobal Inc. 2015 Confidential ML in a Hadoop Environment – Single Algorithm Architecture Multi-Core Machine (per Chu and Kim, et. al 2006, Stanford NLPG) ML Algorithm Engine Master Mapper Mapper Mapper Mapper Data Reducer input reduce query info result query info map (split data)intermediate data
  • 23.
    23  RedPointGlobal Inc. 2015 Confidential Machine Learning in a Hadoop Environment ML Algorithms: Locally Weighted Linear Regression K-Means Nearest Neighbor (KNN) Feed-forward Multi-layer Neural Network (MLP) Principal Component Analysis (PCA) Support Vector Machine (SVM)
  • 24.
    24  RedPointGlobal Inc. 2015 Confidential Machine Learning in a Hadoop Environment – example Hadoop Multi-Core Tests (per Chu and Kim, et. al 2006, Stanford NLPG) # Processors Speed increase
  • 25.
    25  RedPointGlobal Inc. 2015 Confidential ML in a Hadoop Environment – Evolutionary Optimization Architecture Offspring Partition Offspring Partition Map Initial (seed) Population Coordinator Map ... ... Offspring Partition Master (Variation) Reducer Reducer ... 1st reduction stage (local selection) 2nd reduction stage (global selection) Reducer Nth generation solutions map stage (evaluation)
  • 26.
    26  RedPointGlobal Inc. 2015 Confidential Machine Learning – Hadoop, MPI, GPU? query info Analyze the algorithmic bottlenecks Use Hadoop / MapReduce if: large number of features relatively few inter-process communication steps e.g., on-line training Use MPI, GPUs if: large number of training samples e.g., batch training
  • 27.
    27  RedPointGlobal Inc. 2015 Confidential Optimization – Don’t Stop Now Adaptation update models regularly drop old data, retrain Model with different time scales daily, weekly, seasonal, yearly, multi-year Automate the process !
  • 28.
    28  RedPointGlobal Inc. 2015 Confidential A Word about RedPoint Global Launched 2006 Founded and staffed by industry veterans Headquarters: Wellesley, Massachusetts Offices in US, UK, Australia, Philippines Global customer base Serves most major industries MAGIC QUADRANT Data Quality MAGIC QUADRANT Multichannel Campaign Management MAGIC QUADRANT Integrated Marketing Management
  • 29.
    29  RedPointGlobal Inc. 2015 Confidential Time for Q&A For more information contact: Bill Porto RedPoint Global Inc. 36 Washington St., Suite 120 Wellesley Hills, MA 02481 vwporto@redpoint.net