2. Agenda
Approaches for Big Data in MATLAB
Map-Reduce & Hadoop Intro
Hadoop & MATLAB Integration
MATLAB on Amazon cloud EC2
3. Considerations for Choosing an
Approach:
Data characteristics
Size, type and location of your data
Compute platform
Single desktop machine or cluster
Analysis Characteristics
Embarrassingly Parallel
Analyze sub-segments of data and aggregate results
Operate on entire dataset
4. Techniques for Big Data in MATLAB
Complexity
Embarrassingly
Parallel
Non-Partitionable
MapReduce
Distributed Memory
SPMD and distributed arrays
Load,
Analyze,
Discard
parfor,
datastore
out-of-
memory
in-memory
5. Techniques for Big Data in MATLAB
Complexity
Embarrassingly
Parallel
Non-Partitionable
MapReduce
Distributed Memory
SPMD and distributed arrays
Load,
Analyze,
Discard
parfor,
datastore
out-of-
memory
in-memory
6. Techniques for Big Data in MATLAB
Embarrass
ingly
Parallel
Non-
Partitionab
le
MapReduce
Distributed Memory
SPMD and distributed arrays
Load,
Analyze,
Discard
parfor,
datastore
out-of-
memory
in-memory
Complexit
y
7. Techniques for Big Data in MATLAB
Embarrass
ingly
Parallel
Non-
Partitionab
le
MapReduce
Distributed Memory
SPMD and distributed arrays
Load,
Analyze,
Discard
parfor,
datastore
out-of-
memory
in-memory
Complexit
y
8. When to Use datastore
• Data Characteristics
– Text data in files, databases or stored in the
Hadoop Distributed File System (HDFS)
• Compute Platform
– Desktop
• Analysis Characteristics
– Supports Load, Analyze, Discard workflows
– Incrementally read chunks of data,
process within a whileloop
9. Techniques for Big Data in MATLAB
ComplexityEmbarrass
ingly
Parallel
Non-
Partitionab
le
MapReduce
Distributed Memory
SPMD and distributed arrays
Load,
Analyze,
Discard
parfor,
datastore
out-of-
memory
in-memory
10. :
“MapReduce is a programming model for processing and
generating large data sets with a parallel
distributed algorithm on a cluster.”
It consists of 2 parts:
Mapping Phase
Filter & process subsets of data in parallel at the data location
Reduce Phase
Aggregate results from the mapping phase
What is MapReduce?
12. Mapping function:
function maxFuncMapper (data, KeyValueDataStorage)
% Mapper function for the MaxMapreduce demo.
partMax = max(data.Array);
add(KeyValueDataStorage, 'PartialMaxArr',partMax);
function maxFuncReducer(KeyValueDataStorage, outKeyValueStore)
% Reducer function for the MaxMapreduce demo.
while hasnext(KeyValueDataStorage)
maxVal = max(getnext(KeyValueDataStorage), maxVal);
end
% The key-value pair added to outKVStore will become the
output of mapreduce
add(outKeyValueStore,'MaxArr,maxVal);
Reduce function:
Finding the max value of a large array
18. When to Use mapreduce?
Data Characteristics
– Dataset will not fit into memory
– Text data in files, databases or stored in the
Hadoop Distributed File System (HDFS)
Compute Platform
– Desktop – MATLAB
– Scales to run on a Cluster – MATLAB - MDCS
– Scales to run within Hadoop MapReduce
on data in HDFS
Analysis Characteristics
– Must be able to be Partitioned into two phases
1. Map: filter or process sub-segments of data
2. Reduce: aggregate interim results and calculate final answer
19. Demo: Predicting Internal Building Temperature
Using Machine Learning, Datastore and MapReduce
Data
– Internal Building Zone temperatures
– 30+ Zones per Building
– Each zone ~165MB csv (GB’s in total)
Challenge
– Build a Predictive model from ALL the data
– There is too much data for 1 single model
Solution
– Create a multi-layered Model
– Layer 1 – Train multiples models from
each chunk of data
– Layer 2 – Use the output of Layer 1
as input to a 2nd model for prediction
22. What is Hadoop?
Hadoop is a software framework for distributed
processing of large datasets across large clusters
Large datasets Terabytes or petabytes of data
Large clusters hundreds or thousands of nodes
Hadoop consists of two main parts:
HDFS – Hadoop file system
MapReduce environment
23. Who uses Hadoop?
Google: Inventors of MapReduce computing
paradigm
Yahoo: Developing Hadoop open-source of
MapReduce
IBM, Microsoft, Oracle, Facebook, Amazon,
AOL, NetFlix
Many others + universities and research labs
38. spmd blocks
spmd
% single program across workers
end
• Mix parallel and serial code in the same function
• Run on a pool of MATLAB resources
• Single Program runs simultaneously across workers
• Multiple Data spread across multiple workers
40. When to Use Distributed Memory
• Data Characteristics
– Data must be fit in collective
memory across machines
• Compute Platform
– Prototype (subset of data) on desktop
– Run on a cluster or cloud
• Analysis Characteristics
– Consists of:
• Parts that can be run on data in memory (spmd)
• Supported functions for distributed arrays
41. Big Data on the Desktop
• Expand workspace
• 64 bit processor support – increased in-memory data set
handling
• Access portions of data too big to fit into memory
• Memory mapped variables – huge binary file
• Datastore – huge text file or collections of text files
• Database – query portion of a big database table
• Variety of programming constructs
• System Objects – analyze streaming data
• MapReduce – process text files that won’t fit into
memory
• Increase analysis speed
• Parallel for-loops with multicore/multi-process machines
• GPU Arrays
42. Further Scaling Big Data Capacity
MATLAB supports a number of programming
constructs for use with clusters
• General compute clusters
• Parallel for-loops – embarrassingly parallel
algorithms
• SPMD and distributed arrays – distributed memory
• Hadoop clusters
• MapReduce – analyze data stored in the Hadoop
Distributed File System
Use these
constructs on the
desktop to develop
your algorithms
Migrate to a
cluster without
algorithm
changes
43. MATLAB on Amazon cloud - EC2
Custom clusters in 10 minutes
Unlimited workers ( 256 per cluster)
45. Conclusion
Big Data approaches in MATLAB –
locally & beyond
Map-Reduce programming in MATLAB
Hadoop & MATLAB Integration
MATLAB on Amazon cloud EC2
46. Learn More
MATLAB Documentation
– Strategies for Efficient Use of Memory
– Resolving "Out of Memory" Errors
Big Data with MATLAB
– www.mathworks.com/discovery/big-data-matlab.html
MATLAB MapReduce and Hadoop
– www.mathworks.com/discovery/matlab-mapreduce-hadoop.html