MATLAB_BIg_Data_ds_Haddop_22032015

Big Data Handling
in MATLAB
Section-2
Asaf Ben Gal

Agenda
 Approaches for Big Data in MATLAB
 Map-Reduce & Hadoop Intro
 Hadoop & MATLAB Integration
 MATLAB on Amazon cloud EC2

Considerations for Choosing an
Approach:
Data characteristics
 Size, type and location of your data
Compute platform
 Single desktop machine or cluster
Analysis Characteristics
 Embarrassingly Parallel
 Analyze sub-segments of data and aggregate results
 Operate on entire dataset

Techniques for Big Data in MATLAB
Complexity
Embarrassingly
Parallel
Non-Partitionable
MapReduce
Distributed Memory
SPMD and distributed arrays
Load,
Analyze,
Discard
parfor,
datastore
out-of-
memory
in-memory

Embarrass
ingly
Parallel
Non-
Partitionab
le
MapReduce
Distributed Memory
Load,
Analyze,
Discard
parfor,
datastore
out-of-
memory
in-memory
Complexit
y

When to Use datastore
• Data Characteristics
– Text data in files, databases or stored in the
Hadoop Distributed File System (HDFS)
• Compute Platform
– Desktop
• Analysis Characteristics
– Supports Load, Analyze, Discard workflows
– Incrementally read chunks of data,
process within a whileloop

ComplexityEmbarrass
ingly
Parallel
Non-
Partitionab
le
MapReduce
Distributed Memory
Load,
Analyze,
Discard
parfor,
datastore
out-of-
memory
in-memory

 :
“MapReduce is a programming model for processing and
generating large data sets with a parallel
distributed algorithm on a cluster.”
It consists of 2 parts:
 Mapping Phase
Filter & process subsets of data in parallel at the data location
 Reduce Phase
Aggregate results from the mapping phase
What is MapReduce?

Mapping function:
function maxFuncMapper (data, KeyValueDataStorage)
% Mapper function for the MaxMapreduce demo.
partMax = max(data.Array);
add(KeyValueDataStorage, 'PartialMaxArr',partMax);
function maxFuncReducer(KeyValueDataStorage, outKeyValueStore)
% Reducer function for the MaxMapreduce demo.
while hasnext(KeyValueDataStorage)
maxVal = max(getnext(KeyValueDataStorage), maxVal);
end
% The key-value pair added to outKVStore will become the
output of mapreduce
add(outKeyValueStore,'MaxArr,maxVal);
Reduce function:
Finding the max value of a large array

MapReduce Programming Model
 mapreducer
MAP REDUCE
Execution Environment
Output FilesInput Files
 mapreduce datastore

datastore
 datastore: Stores information
regarding input files and output files
MAP REDUCE

mapreducer
 mapreducer: Control the Execution
Environment of the MapReduce operation
MAP REDUCE

mapreducer
Serial ClusterLocal Pool
Multi-core
CPU

mapreduce
MAP REDUCE
Execution
Environment
 mapreduce:Executes the MapReduce
algorithm

When to Use mapreduce?
 Data Characteristics
– Dataset will not fit into memory
– Text data in files, databases or stored in the
Hadoop Distributed File System (HDFS)
 Compute Platform
– Desktop – MATLAB
– Scales to run on a Cluster – MATLAB - MDCS
– Scales to run within Hadoop MapReduce
on data in HDFS
 Analysis Characteristics
– Must be able to be Partitioned into two phases
1. Map: filter or process sub-segments of data
2. Reduce: aggregate interim results and calculate final answer

Demo: Predicting Internal Building Temperature
Using Machine Learning, Datastore and MapReduce
 Data
– Internal Building Zone temperatures
– 30+ Zones per Building
– Each zone ~165MB csv (GB’s in total)
 Challenge
– Build a Predictive model from ALL the data
– There is too much data for 1 single model
 Solution
– Create a multi-layered Model
– Layer 1 – Train multiples models from
each chunk of data
– Layer 2 – Use the output of Layer 1
as input to a 2nd model for prediction

'Time' 'Temp_Label' 'Value' 'AirTemp_Label' 'Value2' 'ZoneTemp_Label''Value3'
161.642 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.599 'ZoneTemperature'22.722
US 245
Data Store Map Reduce
MapReduce with Machine Learning
'Time' 'Temp_Label' 'Value' 'AirTemp_Label' 'Value2' 'ZoneTemp_Label''Value3'
'Value' 'Value2''Value3'
36.88 14.599 22.722
36.88 14.599 22.722
36.88 14.599 22.722
36.88 14.599 22.722
Train a Model
36.88 14.599 22.722
36.88 14.599 22.722
36.88 14.327 22.722
36.88 14.265 22.722
Train a Model
36.88 14.265 22.722
36.88 14.265 22.722
36.88 14.265 22.722
36.88 14.265 22.722
Train a Model
Test Data
Value Value2 Value3
35.67 14.8 21.1
37.5 15.4 22.8
36.34 16.2 24.3
Input 1
Input 2
Input N
Train a Model

New Data
Predict
Layer 1
Predict
Layer 2
Final Workflow
'Time' 'Value' 'Value2''Value3'
340.81 20.5 14.057 17.571
340.81 20.5 14.057 17.571
340.81 20.5 14.057 17.571
340.82 20.5 14.057 17.571
340.82 20.5 14.057 17.571
340.82 20.5 14.057 17.571
340.82 20.5 14.057 17.571
340.82 20.5 14.057 17.571
340.82 20.5 14.057 17.571
340.82 20.5 14.057 17.571
Predict1 Predict2 … Predict N
20.368 22.098 … 21.755
20.368 22.098 … 21.755
20.368 22.098 … 21.755
20.368 22.098 … 21.755
20.368 22.098 … 21.755
20.368 22.098 … 21.755
20.368 22.098 … 21.755
20.368 22.098 … 21.755
20.368 22.098 … 21.755
20.368 22.098 … 21.755
Predict Final
17.489
17.489
17.4632
17.4632
17.4632
17.446
17.446
17.33
17.33
17.33

What is Hadoop?
 Hadoop is a software framework for distributed
processing of large datasets across large clusters
 Large datasets  Terabytes or petabytes of data
 Large clusters  hundreds or thousands of nodes
 Hadoop consists of two main parts:
 HDFS – Hadoop file system
MapReduce environment

Who uses Hadoop?
 Google: Inventors of MapReduce computing
paradigm
 Yahoo: Developing Hadoop open-source of
MapReduce
 IBM, Microsoft, Oracle, Facebook, Amazon,
AOL, NetFlix
 Many others + universities and research labs

Distributions of Hadoop
 Apache Hadoop (Open source)
 Cloudera
 Hortonworks
 MapR

Demo: Vehicle Registry Analysis
Using MapReduce
• Data
– Massachusetts
Vehicle Registration
Data from 2008-2011
– 16M records, 45
fields
• Analysis
– Examine hybrid
adoptions
– Calculate % of
hybrids registered by
quarter

mapreduce
Data
Store
Map
Redu
ce
Veh_typ Q3_08 Q4_08 Q1_09 Hybrid
Car 1 1 1 0
SUV 0 1 1 1
Car 1 1 1 1
Car 0 0 1 1
Car 0 1 1 1
Car 1 1 1 1
Car 0 0 1 1
SUV 0 1 1 0
Car 1 1 1 0
SUV 1 1 1 1
Car 0 1 1 1
Car 1 0 0 0
Key: Q3_08
0
1
1
0
1
1
1
Key: Q4_08
0
1
1
1
1
Key: Q1_09
Hybrid
0
0
0
1
0
1
Key % Hybrid (Value)

Map-Reduce
Map
Map
Map
Reduce
Reduce
Reduce
Map Reduce
Map
Map
Reduce
Reduce
Datastore
HDFS
Hadoop! + MATLAB
Node
Node
NodeData
Data
Data

Datastore
Explore and Analyze Data on Hadoop
MATLAB
MapReduce Code
HDFS
NodeData
MATLAB
Distributed Computing Server
NodeData
NodeData
Map Reduce
Map Reduce
Map Reduce

Deployed MATALB applications with Hadoop
MATLAB
MapReduce Code
Datastore
HDFS
NodeData
NodeData
NodeData
Map Reduce
Map Reduce
Map Reduce
MATLAB Compiler
runtime

Demo Environment -
2 Linux OS Virtual Machines:
Hadoop node
MDCS – Job
Scheduler
& workers
Hadoop Master
MATLAB (R2014b)
MDCS - Workers
Master Slave
Network

Datastore
Demo VM: MDCS
MATLAB
MapReduce Code
HDFS
NodeData
MATLAB
Distributed Computing Server
NodeData
NodeData
Map Reduce
Map Reduce
Map Reduce

Parallel Computing – Distributed
Memory
Core 1
Core 3 Core 4
Core 2
RAM
Using More Computers (RAM)
Core 1
Core 3 Core 4
Core 2
RAM
…

TOOLBOXES
BLOCKSETS
Distributed Array
Lives on the Cluster
Remotely Manipulate
Array
from Desktop
1
1
2
6
4
1
1
2
2
7
4
2
1
3
2
8
4
3
1
4
2
9
4
4
1
5
3
0
4
5
1
6
3
1
4
6
1
7
3
2
4
7
1
7
3
3
4
8
1
9
3
4
4
9
2
0
3
5
5
0
2
1
3
6
5
1
2
2
3
7
5
2
Distributed Arrays
Available from
Parallel Computing
Toolbox
MATLAB Distributed
Computing Server

spmd blocks
spmd
% single program across workers
end
• Mix parallel and serial code in the same function
• Run on a pool of MATLAB resources
• Single Program runs simultaneously across workers
• Multiple Data spread across multiple workers

Example: Airline Delay Analysis
• Data
– BTS/RITA Airline
On-Time Statistics
– 123.5M records, 29 fields
• Analysis
– Calculate delay patterns
– Visualize summaries
– Estimate & evaluate
predictive models

When to Use Distributed Memory
• Data Characteristics
– Data must be fit in collective
memory across machines
• Compute Platform
– Prototype (subset of data) on desktop
– Run on a cluster or cloud
• Analysis Characteristics
– Consists of:
• Parts that can be run on data in memory (spmd)
• Supported functions for distributed arrays

Big Data on the Desktop
• Expand workspace
• 64 bit processor support – increased in-memory data set
handling
• Access portions of data too big to fit into memory
• Memory mapped variables – huge binary file
• Datastore – huge text file or collections of text files
• Database – query portion of a big database table
• Variety of programming constructs
• System Objects – analyze streaming data
• MapReduce – process text files that won’t fit into
memory
• Increase analysis speed
• Parallel for-loops with multicore/multi-process machines
• GPU Arrays

Further Scaling Big Data Capacity
MATLAB supports a number of programming
constructs for use with clusters
• General compute clusters
• Parallel for-loops – embarrassingly parallel
algorithms
• SPMD and distributed arrays – distributed memory
• Hadoop clusters
• MapReduce – analyze data stored in the Hadoop
Distributed File System
Use these
constructs on the
desktop to develop
your algorithms
Migrate to a
cluster without
algorithm
changes

MATLAB on Amazon cloud - EC2
 Custom clusters in 10 minutes
 Unlimited workers ( 256 per cluster)

Conclusion
 Big Data approaches in MATLAB –
locally & beyond
 Map-Reduce programming in MATLAB
 Hadoop & MATLAB Integration
 MATLAB on Amazon cloud EC2

Learn More
 MATLAB Documentation
– Strategies for Efficient Use of Memory
– Resolving "Out of Memory" Errors
 Big Data with MATLAB
– www.mathworks.com/discovery/big-data-matlab.html
 MATLAB MapReduce and Hadoop
– www.mathworks.com/discovery/matlab-mapreduce-hadoop.html

‫ההקשבה‬ ‫על‬ ‫תודה‬!
‫בן‬ ‫אסף‬‫גל‬
asafbg@systematics.co.il
03-7660111

MATLAB_BIg_Data_ds_Haddop_22032015

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to MATLAB_BIg_Data_ds_Haddop_22032015

Similar to MATLAB_BIg_Data_ds_Haddop_22032015 (20)

MATLAB_BIg_Data_ds_Haddop_22032015