SlideShare a Scribd company logo
1 of 47
Download to read offline
Big Data Handling
in MATLAB
Section-2
Asaf Ben Gal
Agenda
 Approaches for Big Data in MATLAB
 Map-Reduce & Hadoop Intro
 Hadoop & MATLAB Integration
 MATLAB on Amazon cloud EC2
Considerations for Choosing an
Approach:
Data characteristics
 Size, type and location of your data
Compute platform
 Single desktop machine or cluster
Analysis Characteristics
 Embarrassingly Parallel
 Analyze sub-segments of data and aggregate results
 Operate on entire dataset
Techniques for Big Data in MATLAB
Complexity
Embarrassingly
Parallel
Non-Partitionable
MapReduce
Distributed Memory
SPMD and distributed arrays
Load,
Analyze,
Discard
parfor,
datastore
out-of-
memory
in-memory
Techniques for Big Data in MATLAB
Complexity
Embarrassingly
Parallel
Non-Partitionable
MapReduce
Distributed Memory
SPMD and distributed arrays
Load,
Analyze,
Discard
parfor,
datastore
out-of-
memory
in-memory
Techniques for Big Data in MATLAB
Embarrass
ingly
Parallel
Non-
Partitionab
le
MapReduce
Distributed Memory
SPMD and distributed arrays
Load,
Analyze,
Discard
parfor,
datastore
out-of-
memory
in-memory
Complexit
y
Techniques for Big Data in MATLAB
Embarrass
ingly
Parallel
Non-
Partitionab
le
MapReduce
Distributed Memory
SPMD and distributed arrays
Load,
Analyze,
Discard
parfor,
datastore
out-of-
memory
in-memory
Complexit
y
When to Use datastore
• Data Characteristics
– Text data in files, databases or stored in the
Hadoop Distributed File System (HDFS)
• Compute Platform
– Desktop
• Analysis Characteristics
– Supports Load, Analyze, Discard workflows
– Incrementally read chunks of data,
process within a whileloop
Techniques for Big Data in MATLAB
ComplexityEmbarrass
ingly
Parallel
Non-
Partitionab
le
MapReduce
Distributed Memory
SPMD and distributed arrays
Load,
Analyze,
Discard
parfor,
datastore
out-of-
memory
in-memory
 :
“MapReduce is a programming model for processing and
generating large data sets with a parallel
distributed algorithm on a cluster.”
It consists of 2 parts:
 Mapping Phase
Filter & process subsets of data in parallel at the data location
 Reduce Phase
Aggregate results from the mapping phase
What is MapReduce?
Map Reduce process
Mapping function:
function maxFuncMapper (data, KeyValueDataStorage)
% Mapper function for the MaxMapreduce demo.
partMax = max(data.Array);
add(KeyValueDataStorage, 'PartialMaxArr',partMax);
function maxFuncReducer(KeyValueDataStorage, outKeyValueStore)
% Reducer function for the MaxMapreduce demo.
while hasnext(KeyValueDataStorage)
maxVal = max(getnext(KeyValueDataStorage), maxVal);
end
% The key-value pair added to outKVStore will become the
output of mapreduce
add(outKeyValueStore,'MaxArr,maxVal);
Reduce function:
Finding the max value of a large array
MapReduce Programming Model
 mapreducer
MAP REDUCE
Execution Environment
Output FilesInput Files
 mapreduce datastore
datastore
 datastore: Stores information
regarding input files and output files
MAP REDUCE
Execution Environment
Output FilesInput Files
mapreducer
 mapreducer: Control the Execution
Environment of the MapReduce operation
MAP REDUCE
Output FilesInput Files
Execution Environment
mapreducer
Serial ClusterLocal Pool
Multi-core
CPU
mapreduce
MAP REDUCE
Output FilesInput Files
Execution
Environment
 mapreduce:Executes the MapReduce
algorithm
When to Use mapreduce?
 Data Characteristics
– Dataset will not fit into memory
– Text data in files, databases or stored in the
Hadoop Distributed File System (HDFS)
 Compute Platform
– Desktop – MATLAB
– Scales to run on a Cluster – MATLAB - MDCS
– Scales to run within Hadoop MapReduce
on data in HDFS
 Analysis Characteristics
– Must be able to be Partitioned into two phases
1. Map: filter or process sub-segments of data
2. Reduce: aggregate interim results and calculate final answer
Demo: Predicting Internal Building Temperature
Using Machine Learning, Datastore and MapReduce
 Data
– Internal Building Zone temperatures
– 30+ Zones per Building
– Each zone ~165MB csv (GB’s in total)
 Challenge
– Build a Predictive model from ALL the data
– There is too much data for 1 single model
 Solution
– Create a multi-layered Model
– Layer 1 – Train multiples models from
each chunk of data
– Layer 2 – Use the output of Layer 1
as input to a 2nd model for prediction
'Time' 'Temp_Label' 'Value' 'AirTemp_Label' 'Value2' 'ZoneTemp_Label''Value3'
161.642 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.599 'ZoneTemperature'22.722
161.643 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.599 'ZoneTemperature'22.722
161.644 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.599 'ZoneTemperature'22.722
161.644 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.599 'ZoneTemperature'22.722
161.645 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.599 'ZoneTemperature'22.722
161.646 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.599 'ZoneTemperature'22.722
161.647 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.327 'ZoneTemperature'22.722
161.647 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.265 'ZoneTemperature'22.722
161.648 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.265 'ZoneTemperature'22.722
161.649 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.265 'ZoneTemperature'22.722
161.649 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.265 'ZoneTemperature'22.722
161.65 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.265 'ZoneTemperature'22.722
US 245
Data Store Map Reduce
MapReduce with Machine Learning
'Time' 'Temp_Label' 'Value' 'AirTemp_Label' 'Value2' 'ZoneTemp_Label''Value3'
161.642 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.599 'ZoneTemperature'22.722
161.643 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.599 'ZoneTemperature'22.722
161.644 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.599 'ZoneTemperature'22.722
161.644 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.599 'ZoneTemperature'22.722
161.645 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.599 'ZoneTemperature'22.722
161.646 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.599 'ZoneTemperature'22.722
161.647 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.327 'ZoneTemperature'22.722
161.647 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.265 'ZoneTemperature'22.722
161.648 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.265 'ZoneTemperature'22.722
161.649 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.265 'ZoneTemperature'22.722
161.649 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.265 'ZoneTemperature'22.722
161.65 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.265 'ZoneTemperature'22.722
'Value' 'Value2''Value3'
36.88 14.599 22.722
36.88 14.599 22.722
36.88 14.599 22.722
36.88 14.599 22.722
Train a Model
36.88 14.599 22.722
36.88 14.599 22.722
36.88 14.327 22.722
36.88 14.265 22.722
Train a Model
36.88 14.265 22.722
36.88 14.265 22.722
36.88 14.265 22.722
36.88 14.265 22.722
Train a Model
Test Data
Value Value2 Value3
35.67 14.8 21.1
37.5 15.4 22.8
36.34 16.2 24.3
Input 1
Input 2
Input N
Train a Model
New Data
Predict
Layer 1
Predict
Layer 2
Final Workflow
'Time' 'Value' 'Value2''Value3'
340.81 20.5 14.057 17.571
340.81 20.5 14.057 17.571
340.81 20.5 14.057 17.571
340.82 20.5 14.057 17.571
340.82 20.5 14.057 17.571
340.82 20.5 14.057 17.571
340.82 20.5 14.057 17.571
340.82 20.5 14.057 17.571
340.82 20.5 14.057 17.571
340.82 20.5 14.057 17.571
Predict1 Predict2 … Predict N
20.368 22.098 … 21.755
20.368 22.098 … 21.755
20.368 22.098 … 21.755
20.368 22.098 … 21.755
20.368 22.098 … 21.755
20.368 22.098 … 21.755
20.368 22.098 … 21.755
20.368 22.098 … 21.755
20.368 22.098 … 21.755
20.368 22.098 … 21.755
Predict Final
17.489
17.489
17.4632
17.4632
17.4632
17.446
17.446
17.33
17.33
17.33
What is Hadoop?
 Hadoop is a software framework for distributed
processing of large datasets across large clusters
 Large datasets  Terabytes or petabytes of data
 Large clusters  hundreds or thousands of nodes
 Hadoop consists of two main parts:
 HDFS – Hadoop file system
MapReduce environment
Who uses Hadoop?
 Google: Inventors of MapReduce computing
paradigm
 Yahoo: Developing Hadoop open-source of
MapReduce
 IBM, Microsoft, Oracle, Facebook, Amazon,
AOL, NetFlix
 Many others + universities and research labs
Distributions of Hadoop
 Apache Hadoop (Open source)
 Cloudera
 Hortonworks
 MapR
Demo: Vehicle Registry Analysis
Using MapReduce
• Data
– Massachusetts
Vehicle Registration
Data from 2008-2011
– 16M records, 45
fields
• Analysis
– Examine hybrid
adoptions
– Calculate % of
hybrids registered by
quarter
mapreduce
Data
Store
Map
Redu
ce
Veh_typ Q3_08 Q4_08 Q1_09 Hybrid
Car 1 1 1 0
SUV 0 1 1 1
Car 1 1 1 1
Car 0 0 1 1
Car 0 1 1 1
Car 1 1 1 1
Car 0 0 1 1
SUV 0 1 1 0
Car 1 1 1 0
SUV 1 1 1 1
Car 0 1 1 1
Car 1 0 0 0
Key: Q3_08
0
1
1
0
1
1
1
Key: Q4_08
0
1
1
1
1
Key: Q1_09
Hybrid
0
0
0
1
0
1
Key % Hybrid (Value)
Map-Reduce
Map
Map
Map
Reduce
Reduce
Reduce
Map Reduce
Map
Map
Reduce
Reduce
Datastore
HDFS
Hadoop! + MATLAB
Node
Node
NodeData
Data
Data
Datastore
Explore and Analyze Data on Hadoop
MATLAB
MapReduce Code
HDFS
NodeData
MATLAB
Distributed Computing Server
NodeData
NodeData
Map Reduce
Map Reduce
Map Reduce
Deployed MATALB applications with Hadoop
MATLAB
MapReduce Code
Datastore
HDFS
NodeData
NodeData
NodeData
Map Reduce
Map Reduce
Map Reduce
MATLAB Compiler
runtime
Demo Environment -
2 Linux OS Virtual Machines:
Hadoop node
MDCS – Job
Scheduler
& workers
Hadoop Master
MATLAB (R2014b)
MDCS - Workers
Master Slave
Network
Hadoop Environment Summary
Hadoop DataNode info
Hadoop File System
MDCS – Admin Center
Datastore
Demo VM: MDCS
MATLAB
MapReduce Code
HDFS
NodeData
MATLAB
Distributed Computing Server
NodeData
NodeData
Map Reduce
Map Reduce
Map Reduce
Parallel Computing – Distributed
Memory
Core 1
Core 3 Core 4
Core 2
RAM
Using More Computers (RAM)
Core 1
Core 3 Core 4
Core 2
RAM
…
TOOLBOXES
BLOCKSETS
Distributed Array
Lives on the Cluster
Remotely Manipulate
Array
from Desktop
1
1
2
6
4
1
1
2
2
7
4
2
1
3
2
8
4
3
1
4
2
9
4
4
1
5
3
0
4
5
1
6
3
1
4
6
1
7
3
2
4
7
1
7
3
3
4
8
1
9
3
4
4
9
2
0
3
5
5
0
2
1
3
6
5
1
2
2
3
7
5
2
Distributed Arrays
Available from
Parallel Computing
Toolbox
MATLAB Distributed
Computing Server
spmd blocks
spmd
% single program across workers
end
• Mix parallel and serial code in the same function
• Run on a pool of MATLAB resources
• Single Program runs simultaneously across workers
• Multiple Data spread across multiple workers
Example: Airline Delay Analysis
• Data
– BTS/RITA Airline
On-Time Statistics
– 123.5M records, 29 fields
• Analysis
– Calculate delay patterns
– Visualize summaries
– Estimate & evaluate
predictive models
When to Use Distributed Memory
• Data Characteristics
– Data must be fit in collective
memory across machines
• Compute Platform
– Prototype (subset of data) on desktop
– Run on a cluster or cloud
• Analysis Characteristics
– Consists of:
• Parts that can be run on data in memory (spmd)
• Supported functions for distributed arrays
Big Data on the Desktop
• Expand workspace
• 64 bit processor support – increased in-memory data set
handling
• Access portions of data too big to fit into memory
• Memory mapped variables – huge binary file
• Datastore – huge text file or collections of text files
• Database – query portion of a big database table
• Variety of programming constructs
• System Objects – analyze streaming data
• MapReduce – process text files that won’t fit into
memory
• Increase analysis speed
• Parallel for-loops with multicore/multi-process machines
• GPU Arrays
Further Scaling Big Data Capacity
MATLAB supports a number of programming
constructs for use with clusters
• General compute clusters
• Parallel for-loops – embarrassingly parallel
algorithms
• SPMD and distributed arrays – distributed memory
• Hadoop clusters
• MapReduce – analyze data stored in the Hadoop
Distributed File System
Use these
constructs on the
desktop to develop
your algorithms
Migrate to a
cluster without
algorithm
changes
MATLAB on Amazon cloud - EC2
 Custom clusters in 10 minutes
 Unlimited workers ( 256 per cluster)
MATLAB on Amazon cloud - EC2
Conclusion
 Big Data approaches in MATLAB –
locally & beyond
 Map-Reduce programming in MATLAB
 Hadoop & MATLAB Integration
 MATLAB on Amazon cloud EC2
Learn More
 MATLAB Documentation
– Strategies for Efficient Use of Memory
– Resolving "Out of Memory" Errors
 Big Data with MATLAB
– www.mathworks.com/discovery/big-data-matlab.html
 MATLAB MapReduce and Hadoop
– www.mathworks.com/discovery/matlab-mapreduce-hadoop.html
‫ההקשבה‬ ‫על‬ ‫תודה‬!
‫בן‬ ‫אסף‬‫גל‬
asafbg@systematics.co.il
03-7660111

More Related Content

What's hot

Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reducePaladion Networks
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programsjani shaik
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014soujavajug
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition
Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA EditionHadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition
Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA EditionBig Data Joe™ Rossi
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with YarnDavid Kaiser
 
Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1Big Data Joe™ Rossi
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranMapR Technologies
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformBikas Saha
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Emilio Coppa
 
HGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBaseHGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBaseDan Han
 
Hive query optimization infinity
Hive query optimization infinityHive query optimization infinity
Hive query optimization infinityShashwat Shriparv
 
SQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillMapR Technologies
 
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
Dache: A Data Aware Caching for Big-Data using Map Reduce frameworkDache: A Data Aware Caching for Big-Data using Map Reduce framework
Dache: A Data Aware Caching for Big-Data using Map Reduce frameworkSafir Shah
 

What's hot (20)

Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
HW09 Hadoop Vaidya
HW09 Hadoop VaidyaHW09 Hadoop Vaidya
HW09 Hadoop Vaidya
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Indexed Hive
Indexed HiveIndexed Hive
Indexed Hive
 
Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition
Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA EditionHadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition
Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
 
myHadoop 0.30
myHadoop 0.30myHadoop 0.30
myHadoop 0.30
 
Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute Platform
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
 
HGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBaseHGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBase
 
Hive query optimization infinity
Hive query optimization infinityHive query optimization infinity
Hive query optimization infinity
 
SQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache Drill
 
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
Dache: A Data Aware Caching for Big-Data using Map Reduce frameworkDache: A Data Aware Caching for Big-Data using Map Reduce framework
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
 

Viewers also liked

Design and implementation of GPU-based SAR image processor
Design and implementation of GPU-based SAR image processorDesign and implementation of GPU-based SAR image processor
Design and implementation of GPU-based SAR image processorNajeeb Ahmad
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel ProgrammingUday Sharma
 
Writing Fast MATLAB Code
Writing Fast MATLAB CodeWriting Fast MATLAB Code
Writing Fast MATLAB CodeJia-Bin Huang
 
Introduction to Digital Image Processing Using MATLAB
Introduction to Digital Image Processing Using MATLABIntroduction to Digital Image Processing Using MATLAB
Introduction to Digital Image Processing Using MATLABRay Phan
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 

Viewers also liked (6)

Design and implementation of GPU-based SAR image processor
Design and implementation of GPU-based SAR image processorDesign and implementation of GPU-based SAR image processor
Design and implementation of GPU-based SAR image processor
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 
Matlab ppt
Matlab pptMatlab ppt
Matlab ppt
 
Writing Fast MATLAB Code
Writing Fast MATLAB CodeWriting Fast MATLAB Code
Writing Fast MATLAB Code
 
Introduction to Digital Image Processing Using MATLAB
Introduction to Digital Image Processing Using MATLABIntroduction to Digital Image Processing Using MATLAB
Introduction to Digital Image Processing Using MATLAB
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 

Similar to MATLAB_BIg_Data_ds_Haddop_22032015

Finalprojectpresentation
FinalprojectpresentationFinalprojectpresentation
FinalprojectpresentationSANTOSH WAYAL
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsRobert Grossman
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analyticsAvinash Pandu
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Sumeet Singh
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfWasyihunSema2
 
Data Warehouse Offload
Data Warehouse OffloadData Warehouse Offload
Data Warehouse OffloadJohn Berns
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingImpetus Technologies
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Anant Corporation
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
R the unsung hero of Big Data
R the unsung hero of Big DataR the unsung hero of Big Data
R the unsung hero of Big DataDhafer Malouche
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkSupriya .
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 

Similar to MATLAB_BIg_Data_ds_Haddop_22032015 (20)

Data Science
Data ScienceData Science
Data Science
 
Finalprojectpresentation
FinalprojectpresentationFinalprojectpresentation
Finalprojectpresentation
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Unit 2
Unit 2Unit 2
Unit 2
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdf
 
Data Warehouse Offload
Data Warehouse OffloadData Warehouse Offload
Data Warehouse Offload
 
Hadoop
HadoopHadoop
Hadoop
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
R the unsung hero of Big Data
R the unsung hero of Big DataR the unsung hero of Big Data
R the unsung hero of Big Data
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark framework
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 

MATLAB_BIg_Data_ds_Haddop_22032015

  • 1. Big Data Handling in MATLAB Section-2 Asaf Ben Gal
  • 2. Agenda  Approaches for Big Data in MATLAB  Map-Reduce & Hadoop Intro  Hadoop & MATLAB Integration  MATLAB on Amazon cloud EC2
  • 3. Considerations for Choosing an Approach: Data characteristics  Size, type and location of your data Compute platform  Single desktop machine or cluster Analysis Characteristics  Embarrassingly Parallel  Analyze sub-segments of data and aggregate results  Operate on entire dataset
  • 4. Techniques for Big Data in MATLAB Complexity Embarrassingly Parallel Non-Partitionable MapReduce Distributed Memory SPMD and distributed arrays Load, Analyze, Discard parfor, datastore out-of- memory in-memory
  • 5. Techniques for Big Data in MATLAB Complexity Embarrassingly Parallel Non-Partitionable MapReduce Distributed Memory SPMD and distributed arrays Load, Analyze, Discard parfor, datastore out-of- memory in-memory
  • 6. Techniques for Big Data in MATLAB Embarrass ingly Parallel Non- Partitionab le MapReduce Distributed Memory SPMD and distributed arrays Load, Analyze, Discard parfor, datastore out-of- memory in-memory Complexit y
  • 7. Techniques for Big Data in MATLAB Embarrass ingly Parallel Non- Partitionab le MapReduce Distributed Memory SPMD and distributed arrays Load, Analyze, Discard parfor, datastore out-of- memory in-memory Complexit y
  • 8. When to Use datastore • Data Characteristics – Text data in files, databases or stored in the Hadoop Distributed File System (HDFS) • Compute Platform – Desktop • Analysis Characteristics – Supports Load, Analyze, Discard workflows – Incrementally read chunks of data, process within a whileloop
  • 9. Techniques for Big Data in MATLAB ComplexityEmbarrass ingly Parallel Non- Partitionab le MapReduce Distributed Memory SPMD and distributed arrays Load, Analyze, Discard parfor, datastore out-of- memory in-memory
  • 10.  : “MapReduce is a programming model for processing and generating large data sets with a parallel distributed algorithm on a cluster.” It consists of 2 parts:  Mapping Phase Filter & process subsets of data in parallel at the data location  Reduce Phase Aggregate results from the mapping phase What is MapReduce?
  • 12. Mapping function: function maxFuncMapper (data, KeyValueDataStorage) % Mapper function for the MaxMapreduce demo. partMax = max(data.Array); add(KeyValueDataStorage, 'PartialMaxArr',partMax); function maxFuncReducer(KeyValueDataStorage, outKeyValueStore) % Reducer function for the MaxMapreduce demo. while hasnext(KeyValueDataStorage) maxVal = max(getnext(KeyValueDataStorage), maxVal); end % The key-value pair added to outKVStore will become the output of mapreduce add(outKeyValueStore,'MaxArr,maxVal); Reduce function: Finding the max value of a large array
  • 13. MapReduce Programming Model  mapreducer MAP REDUCE Execution Environment Output FilesInput Files  mapreduce datastore
  • 14. datastore  datastore: Stores information regarding input files and output files MAP REDUCE Execution Environment Output FilesInput Files
  • 15. mapreducer  mapreducer: Control the Execution Environment of the MapReduce operation MAP REDUCE Output FilesInput Files Execution Environment
  • 17. mapreduce MAP REDUCE Output FilesInput Files Execution Environment  mapreduce:Executes the MapReduce algorithm
  • 18. When to Use mapreduce?  Data Characteristics – Dataset will not fit into memory – Text data in files, databases or stored in the Hadoop Distributed File System (HDFS)  Compute Platform – Desktop – MATLAB – Scales to run on a Cluster – MATLAB - MDCS – Scales to run within Hadoop MapReduce on data in HDFS  Analysis Characteristics – Must be able to be Partitioned into two phases 1. Map: filter or process sub-segments of data 2. Reduce: aggregate interim results and calculate final answer
  • 19. Demo: Predicting Internal Building Temperature Using Machine Learning, Datastore and MapReduce  Data – Internal Building Zone temperatures – 30+ Zones per Building – Each zone ~165MB csv (GB’s in total)  Challenge – Build a Predictive model from ALL the data – There is too much data for 1 single model  Solution – Create a multi-layered Model – Layer 1 – Train multiples models from each chunk of data – Layer 2 – Use the output of Layer 1 as input to a 2nd model for prediction
  • 20. 'Time' 'Temp_Label' 'Value' 'AirTemp_Label' 'Value2' 'ZoneTemp_Label''Value3' 161.642 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.599 'ZoneTemperature'22.722 161.643 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.599 'ZoneTemperature'22.722 161.644 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.599 'ZoneTemperature'22.722 161.644 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.599 'ZoneTemperature'22.722 161.645 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.599 'ZoneTemperature'22.722 161.646 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.599 'ZoneTemperature'22.722 161.647 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.327 'ZoneTemperature'22.722 161.647 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.265 'ZoneTemperature'22.722 161.648 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.265 'ZoneTemperature'22.722 161.649 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.265 'ZoneTemperature'22.722 161.649 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.265 'ZoneTemperature'22.722 161.65 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.265 'ZoneTemperature'22.722 US 245 Data Store Map Reduce MapReduce with Machine Learning 'Time' 'Temp_Label' 'Value' 'AirTemp_Label' 'Value2' 'ZoneTemp_Label''Value3' 161.642 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.599 'ZoneTemperature'22.722 161.643 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.599 'ZoneTemperature'22.722 161.644 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.599 'ZoneTemperature'22.722 161.644 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.599 'ZoneTemperature'22.722 161.645 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.599 'ZoneTemperature'22.722 161.646 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.599 'ZoneTemperature'22.722 161.647 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.327 'ZoneTemperature'22.722 161.647 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.265 'ZoneTemperature'22.722 161.648 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.265 'ZoneTemperature'22.722 161.649 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.265 'ZoneTemperature'22.722 161.649 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.265 'ZoneTemperature'22.722 161.65 'TEMPERATURE' 36.88 'SupplyAirTemperature'14.265 'ZoneTemperature'22.722 'Value' 'Value2''Value3' 36.88 14.599 22.722 36.88 14.599 22.722 36.88 14.599 22.722 36.88 14.599 22.722 Train a Model 36.88 14.599 22.722 36.88 14.599 22.722 36.88 14.327 22.722 36.88 14.265 22.722 Train a Model 36.88 14.265 22.722 36.88 14.265 22.722 36.88 14.265 22.722 36.88 14.265 22.722 Train a Model Test Data Value Value2 Value3 35.67 14.8 21.1 37.5 15.4 22.8 36.34 16.2 24.3 Input 1 Input 2 Input N Train a Model
  • 21. New Data Predict Layer 1 Predict Layer 2 Final Workflow 'Time' 'Value' 'Value2''Value3' 340.81 20.5 14.057 17.571 340.81 20.5 14.057 17.571 340.81 20.5 14.057 17.571 340.82 20.5 14.057 17.571 340.82 20.5 14.057 17.571 340.82 20.5 14.057 17.571 340.82 20.5 14.057 17.571 340.82 20.5 14.057 17.571 340.82 20.5 14.057 17.571 340.82 20.5 14.057 17.571 Predict1 Predict2 … Predict N 20.368 22.098 … 21.755 20.368 22.098 … 21.755 20.368 22.098 … 21.755 20.368 22.098 … 21.755 20.368 22.098 … 21.755 20.368 22.098 … 21.755 20.368 22.098 … 21.755 20.368 22.098 … 21.755 20.368 22.098 … 21.755 20.368 22.098 … 21.755 Predict Final 17.489 17.489 17.4632 17.4632 17.4632 17.446 17.446 17.33 17.33 17.33
  • 22. What is Hadoop?  Hadoop is a software framework for distributed processing of large datasets across large clusters  Large datasets  Terabytes or petabytes of data  Large clusters  hundreds or thousands of nodes  Hadoop consists of two main parts:  HDFS – Hadoop file system MapReduce environment
  • 23. Who uses Hadoop?  Google: Inventors of MapReduce computing paradigm  Yahoo: Developing Hadoop open-source of MapReduce  IBM, Microsoft, Oracle, Facebook, Amazon, AOL, NetFlix  Many others + universities and research labs
  • 24. Distributions of Hadoop  Apache Hadoop (Open source)  Cloudera  Hortonworks  MapR
  • 25. Demo: Vehicle Registry Analysis Using MapReduce • Data – Massachusetts Vehicle Registration Data from 2008-2011 – 16M records, 45 fields • Analysis – Examine hybrid adoptions – Calculate % of hybrids registered by quarter
  • 26. mapreduce Data Store Map Redu ce Veh_typ Q3_08 Q4_08 Q1_09 Hybrid Car 1 1 1 0 SUV 0 1 1 1 Car 1 1 1 1 Car 0 0 1 1 Car 0 1 1 1 Car 1 1 1 1 Car 0 0 1 1 SUV 0 1 1 0 Car 1 1 1 0 SUV 1 1 1 1 Car 0 1 1 1 Car 1 0 0 0 Key: Q3_08 0 1 1 0 1 1 1 Key: Q4_08 0 1 1 1 1 Key: Q1_09 Hybrid 0 0 0 1 0 1 Key % Hybrid (Value)
  • 28. Datastore Explore and Analyze Data on Hadoop MATLAB MapReduce Code HDFS NodeData MATLAB Distributed Computing Server NodeData NodeData Map Reduce Map Reduce Map Reduce
  • 29. Deployed MATALB applications with Hadoop MATLAB MapReduce Code Datastore HDFS NodeData NodeData NodeData Map Reduce Map Reduce Map Reduce MATLAB Compiler runtime
  • 30. Demo Environment - 2 Linux OS Virtual Machines: Hadoop node MDCS – Job Scheduler & workers Hadoop Master MATLAB (R2014b) MDCS - Workers Master Slave Network
  • 34. MDCS – Admin Center
  • 35. Datastore Demo VM: MDCS MATLAB MapReduce Code HDFS NodeData MATLAB Distributed Computing Server NodeData NodeData Map Reduce Map Reduce Map Reduce
  • 36. Parallel Computing – Distributed Memory Core 1 Core 3 Core 4 Core 2 RAM Using More Computers (RAM) Core 1 Core 3 Core 4 Core 2 RAM …
  • 37. TOOLBOXES BLOCKSETS Distributed Array Lives on the Cluster Remotely Manipulate Array from Desktop 1 1 2 6 4 1 1 2 2 7 4 2 1 3 2 8 4 3 1 4 2 9 4 4 1 5 3 0 4 5 1 6 3 1 4 6 1 7 3 2 4 7 1 7 3 3 4 8 1 9 3 4 4 9 2 0 3 5 5 0 2 1 3 6 5 1 2 2 3 7 5 2 Distributed Arrays Available from Parallel Computing Toolbox MATLAB Distributed Computing Server
  • 38. spmd blocks spmd % single program across workers end • Mix parallel and serial code in the same function • Run on a pool of MATLAB resources • Single Program runs simultaneously across workers • Multiple Data spread across multiple workers
  • 39. Example: Airline Delay Analysis • Data – BTS/RITA Airline On-Time Statistics – 123.5M records, 29 fields • Analysis – Calculate delay patterns – Visualize summaries – Estimate & evaluate predictive models
  • 40. When to Use Distributed Memory • Data Characteristics – Data must be fit in collective memory across machines • Compute Platform – Prototype (subset of data) on desktop – Run on a cluster or cloud • Analysis Characteristics – Consists of: • Parts that can be run on data in memory (spmd) • Supported functions for distributed arrays
  • 41. Big Data on the Desktop • Expand workspace • 64 bit processor support – increased in-memory data set handling • Access portions of data too big to fit into memory • Memory mapped variables – huge binary file • Datastore – huge text file or collections of text files • Database – query portion of a big database table • Variety of programming constructs • System Objects – analyze streaming data • MapReduce – process text files that won’t fit into memory • Increase analysis speed • Parallel for-loops with multicore/multi-process machines • GPU Arrays
  • 42. Further Scaling Big Data Capacity MATLAB supports a number of programming constructs for use with clusters • General compute clusters • Parallel for-loops – embarrassingly parallel algorithms • SPMD and distributed arrays – distributed memory • Hadoop clusters • MapReduce – analyze data stored in the Hadoop Distributed File System Use these constructs on the desktop to develop your algorithms Migrate to a cluster without algorithm changes
  • 43. MATLAB on Amazon cloud - EC2  Custom clusters in 10 minutes  Unlimited workers ( 256 per cluster)
  • 44. MATLAB on Amazon cloud - EC2
  • 45. Conclusion  Big Data approaches in MATLAB – locally & beyond  Map-Reduce programming in MATLAB  Hadoop & MATLAB Integration  MATLAB on Amazon cloud EC2
  • 46. Learn More  MATLAB Documentation – Strategies for Efficient Use of Memory – Resolving "Out of Memory" Errors  Big Data with MATLAB – www.mathworks.com/discovery/big-data-matlab.html  MATLAB MapReduce and Hadoop – www.mathworks.com/discovery/matlab-mapreduce-hadoop.html
  • 47. ‫ההקשבה‬ ‫על‬ ‫תודה‬! ‫בן‬ ‫אסף‬‫גל‬ asafbg@systematics.co.il 03-7660111