SlideShare a Scribd company logo
1 of 22
Download to read offline
MapReduce on ZeroVM 
A Lightweight virtualization for Big Data Processing 
Joy Rahman 
Research Assistant 
Cloud and Big Data Lab, UTSA
MapReduce and Big Data 
● Big data is an all-encompassing term for any collection of data sets so large and 
complex that it becomes difficult to process using traditional data processing 
applications. 
● MapReduce is a distributed processing framework that supports Big Data 
Processing. 
● A MapReduce program is composed of a Map() procedure that performs filtering 
and sorting and a Reduce() procedure that performs a summary operation 
● MapReduce libraries have been written in many programming languages. A 
popular open-source implementation is Apache Hadoop (http://hadoop.apache. 
org/).
Lets start with an example 
Challenge : Count all the words in a file 
Lorem Ipsum is simply dummy text of the printing and 
typesetting industry. Lorem Ipsum has been the 
industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and 
scrambled it to make a type specimen book. It has 
survived not only five centuries, but also the leap into 
electronic typesetting, remaining essentially unchanged. 
It was popularised in the 1960s with the release of 
Letraset sheets containing Lorem Ipsum passages, and 
more recently with desktop publishing software like Aldus 
PageMaker including versions of Lorem Ipsum. 
Contrary to popular belief, Lorem Ipsum is not simply 
random text. It has roots in a piece of classical Latin 
literature from 45 BC, making it over 2000 years old. 
Richard McClintock, a Latin professor at Hampden- 
Sydney College in Virginia, looked up one of the more 
obscure Latin words, consectetur, from a Lorem Ipsum 
passage, and going through the cites of the word in 
classical literature, discovered the undoubtable source. 
Word Count 
-------- -------- 
Lorem 5 
.... 1 
.... 1 
.... 1 
dummy 1 
Any problem with this 
approach? 
- Yes, the file may be too big
Lets see an example (cont) 
A better Approach : Divide and Conquer 
Lorem Ipsum is simply dummy text of the printing and 
typesetting industry. Lorem Ipsum has been the 
industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and 
scrambled it to make a type specimen book. It has 
survived not only five centuries, but also the leap into 
electronic typesetting, remaining essentially unchanged. 
It was popularised in the 1960s with the 
release of Letraset sheets containing Lorem Ipsum 
passages, and more recently with desktop publishing 
software like Aldus PageMaker including versions of 
Lorem Ipsum. 
Contrary to popular belief, Lorem Ipsum is not simply 
random text. It has roots in a piece of classical Latin 
literature from 45 BC, making it over 2000 years old. 
Richard McClintock, a Latin professor at 
Hampden-Sydney College in Virginia, looked up one of 
the more obscure Latin words, consectetur, from a 
Lorem Ipsum passage, and going through the cites of the 
word in classical literature, discovered the undoubtable 
source. 
Program 1 Program 2 Program 3 
Lorem, 2 
simply, 1 
has, 1 
Lorem, 1 
was , 2 
has, 5 
Lorem, 3 
from , 2 
has, 1 
Do you see any 
problem with this 
approach? 
key value
We need to combine the results.. 
- We have divided the big input file to multiple pieces so that parallel 
processes can attack the file simultaneously lowering the total 
processing time. 
- But the result from each process needs to be combined. 
Lorem, 2 
simply, 1 
has, 1 
Lorem, 1 
was , 2 
has, 5 
Lorem, 3 
from , 2 
has, 1 
Lorem, 6 
simply, 1 
has, 7 
from, 2 
.... 
....
MapReduce 
● The example we have just seen is a typical 
MapReduce program for big data processing, 
● where the first phase (split-up and processing of the input) is 
called Map 
● and the final phase (the combining of the results) is called 
Reduce.
Formal Definitions 
❏ The Map and Reduce functions of MapReduce are both defined with respect to 
data structured in (key, value) pairs. 
❏ Map takes one pair of data with a type in one data domain, and returns a list of 
pairs in a different domain: 
Map(k1,v1) → list(k2,v2) 
The Map function is applied in parallel to every pair in the input dataset. This produces a list of pairs for each call. After that, the 
MapReduce framework collects all pairs with the same key from all lists and groups them together, creating one group for each 
key. 
❏ The Reduce function is then applied in parallel to each group, which in turn 
produces a collection of values in the same domain: 
Reduce(k2, list (v2)) → list(v3) 
Thus the MapReduce framework transforms a list of (key, value) pairs into a list of values.
Split 
[k1, v1] 
sort 
by k1 
Merge 
[k1, [v1,v2,v3,...]]
Existing Limitations of Big Data 
Processing on the Cloud 
● Current implementation of Cloud has two distinct clusters: 
○ 1) Computation Cluster (ex :Amazon EC2) 
○ 2) Storage Cluster ( ex: Amazon S3) 
● Computation cluster is used for cpu intensive processing whereas storage cluster 
is used to store the persistent data. 
● Running MapReduce on the cloud is costly due to the fact a considerable 
amount of overhead incurred due to fetching the data from storage to the 
computation cluster and putting them back after processing.
ex: Amazon EMR 
Image source & Ref: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-what-is-emr.html 
Costly 
Data 
Transfer
Challenges.... 
● How to avoid the data transfer overhead for big data processing? 
○ Answer : Take computation to the Storage cluster 
apps 
storage cluster 
But traditional OS level virtualizations 
are 
● bulky and cpu intensive to run 
inside a cluster that is optimized 
for storage I/O only 
● slow spin-up 
● horizontal scaling is expensive 
apps
ZeroVM to the rescue 
● ZeroVM is an open–source lightweight virtualization platform 
based on the Chromium Native Client project (NaCl provides the 
essential isolation through software fault isolation technique) 
● ZeroVM permits to safely execute arbitrary code (c/c++, python) 
from untrusted users in multi-tenant environments 
● The ZeroVM Core is only 75 KB in Size and can spin-up in 5 ms. 
● Thus It’s an ideal candidate to be run on top of Storage clusters 
like Openstack SWIFT. 
● ZeroVM Takes computation to the storage enabling cost effective 
MapReduce on the cloud.
ZeroVM Properties 
1. ZeroVM is small, light, fast, Secure, Hyper Scalable. 
2. ZeroVM virtualizes Application not Operating System. 
3.Single threaded (thus deterministic) execution. Same executable will 
produce same results each time it is run. 
4. Predefined resource constraints before execution 
● Channel based I/O 
● Predefine socket port / network 
● Restricted Memory Access 
● Limited Read/ Write (in bytes) 
● Short life sessions / Predefined session_timeout
credit : Ryan McKinney, Senior Software Engineer, Rackspace
ZeroCloud 
● ZeroCloud is the cloud module that runs on top of SWIFT that provides the facility 
to run zerovm sessions on different servers of the cluster 
● ZeroCloud makes it easy to create large clusters of instances, aggregating the 
compute power of many individual physical servers into a single execution 
environment. 
● Users can leverage the power of 100s of physical servers for a few seconds or 
even milliseconds at time. 
● Horizontal scalability is a key design goal for ZeroVM
ZeroCloud (on SWIFT) 
swift proxy 
with zerocloud 
Object Server 
REQ 
Resp 
GET/POST 
Object Server 
Object Server 
Object Server 
apps 
zerovm 
session 
apps 
zerovm 
session 
if (exec) 
spawn 
if (exec) 
spawn 
user supplies the job 
description with the 
executables (apps) 
result 
result 
job 
desc 
Openstack SWIFT Cluster
MapReduce on ZeroVM 
● ZeroVM running on ZeroCloud is inherently targeted for Big 
data processing, particularly using MapReduce style. 
● Users can have multiple stage jobs and any stage can 
connect with another stage 
● The users need to provide the executables only. 
● Since data is already inside the SWIFT cluster, an execution 
job request through GET/POST is enough to fire the big 
data processing instantly and obtain the result. 
● Ensures Data Locality and eliminates the costly data transfer.
Demonstration??? 
Do you like to give ZeroVM a try? http://zebra. 
zerovm.org/
Our Research on ZeroVM 
● There are many ongoing researches on ZeroVM. 
● UTSA Big Data and Cloud Lab has some ongoing research 
projects. 
● Currently I am working under the supervision Dr.Lama to 
improve MapReduce on ZeroVM. 
● Our projects involves developing a scheduler for ZeroCloud 
that will be optimized to ensure Data Locality, Interference & 
Heterogeneity and Skew Aware.
Our Research on ZeroVM (contd) 
● Data Locality is of great importance for Big Data Processing. 
● Current Implementation ensures Data Locality for Map Phase 
since the executables will be run on the input data. 
● We would like to optimize and ensure Data Locality for 
Reducer phases. 
● We would like to design a scheduler that would mitigate the 
data/computational skew problem (which is inherent in 
every MapReduce environment) intelligently, which is 
currently handled manually by the end user
Thanks 
Get this ppt from: http://goo.gl/6fJpbn 
Credits: 
[1] Prosunjit Biswas, UTSA 
[2] Carina C. Zona, Rackspace 
[3] Ryan Mckinney, Rackspace 
References: 
[1] zeroVM: http://www.zerovm.org 
[2] apache hadoop: http://apache.hadoop.org 
[3] Amazon EMR: http://aws.amazon.com/elasticmapreduce 
[4] Map Reduce: http://en.wikipedia.org/wiki/MapReduce 
[5] Native Client: A Sandbox for Portable, Untrusted x86 Native Code : http://static.googleusercontent. 
com/media/research.google.com/en/us/pubs/archive/34913.pdf 
More about ZeroVM 
Website: www.zerovm.org 
Github: https://github. 
com/zerovm/ 
User Mailing List: 
zerovm@googlegroups.com 
IRC: #zerovm on Freenode

More Related Content

What's hot

Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
Mohammad Mustaqeem
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
Jonathan Dursi
 

What's hot (20)

Fault tolerant mechanisms in Big Data
Fault tolerant mechanisms in Big DataFault tolerant mechanisms in Big Data
Fault tolerant mechanisms in Big Data
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
MapReduce in Cloud Computing
MapReduce in Cloud ComputingMapReduce in Cloud Computing
MapReduce in Cloud Computing
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop
HadoopHadoop
Hadoop
 

Viewers also liked (6)

第11組 Ubc Blog
第11組 Ubc Blog第11組 Ubc Blog
第11組 Ubc Blog
 
North American Odyssey Intro
North American Odyssey IntroNorth American Odyssey Intro
North American Odyssey Intro
 
Interactive Powerpoint
Interactive PowerpointInteractive Powerpoint
Interactive Powerpoint
 
HREX Surviving to Thriving Growing on a Budget 091510
HREX Surviving to Thriving Growing on a Budget 091510HREX Surviving to Thriving Growing on a Budget 091510
HREX Surviving to Thriving Growing on a Budget 091510
 
La mujer en el peru prehispanico
La mujer en el peru prehispanicoLa mujer en el peru prehispanico
La mujer en el peru prehispanico
 
Kevin Ashley Mid Con Aade Presentation.Rev
Kevin Ashley Mid Con Aade Presentation.RevKevin Ashley Mid Con Aade Presentation.Rev
Kevin Ashley Mid Con Aade Presentation.Rev
 

Similar to MapReduce on Zero VM

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Mohit Tare
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Andrey Vykhodtsev
 

Similar to MapReduce on Zero VM (20)

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
try
trytry
try
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Distributed Computing & MapReduce
Distributed Computing & MapReduceDistributed Computing & MapReduce
Distributed Computing & MapReduce
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacm
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 

MapReduce on Zero VM

  • 1. MapReduce on ZeroVM A Lightweight virtualization for Big Data Processing Joy Rahman Research Assistant Cloud and Big Data Lab, UTSA
  • 2. MapReduce and Big Data ● Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications. ● MapReduce is a distributed processing framework that supports Big Data Processing. ● A MapReduce program is composed of a Map() procedure that performs filtering and sorting and a Reduce() procedure that performs a summary operation ● MapReduce libraries have been written in many programming languages. A popular open-source implementation is Apache Hadoop (http://hadoop.apache. org/).
  • 3. Lets start with an example Challenge : Count all the words in a file Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum. Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden- Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source. Word Count -------- -------- Lorem 5 .... 1 .... 1 .... 1 dummy 1 Any problem with this approach? - Yes, the file may be too big
  • 4. Lets see an example (cont) A better Approach : Divide and Conquer Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum. Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source. Program 1 Program 2 Program 3 Lorem, 2 simply, 1 has, 1 Lorem, 1 was , 2 has, 5 Lorem, 3 from , 2 has, 1 Do you see any problem with this approach? key value
  • 5. We need to combine the results.. - We have divided the big input file to multiple pieces so that parallel processes can attack the file simultaneously lowering the total processing time. - But the result from each process needs to be combined. Lorem, 2 simply, 1 has, 1 Lorem, 1 was , 2 has, 5 Lorem, 3 from , 2 has, 1 Lorem, 6 simply, 1 has, 7 from, 2 .... ....
  • 6. MapReduce ● The example we have just seen is a typical MapReduce program for big data processing, ● where the first phase (split-up and processing of the input) is called Map ● and the final phase (the combining of the results) is called Reduce.
  • 7.
  • 8. Formal Definitions ❏ The Map and Reduce functions of MapReduce are both defined with respect to data structured in (key, value) pairs. ❏ Map takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain: Map(k1,v1) → list(k2,v2) The Map function is applied in parallel to every pair in the input dataset. This produces a list of pairs for each call. After that, the MapReduce framework collects all pairs with the same key from all lists and groups them together, creating one group for each key. ❏ The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain: Reduce(k2, list (v2)) → list(v3) Thus the MapReduce framework transforms a list of (key, value) pairs into a list of values.
  • 9. Split [k1, v1] sort by k1 Merge [k1, [v1,v2,v3,...]]
  • 10. Existing Limitations of Big Data Processing on the Cloud ● Current implementation of Cloud has two distinct clusters: ○ 1) Computation Cluster (ex :Amazon EC2) ○ 2) Storage Cluster ( ex: Amazon S3) ● Computation cluster is used for cpu intensive processing whereas storage cluster is used to store the persistent data. ● Running MapReduce on the cloud is costly due to the fact a considerable amount of overhead incurred due to fetching the data from storage to the computation cluster and putting them back after processing.
  • 11. ex: Amazon EMR Image source & Ref: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-what-is-emr.html Costly Data Transfer
  • 12. Challenges.... ● How to avoid the data transfer overhead for big data processing? ○ Answer : Take computation to the Storage cluster apps storage cluster But traditional OS level virtualizations are ● bulky and cpu intensive to run inside a cluster that is optimized for storage I/O only ● slow spin-up ● horizontal scaling is expensive apps
  • 13. ZeroVM to the rescue ● ZeroVM is an open–source lightweight virtualization platform based on the Chromium Native Client project (NaCl provides the essential isolation through software fault isolation technique) ● ZeroVM permits to safely execute arbitrary code (c/c++, python) from untrusted users in multi-tenant environments ● The ZeroVM Core is only 75 KB in Size and can spin-up in 5 ms. ● Thus It’s an ideal candidate to be run on top of Storage clusters like Openstack SWIFT. ● ZeroVM Takes computation to the storage enabling cost effective MapReduce on the cloud.
  • 14. ZeroVM Properties 1. ZeroVM is small, light, fast, Secure, Hyper Scalable. 2. ZeroVM virtualizes Application not Operating System. 3.Single threaded (thus deterministic) execution. Same executable will produce same results each time it is run. 4. Predefined resource constraints before execution ● Channel based I/O ● Predefine socket port / network ● Restricted Memory Access ● Limited Read/ Write (in bytes) ● Short life sessions / Predefined session_timeout
  • 15. credit : Ryan McKinney, Senior Software Engineer, Rackspace
  • 16. ZeroCloud ● ZeroCloud is the cloud module that runs on top of SWIFT that provides the facility to run zerovm sessions on different servers of the cluster ● ZeroCloud makes it easy to create large clusters of instances, aggregating the compute power of many individual physical servers into a single execution environment. ● Users can leverage the power of 100s of physical servers for a few seconds or even milliseconds at time. ● Horizontal scalability is a key design goal for ZeroVM
  • 17. ZeroCloud (on SWIFT) swift proxy with zerocloud Object Server REQ Resp GET/POST Object Server Object Server Object Server apps zerovm session apps zerovm session if (exec) spawn if (exec) spawn user supplies the job description with the executables (apps) result result job desc Openstack SWIFT Cluster
  • 18. MapReduce on ZeroVM ● ZeroVM running on ZeroCloud is inherently targeted for Big data processing, particularly using MapReduce style. ● Users can have multiple stage jobs and any stage can connect with another stage ● The users need to provide the executables only. ● Since data is already inside the SWIFT cluster, an execution job request through GET/POST is enough to fire the big data processing instantly and obtain the result. ● Ensures Data Locality and eliminates the costly data transfer.
  • 19. Demonstration??? Do you like to give ZeroVM a try? http://zebra. zerovm.org/
  • 20. Our Research on ZeroVM ● There are many ongoing researches on ZeroVM. ● UTSA Big Data and Cloud Lab has some ongoing research projects. ● Currently I am working under the supervision Dr.Lama to improve MapReduce on ZeroVM. ● Our projects involves developing a scheduler for ZeroCloud that will be optimized to ensure Data Locality, Interference & Heterogeneity and Skew Aware.
  • 21. Our Research on ZeroVM (contd) ● Data Locality is of great importance for Big Data Processing. ● Current Implementation ensures Data Locality for Map Phase since the executables will be run on the input data. ● We would like to optimize and ensure Data Locality for Reducer phases. ● We would like to design a scheduler that would mitigate the data/computational skew problem (which is inherent in every MapReduce environment) intelligently, which is currently handled manually by the end user
  • 22. Thanks Get this ppt from: http://goo.gl/6fJpbn Credits: [1] Prosunjit Biswas, UTSA [2] Carina C. Zona, Rackspace [3] Ryan Mckinney, Rackspace References: [1] zeroVM: http://www.zerovm.org [2] apache hadoop: http://apache.hadoop.org [3] Amazon EMR: http://aws.amazon.com/elasticmapreduce [4] Map Reduce: http://en.wikipedia.org/wiki/MapReduce [5] Native Client: A Sandbox for Portable, Untrusted x86 Native Code : http://static.googleusercontent. com/media/research.google.com/en/us/pubs/archive/34913.pdf More about ZeroVM Website: www.zerovm.org Github: https://github. com/zerovm/ User Mailing List: zerovm@googlegroups.com IRC: #zerovm on Freenode