SlideShare a Scribd company logo
RDD – Overview 
(Resilient Distributed Datasets*) 
{ 
Nov 1st 2014 
Oakland CA 
By Taposh Dutta Roy 
* Source: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Contents 
• What is RDD 
• Motivation Behind RDD 
• Use Cases for RDD 
• Challenges for RDD 
• RDD: Solve
What is RDD 
“RDDs are fault tolerant, parallel data structures 
that let users explicitly persist intermediate 
results in memory, control their partitioning to 
optimize data placement, and manipulate them 
using a rich set of operations. “ 
In a nutshell RDDs are a level of abstraction 
that enable efficient data reuse in a broad 
range of applications
Motivation behind RDD 
Current frameworks like MapReduce & 
Dyrad provide a numerous abstractions for 
accessing a cluster’s computational resources 
but lack abstractions for leveraging the 
distributed memory !!! 
Data reuse is common in many iterative 
machine learning algorithms such as – Page 
Rank, K-means Clustering & Logistic 
Regression.
Motivation behind RDD 
Another use case is when an user runs 
multiple adhoc queries on the same subset of 
data. 
Unfortunately in current frameworks, the 
only way to reuse data between 
computations i.e between two jobs is to write 
to an external storage system e.g. a 
distributed file system such as Amazon S3.
Use cases for RDD 
1. Solving Iterative problems 
Existing Solution – Slow, needs high I/O 
RDD - Fast, in memory
Use cases for RDD 
Example: Suppose I have to look at the 
webserver access logs and look for an 
error_code or certain text.
Use cases for RDD 
Example (cont’d) : I run the above code on server 
which returns a set of files with the words 
looked for grepped, closes the cluster and puts 
the file into an Amazon S3 location specified in 
the script. 
Now we look at the result files and need to 
extract some other text from this file, we will 
need to write or use another set of map-reduce 
code. This might take extra time to fetch the files, 
process and provide the results.
Use cases for RDD 
RDD solves this problem by storing the data 
in memory and providing a ability for the 
user to requery the subset.
Use cases for RDD 
2. Solving Interactive Problems 
The second use case is its usage in interactive 
algorithms such as logistic regression which need 
the data to be re-used.
Challenge for RDD 
The main challenge in designing RDD is 
defining a programming interface that can 
provide fault tolerance efficiently.
Challenges for RDD 
Existing solutions such as distributed shared 
memory, key value stores, & databases offer 
an interface based on fine-grained updates. 
With such systems, the only way to get 
fault tolerance is to replicate the data across 
machines or to log updates across machines. Both 
of these approaches are data intensive. They 
need high bandwidth to move the data over 
the cluster network and large storage.
RDD: Solve 
RDD solves these probems by providing an 
interface based on coarse grained 
transformations such as map, filter and join. 
These transformations apply the same 
operations to many data items. 
This allows them to efficiently provide fault 
tolerance by logging the transformations 
used to build a dataset (i.e. lineage) rather 
than actual data. If a partition of RDD is lost, 
the RDD has enough information about how 
it ..
RDD: Solve 
(Cont’d) was derived from other RDD to 
recompute just that partition. The lost data 
can be recovered quickly, without costly 
replication.
Applications not suitable : RDD 
RDDs would be less suitable for applications 
that make asynchronous fine grained updates 
to shared state, such as a storage system for a 
web application or an incremental web 
crawller. For such applications traditional 
update logging and data checkpointing 
such as databases.
Conclusion RDD 
RDD's goal is to provide an efficient 
programming model for batch 
analytics. 
RDD has been implemented in a system 
called SPARK.

More Related Content

What's hot

NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
Marin Dimitrov
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data Streams
SujaAldrin
 
Chapter 1 big data
Chapter 1 big dataChapter 1 big data
Chapter 1 big data
Prof .Pragati Khade
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Dr. C.V. Suresh Babu
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
 
RDD
RDDRDD
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
Mahmood Reza Esmaili Zand
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Big Data Trends
Big Data TrendsBig Data Trends
Big Data Trends
David Feinleib
 
What Is RDD In Spark? | Edureka
What Is RDD In Spark? | EdurekaWhat Is RDD In Spark? | Edureka
What Is RDD In Spark? | Edureka
Edureka!
 
Spark
SparkSpark
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Apache Apex
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
hktripathy
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Stanley Wang
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Chandler Huang
 

What's hot (20)

NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data Streams
 
Chapter 1 big data
Chapter 1 big dataChapter 1 big data
Chapter 1 big data
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
RDD
RDDRDD
RDD
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Big Data Trends
Big Data TrendsBig Data Trends
Big Data Trends
 
What Is RDD In Spark? | Edureka
What Is RDD In Spark? | EdurekaWhat Is RDD In Spark? | Edureka
What Is RDD In Spark? | Edureka
 
Spark
SparkSpark
Spark
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 

Viewers also liked

Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
Dean Chen
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
Gabriele Modena
 
Simple Log Analysis and Trending
Simple Log Analysis and TrendingSimple Log Analysis and Trending
Simple Log Analysis and Trending
Mike Brittain
 
Anatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIAnatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source API
datamantra
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
Alessandro Menabò
 
BDAS RDD study report v1.2
BDAS RDD study report v1.2BDAS RDD study report v1.2
BDAS RDD study report v1.2
Stefanie Zhao
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Chris Fregly
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Writing your own RDD for fun and profit
Writing your own RDD for fun and profitWriting your own RDD for fun and profit
Writing your own RDD for fun and profit
Pawel Szulc
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
Spark Summit
 
IBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsIBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark Basics
Satya Narayan
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
Spark RDD : Transformations & Actions
Spark RDD : Transformations & ActionsSpark RDD : Transformations & Actions
Spark RDD : Transformations & Actions
MICHRAFY MUSTAFA
 
หนังสือภาษาไทย Spark Internal
หนังสือภาษาไทย Spark Internalหนังสือภาษาไทย Spark Internal
หนังสือภาษาไทย Spark Internal
Bhuridech Sudsee
 

Viewers also liked (20)

Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
 
Simple Log Analysis and Trending
Simple Log Analysis and TrendingSimple Log Analysis and Trending
Simple Log Analysis and Trending
 
Anatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIAnatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source API
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
 
BDAS RDD study report v1.2
BDAS RDD study report v1.2BDAS RDD study report v1.2
BDAS RDD study report v1.2
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Writing your own RDD for fun and profit
Writing your own RDD for fun and profitWriting your own RDD for fun and profit
Writing your own RDD for fun and profit
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
IBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsIBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark Basics
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Spark RDD : Transformations & Actions
Spark RDD : Transformations & ActionsSpark RDD : Transformations & Actions
Spark RDD : Transformations & Actions
 
หนังสือภาษาไทย Spark Internal
หนังสือภาษาไทย Spark Internalหนังสือภาษาไทย Spark Internal
หนังสือภาษาไทย Spark Internal
 

Similar to Resilient Distributed DataSets - Apache SPARK

Spark cluster computing with working sets
Spark cluster computing with working setsSpark cluster computing with working sets
Spark cluster computing with working sets
JinxinTang
 
A big-data architecture for real-time analytics
A big-data architecture for real-time analyticsA big-data architecture for real-time analytics
A big-data architecture for real-time analytics
ramikaurraminder
 
Big Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. SparkBig Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. Spark
Graisy Biswal
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
huguk
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache Spark
Gao Yunzhong
 
Iaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasetsIaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasets
Iaetsd Iaetsd
 
Cloud Computing Ambiance using Secluded Access Control Method
Cloud Computing Ambiance using Secluded Access Control MethodCloud Computing Ambiance using Secluded Access Control Method
Cloud Computing Ambiance using Secluded Access Control Method
IRJET Journal
 
Database
DatabaseDatabase
Database
Zahid Soomro
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System
cscpconf
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
cscpconf
 
Redis vs Memcached
Redis vs MemcachedRedis vs Memcached
Redis vs Memcached
Gaurav Agrawal
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
KamranKhan587
 
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONMAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
ijdms
 
Map Reduce based on Cloak DHT Data Replication Evaluation
Map Reduce based on Cloak DHT Data Replication EvaluationMap Reduce based on Cloak DHT Data Replication Evaluation
Map Reduce based on Cloak DHT Data Replication Evaluation
ijdms
 
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Samsung Business USA
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
David Tjahjono,MD,MBA(UK)
 
D04501036040
D04501036040D04501036040
D04501036040
ijceronline
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
ijwscjournal
 
G017143640
G017143640G017143640
G017143640
IOSR Journals
 

Similar to Resilient Distributed DataSets - Apache SPARK (20)

Spark cluster computing with working sets
Spark cluster computing with working setsSpark cluster computing with working sets
Spark cluster computing with working sets
 
A big-data architecture for real-time analytics
A big-data architecture for real-time analyticsA big-data architecture for real-time analytics
A big-data architecture for real-time analytics
 
Big Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. SparkBig Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. Spark
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache Spark
 
Iaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasetsIaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasets
 
Cloud Computing Ambiance using Secluded Access Control Method
Cloud Computing Ambiance using Secluded Access Control MethodCloud Computing Ambiance using Secluded Access Control Method
Cloud Computing Ambiance using Secluded Access Control Method
 
Database
DatabaseDatabase
Database
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
 
Redis vs Memcached
Redis vs MemcachedRedis vs Memcached
Redis vs Memcached
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONMAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
 
Map Reduce based on Cloak DHT Data Replication Evaluation
Map Reduce based on Cloak DHT Data Replication EvaluationMap Reduce based on Cloak DHT Data Replication Evaluation
Map Reduce based on Cloak DHT Data Replication Evaluation
 
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
D04501036040
D04501036040D04501036040
D04501036040
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
 
G017143640
G017143640G017143640
G017143640
 

More from Taposh Roy

Image annotation - Segmentation & Annotation
Image annotation - Segmentation & AnnotationImage annotation - Segmentation & Annotation
Image annotation - Segmentation & Annotation
Taposh Roy
 
Wal mart health_care_2017_dec
Wal mart health_care_2017_decWal mart health_care_2017_dec
Wal mart health_care_2017_dec
Taposh Roy
 
Predictive modeling healthcare
Predictive modeling healthcarePredictive modeling healthcare
Predictive modeling healthcare
Taposh Roy
 
Basic elements-of-strategy-framework
Basic elements-of-strategy-frameworkBasic elements-of-strategy-framework
Basic elements-of-strategy-framework
Taposh Roy
 
Kaggle bikeshare Competition - Part 1
Kaggle bikeshare Competition  - Part 1Kaggle bikeshare Competition  - Part 1
Kaggle bikeshare Competition - Part 1
Taposh Roy
 
Airline industry analysis - Boeing & Airbus
Airline industry analysis - Boeing & AirbusAirline industry analysis - Boeing & Airbus
Airline industry analysis - Boeing & Airbus
Taposh Roy
 
Energy industry report
Energy industry reportEnergy industry report
Energy industry report
Taposh Roy
 
Consumer electronics bm_retail
Consumer electronics bm_retailConsumer electronics bm_retail
Consumer electronics bm_retail
Taposh Roy
 
Multi Asset Endowment Investment Strategy
Multi Asset Endowment Investment StrategyMulti Asset Endowment Investment Strategy
Multi Asset Endowment Investment Strategy
Taposh Roy
 
Competitor Analysis for RSG Consulting
Competitor Analysis for RSG ConsultingCompetitor Analysis for RSG Consulting
Competitor Analysis for RSG Consulting
Taposh Roy
 
Financial Analysis boeing airbus
Financial Analysis boeing airbusFinancial Analysis boeing airbus
Financial Analysis boeing airbus
Taposh Roy
 
Sprint softbank (Merger Analysis)
Sprint softbank (Merger Analysis)Sprint softbank (Merger Analysis)
Sprint softbank (Merger Analysis)
Taposh Roy
 
M a analysis_roche_genentech
M a analysis_roche_genentechM a analysis_roche_genentech
M a analysis_roche_genentech
Taposh Roy
 
Land rover north america (HBS 9-596036)
Land rover north america (HBS 9-596036)Land rover north america (HBS 9-596036)
Land rover north america (HBS 9-596036)
Taposh Roy
 
American airlines - Value Pricing 1992
American airlines - Value Pricing 1992American airlines - Value Pricing 1992
American airlines - Value Pricing 1992
Taposh Roy
 
Strategy frameworks-and-models
Strategy frameworks-and-modelsStrategy frameworks-and-models
Strategy frameworks-and-models
Taposh Roy
 
Tesla in UAE (Financial Strategy)
Tesla in UAE (Financial Strategy)Tesla in UAE (Financial Strategy)
Tesla in UAE (Financial Strategy)
Taposh Roy
 
Understandingplatform
UnderstandingplatformUnderstandingplatform
Understandingplatform
Taposh Roy
 
Disney hbs9 701-035
Disney hbs9 701-035Disney hbs9 701-035
Disney hbs9 701-035
Taposh Roy
 
Best buy-analysis
Best buy-analysisBest buy-analysis
Best buy-analysis
Taposh Roy
 

More from Taposh Roy (20)

Image annotation - Segmentation & Annotation
Image annotation - Segmentation & AnnotationImage annotation - Segmentation & Annotation
Image annotation - Segmentation & Annotation
 
Wal mart health_care_2017_dec
Wal mart health_care_2017_decWal mart health_care_2017_dec
Wal mart health_care_2017_dec
 
Predictive modeling healthcare
Predictive modeling healthcarePredictive modeling healthcare
Predictive modeling healthcare
 
Basic elements-of-strategy-framework
Basic elements-of-strategy-frameworkBasic elements-of-strategy-framework
Basic elements-of-strategy-framework
 
Kaggle bikeshare Competition - Part 1
Kaggle bikeshare Competition  - Part 1Kaggle bikeshare Competition  - Part 1
Kaggle bikeshare Competition - Part 1
 
Airline industry analysis - Boeing & Airbus
Airline industry analysis - Boeing & AirbusAirline industry analysis - Boeing & Airbus
Airline industry analysis - Boeing & Airbus
 
Energy industry report
Energy industry reportEnergy industry report
Energy industry report
 
Consumer electronics bm_retail
Consumer electronics bm_retailConsumer electronics bm_retail
Consumer electronics bm_retail
 
Multi Asset Endowment Investment Strategy
Multi Asset Endowment Investment StrategyMulti Asset Endowment Investment Strategy
Multi Asset Endowment Investment Strategy
 
Competitor Analysis for RSG Consulting
Competitor Analysis for RSG ConsultingCompetitor Analysis for RSG Consulting
Competitor Analysis for RSG Consulting
 
Financial Analysis boeing airbus
Financial Analysis boeing airbusFinancial Analysis boeing airbus
Financial Analysis boeing airbus
 
Sprint softbank (Merger Analysis)
Sprint softbank (Merger Analysis)Sprint softbank (Merger Analysis)
Sprint softbank (Merger Analysis)
 
M a analysis_roche_genentech
M a analysis_roche_genentechM a analysis_roche_genentech
M a analysis_roche_genentech
 
Land rover north america (HBS 9-596036)
Land rover north america (HBS 9-596036)Land rover north america (HBS 9-596036)
Land rover north america (HBS 9-596036)
 
American airlines - Value Pricing 1992
American airlines - Value Pricing 1992American airlines - Value Pricing 1992
American airlines - Value Pricing 1992
 
Strategy frameworks-and-models
Strategy frameworks-and-modelsStrategy frameworks-and-models
Strategy frameworks-and-models
 
Tesla in UAE (Financial Strategy)
Tesla in UAE (Financial Strategy)Tesla in UAE (Financial Strategy)
Tesla in UAE (Financial Strategy)
 
Understandingplatform
UnderstandingplatformUnderstandingplatform
Understandingplatform
 
Disney hbs9 701-035
Disney hbs9 701-035Disney hbs9 701-035
Disney hbs9 701-035
 
Best buy-analysis
Best buy-analysisBest buy-analysis
Best buy-analysis
 

Recently uploaded

Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
AI assisted telemedicine KIOSK for Rural India.pptx
AI assisted telemedicine KIOSK for Rural India.pptxAI assisted telemedicine KIOSK for Rural India.pptx
AI assisted telemedicine KIOSK for Rural India.pptx
architagupta876
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
integral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdfintegral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdf
gaafergoudaay7aga
 
cnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classicationcnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classication
SakkaravarthiShanmug
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
171ticu
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
Atif Razi
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
21UME003TUSHARDEB
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
kandramariana6
 
An improved modulation technique suitable for a three level flying capacitor ...
An improved modulation technique suitable for a three level flying capacitor ...An improved modulation technique suitable for a three level flying capacitor ...
An improved modulation technique suitable for a three level flying capacitor ...
IJECEIAES
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
KrishnaveniKrishnara1
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
Prakhyath Rai
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
Gino153088
 
Data Control Language.pptx Data Control Language.pptx
Data Control Language.pptx Data Control Language.pptxData Control Language.pptx Data Control Language.pptx
Data Control Language.pptx Data Control Language.pptx
ramrag33
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
UReason
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
ecqow
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
Yasser Mahgoub
 

Recently uploaded (20)

Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
AI assisted telemedicine KIOSK for Rural India.pptx
AI assisted telemedicine KIOSK for Rural India.pptxAI assisted telemedicine KIOSK for Rural India.pptx
AI assisted telemedicine KIOSK for Rural India.pptx
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
integral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdfintegral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdf
 
cnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classicationcnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classication
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
 
An improved modulation technique suitable for a three level flying capacitor ...
An improved modulation technique suitable for a three level flying capacitor ...An improved modulation technique suitable for a three level flying capacitor ...
An improved modulation technique suitable for a three level flying capacitor ...
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
 
Data Control Language.pptx Data Control Language.pptx
Data Control Language.pptx Data Control Language.pptxData Control Language.pptx Data Control Language.pptx
Data Control Language.pptx Data Control Language.pptx
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
 

Resilient Distributed DataSets - Apache SPARK

  • 1. RDD – Overview (Resilient Distributed Datasets*) { Nov 1st 2014 Oakland CA By Taposh Dutta Roy * Source: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  • 2. Contents • What is RDD • Motivation Behind RDD • Use Cases for RDD • Challenges for RDD • RDD: Solve
  • 3. What is RDD “RDDs are fault tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operations. “ In a nutshell RDDs are a level of abstraction that enable efficient data reuse in a broad range of applications
  • 4. Motivation behind RDD Current frameworks like MapReduce & Dyrad provide a numerous abstractions for accessing a cluster’s computational resources but lack abstractions for leveraging the distributed memory !!! Data reuse is common in many iterative machine learning algorithms such as – Page Rank, K-means Clustering & Logistic Regression.
  • 5. Motivation behind RDD Another use case is when an user runs multiple adhoc queries on the same subset of data. Unfortunately in current frameworks, the only way to reuse data between computations i.e between two jobs is to write to an external storage system e.g. a distributed file system such as Amazon S3.
  • 6. Use cases for RDD 1. Solving Iterative problems Existing Solution – Slow, needs high I/O RDD - Fast, in memory
  • 7. Use cases for RDD Example: Suppose I have to look at the webserver access logs and look for an error_code or certain text.
  • 8. Use cases for RDD Example (cont’d) : I run the above code on server which returns a set of files with the words looked for grepped, closes the cluster and puts the file into an Amazon S3 location specified in the script. Now we look at the result files and need to extract some other text from this file, we will need to write or use another set of map-reduce code. This might take extra time to fetch the files, process and provide the results.
  • 9. Use cases for RDD RDD solves this problem by storing the data in memory and providing a ability for the user to requery the subset.
  • 10. Use cases for RDD 2. Solving Interactive Problems The second use case is its usage in interactive algorithms such as logistic regression which need the data to be re-used.
  • 11. Challenge for RDD The main challenge in designing RDD is defining a programming interface that can provide fault tolerance efficiently.
  • 12. Challenges for RDD Existing solutions such as distributed shared memory, key value stores, & databases offer an interface based on fine-grained updates. With such systems, the only way to get fault tolerance is to replicate the data across machines or to log updates across machines. Both of these approaches are data intensive. They need high bandwidth to move the data over the cluster network and large storage.
  • 13. RDD: Solve RDD solves these probems by providing an interface based on coarse grained transformations such as map, filter and join. These transformations apply the same operations to many data items. This allows them to efficiently provide fault tolerance by logging the transformations used to build a dataset (i.e. lineage) rather than actual data. If a partition of RDD is lost, the RDD has enough information about how it ..
  • 14. RDD: Solve (Cont’d) was derived from other RDD to recompute just that partition. The lost data can be recovered quickly, without costly replication.
  • 15. Applications not suitable : RDD RDDs would be less suitable for applications that make asynchronous fine grained updates to shared state, such as a storage system for a web application or an incremental web crawller. For such applications traditional update logging and data checkpointing such as databases.
  • 16. Conclusion RDD RDD's goal is to provide an efficient programming model for batch analytics. RDD has been implemented in a system called SPARK.