GANDHINAGAR INSTITUTE OF TECHNOLGY
Information Technology Department
RDD Transformations
Presented By:-Shaishav Shah
Student ID: GIT_IT_B_21
Guided By
Prof. Pooja Shah
BDA (2171607)
What is RDD?
• RDD means Resilient distributed dataset.
• Spark revolves around the concept of RDD which is a fault-
tolerant collection of elements that can be operated in parallel.
• There are two ways to create RDDs, it can be created by
parallelizing an existing collection in your driver program, or
referencing a dataset in an external storage system such as
(HDFS, Hbase, or any datasource offering Hadoop format)
RDDs & its Operations:-
• There are basically two types of RDDs operations in spark.
1. Transformations.
2. Actions.
Transformations
• The RDD transformations are some functions that takes one
RDD as input and form one or more than one RDD as an
output .
• As all RDDs are immutable then the main RDD will not be
changed.
• It is lazy operation though it creates some RDDs but they can
executes when an action is called.
Types of RDD Transformation:
• To improve the computation performance, we can set some
transformations as pipelined. It helps to optimize process.
• There are two kinds of transformations:
1. Narrow Transformation
2. Wide Transformation
Narrow Transformation
• Narrow transformations are
generated as a result of
Map, Filter or these kind of
operations
• It originates from a single
partition in a parent RDD .
Only some partitions are
used to find result.
Wide Transformation
• Wide Transformations are
generated as a result of
GroupBykey(),
ReduceBykey() or these kind
of operations.
• In these case to form a data
partition, it can take data from
more than one partitions.
• It is also known as shuffle
partition.
Thank You

Rdd transformations bda

  • 1.
    GANDHINAGAR INSTITUTE OFTECHNOLGY Information Technology Department RDD Transformations Presented By:-Shaishav Shah Student ID: GIT_IT_B_21 Guided By Prof. Pooja Shah BDA (2171607)
  • 2.
    What is RDD? •RDD means Resilient distributed dataset. • Spark revolves around the concept of RDD which is a fault- tolerant collection of elements that can be operated in parallel. • There are two ways to create RDDs, it can be created by parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system such as (HDFS, Hbase, or any datasource offering Hadoop format)
  • 3.
    RDDs & itsOperations:- • There are basically two types of RDDs operations in spark. 1. Transformations. 2. Actions.
  • 4.
    Transformations • The RDDtransformations are some functions that takes one RDD as input and form one or more than one RDD as an output . • As all RDDs are immutable then the main RDD will not be changed. • It is lazy operation though it creates some RDDs but they can executes when an action is called.
  • 5.
    Types of RDDTransformation: • To improve the computation performance, we can set some transformations as pipelined. It helps to optimize process. • There are two kinds of transformations: 1. Narrow Transformation 2. Wide Transformation
  • 6.
    Narrow Transformation • Narrowtransformations are generated as a result of Map, Filter or these kind of operations • It originates from a single partition in a parent RDD . Only some partitions are used to find result.
  • 7.
    Wide Transformation • WideTransformations are generated as a result of GroupBykey(), ReduceBykey() or these kind of operations. • In these case to form a data partition, it can take data from more than one partitions. • It is also known as shuffle partition.
  • 8.