Spark_RDD_SyedAcademy

•

0 likes•131 views

Syed Hadoop

Spark RDD www.syedacademy.com

Software

Apache Spark
Syed
Solutions Engineer - Big Data
mail.syed786@gmail.com
info.syedacademy@gmail.com
+91-9030477368

Writing my own RDD? What for?
● To write your own RDD, you need to understand to
some extent internal mechanics of Apache Spark
● Writing your own RDD will prove you understand them
well
● When connecting to external storage, it is reasonable to
create your own RDD for it

RDD - the definition
RDD stands for resilient distributed dataset

RDD - the definition
RDD stands for resilient distributed dataset
Dataset - initial data comes from some
distributed storage

RDD - the definition
RDD stands for resilient distributed dataset
Dataset - initial data comes from some
distributed storage
Distributed - stored in nodes
among the cluster

Quiz: what is an “RDD”?
A: distributed collection of objects on disk
B: distributed collection of objects in memory
C: distributed collection of objects in Cassandra
Answer: could be any of the above!

Scientific Answer: RDD is an
Interface!
1. Set of partitions (“splits” in Hadoop)
2. List of dependencies on parent RDDs
3. Function to compute a partition (as
an Iterator) given its parent(s)
4. (Optional) partitioner (hash, range)
5. (Optional) preferred location(s)
for each partition
“lineage”
optimized
execution

RDD Persistence
•Each node stores any partitions of it that it computes in memory and
reuses them in other actions on that dataset.
•After marking an RDD to be persisted, the first time the dataset is
computed in an action, it will be kept in memory on the nodes.
•Allows future actions to be much faster (often by more than 10x) since
you’re not re-computing some data every time you perform an action.
•If data is too big to be cached, then it will spill to disk and memory will
gradually degrade
•Least Recently Used (LRU) replacement policy

RDD Persistence APIs
rdd.persist()
rdd.persist(StorageLevel)
•Persist this RDD with the default storage level (MEMORY_ONLY).
•You can override the StorageLevel for fine grain control over
persistence
rdd.cache()
•Persists the RDD with the default storage level (MEMORY_ONLY)
rdd.checkpoint()
•RDD will be saved to a file inside the checkpoint directory set with
SparkContext#setCheckpointDir(“/path/to/dir”)
•Used for RDDs with long lineage chains with wide dependencies since
it would be expensive to re-compute
rdd.unpersist()
•Marks it as non-persistent and/or removes all blocks of it from memory
and disk

Fault Tolerance
• RDDs contain lineage graphs (coarse grained updates/transformations) to
help it rebuild partitions that were lost
• Only the lost partitions of an RDD need to be recomputed upon failure.
• They can be recomputed in parallel on different nodes without having to roll
back the entire app
• Also lets a system tolerate slow nodes (stragglers) by running a backup
copy of the troubled task.
• Original process on straggling node will be killed when new process is
complete
• Cached/Check pointed partitions are also used to re-compute lost partitions
if available in shared memory

Thank you!
www.syedacademy.com
mail.syed786@gmail.com
info.syedacademy@gmail.com
+91-9030477368

What's hot

DevDay: Vault Recycler Right to be Forgotten, R3R3

Everything You Need to Know About ShardingMongoDB

Instaclustr Apache Cassandra Best Practices & ToubleshootingInstaclustr

Real world capacityEdward Capriolo

Ceph c01Lâm Đào

Raid data recovery TipsHone Software

Instaclustr introduction to managing cassandraInstaclustr

Big data nyuEdward Capriolo

Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...Instaclustr

Web-scale data processing: practical approaches for low-latency and batchEdward Capriolo

Cassandra tw presentationOmarFaroque16

Seagate Implementation of Dense Storage Utilizing HDDs and SSDsRed_Hat_Storage

Building your own NSQL storeEdward Capriolo

Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen StorageEric Carter

Redis as database - HashedInHashedIn Technologies

Ceph Day Berlin: Scaling an Academic CloudCeph Community

Monitoring Cassandra with RiemannPatricia Gorla

Implementation of Dense Storage Utilizing HDDs with SSDs and PCIe Flash Acc...Red_Hat_Storage

How to Achieve Scale with MongoDBMongoDB

Ravi Namboori Hadoop & HDFS ArchitectureRavi namboori

What's hot (20)

DevDay: Vault Recycler Right to be Forgotten, R3

Everything You Need to Know About Sharding

Instaclustr Apache Cassandra Best Practices & Toubleshooting

Real world capacity

Ceph c01

Raid data recovery Tips

Instaclustr introduction to managing cassandra

Big data nyu

Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...

Web-scale data processing: practical approaches for low-latency and batch

Cassandra tw presentation

Seagate Implementation of Dense Storage Utilizing HDDs and SSDs

Building your own NSQL store

Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage

Redis as database - HashedIn

Ceph Day Berlin: Scaling an Academic Cloud

Monitoring Cassandra with Riemann

Implementation of Dense Storage Utilizing HDDs with SSDs and PCIe Flash Acc...

How to Achieve Scale with MongoDB

Ravi Namboori Hadoop & HDFS Architecture

Similar to Spark_RDD_SyedAcademy

Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Intro to Apache SparkRobert Sanders

Intro to Apache Sparkclairvoyantllc

Apache Spark ArchitectureAlexey Grishchenko

SparkHeena Madan

Study Notes: Apache SparkGao Yunzhong

Unit II Real Time Data Processing tools.pptxRahul Borate

Apache Spark overviewDataArt

Some thoughts on apache spark & sharkViet-Trung TRAN

Introduction to SparkDavid Smelker

Spark learningAjay Guyyala

Big data overviewbeCloudReady

Geek Night - Functional Data Processing using Spark and ScalaAtif Akhtar

Apache Spark™ is a multi-language engine for executing data-S5.pptbhargavi804095

Big Data Processing using Apache Spark and ClojureDr. Christian Betz

SparkMário Almeida

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks

Resilient Distributed DataSets - Apache SPARKTaposh Roy

Spark 计算模型wang xing

Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014cdmaxime

Similar to Spark_RDD_SyedAcademy (20)

Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab

Intro to Apache Spark

Apache Spark Architecture

Spark

Study Notes: Apache Spark

Unit II Real Time Data Processing tools.pptx

Apache Spark overview

Some thoughts on apache spark & shark

Introduction to Spark

Spark learning

Big data overview

Geek Night - Functional Data Processing using Spark and Scala

Apache Spark™ is a multi-language engine for executing data-S5.ppt

Big Data Processing using Apache Spark and Clojure

Spark

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...

Resilient Distributed DataSets - Apache SPARK

Spark 计算模型

Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014

Recently uploaded

Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.

MYjobs Presentation Django-based projectAnoyGreter

Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed

The Evolution of Karaoke From Analog to App.pdfPower Karaoke

Asset Management Software - InfographicHr365.us smith

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.

Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz

Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions

Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh

What is Fashion PLM and Why Do You Need ItWave PLM

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko

software engineering Chapter 5 System modeling.pptxnada99848

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.

Recently uploaded (20)

Cloud Data Center Network Construction - IEEE

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...

MYjobs Presentation Django-based project

Unveiling Design Patterns: A Visual Guide with UML Diagrams

The Evolution of Karaoke From Analog to App.pdf

Asset Management Software - Infographic

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

Folding Cheat Sheet #4 - fourth in a series

Advancing Engineering with AI through the Next Generation of Strategic Projec...

Salesforce Certified Field Service Consultant

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...

What is Fashion PLM and Why Do You Need It

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf

software engineering Chapter 5 System modeling.pptx

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...

Spark_RDD_SyedAcademy

1. Apache Spark Syed Solutions Engineer - Big Data mail.syed786@gmail.com info.syedacademy@gmail.com +91-9030477368

5. Writing my own RDD? What for? ● To write your own RDD, you need to understand to some extent internal mechanics of Apache Spark ● Writing your own RDD will prove you understand them well ● When connecting to external storage, it is reasonable to create your own RDD for it

6. RDD - the definition

7. RDD - the definition RDD stands for resilient distributed dataset

8. RDD - the definition RDD stands for resilient distributed dataset Dataset - initial data comes from some distributed storage

9. RDD - the definition RDD stands for resilient distributed dataset Dataset - initial data comes from some distributed storage Distributed - stored in nodes among the cluster

10. RDD - the definition RDD stands for resilient distributed dataset Dataset - initial data comes from some distributed storage Distributed - stored in nodes among the cluster Resilient - if data is lost, data can be recreated

11. Quiz: what is an “RDD”? A: distributed collection of objects on disk B: distributed collection of objects in memory C: distributed collection of objects in Cassandra Answer: could be any of the above!

12. Scientific Answer: RDD is an Interface! 1. Set of partitions (“splits” in Hadoop) 2. List of dependencies on parent RDDs 3. Function to compute a partition (as an Iterator) given its parent(s) 4. (Optional) partitioner (hash, range) 5. (Optional) preferred location(s) for each partition “lineage” optimized execution

13.

14.

15.

16.

17.

18.

19.

20.

21.

22.

23.

24.

25.

26.

27.

28.

29.

30.

31.

32.

33.

34.

35.

36.

37.

38.

39.

40.

41.

42.

43.

44.

45.

46. RDD Persistence •Each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset. •After marking an RDD to be persisted, the first time the dataset is computed in an action, it will be kept in memory on the nodes. •Allows future actions to be much faster (often by more than 10x) since you’re not re-computing some data every time you perform an action. •If data is too big to be cached, then it will spill to disk and memory will gradually degrade •Least Recently Used (LRU) replacement policy

47. RDD Persistence (Storage Levels)

48. RDD Persistence APIs rdd.persist() rdd.persist(StorageLevel) •Persist this RDD with the default storage level (MEMORY_ONLY). •You can override the StorageLevel for fine grain control over persistence rdd.cache() •Persists the RDD with the default storage level (MEMORY_ONLY) rdd.checkpoint() •RDD will be saved to a file inside the checkpoint directory set with SparkContext#setCheckpointDir(“/path/to/dir”) •Used for RDDs with long lineage chains with wide dependencies since it would be expensive to re-compute rdd.unpersist() •Marks it as non-persistent and/or removes all blocks of it from memory and disk

49. Fault Tolerance • RDDs contain lineage graphs (coarse grained updates/transformations) to help it rebuild partitions that were lost • Only the lost partitions of an RDD need to be recomputed upon failure. • They can be recomputed in parallel on different nodes without having to roll back the entire app • Also lets a system tolerate slow nodes (stragglers) by running a backup copy of the troubled task. • Original process on straggling node will be killed when new process is complete • Cached/Check pointed partitions are also used to re-compute lost partitions if available in shared memory

50. Thank you! www.syedacademy.com mail.syed786@gmail.com info.syedacademy@gmail.com +91-9030477368

Spark_RDD_SyedAcademy

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark_RDD_SyedAcademy

Similar to Spark_RDD_SyedAcademy (20)

More from Syed Hadoop

More from Syed Hadoop (6)

Recently uploaded

Recently uploaded (20)

Spark_RDD_SyedAcademy