SlideShare a Scribd company logo
1 of 50
Apache Spark
Syed
Solutions Engineer - Big Data
mail.syed786@gmail.com
info.syedacademy@gmail.com
+91-9030477368
Writing my own RDD? What for?
● To write your own RDD, you need to understand to
some extent internal mechanics of Apache Spark
● Writing your own RDD will prove you understand them
well
● When connecting to external storage, it is reasonable to
create your own RDD for it
RDD - the definition
RDD - the definition
RDD stands for resilient distributed dataset
RDD - the definition
RDD stands for resilient distributed dataset
Dataset - initial data comes from some
distributed storage
RDD - the definition
RDD stands for resilient distributed dataset
Dataset - initial data comes from some
distributed storage
Distributed - stored in nodes
among the cluster
RDD - the definition
RDD stands for resilient distributed dataset
Dataset - initial data comes from some
distributed storage
Distributed - stored in nodes
among the cluster
Resilient - if data is lost, data can
be recreated
Quiz: what is an “RDD”?
A: distributed collection of objects on disk
B: distributed collection of objects in memory
C: distributed collection of objects in Cassandra
Answer: could be any of the above!
Scientific Answer: RDD is an
Interface!
1. Set of partitions (“splits” in Hadoop)
2. List of dependencies on parent RDDs
3. Function to compute a partition (as
an Iterator) given its parent(s)
4. (Optional) partitioner (hash, range)
5. (Optional) preferred location(s)
for each partition
“lineage”
optimized
execution
RDD Persistence
•Each node stores any partitions of it that it computes in memory and
reuses them in other actions on that dataset.
•After marking an RDD to be persisted, the first time the dataset is
computed in an action, it will be kept in memory on the nodes.
•Allows future actions to be much faster (often by more than 10x) since
you’re not re-computing some data every time you perform an action.
•If data is too big to be cached, then it will spill to disk and memory will
gradually degrade
•Least Recently Used (LRU) replacement policy
RDD Persistence (Storage Levels)
RDD Persistence APIs
rdd.persist()
rdd.persist(StorageLevel)
•Persist this RDD with the default storage level (MEMORY_ONLY).
•You can override the StorageLevel for fine grain control over
persistence
rdd.cache()
•Persists the RDD with the default storage level (MEMORY_ONLY)
rdd.checkpoint()
•RDD will be saved to a file inside the checkpoint directory set with
SparkContext#setCheckpointDir(“/path/to/dir”)
•Used for RDDs with long lineage chains with wide dependencies since
it would be expensive to re-compute
rdd.unpersist()
•Marks it as non-persistent and/or removes all blocks of it from memory
and disk
Fault Tolerance
• RDDs contain lineage graphs (coarse grained updates/transformations) to
help it rebuild partitions that were lost
• Only the lost partitions of an RDD need to be recomputed upon failure.
• They can be recomputed in parallel on different nodes without having to roll
back the entire app
• Also lets a system tolerate slow nodes (stragglers) by running a backup
copy of the troubled task.
• Original process on straggling node will be killed when new process is
complete
• Cached/Check pointed partitions are also used to re-compute lost partitions
if available in shared memory
Thank you!
www.syedacademy.com
mail.syed786@gmail.com
info.syedacademy@gmail.com
+91-9030477368

More Related Content

What's hot

DevDay: Vault Recycler Right to be Forgotten, R3
DevDay: Vault Recycler Right to be Forgotten, R3DevDay: Vault Recycler Right to be Forgotten, R3
DevDay: Vault Recycler Right to be Forgotten, R3R3
 
Everything You Need to Know About Sharding
Everything You Need to Know About ShardingEverything You Need to Know About Sharding
Everything You Need to Know About ShardingMongoDB
 
Instaclustr Apache Cassandra Best Practices & Toubleshooting
Instaclustr Apache Cassandra Best Practices & ToubleshootingInstaclustr Apache Cassandra Best Practices & Toubleshooting
Instaclustr Apache Cassandra Best Practices & ToubleshootingInstaclustr
 
Raid data recovery Tips
Raid data recovery TipsRaid data recovery Tips
Raid data recovery TipsHone Software
 
Instaclustr introduction to managing cassandra
Instaclustr introduction to managing cassandraInstaclustr introduction to managing cassandra
Instaclustr introduction to managing cassandraInstaclustr
 
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...Instaclustr
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchEdward Capriolo
 
Cassandra tw presentation
Cassandra tw presentationCassandra tw presentation
Cassandra tw presentationOmarFaroque16
 
Seagate Implementation of Dense Storage Utilizing HDDs and SSDs
Seagate Implementation of Dense Storage Utilizing HDDs and SSDsSeagate Implementation of Dense Storage Utilizing HDDs and SSDs
Seagate Implementation of Dense Storage Utilizing HDDs and SSDsRed_Hat_Storage
 
Building your own NSQL store
Building your own NSQL storeBuilding your own NSQL store
Building your own NSQL storeEdward Capriolo
 
Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage
Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen StoragePros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage
Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen StorageEric Carter
 
Ceph Day Berlin: Scaling an Academic Cloud
Ceph Day Berlin: Scaling an Academic CloudCeph Day Berlin: Scaling an Academic Cloud
Ceph Day Berlin: Scaling an Academic CloudCeph Community
 
Monitoring Cassandra with Riemann
Monitoring Cassandra with RiemannMonitoring Cassandra with Riemann
Monitoring Cassandra with RiemannPatricia Gorla
 
Implementation of Dense Storage Utilizing HDDs with SSDs and PCIe Flash Acc...
Implementation of Dense Storage Utilizing  HDDs with SSDs and PCIe Flash  Acc...Implementation of Dense Storage Utilizing  HDDs with SSDs and PCIe Flash  Acc...
Implementation of Dense Storage Utilizing HDDs with SSDs and PCIe Flash Acc...Red_Hat_Storage
 
How to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBHow to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBMongoDB
 
Ravi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi namboori
 

What's hot (20)

DevDay: Vault Recycler Right to be Forgotten, R3
DevDay: Vault Recycler Right to be Forgotten, R3DevDay: Vault Recycler Right to be Forgotten, R3
DevDay: Vault Recycler Right to be Forgotten, R3
 
Everything You Need to Know About Sharding
Everything You Need to Know About ShardingEverything You Need to Know About Sharding
Everything You Need to Know About Sharding
 
Instaclustr Apache Cassandra Best Practices & Toubleshooting
Instaclustr Apache Cassandra Best Practices & ToubleshootingInstaclustr Apache Cassandra Best Practices & Toubleshooting
Instaclustr Apache Cassandra Best Practices & Toubleshooting
 
Real world capacity
Real world capacityReal world capacity
Real world capacity
 
Ceph c01
Ceph c01Ceph c01
Ceph c01
 
Raid data recovery Tips
Raid data recovery TipsRaid data recovery Tips
Raid data recovery Tips
 
Instaclustr introduction to managing cassandra
Instaclustr introduction to managing cassandraInstaclustr introduction to managing cassandra
Instaclustr introduction to managing cassandra
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batch
 
Cassandra tw presentation
Cassandra tw presentationCassandra tw presentation
Cassandra tw presentation
 
Seagate Implementation of Dense Storage Utilizing HDDs and SSDs
Seagate Implementation of Dense Storage Utilizing HDDs and SSDsSeagate Implementation of Dense Storage Utilizing HDDs and SSDs
Seagate Implementation of Dense Storage Utilizing HDDs and SSDs
 
Building your own NSQL store
Building your own NSQL storeBuilding your own NSQL store
Building your own NSQL store
 
Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage
Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen StoragePros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage
Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage
 
Redis as database - HashedIn
Redis as database - HashedInRedis as database - HashedIn
Redis as database - HashedIn
 
Ceph Day Berlin: Scaling an Academic Cloud
Ceph Day Berlin: Scaling an Academic CloudCeph Day Berlin: Scaling an Academic Cloud
Ceph Day Berlin: Scaling an Academic Cloud
 
Monitoring Cassandra with Riemann
Monitoring Cassandra with RiemannMonitoring Cassandra with Riemann
Monitoring Cassandra with Riemann
 
Implementation of Dense Storage Utilizing HDDs with SSDs and PCIe Flash Acc...
Implementation of Dense Storage Utilizing  HDDs with SSDs and PCIe Flash  Acc...Implementation of Dense Storage Utilizing  HDDs with SSDs and PCIe Flash  Acc...
Implementation of Dense Storage Utilizing HDDs with SSDs and PCIe Flash Acc...
 
How to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBHow to Achieve Scale with MongoDB
How to Achieve Scale with MongoDB
 
Ravi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS Architecture
 

Similar to Spark_RDD_SyedAcademy

Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache SparkGao Yunzhong
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Some thoughts on apache spark & shark
Some thoughts on apache spark & sharkSome thoughts on apache spark & shark
Some thoughts on apache spark & sharkViet-Trung TRAN
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to SparkDavid Smelker
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaAtif Akhtar
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptbhargavi804095
 
Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureDr. Christian Betz
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKTaposh Roy
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型wang xing
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014cdmaxime
 

Similar to Spark_RDD_SyedAcademy (20)

Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Spark
SparkSpark
Spark
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache Spark
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Some thoughts on apache spark & shark
Some thoughts on apache spark & sharkSome thoughts on apache spark & shark
Some thoughts on apache spark & shark
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Spark learning
Spark learningSpark learning
Spark learning
 
Big data overview
Big data overviewBig data overview
Big data overview
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and Clojure
 
Spark
SparkSpark
Spark
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
 

More from Syed Hadoop

Kafka syed academy_v1_introduction
Kafka syed academy_v1_introductionKafka syed academy_v1_introduction
Kafka syed academy_v1_introductionSyed Hadoop
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSyed Hadoop
 
Spark Streaming In Depth - www.syedacademy.com
Spark Streaming In Depth - www.syedacademy.comSpark Streaming In Depth - www.syedacademy.com
Spark Streaming In Depth - www.syedacademy.comSyed Hadoop
 
Spark_Intro_Syed_Academy
Spark_Intro_Syed_AcademySpark_Intro_Syed_Academy
Spark_Intro_Syed_AcademySyed Hadoop
 
Hadoop Architecture in Depth
Hadoop Architecture in DepthHadoop Architecture in Depth
Hadoop Architecture in DepthSyed Hadoop
 
Hadoop course content Syed Academy
Hadoop course content Syed AcademyHadoop course content Syed Academy
Hadoop course content Syed AcademySyed Hadoop
 

More from Syed Hadoop (6)

Kafka syed academy_v1_introduction
Kafka syed academy_v1_introductionKafka syed academy_v1_introduction
Kafka syed academy_v1_introduction
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
 
Spark Streaming In Depth - www.syedacademy.com
Spark Streaming In Depth - www.syedacademy.comSpark Streaming In Depth - www.syedacademy.com
Spark Streaming In Depth - www.syedacademy.com
 
Spark_Intro_Syed_Academy
Spark_Intro_Syed_AcademySpark_Intro_Syed_Academy
Spark_Intro_Syed_Academy
 
Hadoop Architecture in Depth
Hadoop Architecture in DepthHadoop Architecture in Depth
Hadoop Architecture in Depth
 
Hadoop course content Syed Academy
Hadoop course content Syed AcademyHadoop course content Syed Academy
Hadoop course content Syed Academy
 

Recently uploaded

Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxnada99848
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 

Recently uploaded (20)

Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptx
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 

Spark_RDD_SyedAcademy

  • 1. Apache Spark Syed Solutions Engineer - Big Data mail.syed786@gmail.com info.syedacademy@gmail.com +91-9030477368
  • 2.
  • 3.
  • 4.
  • 5. Writing my own RDD? What for? ● To write your own RDD, you need to understand to some extent internal mechanics of Apache Spark ● Writing your own RDD will prove you understand them well ● When connecting to external storage, it is reasonable to create your own RDD for it
  • 6. RDD - the definition
  • 7. RDD - the definition RDD stands for resilient distributed dataset
  • 8. RDD - the definition RDD stands for resilient distributed dataset Dataset - initial data comes from some distributed storage
  • 9. RDD - the definition RDD stands for resilient distributed dataset Dataset - initial data comes from some distributed storage Distributed - stored in nodes among the cluster
  • 10. RDD - the definition RDD stands for resilient distributed dataset Dataset - initial data comes from some distributed storage Distributed - stored in nodes among the cluster Resilient - if data is lost, data can be recreated
  • 11. Quiz: what is an “RDD”? A: distributed collection of objects on disk B: distributed collection of objects in memory C: distributed collection of objects in Cassandra Answer: could be any of the above!
  • 12. Scientific Answer: RDD is an Interface! 1. Set of partitions (“splits” in Hadoop) 2. List of dependencies on parent RDDs 3. Function to compute a partition (as an Iterator) given its parent(s) 4. (Optional) partitioner (hash, range) 5. (Optional) preferred location(s) for each partition “lineage” optimized execution
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46. RDD Persistence •Each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset. •After marking an RDD to be persisted, the first time the dataset is computed in an action, it will be kept in memory on the nodes. •Allows future actions to be much faster (often by more than 10x) since you’re not re-computing some data every time you perform an action. •If data is too big to be cached, then it will spill to disk and memory will gradually degrade •Least Recently Used (LRU) replacement policy
  • 48. RDD Persistence APIs rdd.persist() rdd.persist(StorageLevel) •Persist this RDD with the default storage level (MEMORY_ONLY). •You can override the StorageLevel for fine grain control over persistence rdd.cache() •Persists the RDD with the default storage level (MEMORY_ONLY) rdd.checkpoint() •RDD will be saved to a file inside the checkpoint directory set with SparkContext#setCheckpointDir(“/path/to/dir”) •Used for RDDs with long lineage chains with wide dependencies since it would be expensive to re-compute rdd.unpersist() •Marks it as non-persistent and/or removes all blocks of it from memory and disk
  • 49. Fault Tolerance • RDDs contain lineage graphs (coarse grained updates/transformations) to help it rebuild partitions that were lost • Only the lost partitions of an RDD need to be recomputed upon failure. • They can be recomputed in parallel on different nodes without having to roll back the entire app • Also lets a system tolerate slow nodes (stragglers) by running a backup copy of the troubled task. • Original process on straggling node will be killed when new process is complete • Cached/Check pointed partitions are also used to re-compute lost partitions if available in shared memory