SlideShare a Scribd company logo
1 of 29
Big Data Training -
Amazon EMR
About me
• I’m Vishal Periyasamy Rajendran
• Senior Data Engineer
• Focused on architecting and developing big
data solutions at AWS cloud.
• 8x AWS certifications + other certifications on
Azure, Snowflake etc.
• You can find me on
• LinkedIn:
https://www.linkedin.com/in/vishal-p-
2703a9131/
• Medium:
https://medium.com/@vishalrv1904
2
Amazon EMR
Agenda
• EMR Overview
• EMR Fundamental blocks
• Launch types of EMR
• EMR Storage
• EMR Managed Scaling
• EMR Security
• EMR Pricing
• Hands-on
4
What is EMR?
5
Elastic MapReduce
Managed Hadoop framework on EC2 instances.
Includes Spark, HBase, Presto, Hive & more
Several integration points with AWS.
Basic blocks of
EMR
• Master node:
The master node manages the cluster
and typically runs master components
of distributed applications.
All the major services like spark-
history server, resource manager, and
node manager runs on the master
node.
6
Basic blocks of
EMR
• Core node:
A node with software components
that run tasks and store data in the
Hadoop Distributed File System (HDFS)
on your cluster.
Multi-node clusters have at least one
core node.
7
Basic blocks of
EMR
• Task node:
A node with software components
that only runs tasks, and you can use
task nodes to add power to perform
parallel computation tasks on data,
such as Hadoop MapReduce tasks and
Spark executors.
Task nodes don’t run the Data Node
daemon nor store data in HDFS.
8
Launch types of
EMR
• EMR on EKS cluster.
• EMR serverless (November 2021.)
• EMR on EC2 instances.
• Instance Group
• Instance Fleets
9
EMR Storage
HDFS
• Hadoop Distributed File System
• Multiple copies stored across cluster instances
for redundancy
• Files stored as blocks (128MB default size)
• Ephemeral – HDFS data is lost when cluster is
terminated!
• But, useful for caching intermediate results or
workloads with significant random I/O
• Hadoop tries to process data where it is stored
on HDFS
Local file system:
• Suitable only for temporary data (buffers,
caches, etc)
10
EMRFS:
• Access S3 as if it were HDFS
• Allows persistent storage after cluster
termination
• EMRFS Consistent View – Optional for S3
consistency
• Uses DynamoDB to track consistency
• May need to tinker with read/write
capacity on DynamoDB
• New in 2021: S3 is Now Strongly
Consistent!
EMR Scaling
EMR Automatic Scaling :
• The old way of doing it
• Custom scaling rules based on CloudWatch
metrics
• Supports instance groups only.
EMR Managed Scaling:
• Support instance groups and instance fleets
• Scales spot, on-demand, and instances in a
Savings Plan within the same cluster
• Available for Spark, Hive, and YARN workloads
11
Scale-up Strategy
• First, add core nodes, then task nodes,
up to max units specified
Scale-down Strategy
• First removes task nodes, then core
nodes, no further than minimum
constraints
Spot nodes always removed before on-demand
instances
EMR
Security
• EMRFS
• S3 encryption (SSE or CSE) at rest
• TLS in transit between EMR nodes and S3
• S3
• SSE-S3, SSE-KMS
• Local disk encryption
• Spark communication between drivers &
executors is encrypted
• Hive communication between Glue Meta store
and EMR uses TLS
• Force HTTPS (TLS) on S3 policies with aws:
Secure Transport.
• IAM roles and policy.
12
EMR Pricing
• Amazon EMR on Amazon EC2:
• The Amazon EMR price is added to the Amazon EC2 price (the
price for the underlying servers) and Amazon Elastic Block
Store (Amazon EBS) price (if attaching Amazon EBS volumes).
These are also billed per second, with a one-minute minimum.
• Amazon EMR on Amazon EKS:
• The Amazon EMR price is added to the Amazon EKS pricing or
any other services used with EKS. You can run EKS on AWS
using either EC2 or AWS Fargate.
• Amazon EMR Serverless:
• With EMR Serverless, there are no upfront costs, and you pay
for only the resources you use. You pay for vCPU, memory, and
storage resources consumed by your applications.
13
© Presidio, Inc. All rights reserved. Proprietary and Confidential.
Questions
14
Amazon EMR
Hands-on
EMR Cluster
Hands - on
• EMR portal overview
• EMR cluster creation overview
• SSH into the Cluster.
• Running application
• Spark shell
• Spark submit option
• EMR step
• EMR Notebook
• Logs overview
16
Spark Deployment
Modes
Client Mode
17
Spark Deployment
Modes
Cluster Mode
18
Spark Memory Allocation
19
Spark Memory Allocation
• Storage Memory:
• It’s mainly used to store Spark cache data, such as RDD cache, Unroll data, and so on.
• Execution Memory:
• It’s mainly used to store temporary data in the calculation process of Shuffle, Join,
Sort, Aggregation, etc.
• User Memory:
• It’s mainly used to store the data needed for RDD conversion operations, such as the
information for RDD dependency.
• Reserved Memory:
• The memory is reserved for the system and is used to store Spark’s internal object
20
EMR Bootstrap
• Use a bootstrap action to install additional
software or customize the configuration of
cluster instances
• Bootstrap actions are scripts that run on
the cluster after Amazon EMR launches
the instance using the Amazon Linux
Amazon Machine Image (AMI).
• Bootstrap actions run before Amazon EMR
installs the applications that you specify
when you create the cluster and before
cluster nodes begin processing data.
21
EMR Spark
Configuration
• spark.dynamicAllocation.enabled
• spark.executor.memory
• spark.driver.memory
• spark.driver.memoryOverhead
• spark.executor.memoryOverhead
• spark.driver.cores
• spark.executor.instances
• Spark arguments:
• --num-executors
• --executor-memory
• --executor-cores
• --py-files
• --packages
22
EMR Hands-
On
Write data to S3 using the EMR spark application.
23
EMR Hands-
On
Write data to RDS PostgreSQL using the EMR spark application.
24
EMR Hands-
On
Write data to S3 using the EMR spark kinesis streaming application.
25
EMR
Assignments
• Explore different file formats,
• CSV file format
• JSON file format
• Avro file format
• ORC file format
• Parquet file format.
•
Explore different compressions,
• ZIP
• GZIP
• BZIP
• Snappy
26
EMR
Assignments
• Create an S3 bucket and configure lambda as a trigger for every new object creation.
• Lambda should receive an event from S3 and submit a step on the EMR cluster with the required arguments.
• EMR spark application should read the file from S3 and add some additional metadata columns such as load
datetime.
• After transformation, the output data frame should be stored under a target s3 bucket.
27
EMR
Assignments
• Create a spark streaming application
with kinesis as input.
• Perform a real-time insert, update, and
delete data on the RDS database.
28
© Presidio, Inc. All rights reserved. Proprietary and Confidential.
Feedback
29

More Related Content

Similar to Amazon EMR Big Data Training - Learn EMR Fundamentals

Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationmattlieber
 
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...Amazon Web Services
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAmazon Web Services
 
Cost Optimization with Spot Instances
Cost Optimization with Spot InstancesCost Optimization with Spot Instances
Cost Optimization with Spot InstancesArun Sirimalla
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAmazon Web Services
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 
Lighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkLighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkAmazon Web Services
 
AWS Well Architected-Info Session WeCloudData
AWS Well Architected-Info Session WeCloudDataAWS Well Architected-Info Session WeCloudData
AWS Well Architected-Info Session WeCloudDataWeCloudData
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...Amazon Web Services
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesVladimir Simek
 
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Amazon Web Services
 

Similar to Amazon EMR Big Data Training - Learn EMR Fundamentals (20)

Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluation
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
 
Cost Optimization with Spot Instances
Cost Optimization with Spot InstancesCost Optimization with Spot Instances
Cost Optimization with Spot Instances
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 
Lighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkLighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache Spark
 
AWS Well Architected-Info Session WeCloudData
AWS Well Architected-Info Session WeCloudDataAWS Well Architected-Info Session WeCloudData
AWS Well Architected-Info Session WeCloudData
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
 

Recently uploaded

Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookmanojkuma9823
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 

Recently uploaded (20)

Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 

Amazon EMR Big Data Training - Learn EMR Fundamentals

  • 1. Big Data Training - Amazon EMR
  • 2. About me • I’m Vishal Periyasamy Rajendran • Senior Data Engineer • Focused on architecting and developing big data solutions at AWS cloud. • 8x AWS certifications + other certifications on Azure, Snowflake etc. • You can find me on • LinkedIn: https://www.linkedin.com/in/vishal-p- 2703a9131/ • Medium: https://medium.com/@vishalrv1904 2
  • 4. Agenda • EMR Overview • EMR Fundamental blocks • Launch types of EMR • EMR Storage • EMR Managed Scaling • EMR Security • EMR Pricing • Hands-on 4
  • 5. What is EMR? 5 Elastic MapReduce Managed Hadoop framework on EC2 instances. Includes Spark, HBase, Presto, Hive & more Several integration points with AWS.
  • 6. Basic blocks of EMR • Master node: The master node manages the cluster and typically runs master components of distributed applications. All the major services like spark- history server, resource manager, and node manager runs on the master node. 6
  • 7. Basic blocks of EMR • Core node: A node with software components that run tasks and store data in the Hadoop Distributed File System (HDFS) on your cluster. Multi-node clusters have at least one core node. 7
  • 8. Basic blocks of EMR • Task node: A node with software components that only runs tasks, and you can use task nodes to add power to perform parallel computation tasks on data, such as Hadoop MapReduce tasks and Spark executors. Task nodes don’t run the Data Node daemon nor store data in HDFS. 8
  • 9. Launch types of EMR • EMR on EKS cluster. • EMR serverless (November 2021.) • EMR on EC2 instances. • Instance Group • Instance Fleets 9
  • 10. EMR Storage HDFS • Hadoop Distributed File System • Multiple copies stored across cluster instances for redundancy • Files stored as blocks (128MB default size) • Ephemeral – HDFS data is lost when cluster is terminated! • But, useful for caching intermediate results or workloads with significant random I/O • Hadoop tries to process data where it is stored on HDFS Local file system: • Suitable only for temporary data (buffers, caches, etc) 10 EMRFS: • Access S3 as if it were HDFS • Allows persistent storage after cluster termination • EMRFS Consistent View – Optional for S3 consistency • Uses DynamoDB to track consistency • May need to tinker with read/write capacity on DynamoDB • New in 2021: S3 is Now Strongly Consistent!
  • 11. EMR Scaling EMR Automatic Scaling : • The old way of doing it • Custom scaling rules based on CloudWatch metrics • Supports instance groups only. EMR Managed Scaling: • Support instance groups and instance fleets • Scales spot, on-demand, and instances in a Savings Plan within the same cluster • Available for Spark, Hive, and YARN workloads 11 Scale-up Strategy • First, add core nodes, then task nodes, up to max units specified Scale-down Strategy • First removes task nodes, then core nodes, no further than minimum constraints Spot nodes always removed before on-demand instances
  • 12. EMR Security • EMRFS • S3 encryption (SSE or CSE) at rest • TLS in transit between EMR nodes and S3 • S3 • SSE-S3, SSE-KMS • Local disk encryption • Spark communication between drivers & executors is encrypted • Hive communication between Glue Meta store and EMR uses TLS • Force HTTPS (TLS) on S3 policies with aws: Secure Transport. • IAM roles and policy. 12
  • 13. EMR Pricing • Amazon EMR on Amazon EC2: • The Amazon EMR price is added to the Amazon EC2 price (the price for the underlying servers) and Amazon Elastic Block Store (Amazon EBS) price (if attaching Amazon EBS volumes). These are also billed per second, with a one-minute minimum. • Amazon EMR on Amazon EKS: • The Amazon EMR price is added to the Amazon EKS pricing or any other services used with EKS. You can run EKS on AWS using either EC2 or AWS Fargate. • Amazon EMR Serverless: • With EMR Serverless, there are no upfront costs, and you pay for only the resources you use. You pay for vCPU, memory, and storage resources consumed by your applications. 13
  • 14. © Presidio, Inc. All rights reserved. Proprietary and Confidential. Questions 14
  • 16. EMR Cluster Hands - on • EMR portal overview • EMR cluster creation overview • SSH into the Cluster. • Running application • Spark shell • Spark submit option • EMR step • EMR Notebook • Logs overview 16
  • 20. Spark Memory Allocation • Storage Memory: • It’s mainly used to store Spark cache data, such as RDD cache, Unroll data, and so on. • Execution Memory: • It’s mainly used to store temporary data in the calculation process of Shuffle, Join, Sort, Aggregation, etc. • User Memory: • It’s mainly used to store the data needed for RDD conversion operations, such as the information for RDD dependency. • Reserved Memory: • The memory is reserved for the system and is used to store Spark’s internal object 20
  • 21. EMR Bootstrap • Use a bootstrap action to install additional software or customize the configuration of cluster instances • Bootstrap actions are scripts that run on the cluster after Amazon EMR launches the instance using the Amazon Linux Amazon Machine Image (AMI). • Bootstrap actions run before Amazon EMR installs the applications that you specify when you create the cluster and before cluster nodes begin processing data. 21
  • 22. EMR Spark Configuration • spark.dynamicAllocation.enabled • spark.executor.memory • spark.driver.memory • spark.driver.memoryOverhead • spark.executor.memoryOverhead • spark.driver.cores • spark.executor.instances • Spark arguments: • --num-executors • --executor-memory • --executor-cores • --py-files • --packages 22
  • 23. EMR Hands- On Write data to S3 using the EMR spark application. 23
  • 24. EMR Hands- On Write data to RDS PostgreSQL using the EMR spark application. 24
  • 25. EMR Hands- On Write data to S3 using the EMR spark kinesis streaming application. 25
  • 26. EMR Assignments • Explore different file formats, • CSV file format • JSON file format • Avro file format • ORC file format • Parquet file format. • Explore different compressions, • ZIP • GZIP • BZIP • Snappy 26
  • 27. EMR Assignments • Create an S3 bucket and configure lambda as a trigger for every new object creation. • Lambda should receive an event from S3 and submit a step on the EMR cluster with the required arguments. • EMR spark application should read the file from S3 and add some additional metadata columns such as load datetime. • After transformation, the output data frame should be stored under a target s3 bucket. 27
  • 28. EMR Assignments • Create a spark streaming application with kinesis as input. • Perform a real-time insert, update, and delete data on the RDS database. 28
  • 29. © Presidio, Inc. All rights reserved. Proprietary and Confidential. Feedback 29