This document provides an overview of Amazon EMR (Elastic MapReduce), a managed cluster platform for big data processing using Apache Hadoop and Spark. It discusses the basic architecture including master nodes, core nodes, and task nodes. It also covers launch types, storage options like HDFS, S3, and EMRFS, managed scaling, security features, and pricing. The latter part includes hands-on examples for running Spark jobs on EMR and interacting with the cluster.
2. About me
• I’m Vishal Periyasamy Rajendran
• Senior Data Engineer
• Focused on architecting and developing big
data solutions at AWS cloud.
• 8x AWS certifications + other certifications on
Azure, Snowflake etc.
• You can find me on
• LinkedIn:
https://www.linkedin.com/in/vishal-p-
2703a9131/
• Medium:
https://medium.com/@vishalrv1904
2
5. What is EMR?
5
Elastic MapReduce
Managed Hadoop framework on EC2 instances.
Includes Spark, HBase, Presto, Hive & more
Several integration points with AWS.
6. Basic blocks of
EMR
• Master node:
The master node manages the cluster
and typically runs master components
of distributed applications.
All the major services like spark-
history server, resource manager, and
node manager runs on the master
node.
6
7. Basic blocks of
EMR
• Core node:
A node with software components
that run tasks and store data in the
Hadoop Distributed File System (HDFS)
on your cluster.
Multi-node clusters have at least one
core node.
7
8. Basic blocks of
EMR
• Task node:
A node with software components
that only runs tasks, and you can use
task nodes to add power to perform
parallel computation tasks on data,
such as Hadoop MapReduce tasks and
Spark executors.
Task nodes don’t run the Data Node
daemon nor store data in HDFS.
8
9. Launch types of
EMR
• EMR on EKS cluster.
• EMR serverless (November 2021.)
• EMR on EC2 instances.
• Instance Group
• Instance Fleets
9
10. EMR Storage
HDFS
• Hadoop Distributed File System
• Multiple copies stored across cluster instances
for redundancy
• Files stored as blocks (128MB default size)
• Ephemeral – HDFS data is lost when cluster is
terminated!
• But, useful for caching intermediate results or
workloads with significant random I/O
• Hadoop tries to process data where it is stored
on HDFS
Local file system:
• Suitable only for temporary data (buffers,
caches, etc)
10
EMRFS:
• Access S3 as if it were HDFS
• Allows persistent storage after cluster
termination
• EMRFS Consistent View – Optional for S3
consistency
• Uses DynamoDB to track consistency
• May need to tinker with read/write
capacity on DynamoDB
• New in 2021: S3 is Now Strongly
Consistent!
11. EMR Scaling
EMR Automatic Scaling :
• The old way of doing it
• Custom scaling rules based on CloudWatch
metrics
• Supports instance groups only.
EMR Managed Scaling:
• Support instance groups and instance fleets
• Scales spot, on-demand, and instances in a
Savings Plan within the same cluster
• Available for Spark, Hive, and YARN workloads
11
Scale-up Strategy
• First, add core nodes, then task nodes,
up to max units specified
Scale-down Strategy
• First removes task nodes, then core
nodes, no further than minimum
constraints
Spot nodes always removed before on-demand
instances
12. EMR
Security
• EMRFS
• S3 encryption (SSE or CSE) at rest
• TLS in transit between EMR nodes and S3
• S3
• SSE-S3, SSE-KMS
• Local disk encryption
• Spark communication between drivers &
executors is encrypted
• Hive communication between Glue Meta store
and EMR uses TLS
• Force HTTPS (TLS) on S3 policies with aws:
Secure Transport.
• IAM roles and policy.
12
13. EMR Pricing
• Amazon EMR on Amazon EC2:
• The Amazon EMR price is added to the Amazon EC2 price (the
price for the underlying servers) and Amazon Elastic Block
Store (Amazon EBS) price (if attaching Amazon EBS volumes).
These are also billed per second, with a one-minute minimum.
• Amazon EMR on Amazon EKS:
• The Amazon EMR price is added to the Amazon EKS pricing or
any other services used with EKS. You can run EKS on AWS
using either EC2 or AWS Fargate.
• Amazon EMR Serverless:
• With EMR Serverless, there are no upfront costs, and you pay
for only the resources you use. You pay for vCPU, memory, and
storage resources consumed by your applications.
13
20. Spark Memory Allocation
• Storage Memory:
• It’s mainly used to store Spark cache data, such as RDD cache, Unroll data, and so on.
• Execution Memory:
• It’s mainly used to store temporary data in the calculation process of Shuffle, Join,
Sort, Aggregation, etc.
• User Memory:
• It’s mainly used to store the data needed for RDD conversion operations, such as the
information for RDD dependency.
• Reserved Memory:
• The memory is reserved for the system and is used to store Spark’s internal object
20
21. EMR Bootstrap
• Use a bootstrap action to install additional
software or customize the configuration of
cluster instances
• Bootstrap actions are scripts that run on
the cluster after Amazon EMR launches
the instance using the Amazon Linux
Amazon Machine Image (AMI).
• Bootstrap actions run before Amazon EMR
installs the applications that you specify
when you create the cluster and before
cluster nodes begin processing data.
21
26. EMR
Assignments
• Explore different file formats,
• CSV file format
• JSON file format
• Avro file format
• ORC file format
• Parquet file format.
•
Explore different compressions,
• ZIP
• GZIP
• BZIP
• Snappy
26
27. EMR
Assignments
• Create an S3 bucket and configure lambda as a trigger for every new object creation.
• Lambda should receive an event from S3 and submit a step on the EMR cluster with the required arguments.
• EMR spark application should read the file from S3 and add some additional metadata columns such as load
datetime.
• After transformation, the output data frame should be stored under a target s3 bucket.
27
28. EMR
Assignments
• Create a spark streaming application
with kinesis as input.
• Perform a real-time insert, update, and
delete data on the RDS database.
28