Amazon Elastic MapReduce (EMR) is one of the largest Hadoop operators in the world. Since its launch five years ago, our customers have launched more than 15 million Hadoop clusters inside of EMR. In this webinar, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters and other Amazon EMR architectural patterns. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient.
6. EMRfs
HDFS
Data management Analytics languages
Amazon EMR
Amazon S3 Amazon
DynamoDB
7. EMRfs
HDFS
Data management Analytics languages
Amazon EMR Amazon
RDS
Amazon S3 Amazon
DynamoDB
8. EMRfs
HDFS
Data management Analytics languages
Amazon
Redshift
Amazon EMR Amazon
RDS
Amazon S3 Amazon
DynamoDB
AWS Data Pipeline
9. Amazon EMR Introduction
Launch clusters of any size in a matter of minutes
Use variety of different instance sizes that match
your workload
Don’t get stuck with hardware
Don’t deal with capacity planning
Run multiple clusters with different sizes, specs
and node types
10.
11. Elastic MapReduce & Amazon S3
EMR has an optimised driver for Amazon S3
64MB Range Offset Reads to increase performance
Elastic MapReduce Consistent View further
Increases Performance
Addresses Consistency
S3 Cost - $.03/GB - Volume Based Price Tiering
12. Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Observations from AWS
15. Pattern #1: Transient Clusters
Cluster lives for the duration of the job
Shut down the cluster when the job is done
Data persist on
Amazon S3
Input & Output
Data on
Amazon S3
16. Benefits of Transient Clusters
1. Control your cost
2. Minimum maintenance
• Cluster goes away when job is done
3. Practice cloud architecture
• Pay for what you use
• Data processing as a workflow
17. Alive Clusters
Very similar to traditional Hadoop deployments
Cluster stays around after the job is done
Data persistence model:
Amazon S3
Amazon S3 Copy To HDFS
HDFS and Amazon S3 as
backup
18. Alive Clusters
Always keep data safe on Amazon S3 even if you’re
using HDFS for primary storage
Get in the habit of shutting down your cluster and start a
new one, once a week or month
Design your data processing workflow to account for failure
You can use workflow managements such as AWS Data
Pipeline
20. Core Nodes
Master instance group
Amazon EMR cluster
Core instance group
HDFS HDFS
Run
TaskTrackers
(Compute)
Run DataNode
(HDFS)
21. Core Nodes
Can add core
nodes
More HDFS
space
More
CPU/memory
Master instance group
Amazon EMR cluster
Core instance group
HDFS HDFS HDFS
22. Core Nodes
Can’t remove
core nodes
because of
HDFS
Master instance group
Core instance group
HDFS HDFS HDFS
Amazon EMR cluster
23. Amazon EMR Task Nodes
Run TaskTrackers
No HDFS
Reads from core
node HDFS Task instance group
Master instance group
Core instance group
HDFS HDFS
Amazon EMR cluster
24. Amazon EMR Task Nodes
Can add
task nodes
Task instance group
Master instance group
Core instance group
HDFS HDFS
Amazon EMR cluster
25. Amazon EMR Task Nodes
More CPU
power
More
memory
Task instance group
Master instance group
Core instance group
HDFS HDFS
Amazon EMR cluster
26. Amazon EMR Task Nodes
You can
remove task
nodes when
processing
is completed
Task instance group
Master instance group
Core instance group
HDFS HDFS
Amazon EMR cluster
27. Amazon EMR Task Nodes
You can
remove task
nodes when
processing
is completed
Task instance group
Master instance group
Core instance group
HDFS HDFS
Amazon EMR cluster
28. Task Node Use-Cases
Speed up job processing using Spot market
Run task nodes on Spot market
Get discount on hourly price
Nodes can come and go without interruption to your cluster
When you need extra horsepower for a short amount of time
Example: Need to pull large amount of data from Amazon S3
30. Option 1: Amazon S3 as HDFS
Use Amazon S3 as your
permanent data store
HDFS for temporary storage
data between jobs
No additional step to copy
data to HDFS
Amazon EMR Cluster
Task Instance
Group
Core Instance
Group
HDFS HDFS
Amazon S3
31. Benefits: Amazon S3 as HDFS
Ability to shut down your cluster
HUGE Benefit!!
Use Amazon S3 as your durable storage
11 9s of durability
32. Benefits: Amazon S3 as HDFS
No need to scale HDFS
Capacity
Replication for durability
Amazon S3 scales with your data
Both in IOPs and data storage
33. Benefits: Amazon S3 as HDFS
Ability to share data between multiple clusters
Hard to do with HDFS
Amazon S3
EMR
EMR
34. Benefits: Amazon S3 as HDFS
Take advantage of Amazon S3 features
Amazon S3 Server Side Encryption
Amazon S3 Lifecycle Policies
Amazon S3 versioning to protect against corruption
Build elastic clusters
Add nodes to read from Amazon S3
Remove nodes with data safe on Amazon S3
35. EMR Consistent View
Provides a ‘consistent view’ of data on S3 within a
Cluster
Ensures that all files created by a Step are available to
Subsequent Steps
Index of data from S3, managed by Dynamo DB
Configurable Retry & Metastore
New Hadoop Config File emrfs-site.xml
fs.s3.consistent* System Properties
37. EMR Consistent View
Manage data in EMRFS using the emrfs client:
emrfs
describe-metadata, set-metadata-capacity, delete-metadata,
create-metadata, list-metadata-stores - work
with Metadata Stores
diff - Show what in a bucket is missing from the index
delete - Remove Index Entries
sync - Ensure that the Index is Synced with a bucket
import - Import Bucket Items into Index
38. What About Data Locality?
Run your job in the same region as your Amazon
S3 bucket
Amazon EMR nodes have high speed connectivity
to Amazon S3
If your job Is CPU/memory-bound, locality doesn’t
make a huge difference
39. Amazon S3 provides near linear scalability
S3 Streaming
Performance
100 VMs; 9.6GB/s; $26/hr
350 VMs; 28.7GB/s; $90/hr
34 secs per terabyte
GB/Second
Reader Connections
Performance & Scalability
40. When HDFS is a Better Choice…
Iterative workloads
If you’re processing the same dataset more than once
Disk I/O intensive workloads
42. Option 2: Optimise for Latency with HDFS
2. Launch Amazon EMR and
copy data to HDFS with
S3distcp
S3DistCp
43. Option 2: Optimise for Latency with HDFS
3. Start processing data on
HDFS
S3DistCp
44. Benefits: HDFS instead of S3
Better pattern for I/O-intensive workloads
Amazon S3 as system of record
Durability
Scalability
Cost
Features: lifecycle policy, security
45. Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Observations from AWS
46. Amazon EMR Nodes and Size
Use M1.Small Instances for functional testing
Use XLarge + nodes for production workloads
Use CC2/C3 for memory and CPU intensive
jobs
HS1, HI1, I2 instances for HDFS workloads
Prefer a smaller cluster of larger nodes
48. Instance Resource Allocation
• Hadoop 1 - Static Number of Mappers/Reducers
configured for the Cluster Nodes
• Hadoop 2 - Variable Number of Hadoop
Applications based on File Splits and Available
Memory
• Useful to understand Old vs New Sizing
50. Cluster Sizing Calculation
1. Estimate the number of tasks your job requires.
2. Pick an instance and note down the number of tasks it
can run in parallel
3. We need to pick some sample data files to run a test
workload. The number of sample files should be the
same number from step #2.
4. Run an Amazon EMR cluster with a single Core node
and process your sample files from #3.
Note down the amount of time taken to process your
sample files.
51. Cluster Sizing Calculation
Total Tasks * Time To Process Sample Files
Instance Task Capacity * Desired Processing Time
Estimated Number Of Nodes:
52. Example: Cluster Sizing Calculation
1. Estimate the number of tasks your job requires
150
2. Pick an instance and note down the number of
tasks it can run in parallel
m1.xlarge with 8 task capacity per instance
53. Example: Cluster Sizing Calculation
3. We need to pick some sample data files to run a
test workload. The number of sample files should
be the same number from step #2.
8 files selected for our sample test
54. Example: Cluster Sizing Calculation
4. Run an Amazon EMR cluster with a single core
node and process your sample files from #3.
Note down the amount of time taken to process
your sample files.
3 min to process 8 files
55. Cluster Sizing Calculation
Total Tasks For Your Job * Time To Process Sample Files
Per Instance Task Capacity * Desired Processing Time
Estimated number of nodes:
150 * 3 min
8 * 5 min
= 11 m1.xlarge
56. File Best Practices
Avoid small files at all costs (smaller than
100MB)
Use Compression
58. Dealing with Small Files
Use S3DistCP to
combine smaller files
together
S3DistCP takes a
pattern and target file
to combine smaller
input files to larger
ones
./elastic-mapreduce –jar
/home/hadoop/lib/emr-s3distcp-1.0.jar
--args '--src,s3://myawsbucket/cf,
--dest,hdfs:///local,
--groupBy,.*XABCD12345678.([0-9]+-[0-9]+-
[0-9]+-[0-9]+).*,
--targetSize,128,
59. Compression
Always Compress Data Files On Amazon S3
Reduces Bandwidth Between Amazon S3 and
Amazon EMR
Speeds Up Your Job
Compress Task Output
60. Compression
Compression Types:
Some are fast BUT offer less space reduction
Some are space efficient BUT Slower
Some are splitable and some are not
Algorithm % Space
Remaining
Encoding
Speed
Decoding
Speed
GZIP 13% 21MB/s 118MB/s
LZO 20% 135MB/s 410MB/s
Snappy 22% 172MB/s 409MB/s
61. Changing Compression Type
You May Decide To Change Compression Type
Use S3DistCP to change the compression types of
your files
Example:
./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar
/home/hadoop/lib/emr-s3distcp-1.0.jar
--args '--src,s3://myawsbucket/cf,
--dest,hdfs:///local,
--outputCodec,lzo’
62. Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Observations from AWS
63. M1/C1 Instance Families
Heavily used by EMR Customers
However, HDFS Utilisation is typically very
Low
M3/C3 Offers better Performance/$
66. Orc vs Parquet
File Formats designed for SQL/Data Warehousing
on Hadoop
Columnar File Formats
Compress Well
High Row Count, Low Cardinality
67. OrcFile Format
Optimised Row Columnar Format
Zlibor Snappy External Compression
250MB Stripe of 1 Column and Index
RunLengthor Dictionary Encoding
1 Output File per Container Task
68. Parquet File Format
Gzipor Snappy External Compression
Array Data Structures
Limited Data Type Support for Hive
Batch Creation
1GB Files
69. Orc vs Parquet
Depends on the Tool you are using
Consider Future Architecture & Requirements
Test Test Test
70. In Summary
• Practice Cloud Architecture with Transient Clusters
• Utilize S3 as the system of record for durability
• Utilize Task Nodes on Spot for Increased performance and
Lower Cost
• Move to new Instance Families for Better Performance/$
• Exciting Developments around Columnar File Formats
Editor's Notes
EMR Is managed Hadoop Offering that takes burden of deploying and maintaining hadoop clusters away from developers. EMR uses Apache Hadoop mapreduce engine and integrates with variety of different tools.
Transient clusters are the type of clusters that are only around for the duration of the job. Once the job is done the cluster shuts down. This model of running Hadoop clusters is very different from traditional Hadoop deployments. With traditional Hadoop deployment, the Hadoop clusters stays up and running regardless if there are any jobs for the cluster to process mostly due to fact that the Hadoop cluster also hosts HDFS storage. With HDFS storage, you don’t have the luxury of shutting down the cluster. With Transient EMR clusters, your data persist on S3, meaning that you don’t use HDFS to store your data. Once data is safe and secure on S3, you have the ability to shutdown the cluster after your job is done knowing that you wont lose data.
There are many reasons you want to use EMR transient clusters.
Cost: Shutting down resources you don’t need is the best path towards optimizing your workload for cost efficiency. Don’t pay for what you’re not using. At AWS that’s all we talk about.
2) No Maintenance. Obviously if you’re shutting down the cluster, then you’re only maintaining the cluster for the duration of your job. This help tremendously reducing the headache maintaining long running clusters
3) Practice cloud architecture best practices. It’ll also makes you a better cloud practitioner. Again you’re getting in the habit of paying for what you’re using which is great. You’ll also get into the habit of thinking of your data processing as a workflow where resources come and go as needed.
I hear this a lot: EMR is only good for short-lived/transient clusters, do I need to run my own Hadoop on EC2 if I need long running clusters? That’s not true at all. EMR can be designed for longer running clusters. You would still want to keep your data on S3 for durability or you can copy data from S3 to HDFS first or you can use HDFS as your primary storage and use S3 as the data store backup.
Notice that we don’t remove S3 from our design. Its important to keep your data safe on S3 just in case you experience cluster failures. In fact I want you to think or always plan for failure. Just because you’re running a long running cluster doesn’t mean you wont see failures. So architecting your data processing workflow to be able to deal with cluster failures is super important. One way to do that is to use a workflow management tool such as Amazon Data pipeline.
EMR has three node types. One master which runs namenode and jobtracker, core nodes and task nodes which are two different type of slave nodes. Lets review two different slave nodes.
We’ll start with Core nodes. Core nodes run TaskTracker and Datanode. Core nodes are very similar to traditional Hadoop salve nodes. They can process data with mappers and reducers and can also store data with HDFS or Datanode.
However, once you add core nodes to your cluster, you can’t remove them later. That’s the only caveat with core nodes. And there’s a good reason for that. Because Core node hold HDFS data, removing nodes from the cluster can potentially cause data-loss.
Task Nodes are a bit different than what you usually see in your traditional hadoop deployments. TaskNodes run jobtracker only. They don’t run datanode which means no HDFS data is stored on Tasknodes.
Similar to core nodes, you can increase/expand your cluster’s TaskNode capacity by adding more nodes.
example
But unlike core nodes, TaskNodes can be removed from the cluster. And you can probably guess why. Because TaskNodes don’t hold any HDFS data. So you’re free to add/remove them at any given time.
example
Use Tasknodes to speed up your data processing using Spot market. Tasknodes are a great use-case for spot instances. Remember that Tasknodes can be added/removed easily. That ability gives you the peace of mind to use Tasknodes for spot market. And if at some point your spot instance gets taken away because the price went up too much, your cluster can withstand losing nodes.
In the next few slides, we’ll talk about data persistence models with EMR. The first pattern is S3 as HDFS. With this data persistence model, data gets stored on S3. HDFS does not play any role in storing data. As matter of fact HDFS is only there for temporary storage. Another common thing I hear is that storing data on S3 instead of HDFS slows my job down a lot because data has to get copied to HDFS/disk first before processing starts. That’s incorrect. If you tell Hadoop that your data is on S3, Hadoop reads directly from S3 and streams data to Mappers without toughing the disk. Not to be completely correct, data does touch HDFS when data has to shuffle from mappers to reducers but as I mentined, HDFS acts as the temp space and nothing more.
One of the biggest and the most important benefits of using S3 instead of HDFS is the fact that you can shutdown your cluster when your job is done knowingthat data is safe on S3. That’s a huge plus! Can you imagine doing that with traditional hadoop deployments? I haven’t come across anyone who could easily do that.
The other important benefit of S3 instead of HDFS is avoiding to play the HDFS capacity game. You’re dealing with big data problems which means you have ton of data coming in. The last thing you want to do is to play the guessing game of how much HDFS space you need. With S3 you don’t have to play that game. S3 scales with your data both in terms of IO and also storage space. With HDFS you have to get the space you want X3 to account for durability. With S3 the space you pay for includes replication and everything else behind the scene to make your data durable.
Imagine your hosting data on HDFS and another team in your company asks to get access to your data? What would you do?
You can either copy data from HDFS to some other storage. Now if you’re dealing with large amount of data, say 400TB or 1PB, this is going to be a nightmare.
Or you can give them access to your cluster to run job. But man it would suck if they run a large job and take over the entire cluster.
With Data on S3, you can share data between multiple jobs in parallel without scarifying storage or cluster resources. S3 can scale with as many jobs as you want it to.
And everything else comes free with S3. Features such as SSE, LifeCycle and etc. And again keep in mind that S3 as the storage is the main reason why we can’t build elastic clusters where nodes gets added and removed dynamically without any data-loss
With this pattern, you still store your data on S3 and use S3 as your primary storage but for processing your data, data gets copied to HDFS first. Copying data to HDFS can be done with a S3DistCP tool provided by the EMR team. S3distCp is very similar to distcp tool that comes with Hadoop for distributed copy jobs, ie copying data between clusters. However, S3distcp was written with S3 in mind meaning that it can perform much better than DistCP.
Use-case for this slide
Use-case for this slide
The benefits of this pattern is getting better IO if we’re dealing with IO intensive workloads. Or as mentioned previously, if data needs to be processed multiple times, copying data to HDFS first is a more optimized approach. And because we’re not using HDFS as the primary storage, we can still take advantage of S3 features.
Do not use smaller nodes for production workload unless you’re 100% sure you know what you’re doing. The majority of jobs I’ve seen requires more CPU and Memory the smaller instances have to offer and most of the times causes job failures if the cluster is not fine tuned. Instead of spending time fine tunning small nodes, get a larger node and run your workload with peace of mind. Anything larger and including m1.xlarge is a good candidate. m1.xlarge, c1.xlarge, m2.4xlarge and all cluster compute instances are good choices.
This is my fav question: given this much data, how many nodes do I need?
Avoid small files at all costs. Small files can cause a lot of issues. I call them the termites of big data.
The reason small files are just trouble is that each file, as discussed previously, eventually becomes a mapper. An each mapper is a java JVM. Smaller files cause Java JVM to get spawn up but as soon as the mapper is up and running, the content of file is processed in a short amount of time and the mapper goes away. That’s waste of CPU and memory. Ideally you like your mapper to do as much processing as possible before shutting down.