Your SlideShare is downloading. ×
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or i...
Outline
Introduction to Amazon EMR
Architecting EMR for Cost
Amazon EMR Design Patterns
Amazon EMR Best Practices
Map-­‐Reduce	
  Engine	
   Vibrant	
  Ecosystem	
  
Hadoop-­‐as-­‐a-­‐Service	
  
Massively	
  Parallel	
  
Cost	
  Effec>v...
HDFS	
  
Amazon EMR
HDFS	
  
Amazon EMR
Amazon S3 Amazon
DynamoDB
HDFS	
  
Analytics languagesData management
Amazon EMR
Amazon S3 Amazon
DynamoDB
HDFS	
  
Analytics languagesData management
Amazon EMR
Amazon
RDS
Amazon S3 Amazon
DynamoDB
HDFS	
  
Analytics languagesData management
Amazon
Redshift
Amazon EMR
Amazon
RDS
Amazon S3 Amazon
DynamoDB
AWS Data Pipel...
Amazon EMR Introduction
•  Launch clusters of any size in a matter of
minutes
•  Use variety of different instance sizes t...
Amazon EMR Introduction
•  Don’t get stuck with hardware
•  Don’t deal with capacity planning
•  Run multiple clusters wit...
Outline
Introduction to Amazon EMR
Architecting EMR for Cost
Amazon EMR Design Patterns
Amazon EMR Best Practices
Architecting for cost
•  EC2/EMR pricing models:
–  On-demand: Pay as you go model.
–  Spot: Market place. Bid for instanc...
Architecting for cost
•  On-demand
–  Research & Development, Data Science
•  Spot
–  Restartable Tasks
–  Embarrassingly ...
EMR Architecture for Optimal Cost
Heavy Utilisation RI’s for alive and long-
running clusters
Use Medium Utilisation RI’s for ad-hoc and
unpredictable workloads
EMR Architecture for Optimal Cost
Supplement with Spot for unpredictable
workloads or Turbo Boost
EMR Architecture for Optimal Cost
Outline
Introduction to Amazon EMR
Architecting EMR for Cost
Amazon EMR Design Patterns
Amazon EMR Best Practices
Amazon EMR Design Patterns
Pattern #1: Transient vs. Alive Clusters
Pattern #2: Core Nodes and Task Nodes
Pattern #3: Amaz...
Pattern #1: Transient vs. Alive Clusters
Pattern #1: Transient Clusters
•  Cluster lives for the duration of the job
•  Shut down the cluster when the job is done
...
Benefits of Transient Clusters
1.  Control your cost
2.  Minimum maintenance
•  Cluster goes away when job is done
3.  Pra...
Alive Clusters
•  Very similar to traditional Hadoop deployments
•  Cluster stays around after the job is done
•  Data per...
Alive Clusters
•  Always keep data safe on Amazon S3 even if you’re
using HDFS for primary storage
•  Get in the habit of ...
Pattern #2: Core & Task nodes
Core Nodes
Master instance group
Amazon EMR cluster
Core instance group
HDFS HDFS
	

Run
TaskTrackers
(Compute)
Run DataNo...
Core Nodes
	

Can add core
nodes
More HDFS
space
More CPU/
memory
	

Master instance group
Amazon EMR cluster
Core instanc...
Core Nodes
	

Can’t remove
core nodes
because of
HDFS
	

Master instance group
Core instance group
HDFS HDFS HDFS
Amazon E...
Amazon EMR Task Nodes
Run TaskTrackers
No HDFS
Reads from core
node HDFS Task instance group
Master instance group
Core in...
Amazon EMR Task Nodes
Can add
task nodes
Task instance group
Master instance group
Core instance group
HDFS HDFS
Amazon EM...
Amazon EMR Task Nodes
More CPU
power
More
memory
Task instance group
Master instance group
Core instance group
HDFS HDFS
A...
Amazon EMR Task Nodes
You can
remove task
nodes when
processing is
completed
Task instance group
Master instance group
Cor...
Amazon EMR Task Nodes
You can
remove task
nodes when
processing is
completed
Task instance group
Master instance group
Cor...
Task Node Use-Cases
•  Speed up job processing using Spot market
–  Run task nodes on Spot market
•  Get discount on hourl...
Pattern #3: Amazon S3 & HDFS
Option 1: Amazon S3 as HDFS
•  Use Amazon S3 as your
permanent data store
•  HDFS for temporary
storage data between jobs
...
Benefits: Amazon S3 as HDFS
•  Ability to shut down your cluster
HUGE Benefit!!
•  Use Amazon S3 as your durable storage
1...
Benefits: Amazon S3 as HDFS
•  No need to scale HDFS
•  Capacity
•  Replication for durability
•  Amazon S3 scales with yo...
Benefits: Amazon S3 as HDFS
•  Ability to share data between multiple clusters
•  Hard to do with HDFS
Amazon S3
EMR
EMR
Benefits: Amazon S3 as HDFS
•  Take advantage of Amazon S3 features
•  Amazon S3 Server Side Encryption
•  Amazon S3 Lifec...
What About Data Locality?
•  Run your job in the same region as your
Amazon S3 bucket
•  Amazon EMR nodes have high speed
...
Anti-Pattern: Amazon S3 as HDFS
•  Iterative workloads
–  If you’re processing the same dataset more than once
•  Disk I/O...
Option 2: Optimise for Latency with HDFS
1.  Data persisted on Amazon S3
2.  Launch Amazon EMR and
copy data to HDFS with
S3distcp
S3DistCp
Option 2: Optimise for Latency with HDFS
3.  Start processing data on
HDFS
S3DistCp
Option 2: Optimise for Latency with HDFS
Benefits: HDFS instead of S3
•  Better pattern for I/O-intensive workloads
•  Amazon S3 as system of record
•  Durability
...
Outline
Introduction to Amazon EMR
Architecting EMR for Cost
Amazon EMR Design Patterns
Amazon EMR Best Practices
Amazon EMR Nodes and Size
•  Use m1 and c1 family for functional testing
•  Use m3 and c3 xlarge and larger nodes for
prod...
Holy Grail Question
How many nodes do I
need?
Cluster Sizing Calculation
1.  Estimate the number of mappers your job
requires.
Cluster Sizing Calculation
2.  Pick an instance and note down the number of
mappers it can run in parallel
M1.xlarge = 8 m...
Resource
Capability /
Instance
Type
EC2 Instance Type
 Mappers
 Reducers
m1.small
 2
 1
m1.large
 3
 1
m1.xlarge
 8
 3
m2....
Cluster Sizing Calculation
3.  We need to pick some sample data files to run
a test workload. The number of sample files
s...
Cluster Sizing Calculation
4.  Run an Amazon EMR cluster with a single core
node and process your sample files from #3.
No...
Cluster Sizing Calculation
Total Mappers * Time To Process Sample Files
Instance Mapper Capacity * Desired Processing Time...
Example: Cluster Sizing Calculation
1.  Estimate the number of mappers your job
requires
150
2.  Pick an instance and note...
Example: Cluster Sizing Calculation
3.  We need to pick some sample data files to run a
test workload. The number of sampl...
Example: Cluster Sizing Calculation
	

4.  Run an Amazon EMR cluster with a single core
node and process your sample files...
Cluster Sizing Calculation
Total Mappers For Your Job * Time To Process Sample Files
Per Instance Mapper Capacity * Desire...
File Best Practices
•  Avoid small files at all costs (smaller than
100MB)
•  Use Compression
Holy Grail Question
What if I have small file
issues?
Dealing with Small Files
•  Use S3DistCP to
combine smaller files
together
•  S3DistCP takes a
pattern and target
file to ...
Compression
•  Always Compress Data Files On Amazon S3
•  Reduces Bandwidth Between Amazon S3
and Amazon EMR
•  Speeds Up ...
•  Compression Types:
–  Some are fast BUT offer less space reduction
–  Some are space efficient BUT Slower
–  Some are s...
In Summary
•  Practice Cloud Architecture with Transient
Clusters
•  Utilise Task Nodes on Spot for Increased
performance ...
John Telford
Enterprise Architect
Channel 4
@jtelford1
johntelforduk
EMR at C4
1.  Who we are.
2.  What we’re doing with E...
Channel 4
•  State owned, public service broadcaster.
•  Self-funded mostly by selling advertising (no TV license fee mone...
12 Years A Slave
C4 Virtuous Circle
Ad Revenue (£s) = Impacts x Rate
Brilliant
Program
mes
Oodles of
Viewers
Massive
Ad
Revenue
Gigantic
Pr...
C4 Viewer Insight Database
•  Clickstream & Ad Server behavioral data.
•  10M registered viewers.
•  Viewer Panel / Survey...
Expect to pre-process your data
We want our Data Scientists to enjoy a User Friendly, High
Performance system, containing ...
Data profiling
SELECT
SUM (IF (visit_num REGEXP '^[0-9]+$', 0, 1)),
SUM (IF (ip REGEXP '^[0-9]+.[0-9]+.[0-9]+.[0-9]+$', 0,...
Partitioning
CREATE EXTERNAL TABLE web_log (
hit_time_gmt BIGINT,
cookie STRING
-- and many more columnns.
) PARTITIONED B...
Connecting data
1
Instanc
e
RD
SSlave
s
Slave
Old approach
Redis
New approach
Slave
Redis
Slave
Redis
Slave
Redis
Handling large amounts of data
•  AWS Import/Export.
–  Consumer grade USB drives… sent by courier.
•  AWS Direct Connect....
Choosing instances for EMR
Source: https://aws.amazon.com/ec2/pricing/
Some instance types omitted from diagram to ease cl...
Social engineering
•  Make the Data Scientists aware of EMR costs.
•  We give them visibility of clusters running, who sta...
John Telford
Enterprise Architect
Channel 4
@jtelford1
johntelforduk
Thanks!
Youtube: “Channel 4 Paralympics Meet the Supe...
AWS Partner Trail
Win a Kindle Fire
•  10 in total
•  Get a code from our
sponsors
Please rate
this session
using the AWS
Summits App
and help us build
better events
#AWSSummit
@AWScloud @AWS_UKI
bit.ly/1n0hRSr
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practices (400)
Upcoming SlideShare
Loading in...5
×

AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practices (400)

1,077

Published on

Join this advanced technical session on Amazon Elastic MapReduce (EMR) for an introduction to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, how you can take advantage of both long and short-lived clusters as well as other Amazon EMR architectural patterns. Learn how to scale your cluster up or down dynamically and about ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient.

Published in: Technology, Travel, Business
1 Comment
5 Likes
Statistics
Notes
  • Hi the save is disabled. I would like to save as a pdf. Is it possible to enable this? Thanks Shaun
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
1,077
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
1
Likes
5
Embeds 0
No embeds

No notes for slide

Transcript of "AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practices (400)"

  1. 1. © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Amazon Elastic MapReduce: Deep Dive and Best Practices Ian Meyers, AWS (meyersi@) John Telford, Channel 4 (jtelford@) April 30, 2014
  2. 2. Outline Introduction to Amazon EMR Architecting EMR for Cost Amazon EMR Design Patterns Amazon EMR Best Practices
  3. 3. Map-­‐Reduce  Engine   Vibrant  Ecosystem   Hadoop-­‐as-­‐a-­‐Service   Massively  Parallel   Cost  Effec>ve  AWS  Wrapper   Integrated  to  AWS  services   What  is  EMR?  
  4. 4. HDFS   Amazon EMR
  5. 5. HDFS   Amazon EMR Amazon S3 Amazon DynamoDB
  6. 6. HDFS   Analytics languagesData management Amazon EMR Amazon S3 Amazon DynamoDB
  7. 7. HDFS   Analytics languagesData management Amazon EMR Amazon RDS Amazon S3 Amazon DynamoDB
  8. 8. HDFS   Analytics languagesData management Amazon Redshift Amazon EMR Amazon RDS Amazon S3 Amazon DynamoDB AWS Data Pipeline
  9. 9. Amazon EMR Introduction •  Launch clusters of any size in a matter of minutes •  Use variety of different instance sizes that match your workload
  10. 10. Amazon EMR Introduction •  Don’t get stuck with hardware •  Don’t deal with capacity planning •  Run multiple clusters with different sizes, specs and node types
  11. 11. Outline Introduction to Amazon EMR Architecting EMR for Cost Amazon EMR Design Patterns Amazon EMR Best Practices
  12. 12. Architecting for cost •  EC2/EMR pricing models: –  On-demand: Pay as you go model. –  Spot: Market place. Bid for instances and get a discount –  Reserved Instance: upfront payment (for 1 or 3 year) for reduction in overall monthly payment
  13. 13. Architecting for cost •  On-demand –  Research & Development, Data Science •  Spot –  Restartable Tasks –  Embarrassingly Parallel Workloads •  Reserved Instance –  Well Understood, Frequent and Predicable Workloads
  14. 14. EMR Architecture for Optimal Cost Heavy Utilisation RI’s for alive and long- running clusters
  15. 15. Use Medium Utilisation RI’s for ad-hoc and unpredictable workloads EMR Architecture for Optimal Cost
  16. 16. Supplement with Spot for unpredictable workloads or Turbo Boost EMR Architecture for Optimal Cost
  17. 17. Outline Introduction to Amazon EMR Architecting EMR for Cost Amazon EMR Design Patterns Amazon EMR Best Practices
  18. 18. Amazon EMR Design Patterns Pattern #1: Transient vs. Alive Clusters Pattern #2: Core Nodes and Task Nodes Pattern #3: Amazon S3 & HDFS
  19. 19. Pattern #1: Transient vs. Alive Clusters
  20. 20. Pattern #1: Transient Clusters •  Cluster lives for the duration of the job •  Shut down the cluster when the job is done •  Data persist on Amazon S3 •  Input & Output Data on Amazon S3
  21. 21. Benefits of Transient Clusters 1.  Control your cost 2.  Minimum maintenance •  Cluster goes away when job is done 3.  Practice cloud architecture •  Pay for what you use •  Data processing as a workflow
  22. 22. Alive Clusters •  Very similar to traditional Hadoop deployments •  Cluster stays around after the job is done •  Data persistence model: •  Amazon S3 •  Amazon S3 Copy To HDFS •  HDFS and Amazon S3 as backup
  23. 23. Alive Clusters •  Always keep data safe on Amazon S3 even if you’re using HDFS for primary storage •  Get in the habit of shutting down your cluster and start a new one, once a week or month •  Design your data processing workflow to account for failure •  You can use workflow managements such as AWS Data Pipeline
  24. 24. Pattern #2: Core & Task nodes
  25. 25. Core Nodes Master instance group Amazon EMR cluster Core instance group HDFS HDFS Run TaskTrackers (Compute) Run DataNode (HDFS)
  26. 26. Core Nodes Can add core nodes More HDFS space More CPU/ memory Master instance group Amazon EMR cluster Core instance group HDFS HDFS HDFS
  27. 27. Core Nodes Can’t remove core nodes because of HDFS Master instance group Core instance group HDFS HDFS HDFS Amazon EMR cluster
  28. 28. Amazon EMR Task Nodes Run TaskTrackers No HDFS Reads from core node HDFS Task instance group Master instance group Core instance group HDFS HDFS Amazon EMR cluster
  29. 29. Amazon EMR Task Nodes Can add task nodes Task instance group Master instance group Core instance group HDFS HDFS Amazon EMR cluster
  30. 30. Amazon EMR Task Nodes More CPU power More memory Task instance group Master instance group Core instance group HDFS HDFS Amazon EMR cluster
  31. 31. Amazon EMR Task Nodes You can remove task nodes when processing is completed Task instance group Master instance group Core instance group HDFS HDFS Amazon EMR cluster
  32. 32. Amazon EMR Task Nodes You can remove task nodes when processing is completed Task instance group Master instance group Core instance group HDFS HDFS Amazon EMR cluster
  33. 33. Task Node Use-Cases •  Speed up job processing using Spot market –  Run task nodes on Spot market •  Get discount on hourly price –  Nodes can come and go without interruption to your cluster •  When you need extra horsepower for a short amount of time –  Example: Need to pull large amount of data from Amazon S3
  34. 34. Pattern #3: Amazon S3 & HDFS
  35. 35. Option 1: Amazon S3 as HDFS •  Use Amazon S3 as your permanent data store •  HDFS for temporary storage data between jobs •  No additional step to copy data to HDFS Amazon EMR cluster Task instance groupCore instance group HD FS HD FS Amazon S3
  36. 36. Benefits: Amazon S3 as HDFS •  Ability to shut down your cluster HUGE Benefit!! •  Use Amazon S3 as your durable storage 11 9s of durability
  37. 37. Benefits: Amazon S3 as HDFS •  No need to scale HDFS •  Capacity •  Replication for durability •  Amazon S3 scales with your data •  Both in IOPs and data storage
  38. 38. Benefits: Amazon S3 as HDFS •  Ability to share data between multiple clusters •  Hard to do with HDFS Amazon S3 EMR EMR
  39. 39. Benefits: Amazon S3 as HDFS •  Take advantage of Amazon S3 features •  Amazon S3 Server Side Encryption •  Amazon S3 Lifecycle Policies •  Amazon S3 versioning to protect against corruption •  Build elastic clusters •  Add nodes to read from Amazon S3 •  Remove nodes with data safe on Amazon S3
  40. 40. What About Data Locality? •  Run your job in the same region as your Amazon S3 bucket •  Amazon EMR nodes have high speed connectivity to Amazon S3 •  If your job Is CPU/memory-bound, locality doesn’t make a huge difference
  41. 41. Anti-Pattern: Amazon S3 as HDFS •  Iterative workloads –  If you’re processing the same dataset more than once •  Disk I/O intensive workloads
  42. 42. Option 2: Optimise for Latency with HDFS 1.  Data persisted on Amazon S3
  43. 43. 2.  Launch Amazon EMR and copy data to HDFS with S3distcp S3DistCp Option 2: Optimise for Latency with HDFS
  44. 44. 3.  Start processing data on HDFS S3DistCp Option 2: Optimise for Latency with HDFS
  45. 45. Benefits: HDFS instead of S3 •  Better pattern for I/O-intensive workloads •  Amazon S3 as system of record •  Durability •  Scalability •  Cost •  Features: lifecycle policy, security
  46. 46. Outline Introduction to Amazon EMR Architecting EMR for Cost Amazon EMR Design Patterns Amazon EMR Best Practices
  47. 47. Amazon EMR Nodes and Size •  Use m1 and c1 family for functional testing •  Use m3 and c3 xlarge and larger nodes for production workloads •  Use cc2/c3 for memory and CPU intensive jobs •  hs1, hi1, i2 instances for HDFS workloads •  Prefer a smaller cluster of larger nodes
  48. 48. Holy Grail Question How many nodes do I need?
  49. 49. Cluster Sizing Calculation 1.  Estimate the number of mappers your job requires.
  50. 50. Cluster Sizing Calculation 2.  Pick an instance and note down the number of mappers it can run in parallel M1.xlarge = 8 mappers in parallel
  51. 51. Resource Capability / Instance Type EC2 Instance Type Mappers Reducers m1.small 2 1 m1.large 3 1 m1.xlarge 8 3 m2.xlarge 3 1 m2.2xlarge 6 2 m2.4xlarge 14 4 m3.xlarge 6 1 m3.2xlarge 12 3 cc2.8xlarge 24 6 c3.4xlarge 24 6 hi1.4xlarge 24 6 hs1.8xlarge 24 6 cr1.8xlarge & c3.8xlarge 48 12
  52. 52. Cluster Sizing Calculation 3.  We need to pick some sample data files to run a test workload. The number of sample files should be the same number from step #2.
  53. 53. Cluster Sizing Calculation 4.  Run an Amazon EMR cluster with a single core node and process your sample files from #3. Note down the amount of time taken to process your sample files.
  54. 54. Cluster Sizing Calculation Total Mappers * Time To Process Sample Files Instance Mapper Capacity * Desired Processing Time Estimated Number Of Nodes:
  55. 55. Example: Cluster Sizing Calculation 1.  Estimate the number of mappers your job requires 150 2.  Pick an instance and note down the number of mappers it can run in parallel m1.xlarge with 8 mapper capacity per instance
  56. 56. Example: Cluster Sizing Calculation 3.  We need to pick some sample data files to run a test workload. The number of sample files should be the same number from step #2. 8 files selected for our sample test
  57. 57. Example: Cluster Sizing Calculation 4.  Run an Amazon EMR cluster with a single core node and process your sample files from #3. Note down the amount of time taken to process your sample files. 3 min to process 8 files
  58. 58. Cluster Sizing Calculation Total Mappers For Your Job * Time To Process Sample Files Per Instance Mapper Capacity * Desired Processing Time Estimated number of nodes: 150 * 3 min 8 * 5 min = 11 m1.xlarge
  59. 59. File Best Practices •  Avoid small files at all costs (smaller than 100MB) •  Use Compression
  60. 60. Holy Grail Question What if I have small file issues?
  61. 61. Dealing with Small Files •  Use S3DistCP to combine smaller files together •  S3DistCP takes a pattern and target file to combine smaller input files to larger ones ./elastic-mapreduce –jar /home/hadoop/lib/ emr-s3distcp-1.0.jar --args '--src,s3://myawsbucket/cf, --dest,hdfs:///local, --groupBy,.*XABCD12345678.([0-9]+-[0-9]+- [0-9]+-[0-9]+).*, --targetSize,128,
  62. 62. Compression •  Always Compress Data Files On Amazon S3 •  Reduces Bandwidth Between Amazon S3 and Amazon EMR •  Speeds Up Your Job •  Compress Mappers and Reducer Output
  63. 63. •  Compression Types: –  Some are fast BUT offer less space reduction –  Some are space efficient BUT Slower –  Some are splitable and some are not Algorithm % Space Remaining Encoding Speed Decoding Speed GZIP 13% 21MB/s 118MB/s LZO 20% 135MB/s 410MB/s Snappy 22% 172MB/s 409MB/s Compression
  64. 64. In Summary •  Practice Cloud Architecture with Transient Clusters •  Utilise Task Nodes on Spot for Increased performance and Lower Cost •  Utilize S3 as the system of record for durability bit.ly/1n0hRSr
  65. 65. John Telford Enterprise Architect Channel 4 @jtelford1 johntelforduk EMR at C4 1.  Who we are. 2.  What we’re doing with EMR. 3.  Lessons learnt.
  66. 66. Channel 4 •  State owned, public service broadcaster. •  Self-funded mostly by selling advertising (no TV license fee money!) •  Turnover £1B. •  800 employees. •  Programmes supplied by 250 independent production companies.
  67. 67. 12 Years A Slave
  68. 68. C4 Virtuous Circle Ad Revenue (£s) = Impacts x Rate Brilliant Program mes Oodles of Viewers Massive Ad Revenue Gigantic Program me Budget
  69. 69. C4 Viewer Insight Database •  Clickstream & Ad Server behavioral data. •  10M registered viewers. •  Viewer Panel / Survey & 3rd Party Data. •  Programme metadata. •  60 Tbytes of S3 storage. Google “Channel 4 viewer promise”
  70. 70. Expect to pre-process your data We want our Data Scientists to enjoy a User Friendly, High Performance system, containing High Quality Data. Embellish DeriveDecorateIngestAcquire AWS storage S3 Hive HQL query Raw DD Smoke test Analytical Outputs Row by row Drop columns Cleanse data Add flags Lookup values Decorated DD Multirow Multipass Dwell Last visit hit Embellish DD Segmentations Last activity Summary tables Derived DD Raw Data
  71. 71. Data profiling SELECT SUM (IF (visit_num REGEXP '^[0-9]+$', 0, 1)), SUM (IF (ip REGEXP '^[0-9]+.[0-9]+.[0-9]+.[0-9]+$', 0, 1)), SUM (IF (page_url <> '', 0, 1)), COUNT (DISTINCT service) FROM raw_clickstream; Big Data requires Big Data Profiling.
  72. 72. Partitioning CREATE EXTERNAL TABLE web_log ( hit_time_gmt BIGINT, cookie STRING -- and many more columnns. ) PARTITIONED BY (month STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LOCATION ‘s3n://bucket/’; ALTER TABLE web_log ADD PARTITION (month='2010-06') LOCATION '2010-06'; ALTER TABLE web_log ADD PARTITION (month='2010-07') LOCATION '2010-07'; -- etc. Help EMR go direct to the data it needs.
  73. 73. Connecting data 1 Instanc e RD SSlave s Slave Old approach Redis New approach Slave Redis Slave Redis Slave Redis
  74. 74. Handling large amounts of data •  AWS Import/Export. –  Consumer grade USB drives… sent by courier. •  AWS Direct Connect. –  Dedicated network connection from your premises to AWS. –  We have not completed our implementation. •  Glacier.
  75. 75. Choosing instances for EMR Source: https://aws.amazon.com/ec2/pricing/ Some instance types omitted from diagram to ease clarity. Exchange rate, $1 = £0.61.
  76. 76. Social engineering •  Make the Data Scientists aware of EMR costs. •  We give them visibility of clusters running, who started them, idle time, etc.
  77. 77. John Telford Enterprise Architect Channel 4 @jtelford1 johntelforduk Thanks! Youtube: “Channel 4 Paralympics Meet the Superhumans”
  78. 78. AWS Partner Trail Win a Kindle Fire •  10 in total •  Get a code from our sponsors
  79. 79. Please rate this session using the AWS Summits App and help us build better events
  80. 80. #AWSSummit @AWScloud @AWS_UKI bit.ly/1n0hRSr

×