SlideShare a Scribd company logo
Scaling your analytics
with Amazon EMR
Rahul Pathak - Amazon EMR
Agenda
•  EMR: Hadoop on AWS
–  Elastic clusters tailored for your workflows
–  Minimize costs using Spot instances
–  Easy integration with your datastores
•  Leveraging the Hadoop Ecosystem on EMR
–  Batch & real-time
–  Data warehouse on Hadoop
•  A few examples
Thousands of EMR Customers; Over 15 Million Clusters Launched
Why Amazon EMR?
•  Managed services
•  Easy to tune clusters and trim costs by
dissociating compute and storage
•  Support for multiple datastores
•  Unique features and ecosystem support
Create a managed Hadoop cluster in just a few clicks
and use easy monitoring and debugging tools
AWS Console, Command Line, or the EMR API
Choose your instance types
Try out different configurations to find your
optimal architecture.
CPU
c3 family
cc1.4xlarge
cc2.8xlarge
Memory
m2 family
r3 family
Disk / IO
hs1.8xlarge
i2 family
General
m1 family
m3 family
Long running or transient clusters
Easy to run Hadoop clusters short-term or 24/7, and
only pay for what you need.
=
Resizable clusters
Easy to add and remove compute
capacity on your cluster.
Match compute
demands with
cluster sizing.
Amazon Confidential
Easy to use Spot Instances
Spot for
task nodes
Up to 90%
off EC2
on-demand
pricing
On-demand for
core nodes
Standard EC2
pricing for
on-demand
capacity
Amazon Confidential
Using Amazon S3 and HDFS
Data Sources
Transient EMR cluster
for batch map/reduce jobs
for daily reports
Long running EMR cluster
holding data in HDFS for
Hive interactive queries
Weekly Report
Ad-hoc Query
Data aggregated
and stored in
Amazon S3
Amazon Confidential
Use the Hadoop Ecosystem
on EMR
Leverage a diverse set of tools to get the most out of your data.
Amazon Confidential
•  Databases
•  Machine learning
•  Metadata stores
•  Exchange formats
•  Diverse query languages
Hadoop 2.x
and much more...
Amazon Confidential
Use bootstrap actions to install whatever
applications you want on your EMR cluster
•  Presto
•  Spark
•  Phoenix
•  Any arbitrary application
Amazon Confidential
HUE: a UI for Hadoop to easily query and browse
through your data
(beta available)
Amazon Confidential
EMR example #1: EMR for processing
GB of logs pushed
to S3 hourly Daily EMR cluster
using Hive to
process data
Input and output
stored in S3
Amazon Confidential
EMR example #2: EMR as long-running database
Sales data pushed
to S3
Amazon Confidential
Logs stored in S3
Daily EMR cluster
ETL data into
database
24/7 EMR cluster running
HBase holds last 2 years of
data
Front-end service uses
HBase cluster to power
dashboard with high
concurrency
EMR example #3: EMR for ETL and query engine for
investigations which require all raw data
Amazon Confidential
TBs of logs sent
daily
Logs stored in S3
Hourly EMR cluster
using Spark for ETL
Load subset into
Redshift DW
Transient EMR cluster using Spark for ad hoc
analysis of entire log set
Leverage Amazon S3
Use S3 as your persistent data store
•  Use Amazon S3 as your persistent data store
•  11 9’s durability
•  $0.03/GB/month
•  Lifecycle policies
•  Versioning
•  Access controls
•  Integration w/ Glacier (and other AWS services)
•  Resize and shut down EMR clusters with no data loss
•  Point multiple EMR clusters at same data in S3
•  Use HDFS for temporary storage data between jobs
•  No additional step to copy data to HDFS
EMRFS makes it easier to leverage S3
•  Better read/write performance and error handling
than open source options (e.g. S3N)
•  Consistent View NEW! (for consistent read after
write)
•  Server-side encryption
•  Faster listing
•  Support for files > 5 GB
EMRFS anti-patterns
•  Iterative workloads
–  If you’re processing the same dataset more than once
•  Disk I/O intensive workloads
...but still use S3: persist data on S3 and use
s3distcp to copy to HDFS for processing
Real Time
  Read Data Directly into Hive,
Pig, Streaming and Cascading
from Kinesis Streams
  No Intermediate Data
Persistence Required
  Simple way to introduce real time sources into Batch
Oriented Systems
  Multi-Application Support & Automatic Checkpointing
EMR Integration with Kinesis
CREATE	
  TABLE	
  call_data_records	
  (	
  
	
  	
  start_time	
  bigint,	
  
	
  	
  end_time	
  bigint,	
  
	
  	
  phone_number	
  STRING,	
  
	
  	
  carrier	
  STRING,	
  
	
  	
  recorded_duration	
  bigint,	
  
	
  	
  calculated_duration	
  bigint,	
  
	
  	
  lat	
  double,	
  
	
  	
  long	
  double	
  
)	
  
ROW	
  FORMAT	
  DELIMITED	
  
FIELDS	
  TERMINATED	
  BY	
  ","	
  
STORED	
  BY	
  
'com.amazon.emr.kinesis.hive.KinesisStorageHandler'	
  
TBLPROPERTIES("kinesis.stream.name"=”MyTestStream");	
  
EMR Kinesis Integration: Hive
Run Spark on EMR
•  Ideal for iterative workloads (e.g. machine learning)
•  Bootstrap action:
•  aws emr create-cluster --name SparkCluster --ami-
version 3.2 --instance-type m3.xlarge --instance-count 3
--service-role EMR_DefaultRole --ec2-attributes
KeyName=MYKEY,InstanceProfile=SparkRole --
applications Name=Hive --bootstrap-actions Path=s3://
support.elasticmapreduce/spark/install-spark
File size and compression
File Size Best Practices
•  Avoid small files at all costs
•  Anything smaller than 100MB
•  Each mapper is a single JVM
•  CPU time is required to spawn JVMs/mappers
•  Fewer files, matching closely to block size
== fewer calls to S3
== fewer network/HDFS requests
Dealing with Small Files
•  Reduce HDFS Block Size, e.g. 1MB (default is 128MB)
–  --bootstrap-action s3://elasticmapreduce/bootstrap-actions/
configure-hadoop --args “-m,dfs.block.size=1048576”
•  Better: use S3DistCP to combine smaller files together
–  S3DistCP takes a pattern and target path to combine smaller
input files to larger ones
–  Supply a target size and compression codec
S3DistCP Options Option
--src,LOCATION
--dest,LOCATION
--srcPattern,PATTERN
--groupBy,PATTERN
--targetSize,SIZE
--appendToLastFile
--outputCodec,CODEC
--s3ServerSideEncryption
--deleteOnSuccess
--disableMultipartUpload
--multipartUploadChunkSize,SIZE
--numberFiles
--startingIndex,INDEX
--outputManifest,FILENAME
--previousManifest,PATH
--requirePreviousManifest
--copyFromManifest
--s3Endpoint ENDPOINT
--storageClass CLASS
•  Most Important Options
•  --src
•  --srcPattern
•  --dest
•  --groupBy
•  --outputCodec
Compression
•  Always Compress Data Files On Amazon S3
•  Reduces Bandwidth Between Amazon S3 and Amazon EMR
•  Speeds Up Your Job
•  Compress Mappers and Reducer Output
•  EMR compresses inter-node traffic with LZO with
Hadoop 1, and Snappy with Hadoop 2
Compression
•  Compression Types:
–  Some are fast BUT offer less space reduction
–  Some are space efficient BUT slower
–  Some are splittable and some are not
Algorithm Splittable? Compression ratio
Compress +
Decompress speed
Gzip (DEFLATE) No High Medium
bzip2 Yes Very high Slow
LZO Yes Low Fast
Snappy No Low Very fast
Compression
•  If you are time sensitive, faster compressions are a
better choice
•  If you have large amount of data, use space efficient
compressions
•  If you don’t care, use gzip
Change Compression Type
•  Use S3DistCP to change the compression types of your files
•  Example:
./elastic-mapreduce --jobflow j-3GY8JC4179IOK 
--jar /home/hadoop/lib/emr-s3distcp-1.0.jar 
--args '--src,s3://myawsbucket/cf,
--dest,hdfs:///local,
--outputCodec,lzo’
Bootstrap actions
EMR Bootstrap Actions
•  What are they?
–  Bash scripts run on every node prior to joining the
cluster
•  What can they do?
–  Anything
•  Really?
–  Yes
The Hadoop ecosystem runs in Amazon EMR
Optimizing for cost
Cost saving tips
•  Use S3 as your persistent data store (only pay for compute when
you need it!)
•  Use EC2 Spot instances (especially with Task nodes) to save 80%
or more on the EC2 cost
•  Use EC2 Reserved instances if you have steady workloads
•  Create CloudWatch alerts to notify you if a cluster is underutilized
so you can shut it down (e.g. Mappers Running == 0 for more than
N hours)
•  Contact your sales rep about custom pricing options if you are
spending more than $10K per month on EMR
150B Soil Observations 3M Daily Weather
Measurements
200 TB of Data in S3
850K Precision Rainfall
Grids Tracked
The Climate Corporation
Per Simulation:
10K Unique Scenarios Generated
5 Trillion Datapoints
20 TB Data
5-6k Node Hadoop Cluster
Expensive data storage (200TB!)
Long data import times
Long data processing times
Expensive computing required (5 trillion data points!)
Hadoop cluster setup and management complexity
(5-6k cluster nodes!)
Business
Challenge
AWS Import/Export to quickly migrate large amount of data into S3
AWS S3 for affordable, unlimited storage
AWS Elastic Map Reduce (EMR) for simplified Hadoop
Transient AWS compute resources
Leverage AWS EC2 Spot Instances for additional capacity at big discounts
The AWS Solution
Temporary EMR Cluster
(5,000 Nodes)
20 TB
10k Scenarios
S3
(200 TB)

More Related Content

What's hot

Introduction to AWS Lambda and Serverless Applications
Introduction to AWS Lambda and Serverless ApplicationsIntroduction to AWS Lambda and Serverless Applications
Introduction to AWS Lambda and Serverless ApplicationsAmazon Web Services
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Amazon Web Services
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 
AWS Black Belt Online Seminar Amazon Redshift
AWS Black Belt Online Seminar Amazon RedshiftAWS Black Belt Online Seminar Amazon Redshift
AWS Black Belt Online Seminar Amazon RedshiftAmazon Web Services Japan
 
SMC304 Serverless Orchestration with AWS Step Functions
SMC304 Serverless Orchestration with AWS Step FunctionsSMC304 Serverless Orchestration with AWS Step Functions
SMC304 Serverless Orchestration with AWS Step FunctionsAmazon Web Services
 
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...Amazon Web Services Korea
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Amazon Web Services
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
 
20190122 AWS Black Belt Online Seminar Amazon Redshift Update
20190122 AWS Black Belt Online Seminar Amazon Redshift Update20190122 AWS Black Belt Online Seminar Amazon Redshift Update
20190122 AWS Black Belt Online Seminar Amazon Redshift UpdateAmazon Web Services Japan
 
STG309_Deep Dive Using Hybrid Storage with AWS Storage Gateway to Solve On-Pr...
STG309_Deep Dive Using Hybrid Storage with AWS Storage Gateway to Solve On-Pr...STG309_Deep Dive Using Hybrid Storage with AWS Storage Gateway to Solve On-Pr...
STG309_Deep Dive Using Hybrid Storage with AWS Storage Gateway to Solve On-Pr...Amazon Web Services
 
AWS Black Belt Tech シリーズ 2015 - Amazon Elastic MapReduce
AWS Black Belt Tech シリーズ 2015 - Amazon Elastic MapReduceAWS Black Belt Tech シリーズ 2015 - Amazon Elastic MapReduce
AWS Black Belt Tech シリーズ 2015 - Amazon Elastic MapReduceAmazon Web Services Japan
 
AWS 빅데이터 아키텍처 패턴 및 모범 사례- AWS Summit Seoul 2017
AWS 빅데이터 아키텍처 패턴 및 모범 사례- AWS Summit Seoul 2017AWS 빅데이터 아키텍처 패턴 및 모범 사례- AWS Summit Seoul 2017
AWS 빅데이터 아키텍처 패턴 및 모범 사례- AWS Summit Seoul 2017Amazon Web Services Korea
 
20210330 AWS Black Belt Online Seminar AWS Glue -Glue Studioを使ったデータ変換のベストプラクティス-
20210330 AWS Black Belt Online Seminar AWS Glue -Glue Studioを使ったデータ変換のベストプラクティス-20210330 AWS Black Belt Online Seminar AWS Glue -Glue Studioを使ったデータ変換のベストプラクティス-
20210330 AWS Black Belt Online Seminar AWS Glue -Glue Studioを使ったデータ変換のベストプラクティス-Amazon Web Services Japan
 

What's hot (20)

Introduction to AWS Lambda and Serverless Applications
Introduction to AWS Lambda and Serverless ApplicationsIntroduction to AWS Lambda and Serverless Applications
Introduction to AWS Lambda and Serverless Applications
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 
Amazon Aurora
Amazon AuroraAmazon Aurora
Amazon Aurora
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
 
AWS Black Belt Online Seminar Amazon Redshift
AWS Black Belt Online Seminar Amazon RedshiftAWS Black Belt Online Seminar Amazon Redshift
AWS Black Belt Online Seminar Amazon Redshift
 
SMC304 Serverless Orchestration with AWS Step Functions
SMC304 Serverless Orchestration with AWS Step FunctionsSMC304 Serverless Orchestration with AWS Step Functions
SMC304 Serverless Orchestration with AWS Step Functions
 
Intro to AWS Lambda
Intro to AWS Lambda Intro to AWS Lambda
Intro to AWS Lambda
 
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
20190122 AWS Black Belt Online Seminar Amazon Redshift Update
20190122 AWS Black Belt Online Seminar Amazon Redshift Update20190122 AWS Black Belt Online Seminar Amazon Redshift Update
20190122 AWS Black Belt Online Seminar Amazon Redshift Update
 
STG309_Deep Dive Using Hybrid Storage with AWS Storage Gateway to Solve On-Pr...
STG309_Deep Dive Using Hybrid Storage with AWS Storage Gateway to Solve On-Pr...STG309_Deep Dive Using Hybrid Storage with AWS Storage Gateway to Solve On-Pr...
STG309_Deep Dive Using Hybrid Storage with AWS Storage Gateway to Solve On-Pr...
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
AWS Black Belt Tech シリーズ 2015 - Amazon Elastic MapReduce
AWS Black Belt Tech シリーズ 2015 - Amazon Elastic MapReduceAWS Black Belt Tech シリーズ 2015 - Amazon Elastic MapReduce
AWS Black Belt Tech シリーズ 2015 - Amazon Elastic MapReduce
 
AWS 빅데이터 아키텍처 패턴 및 모범 사례- AWS Summit Seoul 2017
AWS 빅데이터 아키텍처 패턴 및 모범 사례- AWS Summit Seoul 2017AWS 빅데이터 아키텍처 패턴 및 모범 사례- AWS Summit Seoul 2017
AWS 빅데이터 아키텍처 패턴 및 모범 사례- AWS Summit Seoul 2017
 
20210330 AWS Black Belt Online Seminar AWS Glue -Glue Studioを使ったデータ変換のベストプラクティス-
20210330 AWS Black Belt Online Seminar AWS Glue -Glue Studioを使ったデータ変換のベストプラクティス-20210330 AWS Black Belt Online Seminar AWS Glue -Glue Studioを使ったデータ変換のベストプラクティス-
20210330 AWS Black Belt Online Seminar AWS Glue -Glue Studioを使ったデータ変換のベストプラクティス-
 
AWS Lambda
AWS LambdaAWS Lambda
AWS Lambda
 
Deep Dive on AWS Lambda
Deep Dive on AWS LambdaDeep Dive on AWS Lambda
Deep Dive on AWS Lambda
 
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
 

Viewers also liked

Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)Amazon Web Services
 
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...Yahoo Developer Network
 
Introduction to Elastic MapReduce
Introduction to Elastic MapReduceIntroduction to Elastic MapReduce
Introduction to Elastic MapReduceAmazon Web Services
 
ORC: 2015 Faster, Better, Smaller
ORC: 2015 Faster, Better, SmallerORC: 2015 Faster, Better, Smaller
ORC: 2015 Faster, Better, SmallerDataWorks Summit
 
Amazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto MeetupAmazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto Meetupstevemcpherson
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
 
Facebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/ReduceFacebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/ReduceJ Singh
 
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in HadoopOctober 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in HadoopYahoo Developer Network
 
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...Yahoo Developer Network
 
Big Data & Analytics: End to End on AWS - Technical 101
Big Data & Analytics: End to End on AWS - Technical 101Big Data & Analytics: End to End on AWS - Technical 101
Big Data & Analytics: End to End on AWS - Technical 101Amazon Web Services
 
Presto - Analytical Database. Overview and use cases.
Presto - Analytical Database. Overview and use cases.Presto - Analytical Database. Overview and use cases.
Presto - Analytical Database. Overview and use cases.Wojciech Biela
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...Yahoo Developer Network
 
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Web Services
 
Hands-on Labs: Getting Started with AWS - March 2017 AWS Online Tech Talks
Hands-on Labs: Getting Started with AWS  - March 2017 AWS Online Tech TalksHands-on Labs: Getting Started with AWS  - March 2017 AWS Online Tech Talks
Hands-on Labs: Getting Started with AWS - March 2017 AWS Online Tech TalksAmazon Web Services
 
Building a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSBuilding a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSSmartNews, Inc.
 
Hadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRHadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRrICh morrow
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
 

Viewers also liked (20)

Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
 
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
 
Introduction to Elastic MapReduce
Introduction to Elastic MapReduceIntroduction to Elastic MapReduce
Introduction to Elastic MapReduce
 
ORC: 2015 Faster, Better, Smaller
ORC: 2015 Faster, Better, SmallerORC: 2015 Faster, Better, Smaller
ORC: 2015 Faster, Better, Smaller
 
AWS Webcast - Dynamo DB
AWS Webcast - Dynamo DBAWS Webcast - Dynamo DB
AWS Webcast - Dynamo DB
 
Amazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto MeetupAmazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto Meetup
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
 
Facebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/ReduceFacebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/Reduce
 
Deep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDBDeep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDB
 
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in HadoopOctober 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
 
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
 
Big Data & Analytics: End to End on AWS - Technical 101
Big Data & Analytics: End to End on AWS - Technical 101Big Data & Analytics: End to End on AWS - Technical 101
Big Data & Analytics: End to End on AWS - Technical 101
 
Presto - Analytical Database. Overview and use cases.
Presto - Analytical Database. Overview and use cases.Presto - Analytical Database. Overview and use cases.
Presto - Analytical Database. Overview and use cases.
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
 
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
 
Hands-on Labs: Getting Started with AWS - March 2017 AWS Online Tech Talks
Hands-on Labs: Getting Started with AWS  - March 2017 AWS Online Tech TalksHands-on Labs: Getting Started with AWS  - March 2017 AWS Online Tech Talks
Hands-on Labs: Getting Started with AWS - March 2017 AWS Online Tech Talks
 
Building a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSBuilding a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWS
 
Deep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDBDeep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDB
 
Hadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRHadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMR
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
 

Similar to Scaling your analytics with Amazon EMR

AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...Amazon Web Services
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts Julien SIMON
 
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAmazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAmazon Web Services
 
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석Amazon Web Services Korea
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesVladimir Simek
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
 Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
Big Data Analytics using Amazon Elastic MapReduce and Amazon RedshiftIndicThreads
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
 
AWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedAWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedHarsha KM
 

Similar to Scaling your analytics with Amazon EMR (20)

AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts
 
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
 
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
EMR Training
EMR TrainingEMR Training
EMR Training
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
 Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
 
AWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedAWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explained
 

Recently uploaded

Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Julian Hyde
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaRTTS
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxAbida Shariff
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsExpeed Software
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityScyllaDB
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaCzechDreamin
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Thierry Lestable
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...Product School
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupCatarinaPereira64715
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Product School
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...Elena Simperl
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...Product School
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCzechDreamin
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...CzechDreamin
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...Product School
 

Recently uploaded (20)

Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 

Scaling your analytics with Amazon EMR

  • 1. Scaling your analytics with Amazon EMR Rahul Pathak - Amazon EMR
  • 2. Agenda •  EMR: Hadoop on AWS –  Elastic clusters tailored for your workflows –  Minimize costs using Spot instances –  Easy integration with your datastores •  Leveraging the Hadoop Ecosystem on EMR –  Batch & real-time –  Data warehouse on Hadoop •  A few examples
  • 3. Thousands of EMR Customers; Over 15 Million Clusters Launched
  • 4. Why Amazon EMR? •  Managed services •  Easy to tune clusters and trim costs by dissociating compute and storage •  Support for multiple datastores •  Unique features and ecosystem support
  • 5. Create a managed Hadoop cluster in just a few clicks and use easy monitoring and debugging tools AWS Console, Command Line, or the EMR API
  • 6. Choose your instance types Try out different configurations to find your optimal architecture. CPU c3 family cc1.4xlarge cc2.8xlarge Memory m2 family r3 family Disk / IO hs1.8xlarge i2 family General m1 family m3 family
  • 7. Long running or transient clusters Easy to run Hadoop clusters short-term or 24/7, and only pay for what you need. =
  • 8. Resizable clusters Easy to add and remove compute capacity on your cluster. Match compute demands with cluster sizing. Amazon Confidential
  • 9. Easy to use Spot Instances Spot for task nodes Up to 90% off EC2 on-demand pricing On-demand for core nodes Standard EC2 pricing for on-demand capacity Amazon Confidential
  • 10. Using Amazon S3 and HDFS Data Sources Transient EMR cluster for batch map/reduce jobs for daily reports Long running EMR cluster holding data in HDFS for Hive interactive queries Weekly Report Ad-hoc Query Data aggregated and stored in Amazon S3 Amazon Confidential
  • 11. Use the Hadoop Ecosystem on EMR Leverage a diverse set of tools to get the most out of your data. Amazon Confidential
  • 12. •  Databases •  Machine learning •  Metadata stores •  Exchange formats •  Diverse query languages Hadoop 2.x and much more... Amazon Confidential
  • 13. Use bootstrap actions to install whatever applications you want on your EMR cluster •  Presto •  Spark •  Phoenix •  Any arbitrary application Amazon Confidential
  • 14. HUE: a UI for Hadoop to easily query and browse through your data (beta available) Amazon Confidential
  • 15. EMR example #1: EMR for processing GB of logs pushed to S3 hourly Daily EMR cluster using Hive to process data Input and output stored in S3 Amazon Confidential
  • 16. EMR example #2: EMR as long-running database Sales data pushed to S3 Amazon Confidential Logs stored in S3 Daily EMR cluster ETL data into database 24/7 EMR cluster running HBase holds last 2 years of data Front-end service uses HBase cluster to power dashboard with high concurrency
  • 17. EMR example #3: EMR for ETL and query engine for investigations which require all raw data Amazon Confidential TBs of logs sent daily Logs stored in S3 Hourly EMR cluster using Spark for ETL Load subset into Redshift DW Transient EMR cluster using Spark for ad hoc analysis of entire log set
  • 19. Use S3 as your persistent data store •  Use Amazon S3 as your persistent data store •  11 9’s durability •  $0.03/GB/month •  Lifecycle policies •  Versioning •  Access controls •  Integration w/ Glacier (and other AWS services) •  Resize and shut down EMR clusters with no data loss •  Point multiple EMR clusters at same data in S3 •  Use HDFS for temporary storage data between jobs •  No additional step to copy data to HDFS
  • 20. EMRFS makes it easier to leverage S3 •  Better read/write performance and error handling than open source options (e.g. S3N) •  Consistent View NEW! (for consistent read after write) •  Server-side encryption •  Faster listing •  Support for files > 5 GB
  • 21. EMRFS anti-patterns •  Iterative workloads –  If you’re processing the same dataset more than once •  Disk I/O intensive workloads ...but still use S3: persist data on S3 and use s3distcp to copy to HDFS for processing
  • 23.   Read Data Directly into Hive, Pig, Streaming and Cascading from Kinesis Streams   No Intermediate Data Persistence Required   Simple way to introduce real time sources into Batch Oriented Systems   Multi-Application Support & Automatic Checkpointing EMR Integration with Kinesis
  • 24. CREATE  TABLE  call_data_records  (      start_time  bigint,      end_time  bigint,      phone_number  STRING,      carrier  STRING,      recorded_duration  bigint,      calculated_duration  bigint,      lat  double,      long  double   )   ROW  FORMAT  DELIMITED   FIELDS  TERMINATED  BY  ","   STORED  BY   'com.amazon.emr.kinesis.hive.KinesisStorageHandler'   TBLPROPERTIES("kinesis.stream.name"=”MyTestStream");   EMR Kinesis Integration: Hive
  • 25. Run Spark on EMR •  Ideal for iterative workloads (e.g. machine learning) •  Bootstrap action: •  aws emr create-cluster --name SparkCluster --ami- version 3.2 --instance-type m3.xlarge --instance-count 3 --service-role EMR_DefaultRole --ec2-attributes KeyName=MYKEY,InstanceProfile=SparkRole -- applications Name=Hive --bootstrap-actions Path=s3:// support.elasticmapreduce/spark/install-spark
  • 26. File size and compression
  • 27. File Size Best Practices •  Avoid small files at all costs •  Anything smaller than 100MB •  Each mapper is a single JVM •  CPU time is required to spawn JVMs/mappers •  Fewer files, matching closely to block size == fewer calls to S3 == fewer network/HDFS requests
  • 28. Dealing with Small Files •  Reduce HDFS Block Size, e.g. 1MB (default is 128MB) –  --bootstrap-action s3://elasticmapreduce/bootstrap-actions/ configure-hadoop --args “-m,dfs.block.size=1048576” •  Better: use S3DistCP to combine smaller files together –  S3DistCP takes a pattern and target path to combine smaller input files to larger ones –  Supply a target size and compression codec
  • 30. Compression •  Always Compress Data Files On Amazon S3 •  Reduces Bandwidth Between Amazon S3 and Amazon EMR •  Speeds Up Your Job •  Compress Mappers and Reducer Output •  EMR compresses inter-node traffic with LZO with Hadoop 1, and Snappy with Hadoop 2
  • 31. Compression •  Compression Types: –  Some are fast BUT offer less space reduction –  Some are space efficient BUT slower –  Some are splittable and some are not Algorithm Splittable? Compression ratio Compress + Decompress speed Gzip (DEFLATE) No High Medium bzip2 Yes Very high Slow LZO Yes Low Fast Snappy No Low Very fast
  • 32. Compression •  If you are time sensitive, faster compressions are a better choice •  If you have large amount of data, use space efficient compressions •  If you don’t care, use gzip
  • 33. Change Compression Type •  Use S3DistCP to change the compression types of your files •  Example: ./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args '--src,s3://myawsbucket/cf, --dest,hdfs:///local, --outputCodec,lzo’
  • 35. EMR Bootstrap Actions •  What are they? –  Bash scripts run on every node prior to joining the cluster •  What can they do? –  Anything •  Really? –  Yes
  • 36.
  • 37. The Hadoop ecosystem runs in Amazon EMR
  • 39. Cost saving tips •  Use S3 as your persistent data store (only pay for compute when you need it!) •  Use EC2 Spot instances (especially with Task nodes) to save 80% or more on the EC2 cost •  Use EC2 Reserved instances if you have steady workloads •  Create CloudWatch alerts to notify you if a cluster is underutilized so you can shut it down (e.g. Mappers Running == 0 for more than N hours) •  Contact your sales rep about custom pricing options if you are spending more than $10K per month on EMR
  • 40. 150B Soil Observations 3M Daily Weather Measurements 200 TB of Data in S3 850K Precision Rainfall Grids Tracked The Climate Corporation
  • 41. Per Simulation: 10K Unique Scenarios Generated 5 Trillion Datapoints 20 TB Data 5-6k Node Hadoop Cluster
  • 42. Expensive data storage (200TB!) Long data import times Long data processing times Expensive computing required (5 trillion data points!) Hadoop cluster setup and management complexity (5-6k cluster nodes!) Business Challenge
  • 43. AWS Import/Export to quickly migrate large amount of data into S3 AWS S3 for affordable, unlimited storage AWS Elastic Map Reduce (EMR) for simplified Hadoop Transient AWS compute resources Leverage AWS EC2 Spot Instances for additional capacity at big discounts The AWS Solution
  • 44. Temporary EMR Cluster (5,000 Nodes) 20 TB 10k Scenarios S3 (200 TB)