SlideShare a Scribd company logo
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Jonathan Fritz
Senior Product Manager, Amazon EMR
May 20, 2015
Getting Started with
Amazon EMR
Easy, fast, secure, and cost-effective Hadoop on AWS.
Agenda
• Is Hadoop the answer?
• Amazon EMR 101
• Integration with AWS storage and database services
• Common Amazon EMR design patterns
• Q+A
When leveraging your data to derive new insights,
Big Data problems are everywhere
• Data lacks structure
• Analyzing streams of information
• Processing large datasets
• Warehousing large datasets
• Flexibility for undefined ad hoc analysis
• Speed of queries on large data sets
Hadoop is the right system for Big Data
• Massively parallel
• Scalable and fault tolerant
• Flexibility for multiple languages
and data formats
• Open source
• Ecosystem of tools
• Batch and real-time analytics
Storage
S3, HDFS
YARN
Cluster Resource Management
Batch
MapReduce
Interactive
Tez
In Memory
Spark
Applications
Pig, Hive, Cascading, Mahout, Giraph
HBase
Presto
Impala
Hadoop 2
Batch
MapReduce
Storage
S3, HDFS
Hadoop 1
Applications
Customers across many verticals
Amazon Elastic MapReduce (EMR) is the
easiest way to run Hadoop in the cloud.
Why Amazon EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Elastic
Easily add or remove capacity
Reliable
Spend less time monitoring
Secure
Manage firewalls
Flexible
Customize the cluster
Easy to Use
Launch a cluster in minutes
Easy to deploy
AWS Management Console AWS Command Line Interface
You can also use the Amazon EMR API with your favorite SDK
or use AWS Data Pipeline to start your clusters.
Try different configurations to find your optimal architecture.
CPU
c3 family
cc1.4xlarge
cc2.8xlarge
Memory
m2 family
r3 family
Disk/IO
d2 family
i2 family
General
m1 family
m3 family
Choose your instance types
Batch Machine Spark and Large
process learning interactive HDFS
Low Cost
Pay an hourly rate
Spot Instances
for task nodes
Up to 90%
off Amazon EC2
on-demand
pricing
On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand
capacity
Mix on-demand and EC2 Spot capacity for low costs
Meet SLA at predictable cost Exceed SLA at lower cost
Use multiple EMR instance groups
Master Node
r3.2xlarge
Example Amazon
EMR Cluster
Slave Group - Core
c3.2xlarge
Slave Group – Task
m3.xlarge (EC2 Spot)
Slave Group – Task
m3.2xlarge (EC2 Spot)
Core nodes run HDFS
(DataNode). Task nodes do
not run HDFS. Core and
Task nodes each run YARN
(NodeManager).
Master node runs
NameNode (HDFS),
ResourceManager (YARN),
and serves as a gateway.
Elastic
Easily add or remove capacity
Easy to add and remove compute
capacity in your cluster from the console, CLI, or API.
Match compute
demands with
cluster sizing.
Resizable clusters
Use S3 instead of HDFS for your data layer to decouple
your compute capacity and storage
Amazon S3
Amazon EMR
Shut down your EMR
clusters when you
are not processing
data, and stop paying
for them!
Reliable
Spend less time monitoring
Easy to monitor and debug
Monitor with Amazon CloudWatch or Ganglia
Cluster, Node, and IO
Monitor Debug
EMR logging to S3 makes logs easily available
Secure
Integrates with AWS
security features
Use Identity and Access Management (IAM) roles with
your Amazon EMR cluster
• IAM roles give AWS services fine grained
control over delegating permissions to AWS
services and access to AWS resources
• EMR uses two IAM roles:
• EMR service role is for the Amazon EMR
control plane
• EC2 instance profile is for the actual
instances in the Amazon EMR cluster
• Default IAM roles can be easily created and
used from the AWS Console and AWS CLI
EMR Security Groups: default and custom
A security group is a virtual firewall which controls
access to the EC2 instances in your Amazon EMR
cluster
• There is a single default master and default
slave security group across all of your clusters
• The master security group has port 22 access
for SSHing to your cluster
You can add additional security groups to the master
and slave groups on a cluster to separate them from
the default master and slave security groups, and
further limit ingress and egress policies.
Slave
Security
Group
Master
Security
Group
Other Amazon EMR security features
EMRFS encryption options
• S3 server-side encryption
• S3 client-side encryption (use AWS Key Management Service keys or custom keys)
CloudTrail integration
• Track Amazon EMR API calls for auditing
Launch your Amazon EMR clusters in a VPC
• Logically isolated portion of the cloud (“Virtual Private Network”)
• Enhanced networking on certain instance types
Flexible
Customize the cluster
Hadoop applications available in EMR
Use Hive on EMR to interact with your data in HDFS
and Amazon S3
• Batch or ad hoc workloads
• Integration with EMRFS for better
performance reading and writing
to S3
• SQL-like query language to make
iterative queries easier
• Schema-on-read to query data
without needing pre-processing
• Use Tez engine for faster queries
Use Pig to easily create ETL workflows
• Uses high-level “Pig Latin” language to
easily script data transformations in
Hadoop
• Strong optimizer for workloads
• Options to create robust user defined
functions
Use HBase on a persistent EMR cluster as a noSQL
scalable database
• Billions of rows and millions
of columns
• Backup to and restore from
Amazon S3
• Flexible datatypes
• Modulate your HBase tables
when adding new data to
your system
Impala: a fast SQL query engine for EMR Clusters
• Low-latency SQL query engine for Hadoop
• Perfect for fast ad hoc, interactive queries on
structured on unstructured data
• Can be easily installed on an EMR cluster,
and queried from the CLI or a 3rd party BI tool
• Perfect for memory optimized instances
• Currently uses HDFS as data layer
Hadoop User Experience (Hue)
Query Editor
Hue
Job Browser
Hue
File Browser: Amazon S3 and the Hadoop Distributed File System (HDFS)
To install anything else, use Bootstrap Actions
https://github.com/awslabs/emr-bootstrap-actions
Spark: an alternative engine to Hadoop with its
own ecosystem of applications
• Does not use map-reduce framework
• In-memory for fast queries
• Great for machine learning or other
iterative queries
• Use Spark SQL to create a low-latency
data warehouse
• Spark Streaming for real-time
workloads
Also use Bootstrap Actions to configure your
applications
--bootstrap-action s3://elasticmapreduce/bootstrap-
actions/configure-hadoop
--keyword-config-file (Merge values in new config to existing)
--keyword-key-value (Override values provided)
Configuration File
Name
Configuration File
Keyword
File Name Shortcut
Key-Value Pair
Shortcut
core-site.xml core C c
hdfs-site.xml hdfs H h
mapred-site.xml mapred M m
yarn-site.xml yarn Y y
EMR Step API
• EMR step can be a map-
reduce job, Hive program, Pig
script, or even an arbitrary
script
• Easily submit Step from
console, CLI, or API
• Submit multiple steps to use
EMR as a sequential workflow
engine
Submit work via the EMR Step API or SSH to the
EMR master node
Connect to Master Node
• Connect to HUE, interact with
application CLIs, or submit
work directly to the Hadoop
APIs
• View the Hadoop UI
• Useful for long-running clusters
and interactive use cases
Let’s see it!
Quick tour of the EMR Console and HUE on an EMR
cluster
Diverse set of partners to use with Amazon EMR
BI / Visualization Business Intelligence BI / Visualization BI / Visualization
Hadoop Distribution Data Transfer Data Transformation
Monitoring Performance Tuning Graphical IDE Graphical IDE
Available on AWS Marketplace Available as a distribution in Amazon EMR
ETL Tool
BI / Visualization
Integration with AWS storage
and database services
Choose your data stores
Amazon S3 as your persistent data store
Amazon S3
• Designed for 99.999999999% durability
• Separate compute and storage
Resize and shut down Amazon EMR clusters
with no data loss
Point multiple Amazon EMR clusters at same
data in Amazon S3 using the EMR File
System (EMRFS)
EMRFS makes it easier to leverage Amazon S3
Better performance and error handling options
Transparent to applications – just read/write to “s3://”
Consistent view
• For consistent list and read-after-write for new puts
Support for Amazon S3 server-side and client-side encryption
Faster listing using EMRFS metadata
Amazon S3 EMRFS metadata
in Amazon DynamoDB
• List and read-after-write consistency
• Faster list operations
Number
of objects
Without
Consistent
Views
With Consistent
Views
1,000,000 147.72 29.70
100,000 12.70 3.69
Consistent view and fast listing using the optional
EMRFS metadata
*Tested using a single node cluster with a m3.xlarge instance.
EMRFS support for Amazon S3 client-side encryption
Amazon S3
AmazonS3encryptionclients
EMRFSenabledfor
AmazonS3client-sideencryption
Key vendor (AWS KMS or your custom key vendor)
(client-side encrypted objects)
Read data directly into Hive,
Apache Pig, and Hadoop
Streaming and Cascading from
Amazon Kinesis streams
No intermediate data
persistence required
Simple way to introduce real-time sources into
batch-oriented systems
Multi-application support and automatic
checkpointing
Amazon EMR Integration with Amazon Kinesis
Use Hive with EMR to query data DynamoDB
• Export data stored in DynamoDB to
Amazon S3
• Import data in Amazon S3 to
DynamoDB
• Query live DynamoDB data using SQL-
like statements (HiveQL)
• Join data stored in DynamoDB and
export it or query against the joined data
• Load DynamoDB data into HDFS and
use it in your EMR job
Use AWS Data Pipeline and EMR to transform
data and load into Amazon Redshift
Unstructured Data Processed Data
Pipeline orchestrated and scheduled by AWS Data Pipeline
Amazon EMR design patterns
Amazon EMR example #1: Batch processing
GBs of logs pushed
to Amazon S3 hourly
Daily Amazon EMR
cluster using Hive to
process data
Input and output
stored in Amazon S3
250 Amazon EMR jobs per day, processing 30 TB of data
http://aws.amazon.com/solutions/case-studies/yelp/
Using Amazon S3 and HDFS
Data Sources
Transient EMR cluster
for batch map/reduce jobs
for daily reports
Long running EMR cluster
holding data in HDFS for
Hive interactive queries
Weekly Report
Ad-hoc Query
Data aggregated
and stored in
Amazon S3
Amazon Confidential
Multiple EMR workflows using the same S3
dataset
Computations
S3DistCp
CascalogLZO
Input Amazon
S3 bucket
Intermediate
Amazon S3
bucket
Final
Amazon S3
bucket
Final
Amazon S3
bucket
Final
Amazon S3
bucket
Crashlytics (part of Twitter) uses EMR to
analyze data in S3 to power dashboards
on its Answers platform.
Amazon EMR example #2: Long-running cluster
Data pushed to
Amazon S3
Daily Amazon EMR cluster
Extract, Transform, and Load
(ETL) data into database 24/7 Amazon EMR cluster
running HBase holds last 2
years’ worth of data
Front-end service uses
HBase cluster to power
dashboard with high
concurrency
Amazon EMR example #3: Interactive query
TBs of logs sent daily
Logs stored in
Amazon S3
Amazon EMR cluster using Presto for ad hoc
analysis of entire log set
Interactive query using Presto on multipetabyte warehouse
http://techblog.netflix.com/2014/10/using-presto-in-our-big-
data-platform.html
EMR example #4: EMR for ETL and query engine for
investigations which require all raw data
TBs of logs sent
daily
Logs stored in S3
Hourly EMR cluster
using Spark for ETL
Load subset into
Redshift DW
Transient EMR cluster using Spark for ad hoc
analysis of entire log set
Client/Sensor Recording
Service
Aggregator/
Sequencer
Continuous
Processor
Data Warehouse Analytics and
Reporting
EMR Example #5: Streaming Data
Client/Sensor Recording Service Aggregator/
Sequencer
Continuous
Processor
Data Warehouse Analytics and
Reporting
Kafka
Common Tools
Amazon Kinesis
Streaming Data Repository
Amazon Kinesis
Client/ Sensor Recording Service Aggregator/
Sequencer
Continuous
Processor for
Dashboard
Data Warehouse Analytics and
Reporting
Amazon Kinesis Amazon EMR
Streaming Data RepositoryLogging Data Processing
Log4J
Amazon Kinesis + Amazon EMR = Fewer
Moving Parts
Processed
output in real-time
and batch
workflows
Input
push with Log 4J to
Hive
Pig
Cascading
pull from
Spark
Amazon EMR
Amazon Kinesis
Customer Application
Amazon DynamoDB
Real-time processing with Spark Streaming and batch
workloads on Kinesis streams with the Hadoop stack
AWS Summit – Chicago: An exciting, free cloud conference designed to educate and inform new
customers about the AWS platform, best practices and new cloud services.
Details
• July 1, 2015
• Chicago, Illinois
• @ McCormick Place
Featuring
• New product launches
• 36+ sessions, labs, and bootcamps
• Executive and partner networking
Registration is now open
• Come and see what AWS and the cloud can do for you.
CTA Script
- If you are interested in learning more about how to navigate the cloud to grow
your business - then attend the AWS Summit Chicago, July 1st.
- Register today to learn from technical sessions led by AWS engineers, hear best
practices from AWS customers and partners, and participate in some of the 30+
paid sessions and labs.
- Simply go to
https://aws.amazon.com/summits/chicago/?trkcampaign=summit_chicago_bootc
amps&trk=Webinar_slide
to register today.
- Registration is FREE.
TRACKING CODE:
- Listed above.
Thank you!
www.aws.amazon.com/elasticmapreduce
www.blogs.aws.amazon.com/bigdata
jonfritz@amazon.com

More Related Content

What's hot

Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Web Services
 
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
Amazon Web Services
 
AWS Security Best Practices and Design Patterns
AWS Security Best Practices and Design PatternsAWS Security Best Practices and Design Patterns
AWS Security Best Practices and Design Patterns
Amazon Web Services
 
AWS Cloud Watch
AWS Cloud WatchAWS Cloud Watch
AWS Cloud Watch
zekeLabs Technologies
 
[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue
Amazon Web Services Korea
 
AWS로 데이터 마이그레이션을 위한 방안과 옵션 - 박성훈 스토리지 스페셜리스트 테크니컬 어카운트 매니저, AWS :: AWS Summit...
AWS로 데이터 마이그레이션을 위한 방안과 옵션 - 박성훈 스토리지 스페셜리스트 테크니컬 어카운트 매니저, AWS :: AWS Summit...AWS로 데이터 마이그레이션을 위한 방안과 옵션 - 박성훈 스토리지 스페셜리스트 테크니컬 어카운트 매니저, AWS :: AWS Summit...
AWS로 데이터 마이그레이션을 위한 방안과 옵션 - 박성훈 스토리지 스페셜리스트 테크니컬 어카운트 매니저, AWS :: AWS Summit...
Amazon Web Services Korea
 
Introduction to Amazon Redshift
Introduction to Amazon RedshiftIntroduction to Amazon Redshift
Introduction to Amazon Redshift
Amazon Web Services
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
Amazon Web Services
 
Amazon simple storage service (amazon s3)
Amazon simple storage service (amazon s3)Amazon simple storage service (amazon s3)
Amazon simple storage service (amazon s3)
Faisal Ahmed Farooqui
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
Cobus Bernard
 
AWS Monitoring & Logging
AWS Monitoring & LoggingAWS Monitoring & Logging
AWS Monitoring & Logging
Jason Poley
 
AWS Lambda Tutorial For Beginners | What is AWS Lambda? | AWS Tutorial For Be...
AWS Lambda Tutorial For Beginners | What is AWS Lambda? | AWS Tutorial For Be...AWS Lambda Tutorial For Beginners | What is AWS Lambda? | AWS Tutorial For Be...
AWS Lambda Tutorial For Beginners | What is AWS Lambda? | AWS Tutorial For Be...
Simplilearn
 
ABCs of AWS: S3
ABCs of AWS: S3ABCs of AWS: S3
ABCs of AWS: S3
Mark Cohen
 
A Serverless Data Pipeline
A Serverless Data PipelineA Serverless Data Pipeline
A Serverless Data Pipeline
Amazon Web Services
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Amazon Web Services
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
Amazon Web Services
 
Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day
Amazon Redshift의 이해와 활용 (김용우) - AWS DB DayAmazon Redshift의 이해와 활용 (김용우) - AWS DB Day
Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day
Amazon Web Services Korea
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
Amazon Web Services
 
AWS Lambda
AWS LambdaAWS Lambda
AWS Lambda
Scott Leberknight
 
AWS Lambda
AWS LambdaAWS Lambda
AWS Lambda
Julian Kleinhans
 

What's hot (20)

Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
 
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
 
AWS Security Best Practices and Design Patterns
AWS Security Best Practices and Design PatternsAWS Security Best Practices and Design Patterns
AWS Security Best Practices and Design Patterns
 
AWS Cloud Watch
AWS Cloud WatchAWS Cloud Watch
AWS Cloud Watch
 
[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue
 
AWS로 데이터 마이그레이션을 위한 방안과 옵션 - 박성훈 스토리지 스페셜리스트 테크니컬 어카운트 매니저, AWS :: AWS Summit...
AWS로 데이터 마이그레이션을 위한 방안과 옵션 - 박성훈 스토리지 스페셜리스트 테크니컬 어카운트 매니저, AWS :: AWS Summit...AWS로 데이터 마이그레이션을 위한 방안과 옵션 - 박성훈 스토리지 스페셜리스트 테크니컬 어카운트 매니저, AWS :: AWS Summit...
AWS로 데이터 마이그레이션을 위한 방안과 옵션 - 박성훈 스토리지 스페셜리스트 테크니컬 어카운트 매니저, AWS :: AWS Summit...
 
Introduction to Amazon Redshift
Introduction to Amazon RedshiftIntroduction to Amazon Redshift
Introduction to Amazon Redshift
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
Amazon simple storage service (amazon s3)
Amazon simple storage service (amazon s3)Amazon simple storage service (amazon s3)
Amazon simple storage service (amazon s3)
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
 
AWS Monitoring & Logging
AWS Monitoring & LoggingAWS Monitoring & Logging
AWS Monitoring & Logging
 
AWS Lambda Tutorial For Beginners | What is AWS Lambda? | AWS Tutorial For Be...
AWS Lambda Tutorial For Beginners | What is AWS Lambda? | AWS Tutorial For Be...AWS Lambda Tutorial For Beginners | What is AWS Lambda? | AWS Tutorial For Be...
AWS Lambda Tutorial For Beginners | What is AWS Lambda? | AWS Tutorial For Be...
 
ABCs of AWS: S3
ABCs of AWS: S3ABCs of AWS: S3
ABCs of AWS: S3
 
A Serverless Data Pipeline
A Serverless Data PipelineA Serverless Data Pipeline
A Serverless Data Pipeline
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day
Amazon Redshift의 이해와 활용 (김용우) - AWS DB DayAmazon Redshift의 이해와 활용 (김용우) - AWS DB Day
Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
AWS Lambda
AWS LambdaAWS Lambda
AWS Lambda
 
AWS Lambda
AWS LambdaAWS Lambda
AWS Lambda
 

Similar to AWS May Webinar Series - Getting Started with Amazon EMR

(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
Amazon Web Services
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts
Julien SIMON
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
Amazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
Amazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Amazon Web Services
 
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Amazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Amazon Web Services
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Amazon Web Services
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
Amazon Web Services
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
Amazon Web Services
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
Amazon Web Services
 
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
Amazon Web Services Korea
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
Amazon Web Services
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
Amazon Web Services
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
Amazon Web Services
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
Israel AWS User Group
 
Lighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkLighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache Spark
Amazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
Amazon Web Services
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWS
Amazon Web Services
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
Amazon Web Services
 

Similar to AWS May Webinar Series - Getting Started with Amazon EMR (20)

(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
 
Lighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkLighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache Spark
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWS
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
Amazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
Amazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
Amazon Web Services
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Amazon Web Services
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
Amazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
Amazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Amazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
Amazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Amazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Come costruire un'architettura Serverless nel Cloud AWS
Come costruire un'architettura Serverless nel Cloud AWSCome costruire un'architettura Serverless nel Cloud AWS
Come costruire un'architettura Serverless nel Cloud AWS
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Come costruire un'architettura Serverless nel Cloud AWS
Come costruire un'architettura Serverless nel Cloud AWSCome costruire un'architettura Serverless nel Cloud AWS
Come costruire un'architettura Serverless nel Cloud AWS
 

Recently uploaded

UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 

Recently uploaded (20)

UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 

AWS May Webinar Series - Getting Started with Amazon EMR

  • 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jonathan Fritz Senior Product Manager, Amazon EMR May 20, 2015 Getting Started with Amazon EMR Easy, fast, secure, and cost-effective Hadoop on AWS.
  • 2. Agenda • Is Hadoop the answer? • Amazon EMR 101 • Integration with AWS storage and database services • Common Amazon EMR design patterns • Q+A
  • 3. When leveraging your data to derive new insights, Big Data problems are everywhere • Data lacks structure • Analyzing streams of information • Processing large datasets • Warehousing large datasets • Flexibility for undefined ad hoc analysis • Speed of queries on large data sets
  • 4. Hadoop is the right system for Big Data • Massively parallel • Scalable and fault tolerant • Flexibility for multiple languages and data formats • Open source • Ecosystem of tools • Batch and real-time analytics
  • 5. Storage S3, HDFS YARN Cluster Resource Management Batch MapReduce Interactive Tez In Memory Spark Applications Pig, Hive, Cascading, Mahout, Giraph HBase Presto Impala Hadoop 2 Batch MapReduce Storage S3, HDFS Hadoop 1 Applications
  • 7. Amazon Elastic MapReduce (EMR) is the easiest way to run Hadoop in the cloud.
  • 8. Why Amazon EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Elastic Easily add or remove capacity Reliable Spend less time monitoring Secure Manage firewalls Flexible Customize the cluster
  • 9. Easy to Use Launch a cluster in minutes
  • 10. Easy to deploy AWS Management Console AWS Command Line Interface You can also use the Amazon EMR API with your favorite SDK or use AWS Data Pipeline to start your clusters.
  • 11. Try different configurations to find your optimal architecture. CPU c3 family cc1.4xlarge cc2.8xlarge Memory m2 family r3 family Disk/IO d2 family i2 family General m1 family m3 family Choose your instance types Batch Machine Spark and Large process learning interactive HDFS
  • 12. Low Cost Pay an hourly rate
  • 13. Spot Instances for task nodes Up to 90% off Amazon EC2 on-demand pricing On-demand for core nodes Standard Amazon EC2 pricing for on-demand capacity Mix on-demand and EC2 Spot capacity for low costs Meet SLA at predictable cost Exceed SLA at lower cost
  • 14. Use multiple EMR instance groups Master Node r3.2xlarge Example Amazon EMR Cluster Slave Group - Core c3.2xlarge Slave Group – Task m3.xlarge (EC2 Spot) Slave Group – Task m3.2xlarge (EC2 Spot) Core nodes run HDFS (DataNode). Task nodes do not run HDFS. Core and Task nodes each run YARN (NodeManager). Master node runs NameNode (HDFS), ResourceManager (YARN), and serves as a gateway.
  • 15. Elastic Easily add or remove capacity
  • 16. Easy to add and remove compute capacity in your cluster from the console, CLI, or API. Match compute demands with cluster sizing. Resizable clusters
  • 17. Use S3 instead of HDFS for your data layer to decouple your compute capacity and storage Amazon S3 Amazon EMR Shut down your EMR clusters when you are not processing data, and stop paying for them!
  • 19. Easy to monitor and debug Monitor with Amazon CloudWatch or Ganglia Cluster, Node, and IO Monitor Debug
  • 20. EMR logging to S3 makes logs easily available
  • 22. Use Identity and Access Management (IAM) roles with your Amazon EMR cluster • IAM roles give AWS services fine grained control over delegating permissions to AWS services and access to AWS resources • EMR uses two IAM roles: • EMR service role is for the Amazon EMR control plane • EC2 instance profile is for the actual instances in the Amazon EMR cluster • Default IAM roles can be easily created and used from the AWS Console and AWS CLI
  • 23. EMR Security Groups: default and custom A security group is a virtual firewall which controls access to the EC2 instances in your Amazon EMR cluster • There is a single default master and default slave security group across all of your clusters • The master security group has port 22 access for SSHing to your cluster You can add additional security groups to the master and slave groups on a cluster to separate them from the default master and slave security groups, and further limit ingress and egress policies. Slave Security Group Master Security Group
  • 24. Other Amazon EMR security features EMRFS encryption options • S3 server-side encryption • S3 client-side encryption (use AWS Key Management Service keys or custom keys) CloudTrail integration • Track Amazon EMR API calls for auditing Launch your Amazon EMR clusters in a VPC • Logically isolated portion of the cloud (“Virtual Private Network”) • Enhanced networking on certain instance types
  • 27. Use Hive on EMR to interact with your data in HDFS and Amazon S3 • Batch or ad hoc workloads • Integration with EMRFS for better performance reading and writing to S3 • SQL-like query language to make iterative queries easier • Schema-on-read to query data without needing pre-processing • Use Tez engine for faster queries
  • 28. Use Pig to easily create ETL workflows • Uses high-level “Pig Latin” language to easily script data transformations in Hadoop • Strong optimizer for workloads • Options to create robust user defined functions
  • 29. Use HBase on a persistent EMR cluster as a noSQL scalable database • Billions of rows and millions of columns • Backup to and restore from Amazon S3 • Flexible datatypes • Modulate your HBase tables when adding new data to your system
  • 30. Impala: a fast SQL query engine for EMR Clusters • Low-latency SQL query engine for Hadoop • Perfect for fast ad hoc, interactive queries on structured on unstructured data • Can be easily installed on an EMR cluster, and queried from the CLI or a 3rd party BI tool • Perfect for memory optimized instances • Currently uses HDFS as data layer
  • 31. Hadoop User Experience (Hue) Query Editor
  • 33. Hue File Browser: Amazon S3 and the Hadoop Distributed File System (HDFS)
  • 34. To install anything else, use Bootstrap Actions https://github.com/awslabs/emr-bootstrap-actions
  • 35. Spark: an alternative engine to Hadoop with its own ecosystem of applications • Does not use map-reduce framework • In-memory for fast queries • Great for machine learning or other iterative queries • Use Spark SQL to create a low-latency data warehouse • Spark Streaming for real-time workloads
  • 36. Also use Bootstrap Actions to configure your applications --bootstrap-action s3://elasticmapreduce/bootstrap- actions/configure-hadoop --keyword-config-file (Merge values in new config to existing) --keyword-key-value (Override values provided) Configuration File Name Configuration File Keyword File Name Shortcut Key-Value Pair Shortcut core-site.xml core C c hdfs-site.xml hdfs H h mapred-site.xml mapred M m yarn-site.xml yarn Y y
  • 37. EMR Step API • EMR step can be a map- reduce job, Hive program, Pig script, or even an arbitrary script • Easily submit Step from console, CLI, or API • Submit multiple steps to use EMR as a sequential workflow engine Submit work via the EMR Step API or SSH to the EMR master node Connect to Master Node • Connect to HUE, interact with application CLIs, or submit work directly to the Hadoop APIs • View the Hadoop UI • Useful for long-running clusters and interactive use cases
  • 38. Let’s see it! Quick tour of the EMR Console and HUE on an EMR cluster
  • 39. Diverse set of partners to use with Amazon EMR BI / Visualization Business Intelligence BI / Visualization BI / Visualization Hadoop Distribution Data Transfer Data Transformation Monitoring Performance Tuning Graphical IDE Graphical IDE Available on AWS Marketplace Available as a distribution in Amazon EMR ETL Tool BI / Visualization
  • 40. Integration with AWS storage and database services
  • 42. Amazon S3 as your persistent data store Amazon S3 • Designed for 99.999999999% durability • Separate compute and storage Resize and shut down Amazon EMR clusters with no data loss Point multiple Amazon EMR clusters at same data in Amazon S3 using the EMR File System (EMRFS)
  • 43. EMRFS makes it easier to leverage Amazon S3 Better performance and error handling options Transparent to applications – just read/write to “s3://” Consistent view • For consistent list and read-after-write for new puts Support for Amazon S3 server-side and client-side encryption Faster listing using EMRFS metadata
  • 44. Amazon S3 EMRFS metadata in Amazon DynamoDB • List and read-after-write consistency • Faster list operations Number of objects Without Consistent Views With Consistent Views 1,000,000 147.72 29.70 100,000 12.70 3.69 Consistent view and fast listing using the optional EMRFS metadata *Tested using a single node cluster with a m3.xlarge instance.
  • 45. EMRFS support for Amazon S3 client-side encryption Amazon S3 AmazonS3encryptionclients EMRFSenabledfor AmazonS3client-sideencryption Key vendor (AWS KMS or your custom key vendor) (client-side encrypted objects)
  • 46. Read data directly into Hive, Apache Pig, and Hadoop Streaming and Cascading from Amazon Kinesis streams No intermediate data persistence required Simple way to introduce real-time sources into batch-oriented systems Multi-application support and automatic checkpointing Amazon EMR Integration with Amazon Kinesis
  • 47. Use Hive with EMR to query data DynamoDB • Export data stored in DynamoDB to Amazon S3 • Import data in Amazon S3 to DynamoDB • Query live DynamoDB data using SQL- like statements (HiveQL) • Join data stored in DynamoDB and export it or query against the joined data • Load DynamoDB data into HDFS and use it in your EMR job
  • 48. Use AWS Data Pipeline and EMR to transform data and load into Amazon Redshift Unstructured Data Processed Data Pipeline orchestrated and scheduled by AWS Data Pipeline
  • 49. Amazon EMR design patterns
  • 50. Amazon EMR example #1: Batch processing GBs of logs pushed to Amazon S3 hourly Daily Amazon EMR cluster using Hive to process data Input and output stored in Amazon S3 250 Amazon EMR jobs per day, processing 30 TB of data http://aws.amazon.com/solutions/case-studies/yelp/
  • 51. Using Amazon S3 and HDFS Data Sources Transient EMR cluster for batch map/reduce jobs for daily reports Long running EMR cluster holding data in HDFS for Hive interactive queries Weekly Report Ad-hoc Query Data aggregated and stored in Amazon S3 Amazon Confidential
  • 52. Multiple EMR workflows using the same S3 dataset Computations S3DistCp CascalogLZO Input Amazon S3 bucket Intermediate Amazon S3 bucket Final Amazon S3 bucket Final Amazon S3 bucket Final Amazon S3 bucket Crashlytics (part of Twitter) uses EMR to analyze data in S3 to power dashboards on its Answers platform.
  • 53. Amazon EMR example #2: Long-running cluster Data pushed to Amazon S3 Daily Amazon EMR cluster Extract, Transform, and Load (ETL) data into database 24/7 Amazon EMR cluster running HBase holds last 2 years’ worth of data Front-end service uses HBase cluster to power dashboard with high concurrency
  • 54. Amazon EMR example #3: Interactive query TBs of logs sent daily Logs stored in Amazon S3 Amazon EMR cluster using Presto for ad hoc analysis of entire log set Interactive query using Presto on multipetabyte warehouse http://techblog.netflix.com/2014/10/using-presto-in-our-big- data-platform.html
  • 55. EMR example #4: EMR for ETL and query engine for investigations which require all raw data TBs of logs sent daily Logs stored in S3 Hourly EMR cluster using Spark for ETL Load subset into Redshift DW Transient EMR cluster using Spark for ad hoc analysis of entire log set
  • 57. Client/Sensor Recording Service Aggregator/ Sequencer Continuous Processor Data Warehouse Analytics and Reporting Kafka Common Tools
  • 58. Amazon Kinesis Streaming Data Repository Amazon Kinesis
  • 59. Client/ Sensor Recording Service Aggregator/ Sequencer Continuous Processor for Dashboard Data Warehouse Analytics and Reporting Amazon Kinesis Amazon EMR Streaming Data RepositoryLogging Data Processing Log4J Amazon Kinesis + Amazon EMR = Fewer Moving Parts
  • 60. Processed output in real-time and batch workflows Input push with Log 4J to Hive Pig Cascading pull from Spark Amazon EMR Amazon Kinesis Customer Application Amazon DynamoDB Real-time processing with Spark Streaming and batch workloads on Kinesis streams with the Hadoop stack
  • 61. AWS Summit – Chicago: An exciting, free cloud conference designed to educate and inform new customers about the AWS platform, best practices and new cloud services. Details • July 1, 2015 • Chicago, Illinois • @ McCormick Place Featuring • New product launches • 36+ sessions, labs, and bootcamps • Executive and partner networking Registration is now open • Come and see what AWS and the cloud can do for you.
  • 62. CTA Script - If you are interested in learning more about how to navigate the cloud to grow your business - then attend the AWS Summit Chicago, July 1st. - Register today to learn from technical sessions led by AWS engineers, hear best practices from AWS customers and partners, and participate in some of the 30+ paid sessions and labs. - Simply go to https://aws.amazon.com/summits/chicago/?trkcampaign=summit_chicago_bootc amps&trk=Webinar_slide to register today. - Registration is FREE. TRACKING CODE: - Listed above.