SlideShare a Scribd company logo
1 of 23
DATA PROCESSING
 WITH AMAZON ELASTIC
     MAPREDUCE,
AMAZON AWS USE CASES

                          Sergey Sverchkov
                            Project Manager
                            Altoros Systems
              sergey.sverchkov@altoros.com
                    Skype: sergey.sverchkov
AMAZON EMR – SCALABLE DATA
PROCESSING SERVICE
Amazon EMR
service:
Amazon EC2 +
Amazon S3 +
Apache Hadoop
- Cost-Effective
- Automated
- Scalable
- Easy-to-use
MAPREDUCE

• Simple data-parallel programming model designed for
  scalability and fault-tolerance

• Pioneered by Google
   – Processes 20 petabytes of data per day


• Popularized by open-source Hadoop project
   – Used at Yahoo!, Facebook, Amazon, …
AMAZON EC2 SERVICE

• Elastic - Increase or decrease capacity within
  minutes, not hours or days
• Completely controlled
• Flexible –multiple instance types
  (CPU, memory, storage), operating systems, and
  software packages.
• Reliable –99.95% availability for each Amazon EC2
  Region.
• Secure – numerous mechanisms for securing your
  compute resources.
• Inexpensive: Reserved Instance and Spot Instances
• Easy to Start.
AMAZON S3 STORAGE

• Write, read, and delete objects containing from 1 byte to
  5 terabytes
• Objects are stored in a bucket
• Authentication mechanisms
• Options for secure data upload/download and encryption
  of data at rest
• Designed to provide 99.999999999% durability and
  99.99% availability of objects over a given year
• Reduced Redundancy Storage (RRS)
AMAZON EMR FEATURES

• Web-based interface and command-line tools for running
  Hadoop jobs on Amazon EC2
• Data stored in Amazon S3
• Monitors job and shuts down machines after use
• Small extra charge on top of EC2 pricing
• Significantly reduces the complexity of the time-
  consuming set-up, management and tuning of Hadoop
  clusters
GETTING STARTED – SIGN UP

 • Sign up for Amazon EMR / AWS at
   http://aws.amazon.com
 • Need to be signed also for Amazon S3 and Amazon EC2
 • Locate and save AWS credentials:
    – AWS Access Key ID
    – AWS Secret Access Key
    – EC2 Key Pair
 • Optionally install on desktop:
    – EMR command line client
    – S3 command line
GETTING STARTED – SECURITY, TOOLS
EMR JOB FLOW - BASIC STEPS


 1. Upload input data to S3
 2. Create job flow by defining Map and Reduce
 3. Download output data from S3
EMR WORD COUNT SAMPLE
WORD COUNT – INPUT DATA

• Word count input data size in sample S3 bucket:
./s3cmd du
s3://elasticmapreduce/samples/wordcount/input/
19105856 s3://elasticmapreduce/samples/wordcount/input/
• Word count input data files
./s3cmd ls s3://elasticmapreduce/samples/wordcount/input/
2009-04-02 02:55   2392524
s3://elasticmapreduce/samples/wordcount/input/0001
2009-04-02 02:55   2396618
s3://elasticmapreduce/samples/wordcount/input/0002
2009-04-02 02:55   1593915
s3://elasticmapreduce/samples/wordcount/input/0003
2009-04-02 02:55   1720885
s3://elasticmapreduce/samples/wordcount/input/0004
2009-04-02 02:55   2216895
s3://elasticmapreduce/samples/wordcount/input/0005
EMR WORD COUNT SAMPLE
• Starting instances, bootstrapping, running job steps:
EMR WORD COUNT SAMPLE
• Start the word count sample job from EMR command line:
$ ./elastic-mapreduce --create --name "word count
commandline test" --stream --input
s3n://elasticmapreduce/samples/wordcount/input --mapper
s3://elasticmapreduce/samples/wordcount/wordSplitter.py
--reducer aggregate --output
s3n://test.emr.bucket/wordcount/output2
• Output contains job number:
Created job flow j-317IN1TUMRQ5B
WORD COUNT – OUTPUT DATA
• Locate and download output data in the specified output S3 bucket:
REAL-WORLD EXAMPLE - GENOTYPING

• Crossbow is a scalable, portable, and automatic Cloud
   Computing tool for finding SNPs from short read data.
• Crossbow is designed to be easy to run (a) in "the cloud"
   Amazon's Elastic MapReduce service, (b) on any
   Hadoop cluster, or (c) on any single computer, without
   Hadoop.
• Open-source available to anyone
http://bowtie-bio.sourceforge.net/crossbow/
SINGLE-NUCLEOTIDE POLYMORPHISM


• A single-nucleotide
  polymorphism
  (SNP, pronounced snip) is a
  DNA sequence variation
  occurring when a single
  nucleotide — A, T, C or G
  — in the genome (or other
  shared sequence) differs
  between members of a
  biological species or paired
  chromosomes in an
  individual.
SNP ANALYSIS IN AMAZON EMR
• Crossbow web inteface
http://bowtie-bio.sourceforge.net/crossbow/ui.html
SNP ANALYSIS – DATA IN AMAZON S3

• Data for SNP analysis is uploaded to Amazon S3 bucket
• Output of analysis is placed in S3
SNP ANALYSIS – INPUT / OUTPUT DATA
• Input data – single file ~ 1.4GB
@E201_120801:4:1:1208:14983#ACAGTG/1
GAAGGAATAATGAGACCTNACGTTTCTGNNCNNNNNNNNNNNNNNNNNNN
+E201_120801:4:1:1208:14983#ACAGTG/1
gggfggdgfgdgg_e^^^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@E201_120801:4:1:1208:6966#ACAGTG/1
GCTGGGATTACAGACACANGCCACCACANNTNNNNNNNNNNNNNNNNNNN
+E201_120801:4:1:1208:6966#ACAGTG/1
• Output data – multiple files
chr1    841900    G     A         3         A   68
        2         2     G         0         0   0
        2         0     1.00000   1.00000   1
chr1    922615    T     G         2         G   38
        3         3     T         67        1   1
        4         0     1.00000   1.00000   0
chr1    1011278   A     G         12        G   69
        1         1     A         0         0   0
        1         0     1.00000   1.00000   1
SNP ANALYSIS - TIME

•   To process 1.4GB on 1 EMR instance – 6 hours
•   To process 1.4GB on 2 EMR instances – 4 hours
•   To process 1.4GB on 4 EMR instances – 2.5 hours
•   Haven’t tried more instances…
AND MORE CASES FOR AMAZON AWS

Customer 1 successful migration from dedicated hosting to
Amazon:
 1 EC2 xlarge Linux instance (15 GB, 4 cores, 64bit) with 4
  EBS volumes 250GB in US West (North California) region
 Runs 1 heavy web sites with > 1К concurrent users
 Tomcat app server and Oracle SE 11.2
 Amazon Elastip IP for web site
 Continuous Oracle backup to Amazon S3 through Oracle
  secure backup for S3
 And it costs for customer only …wow
 <2 days for LIVE migration on weekend
AND MORE CASES FOR AMAZON AWS
Customer 2 successful migration from Rackspace to
Amazon:
   Rackspace hosting + service cost $..К, and service level
    very low. Rackspace server was fixed.
   Migrated to 1 Amazon 2xlarge (34.2 GB, 4 virtual cores) EC2
    Windows 2008 R2 instance. >100 web sites for corporate
    customers. 2 EBS volumes 1.5TB
   Amazon Oracle RDS as backend – fully automated Oracle
    database with expandable storage.
   200GB of user data in RDS.
   Full LIVE migration completed in 48 hours with DNS
    names switch.
   And budget is significantly lower!!
THANK YOU
WELCOME FOR DISCUSSION…


                             Sergey Sverchkov
                 sergey.sverchkov@altoros.com
                       skype: sergey.sverchkov

More Related Content

What's hot

(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
 
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)Amazon Web Services
 
Automate all your EMR related activities
Automate all your EMR related activitiesAutomate all your EMR related activities
Automate all your EMR related activitiesEitan Sela
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRIsrael AWS User Group
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
If you doing file uploads with rails you're gonna have a bad time
If you doing file uploads with rails you're gonna have a bad timeIf you doing file uploads with rails you're gonna have a bad time
If you doing file uploads with rails you're gonna have a bad timeDave Rauchwerk
 
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data WarehouseAWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data WarehouseAmazon Web Services
 
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Amazon Web Services
 
AWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedAWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedHarsha KM
 
SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)
SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)
SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)Amazon Web Services
 
Serverless log analytics with Amazon Kinesis
Serverless log analytics with Amazon KinesisServerless log analytics with Amazon Kinesis
Serverless log analytics with Amazon KinesisRob Greenwood
 
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Web Services
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
 
Optimize MySQL Workloads with Amazon Elastic Block Store - February 2017 AWS ...
Optimize MySQL Workloads with Amazon Elastic Block Store - February 2017 AWS ...Optimize MySQL Workloads with Amazon Elastic Block Store - February 2017 AWS ...
Optimize MySQL Workloads with Amazon Elastic Block Store - February 2017 AWS ...Amazon Web Services
 
AWS Summit Berlin 2013 - Choosing the right data storage options with AWS
AWS Summit Berlin 2013 - Choosing the right data storage options with AWSAWS Summit Berlin 2013 - Choosing the right data storage options with AWS
AWS Summit Berlin 2013 - Choosing the right data storage options with AWSAWS Germany
 
An overview of Amazon Athena
An overview of Amazon AthenaAn overview of Amazon Athena
An overview of Amazon AthenaJulien SIMON
 
KELK Stack on AWS
KELK Stack on AWSKELK Stack on AWS
KELK Stack on AWSSteamhaus
 

What's hot (20)

(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
 
Automate all your EMR related activities
Automate all your EMR related activitiesAutomate all your EMR related activities
Automate all your EMR related activities
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
If you doing file uploads with rails you're gonna have a bad time
If you doing file uploads with rails you're gonna have a bad timeIf you doing file uploads with rails you're gonna have a bad time
If you doing file uploads with rails you're gonna have a bad time
 
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data WarehouseAWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
 
AWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedAWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explained
 
SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)
SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)
SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)
 
Deep Dive on Amazon Aurora
Deep Dive on Amazon AuroraDeep Dive on Amazon Aurora
Deep Dive on Amazon Aurora
 
Serverless log analytics with Amazon Kinesis
Serverless log analytics with Amazon KinesisServerless log analytics with Amazon Kinesis
Serverless log analytics with Amazon Kinesis
 
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
 
Hands-On With Amazon Web Services (AWS) - part 3
Hands-On With Amazon Web Services (AWS) - part 3Hands-On With Amazon Web Services (AWS) - part 3
Hands-On With Amazon Web Services (AWS) - part 3
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
 
Optimize MySQL Workloads with Amazon Elastic Block Store - February 2017 AWS ...
Optimize MySQL Workloads with Amazon Elastic Block Store - February 2017 AWS ...Optimize MySQL Workloads with Amazon Elastic Block Store - February 2017 AWS ...
Optimize MySQL Workloads with Amazon Elastic Block Store - February 2017 AWS ...
 
AWS Summit Berlin 2013 - Choosing the right data storage options with AWS
AWS Summit Berlin 2013 - Choosing the right data storage options with AWSAWS Summit Berlin 2013 - Choosing the right data storage options with AWS
AWS Summit Berlin 2013 - Choosing the right data storage options with AWS
 
An overview of Amazon Athena
An overview of Amazon AthenaAn overview of Amazon Athena
An overview of Amazon Athena
 
KELK Stack on AWS
KELK Stack on AWSKELK Stack on AWS
KELK Stack on AWS
 

Similar to Data Processing with Amazon EMR

Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftAmazon Web Services
 
Choosing the Right Database Service (김상필, 유타카 호시노) - AWS DB Day
Choosing the Right Database Service (김상필, 유타카 호시노) - AWS DB DayChoosing the Right Database Service (김상필, 유타카 호시노) - AWS DB Day
Choosing the Right Database Service (김상필, 유타카 호시노) - AWS DB DayAmazon Web Services Korea
 
Rethinking the database for the cloud (iJAWS)
Rethinking the database for the cloud (iJAWS)Rethinking the database for the cloud (iJAWS)
Rethinking the database for the cloud (iJAWS)Rasmus Ekman
 
Lighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkLighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkAmazon Web Services
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
AWS Services for Content Production
AWS Services for Content ProductionAWS Services for Content Production
AWS Services for Content ProductionAmazon Web Services
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Amazon Web Services
 
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석Amazon Web Services Korea
 
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceBDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceAmazon Web Services
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big DataAmazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...Amazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
 
Amazon Aurora Getting started Guide -level 0
Amazon Aurora Getting started Guide -level 0Amazon Aurora Getting started Guide -level 0
Amazon Aurora Getting started Guide -level 0kartraj
 
2017 AWS DB Day | AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?
2017 AWS DB Day |  AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?2017 AWS DB Day |  AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?
2017 AWS DB Day | AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?Amazon Web Services Korea
 
BDA402 Deep Dive: Log analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log analytics with Amazon Elasticsearch ServiceBDA402 Deep Dive: Log analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log analytics with Amazon Elasticsearch ServiceAmazon Web Services
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...Amazon Web Services
 
AWS re:Invent 2016: Amazon Aurora Best Practices: Getting the Best Out of You...
AWS re:Invent 2016: Amazon Aurora Best Practices: Getting the Best Out of You...AWS re:Invent 2016: Amazon Aurora Best Practices: Getting the Best Out of You...
AWS re:Invent 2016: Amazon Aurora Best Practices: Getting the Best Out of You...Amazon Web Services
 

Similar to Data Processing with Amazon EMR (20)

Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
Choosing the Right Database Service (김상필, 유타카 호시노) - AWS DB Day
Choosing the Right Database Service (김상필, 유타카 호시노) - AWS DB DayChoosing the Right Database Service (김상필, 유타카 호시노) - AWS DB Day
Choosing the Right Database Service (김상필, 유타카 호시노) - AWS DB Day
 
Rethinking the database for the cloud (iJAWS)
Rethinking the database for the cloud (iJAWS)Rethinking the database for the cloud (iJAWS)
Rethinking the database for the cloud (iJAWS)
 
Lighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkLighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache Spark
 
Introduction to Amazon Aurora
Introduction to Amazon AuroraIntroduction to Amazon Aurora
Introduction to Amazon Aurora
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
AWS Services for Content Production
AWS Services for Content ProductionAWS Services for Content Production
AWS Services for Content Production
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
 
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceBDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Amazon Aurora Getting started Guide -level 0
Amazon Aurora Getting started Guide -level 0Amazon Aurora Getting started Guide -level 0
Amazon Aurora Getting started Guide -level 0
 
2017 AWS DB Day | AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?
2017 AWS DB Day |  AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?2017 AWS DB Day |  AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?
2017 AWS DB Day | AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?
 
BDA402 Deep Dive: Log analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log analytics with Amazon Elasticsearch ServiceBDA402 Deep Dive: Log analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log analytics with Amazon Elasticsearch Service
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
 
AWS re:Invent 2016: Amazon Aurora Best Practices: Getting the Best Out of You...
AWS re:Invent 2016: Amazon Aurora Best Practices: Getting the Best Out of You...AWS re:Invent 2016: Amazon Aurora Best Practices: Getting the Best Out of You...
AWS re:Invent 2016: Amazon Aurora Best Practices: Getting the Best Out of You...
 

More from Olga Lavrentieva

15 10-22 altoros-fact_sheet_st_v4
15 10-22 altoros-fact_sheet_st_v415 10-22 altoros-fact_sheet_st_v4
15 10-22 altoros-fact_sheet_st_v4Olga Lavrentieva
 
Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance
Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive PerformanceСергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance
Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive PerformanceOlga Lavrentieva
 
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности CassandraАндрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности CassandraOlga Lavrentieva
 
Владимир Иванов (Oracle): Java: прошлое и будущее
Владимир Иванов (Oracle): Java: прошлое и будущееВладимир Иванов (Oracle): Java: прошлое и будущее
Владимир Иванов (Oracle): Java: прошлое и будущееOlga Lavrentieva
 
Brug - Web push notification
Brug  - Web push notificationBrug  - Web push notification
Brug - Web push notificationOlga Lavrentieva
 
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"Olga Lavrentieva
 
Максим Жилинский: "Контейнеры: под капотом"
Максим Жилинский: "Контейнеры: под капотом"Максим Жилинский: "Контейнеры: под капотом"
Максим Жилинский: "Контейнеры: под капотом"Olga Lavrentieva
 
Александр Протасеня: "PayPal. Различные способы интеграции"
Александр Протасеня: "PayPal. Различные способы интеграции"Александр Протасеня: "PayPal. Различные способы интеграции"
Александр Протасеня: "PayPal. Различные способы интеграции"Olga Lavrentieva
 
Сергей Черничков: "Интеграция платежных систем в .Net приложения"
Сергей Черничков: "Интеграция платежных систем в .Net приложения"Сергей Черничков: "Интеграция платежных систем в .Net приложения"
Сергей Черничков: "Интеграция платежных систем в .Net приложения"Olga Lavrentieva
 
Антон Шемерей «Single responsibility principle в руби или почему instanceclas...
Антон Шемерей «Single responsibility principle в руби или почему instanceclas...Антон Шемерей «Single responsibility principle в руби или почему instanceclas...
Антон Шемерей «Single responsibility principle в руби или почему instanceclas...Olga Lavrentieva
 
Егор Воробьёв: «Ruby internals»
Егор Воробьёв: «Ruby internals»Егор Воробьёв: «Ruby internals»
Егор Воробьёв: «Ruby internals»Olga Lavrentieva
 
Андрей Колешко «Что не так с Rails»
Андрей Колешко «Что не так с Rails»Андрей Колешко «Что не так с Rails»
Андрей Колешко «Что не так с Rails»Olga Lavrentieva
 
Дмитрий Савицкий «Ruby Anti Magic Shield»
Дмитрий Савицкий «Ruby Anti Magic Shield»Дмитрий Савицкий «Ruby Anti Magic Shield»
Дмитрий Савицкий «Ruby Anti Magic Shield»Olga Lavrentieva
 
Сергей Алексеев «Парное программирование. Удаленно»
Сергей Алексеев «Парное программирование. Удаленно»Сергей Алексеев «Парное программирование. Удаленно»
Сергей Алексеев «Парное программирование. Удаленно»Olga Lavrentieva
 
«Почему Spark отнюдь не так хорош»
«Почему Spark отнюдь не так хорош»«Почему Spark отнюдь не так хорош»
«Почему Spark отнюдь не так хорош»Olga Lavrentieva
 
«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»
«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»
«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»Olga Lavrentieva
 
«Практика построения высокодоступного решения на базе Cloud Foundry Paas»
«Практика построения высокодоступного решения на базе Cloud Foundry Paas»«Практика построения высокодоступного решения на базе Cloud Foundry Paas»
«Практика построения высокодоступного решения на базе Cloud Foundry Paas»Olga Lavrentieva
 
«Дизайн продвинутых нереляционных схем для Big Data»
«Дизайн продвинутых нереляционных схем для Big Data»«Дизайн продвинутых нереляционных схем для Big Data»
«Дизайн продвинутых нереляционных схем для Big Data»Olga Lavrentieva
 
«Обзор возможностей Open cv»
«Обзор возможностей Open cv»«Обзор возможностей Open cv»
«Обзор возможностей Open cv»Olga Lavrentieva
 
«Нужно больше шин! Eventbus based framework vertx.io»
«Нужно больше шин! Eventbus based framework vertx.io»«Нужно больше шин! Eventbus based framework vertx.io»
«Нужно больше шин! Eventbus based framework vertx.io»Olga Lavrentieva
 

More from Olga Lavrentieva (20)

15 10-22 altoros-fact_sheet_st_v4
15 10-22 altoros-fact_sheet_st_v415 10-22 altoros-fact_sheet_st_v4
15 10-22 altoros-fact_sheet_st_v4
 
Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance
Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive PerformanceСергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance
Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance
 
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности CassandraАндрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
 
Владимир Иванов (Oracle): Java: прошлое и будущее
Владимир Иванов (Oracle): Java: прошлое и будущееВладимир Иванов (Oracle): Java: прошлое и будущее
Владимир Иванов (Oracle): Java: прошлое и будущее
 
Brug - Web push notification
Brug  - Web push notificationBrug  - Web push notification
Brug - Web push notification
 
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
 
Максим Жилинский: "Контейнеры: под капотом"
Максим Жилинский: "Контейнеры: под капотом"Максим Жилинский: "Контейнеры: под капотом"
Максим Жилинский: "Контейнеры: под капотом"
 
Александр Протасеня: "PayPal. Различные способы интеграции"
Александр Протасеня: "PayPal. Различные способы интеграции"Александр Протасеня: "PayPal. Различные способы интеграции"
Александр Протасеня: "PayPal. Различные способы интеграции"
 
Сергей Черничков: "Интеграция платежных систем в .Net приложения"
Сергей Черничков: "Интеграция платежных систем в .Net приложения"Сергей Черничков: "Интеграция платежных систем в .Net приложения"
Сергей Черничков: "Интеграция платежных систем в .Net приложения"
 
Антон Шемерей «Single responsibility principle в руби или почему instanceclas...
Антон Шемерей «Single responsibility principle в руби или почему instanceclas...Антон Шемерей «Single responsibility principle в руби или почему instanceclas...
Антон Шемерей «Single responsibility principle в руби или почему instanceclas...
 
Егор Воробьёв: «Ruby internals»
Егор Воробьёв: «Ruby internals»Егор Воробьёв: «Ruby internals»
Егор Воробьёв: «Ruby internals»
 
Андрей Колешко «Что не так с Rails»
Андрей Колешко «Что не так с Rails»Андрей Колешко «Что не так с Rails»
Андрей Колешко «Что не так с Rails»
 
Дмитрий Савицкий «Ruby Anti Magic Shield»
Дмитрий Савицкий «Ruby Anti Magic Shield»Дмитрий Савицкий «Ruby Anti Magic Shield»
Дмитрий Савицкий «Ruby Anti Magic Shield»
 
Сергей Алексеев «Парное программирование. Удаленно»
Сергей Алексеев «Парное программирование. Удаленно»Сергей Алексеев «Парное программирование. Удаленно»
Сергей Алексеев «Парное программирование. Удаленно»
 
«Почему Spark отнюдь не так хорош»
«Почему Spark отнюдь не так хорош»«Почему Spark отнюдь не так хорош»
«Почему Spark отнюдь не так хорош»
 
«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»
«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»
«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»
 
«Практика построения высокодоступного решения на базе Cloud Foundry Paas»
«Практика построения высокодоступного решения на базе Cloud Foundry Paas»«Практика построения высокодоступного решения на базе Cloud Foundry Paas»
«Практика построения высокодоступного решения на базе Cloud Foundry Paas»
 
«Дизайн продвинутых нереляционных схем для Big Data»
«Дизайн продвинутых нереляционных схем для Big Data»«Дизайн продвинутых нереляционных схем для Big Data»
«Дизайн продвинутых нереляционных схем для Big Data»
 
«Обзор возможностей Open cv»
«Обзор возможностей Open cv»«Обзор возможностей Open cv»
«Обзор возможностей Open cv»
 
«Нужно больше шин! Eventbus based framework vertx.io»
«Нужно больше шин! Eventbus based framework vertx.io»«Нужно больше шин! Eventbus based framework vertx.io»
«Нужно больше шин! Eventbus based framework vertx.io»
 

Recently uploaded

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Recently uploaded (20)

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

Data Processing with Amazon EMR

  • 1. DATA PROCESSING WITH AMAZON ELASTIC MAPREDUCE, AMAZON AWS USE CASES Sergey Sverchkov Project Manager Altoros Systems sergey.sverchkov@altoros.com Skype: sergey.sverchkov
  • 2. AMAZON EMR – SCALABLE DATA PROCESSING SERVICE Amazon EMR service: Amazon EC2 + Amazon S3 + Apache Hadoop - Cost-Effective - Automated - Scalable - Easy-to-use
  • 3. MAPREDUCE • Simple data-parallel programming model designed for scalability and fault-tolerance • Pioneered by Google – Processes 20 petabytes of data per day • Popularized by open-source Hadoop project – Used at Yahoo!, Facebook, Amazon, …
  • 4. AMAZON EC2 SERVICE • Elastic - Increase or decrease capacity within minutes, not hours or days • Completely controlled • Flexible –multiple instance types (CPU, memory, storage), operating systems, and software packages. • Reliable –99.95% availability for each Amazon EC2 Region. • Secure – numerous mechanisms for securing your compute resources. • Inexpensive: Reserved Instance and Spot Instances • Easy to Start.
  • 5. AMAZON S3 STORAGE • Write, read, and delete objects containing from 1 byte to 5 terabytes • Objects are stored in a bucket • Authentication mechanisms • Options for secure data upload/download and encryption of data at rest • Designed to provide 99.999999999% durability and 99.99% availability of objects over a given year • Reduced Redundancy Storage (RRS)
  • 6. AMAZON EMR FEATURES • Web-based interface and command-line tools for running Hadoop jobs on Amazon EC2 • Data stored in Amazon S3 • Monitors job and shuts down machines after use • Small extra charge on top of EC2 pricing • Significantly reduces the complexity of the time- consuming set-up, management and tuning of Hadoop clusters
  • 7. GETTING STARTED – SIGN UP • Sign up for Amazon EMR / AWS at http://aws.amazon.com • Need to be signed also for Amazon S3 and Amazon EC2 • Locate and save AWS credentials: – AWS Access Key ID – AWS Secret Access Key – EC2 Key Pair • Optionally install on desktop: – EMR command line client – S3 command line
  • 8. GETTING STARTED – SECURITY, TOOLS
  • 9. EMR JOB FLOW - BASIC STEPS 1. Upload input data to S3 2. Create job flow by defining Map and Reduce 3. Download output data from S3
  • 10. EMR WORD COUNT SAMPLE
  • 11. WORD COUNT – INPUT DATA • Word count input data size in sample S3 bucket: ./s3cmd du s3://elasticmapreduce/samples/wordcount/input/ 19105856 s3://elasticmapreduce/samples/wordcount/input/ • Word count input data files ./s3cmd ls s3://elasticmapreduce/samples/wordcount/input/ 2009-04-02 02:55 2392524 s3://elasticmapreduce/samples/wordcount/input/0001 2009-04-02 02:55 2396618 s3://elasticmapreduce/samples/wordcount/input/0002 2009-04-02 02:55 1593915 s3://elasticmapreduce/samples/wordcount/input/0003 2009-04-02 02:55 1720885 s3://elasticmapreduce/samples/wordcount/input/0004 2009-04-02 02:55 2216895 s3://elasticmapreduce/samples/wordcount/input/0005
  • 12. EMR WORD COUNT SAMPLE • Starting instances, bootstrapping, running job steps:
  • 13. EMR WORD COUNT SAMPLE • Start the word count sample job from EMR command line: $ ./elastic-mapreduce --create --name "word count commandline test" --stream --input s3n://elasticmapreduce/samples/wordcount/input --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py --reducer aggregate --output s3n://test.emr.bucket/wordcount/output2 • Output contains job number: Created job flow j-317IN1TUMRQ5B
  • 14. WORD COUNT – OUTPUT DATA • Locate and download output data in the specified output S3 bucket:
  • 15. REAL-WORLD EXAMPLE - GENOTYPING • Crossbow is a scalable, portable, and automatic Cloud Computing tool for finding SNPs from short read data. • Crossbow is designed to be easy to run (a) in "the cloud" Amazon's Elastic MapReduce service, (b) on any Hadoop cluster, or (c) on any single computer, without Hadoop. • Open-source available to anyone http://bowtie-bio.sourceforge.net/crossbow/
  • 16. SINGLE-NUCLEOTIDE POLYMORPHISM • A single-nucleotide polymorphism (SNP, pronounced snip) is a DNA sequence variation occurring when a single nucleotide — A, T, C or G — in the genome (or other shared sequence) differs between members of a biological species or paired chromosomes in an individual.
  • 17. SNP ANALYSIS IN AMAZON EMR • Crossbow web inteface http://bowtie-bio.sourceforge.net/crossbow/ui.html
  • 18. SNP ANALYSIS – DATA IN AMAZON S3 • Data for SNP analysis is uploaded to Amazon S3 bucket • Output of analysis is placed in S3
  • 19. SNP ANALYSIS – INPUT / OUTPUT DATA • Input data – single file ~ 1.4GB @E201_120801:4:1:1208:14983#ACAGTG/1 GAAGGAATAATGAGACCTNACGTTTCTGNNCNNNNNNNNNNNNNNNNNNN +E201_120801:4:1:1208:14983#ACAGTG/1 gggfggdgfgdgg_e^^^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @E201_120801:4:1:1208:6966#ACAGTG/1 GCTGGGATTACAGACACANGCCACCACANNTNNNNNNNNNNNNNNNNNNN +E201_120801:4:1:1208:6966#ACAGTG/1 • Output data – multiple files chr1 841900 G A 3 A 68 2 2 G 0 0 0 2 0 1.00000 1.00000 1 chr1 922615 T G 2 G 38 3 3 T 67 1 1 4 0 1.00000 1.00000 0 chr1 1011278 A G 12 G 69 1 1 A 0 0 0 1 0 1.00000 1.00000 1
  • 20. SNP ANALYSIS - TIME • To process 1.4GB on 1 EMR instance – 6 hours • To process 1.4GB on 2 EMR instances – 4 hours • To process 1.4GB on 4 EMR instances – 2.5 hours • Haven’t tried more instances…
  • 21. AND MORE CASES FOR AMAZON AWS Customer 1 successful migration from dedicated hosting to Amazon:  1 EC2 xlarge Linux instance (15 GB, 4 cores, 64bit) with 4 EBS volumes 250GB in US West (North California) region  Runs 1 heavy web sites with > 1К concurrent users  Tomcat app server and Oracle SE 11.2  Amazon Elastip IP for web site  Continuous Oracle backup to Amazon S3 through Oracle secure backup for S3  And it costs for customer only …wow  <2 days for LIVE migration on weekend
  • 22. AND MORE CASES FOR AMAZON AWS Customer 2 successful migration from Rackspace to Amazon:  Rackspace hosting + service cost $..К, and service level very low. Rackspace server was fixed.  Migrated to 1 Amazon 2xlarge (34.2 GB, 4 virtual cores) EC2 Windows 2008 R2 instance. >100 web sites for corporate customers. 2 EBS volumes 1.5TB  Amazon Oracle RDS as backend – fully automated Oracle database with expandable storage.  200GB of user data in RDS.  Full LIVE migration completed in 48 hours with DNS names switch.  And budget is significantly lower!!
  • 23. THANK YOU WELCOME FOR DISCUSSION… Sergey Sverchkov sergey.sverchkov@altoros.com skype: sergey.sverchkov

Editor's Notes

  1. Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). Using Amazon Elastic MapReduce, you can instantly provision as much or as little capacity as you like to perform data-intensive tasks for applications such as web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research. Amazon Elastic MapReduce lets you focus on crunching or analyzing your data without having to worry about time-consuming set-up, management or tuning of Hadoop clusters or the compute capacity upon which they sit.Amazon Elastic MapReduce is ideal for problems that necessitate the fast and efficient processing of large amounts of data. The web service interfaces allow you to build processing workflows, and programmatically monitor progress of running job flows. In addition, you can use the simple web interface of the AWS Management Console to launch your job flows and monitor processing-intensive computation on clusters of Amazon EC2 instances.Q: Who can use Amazon Elastic MapReduce? Anyone who requires simple access to powerful data analysis can use Amazon Elastic MapReduce. Customers don’t need any software development experience to experiment with several sample applications available in the Developer Guide and in our Resource Center.
  2. To sign up for Amazon Elastic MapReduce, click the “Sign up for This Web Service” button on the Amazon Elastic MapReduce detail page http://aws.amazon.com/elasticmapreduce. You must be signed up for Amazon EC2 and Amazon S3 to access Amazon Elastic MapReduce; if you are not already signed up for these services, you will be prompted to do so during the Amazon Elastic MapReduce sign-up process. After signing up, please refer to the Amazon Elastic MapReduce documentation, which includes our Getting Started Guide – the best place to get going with the service.
  3. If you already have an AWS account, skip to the next procedure. If you don&apos;t already have an AWS account, use the following procedure to create one.NoteWhen you create an account, AWS automatically signs up the account for all services. You are charged only for the services you use. To create an AWS accountGo to http://aws.amazon.com, and then click Sign Up Now.Follow the on-screen instructions.Part of the sign-up procedure involves receiving a phone call and entering a PIN using the phone keypad.Install the Amazon EMR Command Line InterfaceTopicsInstalling Ruby Installing the Command Line InterfaceConfiguring CredentialsSSH Setup and ConfigurationAWS Security CredentialsAWS uses security credentials to help protect your data. This section, shows you how to view your security credentials so you can add them to your credentials.json file. AWS assigns you an Access Key ID and a Secret Access Key. You include your Access Key ID in all AWS service requests to identify yourself as the sender of the request. NoteYour Secret Access Key is a shared secret between you and AWS. Keep this ID secret; we use it to bill you for the AWS services you use. Never include the ID in your requests to AWS and never email the ID to anyone even if an inquiry appears to originate from AWS or Amazon.com. No one who legitimately represents Amazon will ever ask you for your Secret Access Key.To locate your AWS Access Key ID and AWS Secret Access KeyGo to the AWS web site at http://aws.amazon.com. Click My Account to display a list of options.Click Security Credentials and log in to your AWS account. Your Access Key ID is displayed in the Access Credentials section. Your Secret Access Key remains hidden as a further precaution. To display your Secret Access Key, click Show in the Your Secret Access Key area, as shown in the following figure.
  4. Upload your data and your processing application into Amazon S3. Amazon S3 provides reliable, scalable, easy-to-use storage for your input and output data.Log in to the AWS Management Console to start an Amazon Elastic MapReduce “job flow.” Simply choose the number and type of Amazon EC2 instances you want, specify the location of your data and/or application on Amazon S3, and then click the “Create Job Flow” button. Alternatively you can start a job flow by specifying the same information mentioned above via our Command Line Tools or APIs. For more sophisticated workloads you can choose to install additional software or alter configuration of your Amazon EC2 instances using Bootstrap Actions.Monitor the progress of your job flow(s) directly from the AWS Management Console, Command Line Tools or APIs. And, after the job flow is done, retrieve the output from Amazon S3. You can optionally track progress and identify issues in steps, jobs, tasks, or task attempts of your job flows directly from the job flow debug window in the AWS Management Console. Amazon Elastic MapReduce uses Amazon SimpleDB to store job flow state information.Pay only for the resources that you actually consume. Amazon Elastic MapReduce monitors your job flow, and unless you specify otherwise, shuts down your Amazon EC2 instances after the job completes.
  5. The sample input for this job flow is available at s3://elasticmapreduce/samples/wordcount/input.This example uses the built-in reducer called aggregate. This reducer adds up the counts of words being output by the wordSplitter mapper function. It knows to use data type Long from the prefix on the words. To run a streaming job flowEnter the following commands from the command-line prompt:Linux and UNIX users:$ ./elastic-mapreduce --create --stream \\ --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \\ --input s3://elasticmapreduce/samples/wordcount/input \\ --output [A path to a bucket you own on Amazon S3, such as, s3n://myawsbucket] \\ --reducer aggregate
  6. Word count input data size in sample S3 bucket:[root@ip-10-166-230-67 ~]# /elimbio/s3cmd-1.0.1/s3cmd du s3://elasticmapreduce/samples/wordcount/input/19105856 s3://elasticmapreduce/samples/wordcount/input/Word count input data files:[root@ip-10-166-230-67 ~]# /elimbio/s3cmd-1.0.1/s3cmd ls s3://elasticmapreduce/samples/wordcount/input/2009-04-02 02:55 2392524 s3://elasticmapreduce/samples/wordcount/input/00012009-04-02 02:55 2396618 s3://elasticmapreduce/samples/wordcount/input/00022009-04-02 02:55 1593915 s3://elasticmapreduce/samples/wordcount/input/00032009-04-02 02:55 1720885 s3://elasticmapreduce/samples/wordcount/input/00042009-04-02 02:55 2216895 s3://elasticmapreduce/samples/wordcount/input/00052009-04-02 02:55 1906322 s3://elasticmapreduce/samples/wordcount/input/00062009-04-02 02:55 1930660 s3://elasticmapreduce/samples/wordcount/input/00072009-04-02 02:55 1913444 s3://elasticmapreduce/samples/wordcount/input/00082009-04-02 02:55 2707527 s3://elasticmapreduce/samples/wordcount/input/00092009-04-02 02:55 327050 s3://elasticmapreduce/samples/wordcount/input/00102009-04-02 02:55 8 s3://elasticmapreduce/samples/wordcount/input/00112009-04-02 02:55 8 s3://elasticmapreduce/samples/wordcount/input/0012
  7. Q: What is Amazon Elastic MapReduce Bootstrap Actions? Bootstrap Actions is a feature in Amazon Elastic MapReduce that provides users a way to run custom set-up prior to the execution of their job flow. Bootstrap Actions can be used to install software or configure instances before running your job flow. Q: How can I use Bootstrap Actions? You can write a Bootstrap Action script in any language already installed on the job flow instance including Bash, Perl, Python, Ruby, C++, or Java. There are several pre-defined Bootstrap Actions available. Once the script is written, you need to upload it to Amazon S3 and reference its location when you start a job flow. Please refer to the “Developer’s Guide”: http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/ for details on how to use Bootstrap Actions. Q: How do I configure Hadoop settings for my job flow? The Elastic MapReduce default Hadoop configuration is appropriate for most workloads. However, based on your job flow’s specific memory and processing requirements, it may be appropriate to tune these settings. For example, if your job flow tasks are memory-intensive, you may choose to use fewer tasks per core and reduce your job tracker heap size. For this situation, a pre-defined Bootstrap Action is available to configure your job flow on startup. See the Configure Memory Intensive Bootstrap Action in the Developer’s Guide for configuration details and usage instructions. An additional predefined bootstrap action is available that allows you to customize your cluster settings to any value of your choice. See the Configure Hadoop Bootstrap Action in the Developer’s Guide for usage instructions. Q: Can I modify the number of slave nodes in a running job flow? Yes. Slave nodes can be of two types: (1) core nodes, which both host persistent data using Hadoop Distributed File System (HDFS) and run Hadoop tasks and (2) task nodes, which only run Hadoop tasks. While a job flow is running you may increase the number of core nodes and you may either increase or decrease the number of task nodes. This can be done through the API, Java SDK, or though the command line client. Please refer to the Resizing Running Job Flows section in the Developer’s Guide for details on how to modify the size of your running job flow. Q: When would I want to use core nodes versus task nodes? As core nodes host persistent data in HDFS and cannot be removed, core nodes should be reserved for the capacity that is required until your job flow completes. As task nodes can be added or removed and do not contain HDFS, they are ideal for capacity that is only needed on a temporary basis.
  8. The sample input for this job flow is available at s3://elasticmapreduce/samples/wordcount/input.This example uses the built-in reducer called aggregate. This reducer adds up the counts of words being output by the wordSplitter mapper function. It knows to use data type Long from the prefix on the words. To run a streaming job flowEnter the following commands from the command-line prompt:Linux and UNIX users:$ ./elastic-mapreduce --create --stream \\ --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \\ --input s3://elasticmapreduce/samples/wordcount/input \\ --output [A path to a bucket you own on Amazon S3, such as, s3n://myawsbucket] \\ --reducer aggregateThe output looks similar to the following.Created jobflowj-317IN1TUMRQ5BBy default, this command launches a job flow to run on a single-node cluster using an Amazon EC2 m1.small instance. Later, when your steps are running correctly on a small set of sample data, you can launch job flows to run on multiple nodes. You can specify the number of nodes and the type of instance to run with the --num-instances and --instance-type parameters, respectively.
  9. Your job flow results are stored in a text file. The results file contains a list of all words found with the number of times the word occurred in the data set. Excerpt from output data:abandon 3abandoned 46abandoning 3abandonment6
  10. Crossbow is a scalable, portable, and automatic Cloud Computing tool for finding SNPs from short read data. Crossbow employs Bowtie and a modified version of SOAPsnp to perform the short read alignment and SNP calling respectively. Crossbow is designed to be easy to run (a) in &quot;the cloud&quot; (in this case, Amazon&apos;s Elastic MapReduce service), (b) on any Hadoop cluster, or (c) on any single computer, without Hadoop. Crossbow exploits the availability of multiple computers and processors where possible.
  11. A single-nucleotide polymorphism (SNP, pronounced snip) is a DNA sequence variation occurring when a single nucleotide — A, T, C or G — in the genome (or other shared sequence) differs between members of a biological species or paired chromosomes in an individual. For example, two sequenced DNA fragments from different individuals, AAGCCTA to AAGCTTA, contain a difference in a single nucleotide. In this case we say that there are two alleles: C and T. Almost all common SNPs have only two alleles. The genomic distribution of SNPs is not homogenous, SNPs usually occur in non-coding regions more frequently than in coding regions or, in general, where natural selection is acting and fixating the allele of the SNP that constitutes the most favorable genetic adaptation.[1] Besides natural selection other factors like recombination and mutation rate can also determine SNP density. SNP density can be predicted by the presence of microsatellites as regions of thousands of nucleotides flanking microsatellites have an increased or decreased density of SNPs depending on the microsatellite sequence.[2]
  12. Before running Crossbow on EMR, you must have an AWS account with the appropriate features enabled. You may also need to install Amazon&apos;s elastic-mapreduce tool. In addition, you may want to install an S3 tool, though most users can simply use Amazon&apos;s web interface for S3, which requires no installation.If you plan to run Crossbow exclusively on a single computer or on a Hadoop cluster, you can skip this section.Create an AWS account by navigating to the AWS page. Click &quot;Sign Up Now&quot; in the upper right-hand corner and follow the instructions. You will be asked to accept the AWS Customer Agreement.Sign up for EC2 and S3. Navigate to the Amazon EC2 page, click on &quot;Sign Up For Amazon EC2&quot; and follow the instructions. This step requires you to enter credit card information. Once this is complete, your AWS account will be permitted to use EC2 and S3, which are required.Sign up for EMR. Navigate to the Elastic MapReduce page, click on &quot;Sign up for Elastic MapReduce&quot; and follow the instructions. Once this is complete, your AWS account will be permitted to use EMR, which is required.Sign up for SimpleDB. With SimpleDB enabled, you have the option of using the AWS Console&apos;s Job Flow Debugging feature. This is a convenient way to monitor your job&apos;s progress and diagnose errors.Optional: Request an increase to your instance limit. By default, Amazon allows you to allocate EC2 clusters with up to 20 instances (virtual computers). To be permitted to work with more instances, fill in the form on the Request to Increase page. You may have to speak to an Amazon representative and/or wait several business days before your request is granted.To see a list of AWS services you&apos;ve already signed up for, see your Account Activity page. If &quot;Amazon Elastic Compute Cloud&quot;, &quot;Amazon Simple Storage Service&quot;, &quot;Amazon Elastic MapReduce&quot; and &quot;Amazon SimpleDB&quot; all appear there, you are ready to proceed.To runIf the input reads have not yet been preprocessed by Crossbow (i.e. input is FASTQ or .sra), then first (a) prepare a manifest file with URLs pointing to the read files, and (b) upload it to an S3 bucket that you own. See your S3 tool&apos;s documentation for how to create a bucket and upload a file to it. The URL for the manifest file will be the input URL for your EMR job.If the input reads have already been preprocessed by Crossbow, make a note of of the S3 URL where they&apos;re located. This will be the input URL for your EMR job.If you are using a pre-built reference jar, make a note of its S3 URL. This will be the reference URL for your EMR job. See the Crossbow website for a list of pre-built reference jars and their URLs.If you are not using a pre-built reference jar, you may need to build the reference jars and/or upload them to an S3 bucket you own. See your S3 tool&apos;s documentation for how to create a bucket and upload to it. The URL for the main reference jar will be the reference URL for your EMR job.In a web browser, go to the Crossbow web interface.Fill in the form according to your job&apos;s parameters. We recommend filling in and validating the &quot;AWS ID&quot; and &quot;AWS Secret Key&quot; fields first. Also, when entering S3 URLs (e.g. &quot;Input URL&quot; and &quot;Output URL&quot;), we recommend that users validate the entered URLs by clicking the link below it. This avoids failed jobs due to simple URL issues (e.g. non-existence of the &quot;Input URL&quot;). For examples of how to fill in this form, see the E. coli EMR and Mouse chromosome 17 EMR examples.Be sure to make a note of the various numbers and names associated with your accounts, especially your Access Key ID, Secret Access Key, and your EC2 key pair name. You will have to refer to these and other account details in the future.