SlideShare a Scribd company logo
Masterclass
ianmas@amazon.com
@IanMmmm
Ian Massingham — Technical Evangelist
Amazon
Elastic MapReduce
Masterclass
Intended to educate you on how to get the best from AWS services
Show you how things work and how to get things done
A technical deep dive that goes beyond the basics
1
2
3
Amazon Elastic MapReduce
Provides a managed Hadoop framework
Quickly & cost-effectively process vast amounts of data
Makes it easy, fast & cost-effective for you to process data

Run other popular distributed frameworks such as Spark
Elastic
Amazon EMR
Easy to Use
ReliableFlexible
Low Cost
Secure
Amazon EMR: Example Use Cases
Amazon EMR can be used
to process vast amounts of
genomic data and other
large scientific data sets
quickly and efficiently.
Researchers can access
genomic data hosted for
free on AWS.
Genomics
Amazon EMR can be used
to analyze click stream data
in order to segment users
and understand user
preferences. Advertisers
can also analyze click
streams and advertising
impression logs to deliver
more effective ads.
Clickstream Analysis
Amazon EMR can be used
to process logs generated
by web and mobile
applications. Amazon EMR
helps customers turn
petabytes of un-structured
or semi-structured data into
useful insights about their
applications or users.
Log Processing
Agenda
Hadoop Fundamentals
Core Features of Amazon EMR
How to Get Started with Amazon EMR
Supported Hadoop Tools
Additional EMR Features
Third Party Tools
Resources where you can learn more
HADOOP FUNDAMENTALS
Very large
clickstream
logging data
(e.g TBs)
Lots of actions by
John Smith
Very large
clickstream
logging data
(e.g TBs)
Lots of actions by
John Smith
Split the log
into many
small pieces
Very large
clickstream
logging data
(e.g TBs)
Lots of actions by
John Smith
Split the log
into many
small pieces
Process in an EMR
cluster
Very large
clickstream
logging data
(e.g TBs)
Lots of actions by
John Smith
Split the log
into many
small pieces
Process in an EMR
cluster
Aggregate the
results from all
the nodes
Very large
clickstream
logging data
(e.g TBs)
Lots of actions by
John Smith
Split the log
into many
small pieces
Process in an EMR
cluster
Aggregate the
results from all
the nodes
Very large
clickstream
logging data
(e.g TBs)
What John
Smith did
Insight in a fraction of the time
Very large
clickstream
logging data
(e.g TBs)
What John
Smith did
CORE FEATURES OF AMAZON EMR
ELASTIC
Elastic
Provision as much capacity as you need
Add or remove capacity at any time
Deploy Multiple Clusters Resize a Running Cluster
LOW COST
Low Cost
Low Hourly Pricing
Amazon EC2 Spot Integration
Amazon EC2 Reserved Instance Integration
Elasticity
Amazon S3 Integration
Where to deploy your Hadoop clusters?
Executive Summary
Accenture Technology Labs
Low Cost
http://www.accenture.com/SiteCollectionDocuments/PDF/Accenture-Technology-Labs-Where-To-Deploy-Your-Hadoop-CLusters-Executive-Summary.pdf
Accenture Hadoop Study:
Amazon EMR ‘offers better price-performance’
FLEXIBLE DATA STORES
Amazon
EMR
Amazon
S3
Amazon
DynamoDB
DW1
Amazon
Redshift
Amazon
Glacier
Amazon Relational
Database Service
Hadoop Distributed
File System
Allows you to decouple storage and computing resources
Use Amazon S3 features such as server-side encryption
When you launch your cluster, EMR streams data from S3
Multiple clusters can process the same data concurrently
Amazon S3 + Amazon EMR
Amazon
RDS
Hadoop Distributed
File System (HDFS)
Amazon
DynamoDB
Amazon
Redshift
AWS
Data Pipeline
Analytics languages/enginesData management
Amazon
Redshift
AWS Data Pipeline
Amazon
Kinesis
Amazon
S3
Amazon
DynamoDB
Amazon
RDSAmazon EMR
Data Sources
GETTING STARTED WITH
AMAZON ELASTIC MAPREDUCE
Develop your data processing application
http://aws.amazon.com/articles/Elastic-MapReduce
Develop your data processing application
Upload your application and data to Amazon S3
Develop your data processing application
Upload your application and data to Amazon S3
Develop your data processing application
Upload your application and data to Amazon S3
Develop your data processing application
Upload your application and data to Amazon S3
Develop your data processing application
Upload your application and data to Amazon S3
Configure and launch your cluster
Configure and launch your cluster Start an EMR cluster
using console, CLI tools
or an AWS SDKAmazon EMR Cluster
Configure and launch your cluster Master instance group
created that controls the
clusterAmazon EMR Cluster
Master Instance Group
Configure and launch your cluster
Core instance group
created for life of cluster
Amazon EMR Cluster
Master Instance Group
Core Instance Group
HDFS HDFS
Configure and launch your cluster
Core instance group
created for life of cluster
Amazon EMR Cluster
Master Instance Group
Core Instance Group
Core instances run
DataNode and
TaskTracker daemons
HDFS HDFS
Configure and launch your cluster Optional task instances
can be added or
subtractedAmazon EMR Cluster
Master Instance Group
Core Instance Group Task Instance Group
HDFS HDFS
Configure and launch your cluster Master node
coordinates distribution
of work and manages
cluster stateAmazon EMR Cluster
Master Instance Group
Core Instance Group Task Instance Group
Develop your data processing application
Upload your application and data to Amazon S3
Configure and launch your cluster
Optionally, monitor the cluster
Develop your data processing application
Upload your application and data to Amazon S3
Configure and launch your cluster
Optionally, monitor the cluster
Retrieve the output
HDFS HDFS
S3 can be used as
underlying ‘file system’
for input/output dataAmazon EMR Cluster
Master Instance Group
Core Instance Group Task Instance Group
Retrieve the output
DEMO:
GETTING STARTED WITH
AMAZON EMR USING A SAMPLE
HADOOP STREAMING APPLICATION
Utility that comes with the Hadoop distribution
Allows you to create and run Map/Reduce jobs with any executable or
script as the mapper and/or the reducer
Reads the input from standard input and the reducer outputs data
through standard output
By default, each line of input/output represents a record with tab
separated key/value
Hadoop Streaming
Job Flow for Sample Application
Job Flow: Step 1
JAR location: /home/hadoop/contrib/streaming/hadoop-­‐streaming.jar	
  
Arguments:
-­‐files	
  s3://eu-­‐west-­‐1.elasticmapreduce/samples/wordcount/wordSplitter.py	
  	
  
-­‐mapper	
  wordSplitter.py	
  	
  
-­‐reducer	
  aggregate	
  	
  
-­‐input	
  s3://eu-­‐west-­‐1.elasticmapreduce/samples/wordcount/input	
  	
  
-­‐output	
  s3://ianmas-­‐aws-­‐emr/intermediate/
Step 1: mapper: wordSplitter.py
#!/usr/bin/python	
  
import	
  sys	
  	
  
import	
  re	
  	
  
def	
  main(argv):	
  	
  
	
  	
  	
  	
  pattern	
  =	
  re.compile("[a-­‐zA-­‐Z][a-­‐zA-­‐Z0-­‐9]*")	
  	
  
	
  	
  	
  	
  for	
  line	
  in	
  sys.stdin:	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  for	
  word	
  in	
  pattern.findall(line):	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  print	
  "LongValueSum:"	
  +	
  word.lower()	
  +	
  "t"	
  +	
  "1"	
  	
  
if	
  __name__	
  ==	
  "__main__":	
  	
  
	
  	
  	
  	
  main(sys.argv)	
  
Step 1: mapper: wordSplitter.py
#!/usr/bin/python	
  
import	
  sys	
  	
  
import	
  re	
  	
  
def	
  main(argv):	
  	
  
	
  	
  	
  	
  pattern	
  =	
  re.compile("[a-­‐zA-­‐Z][a-­‐zA-­‐Z0-­‐9]*")	
  	
  
	
  	
  	
  	
  for	
  line	
  in	
  sys.stdin:	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  for	
  word	
  in	
  pattern.findall(line):	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  print	
  "LongValueSum:"	
  +	
  word.lower()	
  +	
  "t"	
  +	
  "1"	
  	
  
if	
  __name__	
  ==	
  "__main__":	
  	
  
	
  	
  	
  	
  main(sys.argv)	
  
Read words from StdIn line by line
Step 1: mapper: wordSplitter.py
#!/usr/bin/python	
  
import	
  sys	
  	
  
import	
  re	
  	
  
def	
  main(argv):	
  	
  
	
  	
  	
  	
  pattern	
  =	
  re.compile("[a-­‐zA-­‐Z][a-­‐zA-­‐Z0-­‐9]*")	
  	
  
	
  	
  	
  	
  for	
  line	
  in	
  sys.stdin:	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  for	
  word	
  in	
  pattern.findall(line):	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  print	
  "LongValueSum:"	
  +	
  word.lower()	
  +	
  "t"	
  +	
  "1"	
  	
  
if	
  __name__	
  ==	
  "__main__":	
  	
  
	
  	
  	
  	
  main(sys.argv)	
  
Output to StdOut tab delimited records
in the format “LongValueSum:abacus 1”
Step 1: reducer: aggregate
Sorts inputs and adds up totals:
“Abacus 1”
“Abacus 1”
“Abacus 1”
becomes
“Abacus 3”
Step 1: input/ouput
The input is all the objects in the S3 bucket/prefix:
s3://eu-­‐west-­‐1.elasticmapreduce/samples/wordcount/input	
  	
  
Output is written to the following S3 bucket/prefix to be used as input for
the next step in the job flow:
s3://ianmas-­‐aws-­‐emr/intermediate/	
  
One output object is created for each reducer (generally one per core)
Job Flow: Step 2
JAR location: /home/hadoop/contrib/streaming/hadoop-­‐streaming.jar	
  
Arguments:
-­‐mapper	
  /bin/cat	
  	
  
-­‐reducer	
  org.apache.hadoop.mapred.lib.IdentityReducer	
  	
  
-­‐input	
  s3://ianmas-­‐aws-­‐emr/intermediate/	
  	
  
-­‐output	
  s3://ianmas-­‐aws-­‐emr/output	
  	
  
-­‐jobconf	
  mapred.reduce.tasks=1
Accept anything and return as text
Job Flow: Step 2
JAR location: /home/hadoop/contrib/streaming/hadoop-­‐streaming.jar	
  
Arguments:
-­‐mapper	
  /bin/cat	
  	
  
-­‐reducer	
  org.apache.hadoop.mapred.lib.IdentityReducer	
  	
  
-­‐input	
  s3://ianmas-­‐aws-­‐emr/intermediate/	
  	
  
-­‐output	
  s3://ianmas-­‐aws-­‐emr/output	
  	
  
-­‐jobconf	
  mapred.reduce.tasks=1
Sort
Job Flow: Step 2
JAR location: /home/hadoop/contrib/streaming/hadoop-­‐streaming.jar	
  
Arguments:
-­‐mapper	
  /bin/cat	
  	
  
-­‐reducer	
  org.apache.hadoop.mapred.lib.IdentityReducer	
  	
  
-­‐input	
  s3://ianmas-­‐aws-­‐emr/intermediate/	
  	
  
-­‐output	
  s3://ianmas-­‐aws-­‐emr/output	
  	
  
-­‐jobconf	
  mapred.reduce.tasks=1
Take previous output as input
Job Flow: Step 2
JAR location: /home/hadoop/contrib/streaming/hadoop-­‐streaming.jar	
  
Arguments:
-­‐mapper	
  /bin/cat	
  	
  
-­‐reducer	
  org.apache.hadoop.mapred.lib.IdentityReducer	
  	
  
-­‐input	
  s3://ianmas-­‐aws-­‐emr/intermediate/	
  	
  
-­‐output	
  s3://ianmas-­‐aws-­‐emr/output	
  	
  
-­‐jobconf	
  mapred.reduce.tasks=1
Output location
Job Flow: Step 2
JAR location: /home/hadoop/contrib/streaming/hadoop-­‐streaming.jar	
  
Arguments:
-­‐mapper	
  /bin/cat	
  	
  
-­‐reducer	
  org.apache.hadoop.mapred.lib.IdentityReducer	
  	
  
-­‐input	
  s3://ianmas-­‐aws-­‐emr/intermediate/	
  	
  
-­‐output	
  s3://ianmas-­‐aws-­‐emr/output	
  	
  
-­‐jobconf	
  mapred.reduce.tasks=1
Use a single reduce task
to get a single output object
SUPPORTED HADOOP TOOLS
Supported Hadoop Tools
An open source analytics
package that runs on top of
Hadoop. Pig is operated by Pig
Latin, a SQL-like language
which allows users to structure,
summarize, and query data.
Allows processing of complex
and unstructured data sources
such as text documents and log
files.
Pig
An open source data warehouse
& analytics package the runs on
top of Hadoop. Operated by
Hive QL, a SQL-based language
which allows users to structure,
summarize, and query data
Hive
Provides you an efficient way of
storing large quantities of sparse
data using column-based
storage. HBase provides fast
lookup of data because data is
stored in-memory instead of on
disk. Optimized for sequential
write operations, and it is highly
efficient for batch inserts,
updates, and deletes.
HBase
Supported Hadoop Tools
An open source distributed SQL
query engine for running
interactive analytic queries
against data sources of all sizes
ranging from gigabytes to
petabytes.
Presto
A tool in the Hadoop ecosystem
for interactive, ad hoc querying
using SQL syntax.It uses a
massively parallel processing
(MPP) engine similar to that
found in a traditional RDBMS.
This lends Impala to interactive,
low-latency analytics. You can
connect to BI tools through
ODBC and JDBC drivers.
Impala
An open source user interface
for Hadoop that makes it easier
to run and develop Hive queries,
manage files in HDFS, run and
develop Pig scripts, and
manage tables.
Hue
New
DEMO:
APACHE HUE ON EMR
aws.amazon.com/blogs/aws/new-apache-spark-on-amazon-emr/
New
Create a Cluster with Spark
$	
  aws	
  emr	
  create-­‐cluster	
  -­‐-­‐name	
  "Spark	
  cluster"	
  	
  
	
  	
  	
  -­‐-­‐ami-­‐version	
  3.8	
  -­‐-­‐applications	
  Name=Spark	
  	
  
	
  	
  	
  -­‐-­‐ec2-­‐attributes	
  KeyName=myKey	
  -­‐-­‐instance-­‐type	
  m3.xlarge	
  	
  	
  
	
  	
  	
  -­‐-­‐instance-­‐count	
  3	
  -­‐-­‐use-­‐default-­‐roles	
  
$	
  ssh	
  -­‐i	
  myKey	
  hadoop@masternode	
  
invoke the spark shell with
$	
  spark-­‐shell	
  
or	
  
$	
  pyspark	
  
AWS CLI
Working with the Spark Shell
Counting the occurrences of a string a text file stored in Amazon S3 with spark
$	
  pyspark	
  
>>>	
  sc	
  
<pyspark.context.SparkContext	
  object	
  at	
  0x7fe7e659fa50>	
  
>>>	
  textfile	
  =	
  sc.textFile(“s3://elasticmapreduce/samples/hive-­‐ads/tables/impressions/
dt=2009-­‐04-­‐13-­‐08-­‐05/ec2-­‐0-­‐51-­‐75-­‐39.amazon.com-­‐2009-­‐04-­‐13-­‐08-­‐05.log")	
  
>>>	
  linesWithCartoonNetwork	
  =	
  textfile.filter(lambda	
  line:	
  "cartoonnetwork.com"	
  in	
  
line).count()	
  
15/06/04	
  17:12:22	
  INFO	
  lzo.GPLNativeCodeLoader:	
  Loaded	
  native	
  gpl	
  library	
  from	
  the	
  
embedded	
  binaries	
  
<snip>	
  
<Spark	
  program	
  continues>	
  
>>>	
  linesWithCartoonNetwork	
  
9
pyspark
docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-spark.html
ADDITIONAL EMR FEATURES
CONTROL NETWORK ACCESS TO
YOUR EMR CLUSTER
ssh	
  -­‐i	
  EMRKeyPair.pem	
  -­‐N	
  	
  
	
  	
  	
  	
  -­‐L	
  8160:ec2-­‐52-­‐16-­‐143-­‐78.eu-­‐west-­‐1.compute.amazonaws.com:8888	
  	
  
	
  	
  	
  	
  hadoop@ec2-­‐52-­‐16-­‐143-­‐78.eu-­‐west-­‐1.compute.amazonaws.com
Using SSH local port forwarding
docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-web-interfaces.html
MANAGE USERS, PERMISSIONS
AND ENCRYPTION
docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-access.html
INSTALL ADDITIONAL SOFTWARE
WITH BOOTSTRAP ACTIONS
docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html
EFFICIENTLY COPY DATA TO
EMR FROM AMAZON S3
docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
$	
  hadoop	
  jar	
  /home/hadoop/lib/emr-­‐s3distcp-­‐1.0.jar	
  -­‐
Dmapreduce.job.reduces=30	
  -­‐-­‐src	
  s3://s3bucketname/	
  -­‐-­‐dest	
  hdfs://
$HADOOP_NAMENODE_HOST:$HADOOP_NAMENODE_PORT/data/	
  -­‐-­‐outputCodec	
  'none'
Run on a cluster master node:
SCHEDULE RECURRING
WORKFLOWS
aws.amazon.com/datapipeline
MONITOR YOUR CLUSTER
docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_ViewingMetrics.html
DEBUG YOUR APPLICATIONS
docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-debugging.html
Log files generated by EMR Clusters include:
	 •	 Step logs
	 •	 Hadoop logs
	 •	 Bootstrap action logs
	 •	 Instance state logs
USE THE MAPR DISTRIBUTION
aws.amazon.com/elasticmapreduce/mapr/
TUNE YOUR CLUSTER FOR
COST & PERFORMANCE
Supported EC2 instance types
	 •	 General Purpose
	 •	 Compute Optimized
	 •	 Memory Optimized
	 •	 Storage Optimized - D2 instance family
D2 instances are available in four sizes with 6TB, 12TB, 24TB, and 48TB storage options.
	 •	 GPU Instances
New
TUNE YOUR CLUSTER FOR
COST & PERFORMANCE
Job Flow
Duration:
Scenario #1
Duration:
Scenario #2
Cost without Spot
4 instances *14 hrs * $0.50
Total = $28
Cost with Spot
4 instances *7 hrs * $0.50 = $14 +
5 instances * 7 hrs * $0.25 = $8.75
Total = $22.75
Time Savings: 50%
Cost Savings: ~22%
Job Flow
Allocate
4 on demand
instances
Add
5 additional
spot instances
14 hours 7 hours
THIRD PARTY TOOLS
RESOURCES YOU CAN USE
TO LEARN MORE
aws.amazon.com/emr
Getting Started with Amazon EMR Tutorial guide:
docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-get-started.html

Customer Case Studies for Big Data Use-Cases
aws.amazon.com/solutions/case-studies/big-data/

Amazon EMR Documentation:
aws.amazon.com/documentation/emr/
Certification
aws.amazon.com/certification
Self-Paced Labs
aws.amazon.com/training/

self-paced-labs
Try products, gain new skills, and
get hands-on practice working
with AWS technologies
aws.amazon.com/training
Training
Validate your proven skills and
expertise with the AWS platform
Build technical expertise to
design and operate scalable,
efficient applications on AWS
AWS Training & Certification
Follow
us
for m
ore
events
&
w
ebinars
@AWScloud for Global AWS News & Announcements
@AWS_UKI for local AWS events & news
@IanMmmm
Ian Massingham — Technical Evangelist

More Related Content

What's hot

Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Amazon Web Services
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
Amazon Web Services
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
Amazon Web Services
 
AWS Summit Stockholm 2014 – T3 – disaster recovery on AWS
AWS Summit Stockholm 2014 – T3 – disaster recovery on AWSAWS Summit Stockholm 2014 – T3 – disaster recovery on AWS
AWS Summit Stockholm 2014 – T3 – disaster recovery on AWS
Amazon Web Services
 
Amazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto MeetupAmazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto Meetup
stevemcpherson
 
AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)
AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)
AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)
Amazon Web Services
 
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...
Amazon Web Services
 
SF Big Analytics: Machine Learning with Presto by Christopher Berner
SF Big Analytics: Machine Learning with Presto by Christopher BernerSF Big Analytics: Machine Learning with Presto by Christopher Berner
SF Big Analytics: Machine Learning with Presto by Christopher Berner
Chester Chen
 
Presto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte ScalePresto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte Scale
DataWorks Summit
 
Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11
Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11
Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11
Harish Ganesan
 
AWS Office Hours: Disaster Recovery
AWS Office Hours: Disaster RecoveryAWS Office Hours: Disaster Recovery
AWS Office Hours: Disaster Recovery
Amazon Web Services
 
Disaster Recovery with the AWS Cloud
Disaster Recovery with the AWS CloudDisaster Recovery with the AWS Cloud
Disaster Recovery with the AWS Cloud
Amazon Web Services
 
Workload-Aware: Auto-Scaling A new paradigm for Big Data Workloads
Workload-Aware: Auto-Scaling A new paradigm for Big Data WorkloadsWorkload-Aware: Auto-Scaling A new paradigm for Big Data Workloads
Workload-Aware: Auto-Scaling A new paradigm for Big Data Workloads
Vasu S
 
Introdução ao AWS Data Pipeline
Introdução ao AWS Data PipelineIntrodução ao AWS Data Pipeline
Introdução ao AWS Data Pipeline
Amazon Web Services LATAM
 
Disaster Recovery of on-premises IT infrastructure with AWS
Disaster Recovery of on-premises IT infrastructure with AWSDisaster Recovery of on-premises IT infrastructure with AWS
Disaster Recovery of on-premises IT infrastructure with AWS
Amazon Web Services
 
Disaster Recovery Sites on AWS: Minimal Cost, Maximum Efficiency
Disaster Recovery Sites on AWS: Minimal Cost, Maximum EfficiencyDisaster Recovery Sites on AWS: Minimal Cost, Maximum Efficiency
Disaster Recovery Sites on AWS: Minimal Cost, Maximum Efficiency
Amazon Web Services
 
Hw09 Making Hadoop Easy On Amazon Web Services
Hw09   Making Hadoop Easy On Amazon Web ServicesHw09   Making Hadoop Easy On Amazon Web Services
Hw09 Making Hadoop Easy On Amazon Web Services
Cloudera, Inc.
 
Optimizing Total Cost of Ownership for the AWS Cloud
Optimizing Total Cost of Ownership for the AWS CloudOptimizing Total Cost of Ownership for the AWS Cloud
Optimizing Total Cost of Ownership for the AWS Cloud
Amazon Web Services
 
Optimising TCO with AWS at Websummit Dublin
Optimising TCO with AWS at Websummit DublinOptimising TCO with AWS at Websummit Dublin
Optimising TCO with AWS at Websummit Dublin
Amazon Web Services
 
Cost Optimisation on AWS
Cost Optimisation on AWSCost Optimisation on AWS
Cost Optimisation on AWS
Amazon Web Services
 

What's hot (20)

Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
AWS Summit Stockholm 2014 – T3 – disaster recovery on AWS
AWS Summit Stockholm 2014 – T3 – disaster recovery on AWSAWS Summit Stockholm 2014 – T3 – disaster recovery on AWS
AWS Summit Stockholm 2014 – T3 – disaster recovery on AWS
 
Amazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto MeetupAmazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto Meetup
 
AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)
AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)
AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)
 
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...
 
SF Big Analytics: Machine Learning with Presto by Christopher Berner
SF Big Analytics: Machine Learning with Presto by Christopher BernerSF Big Analytics: Machine Learning with Presto by Christopher Berner
SF Big Analytics: Machine Learning with Presto by Christopher Berner
 
Presto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte ScalePresto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte Scale
 
Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11
Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11
Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11
 
AWS Office Hours: Disaster Recovery
AWS Office Hours: Disaster RecoveryAWS Office Hours: Disaster Recovery
AWS Office Hours: Disaster Recovery
 
Disaster Recovery with the AWS Cloud
Disaster Recovery with the AWS CloudDisaster Recovery with the AWS Cloud
Disaster Recovery with the AWS Cloud
 
Workload-Aware: Auto-Scaling A new paradigm for Big Data Workloads
Workload-Aware: Auto-Scaling A new paradigm for Big Data WorkloadsWorkload-Aware: Auto-Scaling A new paradigm for Big Data Workloads
Workload-Aware: Auto-Scaling A new paradigm for Big Data Workloads
 
Introdução ao AWS Data Pipeline
Introdução ao AWS Data PipelineIntrodução ao AWS Data Pipeline
Introdução ao AWS Data Pipeline
 
Disaster Recovery of on-premises IT infrastructure with AWS
Disaster Recovery of on-premises IT infrastructure with AWSDisaster Recovery of on-premises IT infrastructure with AWS
Disaster Recovery of on-premises IT infrastructure with AWS
 
Disaster Recovery Sites on AWS: Minimal Cost, Maximum Efficiency
Disaster Recovery Sites on AWS: Minimal Cost, Maximum EfficiencyDisaster Recovery Sites on AWS: Minimal Cost, Maximum Efficiency
Disaster Recovery Sites on AWS: Minimal Cost, Maximum Efficiency
 
Hw09 Making Hadoop Easy On Amazon Web Services
Hw09   Making Hadoop Easy On Amazon Web ServicesHw09   Making Hadoop Easy On Amazon Web Services
Hw09 Making Hadoop Easy On Amazon Web Services
 
Optimizing Total Cost of Ownership for the AWS Cloud
Optimizing Total Cost of Ownership for the AWS CloudOptimizing Total Cost of Ownership for the AWS Cloud
Optimizing Total Cost of Ownership for the AWS Cloud
 
Optimising TCO with AWS at Websummit Dublin
Optimising TCO with AWS at Websummit DublinOptimising TCO with AWS at Websummit Dublin
Optimising TCO with AWS at Websummit Dublin
 
Cost Optimisation on AWS
Cost Optimisation on AWSCost Optimisation on AWS
Cost Optimisation on AWS
 

Similar to Amazon EMR Masterclass

Architetture serverless e pattern avanzati per AWS Lambda
Architetture serverless e pattern avanzati per AWS LambdaArchitetture serverless e pattern avanzati per AWS Lambda
Architetture serverless e pattern avanzati per AWS Lambda
Amazon Web Services
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
Israel AWS User Group
 
AWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache StormAWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache Storm
Amazon Web Services
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
Amazon Web Services
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Amazon Web Services
 
Data Analytics on AWS
Data Analytics on AWSData Analytics on AWS
Data Analytics on AWS
Danilo Poccia
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
Amazon Web Services
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
Amazon Web Services
 
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Amazon Web Services
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
Amazon Web Services
 
B3 - Business intelligence apps on aws
B3 - Business intelligence apps on awsB3 - Business intelligence apps on aws
B3 - Business intelligence apps on aws
Amazon Web Services
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Amazon Web Services
 
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...
Amazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
Amazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Amazon Web Services
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
Amazon Web Services
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
Amazon Web Services
 
Get Value From Your Data
Get Value From Your DataGet Value From Your Data
Get Value From Your Data
Danilo Poccia
 

Similar to Amazon EMR Masterclass (20)

Architetture serverless e pattern avanzati per AWS Lambda
Architetture serverless e pattern avanzati per AWS LambdaArchitetture serverless e pattern avanzati per AWS Lambda
Architetture serverless e pattern avanzati per AWS Lambda
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
 
AWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache StormAWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache Storm
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Data Analytics on AWS
Data Analytics on AWSData Analytics on AWS
Data Analytics on AWS
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
B3 - Business intelligence apps on aws
B3 - Business intelligence apps on awsB3 - Business intelligence apps on aws
B3 - Business intelligence apps on aws
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
 
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
Get Value From Your Data
Get Value From Your DataGet Value From Your Data
Get Value From Your Data
 

More from Ian Massingham

Some thoughts on measuring the impact of developer relations
Some thoughts on measuring the impact of developer relationsSome thoughts on measuring the impact of developer relations
Some thoughts on measuring the impact of developer relations
Ian Massingham
 
Leeds IoT Meetup - Nov 2017
Leeds IoT Meetup - Nov 2017Leeds IoT Meetup - Nov 2017
Leeds IoT Meetup - Nov 2017
Ian Massingham
 
What's New & What's Next from AWS?
What's New & What's Next from AWS?What's New & What's Next from AWS?
What's New & What's Next from AWS?
Ian Massingham
 
DevTalks Romania - Getting Started with AWS Lambda & the Serverless Cloud
DevTalks Romania - Getting Started with AWS Lambda & the Serverless CloudDevTalks Romania - Getting Started with AWS Lambda & the Serverless Cloud
DevTalks Romania - Getting Started with AWS Lambda & the Serverless Cloud
Ian Massingham
 
Getting started with AWS Lambda and the Serverless Cloud
Getting started with AWS Lambda and the Serverless CloudGetting started with AWS Lambda and the Serverless Cloud
Getting started with AWS Lambda and the Serverless Cloud
Ian Massingham
 
AWS AWSome Day - Getting Started Best Practices
AWS AWSome Day - Getting Started Best PracticesAWS AWSome Day - Getting Started Best Practices
AWS AWSome Day - Getting Started Best Practices
Ian Massingham
 
AWS IoT Workshop Keynote
AWS IoT Workshop KeynoteAWS IoT Workshop Keynote
AWS IoT Workshop Keynote
Ian Massingham
 
Security Best Practices: AWS AWSome Day Management Track
Security Best Practices: AWS AWSome Day Management TrackSecurity Best Practices: AWS AWSome Day Management Track
Security Best Practices: AWS AWSome Day Management Track
Ian Massingham
 
AWS re:Invent 2016 Day 2 Keynote re:Cap
AWS re:Invent 2016 Day 2 Keynote re:CapAWS re:Invent 2016 Day 2 Keynote re:Cap
AWS re:Invent 2016 Day 2 Keynote re:Cap
Ian Massingham
 
AWS re:Invent 2016 Day 1 Keynote re:Cap
AWS re:Invent 2016 Day 1 Keynote re:CapAWS re:Invent 2016 Day 1 Keynote re:Cap
AWS re:Invent 2016 Day 1 Keynote re:Cap
Ian Massingham
 
Getting Started with AWS Lambda & Serverless Cloud
Getting Started with AWS Lambda & Serverless CloudGetting Started with AWS Lambda & Serverless Cloud
Getting Started with AWS Lambda & Serverless Cloud
Ian Massingham
 
Building Better IoT Applications without Servers
Building Better IoT Applications without ServersBuilding Better IoT Applications without Servers
Building Better IoT Applications without Servers
Ian Massingham
 
AWS AWSome Day Roadshow
AWS AWSome Day RoadshowAWS AWSome Day Roadshow
AWS AWSome Day Roadshow
Ian Massingham
 
AWS AWSome Day Roadshow Intro
AWS AWSome Day Roadshow IntroAWS AWSome Day Roadshow Intro
AWS AWSome Day Roadshow Intro
Ian Massingham
 
Hashiconf AWS Lambda Breakout
Hashiconf AWS Lambda BreakoutHashiconf AWS Lambda Breakout
Hashiconf AWS Lambda Breakout
Ian Massingham
 
Getting started with AWS IoT on Raspberry Pi
Getting started with AWS IoT on Raspberry PiGetting started with AWS IoT on Raspberry Pi
Getting started with AWS IoT on Raspberry Pi
Ian Massingham
 
AWSome Day Dublin Intro & Closing Slides
AWSome Day Dublin Intro & Closing Slides AWSome Day Dublin Intro & Closing Slides
AWSome Day Dublin Intro & Closing Slides
Ian Massingham
 
GOTO Stockholm - AWS Lambda - Logic in the cloud without a back-end
GOTO Stockholm - AWS Lambda - Logic in the cloud without a back-endGOTO Stockholm - AWS Lambda - Logic in the cloud without a back-end
GOTO Stockholm - AWS Lambda - Logic in the cloud without a back-end
Ian Massingham
 
What's New at AWS Update for AWS User Groups
What's New at AWS Update for AWS User Groups What's New at AWS Update for AWS User Groups
What's New at AWS Update for AWS User Groups
Ian Massingham
 
Advanced Security Masterclass - Tel Aviv Loft
Advanced Security Masterclass - Tel Aviv LoftAdvanced Security Masterclass - Tel Aviv Loft
Advanced Security Masterclass - Tel Aviv Loft
Ian Massingham
 

More from Ian Massingham (20)

Some thoughts on measuring the impact of developer relations
Some thoughts on measuring the impact of developer relationsSome thoughts on measuring the impact of developer relations
Some thoughts on measuring the impact of developer relations
 
Leeds IoT Meetup - Nov 2017
Leeds IoT Meetup - Nov 2017Leeds IoT Meetup - Nov 2017
Leeds IoT Meetup - Nov 2017
 
What's New & What's Next from AWS?
What's New & What's Next from AWS?What's New & What's Next from AWS?
What's New & What's Next from AWS?
 
DevTalks Romania - Getting Started with AWS Lambda & the Serverless Cloud
DevTalks Romania - Getting Started with AWS Lambda & the Serverless CloudDevTalks Romania - Getting Started with AWS Lambda & the Serverless Cloud
DevTalks Romania - Getting Started with AWS Lambda & the Serverless Cloud
 
Getting started with AWS Lambda and the Serverless Cloud
Getting started with AWS Lambda and the Serverless CloudGetting started with AWS Lambda and the Serverless Cloud
Getting started with AWS Lambda and the Serverless Cloud
 
AWS AWSome Day - Getting Started Best Practices
AWS AWSome Day - Getting Started Best PracticesAWS AWSome Day - Getting Started Best Practices
AWS AWSome Day - Getting Started Best Practices
 
AWS IoT Workshop Keynote
AWS IoT Workshop KeynoteAWS IoT Workshop Keynote
AWS IoT Workshop Keynote
 
Security Best Practices: AWS AWSome Day Management Track
Security Best Practices: AWS AWSome Day Management TrackSecurity Best Practices: AWS AWSome Day Management Track
Security Best Practices: AWS AWSome Day Management Track
 
AWS re:Invent 2016 Day 2 Keynote re:Cap
AWS re:Invent 2016 Day 2 Keynote re:CapAWS re:Invent 2016 Day 2 Keynote re:Cap
AWS re:Invent 2016 Day 2 Keynote re:Cap
 
AWS re:Invent 2016 Day 1 Keynote re:Cap
AWS re:Invent 2016 Day 1 Keynote re:CapAWS re:Invent 2016 Day 1 Keynote re:Cap
AWS re:Invent 2016 Day 1 Keynote re:Cap
 
Getting Started with AWS Lambda & Serverless Cloud
Getting Started with AWS Lambda & Serverless CloudGetting Started with AWS Lambda & Serverless Cloud
Getting Started with AWS Lambda & Serverless Cloud
 
Building Better IoT Applications without Servers
Building Better IoT Applications without ServersBuilding Better IoT Applications without Servers
Building Better IoT Applications without Servers
 
AWS AWSome Day Roadshow
AWS AWSome Day RoadshowAWS AWSome Day Roadshow
AWS AWSome Day Roadshow
 
AWS AWSome Day Roadshow Intro
AWS AWSome Day Roadshow IntroAWS AWSome Day Roadshow Intro
AWS AWSome Day Roadshow Intro
 
Hashiconf AWS Lambda Breakout
Hashiconf AWS Lambda BreakoutHashiconf AWS Lambda Breakout
Hashiconf AWS Lambda Breakout
 
Getting started with AWS IoT on Raspberry Pi
Getting started with AWS IoT on Raspberry PiGetting started with AWS IoT on Raspberry Pi
Getting started with AWS IoT on Raspberry Pi
 
AWSome Day Dublin Intro & Closing Slides
AWSome Day Dublin Intro & Closing Slides AWSome Day Dublin Intro & Closing Slides
AWSome Day Dublin Intro & Closing Slides
 
GOTO Stockholm - AWS Lambda - Logic in the cloud without a back-end
GOTO Stockholm - AWS Lambda - Logic in the cloud without a back-endGOTO Stockholm - AWS Lambda - Logic in the cloud without a back-end
GOTO Stockholm - AWS Lambda - Logic in the cloud without a back-end
 
What's New at AWS Update for AWS User Groups
What's New at AWS Update for AWS User Groups What's New at AWS Update for AWS User Groups
What's New at AWS Update for AWS User Groups
 
Advanced Security Masterclass - Tel Aviv Loft
Advanced Security Masterclass - Tel Aviv LoftAdvanced Security Masterclass - Tel Aviv Loft
Advanced Security Masterclass - Tel Aviv Loft
 

Recently uploaded

Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
BibashShahi
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
saastr
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 

Recently uploaded (20)

Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 

Amazon EMR Masterclass

  • 1. Masterclass ianmas@amazon.com @IanMmmm Ian Massingham — Technical Evangelist Amazon Elastic MapReduce
  • 2. Masterclass Intended to educate you on how to get the best from AWS services Show you how things work and how to get things done A technical deep dive that goes beyond the basics 1 2 3
  • 3. Amazon Elastic MapReduce Provides a managed Hadoop framework Quickly & cost-effectively process vast amounts of data Makes it easy, fast & cost-effective for you to process data
 Run other popular distributed frameworks such as Spark
  • 4. Elastic Amazon EMR Easy to Use ReliableFlexible Low Cost Secure
  • 5. Amazon EMR: Example Use Cases Amazon EMR can be used to process vast amounts of genomic data and other large scientific data sets quickly and efficiently. Researchers can access genomic data hosted for free on AWS. Genomics Amazon EMR can be used to analyze click stream data in order to segment users and understand user preferences. Advertisers can also analyze click streams and advertising impression logs to deliver more effective ads. Clickstream Analysis Amazon EMR can be used to process logs generated by web and mobile applications. Amazon EMR helps customers turn petabytes of un-structured or semi-structured data into useful insights about their applications or users. Log Processing
  • 6. Agenda Hadoop Fundamentals Core Features of Amazon EMR How to Get Started with Amazon EMR Supported Hadoop Tools Additional EMR Features Third Party Tools Resources where you can learn more
  • 9. Lots of actions by John Smith Very large clickstream logging data (e.g TBs)
  • 10. Lots of actions by John Smith Split the log into many small pieces Very large clickstream logging data (e.g TBs)
  • 11. Lots of actions by John Smith Split the log into many small pieces Process in an EMR cluster Very large clickstream logging data (e.g TBs)
  • 12. Lots of actions by John Smith Split the log into many small pieces Process in an EMR cluster Aggregate the results from all the nodes Very large clickstream logging data (e.g TBs)
  • 13. Lots of actions by John Smith Split the log into many small pieces Process in an EMR cluster Aggregate the results from all the nodes Very large clickstream logging data (e.g TBs) What John Smith did
  • 14. Insight in a fraction of the time Very large clickstream logging data (e.g TBs) What John Smith did
  • 15. CORE FEATURES OF AMAZON EMR
  • 17. Elastic Provision as much capacity as you need Add or remove capacity at any time Deploy Multiple Clusters Resize a Running Cluster
  • 19. Low Cost Low Hourly Pricing Amazon EC2 Spot Integration Amazon EC2 Reserved Instance Integration Elasticity Amazon S3 Integration
  • 20. Where to deploy your Hadoop clusters? Executive Summary Accenture Technology Labs Low Cost http://www.accenture.com/SiteCollectionDocuments/PDF/Accenture-Technology-Labs-Where-To-Deploy-Your-Hadoop-CLusters-Executive-Summary.pdf Accenture Hadoop Study: Amazon EMR ‘offers better price-performance’
  • 23. Allows you to decouple storage and computing resources Use Amazon S3 features such as server-side encryption When you launch your cluster, EMR streams data from S3 Multiple clusters can process the same data concurrently Amazon S3 + Amazon EMR
  • 24. Amazon RDS Hadoop Distributed File System (HDFS) Amazon DynamoDB Amazon Redshift AWS Data Pipeline
  • 25. Analytics languages/enginesData management Amazon Redshift AWS Data Pipeline Amazon Kinesis Amazon S3 Amazon DynamoDB Amazon RDSAmazon EMR Data Sources
  • 26. GETTING STARTED WITH AMAZON ELASTIC MAPREDUCE
  • 27. Develop your data processing application http://aws.amazon.com/articles/Elastic-MapReduce
  • 28. Develop your data processing application Upload your application and data to Amazon S3
  • 29. Develop your data processing application Upload your application and data to Amazon S3
  • 30. Develop your data processing application Upload your application and data to Amazon S3
  • 31. Develop your data processing application Upload your application and data to Amazon S3
  • 32. Develop your data processing application Upload your application and data to Amazon S3 Configure and launch your cluster
  • 33. Configure and launch your cluster Start an EMR cluster using console, CLI tools or an AWS SDKAmazon EMR Cluster
  • 34. Configure and launch your cluster Master instance group created that controls the clusterAmazon EMR Cluster Master Instance Group
  • 35. Configure and launch your cluster Core instance group created for life of cluster Amazon EMR Cluster Master Instance Group Core Instance Group
  • 36. HDFS HDFS Configure and launch your cluster Core instance group created for life of cluster Amazon EMR Cluster Master Instance Group Core Instance Group Core instances run DataNode and TaskTracker daemons
  • 37. HDFS HDFS Configure and launch your cluster Optional task instances can be added or subtractedAmazon EMR Cluster Master Instance Group Core Instance Group Task Instance Group
  • 38. HDFS HDFS Configure and launch your cluster Master node coordinates distribution of work and manages cluster stateAmazon EMR Cluster Master Instance Group Core Instance Group Task Instance Group
  • 39. Develop your data processing application Upload your application and data to Amazon S3 Configure and launch your cluster Optionally, monitor the cluster
  • 40. Develop your data processing application Upload your application and data to Amazon S3 Configure and launch your cluster Optionally, monitor the cluster Retrieve the output
  • 41. HDFS HDFS S3 can be used as underlying ‘file system’ for input/output dataAmazon EMR Cluster Master Instance Group Core Instance Group Task Instance Group Retrieve the output
  • 42. DEMO: GETTING STARTED WITH AMAZON EMR USING A SAMPLE HADOOP STREAMING APPLICATION
  • 43. Utility that comes with the Hadoop distribution Allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer Reads the input from standard input and the reducer outputs data through standard output By default, each line of input/output represents a record with tab separated key/value Hadoop Streaming
  • 44. Job Flow for Sample Application
  • 45. Job Flow: Step 1 JAR location: /home/hadoop/contrib/streaming/hadoop-­‐streaming.jar   Arguments: -­‐files  s3://eu-­‐west-­‐1.elasticmapreduce/samples/wordcount/wordSplitter.py     -­‐mapper  wordSplitter.py     -­‐reducer  aggregate     -­‐input  s3://eu-­‐west-­‐1.elasticmapreduce/samples/wordcount/input     -­‐output  s3://ianmas-­‐aws-­‐emr/intermediate/
  • 46. Step 1: mapper: wordSplitter.py #!/usr/bin/python   import  sys     import  re     def  main(argv):            pattern  =  re.compile("[a-­‐zA-­‐Z][a-­‐zA-­‐Z0-­‐9]*")            for  line  in  sys.stdin:                    for  word  in  pattern.findall(line):                            print  "LongValueSum:"  +  word.lower()  +  "t"  +  "1"     if  __name__  ==  "__main__":            main(sys.argv)  
  • 47. Step 1: mapper: wordSplitter.py #!/usr/bin/python   import  sys     import  re     def  main(argv):            pattern  =  re.compile("[a-­‐zA-­‐Z][a-­‐zA-­‐Z0-­‐9]*")            for  line  in  sys.stdin:                    for  word  in  pattern.findall(line):                            print  "LongValueSum:"  +  word.lower()  +  "t"  +  "1"     if  __name__  ==  "__main__":            main(sys.argv)   Read words from StdIn line by line
  • 48. Step 1: mapper: wordSplitter.py #!/usr/bin/python   import  sys     import  re     def  main(argv):            pattern  =  re.compile("[a-­‐zA-­‐Z][a-­‐zA-­‐Z0-­‐9]*")            for  line  in  sys.stdin:                    for  word  in  pattern.findall(line):                            print  "LongValueSum:"  +  word.lower()  +  "t"  +  "1"     if  __name__  ==  "__main__":            main(sys.argv)   Output to StdOut tab delimited records in the format “LongValueSum:abacus 1”
  • 49. Step 1: reducer: aggregate Sorts inputs and adds up totals: “Abacus 1” “Abacus 1” “Abacus 1” becomes “Abacus 3”
  • 50. Step 1: input/ouput The input is all the objects in the S3 bucket/prefix: s3://eu-­‐west-­‐1.elasticmapreduce/samples/wordcount/input     Output is written to the following S3 bucket/prefix to be used as input for the next step in the job flow: s3://ianmas-­‐aws-­‐emr/intermediate/   One output object is created for each reducer (generally one per core)
  • 51. Job Flow: Step 2 JAR location: /home/hadoop/contrib/streaming/hadoop-­‐streaming.jar   Arguments: -­‐mapper  /bin/cat     -­‐reducer  org.apache.hadoop.mapred.lib.IdentityReducer     -­‐input  s3://ianmas-­‐aws-­‐emr/intermediate/     -­‐output  s3://ianmas-­‐aws-­‐emr/output     -­‐jobconf  mapred.reduce.tasks=1 Accept anything and return as text
  • 52. Job Flow: Step 2 JAR location: /home/hadoop/contrib/streaming/hadoop-­‐streaming.jar   Arguments: -­‐mapper  /bin/cat     -­‐reducer  org.apache.hadoop.mapred.lib.IdentityReducer     -­‐input  s3://ianmas-­‐aws-­‐emr/intermediate/     -­‐output  s3://ianmas-­‐aws-­‐emr/output     -­‐jobconf  mapred.reduce.tasks=1 Sort
  • 53. Job Flow: Step 2 JAR location: /home/hadoop/contrib/streaming/hadoop-­‐streaming.jar   Arguments: -­‐mapper  /bin/cat     -­‐reducer  org.apache.hadoop.mapred.lib.IdentityReducer     -­‐input  s3://ianmas-­‐aws-­‐emr/intermediate/     -­‐output  s3://ianmas-­‐aws-­‐emr/output     -­‐jobconf  mapred.reduce.tasks=1 Take previous output as input
  • 54. Job Flow: Step 2 JAR location: /home/hadoop/contrib/streaming/hadoop-­‐streaming.jar   Arguments: -­‐mapper  /bin/cat     -­‐reducer  org.apache.hadoop.mapred.lib.IdentityReducer     -­‐input  s3://ianmas-­‐aws-­‐emr/intermediate/     -­‐output  s3://ianmas-­‐aws-­‐emr/output     -­‐jobconf  mapred.reduce.tasks=1 Output location
  • 55. Job Flow: Step 2 JAR location: /home/hadoop/contrib/streaming/hadoop-­‐streaming.jar   Arguments: -­‐mapper  /bin/cat     -­‐reducer  org.apache.hadoop.mapred.lib.IdentityReducer     -­‐input  s3://ianmas-­‐aws-­‐emr/intermediate/     -­‐output  s3://ianmas-­‐aws-­‐emr/output     -­‐jobconf  mapred.reduce.tasks=1 Use a single reduce task to get a single output object
  • 57. Supported Hadoop Tools An open source analytics package that runs on top of Hadoop. Pig is operated by Pig Latin, a SQL-like language which allows users to structure, summarize, and query data. Allows processing of complex and unstructured data sources such as text documents and log files. Pig An open source data warehouse & analytics package the runs on top of Hadoop. Operated by Hive QL, a SQL-based language which allows users to structure, summarize, and query data Hive Provides you an efficient way of storing large quantities of sparse data using column-based storage. HBase provides fast lookup of data because data is stored in-memory instead of on disk. Optimized for sequential write operations, and it is highly efficient for batch inserts, updates, and deletes. HBase
  • 58. Supported Hadoop Tools An open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto A tool in the Hadoop ecosystem for interactive, ad hoc querying using SQL syntax.It uses a massively parallel processing (MPP) engine similar to that found in a traditional RDBMS. This lends Impala to interactive, low-latency analytics. You can connect to BI tools through ODBC and JDBC drivers. Impala An open source user interface for Hadoop that makes it easier to run and develop Hive queries, manage files in HDFS, run and develop Pig scripts, and manage tables. Hue New
  • 61. Create a Cluster with Spark $  aws  emr  create-­‐cluster  -­‐-­‐name  "Spark  cluster"          -­‐-­‐ami-­‐version  3.8  -­‐-­‐applications  Name=Spark          -­‐-­‐ec2-­‐attributes  KeyName=myKey  -­‐-­‐instance-­‐type  m3.xlarge            -­‐-­‐instance-­‐count  3  -­‐-­‐use-­‐default-­‐roles   $  ssh  -­‐i  myKey  hadoop@masternode   invoke the spark shell with $  spark-­‐shell   or   $  pyspark   AWS CLI
  • 62. Working with the Spark Shell Counting the occurrences of a string a text file stored in Amazon S3 with spark $  pyspark   >>>  sc   <pyspark.context.SparkContext  object  at  0x7fe7e659fa50>   >>>  textfile  =  sc.textFile(“s3://elasticmapreduce/samples/hive-­‐ads/tables/impressions/ dt=2009-­‐04-­‐13-­‐08-­‐05/ec2-­‐0-­‐51-­‐75-­‐39.amazon.com-­‐2009-­‐04-­‐13-­‐08-­‐05.log")   >>>  linesWithCartoonNetwork  =  textfile.filter(lambda  line:  "cartoonnetwork.com"  in   line).count()   15/06/04  17:12:22  INFO  lzo.GPLNativeCodeLoader:  Loaded  native  gpl  library  from  the   embedded  binaries   <snip>   <Spark  program  continues>   >>>  linesWithCartoonNetwork   9 pyspark docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-spark.html
  • 64. CONTROL NETWORK ACCESS TO YOUR EMR CLUSTER ssh  -­‐i  EMRKeyPair.pem  -­‐N            -­‐L  8160:ec2-­‐52-­‐16-­‐143-­‐78.eu-­‐west-­‐1.compute.amazonaws.com:8888            hadoop@ec2-­‐52-­‐16-­‐143-­‐78.eu-­‐west-­‐1.compute.amazonaws.com Using SSH local port forwarding docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-web-interfaces.html
  • 65. MANAGE USERS, PERMISSIONS AND ENCRYPTION docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-access.html
  • 66. INSTALL ADDITIONAL SOFTWARE WITH BOOTSTRAP ACTIONS docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html
  • 67. EFFICIENTLY COPY DATA TO EMR FROM AMAZON S3 docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html $  hadoop  jar  /home/hadoop/lib/emr-­‐s3distcp-­‐1.0.jar  -­‐ Dmapreduce.job.reduces=30  -­‐-­‐src  s3://s3bucketname/  -­‐-­‐dest  hdfs:// $HADOOP_NAMENODE_HOST:$HADOOP_NAMENODE_PORT/data/  -­‐-­‐outputCodec  'none' Run on a cluster master node:
  • 70. DEBUG YOUR APPLICATIONS docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-debugging.html Log files generated by EMR Clusters include: • Step logs • Hadoop logs • Bootstrap action logs • Instance state logs
  • 71. USE THE MAPR DISTRIBUTION aws.amazon.com/elasticmapreduce/mapr/
  • 72. TUNE YOUR CLUSTER FOR COST & PERFORMANCE Supported EC2 instance types • General Purpose • Compute Optimized • Memory Optimized • Storage Optimized - D2 instance family D2 instances are available in four sizes with 6TB, 12TB, 24TB, and 48TB storage options. • GPU Instances New
  • 73. TUNE YOUR CLUSTER FOR COST & PERFORMANCE Job Flow Duration: Scenario #1 Duration: Scenario #2 Cost without Spot 4 instances *14 hrs * $0.50 Total = $28 Cost with Spot 4 instances *7 hrs * $0.50 = $14 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $22.75 Time Savings: 50% Cost Savings: ~22% Job Flow Allocate 4 on demand instances Add 5 additional spot instances 14 hours 7 hours
  • 75.
  • 76. RESOURCES YOU CAN USE TO LEARN MORE
  • 78. Getting Started with Amazon EMR Tutorial guide: docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-get-started.html Customer Case Studies for Big Data Use-Cases aws.amazon.com/solutions/case-studies/big-data/ Amazon EMR Documentation: aws.amazon.com/documentation/emr/
  • 79. Certification aws.amazon.com/certification Self-Paced Labs aws.amazon.com/training/
 self-paced-labs Try products, gain new skills, and get hands-on practice working with AWS technologies aws.amazon.com/training Training Validate your proven skills and expertise with the AWS platform Build technical expertise to design and operate scalable, efficient applications on AWS AWS Training & Certification
  • 80. Follow us for m ore events & w ebinars @AWScloud for Global AWS News & Announcements @AWS_UKI for local AWS events & news @IanMmmm Ian Massingham — Technical Evangelist