SlideShare a Scribd company logo
1 of 39
Download to read offline
Possible real-world situation
● We have big data and/or very long,
  embarrassingly parallel computation
● Our data may grow fast
● We want to start and try Hadoop asap

● We do not have our own infrastructure
● We do not have Hadoop administrators
● We have limited funds
Possible solution
Amazon Elastic MapReduce (EMR)
● Hadoop framework running on the web scale
  infrastructure of Amazon
EMR Benefits
Elastic (scalable)
● Use one, hundred, or even thousands of
   instances to process even petabytes of data
● Modify the number of instances while the job
   flow is running
● Start computation within minutes
EMR Benefits
Easy to use
● No configuration necessary
  ○ Do not worry about setting up hardware and
       networking, running, managing and tuning the
       performance of Hadoop cluster
● Easy-to-use tools and plugins available
   ○   AWS Web Management Console
   ○   Command Line Tools by Amazon
   ○   Amazon EMR API, SDK, Libraries
   ○   Plugins for IDEs (e.g. Eclipse & Karmasphere Studio
       for EMR)
EMR Benefits
Reliable
● Build on Amazon's highly available and
  battle-tested infrastructure
● Provision new nodes to replace those that
  fail
● Used by e.g.:
EMR Benefits
Cost effective
● Pay for what you use (for each started hour)
● Choose various instance types that meets
  your requirements
● Possibility to reserve instances for 1 or 3
  years to pay less for hour
EMR Overview
Amazon Elastic MapReduce (Amazon EMR)
works in conjunction with
● Amazon EC2 to rent computing instances
  (with Hadoop installed)
● Amazon S3 to store input and output data,
  scripts/applications and logs
EMR Architectural Overview




* image from the Internet
EC2 Instance Types




* image from Big Data University, Course: "Hadoop and the Amazon Cloud"
EMR Pricing - "On-demand"
instances
Standard Family Instances (US East Region)




http://aws.amazon.com/elasticmapreduce/pricing/
EC2 & S3 Pricing - Real-world example
New York Times wanted to host all public
domain articles from 1851 to 1922.
● 11 million articles
● 4 TB of raw image TIFF input data converted
  to 1.5 TB of PDF documents
● 100 EC2 Instances rented
● < 24 hours of computation
● $240 paid (not including storage & bandwidth)
● 1 employee assigned to this task
EC2 & S3 Pricing - Real-world example



           How much
    did they pay for storage
        and bandwidth?
S3 Pricing




http://aws.amazon.com/s3/pricing/
EC2 & S3 Pricing Calculator
Simple Monthly Calculator:
http://calculator.s3.amazonaws.com/calc5.html
AWS Free Usage Tier (Per Month)
Available for free to new AWS customers for 12
months following AWS sign-up date e.g.:
● 750 hours of Amazon EC2 Micro Instance
  usage
    ○ 613 MB of memory and 32-bit or 64-bit platform
● 5 GB of Amazon S3 standard storage,
  20,000 Get and 2,000 Put Requests
● 15 GB of bandwidth out aggregated across
  all AWS services
EMR - Support for Hadoop
Ecosystem
Develop and run MapReduce application using:
● Java
● Streaming (e.g. Ruby, Perl, Python, PHP, R,
  or C++)
● Pig
● Hive

HBase can be easily installed using set of EC2
scripts
●
EMR - Featured Users




* logos form http://aws.amazon.com/elasticmapreduce/
EMR - Case Study - Yelp

● help people connect
  with great local business
● share reviews and insights

● as of November 2010:
  ○ 39 million monthly unique visitors
  ○ in total, 14 million reviews posted
 ●
EMR - Case Study - Yelp
EMR - Case Study - Yelp
● uses S3 to store daily logs (~100GB/day)
  and photos
● uses EMR to power features like
    ○   People who viewed this also viewed
    ○   Review highlights
    ○   Autocomplete in search box
    ○   Top searches
●   implements jobs in Python and uses their
    own open-source library, mrjob, to run them
    on EMR
mrjob - WordCount example
from mrjob.job import MRJob

class MRWordCounter(MRJob):
   def mapper(self, key, line):
     for word in line.split():
        yield word, 1

  def reducer(self, word, occurrences):
    yield word, sum(occurrences)

if __name__ == '__main__':
   MRWordCounter.run()
mrjob - run on EMR
$ python wordcount.py
  --ec2_instance_type c1.medium
  --num-ec2-instances 10
  -r emr < 's3://input-bucket/*.txt' > output
Demo
Million Song Dataset
● Contains detailed acoustic and contextual
  data for one million popular songs
● ~300 GB of data
● Publicly available
  ○ for download: http://www.infochimps.
      com/collections/million-songs
  ○   for processing using EMR: http://tbmmsd.s3.
      amazonaws.com/
Million Song Dataset
Contains data such as:
● Song's title, year and hotness
● Song's tempo, duration, danceability,
  energy, loudness, segments count, preview
  (URL to mp3 file) and so on
● Artist's name and hotness
Million Song Dataset - Song's
density
Song's density* can be defined as the average
number of notes or atomic sounds (called
segments) per second in a song.

        density = segmentCnt / duration
 
 
 
* based on Paul Lamere's blog - http://bit.ly/qUbLdQ
Million Song Dataset - Task*
Simple music recommendation system
● Calculate density for each song
● Find hot songs with similar density




* based on Paul Lamere's blog - http://bit.ly/qUbLdQ
Million Song Dataset - MapReduce
Input data
● 339 files
● Each file contains ~3 000 songs
● Each song is represented by one line in
   input file
● Fields are separated by a tab character
Million Song Dataset - MapReduce
Mapper
● Reads song's data from each line of input
  text
● Calculate song's density
● Emits song's density as key with some other
  details as value

<line_offset, song_data> ->
           <density, (artist_name, song_title, song_url)>
public void map(LongWritable key, Text value,
    OutputCollector<FloatWritable, TripleTextWritable> output, Reporter
    reporter) throws IOException {
 
    song.parseLine(value.toString());
    if (song.tempo > 0 && song.duration > 0 ) {
        // calculate density
        float density = ((float) song.segmentCnt) / song.duration;


        denstyWritable.set(density);
        songWritable.set(song.artistName, song.title, song.preview);


        output.collect(denstyWritable, songWritable);
    }
}
Million Song Dataset - MapReduce
Reducer
● Identity Reducer
● Each Reducer gets density values from
  different range: <i,i+1)*,**

<density, [(artist_name, song_title, song_url)]> ->
               <density, (artist_name, song_title, song_url)>


* thanks to a custom Partitioner
** not optimal partitioning (partitions are not balanced)
Demo - used software
● Karmasphere Studio for EMR (Eclipse
  plugin)
  ○ graphical environment that supports the complete
    lifecycle for developing for Amazon Elastic
    MapReduce, including prototyping, developing,
    testing, debugging, deploying and optimizing
    Hadoop Jobs (http://www.karmasphere.
    com/ksc/karmasphere-studio-for-amazon.html)
Demo - used software
● Karmasphere Studio for EMR (Eclipse
  plugin)




images from:
http://www.karmasphere.com/ksc/karmasphere-studio-for-amazon.html
Video
Please watch video on WHUG channel on
YouTube

http://www.youtube.com/watch?
v=Azwilbn8GCs
Thank you!
Join us !
whug.org

More Related Content

What's hot

Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Yahoo Developer Network
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Nathan Bijnens
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and PipesHanborq Inc.
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - OverviewJay
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
 

What's hot (19)

Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 

Viewers also liked

Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Adam Kawa
 
Apache Hadoop Java API
Apache Hadoop Java APIApache Hadoop Java API
Apache Hadoop Java APIAdam Kawa
 
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)Adam Kawa
 
Data Mining Music
Data Mining MusicData Mining Music
Data Mining MusicPaul Lamere
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeAdam Kawa
 
Systemy rekomendacji
Systemy rekomendacjiSystemy rekomendacji
Systemy rekomendacjiAdam Kawa
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARNAdam Kawa
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationAdam Kawa
 
Big Data At Spotify
Big Data At SpotifyBig Data At Spotify
Big Data At SpotifyAdam Kawa
 
Waltz ballroom dancing Angel
Waltz ballroom dancing AngelWaltz ballroom dancing Angel
Waltz ballroom dancing Angelangelofgod13
 
Lean Change - Organisationsentwicklung mit Design Thinking
Lean Change -  Organisationsentwicklung mit Design ThinkingLean Change -  Organisationsentwicklung mit Design Thinking
Lean Change - Organisationsentwicklung mit Design ThinkingMike Leber
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Adam Kawa
 
ballroom dancing lessons
ballroom dancing lessonsballroom dancing lessons
ballroom dancing lessons4Edward
 
Data Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeData Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeArvind Prabhakar
 
Introduction to Hadoop Ecosystem
Introduction to Hadoop Ecosystem Introduction to Hadoop Ecosystem
Introduction to Hadoop Ecosystem GetInData
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache TezGetInData
 
HDFS NameNode High Availability
HDFS NameNode High AvailabilityHDFS NameNode High Availability
HDFS NameNode High AvailabilityDataWorks Summit
 

Viewers also liked (20)

Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
 
Apache Hadoop Java API
Apache Hadoop Java APIApache Hadoop Java API
Apache Hadoop Java API
 
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
 
Data Mining Music
Data Mining MusicData Mining Music
Data Mining Music
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 
Systemy rekomendacji
Systemy rekomendacjiSystemy rekomendacji
Systemy rekomendacji
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARN
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS Federation
 
Big Data At Spotify
Big Data At SpotifyBig Data At Spotify
Big Data At Spotify
 
Waltz ballroom dancing Angel
Waltz ballroom dancing AngelWaltz ballroom dancing Angel
Waltz ballroom dancing Angel
 
Last Waltz
Last WaltzLast Waltz
Last Waltz
 
Lean Change - Organisationsentwicklung mit Design Thinking
Lean Change -  Organisationsentwicklung mit Design ThinkingLean Change -  Organisationsentwicklung mit Design Thinking
Lean Change - Organisationsentwicklung mit Design Thinking
 
HDFS Federation
HDFS FederationHDFS Federation
HDFS Federation
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm
 
ballroom dancing lessons
ballroom dancing lessonsballroom dancing lessons
ballroom dancing lessons
 
Data Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeData Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache Flume
 
Introduction to Hadoop Ecosystem
Introduction to Hadoop Ecosystem Introduction to Hadoop Ecosystem
Introduction to Hadoop Ecosystem
 
HCatalog
HCatalogHCatalog
HCatalog
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache Tez
 
HDFS NameNode High Availability
HDFS NameNode High AvailabilityHDFS NameNode High Availability
HDFS NameNode High Availability
 

Similar to Introduction To Elastic MapReduce at WHUG

Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRIsrael AWS User Group
 
AWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedAWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedHarsha KM
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...Flink Forward
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR MasterclassIan Massingham
 
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamMalo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamFlink Forward
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerFederico Palladoro
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Amazon Web Services
 
Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)Amazon Web Services
 
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Amazon Web Services
 
Code for the earth OCP APAC Tokyo 2013-05
Code for the earth OCP APAC Tokyo 2013-05Code for the earth OCP APAC Tokyo 2013-05
Code for the earth OCP APAC Tokyo 2013-05Tetsu Saburi
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSAmazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformEva Tse
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
 
Amazon web services : Layman Introduction
Amazon web services : Layman IntroductionAmazon web services : Layman Introduction
Amazon web services : Layman IntroductionParashar Borkotoky
 

Similar to Introduction To Elastic MapReduce at WHUG (20)

Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
 
AWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedAWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explained
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamMalo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
 
Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)
 
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
 
Code for the earth OCP APAC Tokyo 2013-05
Code for the earth OCP APAC Tokyo 2013-05Code for the earth OCP APAC Tokyo 2013-05
Code for the earth OCP APAC Tokyo 2013-05
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWS
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Amazon web services : Layman Introduction
Amazon web services : Layman IntroductionAmazon web services : Layman Introduction
Amazon web services : Layman Introduction
 

Recently uploaded

Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxnelietumpap1
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 

Recently uploaded (20)

Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptx
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 

Introduction To Elastic MapReduce at WHUG

  • 1.
  • 2. Possible real-world situation ● We have big data and/or very long, embarrassingly parallel computation ● Our data may grow fast ● We want to start and try Hadoop asap ● We do not have our own infrastructure ● We do not have Hadoop administrators ● We have limited funds
  • 3. Possible solution Amazon Elastic MapReduce (EMR) ● Hadoop framework running on the web scale infrastructure of Amazon
  • 4. EMR Benefits Elastic (scalable) ● Use one, hundred, or even thousands of instances to process even petabytes of data ● Modify the number of instances while the job flow is running ● Start computation within minutes
  • 5. EMR Benefits Easy to use ● No configuration necessary ○ Do not worry about setting up hardware and networking, running, managing and tuning the performance of Hadoop cluster ● Easy-to-use tools and plugins available ○ AWS Web Management Console ○ Command Line Tools by Amazon ○ Amazon EMR API, SDK, Libraries ○ Plugins for IDEs (e.g. Eclipse & Karmasphere Studio for EMR)
  • 6. EMR Benefits Reliable ● Build on Amazon's highly available and battle-tested infrastructure ● Provision new nodes to replace those that fail ● Used by e.g.:
  • 7. EMR Benefits Cost effective ● Pay for what you use (for each started hour) ● Choose various instance types that meets your requirements ● Possibility to reserve instances for 1 or 3 years to pay less for hour
  • 8. EMR Overview Amazon Elastic MapReduce (Amazon EMR) works in conjunction with ● Amazon EC2 to rent computing instances (with Hadoop installed) ● Amazon S3 to store input and output data, scripts/applications and logs
  • 9. EMR Architectural Overview * image from the Internet
  • 10. EC2 Instance Types * image from Big Data University, Course: "Hadoop and the Amazon Cloud"
  • 11. EMR Pricing - "On-demand" instances Standard Family Instances (US East Region) http://aws.amazon.com/elasticmapreduce/pricing/
  • 12. EC2 & S3 Pricing - Real-world example New York Times wanted to host all public domain articles from 1851 to 1922. ● 11 million articles ● 4 TB of raw image TIFF input data converted to 1.5 TB of PDF documents ● 100 EC2 Instances rented ● < 24 hours of computation ● $240 paid (not including storage & bandwidth) ● 1 employee assigned to this task
  • 13.
  • 14. EC2 & S3 Pricing - Real-world example How much did they pay for storage and bandwidth?
  • 16. EC2 & S3 Pricing Calculator Simple Monthly Calculator: http://calculator.s3.amazonaws.com/calc5.html
  • 17. AWS Free Usage Tier (Per Month) Available for free to new AWS customers for 12 months following AWS sign-up date e.g.: ● 750 hours of Amazon EC2 Micro Instance usage ○ 613 MB of memory and 32-bit or 64-bit platform ● 5 GB of Amazon S3 standard storage, 20,000 Get and 2,000 Put Requests ● 15 GB of bandwidth out aggregated across all AWS services
  • 18. EMR - Support for Hadoop Ecosystem Develop and run MapReduce application using: ● Java ● Streaming (e.g. Ruby, Perl, Python, PHP, R, or C++) ● Pig ● Hive HBase can be easily installed using set of EC2 scripts ●
  • 19. EMR - Featured Users * logos form http://aws.amazon.com/elasticmapreduce/
  • 20. EMR - Case Study - Yelp ● help people connect with great local business ● share reviews and insights ● as of November 2010: ○ 39 million monthly unique visitors ○ in total, 14 million reviews posted ●
  • 21. EMR - Case Study - Yelp
  • 22. EMR - Case Study - Yelp ● uses S3 to store daily logs (~100GB/day) and photos ● uses EMR to power features like ○ People who viewed this also viewed ○ Review highlights ○ Autocomplete in search box ○ Top searches ● implements jobs in Python and uses their own open-source library, mrjob, to run them on EMR
  • 23. mrjob - WordCount example from mrjob.job import MRJob class MRWordCounter(MRJob): def mapper(self, key, line): for word in line.split(): yield word, 1 def reducer(self, word, occurrences): yield word, sum(occurrences) if __name__ == '__main__': MRWordCounter.run()
  • 24. mrjob - run on EMR $ python wordcount.py --ec2_instance_type c1.medium --num-ec2-instances 10 -r emr < 's3://input-bucket/*.txt' > output
  • 25. Demo
  • 26. Million Song Dataset ● Contains detailed acoustic and contextual data for one million popular songs ● ~300 GB of data ● Publicly available ○ for download: http://www.infochimps. com/collections/million-songs ○ for processing using EMR: http://tbmmsd.s3. amazonaws.com/
  • 27. Million Song Dataset Contains data such as: ● Song's title, year and hotness ● Song's tempo, duration, danceability, energy, loudness, segments count, preview (URL to mp3 file) and so on ● Artist's name and hotness
  • 28. Million Song Dataset - Song's density Song's density* can be defined as the average number of notes or atomic sounds (called segments) per second in a song. density = segmentCnt / duration       * based on Paul Lamere's blog - http://bit.ly/qUbLdQ
  • 29. Million Song Dataset - Task* Simple music recommendation system ● Calculate density for each song ● Find hot songs with similar density * based on Paul Lamere's blog - http://bit.ly/qUbLdQ
  • 30. Million Song Dataset - MapReduce Input data ● 339 files ● Each file contains ~3 000 songs ● Each song is represented by one line in input file ● Fields are separated by a tab character
  • 31. Million Song Dataset - MapReduce Mapper ● Reads song's data from each line of input text ● Calculate song's density ● Emits song's density as key with some other details as value <line_offset, song_data> -> <density, (artist_name, song_title, song_url)>
  • 32. public void map(LongWritable key, Text value, OutputCollector<FloatWritable, TripleTextWritable> output, Reporter reporter) throws IOException {   song.parseLine(value.toString()); if (song.tempo > 0 && song.duration > 0 ) { // calculate density float density = ((float) song.segmentCnt) / song.duration; denstyWritable.set(density); songWritable.set(song.artistName, song.title, song.preview); output.collect(denstyWritable, songWritable); } }
  • 33. Million Song Dataset - MapReduce Reducer ● Identity Reducer ● Each Reducer gets density values from different range: <i,i+1)*,** <density, [(artist_name, song_title, song_url)]> -> <density, (artist_name, song_title, song_url)> * thanks to a custom Partitioner ** not optimal partitioning (partitions are not balanced)
  • 34. Demo - used software ● Karmasphere Studio for EMR (Eclipse plugin) ○ graphical environment that supports the complete lifecycle for developing for Amazon Elastic MapReduce, including prototyping, developing, testing, debugging, deploying and optimizing Hadoop Jobs (http://www.karmasphere. com/ksc/karmasphere-studio-for-amazon.html)
  • 35. Demo - used software ● Karmasphere Studio for EMR (Eclipse plugin) images from: http://www.karmasphere.com/ksc/karmasphere-studio-for-amazon.html
  • 36. Video
  • 37. Please watch video on WHUG channel on YouTube http://www.youtube.com/watch? v=Azwilbn8GCs