Possible real-world situation● We have big data and/or very long,  embarrassingly parallel computation● Our data may grow ...
Possible solutionAmazon Elastic MapReduce (EMR)● Hadoop framework running on the web scale  infrastructure of Amazon
EMR BenefitsElastic (scalable)● Use one, hundred, or even thousands of   instances to process even petabytes of data● Modi...
EMR BenefitsEasy to use● No configuration necessary  ○ Do not worry about setting up hardware and       networking, runnin...
EMR BenefitsReliable● Build on Amazons highly available and  battle-tested infrastructure● Provision new nodes to replace ...
EMR BenefitsCost effective● Pay for what you use (for each started hour)● Choose various instance types that meets  your r...
EMR OverviewAmazon Elastic MapReduce (Amazon EMR)works in conjunction with● Amazon EC2 to rent computing instances  (with ...
EMR Architectural Overview* image from the Internet
EC2 Instance Types* image from Big Data University, Course: "Hadoop and the Amazon Cloud"
EMR Pricing - "On-demand"instancesStandard Family Instances (US East Region)http://aws.amazon.com/elasticmapreduce/pricing/
EC2 & S3 Pricing - Real-world exampleNew York Times wanted to host all publicdomain articles from 1851 to 1922.● 11 millio...
EC2 & S3 Pricing - Real-world example           How much    did they pay for storage        and bandwidth?
S3 Pricinghttp://aws.amazon.com/s3/pricing/
EC2 & S3 Pricing CalculatorSimple Monthly Calculator:http://calculator.s3.amazonaws.com/calc5.html
AWS Free Usage Tier (Per Month)Available for free to new AWS customers for 12months following AWS sign-up date e.g.:● 750 ...
EMR - Support for HadoopEcosystemDevelop and run MapReduce application using:● Java● Streaming (e.g. Ruby, Perl, Python, P...
EMR - Featured Users* logos form http://aws.amazon.com/elasticmapreduce/
EMR - Case Study - Yelp● help people connect  with great local business● share reviews and insights● as of November 2010: ...
EMR - Case Study - Yelp
EMR - Case Study - Yelp● uses S3 to store daily logs (~100GB/day)  and photos● uses EMR to power features like    ○   Peop...
mrjob - WordCount examplefrom mrjob.job import MRJobclass MRWordCounter(MRJob):   def mapper(self, key, line):     for wor...
mrjob - run on EMR$ python wordcount.py  --ec2_instance_type c1.medium  --num-ec2-instances 10  -r emr < s3://input-bucket...
Demo
Million Song Dataset● Contains detailed acoustic and contextual  data for one million popular songs● ~300 GB of data● Publ...
Million Song DatasetContains data such as:● Songs title, year and hotness● Songs tempo, duration, danceability,  energy, l...
Million Song Dataset - SongsdensitySongs density* can be defined as the averagenumber of notes or atomic sounds (calledseg...
Million Song Dataset - Task*Simple music recommendation system● Calculate density for each song● Find hot songs with simil...
Million Song Dataset - MapReduceInput data● 339 files● Each file contains ~3 000 songs● Each song is represented by one li...
Million Song Dataset - MapReduceMapper● Reads songs data from each line of input  text● Calculate songs density● Emits son...
public void map(LongWritable key, Text value,    OutputCollector<FloatWritable, TripleTextWritable> output, Reporter    re...
Million Song Dataset - MapReduceReducer● Identity Reducer● Each Reducer gets density values from  different range: <i,i+1)...
Demo - used software● Karmasphere Studio for EMR (Eclipse  plugin)  ○ graphical environment that supports the complete    ...
Demo - used software● Karmasphere Studio for EMR (Eclipse  plugin)images from:http://www.karmasphere.com/ksc/karmasphere-s...
Video
Please watch video on WHUG channel onYouTubehttp://www.youtube.com/watch?v=Azwilbn8GCs
Thank you!
Join us !whug.org
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUG
Upcoming SlideShare
Loading in...5
×

Introduction To Elastic MapReduce at WHUG

2,748

Published on

Elasic MapReduce presentation given at 2nd meeting of Warsaw Hadoop User Group.

Watch also demonstration at www.youtube.com/watch?v=Azwilbn8GCs (it show how to create Hadoop cluster on Amazon Elastic MapReduce with Karashpere Studio for EMR (a plugin for Eclipse) to launch big calculations quickly and easily.

Published in: Education, Technology, Business
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,748
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
64
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Introduction To Elastic MapReduce at WHUG

  1. 1. Possible real-world situation● We have big data and/or very long, embarrassingly parallel computation● Our data may grow fast● We want to start and try Hadoop asap● We do not have our own infrastructure● We do not have Hadoop administrators● We have limited funds
  2. 2. Possible solutionAmazon Elastic MapReduce (EMR)● Hadoop framework running on the web scale infrastructure of Amazon
  3. 3. EMR BenefitsElastic (scalable)● Use one, hundred, or even thousands of instances to process even petabytes of data● Modify the number of instances while the job flow is running● Start computation within minutes
  4. 4. EMR BenefitsEasy to use● No configuration necessary ○ Do not worry about setting up hardware and networking, running, managing and tuning the performance of Hadoop cluster● Easy-to-use tools and plugins available ○ AWS Web Management Console ○ Command Line Tools by Amazon ○ Amazon EMR API, SDK, Libraries ○ Plugins for IDEs (e.g. Eclipse & Karmasphere Studio for EMR)
  5. 5. EMR BenefitsReliable● Build on Amazons highly available and battle-tested infrastructure● Provision new nodes to replace those that fail● Used by e.g.:
  6. 6. EMR BenefitsCost effective● Pay for what you use (for each started hour)● Choose various instance types that meets your requirements● Possibility to reserve instances for 1 or 3 years to pay less for hour
  7. 7. EMR OverviewAmazon Elastic MapReduce (Amazon EMR)works in conjunction with● Amazon EC2 to rent computing instances (with Hadoop installed)● Amazon S3 to store input and output data, scripts/applications and logs
  8. 8. EMR Architectural Overview* image from the Internet
  9. 9. EC2 Instance Types* image from Big Data University, Course: "Hadoop and the Amazon Cloud"
  10. 10. EMR Pricing - "On-demand"instancesStandard Family Instances (US East Region)http://aws.amazon.com/elasticmapreduce/pricing/
  11. 11. EC2 & S3 Pricing - Real-world exampleNew York Times wanted to host all publicdomain articles from 1851 to 1922.● 11 million articles● 4 TB of raw image TIFF input data converted to 1.5 TB of PDF documents● 100 EC2 Instances rented● < 24 hours of computation● $240 paid (not including storage & bandwidth)● 1 employee assigned to this task
  12. 12. EC2 & S3 Pricing - Real-world example How much did they pay for storage and bandwidth?
  13. 13. S3 Pricinghttp://aws.amazon.com/s3/pricing/
  14. 14. EC2 & S3 Pricing CalculatorSimple Monthly Calculator:http://calculator.s3.amazonaws.com/calc5.html
  15. 15. AWS Free Usage Tier (Per Month)Available for free to new AWS customers for 12months following AWS sign-up date e.g.:● 750 hours of Amazon EC2 Micro Instance usage ○ 613 MB of memory and 32-bit or 64-bit platform● 5 GB of Amazon S3 standard storage, 20,000 Get and 2,000 Put Requests● 15 GB of bandwidth out aggregated across all AWS services
  16. 16. EMR - Support for HadoopEcosystemDevelop and run MapReduce application using:● Java● Streaming (e.g. Ruby, Perl, Python, PHP, R, or C++)● Pig● HiveHBase can be easily installed using set of EC2scripts●
  17. 17. EMR - Featured Users* logos form http://aws.amazon.com/elasticmapreduce/
  18. 18. EMR - Case Study - Yelp● help people connect with great local business● share reviews and insights● as of November 2010: ○ 39 million monthly unique visitors ○ in total, 14 million reviews posted ●
  19. 19. EMR - Case Study - Yelp
  20. 20. EMR - Case Study - Yelp● uses S3 to store daily logs (~100GB/day) and photos● uses EMR to power features like ○ People who viewed this also viewed ○ Review highlights ○ Autocomplete in search box ○ Top searches● implements jobs in Python and uses their own open-source library, mrjob, to run them on EMR
  21. 21. mrjob - WordCount examplefrom mrjob.job import MRJobclass MRWordCounter(MRJob): def mapper(self, key, line): for word in line.split(): yield word, 1 def reducer(self, word, occurrences): yield word, sum(occurrences)if __name__ == __main__: MRWordCounter.run()
  22. 22. mrjob - run on EMR$ python wordcount.py --ec2_instance_type c1.medium --num-ec2-instances 10 -r emr < s3://input-bucket/*.txt > output
  23. 23. Demo
  24. 24. Million Song Dataset● Contains detailed acoustic and contextual data for one million popular songs● ~300 GB of data● Publicly available ○ for download: http://www.infochimps. com/collections/million-songs ○ for processing using EMR: http://tbmmsd.s3. amazonaws.com/
  25. 25. Million Song DatasetContains data such as:● Songs title, year and hotness● Songs tempo, duration, danceability, energy, loudness, segments count, preview (URL to mp3 file) and so on● Artists name and hotness
  26. 26. Million Song Dataset - SongsdensitySongs density* can be defined as the averagenumber of notes or atomic sounds (calledsegments) per second in a song. density = segmentCnt / duration   * based on Paul Lameres blog - http://bit.ly/qUbLdQ
  27. 27. Million Song Dataset - Task*Simple music recommendation system● Calculate density for each song● Find hot songs with similar density* based on Paul Lameres blog - http://bit.ly/qUbLdQ
  28. 28. Million Song Dataset - MapReduceInput data● 339 files● Each file contains ~3 000 songs● Each song is represented by one line in input file● Fields are separated by a tab character
  29. 29. Million Song Dataset - MapReduceMapper● Reads songs data from each line of input text● Calculate songs density● Emits songs density as key with some other details as value<line_offset, song_data> -> <density, (artist_name, song_title, song_url)>
  30. 30. public void map(LongWritable key, Text value, OutputCollector<FloatWritable, TripleTextWritable> output, Reporter reporter) throws IOException {  song.parseLine(value.toString()); if (song.tempo > 0 && song.duration > 0 ) { // calculate density float density = ((float) song.segmentCnt) / song.duration; denstyWritable.set(density); songWritable.set(song.artistName, song.title, song.preview); output.collect(denstyWritable, songWritable); }}
  31. 31. Million Song Dataset - MapReduceReducer● Identity Reducer● Each Reducer gets density values from different range: <i,i+1)*,**<density, [(artist_name, song_title, song_url)]> -> <density, (artist_name, song_title, song_url)>* thanks to a custom Partitioner** not optimal partitioning (partitions are not balanced)
  32. 32. Demo - used software● Karmasphere Studio for EMR (Eclipse plugin) ○ graphical environment that supports the complete lifecycle for developing for Amazon Elastic MapReduce, including prototyping, developing, testing, debugging, deploying and optimizing Hadoop Jobs (http://www.karmasphere. com/ksc/karmasphere-studio-for-amazon.html)
  33. 33. Demo - used software● Karmasphere Studio for EMR (Eclipse plugin)images from:http://www.karmasphere.com/ksc/karmasphere-studio-for-amazon.html
  34. 34. Video
  35. 35. Please watch video on WHUG channel onYouTubehttp://www.youtube.com/watch?v=Azwilbn8GCs
  36. 36. Thank you!
  37. 37. Join us !whug.org
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×