Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction To Elastic MapReduce at WHUG


Published on

Elasic MapReduce presentation given at 2nd meeting of Warsaw Hadoop User Group.

Watch also demonstration at (it show how to create Hadoop cluster on Amazon Elastic MapReduce with Karashpere Studio for EMR (a plugin for Eclipse) to launch big calculations quickly and easily.

Published in: Education, Technology, Business
  • Be the first to comment

Introduction To Elastic MapReduce at WHUG

  1. 1. Possible real-world situation● We have big data and/or very long, embarrassingly parallel computation● Our data may grow fast● We want to start and try Hadoop asap● We do not have our own infrastructure● We do not have Hadoop administrators● We have limited funds
  2. 2. Possible solutionAmazon Elastic MapReduce (EMR)● Hadoop framework running on the web scale infrastructure of Amazon
  3. 3. EMR BenefitsElastic (scalable)● Use one, hundred, or even thousands of instances to process even petabytes of data● Modify the number of instances while the job flow is running● Start computation within minutes
  4. 4. EMR BenefitsEasy to use● No configuration necessary ○ Do not worry about setting up hardware and networking, running, managing and tuning the performance of Hadoop cluster● Easy-to-use tools and plugins available ○ AWS Web Management Console ○ Command Line Tools by Amazon ○ Amazon EMR API, SDK, Libraries ○ Plugins for IDEs (e.g. Eclipse & Karmasphere Studio for EMR)
  5. 5. EMR BenefitsReliable● Build on Amazons highly available and battle-tested infrastructure● Provision new nodes to replace those that fail● Used by e.g.:
  6. 6. EMR BenefitsCost effective● Pay for what you use (for each started hour)● Choose various instance types that meets your requirements● Possibility to reserve instances for 1 or 3 years to pay less for hour
  7. 7. EMR OverviewAmazon Elastic MapReduce (Amazon EMR)works in conjunction with● Amazon EC2 to rent computing instances (with Hadoop installed)● Amazon S3 to store input and output data, scripts/applications and logs
  8. 8. EMR Architectural Overview* image from the Internet
  9. 9. EC2 Instance Types* image from Big Data University, Course: "Hadoop and the Amazon Cloud"
  10. 10. EMR Pricing - "On-demand"instancesStandard Family Instances (US East Region)
  11. 11. EC2 & S3 Pricing - Real-world exampleNew York Times wanted to host all publicdomain articles from 1851 to 1922.● 11 million articles● 4 TB of raw image TIFF input data converted to 1.5 TB of PDF documents● 100 EC2 Instances rented● < 24 hours of computation● $240 paid (not including storage & bandwidth)● 1 employee assigned to this task
  12. 12. EC2 & S3 Pricing - Real-world example How much did they pay for storage and bandwidth?
  13. 13. S3 Pricing
  14. 14. EC2 & S3 Pricing CalculatorSimple Monthly Calculator:
  15. 15. AWS Free Usage Tier (Per Month)Available for free to new AWS customers for 12months following AWS sign-up date e.g.:● 750 hours of Amazon EC2 Micro Instance usage ○ 613 MB of memory and 32-bit or 64-bit platform● 5 GB of Amazon S3 standard storage, 20,000 Get and 2,000 Put Requests● 15 GB of bandwidth out aggregated across all AWS services
  16. 16. EMR - Support for HadoopEcosystemDevelop and run MapReduce application using:● Java● Streaming (e.g. Ruby, Perl, Python, PHP, R, or C++)● Pig● HiveHBase can be easily installed using set of EC2scripts●
  17. 17. EMR - Featured Users* logos form
  18. 18. EMR - Case Study - Yelp● help people connect with great local business● share reviews and insights● as of November 2010: ○ 39 million monthly unique visitors ○ in total, 14 million reviews posted ●
  19. 19. EMR - Case Study - Yelp
  20. 20. EMR - Case Study - Yelp● uses S3 to store daily logs (~100GB/day) and photos● uses EMR to power features like ○ People who viewed this also viewed ○ Review highlights ○ Autocomplete in search box ○ Top searches● implements jobs in Python and uses their own open-source library, mrjob, to run them on EMR
  21. 21. mrjob - WordCount examplefrom mrjob.job import MRJobclass MRWordCounter(MRJob): def mapper(self, key, line): for word in line.split(): yield word, 1 def reducer(self, word, occurrences): yield word, sum(occurrences)if __name__ == __main__:
  22. 22. mrjob - run on EMR$ python --ec2_instance_type c1.medium --num-ec2-instances 10 -r emr < s3://input-bucket/*.txt > output
  23. 23. Demo
  24. 24. Million Song Dataset● Contains detailed acoustic and contextual data for one million popular songs● ~300 GB of data● Publicly available ○ for download: http://www.infochimps. com/collections/million-songs ○ for processing using EMR: http://tbmmsd.s3.
  25. 25. Million Song DatasetContains data such as:● Songs title, year and hotness● Songs tempo, duration, danceability, energy, loudness, segments count, preview (URL to mp3 file) and so on● Artists name and hotness
  26. 26. Million Song Dataset - SongsdensitySongs density* can be defined as the averagenumber of notes or atomic sounds (calledsegments) per second in a song. density = segmentCnt / duration   * based on Paul Lameres blog -
  27. 27. Million Song Dataset - Task*Simple music recommendation system● Calculate density for each song● Find hot songs with similar density* based on Paul Lameres blog -
  28. 28. Million Song Dataset - MapReduceInput data● 339 files● Each file contains ~3 000 songs● Each song is represented by one line in input file● Fields are separated by a tab character
  29. 29. Million Song Dataset - MapReduceMapper● Reads songs data from each line of input text● Calculate songs density● Emits songs density as key with some other details as value<line_offset, song_data> -> <density, (artist_name, song_title, song_url)>
  30. 30. public void map(LongWritable key, Text value, OutputCollector<FloatWritable, TripleTextWritable> output, Reporter reporter) throws IOException {  song.parseLine(value.toString()); if (song.tempo > 0 && song.duration > 0 ) { // calculate density float density = ((float) song.segmentCnt) / song.duration; denstyWritable.set(density); songWritable.set(song.artistName, song.title, song.preview); output.collect(denstyWritable, songWritable); }}
  31. 31. Million Song Dataset - MapReduceReducer● Identity Reducer● Each Reducer gets density values from different range: <i,i+1)*,**<density, [(artist_name, song_title, song_url)]> -> <density, (artist_name, song_title, song_url)>* thanks to a custom Partitioner** not optimal partitioning (partitions are not balanced)
  32. 32. Demo - used software● Karmasphere Studio for EMR (Eclipse plugin) ○ graphical environment that supports the complete lifecycle for developing for Amazon Elastic MapReduce, including prototyping, developing, testing, debugging, deploying and optimizing Hadoop Jobs (http://www.karmasphere. com/ksc/karmasphere-studio-for-amazon.html)
  33. 33. Demo - used software● Karmasphere Studio for EMR (Eclipse plugin)images from:
  34. 34. Video
  35. 35. Please watch video on WHUG channel onYouTube
  36. 36. Thank you!
  37. 37. Join us !