Introduction To Elastic MapReduce at WHUG

  • 2,319 views
Uploaded on

Elasic MapReduce presentation given at 2nd meeting of Warsaw Hadoop User Group. …

Elasic MapReduce presentation given at 2nd meeting of Warsaw Hadoop User Group.

Watch also demonstration at www.youtube.com/watch?v=Azwilbn8GCs (it show how to create Hadoop cluster on Amazon Elastic MapReduce with Karashpere Studio for EMR (a plugin for Eclipse) to launch big calculations quickly and easily.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,319
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
52
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Possible real-world situation● We have big data and/or very long, embarrassingly parallel computation● Our data may grow fast● We want to start and try Hadoop asap● We do not have our own infrastructure● We do not have Hadoop administrators● We have limited funds
  • 2. Possible solutionAmazon Elastic MapReduce (EMR)● Hadoop framework running on the web scale infrastructure of Amazon
  • 3. EMR BenefitsElastic (scalable)● Use one, hundred, or even thousands of instances to process even petabytes of data● Modify the number of instances while the job flow is running● Start computation within minutes
  • 4. EMR BenefitsEasy to use● No configuration necessary ○ Do not worry about setting up hardware and networking, running, managing and tuning the performance of Hadoop cluster● Easy-to-use tools and plugins available ○ AWS Web Management Console ○ Command Line Tools by Amazon ○ Amazon EMR API, SDK, Libraries ○ Plugins for IDEs (e.g. Eclipse & Karmasphere Studio for EMR)
  • 5. EMR BenefitsReliable● Build on Amazons highly available and battle-tested infrastructure● Provision new nodes to replace those that fail● Used by e.g.:
  • 6. EMR BenefitsCost effective● Pay for what you use (for each started hour)● Choose various instance types that meets your requirements● Possibility to reserve instances for 1 or 3 years to pay less for hour
  • 7. EMR OverviewAmazon Elastic MapReduce (Amazon EMR)works in conjunction with● Amazon EC2 to rent computing instances (with Hadoop installed)● Amazon S3 to store input and output data, scripts/applications and logs
  • 8. EMR Architectural Overview* image from the Internet
  • 9. EC2 Instance Types* image from Big Data University, Course: "Hadoop and the Amazon Cloud"
  • 10. EMR Pricing - "On-demand"instancesStandard Family Instances (US East Region)http://aws.amazon.com/elasticmapreduce/pricing/
  • 11. EC2 & S3 Pricing - Real-world exampleNew York Times wanted to host all publicdomain articles from 1851 to 1922.● 11 million articles● 4 TB of raw image TIFF input data converted to 1.5 TB of PDF documents● 100 EC2 Instances rented● < 24 hours of computation● $240 paid (not including storage & bandwidth)● 1 employee assigned to this task
  • 12. EC2 & S3 Pricing - Real-world example How much did they pay for storage and bandwidth?
  • 13. S3 Pricinghttp://aws.amazon.com/s3/pricing/
  • 14. EC2 & S3 Pricing CalculatorSimple Monthly Calculator:http://calculator.s3.amazonaws.com/calc5.html
  • 15. AWS Free Usage Tier (Per Month)Available for free to new AWS customers for 12months following AWS sign-up date e.g.:● 750 hours of Amazon EC2 Micro Instance usage ○ 613 MB of memory and 32-bit or 64-bit platform● 5 GB of Amazon S3 standard storage, 20,000 Get and 2,000 Put Requests● 15 GB of bandwidth out aggregated across all AWS services
  • 16. EMR - Support for HadoopEcosystemDevelop and run MapReduce application using:● Java● Streaming (e.g. Ruby, Perl, Python, PHP, R, or C++)● Pig● HiveHBase can be easily installed using set of EC2scripts●
  • 17. EMR - Featured Users* logos form http://aws.amazon.com/elasticmapreduce/
  • 18. EMR - Case Study - Yelp● help people connect with great local business● share reviews and insights● as of November 2010: ○ 39 million monthly unique visitors ○ in total, 14 million reviews posted ●
  • 19. EMR - Case Study - Yelp
  • 20. EMR - Case Study - Yelp● uses S3 to store daily logs (~100GB/day) and photos● uses EMR to power features like ○ People who viewed this also viewed ○ Review highlights ○ Autocomplete in search box ○ Top searches● implements jobs in Python and uses their own open-source library, mrjob, to run them on EMR
  • 21. mrjob - WordCount examplefrom mrjob.job import MRJobclass MRWordCounter(MRJob): def mapper(self, key, line): for word in line.split(): yield word, 1 def reducer(self, word, occurrences): yield word, sum(occurrences)if __name__ == __main__: MRWordCounter.run()
  • 22. mrjob - run on EMR$ python wordcount.py --ec2_instance_type c1.medium --num-ec2-instances 10 -r emr < s3://input-bucket/*.txt > output
  • 23. Demo
  • 24. Million Song Dataset● Contains detailed acoustic and contextual data for one million popular songs● ~300 GB of data● Publicly available ○ for download: http://www.infochimps. com/collections/million-songs ○ for processing using EMR: http://tbmmsd.s3. amazonaws.com/
  • 25. Million Song DatasetContains data such as:● Songs title, year and hotness● Songs tempo, duration, danceability, energy, loudness, segments count, preview (URL to mp3 file) and so on● Artists name and hotness
  • 26. Million Song Dataset - SongsdensitySongs density* can be defined as the averagenumber of notes or atomic sounds (calledsegments) per second in a song. density = segmentCnt / duration   * based on Paul Lameres blog - http://bit.ly/qUbLdQ
  • 27. Million Song Dataset - Task*Simple music recommendation system● Calculate density for each song● Find hot songs with similar density* based on Paul Lameres blog - http://bit.ly/qUbLdQ
  • 28. Million Song Dataset - MapReduceInput data● 339 files● Each file contains ~3 000 songs● Each song is represented by one line in input file● Fields are separated by a tab character
  • 29. Million Song Dataset - MapReduceMapper● Reads songs data from each line of input text● Calculate songs density● Emits songs density as key with some other details as value<line_offset, song_data> -> <density, (artist_name, song_title, song_url)>
  • 30. public void map(LongWritable key, Text value, OutputCollector<FloatWritable, TripleTextWritable> output, Reporter reporter) throws IOException {  song.parseLine(value.toString()); if (song.tempo > 0 && song.duration > 0 ) { // calculate density float density = ((float) song.segmentCnt) / song.duration; denstyWritable.set(density); songWritable.set(song.artistName, song.title, song.preview); output.collect(denstyWritable, songWritable); }}
  • 31. Million Song Dataset - MapReduceReducer● Identity Reducer● Each Reducer gets density values from different range: <i,i+1)*,**<density, [(artist_name, song_title, song_url)]> -> <density, (artist_name, song_title, song_url)>* thanks to a custom Partitioner** not optimal partitioning (partitions are not balanced)
  • 32. Demo - used software● Karmasphere Studio for EMR (Eclipse plugin) ○ graphical environment that supports the complete lifecycle for developing for Amazon Elastic MapReduce, including prototyping, developing, testing, debugging, deploying and optimizing Hadoop Jobs (http://www.karmasphere. com/ksc/karmasphere-studio-for-amazon.html)
  • 33. Demo - used software● Karmasphere Studio for EMR (Eclipse plugin)images from:http://www.karmasphere.com/ksc/karmasphere-studio-for-amazon.html
  • 34. Video
  • 35. Please watch video on WHUG channel onYouTubehttp://www.youtube.com/watch?v=Azwilbn8GCs
  • 36. Thank you!
  • 37. Join us !whug.org