An Introduction to MapReduce

  • 4,979 views
Uploaded on

Slides for workshop presented at International PHP Conference 2009 in Karlsruhe, Germany

Slides for workshop presented at International PHP Conference 2009 in Karlsruhe, Germany

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
4,979
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
307
Comments
1
Likes
9

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. AN INTRODUCTION TO MAPREDUCE
  • 2. David Zülke
  • 3. David Zuelke
  • 4. http://en.wikipedia.org/wiki/File:München_Panorama.JPG
  • 5. Founder
  • 6. awesome
  • 7. Lead Developer
  • 8. @dzuelke
  • 9. BEFORE WE BEGIN... Installing Prerequisites
  • 10. I brought a pre-configured VM
  • 11. (I stole it from the nice folks over at Cloudera)
  • 12. to save some time
  • 13. PLEASE COPY FROM THE HD • /cloudera-training-0.3.2/ • VMWare for Windows, Linux (i386 or x86_64) or Mac OS from /vmware/ if you don’t have it. • For Fusion, go to vmware.com and get an evaluation key. • /php/
  • 14. (but be so kind as to pretend to be still listening)
  • 15. FROM 30.000 FEET Distributed And Parallel Computing
  • 16. we want to process data
  • 17. how much data exactly?
  • 18. SOME NUMBERS • Google • Facebook • Dataprocessed per • New data per day: month: 400 PB (in 2007!) • 200 GB (March 2008) • Average job size: 180 GB • 2 TB (April 2009) • 4 TB (October 2009)
  • 19. what if you have that much data?
  • 20. what if you have just 1% of that amount?
  • 21. “no problemo”, you say?
  • 22. reading 180 GB sequentially off a disk will take ~45 minutes
  • 23. but you only have 16 GB or so of RAM per computer
  • 24. data can be processed much faster than it can be read
  • 25. solution: parallelize your I/O
  • 26. but now you need to coordinate what you’re doing
  • 27. and that’s hard
  • 28. what if a node dies?
  • 29. is data lost? will other nodes in the grid have to re-start? how do you coordinate this?
  • 30. ENTER: OUR HERO Introducing MapReduce
  • 31. in the olden days, the workload was distributed across a grid
  • 32. but the data was shipped around between nodes
  • 33. or even stored centrally on something like an SAN
  • 34. I/O bottleneck
  • 35. Google made a publication in 2004
  • 36. MapReduce: Simplified Data Processing on Large Clusters http://labs.google.com/papers/mapreduce.html
  • 37. now the data is distributed
  • 38. computing happens on the nodes where the data already is
  • 39. processes are isolated and don’t communicate (share-nothing)
  • 40. BASIC PRINCIPLE: MAPPER •A Mapper reads records and emits <key, value> pairs • Example: Apache access.log • Each line is a record • Extract client IP address and number of bytes transferred • Emit IP address as key, number of bytes as value • For hourly rotating logs, the job can be split across 24 nodes* * In pratice, it’s a lot smarter than that
  • 41. BASIC PRINCIPLE: REDUCER •A Reducer is given a key and all values for this specific key • Even if there are many Mappers on many computers; the results are aggregated before they are handed to Reducers • Example: Apache access.log • The Reducer is called once for each client IP (that’s our key), with a list of values (transferred bytes) • We simply sum up the bytes to get the total traffic per IP!
  • 42. EXAMPLE OF MAPPED INPUT IP Bytes 212.122.174.13 18271 212.122.174.13 191726 212.122.174.13 198 74.119.8.111 91272 74.119.8.111 8371 212.122.174.13 43
  • 43. REDUCER WILL RECEIVE THIS IP Bytes 18271 191726 212.122.174.13 198 43 91272 74.119.8.111 8371
  • 44. AFTER REDUCTION IP Bytes 212.122.174.13 210238 74.119.8.111 99643
  • 45. PSEUDOCODE function map($line_number, $line_text) {   $parts = parse_apache_log($linetext);   emit($parts['ip'], $parts['bytes']); } function reduce($key, $values) {   $bytes = array_sum($values);   emit($key, $bytes); } 212.122.174.13 ‐ ‐ [30/Oct/2009:18:14:32 +0100] "GET /foo HTTP/1.1" 200 18271 212.122.174.13 ‐ ‐ [30/Oct/2009:18:14:32 +0100] "GET /bar HTTP/1.1" 200 191726 212.122.174.13 ‐ ‐ [30/Oct/2009:18:14:32 +0100] "GET /baz HTTP/1.1" 200 198 74.119.8.111   ‐ ‐ [30/Oct/2009:18:14:32 +0100] "GET /egg HTTP/1.1" 200 43 74.119.8.111   ‐ ‐ [30/Oct/2009:18:14:32 +0100] "GET /moo HTTP/1.1" 200 91272 212.122.174.13 ‐ ‐ [30/Oct/2009:18:14:32 +0100] "GET /yay HTTP/1.1" 200 8371 212.122.174.13 210238 74.119.8.111   99643
  • 46. FINGER EXERCISE Let’s Try PHP First
  • 47. HANDS-ON Time To Write Some Code!
  • 48. ANOTHER ELEPHANT Introducing Apache Hadoop
  • 49. Hadoop is a MapReduce framework
  • 50. it allows us to focus on writing Mappers, Reducers etc.
  • 51. and it works extremely well
  • 52. how well exactly?
  • 53. HADOOP AT FACEBOOK • Predominantly used in combination with Hive (~95%) • 4800 cores with 12 TB of storage per node • Per day: • 4 TB of new data (compressed) • 135 TB of data scanned (compressed) • 7500+ Hive jobs per day, ~80k compute hours http://www.slideshare.net/cloudera/hw09-rethinking-the-data-warehouse-with-hadoop-and-hive
  • 54. HADOOP AT YAHOO! • Over 25,000 computers with over 100,000 CPUs • Biggest Cluster: • 4000 Nodes • 2x4 CPU cores each • 16 GB RAM each • Over 40% of jobs run using Pig http://wiki.apache.org/hadoop/PoweredBy
  • 55. there’s just one little problem
  • 56. it’s written in Java
  • 57. however, there is hope...
  • 58. STREAMING Hadoop Won’t Force Us To Use Java
  • 59. Hadoop Streaming can use any script as Mapper or Reducer
  • 60. many configuration options (parsers, formats, combining, …)
  • 61. it works using STDIN and STDOUT
  • 62. Mappers are streamed the records (usually by line: <byteoffset>t<line>n) and emit key/value pairs: <key>t<value>n
  • 63. Reducers are streamed key/value pairs: <keyA>t<value1>n <keyA>t<value2>n <keyA>t<value3>n <keyB>t<value4>n
  • 64. Caution: no separate Reducer processes per key (but keys are sorted)
  • 65. HANDS-ON Let’s Say Hello To Our Hadoop VM
  • 66. THE HADOOP ECOSYSTEM A Little Tour
  • 67. APACHE AVRO Efficient Data Serialization System With Schemas (compare: Facebook’s Thrift)
  • 68. APACHE CHUKWA Distributed Data Collection System (compare: Facebook’s Scribe)
  • 69. APACHE HBASE Like Google’s BigTable, Only That You Can Have It, Too!
  • 70. HDFS Your Friendly Distributed File System
  • 71. HIVE Data Warehousing Made Simple With An SQL Interface
  • 72. PIG A High-Level Language For Modelling Data Processing Tasks
  • 73. ZOOKEEPER Your Distributed Applications, Coordinated
  • 74. ISABEL DROST Welcome Our Special Guest, Presenting Apache Mahout
  • 75. !e End
  • 76. Questions?
  • 77. THANK YOU! • http://hadoop.apache.org/ is the Hadoop project website • http://www.cloudera.com/hadoop-training has useful resources • Send me an E-Mail: david.zuelke@bitextender.com • Follow @dzuelke on Twitter • Slides will be on SlideShare