Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)
Upcoming SlideShare
Loading in...5

Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)



Presentation given at International PHP Conference 2011.

Presentation given at International PHP Conference 2011.



Total Views
Views on SlideShare
Embed Views



3 Embeds 27 22 4 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11) Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11) Presentation Transcript

  • David Zuelke
  • David Zülke
  • Founder
  • Lead Developer
  • @dzuelke
  • THE BIG DATA CHALLENGE Distributed And Parallel Computing
  • we want to process data
  • how much data exactly?
  • SOME NUMBERS• Facebook • Google • New data per day: • Data processed per month: 400 PB (in 2007!) • 200 GB (March 2008) • Average job size: 180 GB • 2 TB (April 2009) • 4 TB (October 2009) • 12 TB (March 2010)
  • what if you have that much data?
  • what if you have just 1% of that amount?
  • “No Problemo”, you say?
  • reading 180 GB sequentially off a disk will take ~45 minutes
  • and you only have 16 to 64 GB of RAM per computer
  • so you cant process everything at once
  • general rule of modern computers:
  • data can be processed much faster than it can be read
  • solution: parallelize your I/O
  • but now you need to coordinate what you’re doing
  • and that’s hard
  • what if a node dies?
  • is data lost?will other nodes in the grid have to re-start? how do you coordinate this?
  • ENTER: OUR HERO Introducing MapReduce
  • in the olden days, the workload was distributed across a grid
  • and the data was shipped around between nodes
  • or even stored centrally on something like an SAN
  • which was fine for small amounts of information
  • but today, on the web, we have big data
  • I/O bottleneck
  • along came a Google publication in 2004
  • MapReduce: Simplified Data Processing on Large Clusters
  • now the data is distributed
  • computing happens on the nodes where the data already is
  • processes are isolated and don’t communicate (share-nothing)
  • BASIC PRINCIPLE: MAPPER•A Mapper reads records and emits <key, value> pairs • Example: Apache access.log • Each line is a record • Extract client IP address and number of bytes transferred • Emit IP address as key, number of bytes as value• For hourly rotating logs, the job can be split across 24 nodes* * In pratice, it’s a lot smarter than that
  • BASIC PRINCIPLE: REDUCER•A Reducer is given a key and all values for this specific key • Even if there are many Mappers on many computers; the results are aggregated before they are handed to Reducers • Example: Apache access.log • The Reducer is called once for each client IP (that’s our key), with a list of values (transferred bytes) • We simply sum up the bytes to get the total traffic per IP!
  • EXAMPLE OF MAPPED INPUT IP Bytes 18271 191726 198 91272 8371 43
  • REDUCER WILL RECEIVE THIS IP Bytes 18271 191726 198 43 91272 8371
  • AFTER REDUCTION IP Bytes212.122.174.13 210238 99643
  • PSEUDOCODEfunction  map($line_number,  $line_text)  {    $parts  =  parse_apache_log($line_text);    emit($parts[ip],  $parts[bytes]);}function  reduce($key,  $values)  {    $bytes  =  array_sum($values);    emit($key,  $bytes);}  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /foo  HTTP/1.1"  200  18271212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /bar  HTTP/1.1"  200  191726212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /baz  HTTP/1.1"  200  19874.119.8.111      -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /egg  HTTP/1.1"  200  4374.119.8.111      -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /moo  HTTP/1.1"  200  91272212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /yay  HTTP/1.1"  200  8371212.122.174.13  21023874.119.8.111      99643
  • A YELLOW ELEPHANT Introducing Apache Hadoop
  • The name my kid gave a stuffed yellowelephant. Short, relatively easy to spell andpronounce, meaningless and not used elsewhere:those are my naming criteria. Kids are good atgenerating such. Googol is a kid’s term. Doug Cutting
  • Hadoop is a MapReduce framework
  • it allows us to focus on writing Mappers, Reducers etc.
  • and it works extremely well
  • how well exactly?
  • HADOOP AT FACEBOOK (I)• Predominantly used in combination with Hive (~95%)• 8400 cores with ~12.5 PB of total storage•8 cores, 12 TB storage and 32 GB RAM per node• 1x Gigabit Ethernet for each server in a rack• 4x Gigabit Ethernet from rack switch to core Hadoop is aware of racks and locality of nodes
  • HADOOP AT FACEBOOK (II)• Daily stats: • New data per day: • 25 TB logged by Scribe • I/08: 200 GB • 135 TB of compressed • II/09: 2 TB (compressed) data scanned • III/09: 4 TB (compressed) • 7500+ Hive jobs • I/10: 12 TB (compressed) • ~80k compute hours
  • HADOOP AT YAHOO!• Over 25,000 computers with over 100,000 CPUs• Biggest Cluster: • 4000 Nodes • 2x4 CPU cores each • 16 GB RAM each• Over 40% of jobs run using Pig
  • OTHER NOTABLE USERS• Twitter (storage, logging, analysis. Heavy users of Pig)• Rackspace (log analysis; data pumped into Lucene/Solr)• LinkedIn (friend suggestions)• (charts, log analysis, A/B testing)• The New York Times (converted 4 TB of scans using EC2)
  • JOB PROCESSING How Hadoop Works
  • Just like I already described! It’s MapReduce! o/
  • BASIC RULES• Uses Input Formats to split up your data into single records• You can optimize using combiners to reduce locally on a node • Only possible in some cases, e.g. for max(), but not avg()• You can control partitioning of map output yourself • Rarely useful, the default partitioner (key hash) is enough• And a million other things that really don’t matter right now ;)
  • HDFSHadoop Distributed File System
  • HDFS• Stores data in blocks (default block size: 64 MB)• Designed for very large data sets• Designed for streaming rather than random reads• Write-once, read-many (although appending is possible)• Capable of compression and other cool things
  • HDFS CONCEPTS• Large blocks minimize amount of seeks, maximize throughput• Blocks are stored redundantly (3 replicas as default)• Aware of infrastructure characteristics (nodes, racks, ...)• Datanodes hold blocks• Namenode holds the metadata Critical component for an HDFS cluster (HA, SPOF)
  • there’s just one little problem
  • you need to write Java code
  • however, there is hope...
  • STREAMINGHadoop Won’t Force Us To Use Java
  • Hadoop Streaming can use any script as Mapper or Reducer
  • many configuration options (parsers, formats, combining, …)
  • it works using STDIN and STDOUT
  • Mappers are streamed the records (usually by line: <line>n)and emit key/value pairs: <key>t<value>n
  • Reducers are streamed key/value pairs: <keyA>t<value1>n <keyA>t<value2>n <keyA>t<value3>n <keyB>t<value4>n
  • Caution: no separate Reducer processes per key (but keys are sorted)
  • STREAMING WITH PHP Introducing HadooPHP
  • HADOOPHP•A little framework to help with writing mapred jobs in PHP• Takes care of input splitting, can do basic decoding et cetera • Automatically detects and handles Hadoop settings such as key length or field separators• Packages jobs as one .phar archive to ease deployment • Also creates a ready-to-rock shell script to invoke the job
  • written by
  • DEMOHadoop Streaming & PHP in Action
  • !e End
  • RESOURCES•• Tom White: Hadoop. The Definitive Guide. O’Reilly, 2009• • Cloudera Distribution for Hadoop is easy to install and has all the stuff included: Hadoop, Hive, Flume, Sqoop, Oozie, …
  • Questions?
  • THANK YOU! This was by @dzuelke. Contact me or hire