Intro to Hadoop

  • 3,804 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
3,804
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
335
Comments
1
Likes
5

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide


  • Yahoo 38K nodes, 4K node cluster
    Facebook 2K node cluster, 21 PB



















  • >60% jobs at Yahoo are Pig



Transcript

  • 1. Intro to Hadoop TriHUG, July 2010 Jeff Turner Bronto Software
  • 2. Who am I ? Director of Platform Engineering at Bronto Former Googler/FeedBurner(er) Web Analytics background Still working this out in therapy
  • 3. What is a Hadoop? Open source distributed computing framework built on Java Named by Doug Cutting (Apache Lucene) after son’s toy elephant Main components: HDFS and MapReduce Heavily used and sponsored by Yahoo Also used by Facebook, Twitter, Rackspace, LinkedIn, countless others Tremendous community and growing popularity
  • 4. What does Hadoop do? Networks nodes together to combine storage and computing power Scales to petabytes of storage Manages fault tolerance and data replication automagically Excels at processing semi-structured and unstructured data Provides framework for analyzing data in parallel (MapReduce)
  • 5. What does Hadoop not do? No random access (it’s not a database) Not real-time (it’s batch oriented) Make things obvious (there’s a learning curve)
  • 6. Where do we start? 1. HDFS & MapReduce 2. ??? 3. Profit
  • 7. Hadoop’s Filesystem (HDFS) Hadoop Distributed File System, based on Google’s GFS whitepaper Data stored in blocks across cluster Hadoop manages replication, node failure, rebalancing Namenode is the master; Datanodes are slaves Data stored on disk, but not accessible via local file system; use Hadoop API/tools
  • 8. How HDFS stores data Hadoop Client/API talks local filesystem to Namenode file001 (1,2,3) file002 (2) Namenode file003 (1,3) file004 (3) Namenode looks up file005 (2) file006 (4) block locations, returns which Datanodes have data Datanode 1 file001, file003 HDFS Hadoop Client/API talks Datanode 2 file001, file002, file005 to Datanodes to read file data Datanode 3 file001, file003, file004 Datanode 4 file006
  • 9. How HDFS stores data Hadoop Client/API talks local filesystem to Namenode file001 (1,2,3) file002 (2) Namenode file003 (1,3) file004 (3) Namenode looks up file005 (2) file006 (4) block locations, returns which Datanodes have data Datanode 1 file001, file003 HDFS Hadoop Client/API talks Datanode 2 file001, file002, file005 to Datanodes to read file data Datanode 3 file001, file003, file004 This is the only way to access HDFS data Datanode 4 file006
  • 10. How HDFS stores data Hadoop Client/API talks local filesystem to Namenode file001 (1,2,3) file002 (2) Namenode file003 (1,3) file004 (3) Namenode looks up file005 (2) file006 (4) block locations, returns which Datanodes have data Datanode 1 file001, file003 HDFS Hadoop Client/API talks Datanode 2 HDFS data on file001, file002, file005 to Datanodes to read file local file system data is stored in Datanode 3 file001, file003, file004 blocks all over This is the only way to the cluster access HDFS data Datanode 4 file006
  • 11. About that Namenode ... Namenode manages filesystem and file metadata, Datanodes store actual blocks of data Datanode Datanode Namenode keeps track of available Datanodes and file locations across the cluster Namenode Namenode is a SPOF Datanode Datanode
  • 12. About that Namenode ... Namenode manages filesystem and file metadata, Datanodes store actual blocks of data Datanode Datanode Namenode keeps track of available Datanodes and file locations across the cluster Namenode is a SPOF If you lose Namenode metadata, Hadoop Datanode Datanode has no idea which files are in which blocks
  • 13. About that Namenode ... Namenode manages filesystem and file metadata, Datanodes store actual blocks of data Namenode keeps track of available Datanodes and file locations across the cluster Namenode is a SPOF If you lose Namenode metadata, Hadoop has no idea which files are in which blocks
  • 14. HDFS Tips & Tricks Write Namenode data to multiple local & a remote device (NFS mount) No RAID, use JBOD. More disks == more disk I/O Mount disks with noatime (skip writing last accessed time on file reads) LZO compression; saves space, speeds network transfer Tweak and test settings with included JARs: TestDFSIO, sort example
  • 15. Quick break before we move on to MapReduce
  • 16. Hadoop’s MapReduce Framework for running tasks in parallel, based on Google’s whitepaper JobTracker is the master; schedules tasks on nodes, monitors tasks and re- tries failures TaskTrackers are the slaves; runs specified task against specified bits of data on HDFS Map/Reduce functions operate on smaller parts of problem, distributed across multiple nodes
  • 17. Oversimplified MapReduce Example 18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-" "User agent" 77.220.219.58 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 238 company.com "-" "User agent" 121.41.7.104 - [18/Jul/2010:07:02:42 -0400] "GET /index2 HTTP/1.1" 200 2079 company.com "-" "User agent" 42.7.64.102 - - [20/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 173 company.com "-" "User agent"
  • 18. Oversimplified MapReduce Example 18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-" "User agent" 77.220.219.58 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 238 company.com "-" "User agent" 121.41.7.104 - [18/Jul/2010:07:02:42 -0400] "GET /index2 HTTP/1.1" 200 2079 company.com "-" "User agent" 42.7.64.102 - - [20/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 173 company.com "-" "User agent" 1. Each line of log file is input into the map function. The mapper (filename, file-contents): map parses the line, emits a key/value pair representing for each line in file-contents: page = parsePage(line) the page, and that it was viewed once. emit(page, 1)
  • 19. Oversimplified MapReduce Example 18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-" "User agent" 77.220.219.58 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 238 company.com "-" "User agent" 121.41.7.104 - [18/Jul/2010:07:02:42 -0400] "GET /index2 HTTP/1.1" 200 2079 company.com "-" "User agent" 42.7.64.102 - - [20/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 173 company.com "-" "User agent" 1. Each line of log file is input into the map function. The mapper (filename, file-contents): map parses the line, emits a key/value pair representing for each line in file-contents: page = parsePage(line) the page, and that it was viewed once. emit(page, 1) 2. Reducer is given a key and all occurrences of values reduce (key, values): for that key. The reducer sums the values and outputs a int views = 0 key/value pair that represents the page and a total # of for each value in values: views++ views. emit(key, views)
  • 20. Oversimplified MapReduce Example 18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-" "User agent" 77.220.219.58 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 238 company.com "-" "User agent" 121.41.7.104 - [18/Jul/2010:07:02:42 -0400] "GET /index2 HTTP/1.1" 200 2079 company.com "-" "User agent" 42.7.64.102 - - [20/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 173 company.com "-" "User agent" 1. Each line of log file is input into the map function. The mapper (filename, file-contents): map parses the line, emits a key/value pair representing for each line in file-contents: page = parsePage(line) the page, and that it was viewed once. emit(page, 1) 2. Reducer is given a key and all occurrences of values reduce (key, values): for that key. The reducer sums the values and outputs a int views = 0 key/value pair that represents the page and a total # of for each value in values: views++ views. emit(key, views) 3. The result is a count of how many times a webpage (index1, 3) (index2, 1) has appeared in this log file.
  • 21. Hadoop MapReduce data flow InputFormat controls where data comes from, breaks into InputSplits RecordReader knows how to read InputSplit, passes data to map function Mappers do their thing, output intermediate data to local disk Hadoop shuffles, sorts keys in map output so all occurrences of same key are passed to reducer together Reducers do their thing, send output to OutputFormat chart from Yahoo! Hadoop Tutorial OutputFormat controls where data goes http://developer.yahoo.com/hadoop/tutorial/index.html
  • 22. Input/Output Formats TextInputFormat - Reads text files, each line is an input TextOutputFormat - Writes output from Hadoop to plain text DBInputFormat - Reads JDBC sources, rows map to custom DBWritable DBOutputFormat - Writes to JDBC sources, again using DBWritable ColumnFamilyInputFormat - Reads rows from a Cassandra ColumnFamily
  • 23. MapReduce Tips & Tricks You don’t have to do it in Java; current MapReduce abstractions are awesome Pig, Hive - performance is close enough to native MR, with big productivity boost Hadoop Streaming - passes data through stdin/stdout so you can use any language. Ruby, Python popular choices Amazon’s Elastic MapReduce - on-demand MR jobs on EC2 instances
  • 24. Hadoop at Bronto 5 node cluster, adding 8 more; each node 4x 1TB drives, 16GB memory, 8 cores Mostly Pig scripts, some Java utility MR jobs Jobs process raw data/mail logs; store aggregate stats in Cassandra Ad-hoc scripts analyze internal logs for app monitoring/debugging Using Cassandra with Hadoop (we’re rolling our own InputFormat)
  • 25. Summary Hadoop excels at big data, analytics, batch processing Not real-time, no random access; not a database HDFS makes it all possible: massively scalable, fault tolerant file system MapReduce provides framework for processing data on HDFS Pig, Hive easy to use, big productivity gain, close enough performance in most cases
  • 26. Questions? email: jeff.turner@bronto.com twitter: twitter.com/jefft We’re hiring: http://bronto.com/company/careers