0
Intro to Hadoop
TriHUG, July 2010


Jeff Turner
Bronto Software
Who am I ?

Director of Platform Engineering at Bronto

Former Googler/FeedBurner(er)

Web Analytics background

Still wor...
What is a Hadoop?
Open source distributed computing framework built on Java

Named by Doug Cutting (Apache Lucene) after s...
What does Hadoop do?
Networks nodes together to combine storage and computing power

Scales to petabytes of storage

Manag...
What does Hadoop not do?
No random access (it’s not a database)

Not real-time (it’s batch oriented)

Make things obvious ...
Where do we start?
1. HDFS & MapReduce

2. ???

3. Profit
Hadoop’s Filesystem (HDFS)
Hadoop Distributed File System, based on Google’s GFS whitepaper

Data stored in blocks across ...
How HDFS stores data
Hadoop Client/API talks                                 local filesystem
to Namenode                  ...
How HDFS stores data
Hadoop Client/API talks                                 local filesystem
to Namenode                  ...
How HDFS stores data
Hadoop Client/API talks                                 local filesystem
to Namenode                  ...
About that Namenode ...
Namenode manages filesystem and file
metadata, Datanodes store actual blocks
of data
               ...
About that Namenode ...
Namenode manages filesystem and file
metadata, Datanodes store actual blocks
of data
               ...
About that Namenode ...
Namenode manages filesystem and file
metadata, Datanodes store actual blocks
of data

Namenode keeps...
HDFS Tips & Tricks
Write Namenode data to multiple local & a remote device (NFS mount)

No RAID, use JBOD. More disks == m...
Quick break before we move on to MapReduce
Hadoop’s MapReduce
Framework for running tasks in parallel, based on Google’s whitepaper

JobTracker is the master; schedu...
Oversimplified MapReduce Example
18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-"...
Oversimplified MapReduce Example
18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-"...
Oversimplified MapReduce Example
18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-"...
Oversimplified MapReduce Example
18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-"...
Hadoop MapReduce data flow


InputFormat controls where data comes from,
breaks into InputSplits

RecordReader knows how to...
Input/Output Formats

TextInputFormat - Reads text files, each line is an input

TextOutputFormat - Writes output from Hado...
MapReduce Tips & Tricks
You don’t have to do it in Java; current MapReduce abstractions are
awesome

Pig, Hive - performan...
Hadoop at Bronto
5 node cluster, adding 8 more; each node 4x 1TB drives, 16GB memory, 8
cores

Mostly Pig scripts, some Ja...
Summary
Hadoop excels at big data, analytics, batch processing

Not real-time, no random access; not a database

HDFS make...
Questions?
      email: jeff.turner@bronto.com
    twitter: twitter.com/jefft

We’re hiring: http://bronto.com/company/car...
Upcoming SlideShare
Loading in...5
×

Intro to Hadoop

3,918

Published on

Published in: Technology
1 Comment
5 Likes
Statistics
Notes
No Downloads
Views
Total Views
3,918
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
339
Comments
1
Likes
5
Embeds 0
No embeds

No notes for slide


  • Yahoo 38K nodes, 4K node cluster
    Facebook 2K node cluster, 21 PB



















  • >60% jobs at Yahoo are Pig



  • Transcript of "Intro to Hadoop"

    1. 1. Intro to Hadoop TriHUG, July 2010 Jeff Turner Bronto Software
    2. 2. Who am I ? Director of Platform Engineering at Bronto Former Googler/FeedBurner(er) Web Analytics background Still working this out in therapy
    3. 3. What is a Hadoop? Open source distributed computing framework built on Java Named by Doug Cutting (Apache Lucene) after son’s toy elephant Main components: HDFS and MapReduce Heavily used and sponsored by Yahoo Also used by Facebook, Twitter, Rackspace, LinkedIn, countless others Tremendous community and growing popularity
    4. 4. What does Hadoop do? Networks nodes together to combine storage and computing power Scales to petabytes of storage Manages fault tolerance and data replication automagically Excels at processing semi-structured and unstructured data Provides framework for analyzing data in parallel (MapReduce)
    5. 5. What does Hadoop not do? No random access (it’s not a database) Not real-time (it’s batch oriented) Make things obvious (there’s a learning curve)
    6. 6. Where do we start? 1. HDFS & MapReduce 2. ??? 3. Profit
    7. 7. Hadoop’s Filesystem (HDFS) Hadoop Distributed File System, based on Google’s GFS whitepaper Data stored in blocks across cluster Hadoop manages replication, node failure, rebalancing Namenode is the master; Datanodes are slaves Data stored on disk, but not accessible via local file system; use Hadoop API/tools
    8. 8. How HDFS stores data Hadoop Client/API talks local filesystem to Namenode file001 (1,2,3) file002 (2) Namenode file003 (1,3) file004 (3) Namenode looks up file005 (2) file006 (4) block locations, returns which Datanodes have data Datanode 1 file001, file003 HDFS Hadoop Client/API talks Datanode 2 file001, file002, file005 to Datanodes to read file data Datanode 3 file001, file003, file004 Datanode 4 file006
    9. 9. How HDFS stores data Hadoop Client/API talks local filesystem to Namenode file001 (1,2,3) file002 (2) Namenode file003 (1,3) file004 (3) Namenode looks up file005 (2) file006 (4) block locations, returns which Datanodes have data Datanode 1 file001, file003 HDFS Hadoop Client/API talks Datanode 2 file001, file002, file005 to Datanodes to read file data Datanode 3 file001, file003, file004 This is the only way to access HDFS data Datanode 4 file006
    10. 10. How HDFS stores data Hadoop Client/API talks local filesystem to Namenode file001 (1,2,3) file002 (2) Namenode file003 (1,3) file004 (3) Namenode looks up file005 (2) file006 (4) block locations, returns which Datanodes have data Datanode 1 file001, file003 HDFS Hadoop Client/API talks Datanode 2 HDFS data on file001, file002, file005 to Datanodes to read file local file system data is stored in Datanode 3 file001, file003, file004 blocks all over This is the only way to the cluster access HDFS data Datanode 4 file006
    11. 11. About that Namenode ... Namenode manages filesystem and file metadata, Datanodes store actual blocks of data Datanode Datanode Namenode keeps track of available Datanodes and file locations across the cluster Namenode Namenode is a SPOF Datanode Datanode
    12. 12. About that Namenode ... Namenode manages filesystem and file metadata, Datanodes store actual blocks of data Datanode Datanode Namenode keeps track of available Datanodes and file locations across the cluster Namenode is a SPOF If you lose Namenode metadata, Hadoop Datanode Datanode has no idea which files are in which blocks
    13. 13. About that Namenode ... Namenode manages filesystem and file metadata, Datanodes store actual blocks of data Namenode keeps track of available Datanodes and file locations across the cluster Namenode is a SPOF If you lose Namenode metadata, Hadoop has no idea which files are in which blocks
    14. 14. HDFS Tips & Tricks Write Namenode data to multiple local & a remote device (NFS mount) No RAID, use JBOD. More disks == more disk I/O Mount disks with noatime (skip writing last accessed time on file reads) LZO compression; saves space, speeds network transfer Tweak and test settings with included JARs: TestDFSIO, sort example
    15. 15. Quick break before we move on to MapReduce
    16. 16. Hadoop’s MapReduce Framework for running tasks in parallel, based on Google’s whitepaper JobTracker is the master; schedules tasks on nodes, monitors tasks and re- tries failures TaskTrackers are the slaves; runs specified task against specified bits of data on HDFS Map/Reduce functions operate on smaller parts of problem, distributed across multiple nodes
    17. 17. Oversimplified MapReduce Example 18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-" "User agent" 77.220.219.58 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 238 company.com "-" "User agent" 121.41.7.104 - [18/Jul/2010:07:02:42 -0400] "GET /index2 HTTP/1.1" 200 2079 company.com "-" "User agent" 42.7.64.102 - - [20/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 173 company.com "-" "User agent"
    18. 18. Oversimplified MapReduce Example 18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-" "User agent" 77.220.219.58 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 238 company.com "-" "User agent" 121.41.7.104 - [18/Jul/2010:07:02:42 -0400] "GET /index2 HTTP/1.1" 200 2079 company.com "-" "User agent" 42.7.64.102 - - [20/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 173 company.com "-" "User agent" 1. Each line of log file is input into the map function. The mapper (filename, file-contents): map parses the line, emits a key/value pair representing for each line in file-contents: page = parsePage(line) the page, and that it was viewed once. emit(page, 1)
    19. 19. Oversimplified MapReduce Example 18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-" "User agent" 77.220.219.58 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 238 company.com "-" "User agent" 121.41.7.104 - [18/Jul/2010:07:02:42 -0400] "GET /index2 HTTP/1.1" 200 2079 company.com "-" "User agent" 42.7.64.102 - - [20/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 173 company.com "-" "User agent" 1. Each line of log file is input into the map function. The mapper (filename, file-contents): map parses the line, emits a key/value pair representing for each line in file-contents: page = parsePage(line) the page, and that it was viewed once. emit(page, 1) 2. Reducer is given a key and all occurrences of values reduce (key, values): for that key. The reducer sums the values and outputs a int views = 0 key/value pair that represents the page and a total # of for each value in values: views++ views. emit(key, views)
    20. 20. Oversimplified MapReduce Example 18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-" "User agent" 77.220.219.58 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 238 company.com "-" "User agent" 121.41.7.104 - [18/Jul/2010:07:02:42 -0400] "GET /index2 HTTP/1.1" 200 2079 company.com "-" "User agent" 42.7.64.102 - - [20/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 173 company.com "-" "User agent" 1. Each line of log file is input into the map function. The mapper (filename, file-contents): map parses the line, emits a key/value pair representing for each line in file-contents: page = parsePage(line) the page, and that it was viewed once. emit(page, 1) 2. Reducer is given a key and all occurrences of values reduce (key, values): for that key. The reducer sums the values and outputs a int views = 0 key/value pair that represents the page and a total # of for each value in values: views++ views. emit(key, views) 3. The result is a count of how many times a webpage (index1, 3) (index2, 1) has appeared in this log file.
    21. 21. Hadoop MapReduce data flow InputFormat controls where data comes from, breaks into InputSplits RecordReader knows how to read InputSplit, passes data to map function Mappers do their thing, output intermediate data to local disk Hadoop shuffles, sorts keys in map output so all occurrences of same key are passed to reducer together Reducers do their thing, send output to OutputFormat chart from Yahoo! Hadoop Tutorial OutputFormat controls where data goes http://developer.yahoo.com/hadoop/tutorial/index.html
    22. 22. Input/Output Formats TextInputFormat - Reads text files, each line is an input TextOutputFormat - Writes output from Hadoop to plain text DBInputFormat - Reads JDBC sources, rows map to custom DBWritable DBOutputFormat - Writes to JDBC sources, again using DBWritable ColumnFamilyInputFormat - Reads rows from a Cassandra ColumnFamily
    23. 23. MapReduce Tips & Tricks You don’t have to do it in Java; current MapReduce abstractions are awesome Pig, Hive - performance is close enough to native MR, with big productivity boost Hadoop Streaming - passes data through stdin/stdout so you can use any language. Ruby, Python popular choices Amazon’s Elastic MapReduce - on-demand MR jobs on EC2 instances
    24. 24. Hadoop at Bronto 5 node cluster, adding 8 more; each node 4x 1TB drives, 16GB memory, 8 cores Mostly Pig scripts, some Java utility MR jobs Jobs process raw data/mail logs; store aggregate stats in Cassandra Ad-hoc scripts analyze internal logs for app monitoring/debugging Using Cassandra with Hadoop (we’re rolling our own InputFormat)
    25. 25. Summary Hadoop excels at big data, analytics, batch processing Not real-time, no random access; not a database HDFS makes it all possible: massively scalable, fault tolerant file system MapReduce provides framework for processing data on HDFS Pig, Hive easy to use, big productivity gain, close enough performance in most cases
    26. 26. Questions? email: jeff.turner@bronto.com twitter: twitter.com/jefft We’re hiring: http://bronto.com/company/careers
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×