SlideShare a Scribd company logo
1 of 26
Intro to Hadoop
TriHUG, July 2010


Jeff Turner
Bronto Software
Who am I ?

Director of Platform Engineering at Bronto

Former Googler/FeedBurner(er)

Web Analytics background

Still working this out in therapy
What is a Hadoop?
Open source distributed computing framework built on Java

Named by Doug Cutting (Apache Lucene) after son’s toy elephant

Main components: HDFS and MapReduce

Heavily used and sponsored by Yahoo

Also used by Facebook, Twitter, Rackspace, LinkedIn, countless others

Tremendous community and growing popularity
What does Hadoop do?
Networks nodes together to combine storage and computing power

Scales to petabytes of storage

Manages fault tolerance and data replication automagically

Excels at processing semi-structured and unstructured data

Provides framework for analyzing data in parallel (MapReduce)
What does Hadoop not do?
No random access (it’s not a database)

Not real-time (it’s batch oriented)

Make things obvious (there’s a learning curve)
Where do we start?
1. HDFS & MapReduce

2. ???

3. Profit
Hadoop’s Filesystem (HDFS)
Hadoop Distributed File System, based on Google’s GFS whitepaper

Data stored in blocks across cluster

Hadoop manages replication, node failure, rebalancing

Namenode is the master; Datanodes are slaves

Data stored on disk, but not accessible via local file system; use Hadoop
API/tools
How HDFS stores data
Hadoop Client/API talks                                 local filesystem
to Namenode                                       file001 (1,2,3) file002 (2)
                                     Namenode
                                                  file003 (1,3)   file004 (3)
Namenode looks up                                 file005 (2)     file006 (4)
block locations, returns
which Datanodes have
data                                 Datanode 1          file001, file003
                              HDFS
Hadoop Client/API talks              Datanode 2      file001, file002, file005
to Datanodes to read file
data
                                     Datanode 3      file001, file003, file004

                                     Datanode 4             file006
How HDFS stores data
Hadoop Client/API talks                                 local filesystem
to Namenode                                       file001 (1,2,3) file002 (2)
                                     Namenode
                                                  file003 (1,3)   file004 (3)
Namenode looks up                                 file005 (2)     file006 (4)
block locations, returns
which Datanodes have
data                                 Datanode 1          file001, file003
                              HDFS
Hadoop Client/API talks              Datanode 2      file001, file002, file005
to Datanodes to read file
data
                                     Datanode 3      file001, file003, file004
This is the only way to
access HDFS data                     Datanode 4             file006
How HDFS stores data
Hadoop Client/API talks                                 local filesystem
to Namenode                                       file001 (1,2,3) file002 (2)
                                     Namenode
                                                  file003 (1,3)   file004 (3)
Namenode looks up                                 file005 (2)     file006 (4)
block locations, returns
which Datanodes have
data                                 Datanode 1          file001, file003
                              HDFS
Hadoop Client/API talks              Datanode 2                                 HDFS data on
                                                     file001, file002, file005
to Datanodes to read file                                                      local file system
data                                                                               is stored in
                                     Datanode 3      file001, file003, file004     blocks all over
This is the only way to                                                             the cluster
access HDFS data                     Datanode 4             file006
About that Namenode ...
Namenode manages filesystem and file
metadata, Datanodes store actual blocks
of data
                                          Datanode              Datanode

Namenode keeps track of available
Datanodes and file locations across the
cluster
                                                     Namenode

Namenode is a SPOF

                                          Datanode              Datanode
About that Namenode ...
Namenode manages filesystem and file
metadata, Datanodes store actual blocks
of data
                                             Datanode   Datanode

Namenode keeps track of available
Datanodes and file locations across the
cluster

Namenode is a SPOF

If you lose Namenode metadata, Hadoop        Datanode   Datanode
has no idea which files are in which blocks
About that Namenode ...
Namenode manages filesystem and file
metadata, Datanodes store actual blocks
of data

Namenode keeps track of available
Datanodes and file locations across the
cluster

Namenode is a SPOF

If you lose Namenode metadata, Hadoop
has no idea which files are in which blocks
HDFS Tips & Tricks
Write Namenode data to multiple local & a remote device (NFS mount)

No RAID, use JBOD. More disks == more disk I/O

Mount disks with noatime (skip writing last accessed time on file reads)

LZO compression; saves space, speeds network transfer

Tweak and test settings with included JARs: TestDFSIO, sort example
Quick break before we move on to MapReduce
Hadoop’s MapReduce
Framework for running tasks in parallel, based on Google’s whitepaper

JobTracker is the master; schedules tasks on nodes, monitors tasks and re-
tries failures

TaskTrackers are the slaves; runs specified task against specified bits of data
on HDFS

Map/Reduce functions operate on smaller parts of problem, distributed
across multiple nodes
Oversimplified MapReduce Example
18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-" "User agent"
77.220.219.58 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 238 company.com "-" "User agent"
121.41.7.104 - [18/Jul/2010:07:02:42 -0400] "GET /index2 HTTP/1.1" 200 2079 company.com "-" "User agent"
42.7.64.102 - - [20/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 173 company.com "-" "User agent"
Oversimplified MapReduce Example
18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-" "User agent"
77.220.219.58 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 238 company.com "-" "User agent"
121.41.7.104 - [18/Jul/2010:07:02:42 -0400] "GET /index2 HTTP/1.1" 200 2079 company.com "-" "User agent"
42.7.64.102 - - [20/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 173 company.com "-" "User agent"



1. Each line of log file is input into the map function. The   mapper (filename, file-contents):
map parses the line, emits a key/value pair representing        for each line in file-contents:
                                                                  page = parsePage(line)
the page, and that it was viewed once.                            emit(page, 1)
Oversimplified MapReduce Example
18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-" "User agent"
77.220.219.58 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 238 company.com "-" "User agent"
121.41.7.104 - [18/Jul/2010:07:02:42 -0400] "GET /index2 HTTP/1.1" 200 2079 company.com "-" "User agent"
42.7.64.102 - - [20/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 173 company.com "-" "User agent"



1. Each line of log file is input into the map function. The   mapper (filename, file-contents):
map parses the line, emits a key/value pair representing        for each line in file-contents:
                                                                  page = parsePage(line)
the page, and that it was viewed once.                            emit(page, 1)

2. Reducer is given a key and all occurrences of values       reduce (key, values):
for that key. The reducer sums the values and outputs a         int views = 0
key/value pair that represents the page and a total # of        for each value in values:
                                                                  views++
views.
                                                                emit(key, views)
Oversimplified MapReduce Example
18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-" "User agent"
77.220.219.58 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 238 company.com "-" "User agent"
121.41.7.104 - [18/Jul/2010:07:02:42 -0400] "GET /index2 HTTP/1.1" 200 2079 company.com "-" "User agent"
42.7.64.102 - - [20/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 173 company.com "-" "User agent"



1. Each line of log file is input into the map function. The   mapper (filename, file-contents):
map parses the line, emits a key/value pair representing        for each line in file-contents:
                                                                  page = parsePage(line)
the page, and that it was viewed once.                            emit(page, 1)

2. Reducer is given a key and all occurrences of values       reduce (key, values):
for that key. The reducer sums the values and outputs a         int views = 0
key/value pair that represents the page and a total # of        for each value in values:
                                                                  views++
views.
                                                                emit(key, views)



3. The result is a count of how many times a webpage          (index1, 3)
                                                              (index2, 1)
has appeared in this log file.
Hadoop MapReduce data flow


InputFormat controls where data comes from,
breaks into InputSplits

RecordReader knows how to read InputSplit, passes
data to map function

Mappers do their thing, output intermediate data to
local disk

Hadoop shuffles, sorts keys in map output so all
occurrences of same key are passed to reducer
together

Reducers do their thing, send output to
OutputFormat

                                                                chart from Yahoo! Hadoop Tutorial
OutputFormat controls where data goes                 http://developer.yahoo.com/hadoop/tutorial/index.html
Input/Output Formats

TextInputFormat - Reads text files, each line is an input

TextOutputFormat - Writes output from Hadoop to plain text

DBInputFormat - Reads JDBC sources, rows map to custom DBWritable

DBOutputFormat - Writes to JDBC sources, again using DBWritable

ColumnFamilyInputFormat - Reads rows from a Cassandra ColumnFamily
MapReduce Tips & Tricks
You don’t have to do it in Java; current MapReduce abstractions are
awesome

Pig, Hive - performance is close enough to native MR, with big productivity
boost

Hadoop Streaming - passes data through stdin/stdout so you can use any
language. Ruby, Python popular choices

Amazon’s Elastic MapReduce - on-demand MR jobs on EC2 instances
Hadoop at Bronto
5 node cluster, adding 8 more; each node 4x 1TB drives, 16GB memory, 8
cores

Mostly Pig scripts, some Java utility MR jobs

Jobs process raw data/mail logs; store aggregate stats in Cassandra

Ad-hoc scripts analyze internal logs for app monitoring/debugging

Using Cassandra with Hadoop (we’re rolling our own InputFormat)
Summary
Hadoop excels at big data, analytics, batch processing

Not real-time, no random access; not a database

HDFS makes it all possible: massively scalable, fault tolerant file system

MapReduce provides framework for processing data on HDFS

Pig, Hive easy to use, big productivity gain, close enough performance in
most cases
Questions?
      email: jeff.turner@bronto.com
    twitter: twitter.com/jefft

We’re hiring: http://bronto.com/company/careers

More Related Content

What's hot

Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
tipanagiriharika
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
Christopher Pezza
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav
 

What's hot (20)

Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
 

Viewers also liked

Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
Cyanny LIANG
 

Viewers also liked (9)

Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
 
A (very) short intro to Hadoop
A (very) short intro to HadoopA (very) short intro to Hadoop
A (very) short intro to Hadoop
 
Grid computing ppt
Grid computing pptGrid computing ppt
Grid computing ppt
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Similar to Intro to Hadoop

Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Simplilearn
 
big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing data
preetik9044
 

Similar to Intro to Hadoop (20)

Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem
 
Unit 1
Unit 1Unit 1
Unit 1
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Nextag talk
Nextag talkNextag talk
Nextag talk
 
HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out
 
May 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data OutMay 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data Out
 
Hadoop
HadoopHadoop
Hadoop
 
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overviewHdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle Professional
 
Hadoop
HadoopHadoop
Hadoop
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing data
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 

Intro to Hadoop

  • 1. Intro to Hadoop TriHUG, July 2010 Jeff Turner Bronto Software
  • 2. Who am I ? Director of Platform Engineering at Bronto Former Googler/FeedBurner(er) Web Analytics background Still working this out in therapy
  • 3. What is a Hadoop? Open source distributed computing framework built on Java Named by Doug Cutting (Apache Lucene) after son’s toy elephant Main components: HDFS and MapReduce Heavily used and sponsored by Yahoo Also used by Facebook, Twitter, Rackspace, LinkedIn, countless others Tremendous community and growing popularity
  • 4. What does Hadoop do? Networks nodes together to combine storage and computing power Scales to petabytes of storage Manages fault tolerance and data replication automagically Excels at processing semi-structured and unstructured data Provides framework for analyzing data in parallel (MapReduce)
  • 5. What does Hadoop not do? No random access (it’s not a database) Not real-time (it’s batch oriented) Make things obvious (there’s a learning curve)
  • 6. Where do we start? 1. HDFS & MapReduce 2. ??? 3. Profit
  • 7. Hadoop’s Filesystem (HDFS) Hadoop Distributed File System, based on Google’s GFS whitepaper Data stored in blocks across cluster Hadoop manages replication, node failure, rebalancing Namenode is the master; Datanodes are slaves Data stored on disk, but not accessible via local file system; use Hadoop API/tools
  • 8. How HDFS stores data Hadoop Client/API talks local filesystem to Namenode file001 (1,2,3) file002 (2) Namenode file003 (1,3) file004 (3) Namenode looks up file005 (2) file006 (4) block locations, returns which Datanodes have data Datanode 1 file001, file003 HDFS Hadoop Client/API talks Datanode 2 file001, file002, file005 to Datanodes to read file data Datanode 3 file001, file003, file004 Datanode 4 file006
  • 9. How HDFS stores data Hadoop Client/API talks local filesystem to Namenode file001 (1,2,3) file002 (2) Namenode file003 (1,3) file004 (3) Namenode looks up file005 (2) file006 (4) block locations, returns which Datanodes have data Datanode 1 file001, file003 HDFS Hadoop Client/API talks Datanode 2 file001, file002, file005 to Datanodes to read file data Datanode 3 file001, file003, file004 This is the only way to access HDFS data Datanode 4 file006
  • 10. How HDFS stores data Hadoop Client/API talks local filesystem to Namenode file001 (1,2,3) file002 (2) Namenode file003 (1,3) file004 (3) Namenode looks up file005 (2) file006 (4) block locations, returns which Datanodes have data Datanode 1 file001, file003 HDFS Hadoop Client/API talks Datanode 2 HDFS data on file001, file002, file005 to Datanodes to read file local file system data is stored in Datanode 3 file001, file003, file004 blocks all over This is the only way to the cluster access HDFS data Datanode 4 file006
  • 11. About that Namenode ... Namenode manages filesystem and file metadata, Datanodes store actual blocks of data Datanode Datanode Namenode keeps track of available Datanodes and file locations across the cluster Namenode Namenode is a SPOF Datanode Datanode
  • 12. About that Namenode ... Namenode manages filesystem and file metadata, Datanodes store actual blocks of data Datanode Datanode Namenode keeps track of available Datanodes and file locations across the cluster Namenode is a SPOF If you lose Namenode metadata, Hadoop Datanode Datanode has no idea which files are in which blocks
  • 13. About that Namenode ... Namenode manages filesystem and file metadata, Datanodes store actual blocks of data Namenode keeps track of available Datanodes and file locations across the cluster Namenode is a SPOF If you lose Namenode metadata, Hadoop has no idea which files are in which blocks
  • 14. HDFS Tips & Tricks Write Namenode data to multiple local & a remote device (NFS mount) No RAID, use JBOD. More disks == more disk I/O Mount disks with noatime (skip writing last accessed time on file reads) LZO compression; saves space, speeds network transfer Tweak and test settings with included JARs: TestDFSIO, sort example
  • 15. Quick break before we move on to MapReduce
  • 16. Hadoop’s MapReduce Framework for running tasks in parallel, based on Google’s whitepaper JobTracker is the master; schedules tasks on nodes, monitors tasks and re- tries failures TaskTrackers are the slaves; runs specified task against specified bits of data on HDFS Map/Reduce functions operate on smaller parts of problem, distributed across multiple nodes
  • 17. Oversimplified MapReduce Example 18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-" "User agent" 77.220.219.58 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 238 company.com "-" "User agent" 121.41.7.104 - [18/Jul/2010:07:02:42 -0400] "GET /index2 HTTP/1.1" 200 2079 company.com "-" "User agent" 42.7.64.102 - - [20/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 173 company.com "-" "User agent"
  • 18. Oversimplified MapReduce Example 18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-" "User agent" 77.220.219.58 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 238 company.com "-" "User agent" 121.41.7.104 - [18/Jul/2010:07:02:42 -0400] "GET /index2 HTTP/1.1" 200 2079 company.com "-" "User agent" 42.7.64.102 - - [20/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 173 company.com "-" "User agent" 1. Each line of log file is input into the map function. The mapper (filename, file-contents): map parses the line, emits a key/value pair representing for each line in file-contents: page = parsePage(line) the page, and that it was viewed once. emit(page, 1)
  • 19. Oversimplified MapReduce Example 18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-" "User agent" 77.220.219.58 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 238 company.com "-" "User agent" 121.41.7.104 - [18/Jul/2010:07:02:42 -0400] "GET /index2 HTTP/1.1" 200 2079 company.com "-" "User agent" 42.7.64.102 - - [20/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 173 company.com "-" "User agent" 1. Each line of log file is input into the map function. The mapper (filename, file-contents): map parses the line, emits a key/value pair representing for each line in file-contents: page = parsePage(line) the page, and that it was viewed once. emit(page, 1) 2. Reducer is given a key and all occurrences of values reduce (key, values): for that key. The reducer sums the values and outputs a int views = 0 key/value pair that represents the page and a total # of for each value in values: views++ views. emit(key, views)
  • 20. Oversimplified MapReduce Example 18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-" "User agent" 77.220.219.58 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 238 company.com "-" "User agent" 121.41.7.104 - [18/Jul/2010:07:02:42 -0400] "GET /index2 HTTP/1.1" 200 2079 company.com "-" "User agent" 42.7.64.102 - - [20/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 173 company.com "-" "User agent" 1. Each line of log file is input into the map function. The mapper (filename, file-contents): map parses the line, emits a key/value pair representing for each line in file-contents: page = parsePage(line) the page, and that it was viewed once. emit(page, 1) 2. Reducer is given a key and all occurrences of values reduce (key, values): for that key. The reducer sums the values and outputs a int views = 0 key/value pair that represents the page and a total # of for each value in values: views++ views. emit(key, views) 3. The result is a count of how many times a webpage (index1, 3) (index2, 1) has appeared in this log file.
  • 21. Hadoop MapReduce data flow InputFormat controls where data comes from, breaks into InputSplits RecordReader knows how to read InputSplit, passes data to map function Mappers do their thing, output intermediate data to local disk Hadoop shuffles, sorts keys in map output so all occurrences of same key are passed to reducer together Reducers do their thing, send output to OutputFormat chart from Yahoo! Hadoop Tutorial OutputFormat controls where data goes http://developer.yahoo.com/hadoop/tutorial/index.html
  • 22. Input/Output Formats TextInputFormat - Reads text files, each line is an input TextOutputFormat - Writes output from Hadoop to plain text DBInputFormat - Reads JDBC sources, rows map to custom DBWritable DBOutputFormat - Writes to JDBC sources, again using DBWritable ColumnFamilyInputFormat - Reads rows from a Cassandra ColumnFamily
  • 23. MapReduce Tips & Tricks You don’t have to do it in Java; current MapReduce abstractions are awesome Pig, Hive - performance is close enough to native MR, with big productivity boost Hadoop Streaming - passes data through stdin/stdout so you can use any language. Ruby, Python popular choices Amazon’s Elastic MapReduce - on-demand MR jobs on EC2 instances
  • 24. Hadoop at Bronto 5 node cluster, adding 8 more; each node 4x 1TB drives, 16GB memory, 8 cores Mostly Pig scripts, some Java utility MR jobs Jobs process raw data/mail logs; store aggregate stats in Cassandra Ad-hoc scripts analyze internal logs for app monitoring/debugging Using Cassandra with Hadoop (we’re rolling our own InputFormat)
  • 25. Summary Hadoop excels at big data, analytics, batch processing Not real-time, no random access; not a database HDFS makes it all possible: massively scalable, fault tolerant file system MapReduce provides framework for processing data on HDFS Pig, Hive easy to use, big productivity gain, close enough performance in most cases
  • 26. Questions? email: jeff.turner@bronto.com twitter: twitter.com/jefft We’re hiring: http://bronto.com/company/careers

Editor's Notes

  1. Yahoo 38K nodes, 4K node cluster Facebook 2K node cluster, 21 PB
  2. >60% jobs at Yahoo are Pig