SlideShare a Scribd company logo
1 of 41
Apache Hadoop
The elephant in the room
C. Aaron Cois, Ph.D.
Me
@aaroncois
www.codehenge.net
Love to chat!
The Problem
Large-Scale Computation
• Traditionally, large computation was
focused on
– Complex, CPU-intensive calculations
– On relatively small data sets
• Examples:
– Calculate complex differential equations
– Calculate digits of Pi
Parallel Processing
• Distributed systems allow scalable
computation (more
processors, working simultaneously)
INPUT OUTPUT
Data Storage
• Data is often stored on a SAN
• Data is copied to each compute node
at compute time
• This works well for small amounts of
data, but requires significant copy
time for large data sets
SAN
Compute Nodes
Data
SAN
Calculating…
You must first distribute data
each time you run a
computation…
How much data?
How much data?
over 25 PB of data
How much data?
over 25 PB of data
over 100 PB of data
The internet
IDC estimates[2] the internet contains at
least:
1 Zetabyte
or
1,000 Exabytes
or
1,000,000 Petabytes
2 http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf (2007)
How much time?
Disk Transfer Rates:
• Standard 7200 RPM drive
128.75 MB/s
=> 7.7 secs/GB
=> 13 mins/100 GB
=> > 2 hours/TB
=> 90 days/PB
1 http://en.wikipedia.org/wiki/Hard_disk_drive#Data_transfer_rate
How much time?
Fastest Network Xfer rate:
• iSCSI over 1000GB ethernet (theor.)
– 12.5 Gb/S => 80 sec/TB, 1333 min/PB
Ok, ignore network bottleneck:
• Hypertransport Bus
– 51.2 Gb/S => 19 sec/TB, 325 min/PB
1 http://en.wikipedia.org/wiki/List_of_device_bit_rates
We need a better plan
• Sending data to distributed processors is
the bottleneck
• So what if we sent the processors to the
data?
Core concept:
Pre-distribute and store the data.
Assign compute nodes to operate on local
data.
The Solution
Distributed Data Servers
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
Distribute the Data
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
Send computation code to servers
containing relevant data
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
Hadoop Origin
• Hadoop was modeled after innovative
systems created by Google
• Designed to handle massive (web-
scale) amounts of data
Fun Fact: Hadoop’s creator
named it after his son’s stuffed
elephant
Hadoop Goals
• Store massive data sets
• Enable distributed computation
• Heavy focus on
– Fault tolerance
– Data integrity
– Commodity hardware
Hadoop System
GFS
MapReduce
BigTable
HDFS
Hadoop
MapReduce
HBase
Hadoop System
GFS
MapReduce
BigTable
HDFS
Hadoop
MapReduce
HBase
Hadoop
Components
HDFS
• “Hadoop Distributed File System”
• Sits on top of native filesystem
– ext3, etc
• Stores data in files, replicated and
distributed across data nodes
• Files are “write once”
• Performs best with millions of ~100MB+
files
HDFS
Files are split into blocks for storage
Datanodes
– Data blocks are distributed/replicated
across datanodes
Namenode
– The master node
– Keeps track of location of data blocks
HDFS
Multi-Node Cluster
Master Slave
Name Node
Data NodeData Node
MapReduce
A programming model
– Designed to make programming parallel
computation over large distributed data
sets easy
– Each node processes data already
residing on it (when possible)
– Inspired by functional programming map
and reduce functions
MapReduce
JobTracker
– Runs on a master node
– Clients submit jobs to the JobTracker
– Assigns Map and Reduce tasks to slave
nodes
TaskTracker
– Runs on every slave node
– Daemon that instantiates Map or Reduce
tasks and reports results to JobTracker
MapReduce
Multi-Node Cluster
Master Slave
JobTracker
TaskTrackerTaskTracker
MapReduce
Layer
HDFS Layer
Multi-Node Cluster
Master Slave
NameNod
e
DataNodeDataNode
JobTracker
TaskTracker TaskTracker
HBase
• Hadoop’s Database
• Sits on top of HDFS
• Provides random read/write access to
Very LargeTM tables
– Billions of rows, billions of columns
• Access via
Java, Jython, Groovy, Scala, or REST
web service
A Typical Hadoop Cluster
• Consists entirely of commodity ~$5k
servers
• 1 master, 1 -> 1000+ slaves
• Scales linearly as more processing
nodes are added
How it works
http://en.wikipedia.org/wiki/MapReduce
Traditional MapReduce
Hadoop MapReduce
Image Credit: http://www.drdobbs.com/database/hadoop-the-lay-of-the-land/240150854
MapReduce Example
function map(Str name, Str document):
for each word w in document:
increment_count(w, 1)
function reduce(Str word, Iter partialCounts):
sum = 0
for each pc in partialCounts:
sum += ParseInt(pc)
return (word, sum)
What didn’t I worry about?
• Data distribution
• Node management
• Concurrency
• Error handling
• Node failure
• Load balancing
• Data replication/integrity
Demo
Try the demo yourself!
Go to:
https://github.com/cacois/vagrant-
hadoop-cluster
Follow the instructions in the README

More Related Content

What's hot

Qubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceQubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant Conference
Joydeep Sen Sarma
 

What's hot (20)

HUG August 2010: Best practices
HUG August 2010: Best practicesHUG August 2010: Best practices
HUG August 2010: Best practices
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Map reduce & HDFS with Hadoop
Map reduce & HDFS with HadoopMap reduce & HDFS with Hadoop
Map reduce & HDFS with Hadoop
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Cloud Optimized Big Data
Cloud Optimized Big DataCloud Optimized Big Data
Cloud Optimized Big Data
 
Nextag talk
Nextag talkNextag talk
Nextag talk
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
 
Qubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceQubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant Conference
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 

Viewers also liked

Future cities and the control room of 2030
Future cities and the control room of 2030 Future cities and the control room of 2030
Future cities and the control room of 2030
David Watts
 
Elephant in the Room: Social Media ROI - WEB 2.0 NYC
Elephant in the Room: Social Media ROI - WEB 2.0 NYCElephant in the Room: Social Media ROI - WEB 2.0 NYC
Elephant in the Room: Social Media ROI - WEB 2.0 NYC
Mike Lewis
 
ELEARNING IN ART AND DESIGN: THE ELEPHANT IN THE ROOM
ELEARNING IN ART AND DESIGN: THE ELEPHANT IN THE ROOMELEARNING IN ART AND DESIGN: THE ELEPHANT IN THE ROOM
ELEARNING IN ART AND DESIGN: THE ELEPHANT IN THE ROOM
Art + Design: elearning lab design for social change
 

Viewers also liked (20)

Theres An Elephant In The Control Room V4
Theres An Elephant In The Control Room V4Theres An Elephant In The Control Room V4
Theres An Elephant In The Control Room V4
 
2004 ibc - The role of control room operators
2004 ibc - The role of control room operators2004 ibc - The role of control room operators
2004 ibc - The role of control room operators
 
Jumbo Vision - Control Room Design & Operations Conference 2013
Jumbo Vision - Control Room Design & Operations Conference 2013Jumbo Vision - Control Room Design & Operations Conference 2013
Jumbo Vision - Control Room Design & Operations Conference 2013
 
Gesab Presentation
Gesab PresentationGesab Presentation
Gesab Presentation
 
MCC Presentation 13.05.15
MCC Presentation 13.05.15MCC Presentation 13.05.15
MCC Presentation 13.05.15
 
Welcome to the future Control Room Working Environment
Welcome to the future Control Room Working EnvironmentWelcome to the future Control Room Working Environment
Welcome to the future Control Room Working Environment
 
Control Room Design and Cost Reduction
Control Room Design and Cost ReductionControl Room Design and Cost Reduction
Control Room Design and Cost Reduction
 
Control Room Design and Functionality | Evans Consoles PPT
Control Room Design and Functionality | Evans Consoles PPTControl Room Design and Functionality | Evans Consoles PPT
Control Room Design and Functionality | Evans Consoles PPT
 
Future cities and the control room of 2030
Future cities and the control room of 2030 Future cities and the control room of 2030
Future cities and the control room of 2030
 
The Mine Central Control Room: From Concept to Reality
The Mine Central Control Room: From Concept to Reality The Mine Central Control Room: From Concept to Reality
The Mine Central Control Room: From Concept to Reality
 
Control Room of the Future
Control Room of the FutureControl Room of the Future
Control Room of the Future
 
2010 IBC - Managing risks of control room operations
2010 IBC - Managing risks of control room operations2010 IBC - Managing risks of control room operations
2010 IBC - Managing risks of control room operations
 
2005 IBC - Managing risks of control room operations
2005 IBC - Managing risks of control room operations2005 IBC - Managing risks of control room operations
2005 IBC - Managing risks of control room operations
 
Elephant in the Room: Social Media ROI - WEB 2.0 NYC
Elephant in the Room: Social Media ROI - WEB 2.0 NYCElephant in the Room: Social Media ROI - WEB 2.0 NYC
Elephant in the Room: Social Media ROI - WEB 2.0 NYC
 
Moving the Elephant in the Room: Data Migration at Scale
Moving the Elephant in the Room: Data Migration at ScaleMoving the Elephant in the Room: Data Migration at Scale
Moving the Elephant in the Room: Data Migration at Scale
 
The elephant in the room. discussion
The elephant in the room. discussionThe elephant in the room. discussion
The elephant in the room. discussion
 
YUI The Elephant In The Room
YUI The Elephant In The RoomYUI The Elephant In The Room
YUI The Elephant In The Room
 
The elephant in the room
The elephant in the roomThe elephant in the room
The elephant in the room
 
ELEARNING IN ART AND DESIGN: THE ELEPHANT IN THE ROOM
ELEARNING IN ART AND DESIGN: THE ELEPHANT IN THE ROOMELEARNING IN ART AND DESIGN: THE ELEPHANT IN THE ROOM
ELEARNING IN ART AND DESIGN: THE ELEPHANT IN THE ROOM
 
asteRISK
asteRISKasteRISK
asteRISK
 

Similar to Hadoop: The elephant in the room

Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
Cisco Canada
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
Taldor Group
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
bhargavi804095
 

Similar to Hadoop: The elephant in the room (20)

Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Introduction to Hadoop and Big-Data
Introduction to Hadoop and Big-DataIntroduction to Hadoop and Big-Data
Introduction to Hadoop and Big-Data
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Hadoop
HadoopHadoop
Hadoop
 
Anju
AnjuAnju
Anju
 
Hadoop fundamentals
Hadoop fundamentalsHadoop fundamentals
Hadoop fundamentals
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with Spark
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 

More from cacois

More from cacois (7)

Devopssecfail
DevopssecfailDevopssecfail
Devopssecfail
 
Machine Learning for Modern Developers
Machine Learning for Modern DevelopersMachine Learning for Modern Developers
Machine Learning for Modern Developers
 
Avoiding Callback Hell with Async.js
Avoiding Callback Hell with Async.jsAvoiding Callback Hell with Async.js
Avoiding Callback Hell with Async.js
 
Node.js Patterns for Discerning Developers
Node.js Patterns for Discerning DevelopersNode.js Patterns for Discerning Developers
Node.js Patterns for Discerning Developers
 
High-Volume Data Collection and Real Time Analytics Using Redis
High-Volume Data Collection and Real Time Analytics Using RedisHigh-Volume Data Collection and Real Time Analytics Using Redis
High-Volume Data Collection and Real Time Analytics Using Redis
 
Automate your Development Environments with Vagrant
Automate your Development Environments with VagrantAutomate your Development Environments with Vagrant
Automate your Development Environments with Vagrant
 
Node.js: A Guided Tour
Node.js: A Guided TourNode.js: A Guided Tour
Node.js: A Guided Tour
 

Recently uploaded

Recently uploaded (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Hadoop: The elephant in the room

Editor's Notes

  1. Note: This study was from 2007. I don’t know if there’s a Moore’s Law of growth of data on the internet, but I expect this is a much larger number now.
  2. This is not a supercomputer, and its not intended to be. Google’s approach was always to use a lot of cheap, expendable commodity servers, rather than be beholden to expensive, custom hardware and vendors. What they knew was software, so they learned on that expertise to produce a solution.