SlideShare a Scribd company logo
1 of 25
Download to read offline
Federico 
Cargnelu/ 
/ 
BSkyB 
Hadoop 
& 
Distributed 
Compu<ng
Distributed 
compu<ng 
uses 
so=ware 
to 
divide 
pieces 
of 
a 
program 
among 
several 
computers. 
One 
project 
in 
par<cular 
has 
proven 
that 
the 
concept 
works 
extremely 
well.
SETI@Home 
Search 
for 
Extra-­‐Terrestrial 
Intelligence 
• Prove 
the 
viability 
of 
the 
distributed 
grid 
compu<ng 
concept 
(succeeded) 
• Detect 
intelligent 
life 
outside 
Earth 
(failed)
Distributed 
Compu6ng 
What 
problem 
are 
we 
trying 
to 
solve?
Counts 
of 
all 
the 
dis6nct 
word 
• in 
a 
file? 
• in 
a 
directory? 
• on 
the 
Web?
We 
need 
to 
process 
100TB 
datasets 
• On 
1 
node: 
o Scanning 
@ 
50MB/s 
= 
23 
days 
• On 
1000 
node 
cluster: 
o Scanning 
@ 
50MB/s 
= 
33 
min
We 
need 
a 
framework 
for 
distribu<on
We 
need 
a 
new 
paradigm
Hadoop 
is 
an 
open-­‐source 
Java 
framework 
for 
running 
applica<ons 
on 
large 
clusters 
of 
commodity 
hardware
Scalable 
Hadoop 
can 
reliably 
store 
and 
process 
petabytes 
of 
data. 
Economical 
Hadoop 
distributes 
the 
data 
and 
processing 
across 
clusters 
of 
commonly 
available 
computers. 
These 
clusters 
can 
number 
into 
the 
thousands 
of 
nodes. 
Efficient 
Hadoop 
can 
process 
the 
distributed 
data 
in 
parallel 
on 
the 
nodes 
where 
the 
data 
is 
located. 
Reliable 
Hadoop 
automa<cally 
maintains 
mul<ple 
copies 
of 
data 
and 
automa<cally 
redeploys 
compu<ng 
tasks 
based 
on 
failures.
Hadoop 
Components 
Hadoop 
Distributed 
File 
System 
(HDFS) 
• 
Java, 
Shell, 
C 
and 
HTTP 
API’s 
Hadoop 
MapReduce 
• 
Java 
and 
Streaming 
API’s 
Hadoop 
on 
Demand 
• Tools 
to 
manage 
dynamic 
setup 
and 
teardown 
of 
Hadoop 
nodes
Other 
Tools 
HBase 
Table 
storage 
on 
top 
of 
HDFS, 
modeled 
a=er 
Google’s 
Big 
Table 
Pig 
Language 
for 
dataflow 
programming 
Hive 
SQL 
interface 
to 
structured 
data 
stored 
in 
HDFS
Hadoop 
MapReduce 
• Mappers 
and 
Reducers 
are 
allocated 
• Code 
is 
shipped 
to 
nodes 
• Mappers 
and 
Reducers 
are 
run 
on 
same 
machines 
as 
DataNodes 
• Two 
major 
daemons: 
JobTracker 
and 
TaskTracker
Hadoop 
MapReduce 
JobTracker 
• 
Long-­‐lived 
master 
daemon 
which 
distributes 
tasks 
• 
Maintains 
a 
job 
history 
of 
job 
execu<on 
sta<s<cs 
TaskTrackers 
• Long-­‐lived 
client 
daemon 
which 
executes 
Map 
and 
Reduce 
tasks
Hadoop 
MapReduce 
• Setup 
a 
mul<-­‐node 
Hadoop 
cluster 
using 
the 
Hadoop 
Distributed 
File 
System 
(HDFS) 
• Create 
a 
hierarchical 
HDFS 
with 
directories 
and 
files. 
• Use 
Hadoop 
API 
to 
store 
a 
large 
text 
file. 
• Create 
a 
MapReduce 
applica<on.
• Mapper 
takes 
input 
key/value 
pair 
• Does 
something 
to 
its 
input 
• Emits 
intermediate 
key/value 
pair 
• One 
call 
per 
input 
record 
• Fully 
data-­‐parallel 
Map
Map 
(in, 
1) 
(in, 
1) 
(sunt, 
1) 
(in, 
1) 
(elit, 
1) 
(sed, 
1) 
(eiusmod, 
1)
• Input 
is 
all 
list 
of 
intermediate 
values 
for 
a 
given 
key 
• Reducer 
aggregates 
list 
of 
intermediate 
values 
• Returns 
a 
final 
key/value 
pair 
for 
output 
Reduce
Reduce 
Reduce 
(irure, 
1) 
(in, 
3) 
(ea, 
1) 
(enim, 
1) 
(eu, 
1) 
(Duis, 
1) 
(dolore, 
2)
Adobe 
-­‐ 
Use 
for 
data 
storage 
and 
processing 
-­‐ 
30 
nodes 
Facebook 
-­‐ 
Use 
for 
repor<ng 
and 
analy<cs 
-­‐ 
320 
nodes 
FOX 
-­‐ 
Use 
for 
log 
analysis 
and 
data 
mining 
-­‐ 
140 
nodes 
Who 
is 
using 
it? 
Last.fm 
-­‐ 
Use 
for 
chart 
calcula<on 
and 
log 
analysis 
-­‐ 
27 
nodes 
New 
York 
Times 
-­‐ 
Use 
for 
large 
scale 
image 
conversion 
-­‐ 
100 
nodes 
Yahoo! 
-­‐ 
Use 
for 
Ad 
systems 
and 
Web 
search 
-­‐ 
10.000 
nodes
Use 
Cases 
• Video 
and 
Image 
processing 
• Log 
analysis 
• Spam/BOT 
analysis 
• Behavioral 
analy<cs 
(CRM) 
• Sequen<al 
paiern 
analysis 
(eg. 
Understanding 
long-­‐term 
customer 
buying 
behavior 
for 
cross 
selling 
and 
target 
marke<ng)
Recommended 
Hardware 
Commodity 
servers 
• 1 
RU 
• 2 
x 
4 
core 
CPU 
• 4-­‐8GB 
of 
RAM 
using 
ECC 
memory 
• 4 
x 
1TB 
SATA 
drives 
• 1-­‐5TB 
external 
storage 
Typically 
arranged 
in 
2 
level 
architecture 
• 30/40 
nodes 
per 
rack
Challenges 
• No 
version 
and 
dependency 
management. 
• Configura<on: 
more 
than 
150 
parameters. 
• No 
security 
against 
accidents. 
User 
iden<fica<on 
added 
a=er 
Last.fm 
deleted 
a 
fileystem 
by 
accident. 
• HDFS 
is 
primarily 
designed 
for 
streaming 
access 
of 
large 
files. 
Reading 
through 
small 
files 
normally 
causes 
lots 
of 
seeks 
and 
lots 
of 
hopping 
from 
datanode 
to 
datanode 
to 
retrieve 
each 
small 
file. 
• Steep 
learning 
curve. 
According 
to 
Facebook, 
using 
Hadoop 
was 
not 
easy 
for 
end 
users, 
especially 
for 
the 
ones 
who 
were 
not 
familiar 
with 
MapReduce.
Ques6ons? 
Images: 
hip://www.flickr.com/photos/labguest/3509303134 
hip://www.flickr.com/photos/tantrum_dan/3546852841

More Related Content

What's hot

Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Hadoop: The elephant in the room
Hadoop: The elephant in the roomHadoop: The elephant in the room
Hadoop: The elephant in the roomcacois
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentalsits_skm
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR Technologies
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Adam Kawa
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
HUG August 2010: Best practices
HUG August 2010: Best practicesHUG August 2010: Best practices
HUG August 2010: Best practicesHadoop User Group
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 

What's hot (20)

2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop: The elephant in the room
Hadoop: The elephant in the roomHadoop: The elephant in the room
Hadoop: The elephant in the room
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document Database
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
HUG August 2010: Best practices
HUG August 2010: Best practicesHUG August 2010: Best practices
HUG August 2010: Best practices
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Anju
AnjuAnju
Anju
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 

Similar to Distributed Computing Hadoop Framework Process Large Datasets

02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache HadoopKMS Technology
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.MaharajothiP
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.pptSathish24111
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 

Similar to Distributed Computing Hadoop Framework Process Large Datasets (20)

List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache Hadoop
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 

Recently uploaded

Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 

Recently uploaded (20)

Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

Distributed Computing Hadoop Framework Process Large Datasets

  • 1. Federico Cargnelu/ / BSkyB Hadoop & Distributed Compu<ng
  • 2. Distributed compu<ng uses so=ware to divide pieces of a program among several computers. One project in par<cular has proven that the concept works extremely well.
  • 3. SETI@Home Search for Extra-­‐Terrestrial Intelligence • Prove the viability of the distributed grid compu<ng concept (succeeded) • Detect intelligent life outside Earth (failed)
  • 4. Distributed Compu6ng What problem are we trying to solve?
  • 5. Counts of all the dis6nct word • in a file? • in a directory? • on the Web?
  • 6. We need to process 100TB datasets • On 1 node: o Scanning @ 50MB/s = 23 days • On 1000 node cluster: o Scanning @ 50MB/s = 33 min
  • 7. We need a framework for distribu<on
  • 8. We need a new paradigm
  • 9.
  • 10. Hadoop is an open-­‐source Java framework for running applica<ons on large clusters of commodity hardware
  • 11. Scalable Hadoop can reliably store and process petabytes of data. Economical Hadoop distributes the data and processing across clusters of commonly available computers. These clusters can number into the thousands of nodes. Efficient Hadoop can process the distributed data in parallel on the nodes where the data is located. Reliable Hadoop automa<cally maintains mul<ple copies of data and automa<cally redeploys compu<ng tasks based on failures.
  • 12. Hadoop Components Hadoop Distributed File System (HDFS) • Java, Shell, C and HTTP API’s Hadoop MapReduce • Java and Streaming API’s Hadoop on Demand • Tools to manage dynamic setup and teardown of Hadoop nodes
  • 13. Other Tools HBase Table storage on top of HDFS, modeled a=er Google’s Big Table Pig Language for dataflow programming Hive SQL interface to structured data stored in HDFS
  • 14. Hadoop MapReduce • Mappers and Reducers are allocated • Code is shipped to nodes • Mappers and Reducers are run on same machines as DataNodes • Two major daemons: JobTracker and TaskTracker
  • 15. Hadoop MapReduce JobTracker • Long-­‐lived master daemon which distributes tasks • Maintains a job history of job execu<on sta<s<cs TaskTrackers • Long-­‐lived client daemon which executes Map and Reduce tasks
  • 16. Hadoop MapReduce • Setup a mul<-­‐node Hadoop cluster using the Hadoop Distributed File System (HDFS) • Create a hierarchical HDFS with directories and files. • Use Hadoop API to store a large text file. • Create a MapReduce applica<on.
  • 17. • Mapper takes input key/value pair • Does something to its input • Emits intermediate key/value pair • One call per input record • Fully data-­‐parallel Map
  • 18. Map (in, 1) (in, 1) (sunt, 1) (in, 1) (elit, 1) (sed, 1) (eiusmod, 1)
  • 19. • Input is all list of intermediate values for a given key • Reducer aggregates list of intermediate values • Returns a final key/value pair for output Reduce
  • 20. Reduce Reduce (irure, 1) (in, 3) (ea, 1) (enim, 1) (eu, 1) (Duis, 1) (dolore, 2)
  • 21. Adobe -­‐ Use for data storage and processing -­‐ 30 nodes Facebook -­‐ Use for repor<ng and analy<cs -­‐ 320 nodes FOX -­‐ Use for log analysis and data mining -­‐ 140 nodes Who is using it? Last.fm -­‐ Use for chart calcula<on and log analysis -­‐ 27 nodes New York Times -­‐ Use for large scale image conversion -­‐ 100 nodes Yahoo! -­‐ Use for Ad systems and Web search -­‐ 10.000 nodes
  • 22. Use Cases • Video and Image processing • Log analysis • Spam/BOT analysis • Behavioral analy<cs (CRM) • Sequen<al paiern analysis (eg. Understanding long-­‐term customer buying behavior for cross selling and target marke<ng)
  • 23. Recommended Hardware Commodity servers • 1 RU • 2 x 4 core CPU • 4-­‐8GB of RAM using ECC memory • 4 x 1TB SATA drives • 1-­‐5TB external storage Typically arranged in 2 level architecture • 30/40 nodes per rack
  • 24. Challenges • No version and dependency management. • Configura<on: more than 150 parameters. • No security against accidents. User iden<fica<on added a=er Last.fm deleted a fileystem by accident. • HDFS is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file. • Steep learning curve. According to Facebook, using Hadoop was not easy for end users, especially for the ones who were not familiar with MapReduce.
  • 25. Ques6ons? Images: hip://www.flickr.com/photos/labguest/3509303134 hip://www.flickr.com/photos/tantrum_dan/3546852841