Federico 
Cargnelu/ 
/ 
BSkyB 
Hadoop 
& 
Distributed 
Compu<ng
Distributed 
compu<ng 
uses 
so=ware 
to 
divide 
pieces 
of 
a 
program 
among 
several 
computers. 
One 
project 
in 
pa...
SETI@Home 
Search 
for 
Extra-­‐Terrestrial 
Intelligence 
• Prove 
the 
viability 
of 
the 
distributed 
grid 
compu<ng 
...
Distributed 
Compu6ng 
What 
problem 
are 
we 
trying 
to 
solve?
Counts 
of 
all 
the 
dis6nct 
word 
• in 
a 
file? 
• in 
a 
directory? 
• on 
the 
Web?
We 
need 
to 
process 
100TB 
datasets 
• On 
1 
node: 
o Scanning 
@ 
50MB/s 
= 
23 
days 
• On 
1000 
node 
cluster: 
o ...
We 
need 
a 
framework 
for 
distribu<on
We 
need 
a 
new 
paradigm
Hadoop 
is 
an 
open-­‐source 
Java 
framework 
for 
running 
applica<ons 
on 
large 
clusters 
of 
commodity 
hardware
Scalable 
Hadoop 
can 
reliably 
store 
and 
process 
petabytes 
of 
data. 
Economical 
Hadoop 
distributes 
the 
data 
an...
Hadoop 
Components 
Hadoop 
Distributed 
File 
System 
(HDFS) 
• 
Java, 
Shell, 
C 
and 
HTTP 
API’s 
Hadoop 
MapReduce 
•...
Other 
Tools 
HBase 
Table 
storage 
on 
top 
of 
HDFS, 
modeled 
a=er 
Google’s 
Big 
Table 
Pig 
Language 
for 
dataflow...
Hadoop 
MapReduce 
• Mappers 
and 
Reducers 
are 
allocated 
• Code 
is 
shipped 
to 
nodes 
• Mappers 
and 
Reducers 
are...
Hadoop 
MapReduce 
JobTracker 
• 
Long-­‐lived 
master 
daemon 
which 
distributes 
tasks 
• 
Maintains 
a 
job 
history 
...
Hadoop 
MapReduce 
• Setup 
a 
mul<-­‐node 
Hadoop 
cluster 
using 
the 
Hadoop 
Distributed 
File 
System 
(HDFS) 
• Crea...
• Mapper 
takes 
input 
key/value 
pair 
• Does 
something 
to 
its 
input 
• Emits 
intermediate 
key/value 
pair 
• One ...
Map 
(in, 
1) 
(in, 
1) 
(sunt, 
1) 
(in, 
1) 
(elit, 
1) 
(sed, 
1) 
(eiusmod, 
1)
• Input 
is 
all 
list 
of 
intermediate 
values 
for 
a 
given 
key 
• Reducer 
aggregates 
list 
of 
intermediate 
value...
Reduce 
Reduce 
(irure, 
1) 
(in, 
3) 
(ea, 
1) 
(enim, 
1) 
(eu, 
1) 
(Duis, 
1) 
(dolore, 
2)
Adobe 
-­‐ 
Use 
for 
data 
storage 
and 
processing 
-­‐ 
30 
nodes 
Facebook 
-­‐ 
Use 
for 
repor<ng 
and 
analy<cs 
-­...
Use 
Cases 
• Video 
and 
Image 
processing 
• Log 
analysis 
• Spam/BOT 
analysis 
• Behavioral 
analy<cs 
(CRM) 
• Seque...
Recommended 
Hardware 
Commodity 
servers 
• 1 
RU 
• 2 
x 
4 
core 
CPU 
• 4-­‐8GB 
of 
RAM 
using 
ECC 
memory 
• 4 
x 
...
Challenges 
• No 
version 
and 
dependency 
management. 
• Configura<on: 
more 
than 
150 
parameters. 
• No 
security 
ag...
Ques6ons? 
Images: 
hip://www.flickr.com/photos/labguest/3509303134 
hip://www.flickr.com/photos/tantrum_dan/3546852841
Hadoop and Distributed Computing
Upcoming SlideShare
Loading in …5
×

Hadoop and Distributed Computing

1,037 views
1,004 views

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,037
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
52
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Hadoop and Distributed Computing

  1. 1. Federico Cargnelu/ / BSkyB Hadoop & Distributed Compu<ng
  2. 2. Distributed compu<ng uses so=ware to divide pieces of a program among several computers. One project in par<cular has proven that the concept works extremely well.
  3. 3. SETI@Home Search for Extra-­‐Terrestrial Intelligence • Prove the viability of the distributed grid compu<ng concept (succeeded) • Detect intelligent life outside Earth (failed)
  4. 4. Distributed Compu6ng What problem are we trying to solve?
  5. 5. Counts of all the dis6nct word • in a file? • in a directory? • on the Web?
  6. 6. We need to process 100TB datasets • On 1 node: o Scanning @ 50MB/s = 23 days • On 1000 node cluster: o Scanning @ 50MB/s = 33 min
  7. 7. We need a framework for distribu<on
  8. 8. We need a new paradigm
  9. 9. Hadoop is an open-­‐source Java framework for running applica<ons on large clusters of commodity hardware
  10. 10. Scalable Hadoop can reliably store and process petabytes of data. Economical Hadoop distributes the data and processing across clusters of commonly available computers. These clusters can number into the thousands of nodes. Efficient Hadoop can process the distributed data in parallel on the nodes where the data is located. Reliable Hadoop automa<cally maintains mul<ple copies of data and automa<cally redeploys compu<ng tasks based on failures.
  11. 11. Hadoop Components Hadoop Distributed File System (HDFS) • Java, Shell, C and HTTP API’s Hadoop MapReduce • Java and Streaming API’s Hadoop on Demand • Tools to manage dynamic setup and teardown of Hadoop nodes
  12. 12. Other Tools HBase Table storage on top of HDFS, modeled a=er Google’s Big Table Pig Language for dataflow programming Hive SQL interface to structured data stored in HDFS
  13. 13. Hadoop MapReduce • Mappers and Reducers are allocated • Code is shipped to nodes • Mappers and Reducers are run on same machines as DataNodes • Two major daemons: JobTracker and TaskTracker
  14. 14. Hadoop MapReduce JobTracker • Long-­‐lived master daemon which distributes tasks • Maintains a job history of job execu<on sta<s<cs TaskTrackers • Long-­‐lived client daemon which executes Map and Reduce tasks
  15. 15. Hadoop MapReduce • Setup a mul<-­‐node Hadoop cluster using the Hadoop Distributed File System (HDFS) • Create a hierarchical HDFS with directories and files. • Use Hadoop API to store a large text file. • Create a MapReduce applica<on.
  16. 16. • Mapper takes input key/value pair • Does something to its input • Emits intermediate key/value pair • One call per input record • Fully data-­‐parallel Map
  17. 17. Map (in, 1) (in, 1) (sunt, 1) (in, 1) (elit, 1) (sed, 1) (eiusmod, 1)
  18. 18. • Input is all list of intermediate values for a given key • Reducer aggregates list of intermediate values • Returns a final key/value pair for output Reduce
  19. 19. Reduce Reduce (irure, 1) (in, 3) (ea, 1) (enim, 1) (eu, 1) (Duis, 1) (dolore, 2)
  20. 20. Adobe -­‐ Use for data storage and processing -­‐ 30 nodes Facebook -­‐ Use for repor<ng and analy<cs -­‐ 320 nodes FOX -­‐ Use for log analysis and data mining -­‐ 140 nodes Who is using it? Last.fm -­‐ Use for chart calcula<on and log analysis -­‐ 27 nodes New York Times -­‐ Use for large scale image conversion -­‐ 100 nodes Yahoo! -­‐ Use for Ad systems and Web search -­‐ 10.000 nodes
  21. 21. Use Cases • Video and Image processing • Log analysis • Spam/BOT analysis • Behavioral analy<cs (CRM) • Sequen<al paiern analysis (eg. Understanding long-­‐term customer buying behavior for cross selling and target marke<ng)
  22. 22. Recommended Hardware Commodity servers • 1 RU • 2 x 4 core CPU • 4-­‐8GB of RAM using ECC memory • 4 x 1TB SATA drives • 1-­‐5TB external storage Typically arranged in 2 level architecture • 30/40 nodes per rack
  23. 23. Challenges • No version and dependency management. • Configura<on: more than 150 parameters. • No security against accidents. User iden<fica<on added a=er Last.fm deleted a fileystem by accident. • HDFS is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file. • Steep learning curve. According to Facebook, using Hadoop was not easy for end users, especially for the ones who were not familiar with MapReduce.
  24. 24. Ques6ons? Images: hip://www.flickr.com/photos/labguest/3509303134 hip://www.flickr.com/photos/tantrum_dan/3546852841

×