Hadoop &
BigData
Gabriel
Răileanu
Agenda
BigData
concept
Introduction
Challanges
Hadoop
Basics
Distributed file
system
Map/Reduce
How Hadoop
works?
Q/A
Bibliography
Demo
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Hadoop & BigData
Gabriel Răileanu
December 10, 2019
Gabriel Răileanu (AC TUIAȘI) Hadoop & BigData December 10, 2019 1 / 27
Hadoop &
BigData
Gabriel
Răileanu
Agenda
BigData
concept
Introduction
Challanges
Hadoop
Basics
Distributed file
system
Map/Reduce
How Hadoop
works?
Q/A
Bibliography
Demo
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Agenda
Agenda
1 BigData concept
2 Hadoop
3 How Hadoop works?
4 Q/A
5 Bibliography
6 Demo
Gabriel Răileanu (AC TUIAȘI) Hadoop & BigData December 10, 2019 2 / 27
Hadoop &
BigData
Gabriel
Răileanu
Agenda
BigData
concept
Introduction
Challanges
Hadoop
Basics
Distributed file
system
Map/Reduce
How Hadoop
works?
Q/A
Bibliography
Demo
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
BigData concept Introduction
Gabriel Răileanu (AC TUIAȘI) Hadoop & BigData December 10, 2019 3 / 27
Hadoop &
BigData
Gabriel
Răileanu
Agenda
BigData
concept
Introduction
Challanges
Hadoop
Basics
Distributed file
system
Map/Reduce
How Hadoop
works?
Q/A
Bibliography
Demo
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
BigData concept Introduction
BigData:
• More complex data sets than traditional ones, especially
from new data sources. These data sets are so
voluminous that traditional data processing software just
can’t manage them.
• These massive volumes of data can be used to address
business problems you wouldn’t have been able to tackle
before.
Gabriel Răileanu (AC TUIAȘI) Hadoop & BigData December 10, 2019 4 / 27
Hadoop &
BigData
Gabriel
Răileanu
Agenda
BigData
concept
Introduction
Challanges
Hadoop
Basics
Distributed file
system
Map/Reduce
How Hadoop
works?
Q/A
Bibliography
Demo
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
BigData concept Challanges
BigData challanges:
• Dealing with data growth
• Generating insights in a timely manner
• Integrating disparate data sources
• Validating data
• Securing BigData
• Recruiting and retaining big data talent
Gabriel Răileanu (AC TUIAȘI) Hadoop & BigData December 10, 2019 5 / 27
Hadoop &
BigData
Gabriel
Răileanu
Agenda
BigData
concept
Introduction
Challanges
Hadoop
Basics
Distributed file
system
Map/Reduce
How Hadoop
works?
Q/A
Bibliography
Demo
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Hadoop Basics
• Open source Apache project
• Hadoop Core includes:
• Distributed File System - distributes data
• Map/Reduce - distributes application
• Runs on Java → cross-platform
So, Hadoop:
• Is an open-source framework for writting & running
distributed aplications that process large amounts of
data (≡ BigData volumes).
• Runs on large clusters of commodity machines or on
cloud computing services
Gabriel Răileanu (AC TUIAȘI) Hadoop & BigData December 10, 2019 6 / 27
Hadoop &
BigData
Gabriel
Răileanu
Agenda
BigData
concept
Introduction
Challanges
Hadoop
Basics
Distributed file
system
Map/Reduce
How Hadoop
works?
Q/A
Bibliography
Demo
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Hadoop Basics
While not strictly necessary, machines in a Hadoop cluster
are usually relatively homogeneous x86 boxes, almost always
located in the same data center, often in the same set of
racks.
Gabriel Răileanu (AC TUIAȘI) Hadoop & BigData December 10, 2019 7 / 27
Hadoop &
BigData
Gabriel
Răileanu
Agenda
BigData
concept
Introduction
Challanges
Hadoop
Basics
Distributed file
system
Map/Reduce
How Hadoop
works?
Q/A
Bibliography
Demo
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Hadoop Distributed file system
Hadoop Distributed File System (HDFS)
• Based on the Google File System (GFS) and provides a
distributed file system that is designed to run on
commodity hardware.
• Highly fault-tolerant and is designed to be deployed on
low-cost hardware.
• Provides high throughput access to application data and
is suitable for applications having large datasets.
Gabriel Răileanu (AC TUIAȘI) Hadoop & BigData December 10, 2019 8 / 27
Hadoop &
BigData
Gabriel
Răileanu
Agenda
BigData
concept
Introduction
Challanges
Hadoop
Basics
Distributed file
system
Map/Reduce
How Hadoop
works?
Q/A
Bibliography
Demo
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Hadoop Distributed file system
HDFS:
• HDFS holds very large amount of data and provides
easier access, the files are stored across multiple
machines.
• These files are stored in redundant fashion to rescue the
system from possible data losses in case of failure.
• HDFS also makes applications available to parallel
processing.
HDFS - goals:
• Fault detection and recovery
• Huge datasets
• Hardware at data: a requested task can be done
efficiently, when the computation takes place near the
data. Especially where huge datasets are involved, it
reduces the network traffic and increases the throughput.
Gabriel Răileanu (AC TUIAȘI) Hadoop & BigData December 10, 2019 9 / 27
Hadoop &
BigData
Gabriel
Răileanu
Agenda
BigData
concept
Introduction
Challanges
Hadoop
Basics
Distributed file
system
Map/Reduce
How Hadoop
works?
Q/A
Bibliography
Demo
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Hadoop Distributed file system
HDFS architecture
• Follows the master-slave architecture.
Elements:
1 NameNode
2 DataNode
3 (Data)Block
Gabriel Răileanu (AC TUIAȘI) Hadoop & BigData December 10, 2019 10 / 27
Hadoop &
BigData
Gabriel
Răileanu
Agenda
BigData
concept
Introduction
Challanges
Hadoop
Basics
Distributed file
system
Map/Reduce
How Hadoop
works?
Q/A
Bibliography
Demo
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Hadoop Distributed file system
NameNode:
• Is the master of HDFS that directs the slave DataNode
daemons to perform the low-level I/O task.
• Is the bookkeeper of HDFS; it keeps track of how your
file are broken down into file blocks; which nodes stores
those blocks, and the overall health of the distributed
filesystem
• Executes file system namespace operations like opening,
closing, and renaming files and directories.
• It also determines the mapping of blocks to DataNodes.
Drawback:
The NameNode is the single point of failure of a Hadoop
cluster. For any of the other daemons, if their host
nodes fail for software or hardware reasons, the Hadoop
cluster will likely continue to function smoothly or you
can quickly restart it → not so for the NameNode.
Gabriel Răileanu (AC TUIAȘI) Hadoop & BigData December 10, 2019 11 / 27
Hadoop &
BigData
Gabriel
Răileanu
Agenda
BigData
concept
Introduction
Challanges
Hadoop
Basics
Distributed file
system
Map/Reduce
How Hadoop
works?
Q/A
Bibliography
Demo
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Hadoop Distributed file system
DataNode:
• In addition to the NameNode, there are a number of
DataNodes, usually one per node in the cluster, which
manage storage attached to the nodes that they run on.
• The DataNodes are responsible for serving read and
write requests from the file system’s clients.
• The DataNodes also perform block creation, deletion,
and replication upon instruction from the NameNode.
Gabriel Răileanu (AC TUIAȘI) Hadoop & BigData December 10, 2019 12 / 27
Hadoop &
BigData
Gabriel
Răileanu
Agenda
BigData
concept
Introduction
Challanges
Hadoop
Basics
Distributed file
system
Map/Reduce
How Hadoop
works?
Q/A
Bibliography
Demo
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Hadoop Distributed file system
(Data)Block:
• HDFS splits huge files into small chunks knows as data
blocks.
• A data block is the smallest unit of data in a HDFS. We
(client ± admin) do not have any control over the data
block like block location. NameNode is the one that
decides all such things.
• The default size of the HDFS block is 128MB which you
can configure. All blocks of the file are the same size
except the last block, which can be either the same size
or smaller.
• The files are split into 128 MB blocks and then stored
into the Hadoop file system. Hadoop is responsible for
distributing the data blocks across multiple nodes.
Gabriel Răileanu (AC TUIAȘI) Hadoop & BigData December 10, 2019 13 / 27
Hadoop &
BigData
Gabriel
Răileanu
Agenda
BigData
concept
Introduction
Challanges
Hadoop
Basics
Distributed file
system
Map/Reduce
How Hadoop
works?
Q/A
Bibliography
Demo
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Hadoop Map/Reduce
Hadoop processes data like a pipeline (eg. functional
programming style).
eg: Linux pipe
• Pipelines can help the reuse of processing primitives,
simple chainnigng of existing modules creates new ones.
• Message queues can help the synchronization of
processing primitives.
Gabriel Răileanu (AC TUIAȘI) Hadoop & BigData December 10, 2019 14 / 27
Hadoop &
BigData
Gabriel
Răileanu
Agenda
BigData
concept
Introduction
Challanges
Hadoop
Basics
Distributed file
system
Map/Reduce
How Hadoop
works?
Q/A
Bibliography
Demo
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Hadoop Map/Reduce
• Similarly, MapReduce is a programming model for
efficient distributed computing.
• It can be easy scaled of data processing over multiple
computing nodes.
• Under the MapReduce model, the data processing
primitives are called mappers & reducers.
Decomposing a data processing application into mappers
& reducers is sometimes nontrivial.
• Efficiency from
• Streaming through data, reducing seeks.
• Pipelining
• A good fit for a lot of applications, eg:
• Log processing
• Web index building
Gabriel Răileanu (AC TUIAȘI) Hadoop & BigData December 10, 2019 15 / 27
Hadoop &
BigData
Gabriel
Răileanu
Agenda
BigData
concept
Introduction
Challanges
Hadoop
Basics
Distributed file
system
Map/Reduce
How Hadoop
works?
Q/A
Bibliography
Demo
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Hadoop Map/Reduce
Map Reduce dataflow
Gabriel Răileanu (AC TUIAȘI) Hadoop & BigData December 10, 2019 16 / 27
Hadoop &
BigData
Gabriel
Răileanu
Agenda
BigData
concept
Introduction
Challanges
Hadoop
Basics
Distributed file
system
Map/Reduce
How Hadoop
works?
Q/A
Bibliography
Demo
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Hadoop Map/Reduce
Phases for MapReduce algorithm:
• mapper: MapReduce takes the input data and feeds
from each data element to the mapper.
• reducer: the reducer processes all the outputs from the
mapper and arrives at a final result.
In simple terms, the mapper is meant to filter and
transform the input into something that the reducer can
aggregate over.
Gabriel Răileanu (AC TUIAȘI) Hadoop & BigData December 10, 2019 17 / 27
Hadoop &
BigData
Gabriel
Răileanu
Agenda
BigData
concept
Introduction
Challanges
Hadoop
Basics
Distributed file
system
Map/Reduce
How Hadoop
works?
Q/A
Bibliography
Demo
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
How Hadoop works?
How Hadoop works?
Hadoop runs code across a cluster of computers. This process
includes the following core tasks that Hadoop performs:
1 Data is initially divided into directories and files. Files
are divided into uniform sized blocks of 128MB and
64MB (preferably 128MB).
2 These files are then distributed across various cluster
nodes for further processing.
3 HDFS, being on top of the local file system, supervises
the processing.
4 Blocks are replicated for handling hardware failure.
5 Checking that the code was executed successfully.
6 Performing the sort that takes place between the map
and reduce stages.
7 Sending the sorted data to a certain computer.
8 Writing the debugging logs for each job.
Gabriel Răileanu (AC TUIAȘI) Hadoop & BigData December 10, 2019 18 / 27
Hadoop &
BigData
Gabriel
Răileanu
Agenda
BigData
concept
Introduction
Challanges
Hadoop
Basics
Distributed file
system
Map/Reduce
How Hadoop
works?
Q/A
Bibliography
Demo
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Q/A
Q/A?
Gabriel Răileanu (AC TUIAȘI) Hadoop & BigData December 10, 2019 19 / 27
Hadoop &
BigData
Gabriel
Răileanu
Agenda
BigData
concept
Introduction
Challanges
Hadoop
Basics
Distributed file
system
Map/Reduce
How Hadoop
works?
Q/A
Bibliography
Demo
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Bibliography
Bibliography
Chuck Lam. Hadoop in Action. Manning Publications
Co. Greenwich, CT, USA 2010
Data Block in HDFS | HDFS Blocks & Data Block Size,
https://data-flair.training/blogs/data-block/
Apache Hadoop documentation
https://hadoop.apache.org/docs/r1.2.1
Gabriel Răileanu (AC TUIAȘI) Hadoop & BigData December 10, 2019 20 / 27
Hadoop &
BigData
Gabriel
Răileanu
Agenda
BigData
concept
Introduction
Challanges
Hadoop
Basics
Distributed file
system
Map/Reduce
How Hadoop
works?
Q/A
Bibliography
Demo
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Gabriel Răileanu (AC TUIAȘI) Hadoop & BigData December 10, 2019 21 / 27
Hadoop &
BigData
Gabriel
Răileanu
Agenda
BigData
concept
Introduction
Challanges
Hadoop
Basics
Distributed file
system
Map/Reduce
How Hadoop
works?
Q/A
Bibliography
Demo
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Demo
Gabriel Răileanu (AC TUIAȘI) Hadoop & BigData December 10, 2019 22 / 27
Hadoop &
BigData
Gabriel
Răileanu
Agenda
BigData
concept
Introduction
Challanges
Hadoop
Basics
Distributed file
system
Map/Reduce
How Hadoop
works?
Q/A
Bibliography
Demo
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Demo
(Offline) Demo
Our exercise is to count the number of times each word
occcurs in a set of documents. Let’s suppose that or
document has only one sentence:
Do as I say, not as I do
Word Count
as 2
do 2
i 2
not 1
say 1
Gabriel Răileanu (AC TUIAȘI) Hadoop & BigData December 10, 2019 23 / 27
Hadoop &
BigData
Gabriel
Răileanu
Agenda
BigData
concept
Introduction
Challanges
Hadoop
Basics
Distributed file
system
Map/Reduce
How Hadoop
works?
Q/A
Bibliography
Demo
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Demo
A simple pseudo-code for this particular word counting:
• This program works fine until the set of documents you
want to process becomes large.
• Looping through all the documents using a single
computer will be extremely time consuming → rewrite
the program so that it distributes the work over several
machines.
Gabriel Răileanu (AC TUIAȘI) Hadoop & BigData December 10, 2019 24 / 27
Hadoop &
BigData
Gabriel
Răileanu
Agenda
BigData
concept
Introduction
Challanges
Hadoop
Basics
Distributed file
system
Map/Reduce
How Hadoop
works?
Q/A
Bibliography
Demo
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Demo
• Each machine will process a distinct fraction of the
documents.
• When all the machines have completed this, a second
phase of processing will combine the result of all the
machines.
Gabriel Răileanu (AC TUIAȘI) Hadoop & BigData December 10, 2019 25 / 27
Hadoop &
BigData
Gabriel
Răileanu
Agenda
BigData
concept
Introduction
Challanges
Hadoop
Basics
Distributed file
system
Map/Reduce
How Hadoop
works?
Q/A
Bibliography
Demo
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Demo
Hadoop manner for writting a mapper & reducer
Gabriel Răileanu (AC TUIAȘI) Hadoop & BigData December 10, 2019 26 / 27
Hadoop &
BigData
Gabriel
Răileanu
Agenda
BigData
concept
Introduction
Challanges
Hadoop
Basics
Distributed file
system
Map/Reduce
How Hadoop
works?
Q/A
Bibliography
Demo
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Demo
Gabriel Răileanu (AC TUIAȘI) Hadoop & BigData December 10, 2019 27 / 27

Hadoop presentation