3. Agenda
▪ Introduction
▪ What is Hadoop?
▪ Why we need Hadoop?
▪ Architecture of Hadoop
▪ What can we do with Hadoop?
▪ Hands On Exercise
NUST School of Electrical Engineering & Computer Science 4/8/2016 3
4. Agenda
▪ Introduction
▪ What is Hadoop?
▪ Why we need Hadoop?
▪ Architecture of Hadoop
▪ What can we do with Hadoop?
▪ Hands On Exercise
NUST School of Electrical Engineering & Computer Science 4/8/2016 4
5. Agenda
▪ Introduction
▪ What is Hadoop?
▪ Why we need Hadoop?
▪ Architecture of Hadoop
▪ What can we do with Hadoop?
▪ Hands On Exercise
NUST School of Electrical Engineering & Computer Science 4/8/2016 6
6. What is Hadoop
Definition + Brief History of Hadoop
NUST School of Electrical Engineering & Computer Science (SEECS) 4/8/2016 7
7. What Is Apache Hadoop?
▪ The Apache™ Hadoop® project develops open-source software
for reliable, scalable, distributed computing.
▪ Apache top level project, open-source implementation of
frameworks for reliable, scalable, distributed computing and data
storage.
▪ It is a flexible and highly-available architecture for large scale
computation and data processing on a network of commodity
hardware.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 8
8. History..
▪ Hadoop was created by Doug Cutting, the creator of Apache
Lucene, the widely used text search library.
▪ Hadoop has its origins in Apache Nutch, an open source web
search engine, itself a part of the Lucene project.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 9
10. Evolution (Cont.…)
▪ Apache Hadoop was inspired by Google’s MapReduce and
Google File System papers and cultivated at Yahoo!
▪ It started as a large-scale distributed batch processing
infrastructure.
▪ Designed to meet the need for an affordable, scalable and flexible
data structure that could be used for working with very large data
sets.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 11
14. Hadoop’s Developers..
▪ 2005: Doug Cutting and Michael J.
Cafarella developed Hadoop to
support distribution for
the Nutch search engine project.
▪ The project was funded by Yahoo.
▪ 2006: Yahoo gave the project to
Apache Software Foundation.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 15
Doug Cutting
15. The Origin of the Name “Hadoop”
▪ The name Hadoop is not an acronym; it’s a made-up name. The
project’s creator, Doug Cutting, explains how the name came
about:
▪ The name my kid gave a stuffed yellow elephant.
▪ Short, relatively easy to spell and pronounce, meaningless, and
not used elsewhere: those are my naming criteria.
▪ Kids are good at generating such. Googol is a kid’s term.
▪ Subprojects and “contrib” modules in Hadoop also tend to have
names that are unrelated to their function, often with an elephant
or other animal theme (“Pig,” for example)
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 16
16. Agenda
▪ Introduction
▪ What is Hadoop?
▪ Why we need Hadoop?
▪ Architecture of Hadoop
▪ What can we do with Hadoop?
▪ Hands On Exercise
NUST School of Electrical Engineering & Computer Science 4/8/2016 17
17. The Risen of Big Data
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 18
18. Big Data
▪ We live in the age of big data
▪ The data volumes we need to work with on a day-to-day basis have outgrown
the storage and processing capabilities of a single host.
▪ Big data brings with it two fundamental challenges:
▪ how to store and work with voluminous data sizes
▪ how to understand data and turn it into a competitive advantage
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 19
19. Examples:
▪ This flood of data is coming from many sources:
▪ The New York Stock Exchange generates about one terabyte of new trade data per
day.
▪ Facebook hosts approximately 10 billion photos, taking up one petabyte of storage.
▪ Ancestry.com, the genealogy site, stores around 2.5 petabytes of data.
▪ The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20
terabytes per month.
▪ The Large Hadron Collider near Geneva, Switzerland, will produce about 15
petabytes of data per year.
▪ So there’s a lot of data out there…..
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 20
20. Evolution…
▪ Before Hadoop, big data required a lot of computing power, storage, and
parallelism.
▪ Meant that organizations had to spend a lot of money to build the infrastructure
needed to support big data analytics.
▪ Given the large price tag, only the largest Fortune 500 organizations could
afford such an infrastructure.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 21
21. Atlast Hadoop!
▪ It is very difficult to store, compute & analyze these massive volumes of data.
▪ Hadoop is designed to answer the question:
▪ “How to process big data with in reasonable cost and time?”
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 22
22. Agenda
▪ Introduction
▪ What is Hadoop?
▪ Why we need Hadoop?
▪ Architecture/Anatomy of Hadoop
▪ What can we do with Hadoop?
▪ Hands On Exercise
NUST School of Electrical Engineering & Computer Science 4/8/2016 23
25. Architecture of Hadoop Cluster
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 26
26. Anatomy of Hadoop – Basic Modules
▪ Hadoop Common: The common utilities that support the other Hadoop
modules.
▪ Hadoop Distributed File System (HDFS™): A distributed file system that
provides high-throughput access to application data.
▪ Hadoop YARN: A framework for job scheduling and cluster resource
management.
▪ Hadoop MapReduce: A YARN-based system for parallel processing of large
data sets.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 27
27. Hadoop Distributed File System (HDFS)
▪ When a dataset outgrows the storage capacity of a single physical machine, it
becomes necessary to partition it across a number of separate machines.
▪ Filesystems that manage the storage across a network of machines are called
distributed filesystems.
▪ Hadoop comes with a distributed filesystem called HDFS, which stands for
Hadoop Distributed Filesystem.
▪ The Hadoop Distributed File System (HDFS) is a distributed file system
designed to run on commodity hardware.
▪ HDFS is highly fault-tolerant and is designed to be deployed on low-cost
hardware.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 28
28. What HDFS Does?
▪ HDFS provides high throughput access to application data and is suitable for
applications that have large data sets.
▪ Provides scalable and reliable data storage, and it was designed to span large
clusters of commodity servers.
▪ HDFS has demonstrated production scalability of up to 200 PB of storage and a
single cluster of 4500 servers, supporting close to a billion files and blocks.
▪ HDFS is designed to store a very large amount of information (terabytes or
petabytes).
▪ Hence, spreading the data across a large number of machines.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 29
29. What HDFS Does?
▪ HDFS should store data reliably. If individual machines in the cluster
malfunction, data should still be available.
▪ HDFS should provide fast, scalable access to this information.
▪ It should be possible to serve a larger number of clients by simply adding more
machines to the cluster.
▪ HDFS should integrate well with Hadoop MapReduce, allowing data to be read
and computed upon locally when possible.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 30
30. How HDFS Works?
▪ An HDFS cluster is comprised of a NameNode, which manages the cluster
metadata, and DataNodes that store the data.
▪ Files and directories are represented on the NameNode by inodes.
▪ Inodes record attributes like permissions, modification and access times, or disk
space quotas.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 31
31. HDFS Inside Story
▪ The file content is split into large blocks (typically 128 megabytes),
and each block of the file is independently replicated at multiple
DataNodes.
▪ The blocks are stored on the local file system on the DataNodes.
▪ The Namenode actively monitors the number of replicas of a
block.
▪ When a replica of a block is lost due to a DataNode failure or disk
failure, the NameNode creates another replica of the block.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 32
32. HDFS Inside Story….
▪ NameNode sends instructions to the DataNodes by replying to heartbeats sent
by those DataNodes.
▪ The instructions include commands to:
▪ replicate blocks to other nodes.
▪ remove local block replicas.
▪ re-register and send an immediate block report.
▪ shut down the node.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 33
34. Map Reduce
Called “Classic MapReduce” in Hadoop Ecosystem
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 35
35. History of MapReduce
▪ The actual origins of MapReduce are arguable, but the paper that most cite as
the one that started us down this journey is “MapReduce: Simplified Data
Processing on Large Clusters” by Jeffrey Dean and Sanjay Ghemawat in 2004.
▪ This paper described how Google split, processed, and aggregated their data set
of mind-boggling size.
▪ Shortly after the release of the paper, a free and open source software pioneer by
the name of Doug Cutting started working on a MapReduce implementation to
solve scalability in another project he was working on called Nutch.
▪ an effort to build an open source search engine
▪ Over time and with some investment by Yahoo!, Hadoop split out as its own
project and eventually became a top-level Apache Foundation project.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 36
36. What it is?
▪ MapReduce is a programming model and an associated implementation for processing
and generating large data sets.
▪ It was originally developed by Google and built on well-known principles in
parallel and distributed processing dating back several decades.
▪ MapReduce has since enjoyed widespread adoption via an open-source
implementation called Hadoop, whose development was led by Yahoo.
▪ .
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 38
37. What it is?
▪ Hadoop MapReduce is a software framework for easily writing distributed
applications
▪ Process vast amounts of data (multi-terabyte data-sets) in-parallel on large
clusters (thousands of nodes) of commodity hardware in a reliable, fault-
tolerant manner
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 39
38. How Map Reduce works?
▪ A MapReduce job usually splits the input data-set into independent chunks
which are processed by the map tasks in a completely parallel manner.
▪ The framework sorts the outputs of the maps, which are then input to the reduce
tasks.
▪ Typically both the input and the output of the job are stored in a file-system.
▪ The framework takes care of scheduling tasks, monitoring them and re-executes
the failed tasks.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 40
39. How Map Reduce works?
▪ Typically the compute nodes and the storage nodes are the same, that is, the
MapReduce framework and the Hadoop Distributed File System are running on the
same set of nodes.
▪ This configuration allows the framework to effectively schedule tasks on the
nodes where data is already present, resulting in very high aggregate
bandwidth across the cluster.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 41
40. How Map Reduce works?
▪ There are two types of nodes that control the job execution process: a jobtracker
and a number of tasktrackers.
▪ The jobtracker coordinates all the jobs run on the system by scheduling tasks to
run on tasktrackers.
▪ Tasktrackers run tasks and send progress reports to the jobtracker, which keeps
a record of the overall progress of each job.
▪ If a task fails, the jobtracker can reschedule it on a different tasktracker.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 42
41. How Map Reduce works?
▪ Hadoop does its best to run the map task on a node where the input data resides
in HDFS.
▪ This is called the data locality optimization since it doesn’t use valuable cluster
bandwidth.
▪ Map tasks write their output to the local disk, not to HDFS. Why is this?
▪ Map output is intermediate output: it’s processed by reduce tasks to produce the final
output, and once the job is complete the map output can be thrown away.
▪ Storing it in HDFS, with replication, would be overkill!
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 43
42. How Map Reduce works?
▪ Hadoop divides the input to a
MapReduce job into fixed-size pieces
called input splits, or just splits.
▪ Hadoop creates one map task for each
split, which runs the user defined
map function for each record in the
split.
▪ Having many splits means the time
taken to process each split is small
compared to the time to process the
whole input.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 44
45. YARN?
▪ For very large clusters in the region of 4000 nodes and higher, the MapReduce
system described previously begins to hit scalability bottlenecks.
▪ In 2010 a group at Yahoo! began to design the next generation of MapReduce.
▪ The result was YARN, short for Yet Another Resource Negotiator.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 47
46. YARN
▪ YARN meets the scalability shortcomings of “classic” MapReduce by splitting
the responsibilities of the jobtracker into separate entities.
▪ The Jobtracker takes care of both job scheduling
▪ matching tasks with tasktrackers
▪ Task progress monitoring
▪ keeping track of tasks and restarting failed or slow tasks
▪ doing task bookkeeping such as maintaining counter totals etc
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 48
47. YARN
▪ YARN separates these two roles into two independent daemons:
▪ A resource manager to manage the use of resources across the cluster.
▪ An application master to manage the lifecycle of applications running on the cluster.
▪ The idea is that an application master negotiates with the resource manager for cluster
resources:
▪ Described in terms of a number of containers each with a certain memory limit
▪ Runs application specific processes in those containers.
▪ The containers are overseen by node managers running on cluster nodes, which ensure
that the application does not use more resources than it has been allocated.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 49
54. What can we do with Hadoop?
▪ http://hortonworks.com/industry/
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 56
55. ▪ Hands On Exercise…..
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 57
56. References:
1. www.hortonworks.com
2. www.hadoop.apache.org
3. www.cloudera.com
4. Hadoop: The Definitive Guide, Third Edition, Book
A Comprehensive Tutorial on Hadoop Cluster:
5. http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-
single-node-cluster/
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 58
Editor's Notes
An inode is a data structure on a filesystem on Linux and other Unix-like operating systems that stores all the information about a file except its name and its actual data. A data structure is a way of storing data so that it can be used efficiently
Map phase is done by mappers. Mappers run on unsorted input key/values pairs. Each mapper emits zero, one or multiple output key/value pairs for each input key/value pairs.
Combine phase is done by Combiners. Combiner should combine key/value pairs with the same key together. Each combiner may run zero, once or multiple times.
Shuffle and Sort phase is done by framework. Data from all mappers are grouped by the key, split among reducers and sorted by the key. Each reducer obtains all values associated with the same key. Programmer may supply custom compare function for sorting and Partitioner for data split.
Partitioner decides which Reducer will get a particular key value pair.
Reducer obtains sorted key/[values list] pairs sorted by the key. Value list contains all values with the same key produced by mappers. Each reducer emits zero, one or multiple output key/value pairs for each input key/value pair.