SlideShare a Scribd company logo
1 of 56
IN THE NAME OF ALLAH THE MOST
GRACIOUS THE MOST MERCIFUL
Hadoop Fundamentals
Presentation by: Mr. Awais Qureshi
Agenda
▪ Introduction
▪ What is Hadoop?
▪ Why we need Hadoop?
▪ Architecture of Hadoop
▪ What can we do with Hadoop?
▪ Hands On Exercise
NUST School of Electrical Engineering & Computer Science 4/8/2016 3
Agenda
▪ Introduction
▪ What is Hadoop?
▪ Why we need Hadoop?
▪ Architecture of Hadoop
▪ What can we do with Hadoop?
▪ Hands On Exercise
NUST School of Electrical Engineering & Computer Science 4/8/2016 4
Agenda
▪ Introduction
▪ What is Hadoop?
▪ Why we need Hadoop?
▪ Architecture of Hadoop
▪ What can we do with Hadoop?
▪ Hands On Exercise
NUST School of Electrical Engineering & Computer Science 4/8/2016 6
What is Hadoop
Definition + Brief History of Hadoop
NUST School of Electrical Engineering & Computer Science (SEECS) 4/8/2016 7
What Is Apache Hadoop?
▪ The Apache™ Hadoop® project develops open-source software
for reliable, scalable, distributed computing.
▪ Apache top level project, open-source implementation of
frameworks for reliable, scalable, distributed computing and data
storage.
▪ It is a flexible and highly-available architecture for large scale
computation and data processing on a network of commodity
hardware.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 8
History..
▪ Hadoop was created by Doug Cutting, the creator of Apache
Lucene, the widely used text search library.
▪ Hadoop has its origins in Apache Nutch, an open source web
search engine, itself a part of the Lucene project.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 9
Evolution of Hadoop!
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 10
Evolution (Cont.…)
▪ Apache Hadoop was inspired by Google’s MapReduce and
Google File System papers and cultivated at Yahoo!
▪ It started as a large-scale distributed batch processing
infrastructure.
▪ Designed to meet the need for an affordable, scalable and flexible
data structure that could be used for working with very large data
sets.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 11
Hadoop as Inspiration By Google
2003
2004
2006
Evolution (cont..)
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 13
Another View…
4/8/2016
NUST School of Electrical Engineering & Computer Science
(SEECS)
14
Hadoop’s Developers..
▪ 2005: Doug Cutting and Michael J.
Cafarella developed Hadoop to
support distribution for
the Nutch search engine project.
▪ The project was funded by Yahoo.
▪ 2006: Yahoo gave the project to
Apache Software Foundation.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 15
Doug Cutting
The Origin of the Name “Hadoop”
▪ The name Hadoop is not an acronym; it’s a made-up name. The
project’s creator, Doug Cutting, explains how the name came
about:
▪ The name my kid gave a stuffed yellow elephant.
▪ Short, relatively easy to spell and pronounce, meaningless, and
not used elsewhere: those are my naming criteria.
▪ Kids are good at generating such. Googol is a kid’s term.
▪ Subprojects and “contrib” modules in Hadoop also tend to have
names that are unrelated to their function, often with an elephant
or other animal theme (“Pig,” for example)
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 16
Agenda
▪ Introduction
▪ What is Hadoop?
▪ Why we need Hadoop?
▪ Architecture of Hadoop
▪ What can we do with Hadoop?
▪ Hands On Exercise
NUST School of Electrical Engineering & Computer Science 4/8/2016 17
The Risen of Big Data
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 18
Big Data
▪ We live in the age of big data
▪ The data volumes we need to work with on a day-to-day basis have outgrown
the storage and processing capabilities of a single host.
▪ Big data brings with it two fundamental challenges:
▪ how to store and work with voluminous data sizes
▪ how to understand data and turn it into a competitive advantage
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 19
Examples:
▪ This flood of data is coming from many sources:
▪ The New York Stock Exchange generates about one terabyte of new trade data per
day.
▪ Facebook hosts approximately 10 billion photos, taking up one petabyte of storage.
▪ Ancestry.com, the genealogy site, stores around 2.5 petabytes of data.
▪ The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20
terabytes per month.
▪ The Large Hadron Collider near Geneva, Switzerland, will produce about 15
petabytes of data per year.
▪ So there’s a lot of data out there…..
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 20
Evolution…
▪ Before Hadoop, big data required a lot of computing power, storage, and
parallelism.
▪ Meant that organizations had to spend a lot of money to build the infrastructure
needed to support big data analytics.
▪ Given the large price tag, only the largest Fortune 500 organizations could
afford such an infrastructure.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 21
Atlast Hadoop!
▪ It is very difficult to store, compute & analyze these massive volumes of data.
▪ Hadoop is designed to answer the question:
▪ “How to process big data with in reasonable cost and time?”
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 22
Agenda
▪ Introduction
▪ What is Hadoop?
▪ Why we need Hadoop?
▪ Architecture/Anatomy of Hadoop
▪ What can we do with Hadoop?
▪ Hands On Exercise
NUST School of Electrical Engineering & Computer Science 4/8/2016 23
Architecture of Hadoop
Anatomy of Hadoop
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 24
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 25
Architecture of Hadoop Cluster
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 26
Anatomy of Hadoop – Basic Modules
▪ Hadoop Common: The common utilities that support the other Hadoop
modules.
▪ Hadoop Distributed File System (HDFS™): A distributed file system that
provides high-throughput access to application data.
▪ Hadoop YARN: A framework for job scheduling and cluster resource
management.
▪ Hadoop MapReduce: A YARN-based system for parallel processing of large
data sets.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 27
Hadoop Distributed File System (HDFS)
▪ When a dataset outgrows the storage capacity of a single physical machine, it
becomes necessary to partition it across a number of separate machines.
▪ Filesystems that manage the storage across a network of machines are called
distributed filesystems.
▪ Hadoop comes with a distributed filesystem called HDFS, which stands for
Hadoop Distributed Filesystem.
▪ The Hadoop Distributed File System (HDFS) is a distributed file system
designed to run on commodity hardware.
▪ HDFS is highly fault-tolerant and is designed to be deployed on low-cost
hardware.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 28
What HDFS Does?
▪ HDFS provides high throughput access to application data and is suitable for
applications that have large data sets.
▪ Provides scalable and reliable data storage, and it was designed to span large
clusters of commodity servers.
▪ HDFS has demonstrated production scalability of up to 200 PB of storage and a
single cluster of 4500 servers, supporting close to a billion files and blocks.
▪ HDFS is designed to store a very large amount of information (terabytes or
petabytes).
▪ Hence, spreading the data across a large number of machines.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 29
What HDFS Does?
▪ HDFS should store data reliably. If individual machines in the cluster
malfunction, data should still be available.
▪ HDFS should provide fast, scalable access to this information.
▪ It should be possible to serve a larger number of clients by simply adding more
machines to the cluster.
▪ HDFS should integrate well with Hadoop MapReduce, allowing data to be read
and computed upon locally when possible.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 30
How HDFS Works?
▪ An HDFS cluster is comprised of a NameNode, which manages the cluster
metadata, and DataNodes that store the data.
▪ Files and directories are represented on the NameNode by inodes.
▪ Inodes record attributes like permissions, modification and access times, or disk
space quotas.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 31
HDFS Inside Story
▪ The file content is split into large blocks (typically 128 megabytes),
and each block of the file is independently replicated at multiple
DataNodes.
▪ The blocks are stored on the local file system on the DataNodes.
▪ The Namenode actively monitors the number of replicas of a
block.
▪ When a replica of a block is lost due to a DataNode failure or disk
failure, the NameNode creates another replica of the block.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 32
HDFS Inside Story….
▪ NameNode sends instructions to the DataNodes by replying to heartbeats sent
by those DataNodes.
▪ The instructions include commands to:
▪ replicate blocks to other nodes.
▪ remove local block replicas.
▪ re-register and send an immediate block report.
▪ shut down the node.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 33
4/8/2016
NUST School of Electrical Engineering & Computer Science
(SEECS)
34
Map Reduce
Called “Classic MapReduce” in Hadoop Ecosystem
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 35
History of MapReduce
▪ The actual origins of MapReduce are arguable, but the paper that most cite as
the one that started us down this journey is “MapReduce: Simplified Data
Processing on Large Clusters” by Jeffrey Dean and Sanjay Ghemawat in 2004.
▪ This paper described how Google split, processed, and aggregated their data set
of mind-boggling size.
▪ Shortly after the release of the paper, a free and open source software pioneer by
the name of Doug Cutting started working on a MapReduce implementation to
solve scalability in another project he was working on called Nutch.
▪ an effort to build an open source search engine
▪ Over time and with some investment by Yahoo!, Hadoop split out as its own
project and eventually became a top-level Apache Foundation project.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 36
What it is?
▪ MapReduce is a programming model and an associated implementation for processing
and generating large data sets.
▪ It was originally developed by Google and built on well-known principles in
parallel and distributed processing dating back several decades.
▪ MapReduce has since enjoyed widespread adoption via an open-source
implementation called Hadoop, whose development was led by Yahoo.
▪ .
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 38
What it is?
▪ Hadoop MapReduce is a software framework for easily writing distributed
applications
▪ Process vast amounts of data (multi-terabyte data-sets) in-parallel on large
clusters (thousands of nodes) of commodity hardware in a reliable, fault-
tolerant manner
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 39
How Map Reduce works?
▪ A MapReduce job usually splits the input data-set into independent chunks
which are processed by the map tasks in a completely parallel manner.
▪ The framework sorts the outputs of the maps, which are then input to the reduce
tasks.
▪ Typically both the input and the output of the job are stored in a file-system.
▪ The framework takes care of scheduling tasks, monitoring them and re-executes
the failed tasks.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 40
How Map Reduce works?
▪ Typically the compute nodes and the storage nodes are the same, that is, the
MapReduce framework and the Hadoop Distributed File System are running on the
same set of nodes.
▪ This configuration allows the framework to effectively schedule tasks on the
nodes where data is already present, resulting in very high aggregate
bandwidth across the cluster.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 41
How Map Reduce works?
▪ There are two types of nodes that control the job execution process: a jobtracker
and a number of tasktrackers.
▪ The jobtracker coordinates all the jobs run on the system by scheduling tasks to
run on tasktrackers.
▪ Tasktrackers run tasks and send progress reports to the jobtracker, which keeps
a record of the overall progress of each job.
▪ If a task fails, the jobtracker can reschedule it on a different tasktracker.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 42
How Map Reduce works?
▪ Hadoop does its best to run the map task on a node where the input data resides
in HDFS.
▪ This is called the data locality optimization since it doesn’t use valuable cluster
bandwidth.
▪ Map tasks write their output to the local disk, not to HDFS. Why is this?
▪ Map output is intermediate output: it’s processed by reduce tasks to produce the final
output, and once the job is complete the map output can be thrown away.
▪ Storing it in HDFS, with replication, would be overkill!
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 43
How Map Reduce works?
▪ Hadoop divides the input to a
MapReduce job into fixed-size pieces
called input splits, or just splits.
▪ Hadoop creates one map task for each
split, which runs the user defined
map function for each record in the
split.
▪ Having many splits means the time
taken to process each split is small
compared to the time to process the
whole input.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 44
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 45
YARN (MapReduce 2)
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 46
YARN?
▪ For very large clusters in the region of 4000 nodes and higher, the MapReduce
system described previously begins to hit scalability bottlenecks.
▪ In 2010 a group at Yahoo! began to design the next generation of MapReduce.
▪ The result was YARN, short for Yet Another Resource Negotiator.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 47
YARN
▪ YARN meets the scalability shortcomings of “classic” MapReduce by splitting
the responsibilities of the jobtracker into separate entities.
▪ The Jobtracker takes care of both job scheduling
▪ matching tasks with tasktrackers
▪ Task progress monitoring
▪ keeping track of tasks and restarting failed or slow tasks
▪ doing task bookkeeping such as maintaining counter totals etc
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 48
YARN
▪ YARN separates these two roles into two independent daemons:
▪ A resource manager to manage the use of resources across the cluster.
▪ An application master to manage the lifecycle of applications running on the cluster.
▪ The idea is that an application master negotiates with the resource manager for cluster
resources:
▪ Described in terms of a number of containers each with a certain memory limit
▪ Runs application specific processes in those containers.
▪ The containers are overseen by node managers running on cluster nodes, which ensure
that the application does not use more resources than it has been allocated.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 49
© Hortonworks Inc. 2013 - Confidential
Hadoop 1 Architecture
JobTracker
Manage Cluster Resources & Job Scheduling
TaskTracker
Per-node agent
Manage Tasks
Page 50
© Hortonworks Inc. 2013 - Confidential
Apache Hadoop & YARN
▪ Apache Hadoop
–De facto Big Data open source platform
–Running for about 5 years in production at hundreds of companies like Yahoo, Ebay and
Facebook
▪ Hadoop 2
–Significant improvements in HDFS distributed storage layer. High Availability, NFS
–YARN – next generation compute framework for Hadoop designed from the ground up
based on experience gained from Hadoop 1
–YARN running in production at Yahoo for about a year
–YARN awarded Best Paper at SOCC 2013
Page 51
© Hortonworks Inc. 2013 - Confidential
Hadoop 1 Limitations
Lacks Support for Alternate Paradigms and Services
Force everything needs to look like Map Reduce
Iterative applications in MapReduce are 10x slower
Scalability
Max Cluster size ~5,000 nodes
Max concurrent tasks ~40,000
Availability
Failure Kills Queued & Running Jobs
Hard partition of resources into map and reduce slots
Non-optimal Resource Utilization
Page 52
Our Vision: Hadoop as Next-Gen Platform
HADOOP 1.0
HDFS
(redundant, reliable storage)
MapReduce
(cluster resource management
& data processing)
HDFS2
(redundant, highly-available & reliable storage)
YARN
(cluster resource management)
MapReduce
(data processing)
Others
HADOOP 2.0
Single Use System
Batch Apps
Multi Purpose Platform
Batch, Interactive, Online, Streaming, …
Page 53
© Hortonworks Inc. 2013 - Confidential
Page 54
Hadoop 2 - YARN Architecture
ResourceManager (RM)
Central agent - Manages and allocates
cluster resources
NodeManager (NM)
Per-Node agent - Manages and enforces
node resource allocations
ApplicationMaster (AM)
Per-Application –
Manages application
lifecycle and task
scheduling
Resource
Manager
MapReduce Status
Job Submission
Client
Node
Manager
Node
Manager
Container
Node
Manager
App Mstr
Node Status
Resource Request
© Hortonworks Inc. 2013 - Confidential
YARN: Taking Hadoop Beyond Batch
Page 55
Applications Run Natively in Hadoop
HDFS2 (Redundant, Reliable Storage)
YARN (Cluster Resource Management)
BATCH
(MapReduce)
INTERACTIVE
(Tez)
STREAMING
(Storm, S4,…)
GRAPH
(Giraph)
IN-MEMORY
(Spark)
HPC MPI
(OpenMPI)
ONLINE
(HBase)
OTHER
(Search)
(Weave…)
Store ALL DATA in one place…
Interact with that data in MULTIPLE WAYS
with Predictable Performance and Quality of Service
What can we do with Hadoop?
▪ http://hortonworks.com/industry/
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 56
▪ Hands On Exercise…..
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 57
References:
1. www.hortonworks.com
2. www.hadoop.apache.org
3. www.cloudera.com
4. Hadoop: The Definitive Guide, Third Edition, Book
A Comprehensive Tutorial on Hadoop Cluster:
5. http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-
single-node-cluster/
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 58

More Related Content

What's hot

First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache HadoopFirst NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache HadoopEvert Lammerts
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop IntroductionDzung Nguyen
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2tcloudcomputing-tw
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...Edureka!
 
Spotting Hadoop in the wild
Spotting Hadoop in the wildSpotting Hadoop in the wild
Spotting Hadoop in the wildKlaas Bosteels
 
Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Introduction to SARA's Hadoop Hackathon - dec 7th 2010Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Introduction to SARA's Hadoop Hackathon - dec 7th 2010Evert Lammerts
 
field_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentahofield_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentahoMartin Ferguson
 
Large-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopLarge-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopEvert Lammerts
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology LandscapeShivanandaVSeeri
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopGiovanna Roda
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 

What's hot (20)

Hadoop
HadoopHadoop
Hadoop
 
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache HadoopFirst NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
 
Spotting Hadoop in the wild
Spotting Hadoop in the wildSpotting Hadoop in the wild
Spotting Hadoop in the wild
 
Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Introduction to SARA's Hadoop Hackathon - dec 7th 2010Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Introduction to SARA's Hadoop Hackathon - dec 7th 2010
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
field_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentahofield_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentaho
 
Large-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopLarge-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with Hadoop
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop technology doc
Hadoop technology docHadoop technology doc
Hadoop technology doc
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 

Similar to Hadoop fundamentals

Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...inside-BigData.com
 
INTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOPINTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOPKrishna Sujeer
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook AhmedDoukh
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013Jean-Pierre König
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014 WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014 Chris Almond
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Frank Munz
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Eric Baldeschwieler
 

Similar to Hadoop fundamentals (20)

Hadoop
HadoopHadoop
Hadoop
 
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
 
Hadoop
HadoopHadoop
Hadoop
 
INTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOPINTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOP
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014 WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
 
Hadoop
HadoopHadoop
Hadoop
 

Recently uploaded

Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 

Recently uploaded (20)

E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 

Hadoop fundamentals

  • 1. IN THE NAME OF ALLAH THE MOST GRACIOUS THE MOST MERCIFUL
  • 3. Agenda ▪ Introduction ▪ What is Hadoop? ▪ Why we need Hadoop? ▪ Architecture of Hadoop ▪ What can we do with Hadoop? ▪ Hands On Exercise NUST School of Electrical Engineering & Computer Science 4/8/2016 3
  • 4. Agenda ▪ Introduction ▪ What is Hadoop? ▪ Why we need Hadoop? ▪ Architecture of Hadoop ▪ What can we do with Hadoop? ▪ Hands On Exercise NUST School of Electrical Engineering & Computer Science 4/8/2016 4
  • 5. Agenda ▪ Introduction ▪ What is Hadoop? ▪ Why we need Hadoop? ▪ Architecture of Hadoop ▪ What can we do with Hadoop? ▪ Hands On Exercise NUST School of Electrical Engineering & Computer Science 4/8/2016 6
  • 6. What is Hadoop Definition + Brief History of Hadoop NUST School of Electrical Engineering & Computer Science (SEECS) 4/8/2016 7
  • 7. What Is Apache Hadoop? ▪ The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing. ▪ Apache top level project, open-source implementation of frameworks for reliable, scalable, distributed computing and data storage. ▪ It is a flexible and highly-available architecture for large scale computation and data processing on a network of commodity hardware. 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 8
  • 8. History.. ▪ Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. ▪ Hadoop has its origins in Apache Nutch, an open source web search engine, itself a part of the Lucene project. 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 9
  • 9. Evolution of Hadoop! 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 10
  • 10. Evolution (Cont.…) ▪ Apache Hadoop was inspired by Google’s MapReduce and Google File System papers and cultivated at Yahoo! ▪ It started as a large-scale distributed batch processing infrastructure. ▪ Designed to meet the need for an affordable, scalable and flexible data structure that could be used for working with very large data sets. 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 11
  • 11. Hadoop as Inspiration By Google 2003 2004 2006
  • 12. Evolution (cont..) 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 13
  • 13. Another View… 4/8/2016 NUST School of Electrical Engineering & Computer Science (SEECS) 14
  • 14. Hadoop’s Developers.. ▪ 2005: Doug Cutting and Michael J. Cafarella developed Hadoop to support distribution for the Nutch search engine project. ▪ The project was funded by Yahoo. ▪ 2006: Yahoo gave the project to Apache Software Foundation. 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 15 Doug Cutting
  • 15. The Origin of the Name “Hadoop” ▪ The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug Cutting, explains how the name came about: ▪ The name my kid gave a stuffed yellow elephant. ▪ Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. ▪ Kids are good at generating such. Googol is a kid’s term. ▪ Subprojects and “contrib” modules in Hadoop also tend to have names that are unrelated to their function, often with an elephant or other animal theme (“Pig,” for example) 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 16
  • 16. Agenda ▪ Introduction ▪ What is Hadoop? ▪ Why we need Hadoop? ▪ Architecture of Hadoop ▪ What can we do with Hadoop? ▪ Hands On Exercise NUST School of Electrical Engineering & Computer Science 4/8/2016 17
  • 17. The Risen of Big Data 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 18
  • 18. Big Data ▪ We live in the age of big data ▪ The data volumes we need to work with on a day-to-day basis have outgrown the storage and processing capabilities of a single host. ▪ Big data brings with it two fundamental challenges: ▪ how to store and work with voluminous data sizes ▪ how to understand data and turn it into a competitive advantage 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 19
  • 19. Examples: ▪ This flood of data is coming from many sources: ▪ The New York Stock Exchange generates about one terabyte of new trade data per day. ▪ Facebook hosts approximately 10 billion photos, taking up one petabyte of storage. ▪ Ancestry.com, the genealogy site, stores around 2.5 petabytes of data. ▪ The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20 terabytes per month. ▪ The Large Hadron Collider near Geneva, Switzerland, will produce about 15 petabytes of data per year. ▪ So there’s a lot of data out there….. 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 20
  • 20. Evolution… ▪ Before Hadoop, big data required a lot of computing power, storage, and parallelism. ▪ Meant that organizations had to spend a lot of money to build the infrastructure needed to support big data analytics. ▪ Given the large price tag, only the largest Fortune 500 organizations could afford such an infrastructure. 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 21
  • 21. Atlast Hadoop! ▪ It is very difficult to store, compute & analyze these massive volumes of data. ▪ Hadoop is designed to answer the question: ▪ “How to process big data with in reasonable cost and time?” 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 22
  • 22. Agenda ▪ Introduction ▪ What is Hadoop? ▪ Why we need Hadoop? ▪ Architecture/Anatomy of Hadoop ▪ What can we do with Hadoop? ▪ Hands On Exercise NUST School of Electrical Engineering & Computer Science 4/8/2016 23
  • 23. Architecture of Hadoop Anatomy of Hadoop 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 24
  • 24. 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 25
  • 25. Architecture of Hadoop Cluster 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 26
  • 26. Anatomy of Hadoop – Basic Modules ▪ Hadoop Common: The common utilities that support the other Hadoop modules. ▪ Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data. ▪ Hadoop YARN: A framework for job scheduling and cluster resource management. ▪ Hadoop MapReduce: A YARN-based system for parallel processing of large data sets. 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 27
  • 27. Hadoop Distributed File System (HDFS) ▪ When a dataset outgrows the storage capacity of a single physical machine, it becomes necessary to partition it across a number of separate machines. ▪ Filesystems that manage the storage across a network of machines are called distributed filesystems. ▪ Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed Filesystem. ▪ The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. ▪ HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 28
  • 28. What HDFS Does? ▪ HDFS provides high throughput access to application data and is suitable for applications that have large data sets. ▪ Provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. ▪ HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks. ▪ HDFS is designed to store a very large amount of information (terabytes or petabytes). ▪ Hence, spreading the data across a large number of machines. 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 29
  • 29. What HDFS Does? ▪ HDFS should store data reliably. If individual machines in the cluster malfunction, data should still be available. ▪ HDFS should provide fast, scalable access to this information. ▪ It should be possible to serve a larger number of clients by simply adding more machines to the cluster. ▪ HDFS should integrate well with Hadoop MapReduce, allowing data to be read and computed upon locally when possible. 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 30
  • 30. How HDFS Works? ▪ An HDFS cluster is comprised of a NameNode, which manages the cluster metadata, and DataNodes that store the data. ▪ Files and directories are represented on the NameNode by inodes. ▪ Inodes record attributes like permissions, modification and access times, or disk space quotas. 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 31
  • 31. HDFS Inside Story ▪ The file content is split into large blocks (typically 128 megabytes), and each block of the file is independently replicated at multiple DataNodes. ▪ The blocks are stored on the local file system on the DataNodes. ▪ The Namenode actively monitors the number of replicas of a block. ▪ When a replica of a block is lost due to a DataNode failure or disk failure, the NameNode creates another replica of the block. 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 32
  • 32. HDFS Inside Story…. ▪ NameNode sends instructions to the DataNodes by replying to heartbeats sent by those DataNodes. ▪ The instructions include commands to: ▪ replicate blocks to other nodes. ▪ remove local block replicas. ▪ re-register and send an immediate block report. ▪ shut down the node. 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 33
  • 33. 4/8/2016 NUST School of Electrical Engineering & Computer Science (SEECS) 34
  • 34. Map Reduce Called “Classic MapReduce” in Hadoop Ecosystem 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 35
  • 35. History of MapReduce ▪ The actual origins of MapReduce are arguable, but the paper that most cite as the one that started us down this journey is “MapReduce: Simplified Data Processing on Large Clusters” by Jeffrey Dean and Sanjay Ghemawat in 2004. ▪ This paper described how Google split, processed, and aggregated their data set of mind-boggling size. ▪ Shortly after the release of the paper, a free and open source software pioneer by the name of Doug Cutting started working on a MapReduce implementation to solve scalability in another project he was working on called Nutch. ▪ an effort to build an open source search engine ▪ Over time and with some investment by Yahoo!, Hadoop split out as its own project and eventually became a top-level Apache Foundation project. 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 36
  • 36. What it is? ▪ MapReduce is a programming model and an associated implementation for processing and generating large data sets. ▪ It was originally developed by Google and built on well-known principles in parallel and distributed processing dating back several decades. ▪ MapReduce has since enjoyed widespread adoption via an open-source implementation called Hadoop, whose development was led by Yahoo. ▪ . 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 38
  • 37. What it is? ▪ Hadoop MapReduce is a software framework for easily writing distributed applications ▪ Process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault- tolerant manner 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 39
  • 38. How Map Reduce works? ▪ A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. ▪ The framework sorts the outputs of the maps, which are then input to the reduce tasks. ▪ Typically both the input and the output of the job are stored in a file-system. ▪ The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 40
  • 39. How Map Reduce works? ▪ Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System are running on the same set of nodes. ▪ This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster. 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 41
  • 40. How Map Reduce works? ▪ There are two types of nodes that control the job execution process: a jobtracker and a number of tasktrackers. ▪ The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers. ▪ Tasktrackers run tasks and send progress reports to the jobtracker, which keeps a record of the overall progress of each job. ▪ If a task fails, the jobtracker can reschedule it on a different tasktracker. 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 42
  • 41. How Map Reduce works? ▪ Hadoop does its best to run the map task on a node where the input data resides in HDFS. ▪ This is called the data locality optimization since it doesn’t use valuable cluster bandwidth. ▪ Map tasks write their output to the local disk, not to HDFS. Why is this? ▪ Map output is intermediate output: it’s processed by reduce tasks to produce the final output, and once the job is complete the map output can be thrown away. ▪ Storing it in HDFS, with replication, would be overkill! 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 43
  • 42. How Map Reduce works? ▪ Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits, or just splits. ▪ Hadoop creates one map task for each split, which runs the user defined map function for each record in the split. ▪ Having many splits means the time taken to process each split is small compared to the time to process the whole input. 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 44
  • 43. 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 45
  • 44. YARN (MapReduce 2) 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 46
  • 45. YARN? ▪ For very large clusters in the region of 4000 nodes and higher, the MapReduce system described previously begins to hit scalability bottlenecks. ▪ In 2010 a group at Yahoo! began to design the next generation of MapReduce. ▪ The result was YARN, short for Yet Another Resource Negotiator. 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 47
  • 46. YARN ▪ YARN meets the scalability shortcomings of “classic” MapReduce by splitting the responsibilities of the jobtracker into separate entities. ▪ The Jobtracker takes care of both job scheduling ▪ matching tasks with tasktrackers ▪ Task progress monitoring ▪ keeping track of tasks and restarting failed or slow tasks ▪ doing task bookkeeping such as maintaining counter totals etc 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 48
  • 47. YARN ▪ YARN separates these two roles into two independent daemons: ▪ A resource manager to manage the use of resources across the cluster. ▪ An application master to manage the lifecycle of applications running on the cluster. ▪ The idea is that an application master negotiates with the resource manager for cluster resources: ▪ Described in terms of a number of containers each with a certain memory limit ▪ Runs application specific processes in those containers. ▪ The containers are overseen by node managers running on cluster nodes, which ensure that the application does not use more resources than it has been allocated. 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 49
  • 48. © Hortonworks Inc. 2013 - Confidential Hadoop 1 Architecture JobTracker Manage Cluster Resources & Job Scheduling TaskTracker Per-node agent Manage Tasks Page 50
  • 49. © Hortonworks Inc. 2013 - Confidential Apache Hadoop & YARN ▪ Apache Hadoop –De facto Big Data open source platform –Running for about 5 years in production at hundreds of companies like Yahoo, Ebay and Facebook ▪ Hadoop 2 –Significant improvements in HDFS distributed storage layer. High Availability, NFS –YARN – next generation compute framework for Hadoop designed from the ground up based on experience gained from Hadoop 1 –YARN running in production at Yahoo for about a year –YARN awarded Best Paper at SOCC 2013 Page 51
  • 50. © Hortonworks Inc. 2013 - Confidential Hadoop 1 Limitations Lacks Support for Alternate Paradigms and Services Force everything needs to look like Map Reduce Iterative applications in MapReduce are 10x slower Scalability Max Cluster size ~5,000 nodes Max concurrent tasks ~40,000 Availability Failure Kills Queued & Running Jobs Hard partition of resources into map and reduce slots Non-optimal Resource Utilization Page 52
  • 51. Our Vision: Hadoop as Next-Gen Platform HADOOP 1.0 HDFS (redundant, reliable storage) MapReduce (cluster resource management & data processing) HDFS2 (redundant, highly-available & reliable storage) YARN (cluster resource management) MapReduce (data processing) Others HADOOP 2.0 Single Use System Batch Apps Multi Purpose Platform Batch, Interactive, Online, Streaming, … Page 53
  • 52. © Hortonworks Inc. 2013 - Confidential Page 54 Hadoop 2 - YARN Architecture ResourceManager (RM) Central agent - Manages and allocates cluster resources NodeManager (NM) Per-Node agent - Manages and enforces node resource allocations ApplicationMaster (AM) Per-Application – Manages application lifecycle and task scheduling Resource Manager MapReduce Status Job Submission Client Node Manager Node Manager Container Node Manager App Mstr Node Status Resource Request
  • 53. © Hortonworks Inc. 2013 - Confidential YARN: Taking Hadoop Beyond Batch Page 55 Applications Run Natively in Hadoop HDFS2 (Redundant, Reliable Storage) YARN (Cluster Resource Management) BATCH (MapReduce) INTERACTIVE (Tez) STREAMING (Storm, S4,…) GRAPH (Giraph) IN-MEMORY (Spark) HPC MPI (OpenMPI) ONLINE (HBase) OTHER (Search) (Weave…) Store ALL DATA in one place… Interact with that data in MULTIPLE WAYS with Predictable Performance and Quality of Service
  • 54. What can we do with Hadoop? ▪ http://hortonworks.com/industry/ 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 56
  • 55. ▪ Hands On Exercise….. 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 57
  • 56. References: 1. www.hortonworks.com 2. www.hadoop.apache.org 3. www.cloudera.com 4. Hadoop: The Definitive Guide, Third Edition, Book A Comprehensive Tutorial on Hadoop Cluster: 5. http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux- single-node-cluster/ 4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 58

Editor's Notes

  1. An inode is a data structure on a filesystem on Linux and other Unix-like operating systems that stores all the information about a file except its name and its actual data. A data structure is a way of storing data so that it can be used efficiently
  2. Map phase is done by mappers. Mappers run on unsorted input key/values pairs. Each mapper emits zero, one or multiple output key/value pairs for each input key/value pairs. Combine phase is done by Combiners. Combiner should combine key/value pairs with the same key together. Each combiner may run zero, once or multiple times. Shuffle and Sort phase is done by framework. Data from all mappers are grouped by the key, split among reducers and sorted by the key. Each reducer obtains all values associated with the same key. Programmer may supply custom compare function for sorting and Partitioner for data split. Partitioner decides which Reducer will get a particular key value pair. Reducer obtains sorted key/[values list] pairs sorted by the key. Value list contains all values with the same key produced by mappers. Each reducer emits zero, one or multiple output key/value pairs for each input key/value pair.