SlideShare a Scribd company logo
1 of 21
Download to read offline
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
Unit IV
`Introduction of Big data programming-Hadoop, History of Hadoop, The ecosystem and
stack, Components of Hadoop, Hadoop Distributed File System (HDFS), Design of HDFS, Java
interfaces to HDFS, Architecture overview, Development Environment, Hadoop distribution
and-basic commands, Eclipse development
What is HDFS?
HDFS is a distributed file system for storing very large data files, running on clusters of
commodity hardware. It is fault tolerant, scalable, and extremely simple to expand. Hadoop
comes bundled with HDFS (Hadoop Distributed File Systems).
When data exceeds the capacity of storage on a single physical machine, it becomes essential
to divide it across a number of separate machines. A file system that manages storage specific
operations across a network of machines is called a distributed file system. HDFS is one such
software.
History of Hadoop
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the
Google File System paper, published by Google.
Year Event
2002 Doug Cutting and Mike Cafarella started to work on a project, Apache Nutch. It is
an open source web crawler software project.
2003 Google released the paper, Google File System (GFS).
2004 Google released a white paper on Map Reduce.
2006
o Hadoop introduced.
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
o Hadoop 0.1.0 released.
o Yahoo deploys 300 machines and within this year reaches 600 machines.
2007
o Yahoo runs 2 clusters of 1000 machines.
o Hadoop includes HBase.
2008
o YARN JIRA opened
o Hadoop becomes the fastest system to sort 1 terabyte of data on a 900 node
cluster within 209 seconds.
o Yahoo clusters loaded with 10 terabytes per day.
o Cloudera was founded as a Hadoop distributor.
2009
o Yahoo runs 17 clusters of 24,000 machines.
o Hadoop becomes capable enough to sort a petabyte.
o MapReduce and HDFS become separate subproject.
2010
o Hadoop added the support for Kerberos.
o Hadoop operates 4,000 nodes with 40 petabytes.
o Apache Hive and Pig released.
2011
o Apache Zookeeper released.
o Yahoo has 42,000 Hadoop nodes and hundreds of petabytes of storage.
2012 Apache Hadoop 1.0 version released.
2013 Apache Hadoop 2.2 version released.
2014 Apache Hadoop 2.6 version released.
2015 Apache Hadoop 2.7 version released.
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
2017 Apache Hadoop 3.0 version released.
2018 Apache Hadoop 3.1 version released.
The Hadoop Ecosystem
Hadoop Ecosystem is a platform or a suite that provides various services to solve big data
problems. It includes Apache projects and various commercial tools and solutions. 4 major
elements of Hadoop are HDFS, MapReduce, YARN, and Hadoop Common. Hadoop is a
framework that enables the processing of large data sets which reside in the form of clusters.
Being a framework, Hadoop was made up of several modules that are supported by a large
ecosystem of technologies.
Components that collectively form a Hadoop ecosystem:
1. HDFS: Hadoop Distributed File System
2. YARN: Yet Another Resource Negotiator
3. MapReduce: Programming-based Data Processing
4. Spark: In-Memory data processing
5. PIG, HIVE: Query-based processing of data services
6. HBase: NoSQL Database
7. Mahout, Spark MLLib: Machine Learning algorithm libraries
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
8. Zookeeper: Managing cluster
9. Oozie: Job Scheduling
What is Hadoop?
Apache Hadoop is a framework that allows the distributed processing of large data sets
across clusters of commodity computers using a simple programming model.
It is an Open-source Data Management with scale-out storage & distributed processing.
The Apache Hadoop project develops open-source software for reliable, scalable,
distributed computing.
HDFS: Hadoop Distributed File System
• HDFS is a distributed, scalable, and portable filesystem written in Java for the
Hadoop framework.
• HDFS creates multiple replicas of data blocks and distributes them on compute
nodes throughout a cluster to enable reliable, extremely rapid computations.
• HDFS is highly fault-tolerant and is designed to be deployed on low-cost
hardware.
• HDFS provides high throughput access to application data and is suitable for
applications that have large data sets
• HDFS consists of two core components :
o Name Node
o Data Node
Name Node:
• Name Node, a master server, manages the file system namespace and regulates
access to files by clients.
• Maintains and manages the blocks which are present on the data node.
• Name Node is the prime node that contains metadata
• Meta-data in Memory
– The entire metadata is in the main memory
• Types of Metadata
– List of files
– List of Blocks for each file
– List of Data Nodes for each block
– File attributes, example: creation time, replication
• A Transaction Log
– Records file creations, and file deletions. Etc.
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
Data Node:
• Data Nodes, one per node in the cluster, manage storage attached to the nodes
that they run on.
• data nodes that store the actual data. These data nodes are commodity
hardware in the distributed environment.
• A Block Server
o Stores data in the local file system
o Stores meta-data of a block
o Serves data and meta-data to Clients
o Block Report
o Periodically sends a report of all existing blocks to the Name Node
• Facilitates Pipelining of Data
o Forwards data to other specified Data Nodes
HBase: NoSQL Database
Apache HBase is an open-source, distributed, versioned, fault-tolerant, scalable, column-
oriented store modeled after Google’s Bigtable, with random real-time read/write access
to data.
It’s a NoSQL database that runs on top of Hadoop as a distributed and scalable big data
store.
It combines the scalability of Hadoop by running on the Hadoop Distributed File System
(HDFS), with real-time data access as a key/value store and deep analytic capabilities of
Map Reduce.
MapReduce
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
MapReduce is a processing technique and a program model for distributed computing based
on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce.
Map takes a set of data and converts it into another set of data, where individual elements
are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output
from a map as an input and combines those data tuples into a smaller set of tuples. As the
sequence of the name MapReduce implies, the reduce task is always performed after the map
job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application
into mappers and reducers is sometimes nontrivial. But, once we write an application in the
MapReduce form, scaling the application to run over hundreds, thousands, or even tens of
thousands of machines in a cluster is merely a configuration change. This simple scalability
is what has attracted many programmers to use the MapReduce model.
The Algorithm
• Generally MapReduce paradigm is based on sending the computer to where
the data resides!
• MapReduce program executes in three stages, namely map stage, shuffle stage,
and reduce stage.
o Map stage − The map or mapper’s job is to process the input
data. Generally the input data is in the form of file or directory
and is stored in the Hadoop file system (HDFS). The input file is
passed to the mapper function line by line. The mapper processes
the data and creates several small chunks of data.
o Reduce stage − This stage is the combination of
the Shuffle stage and the Reduce stage. The Reducer’s job is to
process the data that comes from the mapper. After processing,
it produces a new set of output, which will be stored in the HDFS.
• During a MapReduce job, Hadoop sends the Map and Reduce tasks to the
appropriate servers in the cluster.
• The framework manages all the details of data-passing such as issuing tasks,
verifying task completion, and copying data around the cluster between the
nodes.
• Most of the computing takes place on nodes with data on local disks that
reduces the network traffic.
• After completion of the given tasks, the cluster collects and reduces the data to
form an appropriate result, and sends it back to the Hadoop server.
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
YARN: Yet Another Resource Negotiator
• Apache YARN is Hadoop’s cluster resource management system.
• YARN was introduced in Hadoop 2.0 for improving the MapReduce utilization.
• It handles the cluster of nodes and acts as Hadoop’s resource management
unit. YARN allocates RAM, memory, and other resources to different
applications.
YARN has two components :
Resource Manager
• Global resource scheduler
• Runs on the master node
• Manages other Nodes
o Tracks heartbeats from Node Manager
• Manages Containers
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
o Handles AM requests for resources
o De-allocates containers when they expire, or the application
completes
• Manages Application Master
o Creates a container from AM and tracks heartbeats
• Manages Security
Node Manager
• Runs on slave node
• Communicates with RM
o Registers and provides info on Node resources
o Sends heartbeats and container status
• Manages processes and container status
o Launches AM on request from RM
o Launches application process on request from AM
o Monitors resource usage by containers.
• Provides logging services to applications
o Aggregates logs for an application and saves them to HDFS
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
PIG
To performed a lot of data administration operation, Pig Hadoop was developed by Yahoo
which is Query based language works on a pig Latin language used with hadoop.
• It is a platform for structuring the data flow, and processing and analyzing
huge data sets.
• Pig does the work of executing commands and in the background, all the
activities of MapReduce are taken care of. After the processing, PIG stores the
result in HDFS.
• Pig Latin language is specially designed for this framework which runs on Pig
Runtime. Just the way Java runs on the JVM.
Features of PIG
• Provides support for data types – long, float, char array, schemas, and
functions
• Is extensible and supports User Defined Functions
• Provides common operations like JOIN, GROUP, FILTER, SORT
HIVE
Relational databases that use SQL as the query language implemented by most of data
Most data warehouse application. Hive is a data warehousing package built on top of
Hadoop that lowers the barrier to moving these applications to Hadoop.
• Structured and Semi-Structured data Processing by using Hive.
• Series of automatically generated Map Reduce jobs is internal execution of
Hive query.
• Structure data used for data analysis.
Mahout
• Mahout provides an environment for creating machine learning applications
that are scalable.
• Mahout allows Machine Learnability to a system or application.
• MLlib, Spark’s open-source distributed machine learning library.
• MLlib provides efficient functionality for a wide range of learning settings and
includes several underlying statistical, optimization, and linear algebra
primitives.
• It allows invoking algorithms as per our need with the help of its own libraries.
Avro
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
Avro is an open source project that provides data serialization and data exchange services
for Apache Hadoop. These services can be used together or independently. Avro facilitates
the exchange of big data between programs written in any language. With the serialization
service, programs can efficiently serialize data into files or into messages. The data storage
is compact and efficient. Avro stores both the data definition and the data together in one
message or file.
Avro stores the data definition in JSON format making it easy to read and interpret; the data
itself is stored in binary format making it compact and efficient. Avro files include markers
that can be used to split large data sets into subsets suitable for Apache
MapReduce processing. Some data exchange services use a code generator to interpret the
data definition and produce code to access the data. Avro doesn't require this step, making
it ideal for scripting languages.
A key feature of Avro is robust support for data schemas that change over time — often called
schema evolution. Avro handles schema changes like missing fields, added fields and
changed fields; as a result, old programs can read new data and new programs can read old
data. Avro includes APIs for Java, Python, Ruby, C, C++ and more. Data stored using Avro can
be passed from programs written in different languages, even from a compiled language like
C to a scripting language like Apache Pig.
Sqoop
Sqoop is a command-line interface application for transferring data between relational
databases and Hadoop.
It supports incremental loads of a single table or a free form SQL query as well as saved
jobs which can be run multiple times to import updates made to a database since the last
import.Using Sqoop, Data can be moved into HDFS/hive/hbase from MySQL/
PostgreSQL/Oracle/SQL Server/DB2 and vise versa.
Oozie: Job Scheduling
• Apache Oozie is a clock and alarm service inside Hadoop Ecosystem.
• Oozie simply performs the task scheduler, it combines multiple jobs
sequentially into one logical unit of work.
• Oozie is a workflow scheduler system that allows users to link jobs written
on various platforms like MapReduce, Hive, Pig, etc. schedule a job in advance
and create a pipeline of individual jobs was executed sequentially or in
parallel to achieve a bigger task using Oozie.
There are two kinds of Oozie jobs:
Oozie Workflow
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
Oozie workflow is a sequential set of actions to be executed.
Oozie Coordinator
These are the Oozie jobs that are triggered by time and data availability
Chukwa
Apache Chukwa is an open source data collection system for monitoring large distributed
systems. Apache Chukwa is built on top of the Hadoop Distributed File System (HDFS) and
Map/Reduce framework and inherits Hadoop’s scalability and robustness. Apache
Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and
analyzing results to make the best use of the collected data.
Flume
Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and
transporting large amounts of streaming data such as log files, events (etc...) from various
sources to a centralized data store.
Flume is a highly reliable, distributed, and configurable tool. It is principally designed to copy
streaming data (log data) from various web servers to HDFS.
Zookeeper
Apache Zookeeper is an open source distributed coordination service that helps to manage
a large set of hosts. Management and coordination in a distributed environment is tricky.
Zookeeper automates this process and allows developers to focus on building software
features rather than worry about it’s distributed nature.
Zookeeper helps you to maintain configuration information, naming, group services for
distributed applications. It implements different protocols on the cluster so that the
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
application should not implement on their own. It provides a single coherent view of multiple
machines.
Access HDFS using JAVA API
In order to interact with Hadoop’s filesystem programmatically, Hadoop provides multiple
JAVA classes. Package named org.apache.hadoop.fs contains classes useful in manipulation
of a file in Hadoop’s filesystem. These operations include, open, read, write, and close.
Actually, file API for Hadoop is generic and can be extended to interact with other filesystems
other than HDFS.
Read Operation In HDFS
Data read request is served by HDFS, NameNode, and DataNode. Let’s call the reader as a
‘client’. Below diagram depicts file read operation in Hadoop.
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
1. A client initiates read request by calling ‘open()’ method of FileSystem object; it is an
object of type DistributedFileSystem.
2. This object connects to namenode using RPC and gets metadata information such as the
locations of the blocks of the file. Please note that these addresses are of first few blocks
of a file.
3. In response to this metadata request, addresses of the DataNodes having a copy of that
block is returned back.
4. Once addresses of DataNodes are received, an object of type FSDataInputStream is
returned to the client. FSDataInputStream contains DFSInputStream which takes care
of interactions with DataNode and NameNode. In step 4 shown in the above diagram, a
client invokes ‘read()’ method which causes DFSInputStream to establish a connection
with the first DataNode with the first block of a file.
5. Data is read in the form of streams wherein client invokes ‘read()’ method repeatedly.
This process of read() operation continues till it reaches the end of block.
6. Once the end of a block is reached, DFSInputStream closes the connection and moves on
to locate the next DataNode for the next block
7. Once a client has done with the reading, it calls a close() method.
Write Operation In HDFS
Here we will understand how data is written into HDFS through files.
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
1. A client initiates write operation by calling ‘create()’ method of DistributedFileSystem
object which creates a new file – Step no. 1 in the above diagram.
2. DistributedFileSystem object connects to the NameNode using RPC call and initiates new
file creation. However, this file creates operation does not associate any blocks with the
file. It is the responsibility of NameNode to verify that the file (which is being created)
does not exist already and a client has correct permissions to create a new file. If a file
already exists or client does not have sufficient permission to create a new file,
then IOException is thrown to the client. Otherwise, the operation succeeds and a new
record for the file is created by the NameNode.
3. Once a new record in NameNode is created, an object of type FSDataOutputStream is
returned to the client. A client uses it to write data into the HDFS. Data write method is
invoked (step 3 in the diagram).
4. FSDataOutputStream contains DFSOutputStream object which looks after
communication with DataNodes and NameNode. While the client continues writing
data, DFSOutputStream continues creating packets with this data. These packets are
enqueued into a queue which is called as DataQueue.
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
5. There is one more component called DataStreamer which consumes this DataQueue.
DataStreamer also asks NameNode for allocation of new blocks thereby picking desirable
DataNodes to be used for replication.
6. Now, the process of replication starts by creating a pipeline using DataNodes. In our case,
we have chosen a replication level of 3 and hence there are 3 DataNodes in the pipeline.
7. The DataStreamer pours packets into the first DataNode in the pipeline.
8. Every DataNode in a pipeline stores packet received by it and forwards the same to the
second DataNode in a pipeline.
9. Another queue, ‘Ack Queue’ is maintained by DFSOutputStream to store packets which
are waiting for acknowledgment from DataNodes.
10. Once acknowledgment for a packet in the queue is received from all DataNodes in the
pipeline, it is removed from the ‘Ack Queue’. In the event of any DataNode failure, packets
from this queue are used to reinitiate the operation.
11. After a client is done with the writing data, it calls a close() method (Step 9 in the
diagram) Call to close(), results into flushing remaining data packets to the pipeline
followed by waiting for acknowledgment.
12. Once a final acknowledgment is received, NameNode is contacted to tell it that the file
write operation is complete.
HDFS Architecture
HDFS cluster primarily consists of a NameNode that manages the file system Metadata and
a DataNodes that stores the actual data.
NameNode: NameNode can be considered as a master of the system. It maintains the file
system tree and the metadata for all the files and directories present in the system. Two files
‘Namespace image’ and the ‘edit log’ are used to store metadata information. Namenode has
knowledge of all the datanodes containing data blocks for a given file, however, it does not
store block locations persistently. This information is reconstructed every time from
datanodes when the system starts.
DataNode: DataNodes are slaves which reside on each machine in a cluster and provide the
actual storage. It is responsible for serving, read and write requests for the clients.
Read/write operations in HDFS operate at a block level. Data files in HDFS are broken into
block-sized chunks, which are stored as independent units. Default block-size is 64 MB.
HDFS operates on a concept of data replication wherein multiple replicas of data blocks are
created and are distributed on nodes throughout a cluster to enable high availability of data
in the event of node failure.
Hadoop basic commands
Commands Description
ls
This command is used to list all the files. Use lsr for recursive
approach. It is useful when we want a hierarchy of a folder.
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
Syntax:
bin/hdfs dfs -ls <path>
Example:
bin/hdfs dfs -ls /
It will print all the directories present in HDFS. bin directory contains
executables so, bin/hdfs means we want the executables of hdfs
particularly dfs(Distributed File System) commands.
mkdir
mkdir: To create a directory. In Hadoop dfs there is no home
directory by default. So let’s first create it.
Syntax:
bin/hdfs dfs -mkdir <folder name>
creating home directory:
hdfs/bin -mkdir /user
hdfs/bin -mkdir /user/username -> write the username of your
computer
Example:
bin/hdfs dfs -mkdir /geeks => '/' means absolute path
bin/hdfs dfs -mkdir geeks2 => Relative path -> the folder will be
created relative to the home directory.
touchz
It creates an empty file.
Syntax:
bin/hdfs dfs -touchz <file_path>
Example:
bin/hdfs dfs -touchz /geeks/myfile.txt
copyFromLocal
(or) put
To copy files/folders from local file system to hdfs store. This is the
most important command. Local filesystem means the files present
on the OS.
Syntax:
bin/hdfs dfs -copyFromLocal <local file path> <dest(present on
hdfs)>
Example: Let’s suppose we have a file AI.txt on Desktop which we
want to copy to folder geeks present on hdfs.
bin/hdfs dfs -copyFromLocal ../Desktop/AI.txt /geeks
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
(OR)
bin/hdfs dfs -put ../Desktop/AI.txt /geeks
cat
To print file contents.
Syntax:
bin/hdfs dfs -cat <path>
Example:
// print the content of AI.txt present
// inside geeks folder.
bin/hdfs dfs -cat /geeks/AI.txt ->
copyToLocal (or)
get
To copy files/folders from hdfs store to local file system.
Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>
Example:
bin/hdfs dfs -copyToLocal /geeks ../Desktop/hero
(OR)
bin/hdfs dfs -get /geeks/myfile.txt ../Desktop/hero
myfile.txt from geeks folder will be copied to folder hero present on
Desktop.
moveFromLocal
This command will move file from local to hdfs.
Syntax:
bin/hdfs dfs -moveFromLocal <local src> <dest(on hdfs)>
Example:
bin/hdfs dfs -moveFromLocal ../Desktop/cutAndPaste.txt /geeks
cp
This command is used to copy files within hdfs. Lets copy folder geeks
to geeks_copied.
Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
Example:
bin/hdfs -cp /geeks /geeks_copied
mv
This command is used to move files within hdfs. Lets cut-paste a file
myfile.txt from geeks folder to geeks_copied.
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
Example:
bin/hdfs -mv /geeks/myfile.txt /geeks_copied
rmr
This command deletes a file from HDFS recursively. It is very useful
command when you want to delete a non-empty directory.
Syntax:
bin/hdfs dfs -rmr <filename/directoryName>
Example:
bin/hdfs dfs -rmr /geeks_copied -> It will delete all the content inside
the directory then the directory itself.
du
It will give the size of each file in directory.
Syntax:
bin/hdfs dfs -du <dirName>
Example:
bin/hdfs dfs -du /geeks
dus
This command will give the total size of directory/file.
Syntax:
bin/hdfs dfs -dus <dirName>
Example:
bin/hdfs dfs -dus /geeks
stat
This command will give the total size of directory/file.
Syntax:
bin/hdfs dfs -dus <dirName>
Example:
bin/hdfs dfs -dus /geeks
setrep
This command is used to change the replication factor of a
file/directory in HDFS. By default it is 3 for anything which is stored
in HDFS (as set in hdfs core-site.xml).
Example 1: To change the replication factor to 6 for geeks.txt stored
in HDFS.
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
bin/hdfs dfs -setrep -R -w 6 geeks.txt
Example 2: To change the replication factor to 4 for a directory
geeksInput stored in HDFS.
bin/hdfs dfs -setrep -R 4 /geeks
Note: The -w means wait till the replication is completed. And -R
means recursively, we use it for directories as they may also contain
many files and folders inside them.
Hadoop Distributions
Distro Remarks Free / Premium
Apache
hadoop.apache.org
o The Hadoop Source
o No packaging except TAR
balls
o No extra tools
Completely free and
open source
Cloudera
www.cloudera.com
o Oldest distro
o Very polished
o Comes with good tools to
install and manage a Hadoop
cluster
Free / Premium
model (depending on
cluster size)
HortonWorks
www.hortonworks.com
o Newer distro
o Tracks Apache Hadoop closely
o Comes with tools to manage
and administer a cluster
Completely open
source
MapR
www.mapr.com
o MapR has their own file
system (alternative to HDFS)
o Boasts higher performance
o Nice set of tools to manage
and administer a cluster
o Does not suffer from Single
Point of Failure
o Offer some cool features like
mirroring, snapshots, etc.
Free / Premium
model
Intel
hadoop.intel.com
o Encryption support
o Hardware acceleration added
to some layers of stack to
boost performance
o Admin tools to deploy and
manage Hadoop
Premium
Pivotal HD
gopivotal.com
o fast SQL on Hadoop
o software only or appliance
Premium
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
Working with Hadoop under Eclipse
Here are instructions for setting up a development environment for Hadoop under the
Eclipse IDE. Please feel free to make additions or modifications to this page.
This document assumes you already have Eclipse downloaded, installed, and configured to
your liking. It also assumes that you are aware of the HowToContribute page and have
given that a read.
Quick Start
We will begin by downloading the Hadoop source. The hadoop-common source tree has
three subprojects underneath it that you will see after you pull down the source code:
hadoop-common, hdfs, and mapreduce.
Let's begin by getting the latest source from Git (Note there is a a copy mirrored on github
but it lags the Apache read-only git repository slightly).
git clone git://git.apache.org/hadoop-common.git
This will create a hadoop-common folder in your current directory, if you "cd" into that
folder you will see all the available subprojects. Now we will build the code to get it ready for
importing into Eclipse.
From this directory you just 'cd'-ed into (Which is also known as the top-level directory of a
branch or a trunk checkout), perform:
$ mvn install -DskipTests
$ mvn eclipse:eclipse -DdownloadSources=true -DdownloadJavadocs=true
Note: This may take a while the first time, as all libraries are fetched from the internet, and
the whole build is performed.
In Eclipse
After the above, do the following to finally have projects in Eclipse ready and waiting for you
to go on that scratch-itching development spree:
For Common
• File -> Import...
• Choose "Existing Projects into Workspace"
• Select the hadoop-common-project directory as the root directory
• Select the hadoop-annotations, hadoop-auth, hadoop-auth-examples, hadoop-nfs and
hadoop-common projects
• Click "Finish"
Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
• File -> Import...
• Choose "Existing Projects into Workspace"
• Select the hadoop-assemblies directory as the root directory
• Select the hadoop-assemblies project
• Click "Finish"
• To get the projects to build cleanly:
• * Add target/generated-test-sources/java as a source directory for hadoop-common
• * You may have to add then remove the JRE System Library to avoid errors due to
access restrictions
For HDFS
• File -> Import...
• Choose "Existing Projects into Workspace"
• Select the hadoop-hdfs-project directory as the root directory
• Select the hadoop-hdfs project
• Click "Finish"
For MapReduce
• File -> Import...
• Choose "Existing Projects into Workspace"
• Select the hadoop-mapreduce-project directory as the root directory
• Select the hadoop-mapreduce-project project
• Click "Finish"
For YARN
• File -> Import...
• Choose "Existing Projects into Workspace"
• Select the hadoop-yarn-project directory as the root directory
• Select the hadoop-yarn-project project
• Click "Finish"
Note: in the case of MapReduce the testjar package is broken. This is expected since it is a
part of a testcase that checks for incorrect packaging. This is not to be worried about.
To run tests from Eclipse you need to additionally do the following:
• Under project Properties, select Java Build Path, and the Libraries tab
• Click "Add External Class Folder" and select the build directory of the current project

More Related Content

Similar to Big Data Programming and Components

HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.pptHADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.pptManiMaran230751
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop GuideSimplilearn
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Bigdata and hadoop
Bigdata and hadoopBigdata and hadoop
Bigdata and hadoopAditi Yadav
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxUttara University
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATarak Tar
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATarak Tar
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoopManoj Jangalva
 

Similar to Big Data Programming and Components (20)

HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.pptHADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Hadoop
HadoopHadoop
Hadoop
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Bigdata and hadoop
Bigdata and hadoopBigdata and hadoop
Bigdata and hadoop
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 

Recently uploaded

GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 

Recently uploaded (20)

GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 

Big Data Programming and Components

  • 1. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny Unit IV `Introduction of Big data programming-Hadoop, History of Hadoop, The ecosystem and stack, Components of Hadoop, Hadoop Distributed File System (HDFS), Design of HDFS, Java interfaces to HDFS, Architecture overview, Development Environment, Hadoop distribution and-basic commands, Eclipse development What is HDFS? HDFS is a distributed file system for storing very large data files, running on clusters of commodity hardware. It is fault tolerant, scalable, and extremely simple to expand. Hadoop comes bundled with HDFS (Hadoop Distributed File Systems). When data exceeds the capacity of storage on a single physical machine, it becomes essential to divide it across a number of separate machines. A file system that manages storage specific operations across a network of machines is called a distributed file system. HDFS is one such software. History of Hadoop The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the Google File System paper, published by Google. Year Event 2002 Doug Cutting and Mike Cafarella started to work on a project, Apache Nutch. It is an open source web crawler software project. 2003 Google released the paper, Google File System (GFS). 2004 Google released a white paper on Map Reduce. 2006 o Hadoop introduced.
  • 2. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny o Hadoop 0.1.0 released. o Yahoo deploys 300 machines and within this year reaches 600 machines. 2007 o Yahoo runs 2 clusters of 1000 machines. o Hadoop includes HBase. 2008 o YARN JIRA opened o Hadoop becomes the fastest system to sort 1 terabyte of data on a 900 node cluster within 209 seconds. o Yahoo clusters loaded with 10 terabytes per day. o Cloudera was founded as a Hadoop distributor. 2009 o Yahoo runs 17 clusters of 24,000 machines. o Hadoop becomes capable enough to sort a petabyte. o MapReduce and HDFS become separate subproject. 2010 o Hadoop added the support for Kerberos. o Hadoop operates 4,000 nodes with 40 petabytes. o Apache Hive and Pig released. 2011 o Apache Zookeeper released. o Yahoo has 42,000 Hadoop nodes and hundreds of petabytes of storage. 2012 Apache Hadoop 1.0 version released. 2013 Apache Hadoop 2.2 version released. 2014 Apache Hadoop 2.6 version released. 2015 Apache Hadoop 2.7 version released.
  • 3. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny 2017 Apache Hadoop 3.0 version released. 2018 Apache Hadoop 3.1 version released. The Hadoop Ecosystem Hadoop Ecosystem is a platform or a suite that provides various services to solve big data problems. It includes Apache projects and various commercial tools and solutions. 4 major elements of Hadoop are HDFS, MapReduce, YARN, and Hadoop Common. Hadoop is a framework that enables the processing of large data sets which reside in the form of clusters. Being a framework, Hadoop was made up of several modules that are supported by a large ecosystem of technologies. Components that collectively form a Hadoop ecosystem: 1. HDFS: Hadoop Distributed File System 2. YARN: Yet Another Resource Negotiator 3. MapReduce: Programming-based Data Processing 4. Spark: In-Memory data processing 5. PIG, HIVE: Query-based processing of data services 6. HBase: NoSQL Database 7. Mahout, Spark MLLib: Machine Learning algorithm libraries
  • 4. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny 8. Zookeeper: Managing cluster 9. Oozie: Job Scheduling What is Hadoop? Apache Hadoop is a framework that allows the distributed processing of large data sets across clusters of commodity computers using a simple programming model. It is an Open-source Data Management with scale-out storage & distributed processing. The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. HDFS: Hadoop Distributed File System • HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. • HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. • HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. • HDFS provides high throughput access to application data and is suitable for applications that have large data sets • HDFS consists of two core components : o Name Node o Data Node Name Node: • Name Node, a master server, manages the file system namespace and regulates access to files by clients. • Maintains and manages the blocks which are present on the data node. • Name Node is the prime node that contains metadata • Meta-data in Memory – The entire metadata is in the main memory • Types of Metadata – List of files – List of Blocks for each file – List of Data Nodes for each block – File attributes, example: creation time, replication • A Transaction Log – Records file creations, and file deletions. Etc.
  • 5. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny Data Node: • Data Nodes, one per node in the cluster, manage storage attached to the nodes that they run on. • data nodes that store the actual data. These data nodes are commodity hardware in the distributed environment. • A Block Server o Stores data in the local file system o Stores meta-data of a block o Serves data and meta-data to Clients o Block Report o Periodically sends a report of all existing blocks to the Name Node • Facilitates Pipelining of Data o Forwards data to other specified Data Nodes HBase: NoSQL Database Apache HBase is an open-source, distributed, versioned, fault-tolerant, scalable, column- oriented store modeled after Google’s Bigtable, with random real-time read/write access to data. It’s a NoSQL database that runs on top of Hadoop as a distributed and scalable big data store. It combines the scalability of Hadoop by running on the Hadoop Distributed File System (HDFS), with real-time data access as a key/value store and deep analytic capabilities of Map Reduce. MapReduce
  • 6. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job. The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers. Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change. This simple scalability is what has attracted many programmers to use the MapReduce model. The Algorithm • Generally MapReduce paradigm is based on sending the computer to where the data resides! • MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage. o Map stage − The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data. o Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS. • During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster. • The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and copying data around the cluster between the nodes. • Most of the computing takes place on nodes with data on local disks that reduces the network traffic. • After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server.
  • 7. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny YARN: Yet Another Resource Negotiator • Apache YARN is Hadoop’s cluster resource management system. • YARN was introduced in Hadoop 2.0 for improving the MapReduce utilization. • It handles the cluster of nodes and acts as Hadoop’s resource management unit. YARN allocates RAM, memory, and other resources to different applications. YARN has two components : Resource Manager • Global resource scheduler • Runs on the master node • Manages other Nodes o Tracks heartbeats from Node Manager • Manages Containers
  • 8. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny o Handles AM requests for resources o De-allocates containers when they expire, or the application completes • Manages Application Master o Creates a container from AM and tracks heartbeats • Manages Security Node Manager • Runs on slave node • Communicates with RM o Registers and provides info on Node resources o Sends heartbeats and container status • Manages processes and container status o Launches AM on request from RM o Launches application process on request from AM o Monitors resource usage by containers. • Provides logging services to applications o Aggregates logs for an application and saves them to HDFS
  • 9. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny PIG To performed a lot of data administration operation, Pig Hadoop was developed by Yahoo which is Query based language works on a pig Latin language used with hadoop. • It is a platform for structuring the data flow, and processing and analyzing huge data sets. • Pig does the work of executing commands and in the background, all the activities of MapReduce are taken care of. After the processing, PIG stores the result in HDFS. • Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just the way Java runs on the JVM. Features of PIG • Provides support for data types – long, float, char array, schemas, and functions • Is extensible and supports User Defined Functions • Provides common operations like JOIN, GROUP, FILTER, SORT HIVE Relational databases that use SQL as the query language implemented by most of data Most data warehouse application. Hive is a data warehousing package built on top of Hadoop that lowers the barrier to moving these applications to Hadoop. • Structured and Semi-Structured data Processing by using Hive. • Series of automatically generated Map Reduce jobs is internal execution of Hive query. • Structure data used for data analysis. Mahout • Mahout provides an environment for creating machine learning applications that are scalable. • Mahout allows Machine Learnability to a system or application. • MLlib, Spark’s open-source distributed machine learning library. • MLlib provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. • It allows invoking algorithms as per our need with the help of its own libraries. Avro
  • 10. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny Avro is an open source project that provides data serialization and data exchange services for Apache Hadoop. These services can be used together or independently. Avro facilitates the exchange of big data between programs written in any language. With the serialization service, programs can efficiently serialize data into files or into messages. The data storage is compact and efficient. Avro stores both the data definition and the data together in one message or file. Avro stores the data definition in JSON format making it easy to read and interpret; the data itself is stored in binary format making it compact and efficient. Avro files include markers that can be used to split large data sets into subsets suitable for Apache MapReduce processing. Some data exchange services use a code generator to interpret the data definition and produce code to access the data. Avro doesn't require this step, making it ideal for scripting languages. A key feature of Avro is robust support for data schemas that change over time — often called schema evolution. Avro handles schema changes like missing fields, added fields and changed fields; as a result, old programs can read new data and new programs can read old data. Avro includes APIs for Java, Python, Ruby, C, C++ and more. Data stored using Avro can be passed from programs written in different languages, even from a compiled language like C to a scripting language like Apache Pig. Sqoop Sqoop is a command-line interface application for transferring data between relational databases and Hadoop. It supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import.Using Sqoop, Data can be moved into HDFS/hive/hbase from MySQL/ PostgreSQL/Oracle/SQL Server/DB2 and vise versa. Oozie: Job Scheduling • Apache Oozie is a clock and alarm service inside Hadoop Ecosystem. • Oozie simply performs the task scheduler, it combines multiple jobs sequentially into one logical unit of work. • Oozie is a workflow scheduler system that allows users to link jobs written on various platforms like MapReduce, Hive, Pig, etc. schedule a job in advance and create a pipeline of individual jobs was executed sequentially or in parallel to achieve a bigger task using Oozie. There are two kinds of Oozie jobs: Oozie Workflow
  • 11. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny Oozie workflow is a sequential set of actions to be executed. Oozie Coordinator These are the Oozie jobs that are triggered by time and data availability Chukwa Apache Chukwa is an open source data collection system for monitoring large distributed systems. Apache Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Apache Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data. Flume Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and transporting large amounts of streaming data such as log files, events (etc...) from various sources to a centralized data store. Flume is a highly reliable, distributed, and configurable tool. It is principally designed to copy streaming data (log data) from various web servers to HDFS. Zookeeper Apache Zookeeper is an open source distributed coordination service that helps to manage a large set of hosts. Management and coordination in a distributed environment is tricky. Zookeeper automates this process and allows developers to focus on building software features rather than worry about it’s distributed nature. Zookeeper helps you to maintain configuration information, naming, group services for distributed applications. It implements different protocols on the cluster so that the
  • 12. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny application should not implement on their own. It provides a single coherent view of multiple machines. Access HDFS using JAVA API In order to interact with Hadoop’s filesystem programmatically, Hadoop provides multiple JAVA classes. Package named org.apache.hadoop.fs contains classes useful in manipulation of a file in Hadoop’s filesystem. These operations include, open, read, write, and close. Actually, file API for Hadoop is generic and can be extended to interact with other filesystems other than HDFS. Read Operation In HDFS Data read request is served by HDFS, NameNode, and DataNode. Let’s call the reader as a ‘client’. Below diagram depicts file read operation in Hadoop.
  • 13. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny 1. A client initiates read request by calling ‘open()’ method of FileSystem object; it is an object of type DistributedFileSystem. 2. This object connects to namenode using RPC and gets metadata information such as the locations of the blocks of the file. Please note that these addresses are of first few blocks of a file. 3. In response to this metadata request, addresses of the DataNodes having a copy of that block is returned back. 4. Once addresses of DataNodes are received, an object of type FSDataInputStream is returned to the client. FSDataInputStream contains DFSInputStream which takes care of interactions with DataNode and NameNode. In step 4 shown in the above diagram, a client invokes ‘read()’ method which causes DFSInputStream to establish a connection with the first DataNode with the first block of a file. 5. Data is read in the form of streams wherein client invokes ‘read()’ method repeatedly. This process of read() operation continues till it reaches the end of block. 6. Once the end of a block is reached, DFSInputStream closes the connection and moves on to locate the next DataNode for the next block 7. Once a client has done with the reading, it calls a close() method. Write Operation In HDFS Here we will understand how data is written into HDFS through files.
  • 14. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny 1. A client initiates write operation by calling ‘create()’ method of DistributedFileSystem object which creates a new file – Step no. 1 in the above diagram. 2. DistributedFileSystem object connects to the NameNode using RPC call and initiates new file creation. However, this file creates operation does not associate any blocks with the file. It is the responsibility of NameNode to verify that the file (which is being created) does not exist already and a client has correct permissions to create a new file. If a file already exists or client does not have sufficient permission to create a new file, then IOException is thrown to the client. Otherwise, the operation succeeds and a new record for the file is created by the NameNode. 3. Once a new record in NameNode is created, an object of type FSDataOutputStream is returned to the client. A client uses it to write data into the HDFS. Data write method is invoked (step 3 in the diagram). 4. FSDataOutputStream contains DFSOutputStream object which looks after communication with DataNodes and NameNode. While the client continues writing data, DFSOutputStream continues creating packets with this data. These packets are enqueued into a queue which is called as DataQueue.
  • 15. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny 5. There is one more component called DataStreamer which consumes this DataQueue. DataStreamer also asks NameNode for allocation of new blocks thereby picking desirable DataNodes to be used for replication. 6. Now, the process of replication starts by creating a pipeline using DataNodes. In our case, we have chosen a replication level of 3 and hence there are 3 DataNodes in the pipeline. 7. The DataStreamer pours packets into the first DataNode in the pipeline. 8. Every DataNode in a pipeline stores packet received by it and forwards the same to the second DataNode in a pipeline. 9. Another queue, ‘Ack Queue’ is maintained by DFSOutputStream to store packets which are waiting for acknowledgment from DataNodes. 10. Once acknowledgment for a packet in the queue is received from all DataNodes in the pipeline, it is removed from the ‘Ack Queue’. In the event of any DataNode failure, packets from this queue are used to reinitiate the operation. 11. After a client is done with the writing data, it calls a close() method (Step 9 in the diagram) Call to close(), results into flushing remaining data packets to the pipeline followed by waiting for acknowledgment. 12. Once a final acknowledgment is received, NameNode is contacted to tell it that the file write operation is complete. HDFS Architecture HDFS cluster primarily consists of a NameNode that manages the file system Metadata and a DataNodes that stores the actual data. NameNode: NameNode can be considered as a master of the system. It maintains the file system tree and the metadata for all the files and directories present in the system. Two files ‘Namespace image’ and the ‘edit log’ are used to store metadata information. Namenode has knowledge of all the datanodes containing data blocks for a given file, however, it does not store block locations persistently. This information is reconstructed every time from datanodes when the system starts. DataNode: DataNodes are slaves which reside on each machine in a cluster and provide the actual storage. It is responsible for serving, read and write requests for the clients. Read/write operations in HDFS operate at a block level. Data files in HDFS are broken into block-sized chunks, which are stored as independent units. Default block-size is 64 MB. HDFS operates on a concept of data replication wherein multiple replicas of data blocks are created and are distributed on nodes throughout a cluster to enable high availability of data in the event of node failure. Hadoop basic commands Commands Description ls This command is used to list all the files. Use lsr for recursive approach. It is useful when we want a hierarchy of a folder.
  • 16. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny Syntax: bin/hdfs dfs -ls <path> Example: bin/hdfs dfs -ls / It will print all the directories present in HDFS. bin directory contains executables so, bin/hdfs means we want the executables of hdfs particularly dfs(Distributed File System) commands. mkdir mkdir: To create a directory. In Hadoop dfs there is no home directory by default. So let’s first create it. Syntax: bin/hdfs dfs -mkdir <folder name> creating home directory: hdfs/bin -mkdir /user hdfs/bin -mkdir /user/username -> write the username of your computer Example: bin/hdfs dfs -mkdir /geeks => '/' means absolute path bin/hdfs dfs -mkdir geeks2 => Relative path -> the folder will be created relative to the home directory. touchz It creates an empty file. Syntax: bin/hdfs dfs -touchz <file_path> Example: bin/hdfs dfs -touchz /geeks/myfile.txt copyFromLocal (or) put To copy files/folders from local file system to hdfs store. This is the most important command. Local filesystem means the files present on the OS. Syntax: bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)> Example: Let’s suppose we have a file AI.txt on Desktop which we want to copy to folder geeks present on hdfs. bin/hdfs dfs -copyFromLocal ../Desktop/AI.txt /geeks
  • 17. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny (OR) bin/hdfs dfs -put ../Desktop/AI.txt /geeks cat To print file contents. Syntax: bin/hdfs dfs -cat <path> Example: // print the content of AI.txt present // inside geeks folder. bin/hdfs dfs -cat /geeks/AI.txt -> copyToLocal (or) get To copy files/folders from hdfs store to local file system. Syntax: bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest> Example: bin/hdfs dfs -copyToLocal /geeks ../Desktop/hero (OR) bin/hdfs dfs -get /geeks/myfile.txt ../Desktop/hero myfile.txt from geeks folder will be copied to folder hero present on Desktop. moveFromLocal This command will move file from local to hdfs. Syntax: bin/hdfs dfs -moveFromLocal <local src> <dest(on hdfs)> Example: bin/hdfs dfs -moveFromLocal ../Desktop/cutAndPaste.txt /geeks cp This command is used to copy files within hdfs. Lets copy folder geeks to geeks_copied. Syntax: bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)> Example: bin/hdfs -cp /geeks /geeks_copied mv This command is used to move files within hdfs. Lets cut-paste a file myfile.txt from geeks folder to geeks_copied.
  • 18. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny Syntax: bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)> Example: bin/hdfs -mv /geeks/myfile.txt /geeks_copied rmr This command deletes a file from HDFS recursively. It is very useful command when you want to delete a non-empty directory. Syntax: bin/hdfs dfs -rmr <filename/directoryName> Example: bin/hdfs dfs -rmr /geeks_copied -> It will delete all the content inside the directory then the directory itself. du It will give the size of each file in directory. Syntax: bin/hdfs dfs -du <dirName> Example: bin/hdfs dfs -du /geeks dus This command will give the total size of directory/file. Syntax: bin/hdfs dfs -dus <dirName> Example: bin/hdfs dfs -dus /geeks stat This command will give the total size of directory/file. Syntax: bin/hdfs dfs -dus <dirName> Example: bin/hdfs dfs -dus /geeks setrep This command is used to change the replication factor of a file/directory in HDFS. By default it is 3 for anything which is stored in HDFS (as set in hdfs core-site.xml). Example 1: To change the replication factor to 6 for geeks.txt stored in HDFS.
  • 19. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny bin/hdfs dfs -setrep -R -w 6 geeks.txt Example 2: To change the replication factor to 4 for a directory geeksInput stored in HDFS. bin/hdfs dfs -setrep -R 4 /geeks Note: The -w means wait till the replication is completed. And -R means recursively, we use it for directories as they may also contain many files and folders inside them. Hadoop Distributions Distro Remarks Free / Premium Apache hadoop.apache.org o The Hadoop Source o No packaging except TAR balls o No extra tools Completely free and open source Cloudera www.cloudera.com o Oldest distro o Very polished o Comes with good tools to install and manage a Hadoop cluster Free / Premium model (depending on cluster size) HortonWorks www.hortonworks.com o Newer distro o Tracks Apache Hadoop closely o Comes with tools to manage and administer a cluster Completely open source MapR www.mapr.com o MapR has their own file system (alternative to HDFS) o Boasts higher performance o Nice set of tools to manage and administer a cluster o Does not suffer from Single Point of Failure o Offer some cool features like mirroring, snapshots, etc. Free / Premium model Intel hadoop.intel.com o Encryption support o Hardware acceleration added to some layers of stack to boost performance o Admin tools to deploy and manage Hadoop Premium Pivotal HD gopivotal.com o fast SQL on Hadoop o software only or appliance Premium
  • 20. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny Working with Hadoop under Eclipse Here are instructions for setting up a development environment for Hadoop under the Eclipse IDE. Please feel free to make additions or modifications to this page. This document assumes you already have Eclipse downloaded, installed, and configured to your liking. It also assumes that you are aware of the HowToContribute page and have given that a read. Quick Start We will begin by downloading the Hadoop source. The hadoop-common source tree has three subprojects underneath it that you will see after you pull down the source code: hadoop-common, hdfs, and mapreduce. Let's begin by getting the latest source from Git (Note there is a a copy mirrored on github but it lags the Apache read-only git repository slightly). git clone git://git.apache.org/hadoop-common.git This will create a hadoop-common folder in your current directory, if you "cd" into that folder you will see all the available subprojects. Now we will build the code to get it ready for importing into Eclipse. From this directory you just 'cd'-ed into (Which is also known as the top-level directory of a branch or a trunk checkout), perform: $ mvn install -DskipTests $ mvn eclipse:eclipse -DdownloadSources=true -DdownloadJavadocs=true Note: This may take a while the first time, as all libraries are fetched from the internet, and the whole build is performed. In Eclipse After the above, do the following to finally have projects in Eclipse ready and waiting for you to go on that scratch-itching development spree: For Common • File -> Import... • Choose "Existing Projects into Workspace" • Select the hadoop-common-project directory as the root directory • Select the hadoop-annotations, hadoop-auth, hadoop-auth-examples, hadoop-nfs and hadoop-common projects • Click "Finish"
  • 21. Name of the Subject Notes Prepared by : : BIG DATA & ANALYTICS Dr. S. Pradeep Kumar Kenny • File -> Import... • Choose "Existing Projects into Workspace" • Select the hadoop-assemblies directory as the root directory • Select the hadoop-assemblies project • Click "Finish" • To get the projects to build cleanly: • * Add target/generated-test-sources/java as a source directory for hadoop-common • * You may have to add then remove the JRE System Library to avoid errors due to access restrictions For HDFS • File -> Import... • Choose "Existing Projects into Workspace" • Select the hadoop-hdfs-project directory as the root directory • Select the hadoop-hdfs project • Click "Finish" For MapReduce • File -> Import... • Choose "Existing Projects into Workspace" • Select the hadoop-mapreduce-project directory as the root directory • Select the hadoop-mapreduce-project project • Click "Finish" For YARN • File -> Import... • Choose "Existing Projects into Workspace" • Select the hadoop-yarn-project directory as the root directory • Select the hadoop-yarn-project project • Click "Finish" Note: in the case of MapReduce the testjar package is broken. This is expected since it is a part of a testcase that checks for incorrect packaging. This is not to be worried about. To run tests from Eclipse you need to additionally do the following: • Under project Properties, select Java Build Path, and the Libraries tab • Click "Add External Class Folder" and select the build directory of the current project