1. Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
Unit IV
`Introduction of Big data programming-Hadoop, History of Hadoop, The ecosystem and
stack, Components of Hadoop, Hadoop Distributed File System (HDFS), Design of HDFS, Java
interfaces to HDFS, Architecture overview, Development Environment, Hadoop distribution
and-basic commands, Eclipse development
What is HDFS?
HDFS is a distributed file system for storing very large data files, running on clusters of
commodity hardware. It is fault tolerant, scalable, and extremely simple to expand. Hadoop
comes bundled with HDFS (Hadoop Distributed File Systems).
When data exceeds the capacity of storage on a single physical machine, it becomes essential
to divide it across a number of separate machines. A file system that manages storage specific
operations across a network of machines is called a distributed file system. HDFS is one such
software.
History of Hadoop
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the
Google File System paper, published by Google.
Year Event
2002 Doug Cutting and Mike Cafarella started to work on a project, Apache Nutch. It is
an open source web crawler software project.
2003 Google released the paper, Google File System (GFS).
2004 Google released a white paper on Map Reduce.
2006
o Hadoop introduced.
2. Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
o Hadoop 0.1.0 released.
o Yahoo deploys 300 machines and within this year reaches 600 machines.
2007
o Yahoo runs 2 clusters of 1000 machines.
o Hadoop includes HBase.
2008
o YARN JIRA opened
o Hadoop becomes the fastest system to sort 1 terabyte of data on a 900 node
cluster within 209 seconds.
o Yahoo clusters loaded with 10 terabytes per day.
o Cloudera was founded as a Hadoop distributor.
2009
o Yahoo runs 17 clusters of 24,000 machines.
o Hadoop becomes capable enough to sort a petabyte.
o MapReduce and HDFS become separate subproject.
2010
o Hadoop added the support for Kerberos.
o Hadoop operates 4,000 nodes with 40 petabytes.
o Apache Hive and Pig released.
2011
o Apache Zookeeper released.
o Yahoo has 42,000 Hadoop nodes and hundreds of petabytes of storage.
2012 Apache Hadoop 1.0 version released.
2013 Apache Hadoop 2.2 version released.
2014 Apache Hadoop 2.6 version released.
2015 Apache Hadoop 2.7 version released.
3. Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
2017 Apache Hadoop 3.0 version released.
2018 Apache Hadoop 3.1 version released.
The Hadoop Ecosystem
Hadoop Ecosystem is a platform or a suite that provides various services to solve big data
problems. It includes Apache projects and various commercial tools and solutions. 4 major
elements of Hadoop are HDFS, MapReduce, YARN, and Hadoop Common. Hadoop is a
framework that enables the processing of large data sets which reside in the form of clusters.
Being a framework, Hadoop was made up of several modules that are supported by a large
ecosystem of technologies.
Components that collectively form a Hadoop ecosystem:
1. HDFS: Hadoop Distributed File System
2. YARN: Yet Another Resource Negotiator
3. MapReduce: Programming-based Data Processing
4. Spark: In-Memory data processing
5. PIG, HIVE: Query-based processing of data services
6. HBase: NoSQL Database
7. Mahout, Spark MLLib: Machine Learning algorithm libraries
4. Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
8. Zookeeper: Managing cluster
9. Oozie: Job Scheduling
What is Hadoop?
Apache Hadoop is a framework that allows the distributed processing of large data sets
across clusters of commodity computers using a simple programming model.
It is an Open-source Data Management with scale-out storage & distributed processing.
The Apache Hadoop project develops open-source software for reliable, scalable,
distributed computing.
HDFS: Hadoop Distributed File System
• HDFS is a distributed, scalable, and portable filesystem written in Java for the
Hadoop framework.
• HDFS creates multiple replicas of data blocks and distributes them on compute
nodes throughout a cluster to enable reliable, extremely rapid computations.
• HDFS is highly fault-tolerant and is designed to be deployed on low-cost
hardware.
• HDFS provides high throughput access to application data and is suitable for
applications that have large data sets
• HDFS consists of two core components :
o Name Node
o Data Node
Name Node:
• Name Node, a master server, manages the file system namespace and regulates
access to files by clients.
• Maintains and manages the blocks which are present on the data node.
• Name Node is the prime node that contains metadata
• Meta-data in Memory
– The entire metadata is in the main memory
• Types of Metadata
– List of files
– List of Blocks for each file
– List of Data Nodes for each block
– File attributes, example: creation time, replication
• A Transaction Log
– Records file creations, and file deletions. Etc.
5. Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
Data Node:
• Data Nodes, one per node in the cluster, manage storage attached to the nodes
that they run on.
• data nodes that store the actual data. These data nodes are commodity
hardware in the distributed environment.
• A Block Server
o Stores data in the local file system
o Stores meta-data of a block
o Serves data and meta-data to Clients
o Block Report
o Periodically sends a report of all existing blocks to the Name Node
• Facilitates Pipelining of Data
o Forwards data to other specified Data Nodes
HBase: NoSQL Database
Apache HBase is an open-source, distributed, versioned, fault-tolerant, scalable, column-
oriented store modeled after Google’s Bigtable, with random real-time read/write access
to data.
It’s a NoSQL database that runs on top of Hadoop as a distributed and scalable big data
store.
It combines the scalability of Hadoop by running on the Hadoop Distributed File System
(HDFS), with real-time data access as a key/value store and deep analytic capabilities of
Map Reduce.
MapReduce
6. Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
MapReduce is a processing technique and a program model for distributed computing based
on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce.
Map takes a set of data and converts it into another set of data, where individual elements
are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output
from a map as an input and combines those data tuples into a smaller set of tuples. As the
sequence of the name MapReduce implies, the reduce task is always performed after the map
job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application
into mappers and reducers is sometimes nontrivial. But, once we write an application in the
MapReduce form, scaling the application to run over hundreds, thousands, or even tens of
thousands of machines in a cluster is merely a configuration change. This simple scalability
is what has attracted many programmers to use the MapReduce model.
The Algorithm
• Generally MapReduce paradigm is based on sending the computer to where
the data resides!
• MapReduce program executes in three stages, namely map stage, shuffle stage,
and reduce stage.
o Map stage − The map or mapper’s job is to process the input
data. Generally the input data is in the form of file or directory
and is stored in the Hadoop file system (HDFS). The input file is
passed to the mapper function line by line. The mapper processes
the data and creates several small chunks of data.
o Reduce stage − This stage is the combination of
the Shuffle stage and the Reduce stage. The Reducer’s job is to
process the data that comes from the mapper. After processing,
it produces a new set of output, which will be stored in the HDFS.
• During a MapReduce job, Hadoop sends the Map and Reduce tasks to the
appropriate servers in the cluster.
• The framework manages all the details of data-passing such as issuing tasks,
verifying task completion, and copying data around the cluster between the
nodes.
• Most of the computing takes place on nodes with data on local disks that
reduces the network traffic.
• After completion of the given tasks, the cluster collects and reduces the data to
form an appropriate result, and sends it back to the Hadoop server.
7. Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
YARN: Yet Another Resource Negotiator
• Apache YARN is Hadoop’s cluster resource management system.
• YARN was introduced in Hadoop 2.0 for improving the MapReduce utilization.
• It handles the cluster of nodes and acts as Hadoop’s resource management
unit. YARN allocates RAM, memory, and other resources to different
applications.
YARN has two components :
Resource Manager
• Global resource scheduler
• Runs on the master node
• Manages other Nodes
o Tracks heartbeats from Node Manager
• Manages Containers
8. Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
o Handles AM requests for resources
o De-allocates containers when they expire, or the application
completes
• Manages Application Master
o Creates a container from AM and tracks heartbeats
• Manages Security
Node Manager
• Runs on slave node
• Communicates with RM
o Registers and provides info on Node resources
o Sends heartbeats and container status
• Manages processes and container status
o Launches AM on request from RM
o Launches application process on request from AM
o Monitors resource usage by containers.
• Provides logging services to applications
o Aggregates logs for an application and saves them to HDFS
9. Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
PIG
To performed a lot of data administration operation, Pig Hadoop was developed by Yahoo
which is Query based language works on a pig Latin language used with hadoop.
• It is a platform for structuring the data flow, and processing and analyzing
huge data sets.
• Pig does the work of executing commands and in the background, all the
activities of MapReduce are taken care of. After the processing, PIG stores the
result in HDFS.
• Pig Latin language is specially designed for this framework which runs on Pig
Runtime. Just the way Java runs on the JVM.
Features of PIG
• Provides support for data types – long, float, char array, schemas, and
functions
• Is extensible and supports User Defined Functions
• Provides common operations like JOIN, GROUP, FILTER, SORT
HIVE
Relational databases that use SQL as the query language implemented by most of data
Most data warehouse application. Hive is a data warehousing package built on top of
Hadoop that lowers the barrier to moving these applications to Hadoop.
• Structured and Semi-Structured data Processing by using Hive.
• Series of automatically generated Map Reduce jobs is internal execution of
Hive query.
• Structure data used for data analysis.
Mahout
• Mahout provides an environment for creating machine learning applications
that are scalable.
• Mahout allows Machine Learnability to a system or application.
• MLlib, Spark’s open-source distributed machine learning library.
• MLlib provides efficient functionality for a wide range of learning settings and
includes several underlying statistical, optimization, and linear algebra
primitives.
• It allows invoking algorithms as per our need with the help of its own libraries.
Avro
10. Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
Avro is an open source project that provides data serialization and data exchange services
for Apache Hadoop. These services can be used together or independently. Avro facilitates
the exchange of big data between programs written in any language. With the serialization
service, programs can efficiently serialize data into files or into messages. The data storage
is compact and efficient. Avro stores both the data definition and the data together in one
message or file.
Avro stores the data definition in JSON format making it easy to read and interpret; the data
itself is stored in binary format making it compact and efficient. Avro files include markers
that can be used to split large data sets into subsets suitable for Apache
MapReduce processing. Some data exchange services use a code generator to interpret the
data definition and produce code to access the data. Avro doesn't require this step, making
it ideal for scripting languages.
A key feature of Avro is robust support for data schemas that change over time — often called
schema evolution. Avro handles schema changes like missing fields, added fields and
changed fields; as a result, old programs can read new data and new programs can read old
data. Avro includes APIs for Java, Python, Ruby, C, C++ and more. Data stored using Avro can
be passed from programs written in different languages, even from a compiled language like
C to a scripting language like Apache Pig.
Sqoop
Sqoop is a command-line interface application for transferring data between relational
databases and Hadoop.
It supports incremental loads of a single table or a free form SQL query as well as saved
jobs which can be run multiple times to import updates made to a database since the last
import.Using Sqoop, Data can be moved into HDFS/hive/hbase from MySQL/
PostgreSQL/Oracle/SQL Server/DB2 and vise versa.
Oozie: Job Scheduling
• Apache Oozie is a clock and alarm service inside Hadoop Ecosystem.
• Oozie simply performs the task scheduler, it combines multiple jobs
sequentially into one logical unit of work.
• Oozie is a workflow scheduler system that allows users to link jobs written
on various platforms like MapReduce, Hive, Pig, etc. schedule a job in advance
and create a pipeline of individual jobs was executed sequentially or in
parallel to achieve a bigger task using Oozie.
There are two kinds of Oozie jobs:
Oozie Workflow
11. Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
Oozie workflow is a sequential set of actions to be executed.
Oozie Coordinator
These are the Oozie jobs that are triggered by time and data availability
Chukwa
Apache Chukwa is an open source data collection system for monitoring large distributed
systems. Apache Chukwa is built on top of the Hadoop Distributed File System (HDFS) and
Map/Reduce framework and inherits Hadoop’s scalability and robustness. Apache
Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and
analyzing results to make the best use of the collected data.
Flume
Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and
transporting large amounts of streaming data such as log files, events (etc...) from various
sources to a centralized data store.
Flume is a highly reliable, distributed, and configurable tool. It is principally designed to copy
streaming data (log data) from various web servers to HDFS.
Zookeeper
Apache Zookeeper is an open source distributed coordination service that helps to manage
a large set of hosts. Management and coordination in a distributed environment is tricky.
Zookeeper automates this process and allows developers to focus on building software
features rather than worry about it’s distributed nature.
Zookeeper helps you to maintain configuration information, naming, group services for
distributed applications. It implements different protocols on the cluster so that the
12. Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
application should not implement on their own. It provides a single coherent view of multiple
machines.
Access HDFS using JAVA API
In order to interact with Hadoop’s filesystem programmatically, Hadoop provides multiple
JAVA classes. Package named org.apache.hadoop.fs contains classes useful in manipulation
of a file in Hadoop’s filesystem. These operations include, open, read, write, and close.
Actually, file API for Hadoop is generic and can be extended to interact with other filesystems
other than HDFS.
Read Operation In HDFS
Data read request is served by HDFS, NameNode, and DataNode. Let’s call the reader as a
‘client’. Below diagram depicts file read operation in Hadoop.
13. Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
1. A client initiates read request by calling ‘open()’ method of FileSystem object; it is an
object of type DistributedFileSystem.
2. This object connects to namenode using RPC and gets metadata information such as the
locations of the blocks of the file. Please note that these addresses are of first few blocks
of a file.
3. In response to this metadata request, addresses of the DataNodes having a copy of that
block is returned back.
4. Once addresses of DataNodes are received, an object of type FSDataInputStream is
returned to the client. FSDataInputStream contains DFSInputStream which takes care
of interactions with DataNode and NameNode. In step 4 shown in the above diagram, a
client invokes ‘read()’ method which causes DFSInputStream to establish a connection
with the first DataNode with the first block of a file.
5. Data is read in the form of streams wherein client invokes ‘read()’ method repeatedly.
This process of read() operation continues till it reaches the end of block.
6. Once the end of a block is reached, DFSInputStream closes the connection and moves on
to locate the next DataNode for the next block
7. Once a client has done with the reading, it calls a close() method.
Write Operation In HDFS
Here we will understand how data is written into HDFS through files.
14. Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
1. A client initiates write operation by calling ‘create()’ method of DistributedFileSystem
object which creates a new file – Step no. 1 in the above diagram.
2. DistributedFileSystem object connects to the NameNode using RPC call and initiates new
file creation. However, this file creates operation does not associate any blocks with the
file. It is the responsibility of NameNode to verify that the file (which is being created)
does not exist already and a client has correct permissions to create a new file. If a file
already exists or client does not have sufficient permission to create a new file,
then IOException is thrown to the client. Otherwise, the operation succeeds and a new
record for the file is created by the NameNode.
3. Once a new record in NameNode is created, an object of type FSDataOutputStream is
returned to the client. A client uses it to write data into the HDFS. Data write method is
invoked (step 3 in the diagram).
4. FSDataOutputStream contains DFSOutputStream object which looks after
communication with DataNodes and NameNode. While the client continues writing
data, DFSOutputStream continues creating packets with this data. These packets are
enqueued into a queue which is called as DataQueue.
15. Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
5. There is one more component called DataStreamer which consumes this DataQueue.
DataStreamer also asks NameNode for allocation of new blocks thereby picking desirable
DataNodes to be used for replication.
6. Now, the process of replication starts by creating a pipeline using DataNodes. In our case,
we have chosen a replication level of 3 and hence there are 3 DataNodes in the pipeline.
7. The DataStreamer pours packets into the first DataNode in the pipeline.
8. Every DataNode in a pipeline stores packet received by it and forwards the same to the
second DataNode in a pipeline.
9. Another queue, ‘Ack Queue’ is maintained by DFSOutputStream to store packets which
are waiting for acknowledgment from DataNodes.
10. Once acknowledgment for a packet in the queue is received from all DataNodes in the
pipeline, it is removed from the ‘Ack Queue’. In the event of any DataNode failure, packets
from this queue are used to reinitiate the operation.
11. After a client is done with the writing data, it calls a close() method (Step 9 in the
diagram) Call to close(), results into flushing remaining data packets to the pipeline
followed by waiting for acknowledgment.
12. Once a final acknowledgment is received, NameNode is contacted to tell it that the file
write operation is complete.
HDFS Architecture
HDFS cluster primarily consists of a NameNode that manages the file system Metadata and
a DataNodes that stores the actual data.
NameNode: NameNode can be considered as a master of the system. It maintains the file
system tree and the metadata for all the files and directories present in the system. Two files
‘Namespace image’ and the ‘edit log’ are used to store metadata information. Namenode has
knowledge of all the datanodes containing data blocks for a given file, however, it does not
store block locations persistently. This information is reconstructed every time from
datanodes when the system starts.
DataNode: DataNodes are slaves which reside on each machine in a cluster and provide the
actual storage. It is responsible for serving, read and write requests for the clients.
Read/write operations in HDFS operate at a block level. Data files in HDFS are broken into
block-sized chunks, which are stored as independent units. Default block-size is 64 MB.
HDFS operates on a concept of data replication wherein multiple replicas of data blocks are
created and are distributed on nodes throughout a cluster to enable high availability of data
in the event of node failure.
Hadoop basic commands
Commands Description
ls
This command is used to list all the files. Use lsr for recursive
approach. It is useful when we want a hierarchy of a folder.
16. Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
Syntax:
bin/hdfs dfs -ls <path>
Example:
bin/hdfs dfs -ls /
It will print all the directories present in HDFS. bin directory contains
executables so, bin/hdfs means we want the executables of hdfs
particularly dfs(Distributed File System) commands.
mkdir
mkdir: To create a directory. In Hadoop dfs there is no home
directory by default. So let’s first create it.
Syntax:
bin/hdfs dfs -mkdir <folder name>
creating home directory:
hdfs/bin -mkdir /user
hdfs/bin -mkdir /user/username -> write the username of your
computer
Example:
bin/hdfs dfs -mkdir /geeks => '/' means absolute path
bin/hdfs dfs -mkdir geeks2 => Relative path -> the folder will be
created relative to the home directory.
touchz
It creates an empty file.
Syntax:
bin/hdfs dfs -touchz <file_path>
Example:
bin/hdfs dfs -touchz /geeks/myfile.txt
copyFromLocal
(or) put
To copy files/folders from local file system to hdfs store. This is the
most important command. Local filesystem means the files present
on the OS.
Syntax:
bin/hdfs dfs -copyFromLocal <local file path> <dest(present on
hdfs)>
Example: Let’s suppose we have a file AI.txt on Desktop which we
want to copy to folder geeks present on hdfs.
bin/hdfs dfs -copyFromLocal ../Desktop/AI.txt /geeks
17. Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
(OR)
bin/hdfs dfs -put ../Desktop/AI.txt /geeks
cat
To print file contents.
Syntax:
bin/hdfs dfs -cat <path>
Example:
// print the content of AI.txt present
// inside geeks folder.
bin/hdfs dfs -cat /geeks/AI.txt ->
copyToLocal (or)
get
To copy files/folders from hdfs store to local file system.
Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>
Example:
bin/hdfs dfs -copyToLocal /geeks ../Desktop/hero
(OR)
bin/hdfs dfs -get /geeks/myfile.txt ../Desktop/hero
myfile.txt from geeks folder will be copied to folder hero present on
Desktop.
moveFromLocal
This command will move file from local to hdfs.
Syntax:
bin/hdfs dfs -moveFromLocal <local src> <dest(on hdfs)>
Example:
bin/hdfs dfs -moveFromLocal ../Desktop/cutAndPaste.txt /geeks
cp
This command is used to copy files within hdfs. Lets copy folder geeks
to geeks_copied.
Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
Example:
bin/hdfs -cp /geeks /geeks_copied
mv
This command is used to move files within hdfs. Lets cut-paste a file
myfile.txt from geeks folder to geeks_copied.
18. Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
Example:
bin/hdfs -mv /geeks/myfile.txt /geeks_copied
rmr
This command deletes a file from HDFS recursively. It is very useful
command when you want to delete a non-empty directory.
Syntax:
bin/hdfs dfs -rmr <filename/directoryName>
Example:
bin/hdfs dfs -rmr /geeks_copied -> It will delete all the content inside
the directory then the directory itself.
du
It will give the size of each file in directory.
Syntax:
bin/hdfs dfs -du <dirName>
Example:
bin/hdfs dfs -du /geeks
dus
This command will give the total size of directory/file.
Syntax:
bin/hdfs dfs -dus <dirName>
Example:
bin/hdfs dfs -dus /geeks
stat
This command will give the total size of directory/file.
Syntax:
bin/hdfs dfs -dus <dirName>
Example:
bin/hdfs dfs -dus /geeks
setrep
This command is used to change the replication factor of a
file/directory in HDFS. By default it is 3 for anything which is stored
in HDFS (as set in hdfs core-site.xml).
Example 1: To change the replication factor to 6 for geeks.txt stored
in HDFS.
19. Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
bin/hdfs dfs -setrep -R -w 6 geeks.txt
Example 2: To change the replication factor to 4 for a directory
geeksInput stored in HDFS.
bin/hdfs dfs -setrep -R 4 /geeks
Note: The -w means wait till the replication is completed. And -R
means recursively, we use it for directories as they may also contain
many files and folders inside them.
Hadoop Distributions
Distro Remarks Free / Premium
Apache
hadoop.apache.org
o The Hadoop Source
o No packaging except TAR
balls
o No extra tools
Completely free and
open source
Cloudera
www.cloudera.com
o Oldest distro
o Very polished
o Comes with good tools to
install and manage a Hadoop
cluster
Free / Premium
model (depending on
cluster size)
HortonWorks
www.hortonworks.com
o Newer distro
o Tracks Apache Hadoop closely
o Comes with tools to manage
and administer a cluster
Completely open
source
MapR
www.mapr.com
o MapR has their own file
system (alternative to HDFS)
o Boasts higher performance
o Nice set of tools to manage
and administer a cluster
o Does not suffer from Single
Point of Failure
o Offer some cool features like
mirroring, snapshots, etc.
Free / Premium
model
Intel
hadoop.intel.com
o Encryption support
o Hardware acceleration added
to some layers of stack to
boost performance
o Admin tools to deploy and
manage Hadoop
Premium
Pivotal HD
gopivotal.com
o fast SQL on Hadoop
o software only or appliance
Premium
20. Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
Working with Hadoop under Eclipse
Here are instructions for setting up a development environment for Hadoop under the
Eclipse IDE. Please feel free to make additions or modifications to this page.
This document assumes you already have Eclipse downloaded, installed, and configured to
your liking. It also assumes that you are aware of the HowToContribute page and have
given that a read.
Quick Start
We will begin by downloading the Hadoop source. The hadoop-common source tree has
three subprojects underneath it that you will see after you pull down the source code:
hadoop-common, hdfs, and mapreduce.
Let's begin by getting the latest source from Git (Note there is a a copy mirrored on github
but it lags the Apache read-only git repository slightly).
git clone git://git.apache.org/hadoop-common.git
This will create a hadoop-common folder in your current directory, if you "cd" into that
folder you will see all the available subprojects. Now we will build the code to get it ready for
importing into Eclipse.
From this directory you just 'cd'-ed into (Which is also known as the top-level directory of a
branch or a trunk checkout), perform:
$ mvn install -DskipTests
$ mvn eclipse:eclipse -DdownloadSources=true -DdownloadJavadocs=true
Note: This may take a while the first time, as all libraries are fetched from the internet, and
the whole build is performed.
In Eclipse
After the above, do the following to finally have projects in Eclipse ready and waiting for you
to go on that scratch-itching development spree:
For Common
• File -> Import...
• Choose "Existing Projects into Workspace"
• Select the hadoop-common-project directory as the root directory
• Select the hadoop-annotations, hadoop-auth, hadoop-auth-examples, hadoop-nfs and
hadoop-common projects
• Click "Finish"
21. Name of the Subject
Notes Prepared by
:
:
BIG DATA & ANALYTICS
Dr. S. Pradeep Kumar Kenny
• File -> Import...
• Choose "Existing Projects into Workspace"
• Select the hadoop-assemblies directory as the root directory
• Select the hadoop-assemblies project
• Click "Finish"
• To get the projects to build cleanly:
• * Add target/generated-test-sources/java as a source directory for hadoop-common
• * You may have to add then remove the JRE System Library to avoid errors due to
access restrictions
For HDFS
• File -> Import...
• Choose "Existing Projects into Workspace"
• Select the hadoop-hdfs-project directory as the root directory
• Select the hadoop-hdfs project
• Click "Finish"
For MapReduce
• File -> Import...
• Choose "Existing Projects into Workspace"
• Select the hadoop-mapreduce-project directory as the root directory
• Select the hadoop-mapreduce-project project
• Click "Finish"
For YARN
• File -> Import...
• Choose "Existing Projects into Workspace"
• Select the hadoop-yarn-project directory as the root directory
• Select the hadoop-yarn-project project
• Click "Finish"
Note: in the case of MapReduce the testjar package is broken. This is expected since it is a
part of a testcase that checks for incorrect packaging. This is not to be worried about.
To run tests from Eclipse you need to additionally do the following:
• Under project Properties, select Java Build Path, and the Libraries tab
• Click "Add External Class Folder" and select the build directory of the current project