HADOOP-A BRIEF DESCRIPTION:
Hadoop is an Open source Framework designed to work with large datasets in a
distributed computing Environment. It is a part of Apache Project, under license of
Apache. Hadoop provides faster data processing between the nodes, Provides high
availability, and fault tolerance to Hadoop clusters.
Hadoop is a powerful platform for processing, coordinating the movements of data
across various architectural components.
Initially Google started GFS (Google file system) which mainly works on part files.
These part files are like small chunk size files or blocks. Hadoop designed after GFS.
In April 2008, Hadoop established as a power full system to sort terabyte of data
running on 910 node cluster in less than 4 minutes.
In April 2009 500 GB of data sorted in 59 seconds on 1406 Hadoop nodes.
And 1Tb of data sorted in 62 seconds on same clusters.
The hardware used during sorting is
2 quad core xeons at 2.0 GHz per node
4 sata disks per node, 8 gb ram per node, 1 Gb ethernet on each node, 40 nodes
per rack, redhat linux server release 5.1, sun jdk 1.6.0.
Hadoop distribution comes with Hadoop kernel, Hdfs, MapReduce. We can add
more components or Hadoop sub projects like Hive, Hbase, Zookeeper, Sqoop etc..
Hadoop uses a file system known as HDFS (Hadoop distributed file system)
which is used maintain physical files in Hadoop.
Hadoop internally uses Mapreduce paradigm, which contains Mapper section,
Reducer section, which divides data into small sized chunks, stored as part files in
Hdfs. It performs distributed data processing, data access patterns.
Hadoop runs on commodity hardware. It means Hadoop supports existing
infrastructure, reusable machines, low or mid range systems. No need to go for
high end machines to work with Hadoop. Generally we need a two quad-core
processor with 2.25 GHz CPU's, 16-24 GB Ram, 1 Tb Sata hard disks.
When we are working with Hadoop, we need to configure namenode, data node, job
tracker, task tracker. How to configure I will teach you in how to setup Hadoop
Hadoop is written in Java, runs on any environment where JVM is available.
Most of the times we use Hadoop to work in Ubuntu, centos. To work in windows
environment we need a tool Cygwin.
Hadoop distribution, commercial support provided by Cloudera, MapR, Hortonworks,
port) for Hadoop commercial support.
Hadoop comes with its sub projects, we can say Hadoop ecosystems or Hadoop
Data storage: Hdfs, Hbase
Data processing: MapReduce, Hive, Pig.
Data Coordination between the components: zookeeper
Data import and Exports: Sqoop
Data serializations: Avro
Data in log files: Flume
Apart from above Hadoop includes more components
Ambari, Hcatalog, Mahout, Oozie, Cassandra, Cascading, Vertica, common etc..
The data or Dataset which we work on Hadoop Generally Includes:
Users Entire browsing History
User interaction logs
User interaction history
User transaction history
User tweets list generated from twitter
Climate sensor data
And we use Hadoop to
How to Analyze Data.
Identify customers who are most important
Identifying the best time to perform maintenance based on usage patterns
Analyzing brands reputations, analyzing social media.
For comments, queries mail to click here
Web Url : http://www.beinghadoop.com
Facebook id: hadoopframework