Hadoop description


Published on

Hadoop defination, Why hadoop, Hadoop framework, Hadoop ecosystem, Apache hadoop

Published in: Education
  • Be the first to comment

  • Be the first to like this

Hadoop description

  1. 1. http://www.Beinghadoop.com hadoopframework@gmail.com HADOOP-A BRIEF DESCRIPTION: Hadoop is an Open source Framework designed to work with large datasets in a distributed computing Environment. It is a part of Apache Project, under license of Apache. Hadoop provides faster data processing between the nodes, Provides high availability, and fault tolerance to Hadoop clusters. Hadoop is a powerful platform for processing, coordinating the movements of data across various architectural components. Initially Google started GFS (Google file system) which mainly works on part files. These part files are like small chunk size files or blocks. Hadoop designed after GFS. In April 2008, Hadoop established as a power full system to sort terabyte of data running on 910 node cluster in less than 4 minutes. In April 2009 500 GB of data sorted in 59 seconds on 1406 Hadoop nodes. And 1Tb of data sorted in 62 seconds on same clusters. The hardware used during sorting is 2 quad core xeons at 2.0 GHz per node 4 sata disks per node, 8 gb ram per node, 1 Gb ethernet on each node, 40 nodes per rack, redhat linux server release 5.1, sun jdk 1.6.0. (http://sort.benchmark.org/yahoohadoop.pdf"). Hadoop distribution comes with Hadoop kernel, Hdfs, MapReduce. We can add more components or Hadoop sub projects like Hive, Hbase, Zookeeper, Sqoop etc.. Hadoop uses a file system known as HDFS (Hadoop distributed file system) which is used maintain physical files in Hadoop. Hadoop internally uses Mapreduce paradigm, which contains Mapper section, Reducer section, which divides data into small sized chunks, stored as part files in Hdfs. It performs distributed data processing, data access patterns.
  2. 2. http://www.Beinghadoop.com hadoopframework@gmail.com Hadoop runs on commodity hardware. It means Hadoop supports existing infrastructure, reusable machines, low or mid range systems. No need to go for high end machines to work with Hadoop. Generally we need a two quad-core processor with 2.25 GHz CPU's, 16-24 GB Ram, 1 Tb Sata hard disks. When we are working with Hadoop, we need to configure namenode, data node, job tracker, task tracker. How to configure I will teach you in how to setup Hadoop cluster notes. Hadoop is written in Java, runs on any environment where JVM is available. Most of the times we use Hadoop to work in Ubuntu, centos. To work in windows environment we need a tool Cygwin. Hadoop distribution, commercial support provided by Cloudera, MapR, Hortonworks, karmasphere. Refer(http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Sup port) for Hadoop commercial support. Hadoop comes with its sub projects, we can say Hadoop ecosystems or Hadoop components. Data storage: Hdfs, Hbase Data processing: MapReduce, Hive, Pig. Data Coordination between the components: zookeeper Data import and Exports: Sqoop Data serializations: Avro Data in log files: Flume Apart from above Hadoop includes more components Ambari, Hcatalog, Mahout, Oozie, Cassandra, Cascading, Vertica, common etc.. The data or Dataset which we work on Hadoop Generally Includes: Users Entire browsing History Weblogs User interaction logs User interaction history User transaction history User tweets list generated from twitter Climate sensor data And we use Hadoop to How to Analyze Data. Identify customers who are most important Identifying the best time to perform maintenance based on usage patterns Analyzing brands reputations, analyzing social media. For comments, queries mail to click here Web Url : http://www.beinghadoop.com Facebook id: hadoopframework