Introduction to hadoop

Introduction to Hadoop
Chandra Chikkareddy

What is
Hadoop is an open source project overseen by the Apache Software
Foundation
Originally based on papers published by Google in 2003 and 2004
Hadoop committers work at several different organizations - Including
Cloudera, Yahoo!, Facebook, LinkedIn
Hadoop takes a radical new approach to the problem of
distributed computing – distribute the data as it’s
initially stored in the system, and individual nodes work
on data local to the nodes.

Hadoop details
Hadoop consists of two core components
– The Hadoop Distributed File System (HDFS)
– MapReduce
A set of machines running HDFS and MapReduce is known as a
Hadoop Cluster
– Individual machines are known as nodes
– A cluster can have as few as one node, or as many as several
thousands

HDFS overview
Distributed file system designed to run on commodity hardware
HDFS is highly fault-tolerant and is designed to be deployed on low-cost
hardware
High throughput access to application data and is suitable for
applications that have large data sets

MapReduce overview
MapReduce is a method for distributing a task across multiple
nodes
Each node processes data stored on that node
– Where possible
Consists of two phases:
– Map
– Reduce

Data is distributed across nodes at load time
Source: http://developer.yahoo.com/hadoop/tutorial/module1.html

Mapping and reducing tasks run on nodes where individual
records of data are already present.
Source: http://developer.yahoo.com/hadoop/tutorial/module1.html

• Data is growing more rapidly than traditional tools &
techniques can handle.
• New tools & techniques are needed.
• A Big Data system such as Hadoop is an inexpensive
way to get started in Big Data.
• The Hadoop ecosystem is robust and expanding
quickly.
• Big Data can be complicated – plan ahead and
understand it before you need it.
In Summary

Introduction to hadoop

More Related Content

What's hot

Viewers also liked

Similar to Introduction to hadoop

Recently uploaded

Introduction to hadoop