Introduction to Hadoop
Chandra Chikkareddy
What is
Hadoop is an open source project overseen by the Apache Software
Foundation
Originally based on papers published by Google in 2003 and 2004
Hadoop committers work at several different organizations - Including
Cloudera, Yahoo!, Facebook, LinkedIn
Hadoop takes a radical new approach to the problem of
distributed computing – distribute the data as it’s
initially stored in the system, and individual nodes work
on data local to the nodes.
Hadoop details
Hadoop consists of two core components
– The Hadoop Distributed File System (HDFS)
– MapReduce
A set of machines running HDFS and MapReduce is known as a
Hadoop Cluster
– Individual machines are known as nodes
– A cluster can have as few as one node, or as many as several
thousands
HDFS overview
Distributed file system designed to run on commodity hardware
HDFS is highly fault-tolerant and is designed to be deployed on low-cost
hardware
High throughput access to application data and is suitable for
applications that have large data sets
MapReduce overview
MapReduce is a method for distributing a task across multiple
nodes
Each node processes data stored on that node
– Where possible
Consists of two phases:
– Map
– Reduce
Map Reduce Overview
Data is distributed across nodes at load time
Source: http://developer.yahoo.com/hadoop/tutorial/module1.html
Mapping and reducing tasks run on nodes where individual
records of data are already present.
Source: http://developer.yahoo.com/hadoop/tutorial/module1.html
Hadoop stack
• Data is growing more rapidly than traditional tools &
techniques can handle.
• New tools & techniques are needed.
• A Big Data system such as Hadoop is an inexpensive
way to get started in Big Data.
• The Hadoop ecosystem is robust and expanding
quickly.
• Big Data can be complicated – plan ahead and
understand it before you need it.
In Summary

Introduction to hadoop

  • 1.
  • 2.
    What is Hadoop isan open source project overseen by the Apache Software Foundation Originally based on papers published by Google in 2003 and 2004 Hadoop committers work at several different organizations - Including Cloudera, Yahoo!, Facebook, LinkedIn Hadoop takes a radical new approach to the problem of distributed computing – distribute the data as it’s initially stored in the system, and individual nodes work on data local to the nodes.
  • 3.
    Hadoop details Hadoop consistsof two core components – The Hadoop Distributed File System (HDFS) – MapReduce A set of machines running HDFS and MapReduce is known as a Hadoop Cluster – Individual machines are known as nodes – A cluster can have as few as one node, or as many as several thousands
  • 4.
    HDFS overview Distributed filesystem designed to run on commodity hardware HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware High throughput access to application data and is suitable for applications that have large data sets
  • 5.
    MapReduce overview MapReduce isa method for distributing a task across multiple nodes Each node processes data stored on that node – Where possible Consists of two phases: – Map – Reduce
  • 6.
  • 7.
    Data is distributedacross nodes at load time Source: http://developer.yahoo.com/hadoop/tutorial/module1.html
  • 8.
    Mapping and reducingtasks run on nodes where individual records of data are already present. Source: http://developer.yahoo.com/hadoop/tutorial/module1.html
  • 9.
  • 10.
    • Data isgrowing more rapidly than traditional tools & techniques can handle. • New tools & techniques are needed. • A Big Data system such as Hadoop is an inexpensive way to get started in Big Data. • The Hadoop ecosystem is robust and expanding quickly. • Big Data can be complicated – plan ahead and understand it before you need it. In Summary