HADOOP DISTRIBUTED FILE
SYSTEM AND MAPREDUCE
BY
Eadara Harsha Siva Sai
DEPARTMENT OF COMPUTER SCENCE AND ENGINEERING
1. INTRDUCTION TO HADOOP
1. HADOOP ARCHITECTURE
2. HADOOP DISTRIBUTED FILE
SYSTEM(HDFS)
3. MAPREDUCE
CONTENTS
INTRDUCTION TO HADOOP
What is hadoop..?
• Hadoop is a frame work, To store and to
process big data sets. It is open source
software used for distributed computing.
• Dough cutting introduced hadoop in cloud
era .
• Designed to answer the question: “How to
process big data with reasonable cost and
time?”
INTRDUCTION TO HADOOP
• In a traditional non distributed architecture,
you’ll have data stored in one server and any
client program will access this central data server
to access the data.
• The non distributed model has few issues. In this
model, you’ll mostly scale vertically by adding
more CPU to adding more storage, etc.
• This architecture is also not reliable, as if the
main server fails, you have to go back to the
backup to restore the data and it is slow to
access the huge data.
INTRDUCTION TO HADOOP
In a hadoop distributed architecture
• Each and every server offers local computation and
storage. i.e. When you run a query against a large data set,
every server in this distributed architecture will be executing
the query on its local machine against the local data set.
Finally, the result set from all this local servers are
consolidated.
• You don’t need a powerful server. Just use several less
expensive commodity servers as hadoop individual nodes.
If any of the nodes fails in the hadoop environment, it will
still return the dataset properly, as hadoop takes care of
replicating and distributing the data efficiently across the
multiple nodes.
• Hadoop is written in Java. So, it can run on any platform.
HADOOP ARCHITECTURE
In Hadoop architecture
1.Name node
2. Secondary Node
3. Job Tracker
4. Data node
5. Task Tracker
HADOOP ARCHITECTURE
HADOOP DISTRIBUTED FILE SYSTEM(HDFS)
• The distribution of a data between the data nodes by using
hadoop is called HDFS .A typical HDFS block size is 64MB.
• There will be one Name Node that manages the file system
metadata. It will divide the data into 64MB size.
• The name will decide to which data node the data to send
and it also says the data node to store it replications to
another two nodes.
• After storing the data the data node will send how much of
space is available.
• Each and every 3 seconds data node passes a heart beat to
name node. If data node failed to send heart beat then
name node wait for 30 seconds . If not send it declare the
data node is dead.
HADOOP DISTRIBUTED FILE SYSTEM(HDFS)
MAPREDUCE
• The process of obtaining the output or getting back your
data is called map reduce . The following is map reduce
using hadoop frame work.
• When the client want the output of the stored data then the
client writes a program and sends the program to job tracer.
• Job tracer asks the name node wether the metadata is
created for this data or not . If created then send the meta
data.
• Then the job tracer will order the data nodes to process the
data with them.
• After processing the data the out put will send to jobtraker.
Again job tracer will send the outputs obtained to another
data node for final output.
MAPREDUCE
• After receiving the the final output then the jobtraker will
send it to the client.
• If a data node failed while processing the data then the job
tracer will order another data node to process the data that
consist of the replication of file.
• After receiving the out puts from the data nodes then the
jobtrker will see which data node have the less work at the
moment and send to data node to get the final output.
• The following diagram will explains about mapreduce.
MAPREDUCE
ADVANTAGES AND DISADVANTAGES
ADVANTAGES:
1. Cost effective
2. Flexible
3. Fast
4. Resilient to failure
DISADVANTAGES:
1. Security concerns
2. Not fit for small data
3. Potential stability issues
CONCLUSION
• Facebook , Google ,Amazon , flipchart
etc.. are using HADOOP
• Hadoop solves so many problems in
storing of data on cloud .hence hadoop is
a open source it is free and it can work on
any Operating System .
ANY
QUESTIONS…….?
THANK YOU….

HADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCE

  • 1.
    HADOOP DISTRIBUTED FILE SYSTEMAND MAPREDUCE BY Eadara Harsha Siva Sai DEPARTMENT OF COMPUTER SCENCE AND ENGINEERING
  • 2.
    1. INTRDUCTION TOHADOOP 1. HADOOP ARCHITECTURE 2. HADOOP DISTRIBUTED FILE SYSTEM(HDFS) 3. MAPREDUCE CONTENTS
  • 3.
    INTRDUCTION TO HADOOP Whatis hadoop..? • Hadoop is a frame work, To store and to process big data sets. It is open source software used for distributed computing. • Dough cutting introduced hadoop in cloud era . • Designed to answer the question: “How to process big data with reasonable cost and time?”
  • 4.
    INTRDUCTION TO HADOOP •In a traditional non distributed architecture, you’ll have data stored in one server and any client program will access this central data server to access the data. • The non distributed model has few issues. In this model, you’ll mostly scale vertically by adding more CPU to adding more storage, etc. • This architecture is also not reliable, as if the main server fails, you have to go back to the backup to restore the data and it is slow to access the huge data.
  • 5.
    INTRDUCTION TO HADOOP Ina hadoop distributed architecture • Each and every server offers local computation and storage. i.e. When you run a query against a large data set, every server in this distributed architecture will be executing the query on its local machine against the local data set. Finally, the result set from all this local servers are consolidated. • You don’t need a powerful server. Just use several less expensive commodity servers as hadoop individual nodes. If any of the nodes fails in the hadoop environment, it will still return the dataset properly, as hadoop takes care of replicating and distributing the data efficiently across the multiple nodes. • Hadoop is written in Java. So, it can run on any platform.
  • 6.
    HADOOP ARCHITECTURE In Hadooparchitecture 1.Name node 2. Secondary Node 3. Job Tracker 4. Data node 5. Task Tracker
  • 7.
  • 8.
    HADOOP DISTRIBUTED FILESYSTEM(HDFS) • The distribution of a data between the data nodes by using hadoop is called HDFS .A typical HDFS block size is 64MB. • There will be one Name Node that manages the file system metadata. It will divide the data into 64MB size. • The name will decide to which data node the data to send and it also says the data node to store it replications to another two nodes. • After storing the data the data node will send how much of space is available. • Each and every 3 seconds data node passes a heart beat to name node. If data node failed to send heart beat then name node wait for 30 seconds . If not send it declare the data node is dead.
  • 9.
  • 10.
    MAPREDUCE • The processof obtaining the output or getting back your data is called map reduce . The following is map reduce using hadoop frame work. • When the client want the output of the stored data then the client writes a program and sends the program to job tracer. • Job tracer asks the name node wether the metadata is created for this data or not . If created then send the meta data. • Then the job tracer will order the data nodes to process the data with them. • After processing the data the out put will send to jobtraker. Again job tracer will send the outputs obtained to another data node for final output.
  • 11.
    MAPREDUCE • After receivingthe the final output then the jobtraker will send it to the client. • If a data node failed while processing the data then the job tracer will order another data node to process the data that consist of the replication of file. • After receiving the out puts from the data nodes then the jobtrker will see which data node have the less work at the moment and send to data node to get the final output. • The following diagram will explains about mapreduce.
  • 12.
  • 13.
    ADVANTAGES AND DISADVANTAGES ADVANTAGES: 1.Cost effective 2. Flexible 3. Fast 4. Resilient to failure DISADVANTAGES: 1. Security concerns 2. Not fit for small data 3. Potential stability issues
  • 14.
    CONCLUSION • Facebook ,Google ,Amazon , flipchart etc.. are using HADOOP • Hadoop solves so many problems in storing of data on cloud .hence hadoop is a open source it is free and it can work on any Operating System .
  • 15.
  • 16.