Your SlideShare is downloading. ×
HADOOP AND HDFS presented by Vijay Pratap Singh
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

HADOOP AND HDFS presented by Vijay Pratap Singh

730
views

Published on

Hadoop is an open-source software framework . …

Hadoop is an open-source software framework .
Hadoop framework consists on two main layers
Distributed file system (HDFS)
Execution engine (MapReduce)
Supports data-intensive distributed applications.
Licensed under the Apache v2 license.
 It enables applications to work with thousands of computation-independent computers and petabytes of data

Hadoop is the popular open source implementation of map/reduce
MapReduce is a programming model for processing large data sets
MapReduce is typically used to do distributed computing on clusters of computers
MapReduce can take advantage of locality of data, processing data on or near the storage assets to decrease transmission of data.
The model is inspired by the map and reduce functions 
"Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to slave nodes. The slave node processes the smaller problem, and passes the answer back to its master node.
"Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the final output

Highly scalable file system
6k nodes and 120pb
Add commodity servers and disks to scale storage and IO bandwidth
Supports parallel reading & processing of data
Optimized for streaming reads/writes of large files
Bandwidth scales linearly with the number of nodes and disks
Fault tolerant & easy management
Built in redundancy
Tolerate disk and node failure
Automatically manages addition/removal of nodes
One operator per 3k nodes

Very Large Distributed File System
10K nodes, 100 million files, 10PB
Assumes Commodity Hardware
Files are replicated to handle hardware failure
Detect failures and recover from them
Optimized for Batch Processing
Data locations exposed so that computations can move to where data resides
Provides very high aggregate bandwidth

Hdfs provides a reliable, scalable and manageable solution for working with huge amounts of data
Future secure
Hdfs has been deployed in clusters of 10 to 4k datanodes
Used in production at companies such as yahoo! , FB , Twitter , ebay
Many enterprises including financial companies use hadoop.

Published in: Technology

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
730
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
66
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Hadoop Distributed File System (HDFS) SEMINAR GUIDE Mr. PRAMOD PAVITHRAN HEAD OF DIVISION COMPUTER SCIENCE & ENGINEERING SCHOOL OF ENGINEERING, CUSAT PRESENTED BY VIJAY PRATAP SINGH REG NO: 12110083 S7, CS-B ROLL NO: 81
  • 2. CONTENTS WHAT IS HADOOP PROJECT COMPONENTS IN HADOOP MAP/REDUCE HDFS ARCHITECTURE GOALS OF HADOOP COMPARISION WITH OTHER SYSTEMS CONCLUSION REFERENCES
  • 3. WHAT IS HADOOP…???
  • 4. WHAT IS HADOOP…???
  • 5. WHAT IS HADOOP…???
  • 6. WHAT IS HADOOP…??? o Hadoop is an open-source software framework . o Hadoop framework consists on two main layers o Distributed file system (HDFS) o Execution engine (MapReduce) o Supports data-intensive distributed applications. o Licensed under the Apache v2 license. o It enables applications to work with thousands of computation-independent computers and petabytes of data
  • 7. WHY HADOOP…???
  • 8. PROJECT COMPONENTS IN HADOOP
  • 9. MAP/REDUCE o Hadoop is the popular open source implementation of map/reduce o MapReduce is a programming model for processing large data sets o MapReduce is typically used to do distributed computing on clusters of computers o MapReduce can take advantage of locality of data, processing data on or near the storage assets to decrease transmission of data. oThe model is inspired by the map and reduce functions o"Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to slave nodes. The slave node processes the smaller problem, and passes the answer back to its master node. o"Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the final output
  • 10. HDFS Highly scalable file system ◦ 6k nodes and 120pb ◦ Add commodity servers and disks to scale storage and IO bandwidth Supports parallel reading & processing of data ◦ Optimized for streaming reads/writes of large files ◦ Bandwidth scales linearly with the number of nodes and disks Fault tolerant & easy management ◦ Built in redundancy ◦ Tolerate disk and node failure ◦ Automatically manages addition/removal of nodes ◦ One operator per 3k nodes Scalable, Reliable & Manageable
  • 11. ISSUES IN CURRENT SYSTEM
  • 12. BIG DATA
  • 13. INCREASING BIG DATA
  • 14. HADOOP’S APPROACH Big Data Computation Computation Computation Computation Combined Result
  • 15. ARCHITECTURE OF HADOOP
  • 16. HADOOP MASTER/SLAVE ARCHITECTURE
  • 17. MAP REDUCE ENGINE
  • 18. MAP REDUCE ENGINE
  • 19. ARCHITECTURE OF HDFS
  • 20. ARCHITECTURE OF HDFS
  • 21. CLIENT INTERACTION TO HADOOP
  • 22. • A Rack 1 DataNode 1 DataNode 9 DataNode 7 Client F CBA Rack 5 NameNode Rack Awareness Rack 1:DN 1 Rack 2:DN7,9 Core Switch Switch Switch I want to write file.txt block A Ok, Write to Data Nodes [1,7,9] Ready DN 7+9 Ready 9 Ready!A A A HDFS WRITE
  • 23. • A Rack 1 DataNode 1 DataNode 9 DataNode 7 Client F CBA Rack 5 NameNode Rack Awareness Rack 1:DN 1 Rack 2:DN7,9 Core Switch Switch Switch A A A Block Received Success Metadata File.txt = Blk DN : 1,7,9 A HDFS WRITE (PIPELINED)
  • 24. • A Rack 1 DataNode 1 DataNode 9 DataNode 7 Client F CBA Rack 5 NameNode Rack Awareness Rack 1:DN 1 Rack 2:DN7,9 Core Switch Switch Switch I want to read file.txt block A Available at nodes [1,7,9] A A A HDFS READ
  • 25. GOALS OF HDFS Very Large Distributed File System ◦ 10K nodes, 100 million files, 10PB Assumes Commodity Hardware ◦ Files are replicated to handle hardware failure ◦ Detect failures and recover from them Optimized for Batch Processing ◦ Data locations exposed so that computations can move to where data resides ◦ Provides very high aggregate bandwidth
  • 26. SCALABILITY OF HADOOP
  • 27. EASE TO PROGRAMMERS
  • 28. HADOOP VS. OTHER SYSTEMS
  • 29. HADOOP USERS
  • 30. TO LEARN MORE Source code ◦ http://hadoop.apache.org/version_control.html ◦ http://svn.apache.org/viewvc/hadoop/common/trunk/ Hadoop releases ◦ http://hadoop.apache.org/releases.html Contribute to it ◦ http://wiki.apache.org/hadoop/HowToContribute
  • 31. CONCLUSION Hdfs provides a reliable, scalable and manageable solution for working with huge amounts of data Future secure Hdfs has been deployed in clusters of 10 to 4k datanodes ◦ Used in production at companies such as yahoo! , FB , Twitter , ebay ◦ Many enterprises including financial companies use hadoop
  • 32. REFERENCES [1] M. Zukowski, S. Heman, N. Nes, And P. Boncz. Cooperative Scans: Dynamic Bandwidth Sharing In A DBMS. In VLDB ’07: Proceedings Of The 33rd International Conference On Very Large Data Bases, Pages 23–34, 2007. [2] Tom White, Hadoop The Definite Guide, O’reilly Media ,Third Edition, May 2012 [3] Jeffrey Shafer, Scott Rixner, And Alan L. Cox, The Hadoop Distributed Filesystem: Balancing Portability And Performance, Rice University, Houston, TX [4] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, The Hadoop Distributed File System, Yahoo, Sunnyvale, California, USA [5] Jens Dittrich, Jorge-arnulfo Quian, E-ruiz, Information Systems Group, Efficient Big Data Processing In Hadoop Mapreduce , Saarland University
  • 33. Thankyou… 
  • 34. Queries