Apache Hadoop HDFS
● What is it ?
● What is it for ?
● Architecture
● Resilience
● Administration
● Data access
● Future c...
HDFS – What is it ?
● HDSF = Hadoop Distributed File System
● It is a distributed file system
● Runs on low cost hardware
...
HDFS – What is it for ?
● Designed for batch processing
● Streaming access to data
● Large data sizes i.e. Terabytes
● Hig...
HDFS – Architecture
HDFS – Architecture
● Has a master / slave architecture
● A master NameNode
– Controls file system operations
– Maps data ...
HDFS – Resilience
● Data is replicated across DataNodes
● Nodes may fail but data is still available
● DataNodes indicate ...
HDFS – Administration
● Access via Java API
● FS Shell commands language
● HTTP browser
● C wrapper for Java API
● Space r...
HDFS – Future changes
Things they might consider for HDFS
● File append
● User quotas
● File links
● Stand by nodes
Other Areas
● Want to know about ?
– Big Data
– Nutch
– Solr
● see my other presentations
Contact Us
● Feel free to contact us at
– www.semtech-solutions.co.nz
– info@semtech-solutions.co.nz
● We offer IT project...
Upcoming SlideShare
Loading in...5
×

Apache Hadoop HDFS

364

Published on

A short presentation to describe Apache HDFS

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
364
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
21
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Apache Hadoop HDFS

  1. 1. Apache Hadoop HDFS ● What is it ? ● What is it for ? ● Architecture ● Resilience ● Administration ● Data access ● Future changes ?
  2. 2. HDFS – What is it ? ● HDSF = Hadoop Distributed File System ● It is a distributed file system ● Runs on low cost hardware ● It is open source ● Written in Java ● Fault tolerant ● Designed for very large data sets ● Tuned for high throughput
  3. 3. HDFS – What is it for ? ● Designed for batch processing ● Streaming access to data ● Large data sizes i.e. Terabytes ● Highly reliable using data replication ● Supports very large node clusters ● Supports large files ● Supports file numbers into millions
  4. 4. HDFS – Architecture
  5. 5. HDFS – Architecture ● Has a master / slave architecture ● A master NameNode – Controls file system operations – Maps data blocks to DataNodes – Logs all changes ● Slave DataNodes – Store file blocks – Store replicated data
  6. 6. HDFS – Resilience ● Data is replicated across DataNodes ● Nodes may fail but data is still available ● DataNodes indicate state via heart beat report ● Single point of failure in master NameNode ● Data integrity via check sums
  7. 7. HDFS – Administration ● Access via Java API ● FS Shell commands language ● HTTP browser ● C wrapper for Java API ● Space reclamation – Via control of replication factor – Deleted files sent to trash folder – Trash folder cleaned after configurable time
  8. 8. HDFS – Future changes Things they might consider for HDFS ● File append ● User quotas ● File links ● Stand by nodes
  9. 9. Other Areas ● Want to know about ? – Big Data – Nutch – Solr ● see my other presentations
  10. 10. Contact Us ● Feel free to contact us at – www.semtech-solutions.co.nz – info@semtech-solutions.co.nz ● We offer IT project consultancy ● We are happy to hear about your problems ● You can just pay for those hours that you need ● To solve your problems
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×