Your SlideShare is downloading. ×
0
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

HDFS

8,777

Published on

Johan of last.fm talks about how to use HDFS in production. They do it, so can everyone else.

Johan of last.fm talks about how to use HDFS in production. They do it, so can everyone else.

Published in: Technology, News & Politics
0 Comments
20 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
8,777
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
391
Comments
0
Likes
20
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Johan Oskarsson Developer at Last.fm Hadoop and Hive committer
  • 2. What is HDFS? Hadoop Hadoop Distributed FileSystem Two server types Namenode - keeps track of block locations Datanode - stores blocks Files commonly split up into 128mb blocks Replicated to 3 datanodes by default Scales well: ~4000 nodes Write once Large files
  • 3. "Can you use HDFS in production?"
  • 4. Yes We have used it in production since 2006, but then again we are insane.
  • 5. Who is using HDFS in production? Yahoo! Largest cluster 4000 nodes (14PB raw storage) Facebook. 600 nodes (2PB raw storage) Powerset (Microsoft). "up to 400 instances" Last.fm. 31 nodes (110TB raw storage) ... see more at http://wiki.apache.org/hadoop/PoweredBy
  • 6. What do they use Hadoop for? Yahoo! search index, Yahoo! anti spam, etc Facebook ad, profile and application monitoring, etc Powerset search index, heavy HBase users Last.fm charts, A/B testing stats, site metrics and reporting
  • 7. "Does HDFS meet people's needs? If not, what can we do?"
  • 8. Use case - MR batch jobs Scenario 1. Large source data files are inserted into HDFS 2. MapReduce job is run 3. Output is saved to HDFS HDFS is a great choice for this use case Shorter downtime is acceptable Backups for important data Permissions + trash to avoid user error
  • 9. Use case - Serving files to a website Scenario 1. User visits a website to browse photos 2. Lots of image files are requested from HDFS Potential issues and solutions HDFS isn't written for many small files Namenode ram limits number of files Use HBase or similar Namenode goes down Crazy "double cluster" solution Standby namenode HADOOP-4539 HDFS isn't really written for low response times Work is being done, not high priority Use GlusterFS or MogileFS instead
  • 10. Use case - Reliable, realtime log storage Scenario 1. A stream of logging events is generated 2. The stream is written directly to HDFS Potential issues and solutions Problems with long write sessions HDFS-200, HADOOP-6099, HDFS-278 Namenode goes down Crazy "double cluster" solution Standby namenode HADOOP-4539 Appends not stable HDFS-265
  • 11. Potential dealbreakers Small files problem™ Use archives, sequencefiles or HBase Appends/sync not stable Namenode not highly available Relatively high latency reads
  • 12. Improvements In progress or completed HADOOP-4539 - Streaming edits to a standby NN HDFS-265 - Appends HDFS-245 - Symbolic links Wish list HDFS-209 - Tool to edit namenode metadata files HDFS-220 - Transparent data archiving off HDFS HDFS-503 - Reduce disk space used with erasure coding
  • 13. Competitors Hadoop MapReduce compatible CloudStore - http://kosmosfs.sourceforge.net/ Low response time MogileFS - http://www.danga.com/mogilefs/ GlusterFS - http://www.gluster.org/

×