HDFS

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    2 Favorites

    HDFS - Presentation Transcript

    1. Johan Oskarsson Developer at Last.fm Hadoop and Hive committer
    2. What is HDFS? Hadoop Hadoop Distributed FileSystem Two server types Namenode - keeps track of block locations Datanode - stores blocks Files commonly split up into 128mb blocks Replicated to 3 datanodes by default Scales well: ~4000 nodes Write once Large files
    3. "Can you use HDFS in production?"
    4. Yes We have used it in production since 2006, but then again we are insane.
    5. Who is using HDFS in production? Yahoo! Largest cluster 4000 nodes (14PB raw storage) Facebook. 600 nodes (2PB raw storage) Powerset (Microsoft). "up to 400 instances" Last.fm. 31 nodes (110TB raw storage) ... see more at http://wiki.apache.org/hadoop/PoweredBy
    6. What do they use Hadoop for? Yahoo! search index, Yahoo! anti spam, etc Facebook ad, profile and application monitoring, etc Powerset search index, heavy HBase users Last.fm charts, A/B testing stats, site metrics and reporting
    7. "Does HDFS meet people's needs? If not, what can we do?"
    8. Use case - MR batch jobs Scenario 1. Large source data files are inserted into HDFS 2. MapReduce job is run 3. Output is saved to HDFS HDFS is a great choice for this use case Shorter downtime is acceptable Backups for important data Permissions + trash to avoid user error
    9. Use case - Serving files to a website Scenario 1. User visits a website to browse photos 2. Lots of image files are requested from HDFS Potential issues and solutions HDFS isn't written for many small files Namenode ram limits number of files Use HBase or similar Namenode goes down Crazy "double cluster" solution Standby namenode HADOOP-4539 HDFS isn't really written for low response times Work is being done, not high priority Use GlusterFS or MogileFS instead
    10. Use case - Reliable, realtime log storage Scenario 1. A stream of logging events is generated 2. The stream is written directly to HDFS Potential issues and solutions Problems with long write sessions HDFS-200, HADOOP-6099, HDFS-278 Namenode goes down Crazy "double cluster" solution Standby namenode HADOOP-4539 Appends not stable HDFS-265
    11. Potential dealbreakers Small files problem™ Use archives, sequencefiles or HBase Appends/sync not stable Namenode not highly available Relatively high latency reads
    12. Improvements In progress or completed HADOOP-4539 - Streaming edits to a standby NN HDFS-265 - Appends HDFS-245 - Symbolic links Wish list HDFS-209 - Tool to edit namenode metadata files HDFS-220 - Transparent data archiving off HDFS HDFS-503 - Reduce disk space used with erasure coding
    13. Competitors Hadoop MapReduce compatible CloudStore - http://kosmosfs.sourceforge.net/ Low response time MogileFS - http://www.danga.com/mogilefs/ GlusterFS - http://www.gluster.org/
    SlideShare Zeitgeist 2009

    + Steve LoughranSteve Loughran Nominate

    custom

    714 views, 2 favs, 0 embeds more stats

    Johan of last.fm talks about how to use HDFS in pro more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 714
      • 714 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 2
    • Downloads 31
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories

    Tags