File Context
Upcoming SlideShare
Loading in...5
×
 

File Context

on

  • 4,311 views

 

Statistics

Views

Total Views
4,311
Views on SlideShare
3,889
Embed Views
422

Actions

Likes
4
Downloads
38
Comments
0

6 Embeds 422

http://developer.yahoo.net 274
http://developer.yahoo.com 133
https://developer.yahoo.com 8
http://feeds.developer.yahoo.net 3
http://www.slideshare.net 3
http://www.hanrss.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

File Context File Context Presentation Transcript

  • 1
    The New File system API:
    FileContext & AbstractFileSystem
    Sanjay Radia
    Cloud Computing
    Yahoo Inc.
  • Agenda
    Overview – old vs new
    Motivation
    The New APIs
    What is next
  • In a Nutshell: Old vs New
    Old: 1 Layer
    New: 2 Layers
    3
    FileContext
    User API
    User API
    FileSystem
    FS Impl API
    AbstractFileSystem
    FS Impl API
    FS
    implementations
    S3
    DistributedFS
    S3Fs
    LocalFS
    LocalFs
    Hdfs
  • Motivation (1): 1st class URI file system namespace
    First class URI file system namespace
    Old:
    Shell offers a URI namespace view: e.g. cp uri1 uri2
    But at user-API layer,
    Create a FileSystem instance for each target scheme-authority
    Incorrect:
    FileSystem.create(uriPath..) must take path in the FileSystem instance
    Correct:
    Fs = FileSystem.get(Uri1, …)
    Fs.create(pathInUri1, …)
    Original patch for symbolic link depicts the problem
    FileSystem.open(UriPath, … ) is invalid if UriPath is foreign
    But one can fool the FileSystem into following a symlink to the same UriPath.
    Need a layer that provides first class URI file namespace
    4
  • Motivation (2) : Separate Layers
    Two layers are merged in old FileSystem
    User Api which provides notion of default file system and working dir
    Implementation Api for implementers of the file systems
    Why separate?
    Simpler to implement file systems
    an API for implementing file systems (like VFS in Unix).
    It does not need to deal with Slash, wd, umask, etc
    Each file system instance is limited to its namespace
    User Api layer provides a natural place for
    The context: Slash, wd, umask, ….
    Uri namespace – cuts across the namespace of all file system instances.
    Hence a natural place to implement symbolic links to foreign namespaces
    5
  • Motivation (3): Cleanup API and Semantics
    FileSystem API & some semantics are not very good
    We should have adapted Unix Apis where appropriate
    Ask yourself: are you smarter than Ritche & Thompson and understand the issues well enough to be different
    Semantics: the recursive parent creation
    This convenience can cause accidental creation of parents
    E.g. A problem for speculative executions
    Semantics: Rename method, etc
    Too many overloaded methods … (eg. Create)
    The cache has leaked through: FileSystem.getNewInstance()
    Ugliness: e.g. copyLocal() copy(), …
    FileSystem leaked into Path
    Adding InterruptedException
    Some could have been fixed in the FileSystem class, but was getting messy to provide compatibility in the transition
    A clean break made things much easier
    6
  • Motivation (4): The Config
    The client-side config is too complex:
    Client should only need: Slash, wd, umask; that’s it nothing more.
    But Hadoop needs server-side defaults in client-side config
    An unnecessary burden on the client and admin
    Cluster admin cannot be expected to copy the config to the desktops
    Does not work for a federated environment where a client connects to many file systems each with its own defaults
    Solution: client grabs/uses needed properties from target server
    A transition to this solution from the current config is challenging if one needs to maintain compatibility within the existing APIs
    A common complaint is that Hadoop config is way too complicated
    7
  • 8
    The New File System APIs
    HADOOP-4952, HADOOP-6223
  • First: Some Naming Fundamentals
    Addresses, routes, names are ALL names (identifiers)
    Numbers, or strings, or paths, or addresses or routes are chosen based on the audience or how they are processed
    ALL names are relative to some context
    Even absolute or global names have a context in which they are resolved
    Names appear to be global/absolute because you have simply chosen a frame-of-reference and are excluding the world outside that frame-of-reference.
    When two worlds, that each have “global” names, collide
    names get ambiguous unless you manage the closure/context
    There is always an implicit context – if you make that implicit context to be explicit by naming it, you need a context for the name of the context
    A more local context makes apps portable across similar environments
    A program can move from one Unix machine to another as long the names relative to the machine’s root refer to the “same” objects
    A Unix process’s context: Root and working dir
    plus default default domain, etc.
    9
  • We have URIs, why do we need Slash-relative names?
    Our world:
    a forest of file systems, each referenced by its URI
    Why isn’t the URI namespace good enough?
    The URI’s will bind your application to the very specific servers that provide that URI namepace.
    A application may run on cluster 1 today and be moved to cluster two in the future.
    If you move the data to the second cluster the app should work
    Better to let each cluster have its on default fs (i.e. slash)
    Also need the convenience of working dir
    10
  • Enter FileContext: A focus point on a forest of file systems
    A FileContext is a focus point on a forest of file systems
    In general, it is set for you in your environment (just like your DNS domain)
    It lets you access the common files in your cluster using location independent names
    Your home, tmp, your project’s data,
    You can still access the files in other clusters or file systems
    In Unix you had to mount remote file systems
    But we have URIs which are fully qualified, automatically mounted
    Fully qualified Uri is to Slash-relative-name
    as Slash-relative-names is to wd-relative-name
    … its just contexts ….
    11
    /foo
    /
    wd
    /
    wd
    hdfs://nn3/foo
    ….
  • Examples
    Use default config which has your default FS
    myFC = FileContext.getFileContext();
    Access files in your default file system
    myFC.create(“/foo”, ...);
    myFC.setWorkingDir(“/foo”)
    myFC.open (“bar”, ...);
    Access files in other clusters
    myFC.open(“hdfs://nn3/bar”, ..)
    You can even set your wd to another fs!
    myFC. setWorkingDir(“hdfs://nn3/foo”)
    Variations on getting your context
    A specific URI as the default FS
    myFC = FileContext.getFileContext(URI)
    Local file system as the default FS
    myFC = FileContext.getLocalFSFileContext()
    Use a specific config,
    ignore $HADOOP_CONFIG
    Generally you should not need use a config unless you are doing something special
    configX = someConfigPassedToYou.
    myFC =FileContext.getFileContext(configX);
    //configX not changed but passed down
    12
  • So what is in the FileContext?
    The default file system (Slash) - obtained from config
    A pointer to the file system object is kept
    The working dir (lack of Slash)
    Stored as a path which is prefixed to relative path names
    Umask – obtained from config
    Absolute permissions after applying mask are sent to layer below
    Any other file system accessed are simply created
    0.21 – uses the FileSystem which has a cache
    0.22 – use the new AbstractFileSystem
    Do we need to add a cache? Hadoop-6356
    13
  • HDFS config – client & server side
    Client side config:
    Default file system
    Umask
    Default values for blocksize, buffersize, replication are obtained at runtime from the specific filesystem in question
    Finally, federation can work
    Server side config:
    What used to be there before (except the above two items)
    + cleaned config variables for SS defaults for blocksize, etc.
    14
  • Abstract File System (0.22)
    15
    FileContext
    User API
    Does not deal with
    • default file system, wd
    • URIs,
    • Umask
    AbstractFileSystem
    FS Impl API
    DelegateTo FileSystem
    Hdfs
    LocalFs
    FilterFs
    ChecksumFs
    RawLocalFs
    RawLocal FileSystem
  • The Jira: Approx 9 months
    FileContext: Three main discussions
    Use Java’s new IO. Main issue is that their default and wd is tied to the file system of the JVM in order to provide compatibility with older APIs
    Config – a simpler per-process environment?
    Differed to Doug’s view that a per-process env is not sufficient for a MT Java application
    Besides, out of scope for this Jira – a radical change
    SS config
    What I would have liked to do, but didn’t
    2 Apis: Java Interface and Abstract class
    Not the Hadoop way!
    For example if we had done this in the old FileSystem, it would have facilitated dynamic class loading for protocol compatibility as an interim solution
    16
  • What is next?
    Exceptions:
    Adding InterruptedException and declaring the sub-exceptions of IOException
    Issue:
    Apps need to manage an interrupt differently then other exceptions
    IO-streams throw IOInterruptedException
    FileContext – three choices:
    IOInterruptedException
    InterruptedException
    An unchecked exception
    The Cache:
    Do we need it?
    How do we deal with Facebook’s application that forced the cache to leak through
    Other AbstractFileSystem impls (can use the delegator)
    17