Your SlideShare is downloading. ×
0
File Context
File Context
File Context
File Context
File Context
File Context
File Context
File Context
File Context
File Context
File Context
File Context
File Context
File Context
File Context
File Context
File Context
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

File Context

3,411

Published on

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,411
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
41
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. 1<br />The New File system API:<br />FileContext & AbstractFileSystem<br />Sanjay Radia<br />Cloud Computing<br />Yahoo Inc.<br />
  • 2. Agenda<br />Overview – old vs new<br />Motivation<br />The New APIs<br />What is next<br />
  • 3. In a Nutshell: Old vs New<br /> Old: 1 Layer<br /> New: 2 Layers<br />3<br />FileContext<br />User API<br />User API<br />FileSystem<br />FS Impl API<br />AbstractFileSystem<br />FS Impl API<br />FS <br />implementations<br />S3<br />DistributedFS<br />S3Fs<br />LocalFS<br />LocalFs<br />Hdfs<br />
  • 4. Motivation (1): 1st class URI file system namespace<br />First class URI file system namespace<br />Old: <br />Shell offers a URI namespace view: e.g. cp uri1 uri2<br />But at user-API layer, <br />Create a FileSystem instance for each target scheme-authority<br />Incorrect:<br />FileSystem.create(uriPath..) must take path in the FileSystem instance<br />Correct:<br />Fs = FileSystem.get(Uri1, …)<br />Fs.create(pathInUri1, …)<br />Original patch for symbolic link depicts the problem<br />FileSystem.open(UriPath, … ) is invalid if UriPath is foreign<br />But one can fool the FileSystem into following a symlink to the same UriPath.<br />Need a layer that provides first class URI file namespace<br />4<br />
  • 5. Motivation (2) : Separate Layers<br />Two layers are merged in old FileSystem<br />User Api which provides notion of default file system and working dir<br />Implementation Api for implementers of the file systems<br />Why separate?<br />Simpler to implement file systems<br /> an API for implementing file systems (like VFS in Unix).<br />It does not need to deal with Slash, wd, umask, etc<br />Each file system instance is limited to its namespace<br />User Api layer provides a natural place for<br />The context: Slash, wd, umask, ….<br />Uri namespace – cuts across the namespace of all file system instances.<br />Hence a natural place to implement symbolic links to foreign namespaces<br />5<br />
  • 6. Motivation (3): Cleanup API and Semantics<br />FileSystem API & some semantics are not very good<br />We should have adapted Unix Apis where appropriate<br />Ask yourself: are you smarter than Ritche & Thompson and understand the issues well enough to be different<br />Semantics: the recursive parent creation<br />This convenience can cause accidental creation of parents<br />E.g. A problem for speculative executions<br />Semantics: Rename method, etc<br />Too many overloaded methods … (eg. Create)<br />The cache has leaked through: FileSystem.getNewInstance()<br />Ugliness: e.g. copyLocal() copy(), …<br />FileSystem leaked into Path<br />Adding InterruptedException<br />Some could have been fixed in the FileSystem class, but was getting messy to provide compatibility in the transition<br />A clean break made things much easier<br />6<br />
  • 7. Motivation (4): The Config<br />The client-side config is too complex:<br />Client should only need: Slash, wd, umask; that’s it nothing more.<br />But Hadoop needs server-side defaults in client-side config<br /> An unnecessary burden on the client and admin<br />Cluster admin cannot be expected to copy the config to the desktops<br />Does not work for a federated environment where a client connects to many file systems each with its own defaults<br />Solution: client grabs/uses needed properties from target server<br />A transition to this solution from the current config is challenging if one needs to maintain compatibility within the existing APIs<br />A common complaint is that Hadoop config is way too complicated<br />7<br />
  • 8. 8<br />The New File System APIs<br />HADOOP-4952, HADOOP-6223<br />
  • 9. First: Some Naming Fundamentals<br />Addresses, routes, names are ALL names (identifiers)<br />Numbers, or strings, or paths, or addresses or routes are chosen based on the audience or how they are processed<br />ALL names are relative to some context<br />Even absolute or global names have a context in which they are resolved<br />Names appear to be global/absolute because you have simply chosen a frame-of-reference and are excluding the world outside that frame-of-reference. <br />When two worlds, that each have “global” names, collide<br />names get ambiguous unless you manage the closure/context<br />There is always an implicit context – if you make that implicit context to be explicit by naming it, you need a context for the name of the context<br />A more local context makes apps portable across similar environments<br />A program can move from one Unix machine to another as long the names relative to the machine’s root refer to the “same” objects<br />A Unix process’s context: Root and working dir<br />plus default default domain, etc.<br />9<br />
  • 10. We have URIs, why do we need Slash-relative names?<br />Our world:<br /> a forest of file systems, each referenced by its URI<br />Why isn’t the URI namespace good enough?<br />The URI’s will bind your application to the very specific servers that provide that URI namepace.<br />A application may run on cluster 1 today and be moved to cluster two in the future.<br />If you move the data to the second cluster the app should work<br />Better to let each cluster have its on default fs (i.e. slash)<br />Also need the convenience of working dir<br />10<br />
  • 11. Enter FileContext: A focus point on a forest of file systems<br />A FileContext is a focus point on a forest of file systems<br />In general, it is set for you in your environment (just like your DNS domain)<br />It lets you access the common files in your cluster using location independent names<br />Your home, tmp, your project’s data,<br />You can still access the files in other clusters or file systems<br />In Unix you had to mount remote file systems<br />But we have URIs which are fully qualified, automatically mounted<br />Fully qualified Uri is to Slash-relative-name<br /> as Slash-relative-names is to wd-relative-name<br />… its just contexts ….<br />11<br />/foo<br />/<br />wd<br />/<br />wd<br />hdfs://nn3/foo<br />….<br />
  • 12. Examples<br />Use default config which has your default FS<br />myFC = FileContext.getFileContext();<br />Access files in your default file system<br />myFC.create(“/foo”, ...);<br />myFC.setWorkingDir(“/foo”)<br />myFC.open (“bar”, ...); <br />Access files in other clusters<br />myFC.open(“hdfs://nn3/bar”, ..)<br />You can even set your wd to another fs!<br />myFC. setWorkingDir(“hdfs://nn3/foo”)<br />Variations on getting your context<br />A specific URI as the default FS<br /> myFC = FileContext.getFileContext(URI)<br />Local file system as the default FS<br />myFC = FileContext.getLocalFSFileContext()<br />Use a specific config,<br /> ignore $HADOOP_CONFIG<br />Generally you should not need use a config unless you are doing something special<br />configX = someConfigPassedToYou.<br />myFC =FileContext.getFileContext(configX);<br />//configX not changed but passed down<br />12<br />
  • 13. So what is in the FileContext?<br />The default file system (Slash) - obtained from config<br />A pointer to the file system object is kept<br />The working dir (lack of Slash) <br />Stored as a path which is prefixed to relative path names<br />Umask – obtained from config<br />Absolute permissions after applying mask are sent to layer below<br />Any other file system accessed are simply created<br />0.21 – uses the FileSystem which has a cache<br />0.22 – use the new AbstractFileSystem<br />Do we need to add a cache? Hadoop-6356<br />13<br />
  • 14. HDFS config – client & server side<br />Client side config:<br />Default file system<br />Umask<br />Default values for blocksize, buffersize, replication are obtained at runtime from the specific filesystem in question<br />Finally, federation can work<br />Server side config:<br />What used to be there before (except the above two items)<br />+ cleaned config variables for SS defaults for blocksize, etc.<br />14<br />
  • 15. Abstract File System (0.22)<br />15<br />FileContext<br />User API<br />Does not deal with<br /><ul><li>default file system, wd
  • 16. URIs,
  • 17. Umask</li></ul>AbstractFileSystem<br />FS Impl API <br />DelegateTo FileSystem<br />Hdfs<br />LocalFs<br />FilterFs<br />ChecksumFs<br />RawLocalFs<br />RawLocal FileSystem<br />
  • 18. The Jira: Approx 9 months<br />FileContext: Three main discussions<br />Use Java’s new IO. Main issue is that their default and wd is tied to the file system of the JVM in order to provide compatibility with older APIs<br />Config – a simpler per-process environment?<br />Differed to Doug’s view that a per-process env is not sufficient for a MT Java application<br />Besides, out of scope for this Jira – a radical change<br />SS config<br />What I would have liked to do, but didn’t<br />2 Apis: Java Interface and Abstract class<br />Not the Hadoop way!<br />For example if we had done this in the old FileSystem, it would have facilitated dynamic class loading for protocol compatibility as an interim solution<br />16<br />
  • 19. What is next?<br />Exceptions:<br />Adding InterruptedException and declaring the sub-exceptions of IOException<br />Issue:<br />Apps need to manage an interrupt differently then other exceptions<br />IO-streams throw IOInterruptedException<br />FileContext – three choices:<br />IOInterruptedException<br />InterruptedException<br />An unchecked exception<br />The Cache:<br />Do we need it?<br />How do we deal with Facebook’s application that forced the cache to leak through<br />Other AbstractFileSystem impls (can use the delegator)<br />17<br />

×