WebHDFS x HttpFS are common source of confusion. This slideset highlights differences and similarities between these two Web interfaces for accessing an HDFS cluster.
2. WebHDFS
● WebHDFS is a HTTP REST API representation of all FileSystem interface
method;
● Detailed dictionary available here;
● FileSystem scheme: "webhdfs://";
● Enabled by default via "dfs.webhdfs.enabled";
● Used by WebHdfsFileSystem implementation;
3. WebHDFS - Implementation Details
● Runs embedded within NN/DN processes, as a jetty Server;
● Runs on same http server instance from Web UI;
● HttpServer2 wrapps jetty specific initialization logic:
○ Creates ServletHolder and WebContext instances;
○ NameNodeWebHdfsMethods defines all jax.ws.rs mappings for related WebHDFS REST
API methods
● The embedded http server is created and started on NN initialisation, before
FS Image is loaded
5. WebHDFS - Client Access
● Clients access NN
and DN host directly,
since jetty processes
run embedded within
NN/DN;
● It's an "HTTP" layer
on top of client
protocol;
9. HttpFS
● Implements same REST methods as WebHDFS, so dictionary is the same;
● Independent java process from NN/DN, can (and should) be ran on different
hosts;
● Listens on port 14000 by default;
● Allows for client access isolation from NNs/DNs;
● Can be deployed over multiple hosts, for load balancing (does not provide
built-in load balancing feature, though);
10. HttpFS - Implementation Details
● Java web application deployed over tomcat (CDH 5);
● Accesses HDFS using HDFS java client API;
● Uses jersey for jax-rs mappings:
○ ServletContainer and packages for classes with jax-rs annotations defined in web.xml;
○ HttpFSServerWebApp ServletContextListener implementation, creates and initialises
services implementations (FileSystemAccessService, GroupsService, etc);
○ HttpFSServer handles HTTP requests, performing the related WebHDFS operations using
HDFS Client API;
■ Defines jax-rs related annotations;
■ Processed and initialised by jersey ServletContainer;
● Once running, hdfs is deployed on a tomcat instance running from:
/var/lib/hadoop-httpfs/tomcat-deployment/
12. HttpFS - Client Access
● Clients only need to
access HttpFS
process host.
NNs/DNs are isolated
from clients;
● HttpFS uses HDFS
Client API to access
HDFS. That translates
into RPC calls to NN,
and additional NW IO
for file read/write
operations;
14. HttpFS - Curl verbose output example
curl -v "http://host-10-17-101-41.coe.cloudera.com:14000/webhdfs/v1/tmp/?op=LISTSTATUS&user.name=root"
* About to connect() to host-10-17-101-41.coe.cloudera.com port 14000 (#0)
* Trying 10.17.101.41... connected
* Connected to host-10-17-101-41.coe.cloudera.com (10.17.101.41) port 14000 (#0)
> GET /webhdfs/v1/tmp/?op=LISTSTATUS&user.name=root HTTP/1.1
> User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.21 Basic ECC zlib/1.2.3 libidn/1.18 libssh2/1.4.2
> Host: host-10-17-101-41.coe.cloudera.com:14000
> Accept: */*
>
< HTTP/1.1 200 OK
< Server: Apache-Coyote/1.1
< Set-Cookie: hadoop.auth="u=root&p=root&t=simple-dt&e=1528928040022&s=4LIdOwldrAiLceRrIuRDTF2D3qs="; Path=/; HttpOnly
< Content-Encoding: UTF-16BE
< Content-Type: application/json;charset=UTF-16BE
< Transfer-Encoding: chunked
< Date: Wed, 13 Jun 2018 12:14:01 GMT
<
{"FileStatuses":{"FileStatus":[{"accessTime":0,"blockSize":0,"childrenNum":1,"fileId":153263,"group":"hdfs","length":0,"modificationTime":1513787571635,"owner":"root","pathSuffix":"hive","permission":"733","replication":0,"st
oragePolicy":0,"type":"DIRECTORY"},
{"accessTime":1513788800994,"blockSize":134217728,"childrenNum":0,"fileId":153418,"group":"hdfs","length":61,"modificationTime":1513788801193,"owner":"root","pathSuffix":"json.sample","permission":"644","replication":1,"
storagePolicy":0,"type":"FILE"}]}}
15. Summary - WebHDFS x HttpFS
WebHDFS
● Runs in embedded http (jetty) server
in NN/DN processes;
● Clients need access to NNs and DNs
hosts;
● Default ports 50070 (NN) / 50075
(DN);
● Can be enabled/disabled by
dfs.webhdfs.enabled property;
● Accesses NN/DN client protocol
methods directly;
HttpFS
● Runs as a java web application
deployed on a tomcat process;
● Isolates client access, clients just
need access to HttpFS hosts;
● Default port 14000;
● Can have multiple instances
deployed;
● Uses HDFS Java client API to access
hdfs;