Web Services Hadoop Summit 2012

Web Services in
Hadoop
Nicholas Sze and Alan F. Gates
@szetszwo, @alanfgates

Page 1

REST-ful API Front-door for Hadoop
• Opens the door to languages other than Java
• Thin clients via web services vs. fat-clients in gateway
• Insulation from interface changes release to release

HCatalog web interfaces

MapReduce Pig Hive

HCatalog

External
HDFS HBase
Store

© 2012 Hortonworks Page 2

Not Covered in this Talk
•  HttpFS (fka Hoop) – same API as WebHDFS but proxied
•  Stargate – REST API for HBase

© 2012 Hortonworks
Page 3

HDFS Clients
• DFSClient: the native client
– High performance (using RPC)
– Java blinding

• libhdfs: a C++ client interface
– Using JNI => large overhead
– Also Java blinding (require Hadoop installing)

Architecting the Future of Big Data Page 4

HFTP
• Designed for cross-version copying (DistCp)
– High performance (using HTTP)
– Read-only
– The HTTP API is proprietary
– Clients must use HftpFileSystem (hftp://)

• WebHDFS is a rewrite of HFTP


Design Goals

• Support a public HTTP API

• Support Read and Write

• High Performance

• Cross-version

• Security


WebHDFS features
• HTTP REST API
– Defines a public API
– Permits non-Java client implementation
– Support common tools like curl/wget

• Wire Compatibility
– The REST API will be maintained for wire compatibility
– WebHDFS clients can talk to different Hadoop versions.


WebHDFS features (2)

• A Complete HDFS Interface
– Support all user operations
– reading files
– writing to files
– mkdir, chmod, chown, mv, rm, …

• High Performance
– Using HTTP redirection to provide data locality
– File read/write are redirected to the corresponding
datanodes


WebHDFS features (3)

• Secure Authentication
– Same as Hadoop authentication: Kerberos (SPNEGO)
and Hadoop delegation tokens
– Support proxy users

• A HDFS Built-in Component
– WebHDFS is a first class built-in component of HDFS.
– Run inside Namenodes and Datanodes

• Apache Open Source
– Available in Apache Hadoop 1.0 and above.


WebHDFS URI & URL
• FileSystem scheme:
webhdfs://

• FileSystem URI:
webhdfs://<HOST>:<HTTP_PORT>/<PATH>

• HTTP URL:
http://<HOST>:<HTTP_PORT>/webhdfs/v1/<PATH>?op=..

– Path prefix: /webhdfs/v1
– Query: ?op=..


URI/URL Examples
•  Suppose we have the following file
hdfs://namenode:8020/user/szetszwo/w.txt

•  WebHDFS FileSystem URI
webhdfs://namenode:50070/user/szetszwo/w.txt

•  WebHDFS HTTP URL
http://namenode:50070/webhdfs/v1/user/
szetszwo/w.txt?op=..

•  WebHDFS HTTP URL to open the file
http://namenode:50070/webhdfs/v1/user/
szetszwo/w.txt?op=OPEN


Example: curl
•  Use curl to open a file

$curl -i -L "http://namenode:50070/webhdfs/v1/user/
szetszwo/w.txt?op=OPEN"

HTTP/1.1 307 TEMPORARY_REDIRECT
Content-Type: application/octet-stream
Location: http://192.168.5.2:50075/webhdfs/v1/user/
szetszwo/w.txt?op=OPEN&offset=0
Content-Length: 0
Server: Jetty(6.1.26)


Example: curl (2)

HTTP/1.1 200 OK
Content-Type: application/octet-stream
Content-Length: 21
Server: Jetty(6.1.26)

Hello, WebHDFS user!


Example: wget
•  Use wget to open the same file

$wget "http://namenode:50070/webhdfs/v1/user/
szetszwo/w.txt?op=OPEN" –O w.txt

Resolving ...
Connecting to ... connected.
HTTP request sent, awaiting response...
307 TEMPORARY_REDIRECT
Location: http://192.168.5.2:50075/webhdfs/v1/user/
szetszwo/w.txt?op=OPEN&offset=0 [following]


Example: wget (2)

--2012-06-13 01:42:10-- http://192.168.5.2:50075/
webhdfs/v1/user/szetszwo/w.txt?op=OPEN&offset=0
Connecting to 192.168.5.2:50075... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21 [application/octet-stream]
Saving to: `w.txt'

100%[=================>] 21 --.-K/s in 0s

2012-06-13 01:42:10 (3.34 MB/s) - `w.txt' saved
[21/21]


Example: Firefox


HCatalog REST API
•  REST endpoints: databases, tables, partitions, columns, table properties
•  PUT to create/update, GET to list or describe, DELETE to drop
•  Uses JSON to describe metadata objects
•  Versioned, because we assume we will have to update it:
http://hadoop.acme.com/templeton/v1/…
•  Runs in a Jetty server
•  Supports security
–  Authentication done via kerberos using SPNEGO
•  Included in HDP, runs on Thrift metastore server machine
•  Not yet checked in, but you can find the code on Apache’s JIRA
HCATALOG-182

© 2012 Hortonworks
Page 17

HCatalog REST API
Get a list of all tables in the default database:

GET
http://…/v1/ddl/database/default/table
Hadoop/
HCatalog
{
"tables": ["counted","processed",],
"database": "default"
}

Indicate user with URL parameter:
http://…/v1/ddl/database/default/table?user.name=gates
Actions authorized as indicated user

© Hortonworks 2012
Page 18

HCatalog REST API
Create new table “rawevents”

PUT
{"columns": [{ "name": "url", "type": "string" },
{ "name": "user", "type": "string"}],
"partitionedBy": [{ "name": "ds", "type": "string" }]}

http://…/v1/ddl/database/default/table/rawevents

Hadoop/
HCatalog
{
"table": "rawevents",
"database": "default”
}

© Hortonworks 2012
Page 19

HCatalog REST API
Describe table “rawevents”

GET
http://…/v1/ddl/database/default/table/rawevents
Hadoop/
HCatalog
{
"columns": [{"name": "url","type": "string"},
{"name": "user","type": "string"}],
"database": "default",
"table": "rawevents"
}

© Hortonworks 2012
Page 20

Job Management
•  Includes APIs to submit and monitor jobs
•  Any files needed for the job first uploaded to HDFS via WebHDFS
–  Pig and Hive scripts
–  Jars, Python scripts, or Ruby scripts for UDFs
–  Pig macros
•  Results from job stored to HDFS, can be retrieved via WebHDFS
•  User responsible for cleaning up output in HDFS
•  Job state information stored in ZooKeeper or HDFS

© 2012 Hortonworks
Page 21

Job Submission
•  Can submit MapReduce, Pig, and Hive jobs
•  POST parameters include
–  script to run or HDFS file containing script/jar to run
–  username to execute the job as
–  optionally an HDFS directory to write results to (defaults to user’s home directory)
–  optionally a URL to invoke GET on when job is done

POST
http://hadoop.acme.com/templeton/v1/pig
Hadoop/
HCatalog
{"id": "job_201111111311_0012",…}

© 2012 Hortonworks
Page 22

Find all Your Jobs
•  GET on queue returns all jobs belonging to the submitting user
•  Pig, Hive, and MapReduce jobs will be returned

GET
http://…/templeton/v1/queue?user.name=gates
Hadoop/
HCatalog
{"job_201111111311_0008",
"job_201111111311_0012"}

© 2012 Hortonworks
Page 23

Get Status of a Job
•  Doing a GET on jobid gets you information about a particular job
•  Can be used to poll to see if job is finished
•  Used after job is finished to get job information
•  Doing a DELETE on jobid kills the job

GET
http://…/templeton/v1/queue/job_201111111311_0012
Hadoop/
HCatalog
{…, "percentComplete": "100% complete",
"exitValue": 0,…
"completed": "done"
}

© 2012 Hortonworks
Page 24

Future
•  Job management
–  Job management APIs don’t belong in HCatalog
–  Only there by historical accident
–  Need to move them out to MapReduce framework
•  Authentication needs more options than kerberos
•  Integration with Oozie
•  Need a directory service
–  Users should not need to connect to different servers for HDFS, HBase, HCatalog,
Oozie, and job submission

© 2012 Hortonworks
Page 25

Web Services Hadoop Summit 2012

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Web Services Hadoop Summit 2012

Similar to Web Services Hadoop Summit 2012 (20)

More from Hortonworks

More from Hortonworks (20)

Recently uploaded

Recently uploaded (20)

Web Services Hadoop Summit 2012