Web Services Hadoop Summit 2012
Upcoming SlideShare
Loading in...5
×
 

Web Services Hadoop Summit 2012

on

  • 5,659 views

 

Statistics

Views

Total Views
5,659
Views on SlideShare
5,611
Embed Views
48

Actions

Likes
3
Downloads
174
Comments
0

4 Embeds 48

http://eventifier.co 36
https://twitter.com 8
https://hwtest.uservoice.com 3
http://www.twylah.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Web Services Hadoop Summit 2012 Web Services Hadoop Summit 2012 Presentation Transcript

    • Web Services inHadoopNicholas Sze and Alan F. Gates@szetszwo, @alanfgates Page 1
    • REST-ful API Front-door for Hadoop• Opens the door to languages other than Java• Thin clients via web services vs. fat-clients in gateway• Insulation from interface changes release to release HCatalog web interfaces MapReduce Pig Hive HCatalog External HDFS HBase Store © 2012 Hortonworks Page 2
    • Not Covered in this Talk•  HttpFS (fka Hoop) – same API as WebHDFS but proxied•  Stargate – REST API for HBase © 2012 Hortonworks Page 3
    • HDFS Clients• DFSClient: the native client – High performance (using RPC) – Java blinding• libhdfs: a C++ client interface – Using JNI => large overhead – Also Java blinding (require Hadoop installing) Architecting the Future of Big Data Page 4
    • HFTP• Designed for cross-version copying (DistCp) – High performance (using HTTP) – Read-only – The HTTP API is proprietary – Clients must use HftpFileSystem (hftp://)• WebHDFS is a rewrite of HFTP Architecting the Future of Big Data Page 5
    • Design Goals• Support a public HTTP API• Support Read and Write• High Performance• Cross-version• Security Architecting the Future of Big Data Page 6
    • WebHDFS features• HTTP REST API – Defines a public API – Permits non-Java client implementation – Support common tools like curl/wget• Wire Compatibility – The REST API will be maintained for wire compatibility – WebHDFS clients can talk to different Hadoop versions. Architecting the Future of Big Data Page 7
    • WebHDFS features (2)• A Complete HDFS Interface – Support all user operations – reading files – writing to files – mkdir, chmod, chown, mv, rm, …• High Performance – Using HTTP redirection to provide data locality – File read/write are redirected to the corresponding datanodes Architecting the Future of Big Data Page 8
    • WebHDFS features (3)• Secure Authentication – Same as Hadoop authentication: Kerberos (SPNEGO) and Hadoop delegation tokens – Support proxy users• A HDFS Built-in Component – WebHDFS is a first class built-in component of HDFS. – Run inside Namenodes and Datanodes• Apache Open Source – Available in Apache Hadoop 1.0 and above. Architecting the Future of Big Data Page 9
    • WebHDFS URI & URL• FileSystem scheme: webhdfs://• FileSystem URI: webhdfs://<HOST>:<HTTP_PORT>/<PATH>• HTTP URL: http://<HOST>:<HTTP_PORT>/webhdfs/v1/<PATH>?op=.. – Path prefix: /webhdfs/v1 – Query: ?op=.. Architecting the Future of Big Data Page 10
    • URI/URL Examples•  Suppose we have the following file hdfs://namenode:8020/user/szetszwo/w.txt•  WebHDFS FileSystem URI webhdfs://namenode:50070/user/szetszwo/w.txt•  WebHDFS HTTP URLhttp://namenode:50070/webhdfs/v1/user/szetszwo/w.txt?op=..•  WebHDFS HTTP URL to open the filehttp://namenode:50070/webhdfs/v1/user/szetszwo/w.txt?op=OPEN Architecting the Future of Big Data Page 11
    • Example: curl•  Use curl to open a file$curl -i -L "http://namenode:50070/webhdfs/v1/user/szetszwo/w.txt?op=OPEN"HTTP/1.1 307 TEMPORARY_REDIRECTContent-Type: application/octet-streamLocation: http://192.168.5.2:50075/webhdfs/v1/user/szetszwo/w.txt?op=OPEN&offset=0Content-Length: 0Server: Jetty(6.1.26) Architecting the Future of Big Data Page 12
    • Example: curl (2)HTTP/1.1 200 OKContent-Type: application/octet-streamContent-Length: 21Server: Jetty(6.1.26)Hello, WebHDFS user! Architecting the Future of Big Data Page 13
    • Example: wget•  Use wget to open the same file$wget "http://namenode:50070/webhdfs/v1/user/szetszwo/w.txt?op=OPEN" –O w.txtResolving ...Connecting to ... connected.HTTP request sent, awaiting response...307 TEMPORARY_REDIRECTLocation: http://192.168.5.2:50075/webhdfs/v1/user/szetszwo/w.txt?op=OPEN&offset=0 [following] Architecting the Future of Big Data Page 14
    • Example: wget (2)--2012-06-13 01:42:10-- http://192.168.5.2:50075/webhdfs/v1/user/szetszwo/w.txt?op=OPEN&offset=0Connecting to 192.168.5.2:50075... connected.HTTP request sent, awaiting response... 200 OKLength: 21 [application/octet-stream]Saving to: `w.txt100%[=================>] 21 --.-K/s in 0s2012-06-13 01:42:10 (3.34 MB/s) - `w.txt saved[21/21] Architecting the Future of Big Data Page 15
    • Example: Firefox Architecting the Future of Big Data Page 16
    • HCatalog REST API•  REST endpoints: databases, tables, partitions, columns, table properties•  PUT to create/update, GET to list or describe, DELETE to drop•  Uses JSON to describe metadata objects•  Versioned, because we assume we will have to update it: http://hadoop.acme.com/templeton/v1/…•  Runs in a Jetty server•  Supports security –  Authentication done via kerberos using SPNEGO•  Included in HDP, runs on Thrift metastore server machine•  Not yet checked in, but you can find the code on Apache’s JIRA HCATALOG-182 © 2012 Hortonworks Page 17
    • HCatalog REST API Get a list of all tables in the default database: GET http://…/v1/ddl/database/default/table Hadoop/ HCatalog { "tables": ["counted","processed",], "database": "default" } Indicate user with URL parameter: http://…/v1/ddl/database/default/table?user.name=gates Actions authorized as indicated user © Hortonworks 2012 Page 18
    • HCatalog REST API Create new table “rawevents” PUT {"columns": [{ "name": "url", "type": "string" }, { "name": "user", "type": "string"}], "partitionedBy": [{ "name": "ds", "type": "string" }]} http://…/v1/ddl/database/default/table/rawevents Hadoop/ HCatalog { "table": "rawevents", "database": "default” } © Hortonworks 2012 Page 19
    • HCatalog REST API Describe table “rawevents” GET http://…/v1/ddl/database/default/table/rawevents Hadoop/ HCatalog { "columns": [{"name": "url","type": "string"}, {"name": "user","type": "string"}], "database": "default", "table": "rawevents" } © Hortonworks 2012 Page 20
    • Job Management•  Includes APIs to submit and monitor jobs•  Any files needed for the job first uploaded to HDFS via WebHDFS –  Pig and Hive scripts –  Jars, Python scripts, or Ruby scripts for UDFs –  Pig macros•  Results from job stored to HDFS, can be retrieved via WebHDFS•  User responsible for cleaning up output in HDFS•  Job state information stored in ZooKeeper or HDFS © 2012 Hortonworks Page 21
    • Job Submission•  Can submit MapReduce, Pig, and Hive jobs•  POST parameters include –  script to run or HDFS file containing script/jar to run –  username to execute the job as –  optionally an HDFS directory to write results to (defaults to user’s home directory) –  optionally a URL to invoke GET on when job is done POST http://hadoop.acme.com/templeton/v1/pig Hadoop/ HCatalog {"id": "job_201111111311_0012",…} © 2012 Hortonworks Page 22
    • Find all Your Jobs•  GET on queue returns all jobs belonging to the submitting user•  Pig, Hive, and MapReduce jobs will be returned GET http://…/templeton/v1/queue?user.name=gates Hadoop/ HCatalog {"job_201111111311_0008", "job_201111111311_0012"} © 2012 Hortonworks Page 23
    • Get Status of a Job•  Doing a GET on jobid gets you information about a particular job•  Can be used to poll to see if job is finished•  Used after job is finished to get job information•  Doing a DELETE on jobid kills the job GET http://…/templeton/v1/queue/job_201111111311_0012 Hadoop/ HCatalog {…, "percentComplete": "100% complete", "exitValue": 0,… "completed": "done" } © 2012 Hortonworks Page 24
    • Future•  Job management –  Job management APIs don’t belong in HCatalog –  Only there by historical accident –  Need to move them out to MapReduce framework•  Authentication needs more options than kerberos•  Integration with Oozie•  Need a directory service –  Users should not need to connect to different servers for HDFS, HBase, HCatalog, Oozie, and job submission © 2012 Hortonworks Page 25