Web Services Hadoop Summit 2012


Published on

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Web Services Hadoop Summit 2012

  1. 1. Web Services inHadoopNicholas Sze and Alan F. Gates@szetszwo, @alanfgates Page 1
  2. 2. REST-ful API Front-door for Hadoop• Opens the door to languages other than Java• Thin clients via web services vs. fat-clients in gateway• Insulation from interface changes release to release HCatalog web interfaces MapReduce Pig Hive HCatalog External HDFS HBase Store © 2012 Hortonworks Page 2
  3. 3. Not Covered in this Talk•  HttpFS (fka Hoop) – same API as WebHDFS but proxied•  Stargate – REST API for HBase © 2012 Hortonworks Page 3
  4. 4. HDFS Clients• DFSClient: the native client – High performance (using RPC) – Java blinding• libhdfs: a C++ client interface – Using JNI => large overhead – Also Java blinding (require Hadoop installing) Architecting the Future of Big Data Page 4
  5. 5. HFTP• Designed for cross-version copying (DistCp) – High performance (using HTTP) – Read-only – The HTTP API is proprietary – Clients must use HftpFileSystem (hftp://)• WebHDFS is a rewrite of HFTP Architecting the Future of Big Data Page 5
  6. 6. Design Goals• Support a public HTTP API• Support Read and Write• High Performance• Cross-version• Security Architecting the Future of Big Data Page 6
  7. 7. WebHDFS features• HTTP REST API – Defines a public API – Permits non-Java client implementation – Support common tools like curl/wget• Wire Compatibility – The REST API will be maintained for wire compatibility – WebHDFS clients can talk to different Hadoop versions. Architecting the Future of Big Data Page 7
  8. 8. WebHDFS features (2)• A Complete HDFS Interface – Support all user operations – reading files – writing to files – mkdir, chmod, chown, mv, rm, …• High Performance – Using HTTP redirection to provide data locality – File read/write are redirected to the corresponding datanodes Architecting the Future of Big Data Page 8
  9. 9. WebHDFS features (3)• Secure Authentication – Same as Hadoop authentication: Kerberos (SPNEGO) and Hadoop delegation tokens – Support proxy users• A HDFS Built-in Component – WebHDFS is a first class built-in component of HDFS. – Run inside Namenodes and Datanodes• Apache Open Source – Available in Apache Hadoop 1.0 and above. Architecting the Future of Big Data Page 9
  10. 10. WebHDFS URI & URL• FileSystem scheme: webhdfs://• FileSystem URI: webhdfs://<HOST>:<HTTP_PORT>/<PATH>• HTTP URL: http://<HOST>:<HTTP_PORT>/webhdfs/v1/<PATH>?op=.. – Path prefix: /webhdfs/v1 – Query: ?op=.. Architecting the Future of Big Data Page 10
  11. 11. URI/URL Examples•  Suppose we have the following file hdfs://namenode:8020/user/szetszwo/w.txt•  WebHDFS FileSystem URI webhdfs://namenode:50070/user/szetszwo/w.txt•  WebHDFS HTTP URLhttp://namenode:50070/webhdfs/v1/user/szetszwo/w.txt?op=..•  WebHDFS HTTP URL to open the filehttp://namenode:50070/webhdfs/v1/user/szetszwo/w.txt?op=OPEN Architecting the Future of Big Data Page 11
  12. 12. Example: curl•  Use curl to open a file$curl -i -L "http://namenode:50070/webhdfs/v1/user/szetszwo/w.txt?op=OPEN"HTTP/1.1 307 TEMPORARY_REDIRECTContent-Type: application/octet-streamLocation: 0Server: Jetty(6.1.26) Architecting the Future of Big Data Page 12
  13. 13. Example: curl (2)HTTP/1.1 200 OKContent-Type: application/octet-streamContent-Length: 21Server: Jetty(6.1.26)Hello, WebHDFS user! Architecting the Future of Big Data Page 13
  14. 14. Example: wget•  Use wget to open the same file$wget "http://namenode:50070/webhdfs/v1/user/szetszwo/w.txt?op=OPEN" –O w.txtResolving ...Connecting to ... connected.HTTP request sent, awaiting response...307 TEMPORARY_REDIRECTLocation: [following] Architecting the Future of Big Data Page 14
  15. 15. Example: wget (2)--2012-06-13 01:42:10-- to connected.HTTP request sent, awaiting response... 200 OKLength: 21 [application/octet-stream]Saving to: `w.txt100%[=================>] 21 --.-K/s in 0s2012-06-13 01:42:10 (3.34 MB/s) - `w.txt saved[21/21] Architecting the Future of Big Data Page 15
  16. 16. Example: Firefox Architecting the Future of Big Data Page 16
  17. 17. HCatalog REST API•  REST endpoints: databases, tables, partitions, columns, table properties•  PUT to create/update, GET to list or describe, DELETE to drop•  Uses JSON to describe metadata objects•  Versioned, because we assume we will have to update it: http://hadoop.acme.com/templeton/v1/…•  Runs in a Jetty server•  Supports security –  Authentication done via kerberos using SPNEGO•  Included in HDP, runs on Thrift metastore server machine•  Not yet checked in, but you can find the code on Apache’s JIRA HCATALOG-182 © 2012 Hortonworks Page 17
  18. 18. HCatalog REST API Get a list of all tables in the default database: GET http://…/v1/ddl/database/default/table Hadoop/ HCatalog { "tables": ["counted","processed",], "database": "default" } Indicate user with URL parameter: http://…/v1/ddl/database/default/table?user.name=gates Actions authorized as indicated user © Hortonworks 2012 Page 18
  19. 19. HCatalog REST API Create new table “rawevents” PUT {"columns": [{ "name": "url", "type": "string" }, { "name": "user", "type": "string"}], "partitionedBy": [{ "name": "ds", "type": "string" }]} http://…/v1/ddl/database/default/table/rawevents Hadoop/ HCatalog { "table": "rawevents", "database": "default” } © Hortonworks 2012 Page 19
  20. 20. HCatalog REST API Describe table “rawevents” GET http://…/v1/ddl/database/default/table/rawevents Hadoop/ HCatalog { "columns": [{"name": "url","type": "string"}, {"name": "user","type": "string"}], "database": "default", "table": "rawevents" } © Hortonworks 2012 Page 20
  21. 21. Job Management•  Includes APIs to submit and monitor jobs•  Any files needed for the job first uploaded to HDFS via WebHDFS –  Pig and Hive scripts –  Jars, Python scripts, or Ruby scripts for UDFs –  Pig macros•  Results from job stored to HDFS, can be retrieved via WebHDFS•  User responsible for cleaning up output in HDFS•  Job state information stored in ZooKeeper or HDFS © 2012 Hortonworks Page 21
  22. 22. Job Submission•  Can submit MapReduce, Pig, and Hive jobs•  POST parameters include –  script to run or HDFS file containing script/jar to run –  username to execute the job as –  optionally an HDFS directory to write results to (defaults to user’s home directory) –  optionally a URL to invoke GET on when job is done POST http://hadoop.acme.com/templeton/v1/pig Hadoop/ HCatalog {"id": "job_201111111311_0012",…} © 2012 Hortonworks Page 22
  23. 23. Find all Your Jobs•  GET on queue returns all jobs belonging to the submitting user•  Pig, Hive, and MapReduce jobs will be returned GET http://…/templeton/v1/queue?user.name=gates Hadoop/ HCatalog {"job_201111111311_0008", "job_201111111311_0012"} © 2012 Hortonworks Page 23
  24. 24. Get Status of a Job•  Doing a GET on jobid gets you information about a particular job•  Can be used to poll to see if job is finished•  Used after job is finished to get job information•  Doing a DELETE on jobid kills the job GET http://…/templeton/v1/queue/job_201111111311_0012 Hadoop/ HCatalog {…, "percentComplete": "100% complete", "exitValue": 0,… "completed": "done" } © 2012 Hortonworks Page 24
  25. 25. Future•  Job management –  Job management APIs don’t belong in HCatalog –  Only there by historical accident –  Need to move them out to MapReduce framework•  Authentication needs more options than kerberos•  Integration with Oozie•  Need a directory service –  Users should not need to connect to different servers for HDFS, HBase, HCatalog, Oozie, and job submission © 2012 Hortonworks Page 25
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.