Your SlideShare is downloading. ×

Web Services Hadoop Summit 2012


Published on

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Web Services inHadoopNicholas Sze and Alan F. Gates@szetszwo, @alanfgates Page 1
  • 2. REST-ful API Front-door for Hadoop• Opens the door to languages other than Java• Thin clients via web services vs. fat-clients in gateway• Insulation from interface changes release to release HCatalog web interfaces MapReduce Pig Hive HCatalog External HDFS HBase Store © 2012 Hortonworks Page 2
  • 3. Not Covered in this Talk•  HttpFS (fka Hoop) – same API as WebHDFS but proxied•  Stargate – REST API for HBase © 2012 Hortonworks Page 3
  • 4. HDFS Clients• DFSClient: the native client – High performance (using RPC) – Java blinding• libhdfs: a C++ client interface – Using JNI => large overhead – Also Java blinding (require Hadoop installing) Architecting the Future of Big Data Page 4
  • 5. HFTP• Designed for cross-version copying (DistCp) – High performance (using HTTP) – Read-only – The HTTP API is proprietary – Clients must use HftpFileSystem (hftp://)• WebHDFS is a rewrite of HFTP Architecting the Future of Big Data Page 5
  • 6. Design Goals• Support a public HTTP API• Support Read and Write• High Performance• Cross-version• Security Architecting the Future of Big Data Page 6
  • 7. WebHDFS features• HTTP REST API – Defines a public API – Permits non-Java client implementation – Support common tools like curl/wget• Wire Compatibility – The REST API will be maintained for wire compatibility – WebHDFS clients can talk to different Hadoop versions. Architecting the Future of Big Data Page 7
  • 8. WebHDFS features (2)• A Complete HDFS Interface – Support all user operations – reading files – writing to files – mkdir, chmod, chown, mv, rm, …• High Performance – Using HTTP redirection to provide data locality – File read/write are redirected to the corresponding datanodes Architecting the Future of Big Data Page 8
  • 9. WebHDFS features (3)• Secure Authentication – Same as Hadoop authentication: Kerberos (SPNEGO) and Hadoop delegation tokens – Support proxy users• A HDFS Built-in Component – WebHDFS is a first class built-in component of HDFS. – Run inside Namenodes and Datanodes• Apache Open Source – Available in Apache Hadoop 1.0 and above. Architecting the Future of Big Data Page 9
  • 10. WebHDFS URI & URL• FileSystem scheme: webhdfs://• FileSystem URI: webhdfs://<HOST>:<HTTP_PORT>/<PATH>• HTTP URL: http://<HOST>:<HTTP_PORT>/webhdfs/v1/<PATH>?op=.. – Path prefix: /webhdfs/v1 – Query: ?op=.. Architecting the Future of Big Data Page 10
  • 11. URI/URL Examples•  Suppose we have the following file hdfs://namenode:8020/user/szetszwo/w.txt•  WebHDFS FileSystem URI webhdfs://namenode:50070/user/szetszwo/w.txt•  WebHDFS HTTP URLhttp://namenode:50070/webhdfs/v1/user/szetszwo/w.txt?op=..•  WebHDFS HTTP URL to open the filehttp://namenode:50070/webhdfs/v1/user/szetszwo/w.txt?op=OPEN Architecting the Future of Big Data Page 11
  • 12. Example: curl•  Use curl to open a file$curl -i -L "http://namenode:50070/webhdfs/v1/user/szetszwo/w.txt?op=OPEN"HTTP/1.1 307 TEMPORARY_REDIRECTContent-Type: application/octet-streamLocation: 0Server: Jetty(6.1.26) Architecting the Future of Big Data Page 12
  • 13. Example: curl (2)HTTP/1.1 200 OKContent-Type: application/octet-streamContent-Length: 21Server: Jetty(6.1.26)Hello, WebHDFS user! Architecting the Future of Big Data Page 13
  • 14. Example: wget•  Use wget to open the same file$wget "http://namenode:50070/webhdfs/v1/user/szetszwo/w.txt?op=OPEN" –O w.txtResolving ...Connecting to ... connected.HTTP request sent, awaiting response...307 TEMPORARY_REDIRECTLocation: [following] Architecting the Future of Big Data Page 14
  • 15. Example: wget (2)--2012-06-13 01:42:10-- to connected.HTTP request sent, awaiting response... 200 OKLength: 21 [application/octet-stream]Saving to: `w.txt100%[=================>] 21 --.-K/s in 0s2012-06-13 01:42:10 (3.34 MB/s) - `w.txt saved[21/21] Architecting the Future of Big Data Page 15
  • 16. Example: Firefox Architecting the Future of Big Data Page 16
  • 17. HCatalog REST API•  REST endpoints: databases, tables, partitions, columns, table properties•  PUT to create/update, GET to list or describe, DELETE to drop•  Uses JSON to describe metadata objects•  Versioned, because we assume we will have to update it:…•  Runs in a Jetty server•  Supports security –  Authentication done via kerberos using SPNEGO•  Included in HDP, runs on Thrift metastore server machine•  Not yet checked in, but you can find the code on Apache’s JIRA HCATALOG-182 © 2012 Hortonworks Page 17
  • 18. HCatalog REST API Get a list of all tables in the default database: GET http://…/v1/ddl/database/default/table Hadoop/ HCatalog { "tables": ["counted","processed",], "database": "default" } Indicate user with URL parameter: http://…/v1/ddl/database/default/table? Actions authorized as indicated user © Hortonworks 2012 Page 18
  • 19. HCatalog REST API Create new table “rawevents” PUT {"columns": [{ "name": "url", "type": "string" }, { "name": "user", "type": "string"}], "partitionedBy": [{ "name": "ds", "type": "string" }]} http://…/v1/ddl/database/default/table/rawevents Hadoop/ HCatalog { "table": "rawevents", "database": "default” } © Hortonworks 2012 Page 19
  • 20. HCatalog REST API Describe table “rawevents” GET http://…/v1/ddl/database/default/table/rawevents Hadoop/ HCatalog { "columns": [{"name": "url","type": "string"}, {"name": "user","type": "string"}], "database": "default", "table": "rawevents" } © Hortonworks 2012 Page 20
  • 21. Job Management•  Includes APIs to submit and monitor jobs•  Any files needed for the job first uploaded to HDFS via WebHDFS –  Pig and Hive scripts –  Jars, Python scripts, or Ruby scripts for UDFs –  Pig macros•  Results from job stored to HDFS, can be retrieved via WebHDFS•  User responsible for cleaning up output in HDFS•  Job state information stored in ZooKeeper or HDFS © 2012 Hortonworks Page 21
  • 22. Job Submission•  Can submit MapReduce, Pig, and Hive jobs•  POST parameters include –  script to run or HDFS file containing script/jar to run –  username to execute the job as –  optionally an HDFS directory to write results to (defaults to user’s home directory) –  optionally a URL to invoke GET on when job is done POST Hadoop/ HCatalog {"id": "job_201111111311_0012",…} © 2012 Hortonworks Page 22
  • 23. Find all Your Jobs•  GET on queue returns all jobs belonging to the submitting user•  Pig, Hive, and MapReduce jobs will be returned GET http://…/templeton/v1/queue? Hadoop/ HCatalog {"job_201111111311_0008", "job_201111111311_0012"} © 2012 Hortonworks Page 23
  • 24. Get Status of a Job•  Doing a GET on jobid gets you information about a particular job•  Can be used to poll to see if job is finished•  Used after job is finished to get job information•  Doing a DELETE on jobid kills the job GET http://…/templeton/v1/queue/job_201111111311_0012 Hadoop/ HCatalog {…, "percentComplete": "100% complete", "exitValue": 0,… "completed": "done" } © 2012 Hortonworks Page 24
  • 25. Future•  Job management –  Job management APIs don’t belong in HCatalog –  Only there by historical accident –  Need to move them out to MapReduce framework•  Authentication needs more options than kerberos•  Integration with Oozie•  Need a directory service –  Users should not need to connect to different servers for HDFS, HBase, HCatalog, Oozie, and job submission © 2012 Hortonworks Page 25