Hadoop World 2011: Hadoop Gateway - Konstantin Schvako, eBay


Published on

Access to Hadoop clusters through dedicated portal nodes (typically located behind firewalls and performing user authentication and authorization) can have several drawbacks -- as shared multitenant resources they can create contention among users and increase the maintenance overhead for cluster administrators. This session will discuss the Gateway system, a cluster virtualization framework that provides multiple benefits: seamless access from users’ workplace computers through corporate firewalls; the ability to failover to active clusters for scheduled or unscheduled downtime, as well as the ability to redirect traffic to other clusters during upgrades; and user access to clusters running different versions of Hadoop.

1 Comment
  • http://dbmanagement.info/Tutorials/Hadoop.htm
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hadoop World 2011: Hadoop Gateway - Konstantin Schvako, eBay

  1. 1. GatewayCluster Virtualization FrameworkKonstantin V ShvachkoPo CheungPriyo MustafiHadoop Platform Team, eBayHadoop World ConferenceNovember 9, 2011
  2. 2. Hadoop Cluster Components• HDFS – a distributed file system – NameNode – namespace and block management – DataNodes – block replica container – BackupNode – checkpointer• MapReduce – a framework for distributed computations – JobTracker – job scheduling, resource management, lifecycle coordination – TaskTracker – task execution module NameNode JobTracker TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode2 eBay Inc. confidential
  3. 3. Cluster Access via Portal Nodes• Users access Hadoop clusters via dedicated portal nodes located behind corporate firewalls – Login (ssh) to the portal node: authentication and authorization – Access clusters: run HDFS commands, submit jobs NameNode JobTracker Portal Node(s) Firewall TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode3 eBay Inc. confidential
  4. 4. Use Case #1: Development of New Applications• Developers of new applications fall into a cycle of moving programs, input and output data between their dev boxes, the portal, and the Hadoop clusters. Develop an application; while( my manager is unsatisfied ) { build application on your desktop; scp myapp.jar or in.mydata to the portal node; Run application on the cluster (data in HDFS); Verify job results with the manager; Fix application bugs or develop more; } Offload output data from the cluster;4 eBay Inc. confidential
  5. 5. Use Case #2: Access to Public Datasets• Scientific data: – Genomics datasets – Fundamental physics experiments (LHC in Nebraska) – Astronomical images• Data is public, but not the servers used to store and process data• Geographically separated datacenters• Users should be able to access and analyze data via internet• Implies direct login to the clusters for everybody – Complex security issues5 eBay Inc. confidential
  6. 6. Problem: Portal Nodes as Shared Resources• Developers hate transferring programs to portal nodes• Input data should be first transferred to the portal, then to HDFS• Developers tend to use portals as their dev nodes – Setup development environments – Connect to git repositories• Portals are shared multi-tenant resources – Community property• Portal nodes become yet another cluster component – Maintenance overhead for cluster administrators• Public datasets: need access without direct login to cluster portals6 eBay Inc. confidential
  7. 7. Gateway Project: Main Objective• Gateway is a cluster virtualization framework, which provides a unified and seamless access to Hadoop clusters from users’ workplace computers through corporate firewalls. Gateway Server(s) NameNode JobTracker TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode7 eBay Inc. confidential
  8. 8. Gateway Project: Principal Benefits1. Unified access to multiple Hadoop clusters through the corporate firewalls – Multiple clusters within the same datacenter “HDFS Scalability: The limits to growth” USENIX ;login: 2010 Connotations with Federation in implementation (ViewFS) and purpose – Clusters in different datacenters2. Service availability: failover to active clusters when one has scheduled/unscheduled downtime3. Flexible cluster upgrades: redirect traffic to other clusters when one is upgrading4. Versioning: access to clusters running different versions of Hadoop5. Load balancing: smart job submission based on cluster workloads8 eBay Inc. confidential
  9. 9. Network Requirements• Gateway Servers are positioned on the boundary between the corporate and “public” networks• Gateway Servers can – communicate with the user desktops/laptops residing on public network and to Hadoop clusters running in different data centers within corp. network.• Due to firewalls there is no direct connectivity from the public network to Hadoop clusters and vice versa other than via the Gateway Servers.• Gateway plays the role of a proxy between users and Hadoop clusters – Users delegate execution of their jobs and HDFS commands to the Gateway servers. – The servers talk to the actual clusters and return the replies back to the users.9 eBay Inc. confidential
  10. 10. Functional Requirements• The cluster virtualization framework need to support – current Java and command line user facing Hadoop APIs – existing Hadoop applications and jobs should continue to run from user boxes the same way as they used to from portal nodes• Transparent use of client side libraries: – Pig, Hive, Cascading, Hadoop shell commands• Authorization and Authentication – As a replacement for existing portal nodes, Gateway should provide adequate user authentication and authorization• Unified WEB UI combining UIs of the serviced clusters10 eBay Inc. confidential
  11. 11. Gateway ArchitectureGateway Virtualization Framework has two main components:• Job Submission system, represented by – Gateway MapReduce Server (GWMRServer) on the server side, and – regular Hadoop job submission and status tracking tools contacting GWMRServer via the standard Hadoop JobClient.• Virtualization of File System Access is represented by – GatewayFileSystem on the client side and – Gateway File System Server (GWFSServer) on the server side. GWMR JobClient Server Server GWFS gwfs://11 eBay Inc. confidential
  12. 12. Job Submission• Hadoop uses JobClient to submit jobs – Job is defined by its configuration file and the job jar – JobClient loads these two files along with other user-specified files required for the job to HDFS and submits the job to the JobTracker – the job is then scheduled for execution• GWMRServer is the only component needed to virtualize job submission. No specialized gateway client is required• Regular Hadoop JobClients are configured to send submissions to GWMRServer instead of a JobTracker• Job Submission Virtualization allows submitting jobs to multiple MR clusters via GWMRServer• GWMRServer selects one of the clusters and further submits the job to the respective JobTracker12 eBay Inc. confidential
  13. 13. HDFS Access• File System Access virtualized via GatewayFileSystem and GWFSServer• GatewayFileSystem is a new specialized client for accessing HDFS clusters via GWFSServer – The client is instantiated automatically based on configuration parameters setup to access gateway server instead of HDFS – GatewayFileSystem passes the client request to GWFSServer – The gateway server instantiates a traditional HDFS client (DistributedFileSystem) pointing to the requested cluster – Executes the request on the cluster and returns the result back to the gateway client• Unlike Job Submission the virtualized Files System Access is always cluster aware – If a user accesses a file he should explicitly specify, which HDFS cluster the file belongs to13 eBay Inc. confidential
  14. 14. GWMR: Implementation• GWMRServer is a subclass of mapred.JobSubmissionProtocol (H-0.20) mapreduce.ClientProtocol (> H-0.20)• GWMRServer can be accessed via regular Hadoop command-line-interface and Java interface• MR clients communicate (submit jobs and obtain job information) directly with GWMRServer as if they talk to a real JobTracker via hadoop.RPC• GWMRServer redirects the job to one of the clusters, based on – Data location – Cluster workload – User group information14 eBay Inc. confidential
  15. 15. GWMR: Implementation Continued• GWMRServer is stateless (or keeps a very lightweight state) – allows setting up pools of Gateway servers in order to avoid single point of failure• On startup GWMRServer reads configuration from “gateway-site.xml”, which determines the Hadoop MR clusters it must serve• GWMRServer has a web UI, similar to the JobTracker UI, which aggregates data from available JobTrackers• GWMRServer supports job sequencing, so that chaining MR jobs initiated by a single Pig or Hive job were scheduled to the same cluster15 eBay Inc. confidential
  16. 16. GatewayFileSystem: Implementation• GatewayFileSystem is a subclass of the FileSystem abstract class Similar to LocalFileSystem, HFTPFileSystem, S3FileSystem – gwfs:// - GatewayFileSystem – file:// - LocalFileSystem – hdfs:// - DistributedFileSystem – har:// - HarFileSystem – hftp:// - HFTPFileSystem – s3:// - S3FileSystem – kfs:// - KFSFileSystem• GatewayFileSystem is instantiated based on the URI scheme listed in fs.default.name (fs.defaultFS) field of core-site.xml – fs.default.name = gwfs://<GWFSServer-address> – fs.gwfs.impl = org.apache.hadoop.gateway.fs.GatewayFileSystem16 eBay Inc. confidential
  17. 17. GWFSServer: Implementation• GatewayFileSystem passes client requests to GWFSServer using – a new RPC protocol – GWFSProtocol, and – a new binary data transfer protocol – DataTProtocol• The Gateway server processes GWFSProtocol requests – It instantiates a real DistributedFileSystem pointing to the required cluster, – executes the request and returns results back to the gateway client• DataTProtocol transfers data between the Gateway clients and the server – The data transfer is a direct pipeline between a gateway client and HDFS – GWFSServer reads data from HDFS and pipelines it to gateway client via DataTProtocol, and vice versa for write• GWFSServer is stateless. This will allow setting up pools of servers in order to avoid single point of failure and to provide load balancing17 eBay Inc. confidential
  18. 18. Versioning• GWMRServer can serve JobClients of a specific version only. Incompatible version of Hadoop will require different implementations of GWMRServer – The service will run multiple versions of GWMRServer so that client requests could be redirected to a server serving the compatible version• Same instance of GWMRServer can submit jobs and query map-reduce clusters running different versions of Hadoop – GWMRServer discovers the Hadoop version of a particular cluster, and uses the respective Hadoop jars to instantiate an appropriate version of the JobClient• GatewayFileSystem-to-GWFSServer communication is independent of HDFS – No need to implement a new GWFSServer for every new Hadoop release.• Same instance of GWFSServer can access clusters running different versions of HDFS – GWFSServer discovers the HDFS cluster version and uses the respective jars to instantiate an appropriate version of the DistributedFileSystem18 eBay Inc. confidential
  19. 19. Project Status• Support for Hadoop 0.20.xxx Plan for 0.22• Authorization & Authentication• Job Chaining• Packaging – It is convenient for users to have the entire Hadoop client suite installed, configured, and packaged as a VM• Plan to open-source soon• Developers wanted19 eBay Inc. confidential
  20. 20. Thank You!20 eBay Inc. confidential