SlideShare a Scribd company logo
Gateway
Cluster Virtualization Framework




Konstantin V Shvachko
Po Cheung
Priyo Mustafi
Hadoop Platform Team, eBay




Hadoop World Conference
November 9, 2011
Hadoop Cluster Components


• HDFS – a distributed file system
      – NameNode – namespace and block management
      – DataNodes – block replica container
      – BackupNode – checkpointer

• MapReduce – a framework for distributed computations
      – JobTracker – job scheduling, resource management, lifecycle coordination
      – TaskTracker – task execution module


                                 NameNode                 JobTracker




                             TaskTracker    TaskTracker        TaskTracker

                             DataNode       DataNode            DataNode



2   eBay Inc. confidential
Cluster Access via Portal Nodes


• Users access Hadoop clusters via dedicated portal nodes located behind
  corporate firewalls
      – Login (ssh) to the portal node: authentication and authorization
      – Access clusters: run HDFS commands, submit jobs



                                                    NameNode                 JobTracker
                               Portal Node(s)
                    Firewall




                                                TaskTracker    TaskTracker        TaskTracker

                                                 DataNode       DataNode           DataNode



3   eBay Inc. confidential
Use Case #1: Development of New Applications


• Developers of new applications fall into a cycle of
  moving programs, input and output data
  between their dev boxes, the portal, and the Hadoop clusters.


    Develop an application;
    while( my manager is unsatisfied ) {
       build application on your desktop;
       scp myapp.jar or in.mydata to the portal node;
       Run application on the cluster (data in HDFS);
       Verify job results with the manager;
       Fix application bugs or develop more;
    }
    Offload output data from the cluster;



4   eBay Inc. confidential
Use Case #2: Access to Public Datasets


• Scientific data:
      – Genomics datasets
      – Fundamental physics experiments (LHC in Nebraska)
      – Astronomical images

• Data is public, but not the servers used to store and process data
• Geographically separated datacenters
• Users should be able to access and analyze data via internet
• Implies direct login to the clusters for everybody
      – Complex security issues




5   eBay Inc. confidential
Problem: Portal Nodes as Shared Resources


• Developers hate transferring programs to portal nodes
• Input data should be first transferred to the portal, then to HDFS
• Developers tend to use portals as their dev nodes
      – Setup development environments
      – Connect to git repositories

• Portals are shared multi-tenant resources
      – Community property

• Portal nodes become yet another cluster component
      – Maintenance overhead for cluster administrators

• Public datasets: need access without direct login to cluster portals




6   eBay Inc. confidential
Gateway Project: Main Objective


• Gateway is a cluster virtualization framework, which
  provides a unified and seamless access to Hadoop clusters
  from users’ workplace computers through corporate firewalls.
                             Gateway Server(s)

                                                     NameNode                 JobTracker




                                                 TaskTracker    TaskTracker        TaskTracker

                                                  DataNode       DataNode           DataNode




7   eBay Inc. confidential
Gateway Project: Principal Benefits


1. Unified access to multiple Hadoop clusters through the corporate firewalls
      – Multiple clusters within the same datacenter
        “HDFS Scalability: The limits to growth” USENIX ;login: 2010
        Connotations with Federation in implementation (ViewFS) and purpose
      – Clusters in different datacenters

2. Service availability:
   failover to active clusters when one has scheduled/unscheduled downtime
3. Flexible cluster upgrades:
   redirect traffic to other clusters when one is upgrading
4. Versioning:
   access to clusters running different versions of Hadoop
5. Load balancing:
   smart job submission based on cluster workloads


8   eBay Inc. confidential
Network Requirements


• Gateway Servers are positioned on the boundary between the corporate
  and “public” networks
• Gateway Servers can
      – communicate with the user desktops/laptops residing on public network and to
        Hadoop clusters running in different data centers within corp. network.

• Due to firewalls there is no direct connectivity from the public network to
  Hadoop clusters and vice versa other than via the Gateway Servers.
• Gateway plays the role of a proxy between users and Hadoop clusters
      – Users delegate execution of their jobs and HDFS commands to the Gateway
        servers.
      – The servers talk to the actual clusters and return the replies back to the users.




9   eBay Inc. confidential
Functional Requirements


• The cluster virtualization framework need to support
       – current Java and command line user facing Hadoop APIs
       – existing Hadoop applications and jobs should continue to run from user boxes the
         same way as they used to from portal nodes

• Transparent use of client side libraries:
       – Pig, Hive, Cascading, Hadoop shell commands

• Authorization and Authentication
       – As a replacement for existing portal nodes, Gateway should provide adequate
         user authentication and authorization

• Unified WEB UI combining UIs of the serviced clusters




10   eBay Inc. confidential
Gateway Architecture


Gateway Virtualization Framework has two main components:
• Job Submission system, represented by
       – Gateway MapReduce Server (GWMRServer) on the server side, and
       – regular Hadoop job submission and status tracking tools
         contacting GWMRServer via the standard Hadoop JobClient.

• Virtualization of File System Access is represented by
       – GatewayFileSystem on the client side and
       – Gateway File System Server (GWFSServer) on the server side.

                                          GWMR
                              JobClient
                                          Server
                                          Server
                                          GWFS




                               gwfs://



11   eBay Inc. confidential
Job Submission


• Hadoop uses JobClient to submit jobs
       – Job is defined by its configuration file and the job jar
       – JobClient loads these two files along with other user-specified files required for
         the job to HDFS and submits the job to the JobTracker
       – the job is then scheduled for execution

• GWMRServer is the only component needed to virtualize job submission.
  No specialized gateway client is required
• Regular Hadoop JobClients are configured to send submissions to
  GWMRServer instead of a JobTracker
• Job Submission Virtualization allows submitting jobs to multiple MR clusters
  via GWMRServer
• GWMRServer selects one of the clusters and further submits the job to the
  respective JobTracker


12   eBay Inc. confidential
HDFS Access


• File System Access virtualized via GatewayFileSystem and GWFSServer
• GatewayFileSystem is a new specialized client for accessing HDFS clusters
  via GWFSServer
       – The client is instantiated automatically based on configuration parameters setup
         to access gateway server instead of HDFS
       – GatewayFileSystem passes the client request to GWFSServer
       – The gateway server instantiates a traditional HDFS client (DistributedFileSystem)
         pointing to the requested cluster
       – Executes the request on the cluster and returns the result back to the gateway
         client

• Unlike Job Submission the virtualized Files System Access is always cluster
  aware
       – If a user accesses a file he should explicitly specify, which HDFS cluster the file
         belongs to



13   eBay Inc. confidential
GWMR: Implementation



• GWMRServer is a subclass of mapred.JobSubmissionProtocol (H-0.20)
  mapreduce.ClientProtocol (> H-0.20)
• GWMRServer can be accessed via regular Hadoop command-line-interface
  and Java interface
• MR clients communicate (submit jobs and obtain job information) directly
  with GWMRServer as if they talk to a real JobTracker via hadoop.RPC
• GWMRServer redirects the job to one of the clusters, based on
       – Data location
       – Cluster workload
       – User group information




14   eBay Inc. confidential
GWMR: Implementation Continued



• GWMRServer is stateless (or keeps a very lightweight state)
       – allows setting up pools of Gateway servers in order to avoid single point of failure

• On startup GWMRServer reads configuration from “gateway-site.xml”,
  which determines the Hadoop MR clusters it must serve
• GWMRServer has a web UI, similar to the JobTracker UI, which aggregates
  data from available JobTrackers
• GWMRServer supports job sequencing, so that chaining MR jobs initiated
  by a single Pig or Hive job were scheduled to the same cluster




15   eBay Inc. confidential
GatewayFileSystem: Implementation


• GatewayFileSystem is a subclass of the FileSystem abstract class
  Similar to LocalFileSystem, HFTPFileSystem, S3FileSystem
       – gwfs://              - GatewayFileSystem
       – file://              - LocalFileSystem
       – hdfs://              - DistributedFileSystem
       – har://               - HarFileSystem
       – hftp://              - HFTPFileSystem
       – s3://                - S3FileSystem
       – kfs://               - KFSFileSystem

• GatewayFileSystem is instantiated based on the URI scheme listed in
  fs.default.name (fs.defaultFS) field of core-site.xml
       – fs.default.name = gwfs://<GWFSServer-address>
       – fs.gwfs.impl = org.apache.hadoop.gateway.fs.GatewayFileSystem




16   eBay Inc. confidential
GWFSServer: Implementation


• GatewayFileSystem passes client requests to GWFSServer using
       – a new RPC protocol – GWFSProtocol, and
       – a new binary data transfer protocol – DataTProtocol

• The Gateway server processes GWFSProtocol requests
       – It instantiates a real DistributedFileSystem pointing to the required cluster,
       – executes the request and returns results back to the gateway client

• DataTProtocol transfers data between the Gateway clients and the server
       – The data transfer is a direct pipeline between a gateway client and HDFS
       – GWFSServer reads data from HDFS and pipelines it to gateway client via
         DataTProtocol, and vice versa for write

• GWFSServer is stateless. This will allow setting up pools of servers in order
  to avoid single point of failure and to provide load balancing




17   eBay Inc. confidential
Versioning


• GWMRServer can serve JobClients of a specific version only. Incompatible
  version of Hadoop will require different implementations of GWMRServer
       – The service will run multiple versions of GWMRServer so that client requests
         could be redirected to a server serving the compatible version

• Same instance of GWMRServer can submit jobs and query map-reduce
  clusters running different versions of Hadoop
       – GWMRServer discovers the Hadoop version of a particular cluster, and uses the
         respective Hadoop jars to instantiate an appropriate version of the JobClient

• GatewayFileSystem-to-GWFSServer communication is independent of
  HDFS
       – No need to implement a new GWFSServer for every new Hadoop release.

• Same instance of GWFSServer can access clusters running different
  versions of HDFS
       – GWFSServer discovers the HDFS cluster version and uses the respective jars to
         instantiate an appropriate version of the DistributedFileSystem

18   eBay Inc. confidential
Project Status


• Support for Hadoop 0.20.xxx
  Plan for 0.22
• Authorization & Authentication
• Job Chaining
• Packaging
       – It is convenient for users to have
         the entire Hadoop client suite
         installed, configured,
         and packaged as a VM

• Plan to open-source soon
• Developers wanted




19   eBay Inc. confidential
Thank You!




20   eBay Inc. confidential

More Related Content

What's hot

Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the field
Uwe Printz
 
Apache Ambari BOF - Blueprints + Azure - Hadoop Summit 2013
Apache Ambari BOF - Blueprints + Azure - Hadoop Summit 2013Apache Ambari BOF - Blueprints + Azure - Hadoop Summit 2013
Apache Ambari BOF - Blueprints + Azure - Hadoop Summit 2013
Hortonworks
 
Kafka Security
Kafka SecurityKafka Security
Infinispan @ Red Hat Forum 2013
Infinispan @ Red Hat Forum 2013Infinispan @ Red Hat Forum 2013
Infinispan @ Red Hat Forum 2013
Jaehong Cheon
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
alanfgates
 
Managing enterprise users in Hadoop ecosystem
Managing enterprise users in Hadoop ecosystemManaging enterprise users in Hadoop ecosystem
Managing enterprise users in Hadoop ecosystem
DataWorks Summit
 
Visualizing Kafka Security
Visualizing Kafka SecurityVisualizing Kafka Security
Visualizing Kafka Security
DataWorks Summit
 
HadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop OverviewHadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop Overview
Yafang Chang
 
Operating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsOperating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and Improvements
DataWorks Summit/Hadoop Summit
 
YARN and the Docker container runtime
YARN and the Docker container runtimeYARN and the Docker container runtime
YARN and the Docker container runtime
DataWorks Summit/Hadoop Summit
 
Lessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker ContainersLessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker Containers
BlueData, Inc.
 
HDFS Selective Wire Encryption
HDFS Selective Wire EncryptionHDFS Selective Wire Encryption
HDFS Selective Wire Encryption
Konstantin V. Shvachko
 
Hadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateHadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the Gate
Steve Loughran
 
New Security Features in Apache HBase 0.98: An Operator's Guide
New Security Features in Apache HBase 0.98: An Operator's GuideNew Security Features in Apache HBase 0.98: An Operator's Guide
New Security Features in Apache HBase 0.98: An Operator's Guide
HBaseCon
 
Big Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosBig Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and Mesos
Heiko Loewe
 
CBlocks - Posix compliant files systems for HDFS
CBlocks - Posix compliant files systems for HDFSCBlocks - Posix compliant files systems for HDFS
CBlocks - Posix compliant files systems for HDFS
DataWorks Summit
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
DataWorks Summit
 
Scale-Out Resource Management at Microsoft using Apache YARN
Scale-Out Resource Management at Microsoft using Apache YARNScale-Out Resource Management at Microsoft using Apache YARN
Scale-Out Resource Management at Microsoft using Apache YARN
DataWorks Summit/Hadoop Summit
 
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload DiversityHarmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
HBaseCon
 
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos AlgorithmSolving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
DataWorks Summit
 

What's hot (20)

Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the field
 
Apache Ambari BOF - Blueprints + Azure - Hadoop Summit 2013
Apache Ambari BOF - Blueprints + Azure - Hadoop Summit 2013Apache Ambari BOF - Blueprints + Azure - Hadoop Summit 2013
Apache Ambari BOF - Blueprints + Azure - Hadoop Summit 2013
 
Kafka Security
Kafka SecurityKafka Security
Kafka Security
 
Infinispan @ Red Hat Forum 2013
Infinispan @ Red Hat Forum 2013Infinispan @ Red Hat Forum 2013
Infinispan @ Red Hat Forum 2013
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
 
Managing enterprise users in Hadoop ecosystem
Managing enterprise users in Hadoop ecosystemManaging enterprise users in Hadoop ecosystem
Managing enterprise users in Hadoop ecosystem
 
Visualizing Kafka Security
Visualizing Kafka SecurityVisualizing Kafka Security
Visualizing Kafka Security
 
HadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop OverviewHadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop Overview
 
Operating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsOperating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and Improvements
 
YARN and the Docker container runtime
YARN and the Docker container runtimeYARN and the Docker container runtime
YARN and the Docker container runtime
 
Lessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker ContainersLessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker Containers
 
HDFS Selective Wire Encryption
HDFS Selective Wire EncryptionHDFS Selective Wire Encryption
HDFS Selective Wire Encryption
 
Hadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateHadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the Gate
 
New Security Features in Apache HBase 0.98: An Operator's Guide
New Security Features in Apache HBase 0.98: An Operator's GuideNew Security Features in Apache HBase 0.98: An Operator's Guide
New Security Features in Apache HBase 0.98: An Operator's Guide
 
Big Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosBig Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and Mesos
 
CBlocks - Posix compliant files systems for HDFS
CBlocks - Posix compliant files systems for HDFSCBlocks - Posix compliant files systems for HDFS
CBlocks - Posix compliant files systems for HDFS
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
 
Scale-Out Resource Management at Microsoft using Apache YARN
Scale-Out Resource Management at Microsoft using Apache YARNScale-Out Resource Management at Microsoft using Apache YARN
Scale-Out Resource Management at Microsoft using Apache YARN
 
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload DiversityHarmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
 
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos AlgorithmSolving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
 

Similar to Hadoop World 2011: Hadoop Gateway - Konstantin Schvako, eBay

Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
DataWorks Summit
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
Owen O'Malley
 
Plugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopPlugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in Hadoop
Owen O'Malley
 
Hw09 Security And Api Compatibility
Hw09   Security And Api CompatibilityHw09   Security And Api Compatibility
Hw09 Security And Api Compatibility
Cloudera, Inc.
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
Big Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeBig Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit Kharabe
ROHIT KHARABE
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
DataWorks Summit
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
Chris Nauroth
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
chariorienit
 
Apache Hadoop India Summit 2011 talk "Making Apache Hadoop Secure" by Devaraj...
Apache Hadoop India Summit 2011 talk "Making Apache Hadoop Secure" by Devaraj...Apache Hadoop India Summit 2011 talk "Making Apache Hadoop Secure" by Devaraj...
Apache Hadoop India Summit 2011 talk "Making Apache Hadoop Secure" by Devaraj...
Yahoo Developer Network
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps Ironfan
Jim Kaskade
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
AnandMHadoop
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
Ramesh Pabba - seeking new projects
 
1.0 vs2.0
1.0 vs2.01.0 vs2.0
1.0 vs2.0
Ramnaresh Mantri
 
Globus: Research Data Management as Service and Platform - pearc17
Globus: Research Data Management as Service and Platform - pearc17Globus: Research Data Management as Service and Platform - pearc17
Globus: Research Data Management as Service and Platform - pearc17
Mary Bass
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
sonukumar379092
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
arslanhaneef
 

Similar to Hadoop World 2011: Hadoop Gateway - Konstantin Schvako, eBay (20)

Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
Plugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopPlugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in Hadoop
 
Hw09 Security And Api Compatibility
Hw09   Security And Api CompatibilityHw09   Security And Api Compatibility
Hw09 Security And Api Compatibility
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Big Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeBig Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit Kharabe
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Apache Hadoop India Summit 2011 talk "Making Apache Hadoop Secure" by Devaraj...
Apache Hadoop India Summit 2011 talk "Making Apache Hadoop Secure" by Devaraj...Apache Hadoop India Summit 2011 talk "Making Apache Hadoop Secure" by Devaraj...
Apache Hadoop India Summit 2011 talk "Making Apache Hadoop Secure" by Devaraj...
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps Ironfan
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
1.0 vs2.0
1.0 vs2.01.0 vs2.0
1.0 vs2.0
 
Globus: Research Data Management as Service and Platform - pearc17
Globus: Research Data Management as Service and Platform - pearc17Globus: Research Data Management as Service and Platform - pearc17
Globus: Research Data Management as Service and Platform - pearc17
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
Fwdays
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 

Recently uploaded (20)

Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
Artificial Intelligence and Electronic Warfare
Artificial Intelligence and Electronic WarfareArtificial Intelligence and Electronic Warfare
Artificial Intelligence and Electronic Warfare
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 

Hadoop World 2011: Hadoop Gateway - Konstantin Schvako, eBay

  • 1. Gateway Cluster Virtualization Framework Konstantin V Shvachko Po Cheung Priyo Mustafi Hadoop Platform Team, eBay Hadoop World Conference November 9, 2011
  • 2. Hadoop Cluster Components • HDFS – a distributed file system – NameNode – namespace and block management – DataNodes – block replica container – BackupNode – checkpointer • MapReduce – a framework for distributed computations – JobTracker – job scheduling, resource management, lifecycle coordination – TaskTracker – task execution module NameNode JobTracker TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode 2 eBay Inc. confidential
  • 3. Cluster Access via Portal Nodes • Users access Hadoop clusters via dedicated portal nodes located behind corporate firewalls – Login (ssh) to the portal node: authentication and authorization – Access clusters: run HDFS commands, submit jobs NameNode JobTracker Portal Node(s) Firewall TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode 3 eBay Inc. confidential
  • 4. Use Case #1: Development of New Applications • Developers of new applications fall into a cycle of moving programs, input and output data between their dev boxes, the portal, and the Hadoop clusters. Develop an application; while( my manager is unsatisfied ) { build application on your desktop; scp myapp.jar or in.mydata to the portal node; Run application on the cluster (data in HDFS); Verify job results with the manager; Fix application bugs or develop more; } Offload output data from the cluster; 4 eBay Inc. confidential
  • 5. Use Case #2: Access to Public Datasets • Scientific data: – Genomics datasets – Fundamental physics experiments (LHC in Nebraska) – Astronomical images • Data is public, but not the servers used to store and process data • Geographically separated datacenters • Users should be able to access and analyze data via internet • Implies direct login to the clusters for everybody – Complex security issues 5 eBay Inc. confidential
  • 6. Problem: Portal Nodes as Shared Resources • Developers hate transferring programs to portal nodes • Input data should be first transferred to the portal, then to HDFS • Developers tend to use portals as their dev nodes – Setup development environments – Connect to git repositories • Portals are shared multi-tenant resources – Community property • Portal nodes become yet another cluster component – Maintenance overhead for cluster administrators • Public datasets: need access without direct login to cluster portals 6 eBay Inc. confidential
  • 7. Gateway Project: Main Objective • Gateway is a cluster virtualization framework, which provides a unified and seamless access to Hadoop clusters from users’ workplace computers through corporate firewalls. Gateway Server(s) NameNode JobTracker TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode 7 eBay Inc. confidential
  • 8. Gateway Project: Principal Benefits 1. Unified access to multiple Hadoop clusters through the corporate firewalls – Multiple clusters within the same datacenter “HDFS Scalability: The limits to growth” USENIX ;login: 2010 Connotations with Federation in implementation (ViewFS) and purpose – Clusters in different datacenters 2. Service availability: failover to active clusters when one has scheduled/unscheduled downtime 3. Flexible cluster upgrades: redirect traffic to other clusters when one is upgrading 4. Versioning: access to clusters running different versions of Hadoop 5. Load balancing: smart job submission based on cluster workloads 8 eBay Inc. confidential
  • 9. Network Requirements • Gateway Servers are positioned on the boundary between the corporate and “public” networks • Gateway Servers can – communicate with the user desktops/laptops residing on public network and to Hadoop clusters running in different data centers within corp. network. • Due to firewalls there is no direct connectivity from the public network to Hadoop clusters and vice versa other than via the Gateway Servers. • Gateway plays the role of a proxy between users and Hadoop clusters – Users delegate execution of their jobs and HDFS commands to the Gateway servers. – The servers talk to the actual clusters and return the replies back to the users. 9 eBay Inc. confidential
  • 10. Functional Requirements • The cluster virtualization framework need to support – current Java and command line user facing Hadoop APIs – existing Hadoop applications and jobs should continue to run from user boxes the same way as they used to from portal nodes • Transparent use of client side libraries: – Pig, Hive, Cascading, Hadoop shell commands • Authorization and Authentication – As a replacement for existing portal nodes, Gateway should provide adequate user authentication and authorization • Unified WEB UI combining UIs of the serviced clusters 10 eBay Inc. confidential
  • 11. Gateway Architecture Gateway Virtualization Framework has two main components: • Job Submission system, represented by – Gateway MapReduce Server (GWMRServer) on the server side, and – regular Hadoop job submission and status tracking tools contacting GWMRServer via the standard Hadoop JobClient. • Virtualization of File System Access is represented by – GatewayFileSystem on the client side and – Gateway File System Server (GWFSServer) on the server side. GWMR JobClient Server Server GWFS gwfs:// 11 eBay Inc. confidential
  • 12. Job Submission • Hadoop uses JobClient to submit jobs – Job is defined by its configuration file and the job jar – JobClient loads these two files along with other user-specified files required for the job to HDFS and submits the job to the JobTracker – the job is then scheduled for execution • GWMRServer is the only component needed to virtualize job submission. No specialized gateway client is required • Regular Hadoop JobClients are configured to send submissions to GWMRServer instead of a JobTracker • Job Submission Virtualization allows submitting jobs to multiple MR clusters via GWMRServer • GWMRServer selects one of the clusters and further submits the job to the respective JobTracker 12 eBay Inc. confidential
  • 13. HDFS Access • File System Access virtualized via GatewayFileSystem and GWFSServer • GatewayFileSystem is a new specialized client for accessing HDFS clusters via GWFSServer – The client is instantiated automatically based on configuration parameters setup to access gateway server instead of HDFS – GatewayFileSystem passes the client request to GWFSServer – The gateway server instantiates a traditional HDFS client (DistributedFileSystem) pointing to the requested cluster – Executes the request on the cluster and returns the result back to the gateway client • Unlike Job Submission the virtualized Files System Access is always cluster aware – If a user accesses a file he should explicitly specify, which HDFS cluster the file belongs to 13 eBay Inc. confidential
  • 14. GWMR: Implementation • GWMRServer is a subclass of mapred.JobSubmissionProtocol (H-0.20) mapreduce.ClientProtocol (> H-0.20) • GWMRServer can be accessed via regular Hadoop command-line-interface and Java interface • MR clients communicate (submit jobs and obtain job information) directly with GWMRServer as if they talk to a real JobTracker via hadoop.RPC • GWMRServer redirects the job to one of the clusters, based on – Data location – Cluster workload – User group information 14 eBay Inc. confidential
  • 15. GWMR: Implementation Continued • GWMRServer is stateless (or keeps a very lightweight state) – allows setting up pools of Gateway servers in order to avoid single point of failure • On startup GWMRServer reads configuration from “gateway-site.xml”, which determines the Hadoop MR clusters it must serve • GWMRServer has a web UI, similar to the JobTracker UI, which aggregates data from available JobTrackers • GWMRServer supports job sequencing, so that chaining MR jobs initiated by a single Pig or Hive job were scheduled to the same cluster 15 eBay Inc. confidential
  • 16. GatewayFileSystem: Implementation • GatewayFileSystem is a subclass of the FileSystem abstract class Similar to LocalFileSystem, HFTPFileSystem, S3FileSystem – gwfs:// - GatewayFileSystem – file:// - LocalFileSystem – hdfs:// - DistributedFileSystem – har:// - HarFileSystem – hftp:// - HFTPFileSystem – s3:// - S3FileSystem – kfs:// - KFSFileSystem • GatewayFileSystem is instantiated based on the URI scheme listed in fs.default.name (fs.defaultFS) field of core-site.xml – fs.default.name = gwfs://<GWFSServer-address> – fs.gwfs.impl = org.apache.hadoop.gateway.fs.GatewayFileSystem 16 eBay Inc. confidential
  • 17. GWFSServer: Implementation • GatewayFileSystem passes client requests to GWFSServer using – a new RPC protocol – GWFSProtocol, and – a new binary data transfer protocol – DataTProtocol • The Gateway server processes GWFSProtocol requests – It instantiates a real DistributedFileSystem pointing to the required cluster, – executes the request and returns results back to the gateway client • DataTProtocol transfers data between the Gateway clients and the server – The data transfer is a direct pipeline between a gateway client and HDFS – GWFSServer reads data from HDFS and pipelines it to gateway client via DataTProtocol, and vice versa for write • GWFSServer is stateless. This will allow setting up pools of servers in order to avoid single point of failure and to provide load balancing 17 eBay Inc. confidential
  • 18. Versioning • GWMRServer can serve JobClients of a specific version only. Incompatible version of Hadoop will require different implementations of GWMRServer – The service will run multiple versions of GWMRServer so that client requests could be redirected to a server serving the compatible version • Same instance of GWMRServer can submit jobs and query map-reduce clusters running different versions of Hadoop – GWMRServer discovers the Hadoop version of a particular cluster, and uses the respective Hadoop jars to instantiate an appropriate version of the JobClient • GatewayFileSystem-to-GWFSServer communication is independent of HDFS – No need to implement a new GWFSServer for every new Hadoop release. • Same instance of GWFSServer can access clusters running different versions of HDFS – GWFSServer discovers the HDFS cluster version and uses the respective jars to instantiate an appropriate version of the DistributedFileSystem 18 eBay Inc. confidential
  • 19. Project Status • Support for Hadoop 0.20.xxx Plan for 0.22 • Authorization & Authentication • Job Chaining • Packaging – It is convenient for users to have the entire Hadoop client suite installed, configured, and packaged as a VM • Plan to open-source soon • Developers wanted 19 eBay Inc. confidential
  • 20. Thank You! 20 eBay Inc. confidential