SlideShare a Scribd company logo
1 of 20
Making Apache HadoopSecure Devaraj Dasddas@yahoo-inc.comYahoo’s Hadoop Team Apache Hadoop India Summit 2011
Who am I Principal Engineer at Yahoo! Sunnyvale Working on Hadoop and related projects Apache Hadoop Committer/PMC member Before Yahoo!, Sunnyvale – Yahoo! Bangalore Before Yahoo! – HP, Bangalore
What is Hadoop? HDFS – Distributed File System Combines cluster’s local storage into a single namespace. All data is replicated to multiple machines. Provides locality information to clients MapReduce Batch computation framework Jobs divided into tasks. Tasks re-executed on failure Optimizes for data locality of input
Problem Different yahoos need different data. ,[object Object]
Need assurance that only the right people can see data.
Need to log who looked at the data.Yahoo! has more yahoos than clusters. ,[object Object]
Security improves ability to share clusters between groups4
Why is Security Hard? Hadoop is Distributed runs on a cluster of computers. Can’t determine the user on client computer. OS doesn’t tell you, must be done by application Client needs to authenticate to each computer Client needs to protect against fake servers
Need Delegation Not just client-server, the servers access other services on behalf of others. MapReduce need to have user’s permissions Even if the user logs out MapReduce jobs need to: Get and keep the necessary credentials Renew them while the job is running Destroy them when the job finishes
Solution Prevent unauthorized HDFS access ,[object Object]
Including tasks running as part of MapReduce jobs
And jobs submitted through Oozie.Users must also authenticate servers ,[object Object],Integrate Hadoop with Kerberos ,[object Object],7
Requirements Security must be optional. Not all clusters are shared between users. Hadoop must not prompt for passwords Makes it easy to make trojan horse versions. Must have single sign on. Must handle the launch of a MapReduce job on 4,000 Nodes Performance / Reliability must not be compromised
Security Definitions Authentication – Determining the user Hadoop 0.20 completely trusted the user Sent user and groups over wire We need it on both RPC and Web UI. Authorization – What can that user do? HDFS had owners and permissions since 0.16. Auditing – Who did that?
Authentication Changes low-level transport RPC authentication using SASL Kerberos (GSSAPI) Token Simple Browser HTTP secured via plugin Configurable translation from Kerberos principals to user names
Authorization HDFS Command line and semantics unchanged MapReduce added Access Control Lists Lists of users and groups that have access. mapreduce.job.acl-view-job – view job mapreduce.job.acl-modify-job – kill or modify job Code for determining group membership is pluggable. Checked on the masters.
Auditing HDFS can track access to files MapReduce can track who ran each job Provides fine grain logs of who did what With strong authentication, logs provide audit trails
Delegation Tokens To prevent authentication flood at the start of a job, NameNode creates delegation tokens. Allows user to authenticate once and pass credentials to all tasks of a job. JobTracker automatically renews tokens while job is running. Max lifetime of delegation tokens is 7 days. Cancels tokens when job finishes.
Primary Communication Paths 14
Kerberos and Single Sign-on Kerberos allows user to sign in once Obtains Ticket Granting Ticket (TGT) kinit – get a new Kerberos ticket klist – list your Kerberos tickets kdestroy – destroy your Kerberos ticket TGT’s last for 10 hours, renewable for 7 days by default Once you have a TGT, Hadoop commands just work hadoop fs –ls / hadoop jar wordcount.jar in-dir out-dir 15

More Related Content

What's hot

Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2benjaminwootton
 
Implementing Advanced Caching and Replication Techniques in ...
Implementing Advanced Caching and Replication Techniques in ...Implementing Advanced Caching and Replication Techniques in ...
Implementing Advanced Caching and Replication Techniques in ...webhostingguy
 
Server and Its Types - Presentation
Server and Its Types - PresentationServer and Its Types - Presentation
Server and Its Types - PresentationShakeel Haider
 
Presentation about servers
Presentation about serversPresentation about servers
Presentation about serversSasin Prabu
 
Big Data Step-by-Step: Infrastructure 1/3: Local VM
Big Data Step-by-Step: Infrastructure 1/3: Local VMBig Data Step-by-Step: Infrastructure 1/3: Local VM
Big Data Step-by-Step: Infrastructure 1/3: Local VMJeffrey Breen
 
Caching objects-in-memory
Caching objects-in-memoryCaching objects-in-memory
Caching objects-in-memoryMauro Cassani
 
Hbase coprocessor with Oozie WF referencing 3rd Party jars
Hbase coprocessor with Oozie WF referencing 3rd Party jarsHbase coprocessor with Oozie WF referencing 3rd Party jars
Hbase coprocessor with Oozie WF referencing 3rd Party jarsJinith Joseph
 
Architecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentationArchitecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentationVlad Ponomarev
 
Configuring the Apache Web Server
Configuring the Apache Web ServerConfiguring the Apache Web Server
Configuring the Apache Web Serverwebhostingguy
 
WebSphere : High Performance Extensible Logging
WebSphere : High Performance Extensible LoggingWebSphere : High Performance Extensible Logging
WebSphere : High Performance Extensible LoggingJoseph's WebSphere Library
 
What is Server? (Web Server vs Application Server)
What is Server? (Web Server vs Application Server)What is Server? (Web Server vs Application Server)
What is Server? (Web Server vs Application Server)Amit Nirala
 
Cache Security- Configuring a Secure Environment
Cache Security- Configuring a Secure EnvironmentCache Security- Configuring a Secure Environment
Cache Security- Configuring a Secure EnvironmentInterSystems Corporation
 
Apache Hadoop & Hive installation with movie rating exercise
Apache Hadoop & Hive installation with movie rating exerciseApache Hadoop & Hive installation with movie rating exercise
Apache Hadoop & Hive installation with movie rating exerciseShiva Rama Krishna Dasharathi
 
MySQL and memcached Guide
MySQL and memcached GuideMySQL and memcached Guide
MySQL and memcached Guidewebhostingguy
 

What's hot (18)

Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2
 
Implementing Advanced Caching and Replication Techniques in ...
Implementing Advanced Caching and Replication Techniques in ...Implementing Advanced Caching and Replication Techniques in ...
Implementing Advanced Caching and Replication Techniques in ...
 
Web server
Web serverWeb server
Web server
 
Server and Its Types - Presentation
Server and Its Types - PresentationServer and Its Types - Presentation
Server and Its Types - Presentation
 
Presentation about servers
Presentation about serversPresentation about servers
Presentation about servers
 
Big Data Step-by-Step: Infrastructure 1/3: Local VM
Big Data Step-by-Step: Infrastructure 1/3: Local VMBig Data Step-by-Step: Infrastructure 1/3: Local VM
Big Data Step-by-Step: Infrastructure 1/3: Local VM
 
Caching objects-in-memory
Caching objects-in-memoryCaching objects-in-memory
Caching objects-in-memory
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
web server
web serverweb server
web server
 
Hbase coprocessor with Oozie WF referencing 3rd Party jars
Hbase coprocessor with Oozie WF referencing 3rd Party jarsHbase coprocessor with Oozie WF referencing 3rd Party jars
Hbase coprocessor with Oozie WF referencing 3rd Party jars
 
Cache Security- The Basics
Cache Security- The BasicsCache Security- The Basics
Cache Security- The Basics
 
Architecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentationArchitecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentation
 
Configuring the Apache Web Server
Configuring the Apache Web ServerConfiguring the Apache Web Server
Configuring the Apache Web Server
 
WebSphere : High Performance Extensible Logging
WebSphere : High Performance Extensible LoggingWebSphere : High Performance Extensible Logging
WebSphere : High Performance Extensible Logging
 
What is Server? (Web Server vs Application Server)
What is Server? (Web Server vs Application Server)What is Server? (Web Server vs Application Server)
What is Server? (Web Server vs Application Server)
 
Cache Security- Configuring a Secure Environment
Cache Security- Configuring a Secure EnvironmentCache Security- Configuring a Secure Environment
Cache Security- Configuring a Secure Environment
 
Apache Hadoop & Hive installation with movie rating exercise
Apache Hadoop & Hive installation with movie rating exerciseApache Hadoop & Hive installation with movie rating exercise
Apache Hadoop & Hive installation with movie rating exercise
 
MySQL and memcached Guide
MySQL and memcached GuideMySQL and memcached Guide
MySQL and memcached Guide
 

Similar to Apache Hadoop India Summit 2011 talk "Making Apache Hadoop Secure" by Devaraj Das

Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop SecurityDataWorks Summit
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security ArchitectureOwen O'Malley
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)outstanding59
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)outstanding59
 
Hadoop World 2011: Hadoop Gateway - Konstantin Schvako, eBay
Hadoop World 2011: Hadoop Gateway - Konstantin Schvako, eBayHadoop World 2011: Hadoop Gateway - Konstantin Schvako, eBay
Hadoop World 2011: Hadoop Gateway - Konstantin Schvako, eBayCloudera, Inc.
 
CCD-410 Cloudera Study Material
CCD-410 Cloudera Study MaterialCCD-410 Cloudera Study Material
CCD-410 Cloudera Study MaterialRoxycodone Online
 
Securing the Hadoop Ecosystem
Securing the Hadoop EcosystemSecuring the Hadoop Ecosystem
Securing the Hadoop EcosystemDataWorks Summit
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorialvinayiqbusiness
 
Hive contributors meetup apache sentry
Hive contributors meetup   apache sentryHive contributors meetup   apache sentry
Hive contributors meetup apache sentryBrock Noland
 
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...Impetus Technologies
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop SecurityChris Nauroth
 
A glimpse of test automation in hadoop ecosystem by Deepika Achary
A glimpse of test automation in hadoop ecosystem by Deepika AcharyA glimpse of test automation in hadoop ecosystem by Deepika Achary
A glimpse of test automation in hadoop ecosystem by Deepika AcharyQA or the Highway
 
Schedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop clusterSchedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop clusterShivraj Raj
 

Similar to Apache Hadoop India Summit 2011 talk "Making Apache Hadoop Secure" by Devaraj Das (20)

Hadoop Security Preview
Hadoop Security PreviewHadoop Security Preview
Hadoop Security Preview
 
Hadoop Security Preview
Hadoop Security PreviewHadoop Security Preview
Hadoop Security Preview
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
Unit 5
Unit  5Unit  5
Unit 5
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 
Hadoop World 2011: Hadoop Gateway - Konstantin Schvako, eBay
Hadoop World 2011: Hadoop Gateway - Konstantin Schvako, eBayHadoop World 2011: Hadoop Gateway - Konstantin Schvako, eBay
Hadoop World 2011: Hadoop Gateway - Konstantin Schvako, eBay
 
CCD-410 Cloudera Study Material
CCD-410 Cloudera Study MaterialCCD-410 Cloudera Study Material
CCD-410 Cloudera Study Material
 
Securing the Hadoop Ecosystem
Securing the Hadoop EcosystemSecuring the Hadoop Ecosystem
Securing the Hadoop Ecosystem
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Hive contributors meetup apache sentry
Hive contributors meetup   apache sentryHive contributors meetup   apache sentry
Hive contributors meetup apache sentry
 
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
 
A glimpse of test automation in hadoop ecosystem by Deepika Achary
A glimpse of test automation in hadoop ecosystem by Deepika AcharyA glimpse of test automation in hadoop ecosystem by Deepika Achary
A glimpse of test automation in hadoop ecosystem by Deepika Achary
 
Schedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop clusterSchedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop cluster
 

More from Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsYahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 

More from Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Apache Hadoop India Summit 2011 talk "Making Apache Hadoop Secure" by Devaraj Das

  • 1. Making Apache HadoopSecure Devaraj Dasddas@yahoo-inc.comYahoo’s Hadoop Team Apache Hadoop India Summit 2011
  • 2. Who am I Principal Engineer at Yahoo! Sunnyvale Working on Hadoop and related projects Apache Hadoop Committer/PMC member Before Yahoo!, Sunnyvale – Yahoo! Bangalore Before Yahoo! – HP, Bangalore
  • 3. What is Hadoop? HDFS – Distributed File System Combines cluster’s local storage into a single namespace. All data is replicated to multiple machines. Provides locality information to clients MapReduce Batch computation framework Jobs divided into tasks. Tasks re-executed on failure Optimizes for data locality of input
  • 4.
  • 5. Need assurance that only the right people can see data.
  • 6.
  • 7. Security improves ability to share clusters between groups4
  • 8. Why is Security Hard? Hadoop is Distributed runs on a cluster of computers. Can’t determine the user on client computer. OS doesn’t tell you, must be done by application Client needs to authenticate to each computer Client needs to protect against fake servers
  • 9. Need Delegation Not just client-server, the servers access other services on behalf of others. MapReduce need to have user’s permissions Even if the user logs out MapReduce jobs need to: Get and keep the necessary credentials Renew them while the job is running Destroy them when the job finishes
  • 10.
  • 11. Including tasks running as part of MapReduce jobs
  • 12.
  • 13. Requirements Security must be optional. Not all clusters are shared between users. Hadoop must not prompt for passwords Makes it easy to make trojan horse versions. Must have single sign on. Must handle the launch of a MapReduce job on 4,000 Nodes Performance / Reliability must not be compromised
  • 14. Security Definitions Authentication – Determining the user Hadoop 0.20 completely trusted the user Sent user and groups over wire We need it on both RPC and Web UI. Authorization – What can that user do? HDFS had owners and permissions since 0.16. Auditing – Who did that?
  • 15. Authentication Changes low-level transport RPC authentication using SASL Kerberos (GSSAPI) Token Simple Browser HTTP secured via plugin Configurable translation from Kerberos principals to user names
  • 16. Authorization HDFS Command line and semantics unchanged MapReduce added Access Control Lists Lists of users and groups that have access. mapreduce.job.acl-view-job – view job mapreduce.job.acl-modify-job – kill or modify job Code for determining group membership is pluggable. Checked on the masters.
  • 17. Auditing HDFS can track access to files MapReduce can track who ran each job Provides fine grain logs of who did what With strong authentication, logs provide audit trails
  • 18. Delegation Tokens To prevent authentication flood at the start of a job, NameNode creates delegation tokens. Allows user to authenticate once and pass credentials to all tasks of a job. JobTracker automatically renews tokens while job is running. Max lifetime of delegation tokens is 7 days. Cancels tokens when job finishes.
  • 20. Kerberos and Single Sign-on Kerberos allows user to sign in once Obtains Ticket Granting Ticket (TGT) kinit – get a new Kerberos ticket klist – list your Kerberos tickets kdestroy – destroy your Kerberos ticket TGT’s last for 10 hours, renewable for 7 days by default Once you have a TGT, Hadoop commands just work hadoop fs –ls / hadoop jar wordcount.jar in-dir out-dir 15
  • 22. Task Isolation Tasks now run as the user. Via a small setuid program Can’t signal other user’s tasks or TaskTracker Can’t read other tasks jobconf, files, outputs, or logs Distributed cache Public files shared between jobs and users Private files shared between jobs
  • 23. Web UIs Hadoop relies on the Web UIs. These need to be authenticated also… Web UI authentication is pluggable. Yahoo uses an internal package We have written a very simple static auth plug-in SPNEGO plugin being developed All servlets enforce permissions.
  • 24. Proxy-Users Oozie (and other trusted services) run as headless user on behalf of other users Configure HDFS and MapReduce with the oozie user as a proxy: Group of users that the proxy can impersonate Which hosts they can impersonate from 19
  • 25. Questions? Questions should be sent to: common/hdfs/mapreduce-user@hadoop.apache.org Security holes should be sent to: security@hadoop.apache.org Available from (in production at Yahoo!) Hadoop Common http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security/ Thanks!