Nowadays a typical Hadoop deployment consists of core Hadoop components – HDFS and MapReduce – several other components such as HBase, HttpFS, Oozie, Pig, Hive, Sqoop, Flume, plus programmatic integration from external systems and applications. This effectively creates a complex and heterogenous distributed environment that runs across several machines and uses different protocols to communicate with each other; all of which is used concurrently by several users and applications. When a Hadoop deployment and its ecosystem is used to process sensitive data (such as financial records, payment transactions, healthcare records), several security requirements arise. These security requirements may be dictated by internal policies and/or government regulations. They may require strong authentication, selective authorization to access data/resources, and data confidentiality. This session covers in detail how different components in the Hadoop ecosystem and external applications can interact with each other in a secure manner providing authentication, authorization, and confidentiality when accessing services and transferring data to/from/between services. The session will cover topics like Kerberos authentication, Web UI authentication, File System permissions, delegation tokens, Access Control Lists, ProxyUser impersonation and network encryption.
The Future of Hadoop Security - Hadoop Summit 2014Cloudera, Inc.
Hadoop deployments are rapidly moving from pilots to production, enabling unprecedented opportunity to build big data applications that deliver faster access to more information to more users than ever before possible. Yet without the ability to address data security and compliance regulations, Hadoop will be limited to another data silo.
In this talk, Matt Brandwein and David Tishgart discuss the requirements for securing Hadoop and how Cloudera (now with Gazzang) and Intel are collaborating in the open to deliver comprehensive, transparent, compliance-ready security to unlock the potential of the Hadoop ecosystem and enable innovation without compromise.
Deploying Enterprise-grade Security for HadoopCloudera, Inc.
Deploying enterprise grade security for Hadoop or six security problems with Apache Hive. In this talk we will discuss the security problems with Hive and then secure Hive with Apache Sentry. Additional topics will include Hadoop security, and Role Based Access Control (RBAC).
Nowadays a typical Hadoop deployment consists of core Hadoop components – HDFS and MapReduce – several other components such as HBase, HttpFS, Oozie, Pig, Hive, Sqoop, Flume, plus programmatic integration from external systems and applications. This effectively creates a complex and heterogenous distributed environment that runs across several machines and uses different protocols to communicate with each other; all of which is used concurrently by several users and applications. When a Hadoop deployment and its ecosystem is used to process sensitive data (such as financial records, payment transactions, healthcare records), several security requirements arise. These security requirements may be dictated by internal policies and/or government regulations. They may require strong authentication, selective authorization to access data/resources, and data confidentiality. This session covers in detail how different components in the Hadoop ecosystem and external applications can interact with each other in a secure manner providing authentication, authorization, and confidentiality when accessing services and transferring data to/from/between services. The session will cover topics like Kerberos authentication, Web UI authentication, File System permissions, delegation tokens, Access Control Lists, ProxyUser impersonation and network encryption.
The Future of Hadoop Security - Hadoop Summit 2014Cloudera, Inc.
Hadoop deployments are rapidly moving from pilots to production, enabling unprecedented opportunity to build big data applications that deliver faster access to more information to more users than ever before possible. Yet without the ability to address data security and compliance regulations, Hadoop will be limited to another data silo.
In this talk, Matt Brandwein and David Tishgart discuss the requirements for securing Hadoop and how Cloudera (now with Gazzang) and Intel are collaborating in the open to deliver comprehensive, transparent, compliance-ready security to unlock the potential of the Hadoop ecosystem and enable innovation without compromise.
Deploying Enterprise-grade Security for HadoopCloudera, Inc.
Deploying enterprise grade security for Hadoop or six security problems with Apache Hive. In this talk we will discuss the security problems with Hive and then secure Hive with Apache Sentry. Additional topics will include Hadoop security, and Role Based Access Control (RBAC).
Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...Abhiraj Butala
The talk covers limitations of current Hadoop eco-system components in handling security (Authentication, Authorization, Auditing) in multi-tenant, multi-application environments. Then it proposes how we can use Apache Ranger and HDFS super-user connections to enforce correct HDFS authorization policies and achieve the required auditing.
A comprehensive overview of the security concepts in the open source Hadoop stack in mid 2015 with a look back into the "old days" and an outlook into future developments.
Hadoop Security Features that make your risk officer happyAnurag Shrivastava
This talk was delivered by Anurag Shrivastava at Hadoop Summit 2015 Brussels. It covers how Apache Ranger, Apache Sentry, Apache Knox and Project Rhino can help you pass IT risk assessment in Hadoop projects.
Overview of Hadoop security (revise from presentation in Hadoop in Taiwan, 2012). Detail configuration of security infrastructure leveraging kerberos and also extensive integration with LDAP aiming for fast exchange of cluster information. Introduction also Etu Appliance end of the slide.
Big Data Warehousing Meetup: Securing the Hadoop Ecosystem by ClouderaCaserta
In our recent Big Data Warehousing Meetup, we discussed Data Governance, Compliance and Security in Hadoop.
As the Big Data paradigm becomes more commonplace, we must apply enterprise-grade governance capabilities for critical data that is highly regulated and adhere to stringent compliance requirements. Caserta and Cloudera shared techniques and tools that enables data governance, compliance and security on Big Data.
For more information, visit www.casertaconcepts.com
Comprehensive Security for the Enterprise III: Protecting Data at Rest and In...Cloudera, Inc.
This webinar discusses how you can use Navigator capabilities such as Encrypt and Key Trustee to secure data and enable compliance. Additionally, we will discuss our joint work with Intel on Project Rhino (an initiative to improve data security in Hadoop). We also hear from a security architect at a financial services company that is using encryption and key management to meet financial regulatory requirements.
Nl HUG 2016 Feb Hadoop security from the trenchesBolke de Bruin
Setting up a secure Hadoop cluster involves a magic combination of Kerberos, Sentry, Ranger, Knox, Atlas, LDAP and possibly PAM. Add encryption on the wire and at rest to the mix and you have, at the very least, a interesting configuration and installation task.
Nonetheless, the fact that there are a lot of knobs to turn, doesn't excuse you from the responsibility of taking proper care of your customers' data. In this talk, we'll detail how the different security components in Hadoop interact and how easy it actually can be to setup thing correctly, once you understand the concepts and tools. We'll outline a successful secure Hadoop setup with an example.
Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...Abhiraj Butala
The talk covers limitations of current Hadoop eco-system components in handling security (Authentication, Authorization, Auditing) in multi-tenant, multi-application environments. Then it proposes how we can use Apache Ranger and HDFS super-user connections to enforce correct HDFS authorization policies and achieve the required auditing.
A comprehensive overview of the security concepts in the open source Hadoop stack in mid 2015 with a look back into the "old days" and an outlook into future developments.
Hadoop Security Features that make your risk officer happyAnurag Shrivastava
This talk was delivered by Anurag Shrivastava at Hadoop Summit 2015 Brussels. It covers how Apache Ranger, Apache Sentry, Apache Knox and Project Rhino can help you pass IT risk assessment in Hadoop projects.
Overview of Hadoop security (revise from presentation in Hadoop in Taiwan, 2012). Detail configuration of security infrastructure leveraging kerberos and also extensive integration with LDAP aiming for fast exchange of cluster information. Introduction also Etu Appliance end of the slide.
Big Data Warehousing Meetup: Securing the Hadoop Ecosystem by ClouderaCaserta
In our recent Big Data Warehousing Meetup, we discussed Data Governance, Compliance and Security in Hadoop.
As the Big Data paradigm becomes more commonplace, we must apply enterprise-grade governance capabilities for critical data that is highly regulated and adhere to stringent compliance requirements. Caserta and Cloudera shared techniques and tools that enables data governance, compliance and security on Big Data.
For more information, visit www.casertaconcepts.com
Comprehensive Security for the Enterprise III: Protecting Data at Rest and In...Cloudera, Inc.
This webinar discusses how you can use Navigator capabilities such as Encrypt and Key Trustee to secure data and enable compliance. Additionally, we will discuss our joint work with Intel on Project Rhino (an initiative to improve data security in Hadoop). We also hear from a security architect at a financial services company that is using encryption and key management to meet financial regulatory requirements.
Nl HUG 2016 Feb Hadoop security from the trenchesBolke de Bruin
Setting up a secure Hadoop cluster involves a magic combination of Kerberos, Sentry, Ranger, Knox, Atlas, LDAP and possibly PAM. Add encryption on the wire and at rest to the mix and you have, at the very least, a interesting configuration and installation task.
Nonetheless, the fact that there are a lot of knobs to turn, doesn't excuse you from the responsibility of taking proper care of your customers' data. In this talk, we'll detail how the different security components in Hadoop interact and how easy it actually can be to setup thing correctly, once you understand the concepts and tools. We'll outline a successful secure Hadoop setup with an example.
The fundamentals and best practices of securing your Hadoop cluster are top of mind today. In this session, we will examine and explain the components, tools, and frameworks used in Hadoop for authentication, authorization, audit, and encryption of data and processes. See how the latest innovations can let you securely connect more data to more users within your organization.
An introduction to Linux Container, Namespace & Cgroup.
Virtual Machine, Linux operating principles. Application constraint execution environment. Isolate application working environment.
This talk discusses the current status of Hadoop security and some exciting new security features that are coming in the next release. First, we provide an overview of current Hadoop security features across the stack, covering Authentication, Authorization and Auditing. Hadoop takes a “defense in depth” approach, so we discuss security at multiple layers: RPC, file system, and data processing. We provide a deep dive into the use of tokens in the security implementation. The second and larger portion of the talk covers the new security features. We discuss the motivation, use cases and design for Authorization improvements in HDFS, Hive and HBase. For HDFS, we describe two styles of ACLs (access control lists) and the reasons for the choice we made. In the case of Hive we compare and contrast two approaches for Hive authrozation.. Further we also show how our approach lends itself to a particular initial implementation choice that has the limitation where the Hive Server owns the data, but where alternate more general implementation is also possible down the road. In the case of HBase, we describe cell level authorization is explained. The talk will be fairly detailed, targeting a technical audience, including Hadoop contributors.
Big-Data-as-a-Service (BDaaS) in an enterprise environment requires meeting the often contradictory goals of (1) providing your data scientists, analysts, and data engineers with a self-service consumption model; (2) delivering agile and scalable on-demand infrastructure for the rapidly evolving ecosystem of big data frameworks and application software; while (3) ensuring enterprise-grade capabilities for isolation, security, monitoring, etc.
In this presentation at our BDaaS meetup in Santa Clara, Tom Phelan (chief architect and co-founder of BlueData) reviewed these goals and how to resolve the potential contradictions. He also discussed the infrastructure, application, user experience, security, and maintainability considerations required before selecting (or designing and building) a Big-Data-as-a-Service platform for an enterprise big data deployment.
More info on this BDaaS meetup can be found at: http://www.meetup.com/Big-Data-as-a-Service/events/233999817
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding EdgeCloudIDSummit
Aaron T. Myers (ATM), Software Engineer, Cloudera, Inc.
The era of “Big Data for the masses” is upon us. Despite the mindshare Big Data has been receiving – driven by the development and distribution of Apache Hadoop, the first commercialized release was only in December of 2011 by Cloudera, Inc. Cloudera remains the leading Hadoop platform provider in the market today. Now, with a diverse enterprise and government early adopter customer list, through Cloudera we can get a bird’s eye view of the leading authentication issues beginning to emerge from these companies headed out of the sandbox and into full production.
Speaker Aaron T. Myers (ATM) was one of Cloudera’s earliest engineers and maintains a core focus on Apache Hadoop core, specifically focused on HDFS and Hadoop’s security features. ATM is an Apache Hadoop PMC Member and Committer.
DevOoops (Increase awareness around DevOps infra security)
DevOps is increasingly blending the work of both application and network security professionals. In a quest to move faster, organisations can end up creating security vulnerabilities using the tools and products meant to protect them. What happens when these tools are used insecurely or - even worse - they are just insecure? Technologies discussed will encompass AWS, Puppet, Hudson/Jenkins, Vagrant, Docker and much, much more. Everything from common misconfigurations to remote code execution.
UKC - Feb 2013 - Analyzing the security of Windows 7 and Linux for cloud comp...Vincent Giersch
University of Kent 2013 - CO899 System security
Presentation of the article:
Salah K, et al, Computers & Security (2012), http://dx.doi.org/10.1016/j.cose.2012.12.001
Risk Management for Data: Secured and GovernedCloudera, Inc.
Cloudera Tech Day Presentation by Eddie Garcia, Chief Security Architect, Cloudera. Protecting enterprise data is an increasingly complex challenge given the diversity and sophistication of threat actors and their cyber-tactics. In this session, participants will hear a comprehensive introduction to Hadoop Security, including the “three A’s” for secure operating environments: Authentication, Authorization, and Audit. In addition, the presenter will cover strategies to orchestrate data security, encryption, and compliance, and will explain the Cloudera Security Maturity Model for Hadoop. Attendees will leave with a greater understanding of how effective INFOSEC relies on an enterprise big data governance and risk management approach.
Project Rhino: Enhancing Data Protection for HadoopCloudera, Inc.
Learn the history of Project Rhino and its importance, the progress that’s been made so far (including a deep dive into the new security features announced with CDH 5.3), and what’s next for Hadoop security.
Tokyo OpenStack Summit 2015: Unraveling Docker SecurityPhil Estes
A Docker security talk that Salman Baset and Phil Estes presented at the Tokyo OpenStack Summit on October 29th, 2015. In this talk we provided an overview of the security constraints available to Docker cloud operators and users and then walked through a "lessons learned" from experiences operating IBM's public Bluemix container cloud based on Docker container technology.
Gianluca Varisco - DevOoops (Increase awareness around DevOps infra security)Codemotion
DevOps is increasingly blending the work of both application and network security professionals. In a quest to move faster, organisations can end up creating security vulnerabilities using the tools and products meant to protect them. What happens when these tools are used insecurely or - even worse - they are just insecure? Technologies discussed will encompass AWS, Puppet, Hudson/Jenkins, Vagrant, Docker and much, much more. Everything from common misconfigurations to remote code execution.
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
This annual program recognizes organizations who are moving swiftly towards the future and building innovative solutions by making what was impossible yesterday, possible today.
The winning organizations' implementations demonstrate outstanding achievements in fulfilling their mission, technical advancement, and overall impact.
The 2021 Data Impact Awards recognize organizations' achievements with the Cloudera Data Platform in seven categories:
Data Lifecycle Connection
Data for Enterprise AI
Cloud Innovation
Security & Governance Leadership
People First
Data for Good
Industry Transformation
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
Cloudera is proud to present the 2020 Data Impact Awards Finalists. This annual program recognizes organizations running the Cloudera platform for the applications they've built and the impact their data projects have on their organizations, their industries, and the world. Nominations were evaluated by a panel of independent thought-leaders and expert industry analysts, who then selected the finalists and winners. Winners exemplify the most-cutting edge data projects and represent innovation and leadership in their respective industries.
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
Cloudera Fast Forward Labs’ latest research report and prototype explore learning with limited labeled data. This capability relaxes the stringent labeled data requirement in supervised machine learning and opens up new product possibilities. It is industry invariant, addresses the labeling pain point and enables applications to be built faster and more efficiently.
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
In this session, we will cover how to move beyond structured, curated reports based on known questions on known data, to an ad-hoc exploration of all data to optimize business processes and into the unknown questions on unknown data, where machine learning and statistically motivated predictive analytics are shaping business strategy.
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
Watch this webinar to understand how Hortonworks DataFlow (HDF) has evolved into the new Cloudera DataFlow (CDF). Learn about key capabilities that CDF delivers such as -
-Powerful data ingestion powered by Apache NiFi
-Edge data collection by Apache MiNiFi
-IoT-scale streaming data processing with Apache Kafka
-Enterprise services to offer unified security and governance from edge-to-enterprise
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
Cloudera’s Data Science Workbench (CDSW) is available for Hortonworks Data Platform (HDP) clusters for secure, collaborative data science at scale. During this webinar, we provide an introductory tour of CDSW and a demonstration of a machine learning workflow using CDSW on HDP.
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
Join Cloudera as we outline how we use Cloudera technology to strengthen sales engagement, minimize marketing waste, and empower line of business leaders to drive successful outcomes.
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
Learn how organizations are deriving unique customer insights, improving product and services efficiency, and reducing business risk with a modern big data architecture powered by Cloudera on Azure. In this webinar, you see how fast and easy it is to deploy a modern data management platform—in your cloud, on your terms.
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
Join us to learn about the challenges of legacy data warehousing, the goals of modern data warehousing, and the design patterns and frameworks that help to accelerate modernization efforts.
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
Learn how organizations are deriving unique customer insights, improving product and services efficiency, and reducing business risk with a modern big data architecture powered by Cloudera on AWS. In this webinar, you see how fast and easy it is to deploy a modern data management platform—in your cloud, on your terms.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
Cloudera SDX is by no means no restricted to just the platform; it extends well beyond. In this webinar, we show you how Bardess Group’s Zero2Hero solution leverages the shared data experience to coordinate Cloudera, Trifacta, and Qlik to deliver complete customer insight.
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
Join Cloudera Fast Forward Labs Research Engineer, Mike Lee Williams, to hear about their latest research report and prototype on Federated Learning. Learn more about what it is, when it’s applicable, how it works, and the current landscape of tools and libraries.
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
451 Research Analyst Sheryl Kingstone, and Cloudera’s Steve Totman recently discussed how a growing number of organizations are replacing legacy Customer 360 systems with Customer Insights Platforms.
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
In this webinar, you will learn how Cloudera and BAH riskCanvas can help you build a modern AML platform that reduces false positive rates, investigation costs, technology sprawl, and regulatory risk.
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
How can companies integrate data science into their businesses more effectively? Watch this recorded webinar and demonstration to hear more about operationalizing data science with Cloudera Data Science Workbench on Cazena’s fully-managed cloud platform.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Hadoop Security: Overview
1. Private Property: No Trespassing
Hadoop Security Explained
Aaron T. Myers
atm@cloudera.com
@atm
2. Who am I?
• Aaron T. Myers – Software Engineer, Cloudera
• Hadoop HDFS, Common Committer
• Masters thesis on security sandboxing in Linux kernel
• Primarily works on the Core Platform Team
3. Outline
• Hadoop Security Overview
• Hadoop Security pre CDH3
• Hadoop Security with CDH3
• Details of Deploying Secure Hadoop
• Summary
5. Why do we care about security?
• SecureCommerceWebSite, Inc has a product that has both
paid ads and search
• “Payment Fraud” team needs logs of all credit card
payments
• “Search Quality” team needs all search logs and click
history
• “Ads Fraud” team needs to access both search logs and
payment info
• So we can't segregate these datasets to different clusters
• If they can share a cluster, we also get better utilization!
6. Security pre CDH3: User Authentication
• Authentication is by vigorous assertion
• Trivial to impersonate other user:
• Just set property “hadoop.job.ugi” when
running job or command
• Group resolution is done client side
8. Security pre CDH3: HDFS
• Unix-like file permissions were introduced in
Hadoop v16.1
• Provides standard user/group/other r/w/x
• Protects well-meaning users from accidents
• Does nothing to prevent malicious users from
causing harm (weak authentication)
9. Security pre CDH3: Job Control
• ACLs per job queue for job submission / killing
• No ACLs for viewing counters / logs
• Does nothing to prevent malicious users from
causing harm (weak authentication)
10. Security pre CDH3: Tasks
• Individual tasks all run as the same user
• Whoever the TT is running as (usually 'hadoop')
• Tasks not isolated from each other
• Tasks which read/write from local storage can
interfere with each other
• Malicious tasks can kill each other
• Hadoop is designed to execute arbitrary code
12. Security with CDH3: User Authentication
• Authentication is secured by Kerberos v5
• RPC connections secured with SASL “GSSAPI”
mechanism
• Provides proven, strong authentication and
single-sign-on
• Hadoop servers can ensure that users are who
they say they are
• Group resolution is done on the server side
13. Security with CDH3: Server Authentication
• Kerberos authentication is bi-directional
• Users can be sure that they are communicating
with the Hadoop server they think they are
14. Security with CDH3: HDFS
• Same general permissions model
• Added sticky bit for directories (e.g. /tmp)
• But, a user can no longer trivially impersonate
other users (strong authentication)
15. Security with CDH3: Job Control
• A job now has its own ACLs, including a view ACL
• Job can now specify who can view logs, counters,
configuration, and who can modify (kill) it
• JT enforces these ACLs (strong authentication)
16. Security with CDH3: Tasks
• Tasks now run as the user who launched the job
• Probably the most complex part of Hadoop's
security implementation
• Ensures isolation of tasks which run on the same TT
• Local file permissions enforced
• Local system permissions enforced (e.g. signals)
• Can take advantage of per-user system limits
• e.g. Linux ulimits
17. Security with CDH3: Web Interfaces
• Out of the box Kerberized SSL support
• Pluggable servlet filters (more on this later)
18. Security with CDH3: Threat Model
• The Hadoop security system assumes that:
• Users do not have root access to cluster
machines
• Users do not have root access to shared user
machines (e.g. bastion box)
• Users cannot read or inject packets on the
network
21. Requirements: Kerberos Infrastructure
• Kerberos domain (KDC)
• eg. MIT Krb5 in RHEL, or MS Active Directory
• Kerberos principals (SPNs) for every daemon
• hdfs/hostname@REALM for DN, NN, 2NN
• mapred/hostname@REALM for TT and JT
• host/hostname@REALM for web UIs
• Keytabs for service principals distributed to
correct hosts
22. Configuring daemons for security
• Most daemons have two configs:
• Keytab location (eg dfs.datanode.keytab.file)
• Kerberos principal (eg dfs.datanode.kerberos.principal)
• Principal can use the special token '_HOST' to substitute
hostname of the daemon (eg 'hdfs/_HOST@MYREALM')
• Several other configs to enable security in the first place
• See example-confs/conf.secure in CDH3
23. Setting up users
• Each user must have a Kerberos principal
• May want some shared accounts:
• sharedaccount/alice and sharedaccount/bob
principals both act as sharedaccount on HDFS - you
can use this!
• hdfs/alice is also useful for alice to act as a superuser
• Users running MR jobs must also have unix accounts on
each of the slaves
• Centralized user database (eg LDAP) is a practical
necessity
24. Installing Secure Hadoop
• MapReduce and HDFS services should run as
separate users (e.g. 'hdfs' and 'mapred')
• New task-controller setuid executable allows
tasks to run as a user
• New JNI code in libhadoop.so to plug subtle
security holes
• Install CDH3 with hadoop-0.20-sbin and hadoop-
0.20-native packages to get this all set up
25. Securing higher-level services
• Many “middle tier” applications need to act on
behalf of their clients when interacting with
Hadoop
• e.g: Oozie, Hive Server, Hue/Beeswax
• “Proxy User” feature provides secure
impersonation (think sudo).
• hadoop.proxyuser.oozie.hosts - IPs where
“oozie” may act as an impersonator
• hadoop.proxyuser.oozie.groups - groups whose
users “oozie” may impersonate
26. Customizing Security
• Current plug-in points:
• hadoop.http.filter.initializers - may configure a
custom ServletFilter to integrate with existing
enterprise web SSO
• hadoop.security.group.mapping - map a
kerberos principal (alice@FOOCORP.COM) to a
set of groups
(users,engstaff,searchquality,adsdata)
• hadoop.security.auth_to_local - regex
mappings of Kerberos principals to usernames
27. Deployment Gotchas
• MIT Kerberos 1.8.1 (in Ubuntu, RHEL 5.6+)
incompatible with Java Krb5 implementation
• Run “kinit -R” after kinit to work around
• Enable allow_weak_crypto in /etc/krb5.conf -
necessary for kerberized SSL
• Must deploy “unlimited security policy JAR” in
JAVA_HOME/jre/lib/security
• Lifesaver: HADOOP_OPTS=
”-Dsun.security.krb5.debug=true” hadoop ...
28. Best Practices for AD Integration
• MIT Kerberos realm inside cluster:
• CLUSTER.FOOCORP.COM
• Existing Active Directory domain:
• FOOCORP.COM or maybe AD.FOOCORP.COM
• Set up one-way cross-realm trust
• Cluster realm must trust corporate AD realm
• See “Step by Step Guide to Kerberos 5
Interoperability” in Windows Server docs
30. What Hadoop Security Is
• Strong authentication
• Malicious impersonation now impossible
• Better authorization
• More control over who can view/control jobs
• Ensure isolation between running tasks
• An ongoing development priority
31. What Hadoop Security Is Not
• Encryption on the wire
• Encryption on disk
• Protection against DOS attacks
• Enabled by default
32. Security Beyond Core Hadoop
• Comprehensive documentation and best
practices
• https://ccp.cloudera.com/display/CDHDOC/CDH3+Security+Guide
• All components of CDH3 are capable of
interacting with a secure Hadoop cluster
• Hive 0.7 (included in CDH3) added a rich set of
access controls
• Much easier deployment if you use Cloudera
Enterprise
33. Security Roadmap
• Pluggable “edge authentication” (eg PKI, SAML)
• More authorization features across CDH
components
• e.g. HBase access controls
• Data encryption support