A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains as well as integration with other big data technologies such as Apache Spark, Druid, and Kafka. The talk will also provide a glimpse of what is expected to come in the near future.
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenDatabricks
Kubernetes is a fast growing open-source platform which provides container-centric infrastructure. Conceived by Google in 2014, and leveraging over a decade of experience running containers at scale internally, it is one of the fastest moving projects on GitHub with 1000+ contributors and 40,000+ commits. Kubernetes has first class support on Google Cloud Platform, Amazon Web Services, and Microsoft Azure.
Unlike YARN, Kubernetes started as a general purpose orchestration framework with a focus on serving jobs. Support for long-running, data intensive batch workloads required some careful design decisions. Engineers across several organizations have been working on Kubernetes support as a cluster scheduler backend within Spark. During this process, we encountered several challenges in translating Spark considerations into idiomatic Kubernetes constructs. In this talk, we describe the challenges and the ways in which we solved them. This talk will be technical and is aimed at people who are looking to run Spark effectively on their clusters. The talk assumes basic familiarity with cluster orchestration and containers.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains as well as integration with other big data technologies such as Apache Spark, Druid, and Kafka. The talk will also provide a glimpse of what is expected to come in the near future.
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenDatabricks
Kubernetes is a fast growing open-source platform which provides container-centric infrastructure. Conceived by Google in 2014, and leveraging over a decade of experience running containers at scale internally, it is one of the fastest moving projects on GitHub with 1000+ contributors and 40,000+ commits. Kubernetes has first class support on Google Cloud Platform, Amazon Web Services, and Microsoft Azure.
Unlike YARN, Kubernetes started as a general purpose orchestration framework with a focus on serving jobs. Support for long-running, data intensive batch workloads required some careful design decisions. Engineers across several organizations have been working on Kubernetes support as a cluster scheduler backend within Spark. During this process, we encountered several challenges in translating Spark considerations into idiomatic Kubernetes constructs. In this talk, we describe the challenges and the ways in which we solved them. This talk will be technical and is aimed at people who are looking to run Spark effectively on their clusters. The talk assumes basic familiarity with cluster orchestration and containers.
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
Uber has real needs to provide faster, fresher data to data consumers & products, running hundreds of thousands of analytical queries everyday. Uber engineers will share the design, architecture & use-cases of the second generation of ‘Hudi’, a self contained Apache Spark library to build large scale analytical datasets designed to serve such needs and beyond. Hudi (formerly Hoodie) is created to effectively manage petabytes of analytical data on distributed storage, while supporting fast ingestion & queries. In this talk, we will discuss how we leveraged Spark as a general purpose distributed execution engine to build Hudi, detailing tradeoffs & operational experience. We will also show to ingest data into Hudi using Spark Datasource/Streaming APIs and build Notebooks/Dashboards on top using Spark SQL.
Data in Hadoop is getting bigger every day, consumers of the data are growing, organizations are now looking at making their Hadoop cluster compliant to federal regulations and commercial demands. Apache Ranger simplifies the management of security policies across all components in Hadoop. Ranger provides granular access controls to data.
The deck describes what security tools are available in Hadoop and their purpose then it moves on to discuss in detail Apache Ranger.
From: DataWorks Summit 2017 - Munich - 20170406
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
Cutting-edge Hadoop clusters are bound to need custom (add-on) services that are not available in the Hadoop distribution of their choice. Agility is crucial for companies to integrate any service into existing large-scale Hadoop clusters with ease.
Apache Ambari manages the Hadoop cluster and solves this problem by extending the stack with add-on services, which can be a new Apache project, different Hadoop file system, or internal tool. This talk covers how to create a service definition in Ambari to manage lifecycle commands and configs, plus advanced topics like packaging, installing from multiple repositories, recommending and validating configs using Service Advisor, running custom commands, defining dependencies on configs and other services, and more. We will also cover how to create custom metrics and dashboards using Ambari Metric System and Grafana, generating alerts, and enabling security by authenticating with Kerberos.
Further, we will discuss the future of service definitions and how Ambari 3.0 will support custom services through Management Packs to enable Hadoop vendors to release software faster.
Speaker
Jayush Luniya, Principal Software Engineer, Hortonworks
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
We have covered the need for CDC and the benefits of building a CDC pipeline. We will compare various CDC streaming and reconciliation frameworks. We will also cover the architecture and the challenges we faced while running this system in the production. Finally, we will conclude the talk by covering Apache Hudi, Schema Registry and Debezium in detail and our contributions to the open-source community.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud (Hadoop / Spark Conference Japan 2019)
# English version #
http://hadoop.apache.jp/hcj2019-program/
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
Apache Kafka is a new breed of messaging system built for the "big data" world. Coming out of LinkedIn (and donated to Apache), it is a distributed pub/sub system built in Scala. It has been an Apache TLP now for several months with the first Apache release imminent. Built for speed, scalability, and robustness, Kafka should definitely be one of the data tools you consider when designing distributed data-oriented applications.
The talk will cover a general overview of the project and technology, with some use cases, and a demo.
ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for.
This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
Flink Forward San Francisco 2022.
With a real-time processing engine like Flink and a transactional storage layer like Hudi, it has never been easier to build end-to-end low-latency data platforms connecting sources like Kafka to data lake storage. Come learn how to blend Lakehouse architectural patterns with real-time processing pipelines with Flink and Hudi. We will dive deep on how Flink can leverage the newest features of Hudi like multi-modal indexing that dramatically improves query and write performance, data skipping that reduces the query latency by 10x for large datasets, and many more innovations unique to Flink and Hudi.
by
Ethan Guo & Kyle Weller
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains, including significantly improved performance for ACID tables. The talk will also provide a glimpse of what is expected to come in the near future.
Hadoop 3.0 has been years in the making, and now it's finally arriving. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, YARN federation, and much more, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Pig, and Presto.
It introduces and illustrates use cases, benefits and problems for Kerberos deployment on Hadoop; how Token support and TokenPreauth can help solve the problems. It also briefly introduces Haox project, a Java client library for Kerberos.
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
Uber has real needs to provide faster, fresher data to data consumers & products, running hundreds of thousands of analytical queries everyday. Uber engineers will share the design, architecture & use-cases of the second generation of ‘Hudi’, a self contained Apache Spark library to build large scale analytical datasets designed to serve such needs and beyond. Hudi (formerly Hoodie) is created to effectively manage petabytes of analytical data on distributed storage, while supporting fast ingestion & queries. In this talk, we will discuss how we leveraged Spark as a general purpose distributed execution engine to build Hudi, detailing tradeoffs & operational experience. We will also show to ingest data into Hudi using Spark Datasource/Streaming APIs and build Notebooks/Dashboards on top using Spark SQL.
Data in Hadoop is getting bigger every day, consumers of the data are growing, organizations are now looking at making their Hadoop cluster compliant to federal regulations and commercial demands. Apache Ranger simplifies the management of security policies across all components in Hadoop. Ranger provides granular access controls to data.
The deck describes what security tools are available in Hadoop and their purpose then it moves on to discuss in detail Apache Ranger.
From: DataWorks Summit 2017 - Munich - 20170406
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
Cutting-edge Hadoop clusters are bound to need custom (add-on) services that are not available in the Hadoop distribution of their choice. Agility is crucial for companies to integrate any service into existing large-scale Hadoop clusters with ease.
Apache Ambari manages the Hadoop cluster and solves this problem by extending the stack with add-on services, which can be a new Apache project, different Hadoop file system, or internal tool. This talk covers how to create a service definition in Ambari to manage lifecycle commands and configs, plus advanced topics like packaging, installing from multiple repositories, recommending and validating configs using Service Advisor, running custom commands, defining dependencies on configs and other services, and more. We will also cover how to create custom metrics and dashboards using Ambari Metric System and Grafana, generating alerts, and enabling security by authenticating with Kerberos.
Further, we will discuss the future of service definitions and how Ambari 3.0 will support custom services through Management Packs to enable Hadoop vendors to release software faster.
Speaker
Jayush Luniya, Principal Software Engineer, Hortonworks
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
We have covered the need for CDC and the benefits of building a CDC pipeline. We will compare various CDC streaming and reconciliation frameworks. We will also cover the architecture and the challenges we faced while running this system in the production. Finally, we will conclude the talk by covering Apache Hudi, Schema Registry and Debezium in detail and our contributions to the open-source community.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud (Hadoop / Spark Conference Japan 2019)
# English version #
http://hadoop.apache.jp/hcj2019-program/
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
Apache Kafka is a new breed of messaging system built for the "big data" world. Coming out of LinkedIn (and donated to Apache), it is a distributed pub/sub system built in Scala. It has been an Apache TLP now for several months with the first Apache release imminent. Built for speed, scalability, and robustness, Kafka should definitely be one of the data tools you consider when designing distributed data-oriented applications.
The talk will cover a general overview of the project and technology, with some use cases, and a demo.
ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for.
This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
Flink Forward San Francisco 2022.
With a real-time processing engine like Flink and a transactional storage layer like Hudi, it has never been easier to build end-to-end low-latency data platforms connecting sources like Kafka to data lake storage. Come learn how to blend Lakehouse architectural patterns with real-time processing pipelines with Flink and Hudi. We will dive deep on how Flink can leverage the newest features of Hudi like multi-modal indexing that dramatically improves query and write performance, data skipping that reduces the query latency by 10x for large datasets, and many more innovations unique to Flink and Hudi.
by
Ethan Guo & Kyle Weller
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains, including significantly improved performance for ACID tables. The talk will also provide a glimpse of what is expected to come in the near future.
Hadoop 3.0 has been years in the making, and now it's finally arriving. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, YARN federation, and much more, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Pig, and Presto.
It introduces and illustrates use cases, benefits and problems for Kerberos deployment on Hadoop; how Token support and TokenPreauth can help solve the problems. It also briefly introduces Haox project, a Java client library for Kerberos.
A comprehensive overview of the security concepts in the open source Hadoop stack in mid 2015 with a look back into the "old days" and an outlook into future developments.
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise UsersDataWorks Summit
Apache Knox Gateway is a proxy for interacting with Apache Hadoop clusters in a secure way providing authentication, service level authorization, and many other extensions to secure any HTTP interactions in your cluster. One main feature of Apache Knox Gateway is the ability to extend the reach of your REST APIs to the internet while still securing your cluster and working with Kerberos. Recent contributions to the Apache Knox community have added support for Single Sign On (SSO) based on Pac4j 1.8.9 which is a very powerful security engine which provides SSO support through SAML2, OAuth, OpenID, and CAS. In addition, through recent community contributions Apache Ambari, and Apache Ranger can now also provide SSO authentication through Knox. This paper will discuss the architecture of Knox SSO, it will explain how enterprise user could benefit by this feature and will present enterprise use cases for Knox SSO, and integration with open source Shibboleth, ADFS Windows server Idp support, and Okta cloud Idp.
Big Data and Security - Where are we now? (2015)Peter Wood
Peter Wood started looking at Big Data as a solution for Advanced Threat Protection in 2013. This presentation examines how Big Data is being used for security in 2015, how this market is developing and how realistic vendor offerings are.
Abstract:
As organizations start to roll out or migrate data driven applications to Apache Hadoop, there are times when they have conflicting needs to leverage their full co-mingled data sets in Hadoop
while providing isolation of sections of such co-mingled data to a specific customer. Serving multiple customers in this manner is a typical multi-tenant usecase and one that can be challenging in Apache Hadoop.
This presentation walks through a number of patterns that can be leveraged for providing isolation of tenants based on the composability of Apache Knox for:
* Authentication/Federation Providers
* KnoxSSO
* Identity Assertion
* Tenant specific topologies
With these patterns, Knox can provide an infrastructure for robust tenant isolation and access control for application UIs and REST APIs for your data landscape, when suitably coupled with a cluster that has carefully considered infrastructure including:
* Kerberos
* Tenant specific user accounts, OUs and Groups within LDAP
* Authorization Policy that is aware of the tenant specific groups,
Summary:
We will walk through some of the patterns that have been used to enable such a multi-tenant environment as well as the specific considerations for topology, access control and user accounts involved with creating such an environment.
Today enterprises desire to move more and more of their data lakes to the cloud to help them execute faster, increase productivity, drive innovation while leveraging the scale and flexibility of the cloud. However, such gains come with risks and challenges in the areas of data security, privacy, and governance. In this talk we cover how enterprises can overcome governance and security obstacles to leverage these new advances that the cloud can provide to ease the management of their data lakes in the cloud. We will also show how the enterprise can have consistent governance and security controls in the cloud for their ephemeral analytic workloads in a multi-cluster cloud environment without sacrificing any of the data security and privacy/compliance needs that their business context demands. Additionally, we will outline some use cases and patterns as well as best practices to rationally manage such a multi-cluster data lake infrastructure in the cloud.
Speaker:
Jeff Sposetti, Product Management, Hortonworks
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...Kevin Minder
Securing Hadoop's REST APIs with Apache Knox Gateway
Presented at Hadoop Summit on June 6th, 2014
Describes the overall roles the Apache Knox Gateway plays in Hadoop security and briefly covers its primary features.
As Hadoop becomes a critical part of Enterprise data infrastructure, securing Hadoop has become critically important. Enterprises want assurance that all their data is protected and that only authorized users have access to the relevant bits of information. In this session we will cover all aspects of Hadoop security including authentication, authorization, audit and data protection. We will also provide demonstration and detailed instructions for implementing comprehensive Hadoop security.
Troubleshooting Kerberos in Hadoop: Taming the BeastDataWorks Summit
Kerberos is the ubiquitous authentication mechanism when it comes to secure any Hadoop Services. With recent updates in Hadoop core and various Apache Hadoop components, inherent Kerberos support has matured and has come a long way.
Understanding & configuring Kerberos is still a challenge but even more painful & frustrating is troubleshooting a Kerberos issue. There are lot of things (small & big) that can go wrong (and will go wrong!). This talk covers the Kerberos debugging part in detail and discusses the tools & tricks that can be used to narrow down any Kerberos issue.
Rather than discussing the issues and their resolution, we will focus on how to approach a Kerberos problem and do's / dont's in Kerberos scene. This talk will provide a step by step guide that will equip the audience for troubleshooting future Kerberos problems.
Agenda is to discuss:
- Systematic approach to Kerberos troubleshooting
- Kerberos Tools available in Hadoop arsenal
- Tips & Tricks to narrow down Kerberos issues quickly
- Some nasty Kerberos issues from Support trenches
Some prior knowledge on Kerberos basics will be appreciated but is not a prerequisite.
Speaker:
Vipin Rathor, Sr. Product Specialist (HDP Security), Hortonworks
Treat your enterprise data lake indigestion: Enterprise ready security and go...DataWorks Summit
Most enterprises with large data lakes today are flying blind when it comes to the extent to which they can understand how the data in their data lakes is organized, accessed, and utilized to create real business value. Couple this with the need to democratize data, enterprises often realize they have created a data swamp loaded with all kinds of data assets without any curation and without appropriate security controls hoping that developers and analysts can responsibly collaborate to generate insights. In this talk we will provide a broad overview of how organizations can use open source frameworks such as Apache Ranger and Apache Knox to secure their data lakes and Apache Atlas to effectively provide open metadata and governance services for Hadoop ecosystem. We will provide an overview of the new features that have been added in each of these Apache projects recently and how enterprises can leverage these new features to build a robust security and governance model for their data lakes.
Speaker
Owen O'Malley, Co-Founder & Technical Fellow, Hortonworks
The fundamentals and best practices of securing your Hadoop cluster are top of mind today. In this session, we will examine and explain the components, tools, and frameworks used in Hadoop for authentication, authorization, audit, and encryption of data and processes. See how the latest innovations can let you securely connect more data to more users within your organization.
http://www.justin.tv/hackertv/49975/Tech_Talk_1_Leah_Culver_on_OAuth
Tech talk about OAuth, and open standard for API authentication. Originally broadcast on Justin.tv.
Apache Hadoop: design and implementation. Lecture in the Big data computing course (http://twiki.di.uniroma1.it/twiki/view/BDC/WebHome), Department of Computer Science, Sapienza University of Rome.
Nowadays a typical Hadoop deployment consists of core Hadoop components – HDFS and MapReduce – several other components such as HBase, HttpFS, Oozie, Pig, Hive, Sqoop, Flume, plus programmatic integration from external systems and applications. This effectively creates a complex and heterogenous distributed environment that runs across several machines and uses different protocols to communicate with each other; all of which is used concurrently by several users and applications. When a Hadoop deployment and its ecosystem is used to process sensitive data (such as financial records, payment transactions, healthcare records), several security requirements arise. These security requirements may be dictated by internal policies and/or government regulations. They may require strong authentication, selective authorization to access data/resources, and data confidentiality. This session covers in detail how different components in the Hadoop ecosystem and external applications can interact with each other in a secure manner providing authentication, authorization, and confidentiality when accessing services and transferring data to/from/between services. The session will cover topics like Kerberos authentication, Web UI authentication, File System permissions, delegation tokens, Access Control Lists, ProxyUser impersonation and network encryption.
Historically, security hasn't been a high priority in regards to Hadoop (reflection of type of data and organizations using Hadoop), but now Hadoop is being used by more traditional firms with heightened security requirements. MapR's Senior Principal Technologist, Keys Botzum, gives a talk on how you can build a secure cluster.
Hadoop World 2011: Hadoop Gateway - Konstantin Schvako, eBayCloudera, Inc.
Access to Hadoop clusters through dedicated portal nodes (typically located behind firewalls and performing user authentication and authorization) can have several drawbacks -- as shared multitenant resources they can create contention among users and increase the maintenance overhead for cluster administrators. This session will discuss the Gateway system, a cluster virtualization framework that provides multiple benefits: seamless access from users’ workplace computers through corporate firewalls; the ability to failover to active clusters for scheduled or unscheduled downtime, as well as the ability to redirect traffic to other clusters during upgrades; and user access to clusters running different versions of Hadoop.
Big Data Warehousing Meetup: Securing the Hadoop Ecosystem by ClouderaCaserta
In our recent Big Data Warehousing Meetup, we discussed Data Governance, Compliance and Security in Hadoop.
As the Big Data paradigm becomes more commonplace, we must apply enterprise-grade governance capabilities for critical data that is highly regulated and adhere to stringent compliance requirements. Caserta and Cloudera shared techniques and tools that enables data governance, compliance and security on Big Data.
For more information, visit www.casertaconcepts.com
Securing Hadoop by MapR's Senior Principal Technologist Keys BotzumMapR Technologies
Historically, security hasn't been a high priority in regards to Hadoop (reflection of type of data and organizations using Hadoop), but now Hadoop is being used by more traditional firms with heightened security requirements. MapR's Senior Principal Technologist, Keys Botzum, gives a talk on how you can build a more secure cluster.
An overview of securing Hadoop. Content primarily by Balaji Ganesan, one of the leaders of the Apache Argus project. Presented on Sept 4, 2014 at the Toronto Hadoop User Group by Adam Muise.
This talk discusses the current status of Hadoop security and some exciting new security features that are coming in the next release. First, we provide an overview of current Hadoop security features across the stack, covering Authentication, Authorization and Auditing. Hadoop takes a “defense in depth” approach, so we discuss security at multiple layers: RPC, file system, and data processing. We provide a deep dive into the use of tokens in the security implementation. The second and larger portion of the talk covers the new security features. We discuss the motivation, use cases and design for Authorization improvements in HDFS, Hive and HBase. For HDFS, we describe two styles of ACLs (access control lists) and the reasons for the choice we made. In the case of Hive we compare and contrast two approaches for Hive authrozation.. Further we also show how our approach lends itself to a particular initial implementation choice that has the limitation where the Hive Server owns the data, but where alternate more general implementation is also possible down the road. In the case of HBase, we describe cell level authorization is explained. The talk will be fairly detailed, targeting a technical audience, including Hadoop contributors.
Securing Hadoop by Sr. Principal Technologist Keys BotzumMapR Technologies
Presentation given at the Hadoop DC meetup on July of 2014. Keys presented on why Hadoop should be secured, what it takes, and how to get security beyond the core.
Keys Botzum - Senior Principal Technologist at MapR
Keys Botzum has over 15 years of experience in large scale distributed system design. At MapR his primary responsibility is working with customers as a consultant, but he also teaches classes, contributes to documentation, and works with MapR engineering. Previously he was a Senior Technical Staff Member with IBM and a respected author of many articles on WebSphere Application Server as well as a book. He holds a Masters degree in Computer Science from Stanford University and a B.S. in Applied Mathematics/Computer Science from Carnegie Mellon University.
Protect your private data with ORC column encryptionOwen O'Malley
Fine-grained data protection at a column level in data lake environments has become a mandatory requirement to demonstrate compliance with multiple local and international regulations across many industries today. ORC is a self-describing type-aware columnar file format designed for Hadoop workloads that provides optimized streaming reads but with integrated support for finding required rows quickly.
Owen O’Malley dives into the progress the Apache community made for adding fine-grained column-level encryption natively into ORC format, which also provides capabilities to mask or redact data on write while protecting sensitive column metadata such as statistics to avoid information leakage. The column encryption capabilities will be fully compatible with Hadoop Key Management Server (KMS) and use the KMS to manage master keys, providing the additional flexibility to use and manage keys per column centrally.
Fine Grain Access Control for Big Data: ORC Column EncryptionOwen O'Malley
Fine-grained data protection at a column level in data lake environments has become a mandatory requirement to demonstrate compliance with multiple local and international regulations across many industries today. ORC is a self-describing type-aware columnar file format designed for Hadoop workloads that provides optimized streaming reads, but with integrated support for finding required rows quickly. In this talk, we will outline the progress made in Apache community for adding fine-grained column level encryption natively into ORC format that will also provide capabilities to mask or redact data on write while protecting sensitive column metadata such as statistics to avoid information leakage. The column encryption capabilities will be fully compatible with Hadoop Key Management Server (KMS) and use the KMS to manage master keys providing the additional flexibility to use and manage keys per column centrally.
Fast Access to Your Data - Avro, JSON, ORC, and ParquetOwen O'Malley
The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so.
The use cases that we’ve examined are:
* reading all of the columns
* reading a few of the columns
* filtering using a filter predicate
While previous work has compared the size and speed from Hive, this presentation will present benchmarks from Spark including the new work that radically improves the performance of Spark on ORC. This presentation will also include tips and suggestions to optimize the performance of your application while reading and writing the data.
Finally, the value of having open source benchmarks that are available to all interested parties is hugely important and all of the code is available from Apache.
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Hive, and Presto.
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetOwen O'Malley
The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so.
The use cases that we’ve examined are:
reading all of the columns
reading a few of the columns
filtering using a filter predicate
While previous work has compared the size and speed from Hive, this presentation will present benchmarks from Spark including the new work that radically improves the performance of Spark on ORC. This presentation will also include tips and suggestions to optimize the performance of your application while reading and writing the data.
To provide better security, ORC files are adding column encryption. Column encryption provides the ability to grant access to different columns within the same file. All of the encryption is handled transparently to the user.
File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley
Hadoop Summit June 2016
The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so. The use cases that we’ve examined are: * reading all of the columns * reading a few of the columns * filtering using a filter predicate * writing the data Furthermore, it is important to benchmark on real data rather than synthetic data. We used the Github logs data available freely from http://githubarchive.org We will make all of the benchmark code open source so that our experiments can be replicated.
Protecting Enterprise Data in Apache HadoopOwen O'Malley
From Hadoop Summit 2015, San Jose
From Apache BigData 2016, Vancouver
Hadoop has long had strong authentication via integration with Kerberos, authorization via User/Group/Other HDFS permissions, and auditing via the audit log. Recent developments in Hadoop have added HDFS file access control lists, pluggable encryption key provider APIs, HDFS snapshots, and HDFS encryption zones. These features combine to give important new data protection features that every company should be using to protect their data. This talk will cover what the new features are and when and how to use them in enterprise production environments. Upcoming features including columnar encryption in the ORC columnar format will also be covered.
Hadoop has long had strong authentication via integration with Kerberos, authorization via user/group/other HDFS permissions and auditing via the audit log. Recent developments in Hadoop have added HDFS file access control lists, pluggable encryption key provider APIs, HDFS snapshots, and HDFS encryption zones. These features combine to given important new data protection features that every company should be using to protect their data. This talk will cover what the new features are and when and how to use them in enterprise production environments. Upcoming features including columnar encryption in the ORC file format will also be covered.
Structor - Automated Building of Virtual Hadoop ClustersOwen O'Malley
Discusses vagrant scripts to setup and deploy a working Hadoop multiple node cluster with or without security. All source code is available on https://github.com/hortonworks/structor .
ORC File and Vectorization - Hadoop Summit 2013Owen O'Malley
Eric Hanson and I gave this presentation at Hadoop Summit 2013:
Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. Hive 0.11 added a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding — resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query.
Columnar storage formats like ORC reduce I/O and storage use, but it’s just as important to reduce CPU usage. A technical breakthrough called vectorized query execution works nicely with column store formats to do this. Vectorized query execution has proven to give dramatic performance speedups, on the order of 10X to 100X, for structured data processing. We describe how we’re adding vectorized query execution to Hive, coupling it with ORC with a vectorized iterator.
Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding -- resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query. Finally, ORC works together with the upcoming query vectorization work providing a high bandwidth reader/writer interface.
Andrew Ryan describes how Facebook operates Hadoop to provide access as a shared resource between groups.
More information and video at:
http://developer.yahoo.com/blogs/hadoop/posts/2011/02/hug-feb-2011-recap/
The next generation of Hadoop MapReduce
Arun C. Murthy presented the plans for the next generation of Apache Hadoop MapReduce. The MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization.
More information and video available at:
http://developer.yahoo.com/blogs/hadoop/posts/2011/02/hug-feb-2011-recap/
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
3. Problem Statement
• The fundamental goal of adding Hadoop security is that Yahoo's data
stored in HDFS must be secure from unauthorized access.
Furthermore, it must do so without adding significant effort to operating
or using the Grid. Based on that goal, there are a few implications.
– All HDFS clients must be authenticated to ensure that the user is
who they claim to be. That implies that all map/reduce users,
including services such as Oozie, must be also authenticated and
that tasks must run with the privileges and identity of the submitting
user.
– Since Data Nodes and TaskTrackers are entrusted with user data
and credentials, they must authenticate themselves to ensure they
are running as part of the Grid and are not trojan horses.
– Kerberos will be the underlying authentication service so that users
can be authenticated using their system credentials.
Hadoop
5. Security Threats in Hadoop
•
User to Service Authentication
– No User Authentication on NameNode or JobTracker
•
Client code supplies user and group names
– No User Authorization on DataNode – Fixed in 0.21
•
Users can read/write any block
– No User Authorization on JobTracker
•
•
•
Users can modify or kill other user’s jobs
Users can modify the persistent state of JobTracker
Service to Service Authentication
– No Authentication of DataNodes and TaskTrackers
•
•
Users can start fake DataNodes and TaskTrackers
No Encryption on Wire or Disk
Hadoop
6. Solutions to Threats
• Add Kerberos-based authentication to NameNode
and JobTracker.
• Add delegation tokens to HDFS and support for
them in MapReduce.
• Determine user’s group membership on the
NameNode and JobTracker.
• Protect MapReduce system directory from users.
• Add authorization model to MapReduce so that
only submitting user can modify or kill job.
• Add Backyard authentication to Web UI’s.
Hadoop
6
7. Out of Scope for 0.20.100
• Protecting against root on slave nodes:
– Encryption of RPC messages
– Encryption of block transfer protocol
– Encryption of MapReduce transient files
– Encryption of HDFS block files
• Passing Kerberos tickets to MapReduce
tasks for third party Kerborized services.
Hadoop
7
8. HDFS Security
• Users authenticate via Kerberos
• MapReduce jobs can obtain delegation
tokens for later use.
• When clients are reading or writing an HDFS
file, the NameNode generates a block
access token that will be verified by the
DataNode.
Hadoop
8
10. What does this *really* look like?
• Need a Kerberos ticket to work
– kinit –l 7d oom@DS.CORP.YAHOO.COM
– hadoop fs –ls
– hadoop jar my.jar in-dir out-dir
• Works using ticket cache!
– Can display ticket cache with klist.
Hadoop
10
12. Delegation Token
• Advantages over using Kerberos directly:
– Don’t trust JobTracker with credentials
– Avoid MapReduce task authorization flood
– Renewable by third party (ie. JobTracker)
– Revocable when job finishes
• tokenId = {owner prin, renewer prin, issueDate, maxDate}
• tokenAuthenticator = HMAC(masterKey, tokenId)
• Token = {tokenId, tokenAuthenticator}
Hadoop
12
13. Block Access Token
• Only NameNode knows the set of users allowed to
access a specific block, so the NameNodes gives
an authorized clients a block access token.
• Capabilities include read, write, copy, or replace.
• The NameNode and DataNodes share a
dynamically rolled secret key to secure the tokens.
• tokenId={expiration, keygen, owner, block, access}
• tokenAuthenticator = HMAC(blockKey, tokenId)
• token = {tokenId, tokenAuthenticator}
Hadoop
13
14. MapReduce Security
• Require Kerberos authentication from client.
• Secure the information about pending and running jobs
– Store the job configuration and input splits in HDFS
under ~user/.staging/$jobid
– Store the job’s location and secrets in private directory
• JobTracker creates a random job token. It it used for:
– Connecting to TaskTracker’s RPC
– Authorizing http get for shuffle
• HMAC(job token, URL) sent from reduce tasks to TaskTracker
Hadoop
14
16. MapReduce Task Security
• Users have separate task directories with
permissions set to 700.
• Distributed cache is now divided based on
the source’s visibility
– Global – shared with other users
– Private - protected from other users
Hadoop
16
17. Web UI
• MapReduce makes heavy use of Web UI for
displaying state of cluster and running jobs.
• HDFS also has a web browsing interface.
• Use Backyard to authenticate Web UI users
• Only allow submitting user of job to view
stdout and stderr of job’s tasks.
• HDFS web browser checks user’s
authorization.
Hadoop
17
18. Oozie
• Client authenticates to Oozie
– Custom auth for Yahoo!
• Oozie authenticates to HDFS and MapReduce as
“oozie” principal
• “oozie” is configured as a super-user for HDFS and
MapReduce and may act as other users.
Hadoop
18
19. Proxy Services Trust Model
• Requires trust that service (eg. Oozie) principal is
secure.
• Explored and rejected
– Having user headless principals stored on
Oozie machine “x/oozie” for user “x”
– Passing user headless principal keytab to Oozie
– Generalizing delegation token to have token
granting tokens.
Hadoop
19
20. Protocols
• RPC
– Change RPC to use SASL and either:
• Kerberos authentication (GSSAPI)
• Tokens (DIGEST-MD5)
– User’s Kerberos tickets obtained at login used
automatically.
– Changes RPC format
– Can easily add encryption later
• Block transfer protocol
– Block access tokens in data stream
Hadoop
20
21. Protocols
• HTTP
– User/Browser facing
• Yahoo – Custom Authentication
• External – SPNEGO or Kerberos login module
– Web Services
• HFTP – Hadoop File Transfer Protocol
• Others later
• SPNEGO or Delegation Token via RPC
– Shuffle
• Use HMAC of URL hashed with Job Token
Hadoop
21
22. Summary
• RPC
– Kerberos
• Application to NameNode, JobTracker
• DataNode to NameNode
• TaskTracker to JobTracker
– Digest-MD5
• MapReduce task to NameNode, TaskTracker
• Block Access Token
• Backyard
– User to Web UI
Hadoop
22
23. Task
Accessing as user
Accessing as user
Backyard
Browser, 2ndNN, fsck
Browser, 2ndNN, fsck
HFTP
HTTP
NN
DIGEST-MD5
RPC
Kerberos
User (initial access),
User (initial access),
2ndNN, Balancer
2ndNN, Balancer
HTTP-DIGEST w/ delegation token
HTTP-DIGEST w/ delegation token
Task
distcp accessing as user
distcp accessing as user
HFTP
User, DN, Balancer, Task
User, DN, Balancer, Task
DN
HTTP
access token
Task
Socket
Backyard
Kerberos
Browser
Forward delegation token
Forward delegation token
Task
distcp accessing as user
distcp accessing as user
Hadoop
26. Interfaces (and their scope and stability)
•
Imported Interfaces
–
–
SASL – Standard for supporting token and Kerberos authentication
–
GSSAPI – Kerberos part of SASL authentication
–
HMAC-SHA1 – Shared secret authentication
–
•
JAAS – Java API for supporting authentication
SPNEGO – Use Kerberos tickets over HTTP
Exported Interfaces – Both Limited Private
–
–
•
HDFS adds a method to get delegation tokens
RPC adds a doAs method
Major Internal (Inter-system Interfaces)
–
MapReduce Shuffle uses HMAC-SHA1
–
RPC uses Kerberos and DIGEST-MD5
Hadoop
26
27. Pluggability
• Pluggability in Hadoop supports different environments
• HTTP browser user authentication
– Yahoo – Backyard
– External – SPNEGO or Kerberos login module
• RPC transport
– SASL supports DIGEST-MD5, Kerberos, and others
• Acquiring credentials
– JAAS supports Kerberos, and others
Hadoop
27
28. Performance
• The authentication should not introduce
substantial performance penalties.
• Delegation token design to avoid
authentication flood by MapReduce tasks
• Required to be less than 3% on GridMix.
Hadoop
28
29. Reliability and Availability
• The Kerberos KDC can not be a single point of failure.
– Kerberos clients automatically fail over to secondary KDC’s
– Secondary KDC’s can be sync’ed automatically from the primary
since the data rarely changes.
• The cluster must remain stable when Kerberos fails.
– The slaves (TaskTrackers and DataNodes) will lose their ability to
reconnect to the master, when their RPC socket closes, their service
ticket has expired, and both the primary and secondary KDC’s have
failed.
– Decided not to use special tokens to handle this case.
• Once the MapReduce job is submitted, the KDC is not required for the
job to continue running.
Hadoop
29
30. Operations and Monitoring
• The number of Kerberos authorizations will be
logged on the NameNode and JobTracker.
• Authorization failures will be logged.
• Authentication failures will be logged.
• The authorization logs will be a separate log4j
logger, so they can be directed to a separate file.
Hadoop
30