Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop

Apache Ambari is an extensible framework that simplifies provisioning, managing and monitoring Hadoop clusters. Apache Ambari was built on a standardized stack-based operations model. Stacks wrap services of all shapes and sizes with a consistent definition and lifecycle-control layer; thereby providing a consistent approach for managing and monitoring the services. This also provided a natural extension point for operators and the community to bring in their own add-on services and “plug-in” the new services into the stack. However, one of the fundamental limitations of the current Apache Ambari architecture has been that there is a strong one-on-one coupling between entities. For instance, a cluster is tied to a single stack and a Hadoop operator can only deploy services defined in that stack, a cluster can have only a single instance of a service and a host can have only a single instance of a component. Taking into consideration various use case scenarios that cannot be enabled due to these limitations there is a growing need to revamp the Ambari architecture. In this talk, we propose a revamped Apache Ambari architecture that will open up the floodgates for a wide range of scenarios that wouldn’t have been possible thus far. We will focus the discussion on a new mpack-based operations model that will replace the stack-based operations model. A management package is a self-contained deployment artifact that includes all the details for deploying, managing and upgrading a set of services bundled in the package. A third-party provider can also build their own management package containing their custom services. This eliminates the need to plug-in their services into a stack and also can define their own upgrade story for these custom services. A Hadoop operator will be able to deploy a Hadoop cluster with a mix of services across multiple packages instead of being limited to a single stack. For example, it would be possible to deploy a cluster with HDFS from HDP and NIFI from HDF. Further, we will also discuss about the architectural changes needed to enable a multi instance architecture in future Ambari releases to support deploying multiple instances of a service in a cluster, deploying multiple instances of a component on a host as well as future proofing the Ambari architecture to leverage some of the advancements happening in the Hadoop community like YARN services (YARN-4692). We will wrap up the conversation with a brief overview of other improvements planned for future releases of Ambari.

Hdfs 2016-hadoop-summit-san-jose-v4

Chris Nauroth

The Hadoop Distributed File System is the foundational storage layer in typical Hadoop deployments. Performance and stability of HDFS are crucial to the correct functioning of applications at higher layers in the Hadoop stack. This session is a technical deep dive into recent enhancements committed to HDFS by the entire Apache contributor community. We describe real-world incidents that motivated these changes and how the enhancements prevent those problems from reoccurring. Attendees will leave this session with a deeper understanding of the implementation challenges in a distributed file system and identify helpful new metrics to monitor in their own clusters.

Scale-Out Resource Management at Microsoft using Apache YARN

Apache Ambari is now the preferred way of provisioning, managing and monitoring Hadoop Clusters. Ambari helps users to manage Hadoop clusters simplifying actions such as upgrades, configuration management, service management, etc. From release 2.0, Ambari started supporting automated Rolling Upgrades. This was further enhanced with release 2.2.0.0 to include support for Express Upgrades, which allows users to upgrade large scale clusters faster but requiring cluster downtime. This talk will cover planning and execution of Hadoop cluster upgrades from an operational perspective. The talk will also cover the internals of the upgrade process including the various stages such as pre-upgrade, backup, service checks, configuration upgrades, and finalization. Finally, the talk will cover troubleshooting upgrade failures, monitoring services during upgrades and post upgrade actions. The presentation will conclude with a case study that will cover how the upgrade process works on a large cluster (including aspects such as planning the upgrade, the amount of time required for the various stages, and troubleshooting)

LLAP: Sub-Second Analytical Queries in Hive

Row/Column- Level Security in SQL for Apache Spark

Security is one of fundamental features for enterprise adoption. Specifically, for SQL users, row/column-level access control is important. However, when a cluster is used as a data warehouse accessed by various user groups via different ways, it is difficult to guarantee data governance in a consistent way. In this talk, we focus on SQL users and talk about how to provide row/column-level access controls with common access control rules throughout the whole cluster with various SQL engines, e.g., Apache Spark 2.1, Apache Spark 1.6 and Apache Hive 2.1. If some of rules are changed, all engines are controlled consistently in near real-time. Technically, we enables Spark Thrift Server to work with an identify given by JDBC connection and take advantage of Hive LLAP daemon as a shared and secured processing engine. We demonstrate row-level filtering, column-level filtering and various column maskings in Apache Spark with Apache Ranger. We use Apache Ranger as a single point of security control center.

Hadoop 3.0 features

anand murari

Apache Hive 2.0: SQL, Speed, Scale

Handling Kernel Upgrades at Scale - The Dirty Cow Story

Apache Hadoop at Yahoo is a massive platform with 36 different clusters spread across YARN, Apache HBase, and Apache Storm deployments, totaling 60,000 servers made up of 100s of different hardware configurations accumulated over generations, presenting unique operational challenges and a variety of unforeseen corner cases. In this talk, we will share methods, tips and tricks to deal with large scale kernel upgrade on heterogeneous platforms within tight timeframes with 100% uptime and no service or data loss through the Dirty COW use case (privilege escalation vulnerability found in the Linux Kernel in late 2016). We will dive deep into our three phased approach that led to eventual success of the program - pre work, kernel upgrade itself, and post work / cleanup. We will share the details on automation tools, UIs, and reporting tools developed and used to achieve the stated objectives of 800+ server upgrades per hour, track the upgrade progress, validate and report data blocks, and recover quickly from bad blocks encountered. Throughout the talk, we will highlight the importance of process management, communicating with 100s of custom teams to ensure they are onboard and aware, and successful coordination tactics with SREs and Site Operations. We will also touch upon some of the unique challenges we faced along with way such as BIOS updates necessary on over 20,000 hosts along the way, and explain system rolling upgrade support we added to HBase and Storm for avoiding service disruption to low latency customer during these upgrades.

Deep Learning using Spark and DL4J for fun and profit

Apache Hadoop YARN: Past, Present and Future

Lessons learned from running Spark on Docker

Today, most any application can be “Dockerized.” However, there are special challenges when deploying a distributed application such as Spark on containers. This session will describe how to overcome these challenges in deploying Spark on Docker containers, with many practical tips and techniques for running Spark in a container environment. Containers are typically used to run stateless applications on a single host. There are significant real-world enterprise requirements that need to be addressed when running a stateful, distributed application in a secure multi-host container environment. There are decisions that need to be made concerning which tools and infrastructure to use. There are many choices with respect to container managers, orchestration frameworks, and resource schedulers that are readily available today and some that may be available tomorrow including:] • Mesos • Kubernetes • Docker Swarm Each has its own strengths and weaknesses; each has unique characteristics that may make it suitable, or unsuitable, for Spark. Understanding these differences is critical to the successful deployment of Spark on Docker containers. This session will describe the work done by the BlueData engineering team to run Spark inside containers, on a distributed platform, including the evaluation of various orchestration frameworks and lessons learned. You will learn how to apply practical networking and storage techniques to achieve high performance and agility in a distributed, container environment. Speaker Thomas Phelan, Chief Architect, Blue Data, Inc

Sharing metadata across the data lake and streams

Traditionally systems have stored and managed their own metadata, just as they traditionally stored and managed their own data. A revolutionary feature of big data tools such as Apache Hadoop and Apache Kafka is the ability to store all data together, where users can bring the tools of their choice to process it. Apache Hive's metastore can be used to share the metadata in the same way. It is already used by many SQL and SQL-like systems beyond Hive (e.g. Apache Spark, Presto, Apache Impala, and via HCatalog, Apache Pig). As data processing changes from only data in the cluster to include data in streams, the metastore needs to expand and grow to meet these use cases as well. There is work going on in the Hive community to separate out the metastore, so it can continue to serve Hive but also be used by a more diverse set of tools. This talk will discuss that work, with particular focus on adding support for storing schemas for Kafka messages. Speaker Alan Gates, Co-Founder, Hortonworks

The Unbearable Lightness of Ephemeral Processing

Ephemeral clusters can be launched quickly (minutes), are pre-configured for a specific processing purpose, and can be brought down quickly as soon as their usefulness has expired. The ability to launch Ephemeral clusters for on-demand processing, quickly and efficiently, is transforming how organizations design, deploy and Manage applications. The velocity and elasticity of fast cluster deployment enables seamless peak-demand provisioning, enables cost optimization by leveraging significantly lower cloud spot pricing, and maximizes utilization of existing compute capacity. Additionally, being able to launch bespoke clusters for specific compute needs in a repeatable fashion and within a shared infrastructure provides flexibility for special purpose processing needs. Organizations can leverage Ephemeral Clusters for parallel compute intensive applications which require short bursts of power but are short lived. In this session we will explore how to design Ephemeral clusters, how to launch, modify and bring them down, as well as application design considerations to maximize Ephemeral clusters usability.

Migrating your clusters and workloads from Hadoop 2 to Hadoop 3

The Hadoop community announced Hadoop 3.0 GA in December, 2017 and 3.1 around April, 2018 loaded with a lot of features and improvements. One of the biggest challenges for any new major release of a software platform is its compatibility. Apache Hadoop community has focused on ensuring wire and binary compatibility for Hadoop 2 clients and workloads. There are many challenges to be addressed by admins while upgrading to a major release of Hadoop. Users running workloads on Hadoop 2 should be able to seamlessly run or migrate their workloads onto Hadoop 3. This session will be deep diving into upgrade aspects in detail and provide a detailed preview of migration strategies with information on what works and what might not work. This talk would focus on the motivation for upgrading to Hadoop 3 and provide a cluster upgrade guide for admins and workload migration guide for users of Hadoop. Speaker Suma Shivaprasad, Hortonworks, Staff Engineer Rohith Sharma, Hortonworks, Senior Software Engineer

Hortonworks Technical Workshop: What's New in HDP 2.3

Streamline Hadoop DevOps with Apache Ambari

How to Achieve a Self-Service and Secure Multitenant Data Lake in a Large Com...

Successfully adopting a data analytics platform inside a large organization critically depends on integrating the platform within the technology fabric of the organization’s enterprise IT systems. Inside large organizations, enterprise security requirements, diverse analytics needs, and the exploratory nature of analytics can complicate adoption. To maximize success, we must make architectural and implementation choices that foster user flexibility, increase data connectedness, and respect enterprise security. Inside Northrop Grumman, a global security company, our team leads our company’s enterprise-wide big data and analytics initiative. In the past several years, our platform technology group have developed, operated, and managed our company’s Hadoop based data analytics platform. We believe the key design principles of a successful platform include self-service, multitenant security, managed infrastructure, seamless connectivity of data, and seamless connectivity of tools. Keeping to these design principles, our on-premises enterprise analytics platform is built using Hadoop, business intelligence and visualization tools, and relational database tools. Northrop Grumman data science teams can onboard onto the platform with integrated authentication and automatic configuration of proper Hadoop and operating system authorizations, create managed ingest jobs that transfer big datasets, share data in a governed manner via an enterprise data catalog, provision big data and relational databases, run interactive and scheduled jobs, and publish production grade visualizations. In this presentation, we present technology and architecture lessons learned during designing, building and operating Hadoop based enterprise data analytics platforms. We discuss critical tradeoffs when choosing an authentication strategy when integrating Hadoop with an existing IT environment, discuss practical implications of interfacing between authorization models, and how to achieve seamless connectivity of multiple COTS tools while maintaining self-service and multitenant security. LEON LI, Software Architect, Northrop Grumman

Backup and Disaster Recovery in Hadoop

While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.

Upgrade Without the Headache: Best Practices for Upgrading Hadoop in Production

Cloudera, Inc.

Walk through some of the best practices to keep in mind when it comes to upgrading your cluster, and learn how to leverage new Upgrade Wizard features in Cloudera Enterprise 5.3. For most mission critical workloads, downtime is never an option. Any downtime can have a direct impact on revenue and lead to frantic calls in the middle of the night. For this reason, upgrading the software that powers these workloads can often be a daunting task. It can cause unpredictable issues without access to support. That’s why an enterprise-grade administration tool is crucial for running Hadoop in production. Hadoop consists of dozens of components, running across multiple machines, all with their own configurations. That can lead to a lot of complexity and uncertainty - especially when taking the upgrade plunge. Cloudera Manager makes it easy and is the only production-ready administration tool for Hadoop. Not only does Cloudera Manager feature zero-downtime rolling upgrades, but it also has a built in Upgrade Wizard to make upgrades simple and predictable.

Running Zeppelin in Enterprise

Zeppelin has become a popular way to unlock the value of data lake due to its user interface and appeal to business users. These business users ask their IT department for access to Zeppelin. Enterprise IT department want to help their business users but they have several enterprise concerns such as enterprise security, integration with their corporate LDAP/AD, scalability and multi-user environment, integration with Ranger and Kerberos. This session will walk through enterprise concerns and how these concerns can be handled with Zeppelin.

Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Big Data Analytics in Healthcare and Life SciencesAli Sanousi, MD, MBA, PhD

The data and analytics of the new life sciences marketplace handout

Frank Wartenberg

What's hot

Hadoop in the Cloud - The what, why and how from the experts

Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting

LLAP: Sub-Second Analytical Queries in Hive

Row/Column- Level Security in SQL for Apache Spark

Hadoop 3.0 features

anand murari

Apache Hive 2.0: SQL, Speed, Scale

Handling Kernel Upgrades at Scale - The Dirty Cow Story

Deep Learning using Spark and DL4J for fun and profit

Apache Hadoop YARN: Past, Present and Future

Lessons learned from running Spark on Docker

Sharing metadata across the data lake and streams

The Unbearable Lightness of Ephemeral Processing

Migrating your clusters and workloads from Hadoop 2 to Hadoop 3

Hortonworks Technical Workshop: What's New in HDP 2.3

Streamline Hadoop DevOps with Apache Ambari

How to Achieve a Self-Service and Secure Multitenant Data Lake in a Large Com...

Backup and Disaster Recovery in Hadoop

Upgrade Without the Headache: Best Practices for Upgrading Hadoop in Production

Cloudera, Inc.

Running Zeppelin in Enterprise

Predictive Analytics and Machine Learning…with SAS and Apache Hadoop