Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...DataWorks Summit
As Hadoop applications move into cloud deployments, object stores become more and more the source and destination of data. But object stores are not filesystems: sometimes they are slower; security is different,
What are the secret settings to get maximum performance from queries against data living in cloud object stores? That's at the filesystem client, the file format and the query engine layers? It's even how you lay out the files —the directory structure and the names you give them.
We know these things, from our work in all these layers, from the benchmarking we've done —and the support calls we get when people have problems. And now: we'll show you.
This talk will start from the ground up "why isn't an object store a filesystem?" issue, showing how that breaks fundamental assumptions in code, and so causes performance issues which you don't get when working with HDFS. We'll look at the ways to get Apache Hive and Spark to work better, looking at optimizations which have been done to enable this —and what work is ongoing. Finally, we'll consider what your own code needs to do in order to adapt to cloud execution.
Speaker:
Sanjay Radia, Founder and Chief Architect, Hortonworks
Apache Ambari is an extensible framework that simplifies provisioning, managing and monitoring Hadoop clusters. Apache Ambari was built on a standardized stack-based operations model. Stacks wrap services of all shapes and sizes with a consistent definition and lifecycle-control layer; thereby providing a consistent approach for managing and monitoring the services. This also provided a natural extension point for operators and the community to bring in their own add-on services and “plug-in” the new services into the stack.
However, one of the fundamental limitations of the current Apache Ambari architecture has been that there is a strong one-on-one coupling between entities. For instance, a cluster is tied to a single stack and a Hadoop operator can only deploy services defined in that stack, a cluster can have only a single instance of a service and a host can have only a single instance of a component. Taking into consideration various use case scenarios that cannot be enabled due to these limitations there is a growing need to revamp the Ambari architecture.
In this talk, we propose a revamped Apache Ambari architecture that will open up the floodgates for a wide range of scenarios that wouldn’t have been possible thus far. We will focus the discussion on a new mpack-based operations model that will replace the stack-based operations model. A management package is a self-contained deployment artifact that includes all the details for deploying, managing and upgrading a set of services bundled in the package. A third-party provider can also build their own management package containing their custom services. This eliminates the need to plug-in their services into a stack and also can define their own upgrade story for these custom services. A Hadoop operator will be able to deploy a Hadoop cluster with a mix of services across multiple packages instead of being limited to a single stack. For example, it would be possible to deploy a cluster with HDFS from HDP and NIFI from HDF.
Further, we will also discuss about the architectural changes needed to enable a multi instance architecture in future Ambari releases to support deploying multiple instances of a service in a cluster, deploying multiple instances of a component on a host as well as future proofing the Ambari architecture to leverage some of the advancements happening in the Hadoop community like YARN services (YARN-4692). We will wrap up the conversation with a brief overview of other improvements planned for future releases of Ambari.
The Hadoop Distributed File System is the foundational storage layer in typical Hadoop deployments. Performance and stability of HDFS are crucial to the correct functioning of applications at higher layers in the Hadoop stack. This session is a technical deep dive into recent enhancements committed to HDFS by the entire Apache contributor community. We describe real-world incidents that motivated these changes and how the enhancements prevent those problems from reoccurring. Attendees will leave this session with a deeper understanding of the implementation challenges in a distributed file system and identify helpful new metrics to monitor in their own clusters.
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...DataWorks Summit
As Hadoop applications move into cloud deployments, object stores become more and more the source and destination of data. But object stores are not filesystems: sometimes they are slower; security is different,
What are the secret settings to get maximum performance from queries against data living in cloud object stores? That's at the filesystem client, the file format and the query engine layers? It's even how you lay out the files —the directory structure and the names you give them.
We know these things, from our work in all these layers, from the benchmarking we've done —and the support calls we get when people have problems. And now: we'll show you.
This talk will start from the ground up "why isn't an object store a filesystem?" issue, showing how that breaks fundamental assumptions in code, and so causes performance issues which you don't get when working with HDFS. We'll look at the ways to get Apache Hive and Spark to work better, looking at optimizations which have been done to enable this —and what work is ongoing. Finally, we'll consider what your own code needs to do in order to adapt to cloud execution.
Speaker:
Sanjay Radia, Founder and Chief Architect, Hortonworks
Apache Ambari is an extensible framework that simplifies provisioning, managing and monitoring Hadoop clusters. Apache Ambari was built on a standardized stack-based operations model. Stacks wrap services of all shapes and sizes with a consistent definition and lifecycle-control layer; thereby providing a consistent approach for managing and monitoring the services. This also provided a natural extension point for operators and the community to bring in their own add-on services and “plug-in” the new services into the stack.
However, one of the fundamental limitations of the current Apache Ambari architecture has been that there is a strong one-on-one coupling between entities. For instance, a cluster is tied to a single stack and a Hadoop operator can only deploy services defined in that stack, a cluster can have only a single instance of a service and a host can have only a single instance of a component. Taking into consideration various use case scenarios that cannot be enabled due to these limitations there is a growing need to revamp the Ambari architecture.
In this talk, we propose a revamped Apache Ambari architecture that will open up the floodgates for a wide range of scenarios that wouldn’t have been possible thus far. We will focus the discussion on a new mpack-based operations model that will replace the stack-based operations model. A management package is a self-contained deployment artifact that includes all the details for deploying, managing and upgrading a set of services bundled in the package. A third-party provider can also build their own management package containing their custom services. This eliminates the need to plug-in their services into a stack and also can define their own upgrade story for these custom services. A Hadoop operator will be able to deploy a Hadoop cluster with a mix of services across multiple packages instead of being limited to a single stack. For example, it would be possible to deploy a cluster with HDFS from HDP and NIFI from HDF.
Further, we will also discuss about the architectural changes needed to enable a multi instance architecture in future Ambari releases to support deploying multiple instances of a service in a cluster, deploying multiple instances of a component on a host as well as future proofing the Ambari architecture to leverage some of the advancements happening in the Hadoop community like YARN services (YARN-4692). We will wrap up the conversation with a brief overview of other improvements planned for future releases of Ambari.
The Hadoop Distributed File System is the foundational storage layer in typical Hadoop deployments. Performance and stability of HDFS are crucial to the correct functioning of applications at higher layers in the Hadoop stack. This session is a technical deep dive into recent enhancements committed to HDFS by the entire Apache contributor community. We describe real-world incidents that motivated these changes and how the enhancements prevent those problems from reoccurring. Attendees will leave this session with a deeper understanding of the implementation challenges in a distributed file system and identify helpful new metrics to monitor in their own clusters.
Apache Ambari is now the preferred way of provisioning, managing and monitoring Hadoop Clusters. Ambari helps users to manage Hadoop clusters simplifying actions such as upgrades, configuration management, service management, etc. From release 2.0, Ambari started supporting automated Rolling Upgrades. This was further enhanced with release 2.2.0.0 to include support for Express Upgrades, which allows users to upgrade large scale clusters faster but requiring cluster downtime.
This talk will cover planning and execution of Hadoop cluster upgrades from an operational perspective. The talk will also cover the internals of the upgrade process including the various stages such as pre-upgrade, backup, service checks, configuration upgrades, and finalization. Finally, the talk will cover troubleshooting upgrade failures, monitoring services during upgrades and post upgrade actions. The presentation will conclude with a case study that will cover how the upgrade process works on a large cluster (including aspects such as planning the upgrade, the amount of time required for the various stages, and troubleshooting)
Security is one of fundamental features for enterprise adoption. Specifically, for SQL users, row/column-level access control is important. However, when a cluster is used as a data warehouse accessed by various user groups via different ways, it is difficult to guarantee data governance in a consistent way. In this talk, we focus on SQL users and talk about how to provide row/column-level access controls with common access control rules throughout the whole cluster with various SQL engines, e.g., Apache Spark 2.1, Apache Spark 1.6 and Apache Hive 2.1. If some of rules are changed, all engines are controlled consistently in near real-time. Technically, we enables Spark Thrift Server to work with an identify given by JDBC connection and take advantage of Hive LLAP daemon as a shared and secured processing engine. We demonstrate row-level filtering, column-level filtering and various column maskings in Apache Spark with Apache Ranger. We use Apache Ranger as a single point of security control center.
Handling Kernel Upgrades at Scale - The Dirty Cow StoryDataWorks Summit
Apache Hadoop at Yahoo is a massive platform with 36 different clusters spread across YARN, Apache HBase, and Apache Storm deployments, totaling 60,000 servers made up of 100s of different hardware configurations accumulated over generations, presenting unique operational challenges and a variety of unforeseen corner cases. In this talk, we will share methods, tips and tricks to deal with large scale kernel upgrade on heterogeneous platforms within tight timeframes with 100% uptime and no service or data loss through the Dirty COW use case (privilege escalation vulnerability found in the Linux Kernel in late 2016).
We will dive deep into our three phased approach that led to eventual success of the program - pre work, kernel upgrade itself, and post work / cleanup. We will share the details on automation tools, UIs, and reporting tools developed and used to achieve the stated objectives of 800+ server upgrades per hour, track the upgrade progress, validate and report data blocks, and recover quickly from bad blocks encountered. Throughout the talk, we will highlight the importance of process management, communicating with 100s of custom teams to ensure they are onboard and aware, and successful coordination tactics with SREs and Site Operations. We will also touch upon some of the unique challenges we faced along with way such as BIOS updates necessary on over 20,000 hosts along the way, and explain system rolling upgrade support we added to HBase and Storm for avoiding service disruption to low latency customer during these upgrades.
Today, most any application can be “Dockerized.” However, there are special challenges when deploying a distributed application such as Spark on containers. This session will describe how to overcome these challenges in deploying Spark on Docker containers, with many practical tips and techniques for running Spark in a container environment.
Containers are typically used to run stateless applications on a single host. There are significant real-world enterprise requirements that need to be addressed when running a stateful, distributed application in a secure multi-host container environment.
There are decisions that need to be made concerning which tools and infrastructure to use. There are many choices with respect to container managers, orchestration frameworks, and resource schedulers that are readily available today and some that may be available tomorrow including:]
• Mesos
• Kubernetes
• Docker Swarm
Each has its own strengths and weaknesses; each has unique characteristics that may make it suitable, or unsuitable, for Spark. Understanding these differences is critical to the successful deployment of Spark on Docker containers.
This session will describe the work done by the BlueData engineering team to run Spark inside containers, on a distributed platform, including the evaluation of various orchestration frameworks and lessons learned. You will learn how to apply practical networking and storage techniques to achieve high performance and agility in a distributed, container environment.
Speaker
Thomas Phelan, Chief Architect, Blue Data, Inc
Sharing metadata across the data lake and streamsDataWorks Summit
Traditionally systems have stored and managed their own metadata, just as they traditionally stored and managed their own data. A revolutionary feature of big data tools such as Apache Hadoop and Apache Kafka is the ability to store all data together, where users can bring the tools of their choice to process it.
Apache Hive's metastore can be used to share the metadata in the same way. It is already used by many SQL and SQL-like systems beyond Hive (e.g. Apache Spark, Presto, Apache Impala, and via HCatalog, Apache Pig). As data processing changes from only data in the cluster to include data in streams, the metastore needs to expand and grow to meet these use cases as well. There is work going on in the Hive community to separate out the metastore, so it can continue to serve Hive but also be used by a more diverse set of tools. This talk will discuss that work, with particular focus on adding support for storing schemas for Kafka messages.
Speaker
Alan Gates, Co-Founder, Hortonworks
The Unbearable Lightness of Ephemeral ProcessingDataWorks Summit
Ephemeral clusters can be launched quickly (minutes), are pre-configured for a specific processing purpose, and can be brought down quickly as soon as their usefulness has expired. The ability to launch Ephemeral clusters for on-demand processing, quickly and efficiently, is transforming how organizations design, deploy and Manage applications. The velocity and elasticity of fast cluster deployment enables seamless peak-demand provisioning, enables cost optimization by leveraging significantly lower cloud spot pricing, and maximizes utilization of existing compute capacity. Additionally, being able to launch bespoke clusters for specific compute needs in a repeatable fashion and within a shared infrastructure provides flexibility for special purpose processing needs. Organizations can leverage Ephemeral Clusters for parallel compute intensive applications which require short bursts of power but are short lived. In this session we will explore how to design Ephemeral clusters, how to launch, modify and bring them down, as well as application design considerations to maximize Ephemeral clusters usability.
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3DataWorks Summit
The Hadoop community announced Hadoop 3.0 GA in December, 2017 and 3.1 around April, 2018 loaded with a lot of features and improvements. One of the biggest challenges for any new major release of a software platform is its compatibility. Apache Hadoop community has focused on ensuring wire and binary compatibility for Hadoop 2 clients and workloads.
There are many challenges to be addressed by admins while upgrading to a major release of Hadoop. Users running workloads on Hadoop 2 should be able to seamlessly run or migrate their workloads onto Hadoop 3. This session will be deep diving into upgrade aspects in detail and provide a detailed preview of migration strategies with information on what works and what might not work. This talk would focus on the motivation for upgrading to Hadoop 3 and provide a cluster upgrade guide for admins and workload migration guide for users of Hadoop.
Speaker
Suma Shivaprasad, Hortonworks, Staff Engineer
Rohith Sharma, Hortonworks, Senior Software Engineer
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks
The recently launched HDP 2.3 is a major advancement of Open Enterprise Hadoop. It represents the best of community led development with innovations spanning Apache Hadoop, Apache Ambari, Ranger, HBase, Spark and Storm. In this session we will provide an in-depth overview of new functionality and discuss it's impact on new and ongoing big data initiatives.
How to Achieve a Self-Service and Secure Multitenant Data Lake in a Large Com...DataWorks Summit
Successfully adopting a data analytics platform inside a large organization critically depends on integrating the platform within the technology fabric of the organization’s enterprise IT systems. Inside large organizations, enterprise security requirements, diverse analytics needs, and the exploratory nature of analytics can complicate adoption. To maximize success, we must make architectural and implementation choices that foster user flexibility, increase data connectedness, and respect enterprise security.
Inside Northrop Grumman, a global security company, our team leads our company’s enterprise-wide big data and analytics initiative. In the past several years, our platform technology group have developed, operated, and managed our company’s Hadoop based data analytics platform. We believe the key design principles of a successful platform include self-service, multitenant security, managed infrastructure, seamless connectivity of data, and seamless connectivity of tools. Keeping to these design principles, our on-premises enterprise analytics platform is built using Hadoop, business intelligence and visualization tools, and relational database tools. Northrop Grumman data science teams can onboard onto the platform with integrated authentication and automatic configuration of proper Hadoop and operating system authorizations, create managed ingest jobs that transfer big datasets, share data in a governed manner via an enterprise data catalog, provision big data and relational databases, run interactive and scheduled jobs, and publish production grade visualizations.
In this presentation, we present technology and architecture lessons learned during designing, building and operating Hadoop based enterprise data analytics platforms. We discuss critical tradeoffs when choosing an authentication strategy when integrating Hadoop with an existing IT environment, discuss practical implications of interfacing between authorization models, and how to achieve seamless connectivity of multiple COTS tools while maintaining self-service and multitenant security. LEON LI, Software Architect, Northrop Grumman
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in ProductionCloudera, Inc.
Walk through some of the best practices to keep in mind when it comes to upgrading your cluster, and learn how to leverage new Upgrade Wizard features in Cloudera Enterprise 5.3.
For most mission critical workloads, downtime is never an option. Any downtime can have a direct impact on revenue and lead to frantic calls in the middle of the night. For this reason, upgrading the software that powers these workloads can often be a daunting task. It can cause unpredictable issues without access to support. That’s why an enterprise-grade administration tool is crucial for running Hadoop in production. Hadoop consists of dozens of components, running across multiple machines, all with their own configurations. That can lead to a lot of complexity and uncertainty - especially when taking the upgrade plunge.
Cloudera Manager makes it easy and is the only production-ready administration tool for Hadoop. Not only does Cloudera Manager feature zero-downtime rolling upgrades, but it also has a built in Upgrade Wizard to make upgrades simple and predictable.
Zeppelin has become a popular way to unlock the value of data lake due to its user interface and appeal to business users. These business users ask their IT department for access to Zeppelin. Enterprise IT department want to help their business users but they have several enterprise concerns such as enterprise security, integration with their corporate LDAP/AD, scalability and multi-user environment, integration with Ranger and Kerberos. This session will walk through enterprise concerns and how these concerns can be handled with Zeppelin.
Predictive Analytics and Machine Learning…with SAS and Apache HadoopHortonworks
In this interactive webinar, we'll walk through use cases on how you can use advanced analytics like SAS Visual Statistics and In-Memory Statistic with Hortonworks’ data platform (HDP) to reveal insights in your big data and redefine how your organization solves complex problems.
The data and analytics of the new life sciences marketplace handoutFrank Wartenberg
Trends in the global healthcare market. Development of pharmaceuticals, market data and insights.
Presentation delivered at the 9th International Pharmaceutical Compliance Congress and Best Practices Forum, Brussels, 2015
Apache Ambari is now the preferred way of provisioning, managing and monitoring Hadoop Clusters. Ambari helps users to manage Hadoop clusters simplifying actions such as upgrades, configuration management, service management, etc. From release 2.0, Ambari started supporting automated Rolling Upgrades. This was further enhanced with release 2.2.0.0 to include support for Express Upgrades, which allows users to upgrade large scale clusters faster but requiring cluster downtime.
This talk will cover planning and execution of Hadoop cluster upgrades from an operational perspective. The talk will also cover the internals of the upgrade process including the various stages such as pre-upgrade, backup, service checks, configuration upgrades, and finalization. Finally, the talk will cover troubleshooting upgrade failures, monitoring services during upgrades and post upgrade actions. The presentation will conclude with a case study that will cover how the upgrade process works on a large cluster (including aspects such as planning the upgrade, the amount of time required for the various stages, and troubleshooting)
Security is one of fundamental features for enterprise adoption. Specifically, for SQL users, row/column-level access control is important. However, when a cluster is used as a data warehouse accessed by various user groups via different ways, it is difficult to guarantee data governance in a consistent way. In this talk, we focus on SQL users and talk about how to provide row/column-level access controls with common access control rules throughout the whole cluster with various SQL engines, e.g., Apache Spark 2.1, Apache Spark 1.6 and Apache Hive 2.1. If some of rules are changed, all engines are controlled consistently in near real-time. Technically, we enables Spark Thrift Server to work with an identify given by JDBC connection and take advantage of Hive LLAP daemon as a shared and secured processing engine. We demonstrate row-level filtering, column-level filtering and various column maskings in Apache Spark with Apache Ranger. We use Apache Ranger as a single point of security control center.
Handling Kernel Upgrades at Scale - The Dirty Cow StoryDataWorks Summit
Apache Hadoop at Yahoo is a massive platform with 36 different clusters spread across YARN, Apache HBase, and Apache Storm deployments, totaling 60,000 servers made up of 100s of different hardware configurations accumulated over generations, presenting unique operational challenges and a variety of unforeseen corner cases. In this talk, we will share methods, tips and tricks to deal with large scale kernel upgrade on heterogeneous platforms within tight timeframes with 100% uptime and no service or data loss through the Dirty COW use case (privilege escalation vulnerability found in the Linux Kernel in late 2016).
We will dive deep into our three phased approach that led to eventual success of the program - pre work, kernel upgrade itself, and post work / cleanup. We will share the details on automation tools, UIs, and reporting tools developed and used to achieve the stated objectives of 800+ server upgrades per hour, track the upgrade progress, validate and report data blocks, and recover quickly from bad blocks encountered. Throughout the talk, we will highlight the importance of process management, communicating with 100s of custom teams to ensure they are onboard and aware, and successful coordination tactics with SREs and Site Operations. We will also touch upon some of the unique challenges we faced along with way such as BIOS updates necessary on over 20,000 hosts along the way, and explain system rolling upgrade support we added to HBase and Storm for avoiding service disruption to low latency customer during these upgrades.
Today, most any application can be “Dockerized.” However, there are special challenges when deploying a distributed application such as Spark on containers. This session will describe how to overcome these challenges in deploying Spark on Docker containers, with many practical tips and techniques for running Spark in a container environment.
Containers are typically used to run stateless applications on a single host. There are significant real-world enterprise requirements that need to be addressed when running a stateful, distributed application in a secure multi-host container environment.
There are decisions that need to be made concerning which tools and infrastructure to use. There are many choices with respect to container managers, orchestration frameworks, and resource schedulers that are readily available today and some that may be available tomorrow including:]
• Mesos
• Kubernetes
• Docker Swarm
Each has its own strengths and weaknesses; each has unique characteristics that may make it suitable, or unsuitable, for Spark. Understanding these differences is critical to the successful deployment of Spark on Docker containers.
This session will describe the work done by the BlueData engineering team to run Spark inside containers, on a distributed platform, including the evaluation of various orchestration frameworks and lessons learned. You will learn how to apply practical networking and storage techniques to achieve high performance and agility in a distributed, container environment.
Speaker
Thomas Phelan, Chief Architect, Blue Data, Inc
Sharing metadata across the data lake and streamsDataWorks Summit
Traditionally systems have stored and managed their own metadata, just as they traditionally stored and managed their own data. A revolutionary feature of big data tools such as Apache Hadoop and Apache Kafka is the ability to store all data together, where users can bring the tools of their choice to process it.
Apache Hive's metastore can be used to share the metadata in the same way. It is already used by many SQL and SQL-like systems beyond Hive (e.g. Apache Spark, Presto, Apache Impala, and via HCatalog, Apache Pig). As data processing changes from only data in the cluster to include data in streams, the metastore needs to expand and grow to meet these use cases as well. There is work going on in the Hive community to separate out the metastore, so it can continue to serve Hive but also be used by a more diverse set of tools. This talk will discuss that work, with particular focus on adding support for storing schemas for Kafka messages.
Speaker
Alan Gates, Co-Founder, Hortonworks
The Unbearable Lightness of Ephemeral ProcessingDataWorks Summit
Ephemeral clusters can be launched quickly (minutes), are pre-configured for a specific processing purpose, and can be brought down quickly as soon as their usefulness has expired. The ability to launch Ephemeral clusters for on-demand processing, quickly and efficiently, is transforming how organizations design, deploy and Manage applications. The velocity and elasticity of fast cluster deployment enables seamless peak-demand provisioning, enables cost optimization by leveraging significantly lower cloud spot pricing, and maximizes utilization of existing compute capacity. Additionally, being able to launch bespoke clusters for specific compute needs in a repeatable fashion and within a shared infrastructure provides flexibility for special purpose processing needs. Organizations can leverage Ephemeral Clusters for parallel compute intensive applications which require short bursts of power but are short lived. In this session we will explore how to design Ephemeral clusters, how to launch, modify and bring them down, as well as application design considerations to maximize Ephemeral clusters usability.
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3DataWorks Summit
The Hadoop community announced Hadoop 3.0 GA in December, 2017 and 3.1 around April, 2018 loaded with a lot of features and improvements. One of the biggest challenges for any new major release of a software platform is its compatibility. Apache Hadoop community has focused on ensuring wire and binary compatibility for Hadoop 2 clients and workloads.
There are many challenges to be addressed by admins while upgrading to a major release of Hadoop. Users running workloads on Hadoop 2 should be able to seamlessly run or migrate their workloads onto Hadoop 3. This session will be deep diving into upgrade aspects in detail and provide a detailed preview of migration strategies with information on what works and what might not work. This talk would focus on the motivation for upgrading to Hadoop 3 and provide a cluster upgrade guide for admins and workload migration guide for users of Hadoop.
Speaker
Suma Shivaprasad, Hortonworks, Staff Engineer
Rohith Sharma, Hortonworks, Senior Software Engineer
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks
The recently launched HDP 2.3 is a major advancement of Open Enterprise Hadoop. It represents the best of community led development with innovations spanning Apache Hadoop, Apache Ambari, Ranger, HBase, Spark and Storm. In this session we will provide an in-depth overview of new functionality and discuss it's impact on new and ongoing big data initiatives.
How to Achieve a Self-Service and Secure Multitenant Data Lake in a Large Com...DataWorks Summit
Successfully adopting a data analytics platform inside a large organization critically depends on integrating the platform within the technology fabric of the organization’s enterprise IT systems. Inside large organizations, enterprise security requirements, diverse analytics needs, and the exploratory nature of analytics can complicate adoption. To maximize success, we must make architectural and implementation choices that foster user flexibility, increase data connectedness, and respect enterprise security.
Inside Northrop Grumman, a global security company, our team leads our company’s enterprise-wide big data and analytics initiative. In the past several years, our platform technology group have developed, operated, and managed our company’s Hadoop based data analytics platform. We believe the key design principles of a successful platform include self-service, multitenant security, managed infrastructure, seamless connectivity of data, and seamless connectivity of tools. Keeping to these design principles, our on-premises enterprise analytics platform is built using Hadoop, business intelligence and visualization tools, and relational database tools. Northrop Grumman data science teams can onboard onto the platform with integrated authentication and automatic configuration of proper Hadoop and operating system authorizations, create managed ingest jobs that transfer big datasets, share data in a governed manner via an enterprise data catalog, provision big data and relational databases, run interactive and scheduled jobs, and publish production grade visualizations.
In this presentation, we present technology and architecture lessons learned during designing, building and operating Hadoop based enterprise data analytics platforms. We discuss critical tradeoffs when choosing an authentication strategy when integrating Hadoop with an existing IT environment, discuss practical implications of interfacing between authorization models, and how to achieve seamless connectivity of multiple COTS tools while maintaining self-service and multitenant security. LEON LI, Software Architect, Northrop Grumman
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in ProductionCloudera, Inc.
Walk through some of the best practices to keep in mind when it comes to upgrading your cluster, and learn how to leverage new Upgrade Wizard features in Cloudera Enterprise 5.3.
For most mission critical workloads, downtime is never an option. Any downtime can have a direct impact on revenue and lead to frantic calls in the middle of the night. For this reason, upgrading the software that powers these workloads can often be a daunting task. It can cause unpredictable issues without access to support. That’s why an enterprise-grade administration tool is crucial for running Hadoop in production. Hadoop consists of dozens of components, running across multiple machines, all with their own configurations. That can lead to a lot of complexity and uncertainty - especially when taking the upgrade plunge.
Cloudera Manager makes it easy and is the only production-ready administration tool for Hadoop. Not only does Cloudera Manager feature zero-downtime rolling upgrades, but it also has a built in Upgrade Wizard to make upgrades simple and predictable.
Zeppelin has become a popular way to unlock the value of data lake due to its user interface and appeal to business users. These business users ask their IT department for access to Zeppelin. Enterprise IT department want to help their business users but they have several enterprise concerns such as enterprise security, integration with their corporate LDAP/AD, scalability and multi-user environment, integration with Ranger and Kerberos. This session will walk through enterprise concerns and how these concerns can be handled with Zeppelin.
Predictive Analytics and Machine Learning…with SAS and Apache HadoopHortonworks
In this interactive webinar, we'll walk through use cases on how you can use advanced analytics like SAS Visual Statistics and In-Memory Statistic with Hortonworks’ data platform (HDP) to reveal insights in your big data and redefine how your organization solves complex problems.
The data and analytics of the new life sciences marketplace handoutFrank Wartenberg
Trends in the global healthcare market. Development of pharmaceuticals, market data and insights.
Presentation delivered at the 9th International Pharmaceutical Compliance Congress and Best Practices Forum, Brussels, 2015
IBM_analytics across the healthcare ecosystem Heather Fraser
Analytics is a key enabler for life sciences and healthcare organizations to create better outcomes for patients, customers and other stakeholders across the entire healthcare ecosystem. While almost two-thirds of organizations across the healthcare ecosystem have analytics strategies in place, our research shows that only a fifth are driving analytics adoption across the enterprise. The key barriers are a lack of data management capabilities and skilled analysts, as well as poor organizational change management. To develop and translate insights into actions that enhance outcomes, organizations will need to collaborate across an expanding ecosystem.
IBM Insight 2014 session (4152 )- Accelerating Insights in Healthcare with “B...Alex Zeltov
Accelerating Insights in Healthcare with “Big Data” with HaDoop , use case description of Hadoop at IBC ( Independence Blue Cross, Alex Zeltov and Darwin Leung speakers for IBC)
Hospital Readmission Reduction: How Important are Follow Up Calls? (Hint: Very)SironaHealth
Starting in 2012, the Centers for Medicare and Medicaid Services (CMS) will begin withholding payments for potentially avoidable readmissions. This presentation reviews these new regulations, what causes excessive readmissions, and how hospitals can positively impact patient health by reaching out 24-72 hours after discharge.
Predicting Hospital Readmission Using CascadingCascading
Michael Covert will examine how Healthcare Providers are finding ways to use Big Data analytics to reduce readmission rates and improve operational efficiency while complying with regulatory mandates.
Big Data, CEP and IoT : Redefining Healthcare Information Systems and AnalyticsTauseef Naquishbandi
Big Data is a term encompassing the use of techniques to capture, process, analyze and visualize potentially large datasets in a reasonable time frame not accessible to standard technologies.
It refers to the ability to crunch vast collections of information, analyze it instantly, and draw from it sometimes profoundly surprising conclusions
Big data solutions can help stakeholders personalize care, engage patients, reduce variability and costs, and improve quality of health delivery.
Big data analytics can also contribute to providing a rich context to shape many areas of health care like analysis of effects, side-effects of drugs, genome analysis etc.
Medicine of the Future—The Transformation from Reactive to Proactive (P4) Med...Ryan Squire
Medicine of the Future—The Transformation from Reactive to Proactive (P4) Medicine as presented at the Ohio State University Medical Center Personalized Health Care National Conference.
Leroy Hood, MD, PhD, is the president and founder of the Institute of Systems Biology. Dr. Hood is a member of the National Academy of Sciences, the American Philosophical Society, the American Academy of Arts and Sciences, the Institute of Medicine and the National Academy of Engineering. His professional career began at Caltech where he and his colleagues pioneered four instruments — the DNA gene sequencer and synthesizer and the protein synthesizer and sequencer — which comprise the technological foundation for contemporary molecular biology. In particular, the DNA sequencer played a crucial role in contributing to the successful mapping of the human genome during the 1990s.
http://www.systemsbiology.org/Scientists_and_Research
Realizing the Promise of Big Data with Hadoop - Cloudera Summer Webinar Serie...Cloudera, Inc.
Apache Hadoop, an open-source platform, is increasingly gaining adoption within organizations trying to draw insight from all the big data being generated. Hadoop, and a handful of open-source tools that complement it, are promising to make gigantic and diverse datasets easily and economically available for quick analysis. A burgeoning partner ecosystem is also essential to helping organizations turn big data into business value.
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...WSO2
In this webinar, Srinath Perera, director of research at WSO2, will discuss
Big data landscape: concepts, use cases, and technologies
Real-time analytics with WSO2 CEP
Batch analytics with WSO2 BAM
Combining batch and real-time analytics
Introducing WSO2 Machine Learner
This is the presentation I gave to the HIMSS Management Engineering and Process Improvement (ME-PI) Community on predictive analytics healthcare usage.
Evaluating Big Data Predictive Analytics PlatformsTeradata Aster
Mike Gualtieri, Principal Analyst, Forrester Research, presents at the Big Analytics Roadshow, 2012 in New York City on December 12, 2012
Presentation title: Evaluating Big Data Predictive Analytics Platforms
Abstract: Great. You have Big Data. Now what? You have to analyze it to find game-changing predictive models that you can use to make smart decisions, reduce risk, or deliver breakthrough customer experiences. Big Data Predictive Analytics solutions are software and/or hardware solutions that allow firms to discover, evaluate, optimize, and deploy predictive models by analyzing big data sources. In this session, Forrester Principal Analyst Mike Gualtieri will discuss the key criteria you should use to evaluate Big Data Predictive Analytics platforms to meet your specific needs.
What do big data and advanced analytics mean for healthcare? This question was answered during the Georgia Society of CPAs (GSCPA) 2015 Healthcare Conference, February 6, at the Cobb Galleria Centre in Atlanta, GA. PYA Principal Marty Brown and PYA Analytics President & CEO Brian Worley presented “Big Data Applications in Healthcare.”
Overcoming Big Data Bottlenecks in Healthcare - a Predictive Analytics Case S...Damo Consulting Inc.
Implementing population health management in transitional care settings is challenging because of: 1) Data interoperability and other bottlenecks 2) complex workflows designed for reactive rather than proactive processes; and 3) difficulty in integrating them into clinical workflows
This presenattion discusses t a use case demonstrating a practical, real-world solution to these challenges.
Three audience takeaways from presentation:
1. Learn about the big data bottlenecks in healthcare
2. Learn how Sutter Health is using its E.H.R. data in a readmission risk predictive model;
3. See how those predictive models are integrated into clinical operations in improving care
Scaling Storage and Computation with Hadoopyaevents
Hadoop provides a distributed storage and a framework for the analysis and transformation of very large data sets using the MapReduce paradigm. Hadoop is partitioning data and computation across thousands of hosts, and executes application computations in parallel close to their data. A Hadoop cluster scales computation capacity, storage capacity and IO bandwidth by simply adding commodity servers. Hadoop is an Apache Software Foundation project; it unites hundreds of developers, and hundreds of organizations worldwide report using Hadoop. This presentation will give an overview of the Hadoop family projects with a focus on its distributed storage solutions
Best Hadoop Institutes : kelly tecnologies is the best Hadoop training Institute in Bangalore.Providing hadoop courses by realtime faculty in Bangalore.
If you are search Best Engineering college in India, Then you can trust RCE (Roorkee College of Engineering) services and facilities. They provide the best education facility, highly educated and experienced faculty, well furnished hostels for both boys and girls, top computerized Library, great placement opportunity and more at affordable fee.
We Provide Hadoop training institute in Hyderabad and Bangalore with corporate training by 12+ Experience faculty.
Real-time industry experts from MNCs
Resume Preparation by expert Professionals
Lab exercises
Interview Preparation
Experts advice
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
1. Scaling Storage and Computation
with Apache Hadoop
Konstantin V. Shvachko
1 October 2010
2. What is Hadoop
• Hadoop is an ecosystem of tools for processing
“Big Data”
• Hadoop is an open source project
• Yahoo! a primary developer of Hadoop since 2006
3. Big Data
• Big Data management, storage and analytics
• Large datasets (PBs) do not fit one computer
– Internal (memory) sort
– External (disk) sort
– Distributed sort
• Computations that need a lot of compute power
4. Big Data: Examples
• Search Webmap as of 2008 @ Y!
– Raw disk used 5 PB
– 1500 nodes
• Large Hadron Collider: PBs of events
– 1 PB of data per sec, most filtered out
• 2 quadrillionth (1015) digit of πis 0
– Tsz-Wo (Nicholas) Sze
– 23 days vs 2 years before
– No data, pure CPU workload
5. Hadoop is the Solution
• Architecture principles:
– Linear scaling
– Reliability and Availability
– Using unreliable commodity hardware
– Computation is shipped to data
No expensive data transfers
– High performance
6. Hadoop Components
Data SerializationAvro
Distributed coordinationZookeeper
HDFS Distributed file system
MapReduce Distributed computation
HBase Column store
Pig Dataflow language
Hive Data warehouse
Chukwa Data Collection
7. MapReduce
• MapReduce – distributed computation framework
– Invented by Google researchers
• Two stages of a MR job
– Map: {<Key,Value>} -> {<K’,V’>}
– Reduce: {<K’,V’>} -> {<K’’,V’’>}
• Map – a truly distributed stage
Reduce – an aggregation, may not be distributed
• Shuffle – sort and merge,
transition from Map to Reduce
invisible to user
9. Hadoop Distributed File System
HDFS
• The name space is a hierarchy of files and directories
• Files are divided into blocks (typically 128 MB)
• Namespace (metadata) is decoupled from data
– Lots of fast namespace operations, not slowed down by
– Data streaming
• Single NameNode keeps the entire name space in RAM
• DataNodes store block replicas as files on local drives
• Blocks are replicated on 3 DataNodes for redundancy
10. HDFS Read
• To read a block, the client requests the list of replica
locations from the NameNode
• Then pulling data from a replica on one of the DataNodes
11. HDFS Write
• To write a block of a file, the client requests a list of
candidate DataNodes from the NameNode, and
organizes a write pipeline.
12. Replica Location Awareness
• MapReduce schedules a task assigned to process block
B to a DataNode possessing a replica of B
• Data are large, programs are small
• Local access to data
13. ZooKeeper
• A distributed coordination service for distributed apps
– Event coordination and notification
– Leader election
– Distributed locking
• ZooKeeper can help build HA systems
14. HBase
• Distributed table store on top of HDFS
– An implementation of Googl’s BigTable
• Big table is Big Data, cannot be stored on a single node
• Tables: big, sparse, loosely structured.
– Consist of rows, having unique row keys
– Has arbitrary number of columns,
– grouped into small number of column families
– Dynamic column creation
• Table is partitioned into regions
– Horizontally across rows; vertically across column families
• HBase provides structured yet flexible access to data
15. Pig
• A language on top of and to simplify MapReduce
• Pig speaks Pig Latin
• SQL-like language
• Pig programs are translated into a
series of MapReduce jobs
16. Hive
• Serves the same purpose as Pig
• Closely follows SQL standards
• Keeps metadata about Hive tables in MySQL DRBM