Apache Ambari is now the preferred way of provisioning, managing and monitoring Hadoop Clusters. Ambari helps users to manage Hadoop clusters simplifying actions such as upgrades, configuration management, service management, etc. From release 2.0, Ambari started supporting automated Rolling Upgrades. This was further enhanced with release 2.2.0.0 to include support for Express Upgrades, which allows users to upgrade large scale clusters faster but requiring cluster downtime.
This talk will cover planning and execution of Hadoop cluster upgrades from an operational perspective. The talk will also cover the internals of the upgrade process including the various stages such as pre-upgrade, backup, service checks, configuration upgrades, and finalization. Finally, the talk will cover troubleshooting upgrade failures, monitoring services during upgrades and post upgrade actions. The presentation will conclude with a case study that will cover how the upgrade process works on a large cluster (including aspects such as planning the upgrade, the amount of time required for the various stages, and troubleshooting)
Zeppelin has become a popular way to unlock the value of data lake due to its user interface and appeal to business users. These business users ask their IT department for access to Zeppelin. Enterprise IT department want to help their business users but they have several enterprise concerns such as enterprise security, integration with their corporate LDAP/AD, scalability and multi-user environment, integration with Ranger and Kerberos. This session will walk through enterprise concerns and how these concerns can be handled with Zeppelin.
Security is one of fundamental features for enterprise adoption. Specifically, for SQL users, row/column-level access control is important. However, when a cluster is used as a data warehouse accessed by various user groups via different ways, it is difficult to guarantee data governance in a consistent way. In this talk, we focus on SQL users and talk about how to provide row/column-level access controls with common access control rules throughout the whole cluster with various SQL engines, e.g., Apache Spark 2.1, Apache Spark 1.6 and Apache Hive 2.1. If some of rules are changed, all engines are controlled consistently in near real-time. Technically, we enables Spark Thrift Server to work with an identify given by JDBC connection and take advantage of Hive LLAP daemon as a shared and secured processing engine. We demonstrate row-level filtering, column-level filtering and various column maskings in Apache Spark with Apache Ranger. We use Apache Ranger as a single point of security control center.
An Overview on Optimization in Apache Hive: Past, Present, FutureDataWorks Summit
Apache Hive has been continuously evolving to support a broad range of use cases, bringing it beyond its batch processing roots to its current support for interactive queries with sub-second response times using LLAP. However, the development of its execution internals is not sufficient to guarantee efficient performance, since poorly optimized queries can create a bottleneck in the system. Hence, each release of Hive has included new features for its optimizer aimed to generate better plans and deliver improvements to query execution. In this talk, we present the development of the optimizer since its initial release. We describe its current state and how Hive leverages the latest Apache Calcite features to generate the most efficient execution plans. We show numbers demonstrating the improvements brought to Hive performance, and we discuss future directions for the next-generation Hive optimizer, which include an enhanced cost model, materialized views support, and complex query decorrelation.
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks
The recently launched HDP 2.3 is a major advancement of Open Enterprise Hadoop. It represents the best of community led development with innovations spanning Apache Hadoop, Apache Ambari, Ranger, HBase, Spark and Storm. In this session we will provide an in-depth overview of new functionality and discuss it's impact on new and ongoing big data initiatives.
Over the last eighteen months, we have seen significant adoption of Hadoop eco-system centric big data processing in Microsoft Azure and Amazon AWS. In this talk we present some of the lessons learned and architectural considerations for cloud-based deployments including security, fault tolerance and auto-scaling.
We look at how Hortonworks Data Cloud and Cloudbreak can automate that scaling of Hadoop clusters, showing how it can react dynamically to workloads, and what that can deliver in cost-effective Hadoop-in-cloud deployments.
Zeppelin has become a popular way to unlock the value of data lake due to its user interface and appeal to business users. These business users ask their IT department for access to Zeppelin. Enterprise IT department want to help their business users but they have several enterprise concerns such as enterprise security, integration with their corporate LDAP/AD, scalability and multi-user environment, integration with Ranger and Kerberos. This session will walk through enterprise concerns and how these concerns can be handled with Zeppelin.
Security is one of fundamental features for enterprise adoption. Specifically, for SQL users, row/column-level access control is important. However, when a cluster is used as a data warehouse accessed by various user groups via different ways, it is difficult to guarantee data governance in a consistent way. In this talk, we focus on SQL users and talk about how to provide row/column-level access controls with common access control rules throughout the whole cluster with various SQL engines, e.g., Apache Spark 2.1, Apache Spark 1.6 and Apache Hive 2.1. If some of rules are changed, all engines are controlled consistently in near real-time. Technically, we enables Spark Thrift Server to work with an identify given by JDBC connection and take advantage of Hive LLAP daemon as a shared and secured processing engine. We demonstrate row-level filtering, column-level filtering and various column maskings in Apache Spark with Apache Ranger. We use Apache Ranger as a single point of security control center.
An Overview on Optimization in Apache Hive: Past, Present, FutureDataWorks Summit
Apache Hive has been continuously evolving to support a broad range of use cases, bringing it beyond its batch processing roots to its current support for interactive queries with sub-second response times using LLAP. However, the development of its execution internals is not sufficient to guarantee efficient performance, since poorly optimized queries can create a bottleneck in the system. Hence, each release of Hive has included new features for its optimizer aimed to generate better plans and deliver improvements to query execution. In this talk, we present the development of the optimizer since its initial release. We describe its current state and how Hive leverages the latest Apache Calcite features to generate the most efficient execution plans. We show numbers demonstrating the improvements brought to Hive performance, and we discuss future directions for the next-generation Hive optimizer, which include an enhanced cost model, materialized views support, and complex query decorrelation.
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks
The recently launched HDP 2.3 is a major advancement of Open Enterprise Hadoop. It represents the best of community led development with innovations spanning Apache Hadoop, Apache Ambari, Ranger, HBase, Spark and Storm. In this session we will provide an in-depth overview of new functionality and discuss it's impact on new and ongoing big data initiatives.
Over the last eighteen months, we have seen significant adoption of Hadoop eco-system centric big data processing in Microsoft Azure and Amazon AWS. In this talk we present some of the lessons learned and architectural considerations for cloud-based deployments including security, fault tolerance and auto-scaling.
We look at how Hortonworks Data Cloud and Cloudbreak can automate that scaling of Hadoop clusters, showing how it can react dynamically to workloads, and what that can deliver in cost-effective Hadoop-in-cloud deployments.
Using Apache Hadoop and related technologies as a data warehouse has been an area of interest since the early days of Hadoop. In recent years Hive has made great strides towards enabling data warehousing by expanding its SQL coverage, adding transactions, and enabling sub-second queries with LLAP. But data warehousing requires more than a full powered SQL engine. Security, governance, data movement, workload management, monitoring, and user tools are required as well. These functions are being addressed by other Apache projects such as Ranger, Atlas, Falcon, Ambari, and Zeppelin. This talk will examine how these projects can be assembled to build a data warehousing solution. It will also discuss features and performance work going on in Hive and the other projects that will enable more data warehousing use cases. These include use cases like data ingestion using merge, support for OLAP cubing queries via Hive’s integration with Druid, expanded SQL coverage, replication of data between data warehouses, advanced access control options, data discovery, and user tools to manage, monitor, and query the warehouse.
Speaker
Alan Gates, Co-founder, Hortonworks
Many Organizations are currently processing various types of data and in different formats. Most often this data will be in free form, As the consumers of this data growing it’s imperative that this free-flowing data needs to adhere to a schema. It will help data consumers to have an expectation of about the type of data they are getting and also they will be able to avoid immediate impact if the upstream source changes its format. Having a uniform schema representation also gives the Data Pipeline a really easy way to integrate and support various systems that use different data formats.
SchemaRegistry is a central repository for storing, evolving schemas. It provides an API & tooling to help developers and users to register a schema and consume that schema without having any impact if the schema changed. Users can tag different schemas and versions, register for notifications of schema changes with versions etc.
In this talk, we will go through the need for a schema registry and schema evolution and showcase the integration with Apache Nifi, Apache Kafka, Apache Storm.
Using Apache Hadoop and related technologies as a data warehouse has been an area of interest since the early days of Hadoop. In recent years Hive has made great strides towards enabling data warehousing by expanding its SQL coverage, adding transactions, and enabling sub-second queries with LLAP. But data warehousing requires more than a full powered SQL engine. Security, governance, data movement, workload management, monitoring, and user tools are required as well. These functions are being addressed by other Apache projects such as Ranger, Atlas, Falcon, Ambari, and Zeppelin. This talk will examine how these projects can be assembled to build a data warehousing solution. It will also discuss features and performance work going on in Hive and the other projects that will enable more data warehousing use cases. These include use cases like data ingestion using merge, support for OLAP cubing queries via Hive’s integration with Druid, expanded SQL coverage, replication of data between data warehouses, advanced access control options, data discovery, and user tools to manage, monitor, and query the warehouse.
Double Your Hadoop Hardware Performance with SmartSenseHortonworks
Hortonworks SmartSense provides proactive recommendations that improve cluster performance, security and operations. And since 30% of issues are configuration related, Hortonworks SmartSense makes an immediate impact on Hadoop system performance and availability, in some cases boosting hardware performance by two times. Learn how SmartSense can help you increase the efficiency of your Hadoop hardware, through customized cluster recommendations.
View the on-demand webinar: https://hortonworks.com/webinar/boosts-hadoop-hardware-performance-2x-smartsense/
Demand for cloud is through the roof. Cloud is turbo charging the Enterprise IT landscape with agility and flexibility. And now, discussions of cloud architecture dominate Enterprise IT. Cloud is enabling many ephemeral on-demand use cases which is a game changing opportunity for analytic workloads. But all of this comes with the challenges of running enterprise workloads in the cloud securely and with ease.
In this session, we will take you through Cloudbreak as a solution to simplify provisioning and managing enterprise workloads while providing an open and common experience for deploying workloads across clouds. We will discuss the challenges (and opportunities) to run enterprise workloads in the cloud and will go through how the latest from Cloudbreak enables enterprises to easily and securely run big data workloads. This includes deep-dive discussion on autoscaling, Ambari Blueprints, recipes, custom images, and enabling Kerberos -- which are all key capabilities for Enterprise deployments.
As a last topic we will discuss how we deployed and operate Cloudbreak as a Service internally which enables rapid cluster deployment for prototyping and testing purposes.
Speakers
Peter Darvasi, Cloudbreak Partner Engineer, Hortonworks
Richard Doktorics, Staff Engineer, Hortonworks
Apache Hive has been continuously evolving to support a broad range of use cases, bringing it beyond its batch processing roots to its current support for interactive queries with sub-second response times using LLAP. However, the development of its execution internals is not sufficient to guarantee efficient performance, since poorly optimized queries can create a bottleneck in the system. Hence, each release of Hive has included new features for its optimizer aimed to generate better plans and deliver improvements to query execution. In this talk, we present the development of the optimizer since its initial release. We describe its current state and how Hive leverages the latest Apache Calcite features to generate the most efficient execution plans. We show numbers demonstrating the improvements brought to Hive performance, and we discuss future directions for the next-generation Hive optimizer, which include an enhanced cost model, materialized views support, and complex query decorrelation.
Apache Ambari: Managing Hadoop and YARNHortonworks
Part of the Hortonworks YARN Ready Webinar Series, this session is about management of Apache Hadoop and YARN using Apache Ambari. This series targets developers and we will feature a demo on Ambari.
Hortonworks tech workshop in-memory processing with sparkHortonworks
Apache Spark offers unique in-memory capabilities and is well suited to a wide variety of data processing workloads including machine learning and micro-batch processing. With HDP 2.2, Apache Spark is a fully supported component of the Hortonworks Data Platform. In this session we will cover the key fundamentals of Apache Spark and operational best practices for executing Spark jobs along with the rest of Big Data workloads. We will also provide a working example to showcase micro-batch and machine learning processing using Apache Spark.
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...DataWorks Summit
Imagine if you could build and deploy an end to end complex streaming analytics app on a streaming engine like Storm or Flink that did the following:
1. Joining Streams
2. Aggregations over Windows (Time or Count based)
3. Complex Event Processing
4. Pattern Matching
5. Model scoring.
Now imagine implementing and deploying this without writing a single line of code in under 10 mins.
Imagine no more; it is indeed here. In this talk, we will discuss an exciting open source project led by Hortonworks on building and deploying streaming applications using a drag and drop paradigm.
Hortonworks technical workshop operations with ambariHortonworks
Ambari continues on its journey of provisioning, monitoring and managing enterprise Hadoop deployments. With 2.0, Apache Ambari brings a host of new capabilities including updated metric collections; Kerberos setup automation and developer views for Big Data developers. In this Hortonworks Technical Workshop session we will provide an in-depth look into Apache Ambari 2.0 and showcase security setup automation using Ambari 2.0. View the recording at https://www.brighttalk.com/webcast/9573/155575. View the github demo work at https://github.com/abajwa-hw/ambari-workshops/blob/master/blueprints-demo-security.md. Recorded May 28, 2015.
Learn how Hortonworks Data Flow (HDF), powered by Apache Nifi, enables organizations to harness IoAT data streams to drive business and operational insights. We will use the session to provide an overview of HDF, including detailed hands-on lab to build HDF pipelines for capture and analysis of streaming data.
Recording and labs available at:
http://hortonworks.com/partners/learn/#hdf
Many enterprise are implementing Hadoop projects to manage and process large datasets. Big question is: how to configure Hadoop clusters to connect to enterprise directory containing 100k+ users and groups for access management. Several large enterprises have complex directory servers for managing users and groups. Many advanced features have been recently added to Hadoop user management in order to support various complex directory server structures.
In this session attendees will learn about: setting up Hadoop node with users from Active Directory for executing Hadoop jobs, setting up authentication for enterprise users, and setting up authorization for users and groups using Apache Ranger. Attendees will also learn about the common challenges faced in the enterprise environments while interacting with Active Directory including filtering out users to be brought into Hadoop from Active Directory, restricting access to a set of users from Active Directory, handling users from nested group structures, etc.
Speakers
Sailaja Polavarapu, staff Software Engineer, Hortonworks
Velmurugan Periasamy, Director - Engineering, Hortonworks
Using Apache Hadoop and related technologies as a data warehouse has been an area of interest since the early days of Hadoop. In recent years Hive has made great strides towards enabling data warehousing by expanding its SQL coverage, adding transactions, and enabling sub-second queries with LLAP. But data warehousing requires more than a full powered SQL engine. Security, governance, data movement, workload management, monitoring, and user tools are required as well. These functions are being addressed by other Apache projects such as Ranger, Atlas, Falcon, Ambari, and Zeppelin. This talk will examine how these projects can be assembled to build a data warehousing solution. It will also discuss features and performance work going on in Hive and the other projects that will enable more data warehousing use cases. These include use cases like data ingestion using merge, support for OLAP cubing queries via Hive’s integration with Druid, expanded SQL coverage, replication of data between data warehouses, advanced access control options, data discovery, and user tools to manage, monitor, and query the warehouse.
Speaker
Alan Gates, Co-founder, Hortonworks
Many Organizations are currently processing various types of data and in different formats. Most often this data will be in free form, As the consumers of this data growing it’s imperative that this free-flowing data needs to adhere to a schema. It will help data consumers to have an expectation of about the type of data they are getting and also they will be able to avoid immediate impact if the upstream source changes its format. Having a uniform schema representation also gives the Data Pipeline a really easy way to integrate and support various systems that use different data formats.
SchemaRegistry is a central repository for storing, evolving schemas. It provides an API & tooling to help developers and users to register a schema and consume that schema without having any impact if the schema changed. Users can tag different schemas and versions, register for notifications of schema changes with versions etc.
In this talk, we will go through the need for a schema registry and schema evolution and showcase the integration with Apache Nifi, Apache Kafka, Apache Storm.
Using Apache Hadoop and related technologies as a data warehouse has been an area of interest since the early days of Hadoop. In recent years Hive has made great strides towards enabling data warehousing by expanding its SQL coverage, adding transactions, and enabling sub-second queries with LLAP. But data warehousing requires more than a full powered SQL engine. Security, governance, data movement, workload management, monitoring, and user tools are required as well. These functions are being addressed by other Apache projects such as Ranger, Atlas, Falcon, Ambari, and Zeppelin. This talk will examine how these projects can be assembled to build a data warehousing solution. It will also discuss features and performance work going on in Hive and the other projects that will enable more data warehousing use cases. These include use cases like data ingestion using merge, support for OLAP cubing queries via Hive’s integration with Druid, expanded SQL coverage, replication of data between data warehouses, advanced access control options, data discovery, and user tools to manage, monitor, and query the warehouse.
Double Your Hadoop Hardware Performance with SmartSenseHortonworks
Hortonworks SmartSense provides proactive recommendations that improve cluster performance, security and operations. And since 30% of issues are configuration related, Hortonworks SmartSense makes an immediate impact on Hadoop system performance and availability, in some cases boosting hardware performance by two times. Learn how SmartSense can help you increase the efficiency of your Hadoop hardware, through customized cluster recommendations.
View the on-demand webinar: https://hortonworks.com/webinar/boosts-hadoop-hardware-performance-2x-smartsense/
Demand for cloud is through the roof. Cloud is turbo charging the Enterprise IT landscape with agility and flexibility. And now, discussions of cloud architecture dominate Enterprise IT. Cloud is enabling many ephemeral on-demand use cases which is a game changing opportunity for analytic workloads. But all of this comes with the challenges of running enterprise workloads in the cloud securely and with ease.
In this session, we will take you through Cloudbreak as a solution to simplify provisioning and managing enterprise workloads while providing an open and common experience for deploying workloads across clouds. We will discuss the challenges (and opportunities) to run enterprise workloads in the cloud and will go through how the latest from Cloudbreak enables enterprises to easily and securely run big data workloads. This includes deep-dive discussion on autoscaling, Ambari Blueprints, recipes, custom images, and enabling Kerberos -- which are all key capabilities for Enterprise deployments.
As a last topic we will discuss how we deployed and operate Cloudbreak as a Service internally which enables rapid cluster deployment for prototyping and testing purposes.
Speakers
Peter Darvasi, Cloudbreak Partner Engineer, Hortonworks
Richard Doktorics, Staff Engineer, Hortonworks
Apache Hive has been continuously evolving to support a broad range of use cases, bringing it beyond its batch processing roots to its current support for interactive queries with sub-second response times using LLAP. However, the development of its execution internals is not sufficient to guarantee efficient performance, since poorly optimized queries can create a bottleneck in the system. Hence, each release of Hive has included new features for its optimizer aimed to generate better plans and deliver improvements to query execution. In this talk, we present the development of the optimizer since its initial release. We describe its current state and how Hive leverages the latest Apache Calcite features to generate the most efficient execution plans. We show numbers demonstrating the improvements brought to Hive performance, and we discuss future directions for the next-generation Hive optimizer, which include an enhanced cost model, materialized views support, and complex query decorrelation.
Apache Ambari: Managing Hadoop and YARNHortonworks
Part of the Hortonworks YARN Ready Webinar Series, this session is about management of Apache Hadoop and YARN using Apache Ambari. This series targets developers and we will feature a demo on Ambari.
Hortonworks tech workshop in-memory processing with sparkHortonworks
Apache Spark offers unique in-memory capabilities and is well suited to a wide variety of data processing workloads including machine learning and micro-batch processing. With HDP 2.2, Apache Spark is a fully supported component of the Hortonworks Data Platform. In this session we will cover the key fundamentals of Apache Spark and operational best practices for executing Spark jobs along with the rest of Big Data workloads. We will also provide a working example to showcase micro-batch and machine learning processing using Apache Spark.
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...DataWorks Summit
Imagine if you could build and deploy an end to end complex streaming analytics app on a streaming engine like Storm or Flink that did the following:
1. Joining Streams
2. Aggregations over Windows (Time or Count based)
3. Complex Event Processing
4. Pattern Matching
5. Model scoring.
Now imagine implementing and deploying this without writing a single line of code in under 10 mins.
Imagine no more; it is indeed here. In this talk, we will discuss an exciting open source project led by Hortonworks on building and deploying streaming applications using a drag and drop paradigm.
Hortonworks technical workshop operations with ambariHortonworks
Ambari continues on its journey of provisioning, monitoring and managing enterprise Hadoop deployments. With 2.0, Apache Ambari brings a host of new capabilities including updated metric collections; Kerberos setup automation and developer views for Big Data developers. In this Hortonworks Technical Workshop session we will provide an in-depth look into Apache Ambari 2.0 and showcase security setup automation using Ambari 2.0. View the recording at https://www.brighttalk.com/webcast/9573/155575. View the github demo work at https://github.com/abajwa-hw/ambari-workshops/blob/master/blueprints-demo-security.md. Recorded May 28, 2015.
Learn how Hortonworks Data Flow (HDF), powered by Apache Nifi, enables organizations to harness IoAT data streams to drive business and operational insights. We will use the session to provide an overview of HDF, including detailed hands-on lab to build HDF pipelines for capture and analysis of streaming data.
Recording and labs available at:
http://hortonworks.com/partners/learn/#hdf
Many enterprise are implementing Hadoop projects to manage and process large datasets. Big question is: how to configure Hadoop clusters to connect to enterprise directory containing 100k+ users and groups for access management. Several large enterprises have complex directory servers for managing users and groups. Many advanced features have been recently added to Hadoop user management in order to support various complex directory server structures.
In this session attendees will learn about: setting up Hadoop node with users from Active Directory for executing Hadoop jobs, setting up authentication for enterprise users, and setting up authorization for users and groups using Apache Ranger. Attendees will also learn about the common challenges faced in the enterprise environments while interacting with Active Directory including filtering out users to be brought into Hadoop from Active Directory, restricting access to a set of users from Active Directory, handling users from nested group structures, etc.
Speakers
Sailaja Polavarapu, staff Software Engineer, Hortonworks
Velmurugan Periasamy, Director - Engineering, Hortonworks
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3DataWorks Summit
The Hadoop community announced Hadoop 3.0 GA in December, 2017 and 3.1 around April, 2018 loaded with a lot of features and improvements. One of the biggest challenges for any new major release of a software platform is its compatibility. Apache Hadoop community has focused on ensuring wire and binary compatibility for Hadoop 2 clients and workloads.
There are many challenges to be addressed by admins while upgrading to a major release of Hadoop. Users running workloads on Hadoop 2 should be able to seamlessly run or migrate their workloads onto Hadoop 3. This session will be deep diving into upgrade aspects in detail and provide a detailed preview of migration strategies with information on what works and what might not work. This talk would focus on the motivation for upgrading to Hadoop 3 and provide a cluster upgrade guide for admins and workload migration guide for users of Hadoop.
Speaker
Suma Shivaprasad, Hortonworks, Staff Engineer
Rohith Sharma, Hortonworks, Senior Software Engineer
Check out the great new features in Helix Core 2017.1 and Helix Swarm 2017 to see why it’s never been easier to collaborate and improve rapid release cycles.
This presentation will give an overview of two exciting releases for Apache HBase and Phoenix. HBase 2.0 is the next stable major release for Apache HBase scheduled for early 2018. It is the next evolution from the Apache HBase community after 1.0. HBase-2.0 contains a large number of features that are long time in development, some of which include rewritten region assignment , perf improvements (RPC, rewritten write pipeline, etc), async clients and WAL, C++ client, offheaping memstore and other buffers, shading of dependencies, as well as a lot of other fixes and stability improvements. We will go into technical details on some of the most important improvements in the release, as well as what are the implications for the users in terms of API and upgrade paths. Phoenix 5.0 is the next biggest release because of its integration with HBase 2.0 and lot of performance improvements in support of secondary Indexes. It has a lot of cool features such as encoded columns, Kafka, Hive integration, and many other performance improvements.
We have presented this at Data work summit 2018 in San Jose.
Streamline Apache Hadoop Operations with Apache Ambari and SmartSenseHortonworks
Apache Ambari 2.5 helps customers simplify the experience for provisioning, managing, monitoring, securing and troubleshooting Hadoop deployments. Find out how the combination of Ambari and SmartSense delivers a path to success to help IT get Hadoop up and running effectively. The end result – you get the full business impact management and benefits of Big Data for your organization.
https://hortonworks.com/webinar/streamline-apache-hadoop-operations-apache-ambari-smartsense/
This talk will give an overview of two exciting releases for Apache HBase and Phoenix. HBase 2.0 is the next stable major release for Apache HBase scheduled for early 2018. It is the next evolution from the Apache HBase community after 1.0. HBase-2.0 contains a large number of features that are long time in development, some of which include rewritten region assignment , perf improvements (RPC, rewritten write pipeline, etc), async clients and WAL, C++ client, offheaping memstore and other buffers, shading of dependencies, as well as a lot of other fixes and stability improvements. We will go into technical details on some of the most important improvements in the release, as well as what are the implications for the users in terms of API and upgrade paths. Phoenix 5.0 is the next biggest release because of its integration with HBase 2.0 and lot of performance improvements in support of secondary Indexes. It has a lot of cool features such as encoded columns, Kafka, Hive integration, and many other performance improvements. Ankit Singhal, Senior Software Engineer, Hortonworks Inc. and Rajeshbabu Chintaguntl, Staff Software Engineer, Hortonworks
This talk with give and overview of exciting two releases for Apache HBase and Phoenix. HBase 2.0 is the next stable major release for Apache HBase scheduled for early 2017. It is the next evolution from the Apache HBase community after 1.0. HBase-2.0 contains a large number of features that is long time in the development, some of which include rewritten region assignment, perf improvements (RPC, rewritten write pipeline, etc), async clients, C++ client, offheaping memstore and other buffers, Spark integration, shading of dependencies as well as a lot of other fixes and stability improvements. We will go into technical details on some of the most important improvements in the release, as well as what are the implications for the users in terms of API and upgrade paths. Phoenix 5.0 is the next biggest and most exciting milestone release because of Phoenix integration with Apache Calcite which ads lot of performance benefits with new query optimizer and helps to integrate with other data sources, especially those also based on calcite. It has lot of cool features such as Encoded columns, Kafka, Hive integration, improvements in secondary index rebuilding and many performance improvements.
Many Organizations are currently processing various types of data and in different formats. Most often this data will be in free form, As the consumers of this data growing it’s imperative that this free-flowing data needs to adhere to a schema. It will help data consumers to have an expectation of about the type of data they are getting and also they will be able to avoid immediate impact if the upstream source changes its format. Having a uniform schema representation also gives the Data Pipeline a really easy way to integrate and support various systems that use different data formats.
SchemaRegistry is a central repository for storing, evolving schemas. It provides an API & tooling to help developers and users to register a schema and consume that schema without having any impact if the schema changed. Users can tag different schemas and versions, register for notifications of schema changes with versions etc.
In this talk, we will go through the need for a schema registry and schema evolution and showcase the integration with Apache NiFi, Apache Kafka, Apache Storm.
There is increasing need for large-scale recommendation systems. Typical solutions rely on periodically retrained batch algorithms, but for massive amounts of data, training a new model could take hours. This is a problem when the model needs to be more up-to-date. For example, when recommending TV programs while they are being transmitted the model should take into consideration users who watch a program at that time.
The promise of online recommendation systems is fast adaptation to changes, but methods of online machine learning from streams is commonly believed to be more restricted and hence less accurate than batch trained models. Combining batch and online learning could lead to a quickly adapting recommendation system with increased accuracy. However, designing a scalable data system for uniting batch and online recommendation algorithms is a challenging task. In this talk we present our experiences in creating such a recommendation engine with Apache Flink and Apache Spark.
DeepLearning is not just a hype - it outperforms state-of-the-art ML algorithms. One by one. In this talk we will show how DeepLearning can be used for detecting anomalies on IoT sensor data streams at high speed using DeepLearning4J on top of different BigData engines like ApacheSpark and ApacheFlink. Key in this talk is the absence of any large training corpus since we are using unsupervised machine learning - a domain current DL research threats step-motherly. As we can see in this demo LSTM networks can learn very complex system behavior - in this case data coming from a physical model simulating bearing vibration data. Once draw back of DeepLearning is that normally a very large labaled training data set is required. This is particularly interesting since we can show how unsupervised machine learning can be used in conjunction with DeepLearning - no labeled data set is necessary. We are able to detect anomalies and predict braking bearings with 10 fold confidence. All examples and all code will be made publicly available and open sources. Only open source components are used.
QE automation for large systems is a great step forward in increasing system reliability. In the big-data world, multiple components have to come together to provide end-users with business outcomes. This means, that QE Automations scenarios need to be detailed around actual use cases, cross-cutting components. The system tests potentially generate large amounts of data on a recurring basis, verifying which is a tedious job. Given the multiple levels of indirection, the false positives of actual defects are higher, and are generally wasteful.
At Hortonworks, we’ve designed and implemented Automated Log Analysis System - Mool, using Statistical Data Science and ML. Currently the work in progress has a batch data pipeline with a following ensemble ML pipeline which feeds into the recommendation engine. The system identifies the root cause of test failures, by correlating the failing test cases, with current and historical error records, to identify root cause of errors across multiple components. The system works in unsupervised mode with no perfect model/stable builds/source-code version to refer to. In addition the system provides limited recommendations to file/open past tickets and compares run-profiles with past runs.
Improving business performance is never easy! The Natixis Pack is like Rugby. Working together is key to scrum success. Our data journey would undoubtedly have been so much more difficult if we had not made the move together.
This session is the story of how ‘The Natixis Pack’ has driven change in its current IT architecture so that legacy systems can leverage some of the many components in Hortonworks Data Platform in order to improve the performance of business applications. During this session, you will hear:
• How and why the business and IT requirements originated
• How we leverage the platform to fulfill security and production requirements
• How we organize a community to:
o Guard all the players, no one gets left on the ground!
o Us the platform appropriately (Not every problem is eligible for Big Data and standard databases are not dead)
• What are the most usable, the most interesting and the most promising technologies in the Apache Hadoop community
We will finish the story of a successful rugby team with insight into the special skills needed from each player to win the match!
DETAILS
This session is part business, part technical. We will talk about infrastructure, security and project management as well as the industrial usage of Hive, HBase, Kafka, and Spark within an industrial Corporate and Investment Bank environment, framed by regulatory constraints.
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
There has been an explosion of data digitising our physical world – from cameras, environmental sensors and embedded devices, right down to the phones in our pockets. Which means that, now, companies have new ways to transform their businesses – both operationally, and through their products and services – by leveraging this data and applying fresh analytical techniques to make sense of it. But are they ready? The answer is “no” in most cases.
In this session, we’ll be discussing the challenges facing companies trying to embrace the Analytics of Things, and how Teradata has helped customers work through and turn those challenges to their advantage.
In this talk, we will present a new distribution of Hadoop, Hops, that can scale the Hadoop Filesystem (HDFS) by 16X, from 70K ops/s to 1.2 million ops/s on Spotiy's industrial Hadoop workload. Hops is an open-source distribution of Apache Hadoop that supports distributed metadata for HSFS (HopsFS) and the ResourceManager in Apache YARN. HopsFS is the first production-grade distributed hierarchical filesystem to store its metadata normalized in an in-memory, shared nothing database. For YARN, we will discuss optimizations that enable 2X throughput increases for the Capacity scheduler, enabling scalability to clusters with >20K nodes. We will discuss the journey of how we reached this milestone, discussing some of the challenges involved in efficiently and safely mapping hierarchical filesystem metadata state and operations onto a shared-nothing, in-memory database. We will also discuss the key database features needed for extreme scaling, such as multi-partition transactions, partition-pruned index scans, distribution-aware transactions, and the streaming changelog API. Hops (www.hops.io) is Apache-licensed open-source and supports a pluggable database backend for distributed metadata, although it currently only support MySQL Cluster as a backend. Hops opens up the potential for new directions for Hadoop when metadata is available for tinkering in a mature relational database.
In high-risk manufacturing industries, regulatory bodies stipulate continuous monitoring and documentation of critical product attributes and process parameters. On the other hand, sensor data coming from production processes can be used to gain deeper insights into optimization potentials. By establishing a central production data lake based on Hadoop and using Talend Data Fabric as a basis for a unified architecture, the German pharmaceutical company HERMES Arzneimittel was able to cater to compliance requirements as well as unlock new business opportunities, enabling use cases like predictive maintenance, predictive quality assurance or open world analytics. Learn how the Talend Data Fabric enabled HERMES Arzneimittel to become data-driven and transform Big Data projects from challenging, hard to maintain hand-coding jobs to repeatable, future-proof integration designs.
Talend Data Fabric combines Talend products into a common set of powerful, easy-to-use tools for any integration style: real-time or batch, big data or master data management, on-premises or in the cloud.
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
Hadoop Distributed File System (HDFS) evolves from a MapReduce-centric storage system to a generic, cost-effective storage infrastructure where HDFS stores all data of inside the organizations. The new use case presents a new sets of challenges to the original HDFS architecture. One challenge is to scale the storage management of HDFS - the centralized scheme within NameNode becomes a main bottleneck which limits the total number of files stored. Although a typical large HDFS cluster is able to store several hundred petabytes of data, it is inefficient to handle large amounts of small files under the current architecture.
In this talk, we introduce our new design and in-progress work that re-architects HDFS to attack this limitation. The storage management is enhanced to a distributed scheme. A new concept of storage container is introduced for storing objects. HDFS blocks are stored and managed as objects in the storage containers instead of being tracked only by NameNode. Storage containers are replicated across DataNodes using a newly-developed high-throughput protocol based on the Raft consensus algorithm. Our current prototype shows that under the new architecture the storage management of HDFS scales 10x better, demonstrating that HDFS is capable of storing billions of files.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
stack.upgrade.auto.retry.timeout.mins : Number of mins to retry for. Ideally, this would be between 15-20 mins. Default is 0 since this feature is turned off.
stack.upgrade.auto.retry.check.interval.secs : Thread sleep interval in seconds, defaults to 20 secs.
stack.upgrade.auto.retry.command.names.to.ignore : Don't auto-retry commands whose names are in this list. Default value is each name enclosed in quotes and separated by commas, "ComponentVersionCheckAction","FinalizeUpgradeAction"
stack.upgrade.auto.retry.command.details.to.ignore : Don't auto-retry commands whose details are in this list. Default value is each name enclosed in quotes and separated by commas, "Execute HDFS Finalize"
Based on the example above, to change the status from “HOLDING_FAILED” to “PENDING”, “Retry” button can be used. Or the following API can be used:
PUT http://vpamb2010.novalocal:8080/api/v1/clusters/Ambari21/upgrades/441/upgrade_groups/106/upgrade_items/1
{"UpgradeItem": { "status" : "PENDING" } }And then refresh the Ambari server page to continue the upgrade / downgrade.
Based on the example above, to change the status from “HOLDING_FAILED” to “PENDING”, “Retry” button can be used. Or the following API can be used:
PUT http://vpamb2010.novalocal:8080/api/v1/clusters/Ambari21/upgrades/441/upgrade_groups/106/upgrade_items/1
{"UpgradeItem": { "status" : "PENDING" } }And then refresh the Ambari server page to continue the upgrade / downgrade.
/**
* Not queued for a host.
*/
PENDING,
/**
* Queued for a host, or has already been sent to host, but host did not answer yet.
*/
QUEUED,
/**
* Host reported it is working, received an IN_PROGRESS command status from host.
*/
IN_PROGRESS,
/**
* Task is holding, waiting for command to proceed to completion.
*/
HOLDING,
/**
* Host reported success
*/
COMPLETED,
/**
* Failed
*/
FAILED,
/**
* Task is holding after a failure, waiting for command to skip or retry.
*/
HOLDING_FAILED,
/**
* Host did not respond in time
*/
TIMEDOUT,
/**
* Task is holding after a time-out, waiting for command to skip or retry.
*/
HOLDING_TIMEDOUT,
/**
* Operation was abandoned
*/
ABORTED,
/**
* The operation failed and was automatically skipped.
*/
SKIPPED_FAILED;
Based on the example above, to change the status from “HOLDING_FAILED” to “PENDING”, “Retry” button can be used. Or the following API can be used:
PUT http://vpamb2010.novalocal:8080/api/v1/clusters/Ambari21/upgrades/441/upgrade_groups/106/upgrade_items/1
{"UpgradeItem": { "status" : "PENDING" } }And then refresh the Ambari server page to continue the upgrade / downgrade.
Based on the example above, to change the status from “HOLDING_FAILED” to “PENDING”, “Retry” button can be used. Or the following API can be used:
PUT http://vpamb2010.novalocal:8080/api/v1/clusters/Ambari21/upgrades/441/upgrade_groups/106/upgrade_items/1
{"UpgradeItem": { "status" : "PENDING" } }And then refresh the Ambari server page to continue the upgrade / downgrade.
Based on the example above, to change the status from “HOLDING_FAILED” to “PENDING”, “Retry” button can be used. Or the following API can be used:
PUT http://vpamb2010.novalocal:8080/api/v1/clusters/Ambari21/upgrades/441/upgrade_groups/106/upgrade_items/1
{"UpgradeItem": { "status" : "PENDING" } }And then refresh the Ambari server page to continue the upgrade / downgrade.