This document discusses various tools and techniques for diagnosing slow Hadoop jobs, including metrics and monitoring, logging and correlation, and tracing and analysis. It describes how to use Ambari metrics and Grafana dashboards to monitor cluster health and performance. It also explains how to leverage Hadoop audit logs and caller context to correlate job activity. Techniques for application tracing using YARN timeline service and tools like Zeppelin and Tez analyzer are presented to enable deep performance analysis of Hadoop jobs.
As organizations pursue Big Data initiatives to capture new opportunities for data-driven insights, data governance has become table stakes both from the perspective of external regulatory compliance as well as business value extraction internally within an enterprise. This session will introduce Apache Atlas, a project that was incubated by Hortonworks along with a group of industry leaders across several verticals including financial services, healthcare, pharma, oil and gas, retail and insurance to help address data governance and metadata needs with an open extensible platform governed under the aegis of Apache Software Foundation. Apache Atlas empowers organizations to harvest metadata across the data ecosystem, govern and curate data lakes by applying consistent data classification with a centralized metadata catalog.
In this talk, we will present the underpinnings of the architecture of Apache Atlas and conclude with a tour of governance capabilities within Apache Atlas as we showcase various features for open metadata modeling, data classification, visualizing cross-component lineage and impact. We will also demo how Apache Atlas delivers a complete view of data movement across several analytic engines such as Apache Hive, Apache Storm, Apache Kafka and capabilities to effectively classify, discover datasets.
Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit
Druid is an open-source analytics data store specially designed to execute OLAP queries on event data. Its speed, scalability and efficiency have made it a popular choice to power user-facing analytic applications, including multiple BI tools and dashboards. However, Druid does not provide important features requested by many of these applications, such as a SQL interface or support for complex operations such as joins. This talk presents our work on extending Druid indexing and querying capabilities using Apache Hive. In particular, our solution allows to index complex query results in Druid using Hive, query Druid data sources from Hive using SQL, and execute complex Hive queries on top of Druid data sources. We describe how we built an extension that brings benefits to both systems alike, leveraging Apache Calcite to overcome the challenge of transparently generating Druid JSON queries from the input Hive SQL queries. We conclude with an experimental evaluation highlighting the performant and powerful integration of these projects.
Speaker
Jesus Camancho Rodriquez, Hortonworks
As more and more enterprises are adapting to utilize hadoop, integrating the enterprise users into the hadoop environment and enabling these users to access hadoop services and clusters becomes critical. Typically, enterprises manage the users and groups through directory servers and Hadoop services need to integrate with these enterprise directory servers (like Active Directory, LDAP etc...). This process presents its own unique set of challenges.
In this session, the speakers review the background of user and group management for Hadoop as well as covers various scenarios supported in Apache Ranger while interacting with AD or LDAP servers. This session also examines various use cases and some of the common challenges faced in the enterprise environments while interacting with Active Directory or LDAP server including filtering out users and/or groups to be brought into Hadoop from AD or LDAP servers, restricting access to a set of users and/or groups from AD, etc…
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...DataWorks Summit
In this talk I will show data engineers and architects how to run real-time TensorFlow Inception Image Recognition on images captured by remote sensors and images in tweets and facebook posts.
In the same flow I will also demonstrate how to apply real-time sentiment analysis and intelligent routing of data to Phoenix, Email and Slack.
I will elaborate on a number of different sentiment analysis frameworks available for use within Apache NiFi including Python NLTK, Stanford CoreNLP, Python SpaCy and Python TextBlob.
This talk will be a deep dive into how to manage complex dataflow pipelines ingesting from multiple streaming sources including social, public open data feeds, logs, drones, RDBMS and IoT with transformations, deep learning, machine learning and business rules.
Data engineers will be shown the power of Apache NiFi for loading diverse sources of data, applying transformations in-stream, routing based on attributes, adding sentiment data to workflows, running deep learning algorithms in stream and
storing data into Apache Phoenix on HBase and Apache Hive as ORC tables.
In this talk, I will walk through each step in the process from ingest of each source, applying filters, performing transformations, converting types, picking and converting fields and finally storing data to Apache Phoenix on HBase.
A quick data analysis to show streaming updates to data will be done in Apache Zeppelin running on HDP 2.x.
This is based on a few talks I have given at the Future of Data - Princeton meetup on various ingestion and processing patterns with Apache NiFi.
Speaker:
TImothy Spann, Solutions Engineer, Hortonworks
As organizations pursue Big Data initiatives to capture new opportunities for data-driven insights, data governance has become table stakes both from the perspective of external regulatory compliance as well as business value extraction internally within an enterprise. This session will introduce Apache Atlas, a project that was incubated by Hortonworks along with a group of industry leaders across several verticals including financial services, healthcare, pharma, oil and gas, retail and insurance to help address data governance and metadata needs with an open extensible platform governed under the aegis of Apache Software Foundation. Apache Atlas empowers organizations to harvest metadata across the data ecosystem, govern and curate data lakes by applying consistent data classification with a centralized metadata catalog.
In this talk, we will present the underpinnings of the architecture of Apache Atlas and conclude with a tour of governance capabilities within Apache Atlas as we showcase various features for open metadata modeling, data classification, visualizing cross-component lineage and impact. We will also demo how Apache Atlas delivers a complete view of data movement across several analytic engines such as Apache Hive, Apache Storm, Apache Kafka and capabilities to effectively classify, discover datasets.
Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit
Druid is an open-source analytics data store specially designed to execute OLAP queries on event data. Its speed, scalability and efficiency have made it a popular choice to power user-facing analytic applications, including multiple BI tools and dashboards. However, Druid does not provide important features requested by many of these applications, such as a SQL interface or support for complex operations such as joins. This talk presents our work on extending Druid indexing and querying capabilities using Apache Hive. In particular, our solution allows to index complex query results in Druid using Hive, query Druid data sources from Hive using SQL, and execute complex Hive queries on top of Druid data sources. We describe how we built an extension that brings benefits to both systems alike, leveraging Apache Calcite to overcome the challenge of transparently generating Druid JSON queries from the input Hive SQL queries. We conclude with an experimental evaluation highlighting the performant and powerful integration of these projects.
Speaker
Jesus Camancho Rodriquez, Hortonworks
As more and more enterprises are adapting to utilize hadoop, integrating the enterprise users into the hadoop environment and enabling these users to access hadoop services and clusters becomes critical. Typically, enterprises manage the users and groups through directory servers and Hadoop services need to integrate with these enterprise directory servers (like Active Directory, LDAP etc...). This process presents its own unique set of challenges.
In this session, the speakers review the background of user and group management for Hadoop as well as covers various scenarios supported in Apache Ranger while interacting with AD or LDAP servers. This session also examines various use cases and some of the common challenges faced in the enterprise environments while interacting with Active Directory or LDAP server including filtering out users and/or groups to be brought into Hadoop from AD or LDAP servers, restricting access to a set of users and/or groups from AD, etc…
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...DataWorks Summit
In this talk I will show data engineers and architects how to run real-time TensorFlow Inception Image Recognition on images captured by remote sensors and images in tweets and facebook posts.
In the same flow I will also demonstrate how to apply real-time sentiment analysis and intelligent routing of data to Phoenix, Email and Slack.
I will elaborate on a number of different sentiment analysis frameworks available for use within Apache NiFi including Python NLTK, Stanford CoreNLP, Python SpaCy and Python TextBlob.
This talk will be a deep dive into how to manage complex dataflow pipelines ingesting from multiple streaming sources including social, public open data feeds, logs, drones, RDBMS and IoT with transformations, deep learning, machine learning and business rules.
Data engineers will be shown the power of Apache NiFi for loading diverse sources of data, applying transformations in-stream, routing based on attributes, adding sentiment data to workflows, running deep learning algorithms in stream and
storing data into Apache Phoenix on HBase and Apache Hive as ORC tables.
In this talk, I will walk through each step in the process from ingest of each source, applying filters, performing transformations, converting types, picking and converting fields and finally storing data to Apache Phoenix on HBase.
A quick data analysis to show streaming updates to data will be done in Apache Zeppelin running on HDP 2.x.
This is based on a few talks I have given at the Future of Data - Princeton meetup on various ingestion and processing patterns with Apache NiFi.
Speaker:
TImothy Spann, Solutions Engineer, Hortonworks
Apache Hadoop YARN is the modern Distributed Operating System. It enables the Hadoop compute layer to be a common resource-management platform that can host a wide variety of applications. Multiple organizations are able to leverage YARN in building their applications on top of Hadoop without themselves repeatedly worrying about resource management, isolation, multi-tenancy issues etc.
In this talk, we’ll first hit the ground with the current status of Apache Hadoop YARN – how it is faring today in deployments large and small. We will cover different types of YARN deployments, in different environments and scale.
We'll then move on to the exciting present & future of YARN – features that are further strengthening YARN as the first-class resource-management platform for datacenters running enterprise Hadoop. We’ll discuss the current status as well as the future promise of features and initiatives like – 10x scheduler throughput improvements, docker containers support on YARN, support for long running services (alongside applications) natively without any changes, seamless application upgrades, fine-grained isolation for multi-tenancy using CGroups on disk & network resources, powerful scheduling features like application priorities, intra-queue preemption across applications and operational enhancements including insights through Timeline Service V2, a new web UI and better queue management.
Speaker:
Sunil Govindan, Senior Software Engineer, Hortonworks
Rohith Sharma K S, Senior Software Engineer, Hortonworks
Security is one of fundamental features for enterprise adoption. Specifically, for SQL users, row/column-level access control is important. However, when a cluster is used as a data warehouse accessed by various user groups via different ways, it is difficult to guarantee data governance in a consistent way. In this talk, we focus on SQL users and talk about how to provide row/column-level access controls with common access control rules throughout the whole cluster with various SQL engines, e.g., Apache Spark 2.1, Apache Spark 1.6 and Apache Hive 2.1. If some of rules are changed, all engines are controlled consistently in near real-time. Technically, we enables Spark Thrift Server to work with an identify given by JDBC connection and take advantage of Hive LLAP daemon as a shared and secured processing engine. We demonstrate row-level filtering, column-level filtering and various column maskings in Apache Spark with Apache Ranger. We use Apache Ranger as a single point of security control center.
Learn more: http://hortonworks.com/hdf/
Log data can be complex to capture, typically collected in limited amounts and difficult to operationalize at scale. HDF expands the capabilities of log analytics integration options for easy and secure edge analytics of log files in the following ways:
More efficient collection and movement of log data by prioritizing, enriching and/or transforming data at the edge to dynamically separate critical data. The relevant data is then delivered into log analytics systems in a real-time, prioritized and secure manner.
Cost-effective expansion of existing log analytics infrastructure by improving error detection and troubleshooting through more comprehensive data sets.
Intelligent edge analytics to support real-time content-based routing, prioritization, and simultaneous delivery of data into Connected Data Platforms, log analytics and reporting systems for comprehensive coverage and retention of Internet of Anything data.
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...Hortonworks
Companies in every industry look for ways to explore new data types and large data sets that were previously too big to capture, store and process. They need to unlock insights from data such as clickstream, geo-location, sensor, server log, social, text and video data. However, becoming a data-first enterprise comes with many challenges.
Join this webinar organized by three leaders in their respective fields and learn from our experts how you can accelerate the implementation of a scalable, cost-efficient and robust Big Data solution. Cisco, Hortonworks and Red Hat will explore how new data sets can enrich existing analytic applications with new perspectives and insights and how they can help you drive the creation of innovative new apps that provide new value to your business.
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDataWorks Summit
When interacting with analytics dashboards in order to achieve a smooth user experience, two major key requirements are sub-second response time and data freshness. Cluster computing frameworks such as Hadoop or Hive/Hbase work well for storing large volumes of data, although they are not optimized for ingesting streaming data and making it available for queries in realtime. Also, long query latencies make these systems sub-optimal choices for powering interactive dashboards and BI use-cases.
In this talk we will present Druid as a complementary solution to existing hadoop based technologies. Druid is an open-source analytics data store, designed from scratch, for OLAP and business intelligence queries over massive data streams. It provides low latency realtime data ingestion and fast sub-second adhoc flexible data exploration queries.
Many large companies are switching to Druid for analytics, and we will cover how druid is able to handle massive data streams and why it is a good fit for BI use cases.
Agenda -
1) Introduction and Ideal Use cases for Druid
2) Data Architecture
3) Streaming Ingestion with Kafka
4) Demo using Druid, Kafka and Superset.
5) Recent Improvements in Druid moving from lambda architecture to Exactly once Ingestion
6) Future Work
Apache Apex brings you the power to quickly build and run big data batch and stream processing applications. But what about visualizing your data in real time as it flows through the Apache Apex applications? Together, we will review Apache Apex, and how it integrates with Apache Hadoop and Apache Kafka to process your big data with streaming computation. Then we will explore the options available to visualize Apex applications metrics and data, including open-source options like REST and PubSub mechanisms in StrAM, as well as features available in the RTS Console like real-time Dashboards and Widgets. We will also look into ways of packaging dashboards inside your Apache Apex applications.
An Overview on Optimization in Apache Hive: Past, Present, FutureDataWorks Summit
Apache Hive has been continuously evolving to support a broad range of use cases, bringing it beyond its batch processing roots to its current support for interactive queries with sub-second response times using LLAP. However, the development of its execution internals is not sufficient to guarantee efficient performance, since poorly optimized queries can create a bottleneck in the system. Hence, each release of Hive has included new features for its optimizer aimed to generate better plans and deliver improvements to query execution. In this talk, we present the development of the optimizer since its initial release. We describe its current state and how Hive leverages the latest Apache Calcite features to generate the most efficient execution plans. We show numbers demonstrating the improvements brought to Hive performance, and we discuss future directions for the next-generation Hive optimizer, which include an enhanced cost model, materialized views support, and complex query decorrelation.
Using Apache Hadoop and related technologies as a data warehouse has been an area of interest since the early days of Hadoop. In recent years Hive has made great strides towards enabling data warehousing by expanding its SQL coverage, adding transactions, and enabling sub-second queries with LLAP. But data warehousing requires more than a full powered SQL engine. Security, governance, data movement, workload management, monitoring, and user tools are required as well. These functions are being addressed by other Apache projects such as Ranger, Atlas, Falcon, Ambari, and Zeppelin. This talk will examine how these projects can be assembled to build a data warehousing solution. It will also discuss features and performance work going on in Hive and the other projects that will enable more data warehousing use cases. These include use cases like data ingestion using merge, support for OLAP cubing queries via Hive’s integration with Druid, expanded SQL coverage, replication of data between data warehouses, advanced access control options, data discovery, and user tools to manage, monitor, and query the warehouse.
Speaker
Alan Gates, Co-founder, Hortonworks
Delivering a Flexible IT Infrastructure for Analytics on IBM Power SystemsHortonworks
Customers are preparing themselves to analyze and manage an increasing quantity of structured and unstructured data. Business leaders introduce new analytical workloads faster than what IT departments can handle. Legacy IT infrastructure needs to evolve to deliver operational improvements and cost containment, while increasing flexibility to meet future requirements. By providing HDP on IBM Power Systems, Hortonworks and IBM are giving customers have more choice in selecting the appropriate architectural platform that is right for them. In this webinar, we’ll discuss some of the challenges with deploying big data platforms, and how choosing solutions built with HDP on IBM Power Systems can offer tangible benefits and flexibility to accommodate changing needs.
Hortonworks Open Connected Data Platforms for IoT and Predictive Big Data Ana...DataWorks Summit
The energy industry is well known to be laggard adopters of new technology. However, industry challenges such as aging assets & workforce, increased regulatory scrutiny, renewable energy sources, depressed commodity prices, changing customer expectations, and growing data volumes are pushing companies to explore new technologies to help solve these problems. Learn how energy companies are leveraging Hortonworks Open and Connected Data Platforms to provide the predictive analysis and data insights to optimize performance for the energy industry.
Speaker
Kenneth Smith, General Manager, Energy, Hortonworks
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3DataWorks Summit
Deep learning is useful for enterprises tasks in the field of speech recognition, image classification, AI chatbots, and machine translation, just to name a few.
In order to train deep learning/machine learning models, applications such as TensorFlow, MXNet, Caffe, and XGBoost can be leveraged. And sometimes these applications will be used together to solve different problems.
To make distributed deep learning/machine learning applications easily launched, managed, and monitored, we introduced, in Apache Hadoop 3.x, YARN native services along with other improvements such as first-class GPU support, container-DNS support, scheduling improvements, etc. These improvements make distributed deep learning/machine learning applications run on YARN as simple as running it locally, which can let machine learning engineers focus on algorithms instead of worrying about underlying infrastructure. Also, YARN can better manage a shared cluster which runs deep learning/machine learning and other services and ETL jobs with these improvements.
In this session, we will take a closer look at these improvements and show how to run these applications on YARN with demos. Audiences can start trying running these applications on YARN after this talk.
Speakers
Wanga Tan, Staff Software Engineer, Hortonworks
Sunil Govindan, Staff Engineer, Hortonworks
Apache Hadoop YARN is the modern distributed operating system for big data applications. It morphed the Hadoop compute layer to be a common resource management platform that can host a wide variety of applications. Many organizations leverage YARN in building their applications on top of Hadoop without themselves repeatedly worrying about resource management, isolation, multi-tenancy issues, etc.
In this talk, we’ll start with the current status of Apache Hadoop YARN—how it is used today in deployments large and small. We'll then move on to the exciting present and future of YARN—features that are further strengthening YARN as the first-class resource management platform for data centers running enterprise Hadoop.
We’ll discuss the current status as well as the future promise of features and initiatives like: powerful container placement, global scheduling, support for machine learning and deep learning workloads through GPU and FPGA support, extreme scale with YARN federation, containerized apps on YARN, support for long running services (alongside applications) natively without any changes, seamless application upgrades, powerful scheduling features like application priorities, intra-queue preemption across applications, and operational enhancements including insights through Timeline Service V2, a new web UI, and better queue management.
Speakers
Wangda Tan, Staff Software Engineer, Hortonworks
Billie Rinaldi, Principal Software Engineer I, Hortonworks
Unlocking a fully integrated Spark experience within your enterprise Hadoop environment that is manageable, secure and deployable anywhere.
Presented at the Spark Summit by Arun C Murthy (co-Founder, Hortonworks) on Monday, June 15, 2015.
Apache Hive has been continuously evolving to support a broad range of use cases, bringing it beyond its batch processing roots to its current support for interactive queries with sub-second response times using LLAP. However, the development of its execution internals is not sufficient to guarantee efficient performance, since poorly optimized queries can create a bottleneck in the system. Hence, each release of Hive has included new features for its optimizer aimed to generate better plans and deliver improvements to query execution. In this talk, we present the development of the optimizer since its initial release. We describe its current state and how Hive leverages the latest Apache Calcite features to generate the most efficient execution plans. We show numbers demonstrating the improvements brought to Hive performance, and we discuss future directions for the next-generation Hive optimizer, which include an enhanced cost model, materialized views support, and complex query decorrelation.
Pivotal Big Data Suite: A Technical OverviewVMware Tanzu
How and why are companies like Uber, Netflix and Airbnb so successful, what you need to in order to become successful in the same way that they are and how Pivotal can help you with that.
Speaker: Les Klein, EMEA CTO Data, Pivotal
Apache Hadoop YARN is the modern Distributed Operating System. It enables the Hadoop compute layer to be a common resource-management platform that can host a wide variety of applications. Multiple organizations are able to leverage YARN in building their applications on top of Hadoop without themselves repeatedly worrying about resource management, isolation, multi-tenancy issues etc.
In this talk, we’ll first hit the ground with the current status of Apache Hadoop YARN – how it is faring today in deployments large and small. We will cover different types of YARN deployments, in different environments and scale.
We'll then move on to the exciting present & future of YARN – features that are further strengthening YARN as the first-class resource-management platform for datacenters running enterprise Hadoop. We’ll discuss the current status as well as the future promise of features and initiatives like – 10x scheduler throughput improvements, docker containers support on YARN, support for long running services (alongside applications) natively without any changes, seamless application upgrades, fine-grained isolation for multi-tenancy using CGroups on disk & network resources, powerful scheduling features like application priorities, intra-queue preemption across applications and operational enhancements including insights through Timeline Service V2, a new web UI and better queue management.
Speaker:
Sunil Govindan, Senior Software Engineer, Hortonworks
Rohith Sharma K S, Senior Software Engineer, Hortonworks
Security is one of fundamental features for enterprise adoption. Specifically, for SQL users, row/column-level access control is important. However, when a cluster is used as a data warehouse accessed by various user groups via different ways, it is difficult to guarantee data governance in a consistent way. In this talk, we focus on SQL users and talk about how to provide row/column-level access controls with common access control rules throughout the whole cluster with various SQL engines, e.g., Apache Spark 2.1, Apache Spark 1.6 and Apache Hive 2.1. If some of rules are changed, all engines are controlled consistently in near real-time. Technically, we enables Spark Thrift Server to work with an identify given by JDBC connection and take advantage of Hive LLAP daemon as a shared and secured processing engine. We demonstrate row-level filtering, column-level filtering and various column maskings in Apache Spark with Apache Ranger. We use Apache Ranger as a single point of security control center.
Learn more: http://hortonworks.com/hdf/
Log data can be complex to capture, typically collected in limited amounts and difficult to operationalize at scale. HDF expands the capabilities of log analytics integration options for easy and secure edge analytics of log files in the following ways:
More efficient collection and movement of log data by prioritizing, enriching and/or transforming data at the edge to dynamically separate critical data. The relevant data is then delivered into log analytics systems in a real-time, prioritized and secure manner.
Cost-effective expansion of existing log analytics infrastructure by improving error detection and troubleshooting through more comprehensive data sets.
Intelligent edge analytics to support real-time content-based routing, prioritization, and simultaneous delivery of data into Connected Data Platforms, log analytics and reporting systems for comprehensive coverage and retention of Internet of Anything data.
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...Hortonworks
Companies in every industry look for ways to explore new data types and large data sets that were previously too big to capture, store and process. They need to unlock insights from data such as clickstream, geo-location, sensor, server log, social, text and video data. However, becoming a data-first enterprise comes with many challenges.
Join this webinar organized by three leaders in their respective fields and learn from our experts how you can accelerate the implementation of a scalable, cost-efficient and robust Big Data solution. Cisco, Hortonworks and Red Hat will explore how new data sets can enrich existing analytic applications with new perspectives and insights and how they can help you drive the creation of innovative new apps that provide new value to your business.
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDataWorks Summit
When interacting with analytics dashboards in order to achieve a smooth user experience, two major key requirements are sub-second response time and data freshness. Cluster computing frameworks such as Hadoop or Hive/Hbase work well for storing large volumes of data, although they are not optimized for ingesting streaming data and making it available for queries in realtime. Also, long query latencies make these systems sub-optimal choices for powering interactive dashboards and BI use-cases.
In this talk we will present Druid as a complementary solution to existing hadoop based technologies. Druid is an open-source analytics data store, designed from scratch, for OLAP and business intelligence queries over massive data streams. It provides low latency realtime data ingestion and fast sub-second adhoc flexible data exploration queries.
Many large companies are switching to Druid for analytics, and we will cover how druid is able to handle massive data streams and why it is a good fit for BI use cases.
Agenda -
1) Introduction and Ideal Use cases for Druid
2) Data Architecture
3) Streaming Ingestion with Kafka
4) Demo using Druid, Kafka and Superset.
5) Recent Improvements in Druid moving from lambda architecture to Exactly once Ingestion
6) Future Work
Apache Apex brings you the power to quickly build and run big data batch and stream processing applications. But what about visualizing your data in real time as it flows through the Apache Apex applications? Together, we will review Apache Apex, and how it integrates with Apache Hadoop and Apache Kafka to process your big data with streaming computation. Then we will explore the options available to visualize Apex applications metrics and data, including open-source options like REST and PubSub mechanisms in StrAM, as well as features available in the RTS Console like real-time Dashboards and Widgets. We will also look into ways of packaging dashboards inside your Apache Apex applications.
An Overview on Optimization in Apache Hive: Past, Present, FutureDataWorks Summit
Apache Hive has been continuously evolving to support a broad range of use cases, bringing it beyond its batch processing roots to its current support for interactive queries with sub-second response times using LLAP. However, the development of its execution internals is not sufficient to guarantee efficient performance, since poorly optimized queries can create a bottleneck in the system. Hence, each release of Hive has included new features for its optimizer aimed to generate better plans and deliver improvements to query execution. In this talk, we present the development of the optimizer since its initial release. We describe its current state and how Hive leverages the latest Apache Calcite features to generate the most efficient execution plans. We show numbers demonstrating the improvements brought to Hive performance, and we discuss future directions for the next-generation Hive optimizer, which include an enhanced cost model, materialized views support, and complex query decorrelation.
Using Apache Hadoop and related technologies as a data warehouse has been an area of interest since the early days of Hadoop. In recent years Hive has made great strides towards enabling data warehousing by expanding its SQL coverage, adding transactions, and enabling sub-second queries with LLAP. But data warehousing requires more than a full powered SQL engine. Security, governance, data movement, workload management, monitoring, and user tools are required as well. These functions are being addressed by other Apache projects such as Ranger, Atlas, Falcon, Ambari, and Zeppelin. This talk will examine how these projects can be assembled to build a data warehousing solution. It will also discuss features and performance work going on in Hive and the other projects that will enable more data warehousing use cases. These include use cases like data ingestion using merge, support for OLAP cubing queries via Hive’s integration with Druid, expanded SQL coverage, replication of data between data warehouses, advanced access control options, data discovery, and user tools to manage, monitor, and query the warehouse.
Speaker
Alan Gates, Co-founder, Hortonworks
Delivering a Flexible IT Infrastructure for Analytics on IBM Power SystemsHortonworks
Customers are preparing themselves to analyze and manage an increasing quantity of structured and unstructured data. Business leaders introduce new analytical workloads faster than what IT departments can handle. Legacy IT infrastructure needs to evolve to deliver operational improvements and cost containment, while increasing flexibility to meet future requirements. By providing HDP on IBM Power Systems, Hortonworks and IBM are giving customers have more choice in selecting the appropriate architectural platform that is right for them. In this webinar, we’ll discuss some of the challenges with deploying big data platforms, and how choosing solutions built with HDP on IBM Power Systems can offer tangible benefits and flexibility to accommodate changing needs.
Hortonworks Open Connected Data Platforms for IoT and Predictive Big Data Ana...DataWorks Summit
The energy industry is well known to be laggard adopters of new technology. However, industry challenges such as aging assets & workforce, increased regulatory scrutiny, renewable energy sources, depressed commodity prices, changing customer expectations, and growing data volumes are pushing companies to explore new technologies to help solve these problems. Learn how energy companies are leveraging Hortonworks Open and Connected Data Platforms to provide the predictive analysis and data insights to optimize performance for the energy industry.
Speaker
Kenneth Smith, General Manager, Energy, Hortonworks
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3DataWorks Summit
Deep learning is useful for enterprises tasks in the field of speech recognition, image classification, AI chatbots, and machine translation, just to name a few.
In order to train deep learning/machine learning models, applications such as TensorFlow, MXNet, Caffe, and XGBoost can be leveraged. And sometimes these applications will be used together to solve different problems.
To make distributed deep learning/machine learning applications easily launched, managed, and monitored, we introduced, in Apache Hadoop 3.x, YARN native services along with other improvements such as first-class GPU support, container-DNS support, scheduling improvements, etc. These improvements make distributed deep learning/machine learning applications run on YARN as simple as running it locally, which can let machine learning engineers focus on algorithms instead of worrying about underlying infrastructure. Also, YARN can better manage a shared cluster which runs deep learning/machine learning and other services and ETL jobs with these improvements.
In this session, we will take a closer look at these improvements and show how to run these applications on YARN with demos. Audiences can start trying running these applications on YARN after this talk.
Speakers
Wanga Tan, Staff Software Engineer, Hortonworks
Sunil Govindan, Staff Engineer, Hortonworks
Apache Hadoop YARN is the modern distributed operating system for big data applications. It morphed the Hadoop compute layer to be a common resource management platform that can host a wide variety of applications. Many organizations leverage YARN in building their applications on top of Hadoop without themselves repeatedly worrying about resource management, isolation, multi-tenancy issues, etc.
In this talk, we’ll start with the current status of Apache Hadoop YARN—how it is used today in deployments large and small. We'll then move on to the exciting present and future of YARN—features that are further strengthening YARN as the first-class resource management platform for data centers running enterprise Hadoop.
We’ll discuss the current status as well as the future promise of features and initiatives like: powerful container placement, global scheduling, support for machine learning and deep learning workloads through GPU and FPGA support, extreme scale with YARN federation, containerized apps on YARN, support for long running services (alongside applications) natively without any changes, seamless application upgrades, powerful scheduling features like application priorities, intra-queue preemption across applications, and operational enhancements including insights through Timeline Service V2, a new web UI, and better queue management.
Speakers
Wangda Tan, Staff Software Engineer, Hortonworks
Billie Rinaldi, Principal Software Engineer I, Hortonworks
Unlocking a fully integrated Spark experience within your enterprise Hadoop environment that is manageable, secure and deployable anywhere.
Presented at the Spark Summit by Arun C Murthy (co-Founder, Hortonworks) on Monday, June 15, 2015.
Apache Hive has been continuously evolving to support a broad range of use cases, bringing it beyond its batch processing roots to its current support for interactive queries with sub-second response times using LLAP. However, the development of its execution internals is not sufficient to guarantee efficient performance, since poorly optimized queries can create a bottleneck in the system. Hence, each release of Hive has included new features for its optimizer aimed to generate better plans and deliver improvements to query execution. In this talk, we present the development of the optimizer since its initial release. We describe its current state and how Hive leverages the latest Apache Calcite features to generate the most efficient execution plans. We show numbers demonstrating the improvements brought to Hive performance, and we discuss future directions for the next-generation Hive optimizer, which include an enhanced cost model, materialized views support, and complex query decorrelation.
Pivotal Big Data Suite: A Technical OverviewVMware Tanzu
How and why are companies like Uber, Netflix and Airbnb so successful, what you need to in order to become successful in the same way that they are and how Pivotal can help you with that.
Speaker: Les Klein, EMEA CTO Data, Pivotal
Driving Real Insights Through Data ScienceVMware Tanzu
Major changes in industries have been brought about by the emergence of data-driven discoveries and applications. Many organizations are bringing together their data, and looking to drive change. But the ability to generate new insights in real time from a massive sets of data is still far from commonplace.
At this event, data technology experts and data scientists from Pivotal provided the latest business perspective on how data science and engineering can be used to accelerate the generation of new insights.
For information about upcoming Pivotal events, please visit: http://pivotal.io/news-events/#events
Troubleshooting App Health and Performance with PCF Metrics 1.2VMware Tanzu
Join Allen Duet and Pieter Humphrey from Pivotal, to learn how PCF Metrics enhances the developer experience on Pivotal Cloud Foundry, with a simple and powerful way to troubleshoot app health and performance issues. You will see how, with a single, unified interface for events, logs, and metrics, app devs can easily navigate graphs to identify problems and then view logs for that time slice.
My presentation slides from Hadoop Summit, San Jose, June 28, 2016. See live video at http://www.makedatauseful.com/vid-solving-performance-problems-hadoop/ and follow along for context.
Moving analytic workloads into production - specific technical challenges and best practices for engineering SQL in Hadoop solutions. Highlighting the next generation engineering approaches to the secret sauce we have implemented in the Actian VectorH database.
Data technology experts from Pivotal give the latest perspective on how big data analytics and applications are transforming organizations across industries.
This event provides an opportunity to learn about new developments in the rapidly-changing world of big data and understand best practices in creating Internet of Things (IoT) applications.
Learn more about the Pivotal Big Data Roadshow: http://pivotal.io/big-data/data-roadshow
Why Domain-Driven Design and Reactive Programming?VMware Tanzu
Enterprise software development is hard.
A poorly designed enterprise software application can result in exorbitant costs and overall project failure. Traditional approaches have had difficulty with promoting good design practices, resulting in applications that don’t meet the needs of the business and are costly and difficult to change. Ultimately, this severely limits the value of these applications.
Domain-Driven Design (DDD) and Reactive Programming are design patterns that address these issues head on. Both approaches address application development complexity by breaking your big problems into smaller problems.
DDD puts the focus on the core business domain ensuring that the highest business value areas are addressed first. DDD operates on the premise that your business needs will change, and your applications need to change accordingly. Working closely together, your business domain experts and technical team can deliver apps that evolve with your business.
Reactive Programming promotes simplicity by focusing on only a few important concepts. It reduces the complexity of building a big application by viewing it as a collection of smaller applications that respond to events. The stream of events that occur as part of your business operations can instantly trigger responses from the application, making Reactive Programming real-time, interactive, and engaging.
In this webinar, we will answer five key questions:
What causes software projects to lack well-designed domains?
What is a good domain model and how does it help with reducing complexity?
What is the Reactive model and how does it help developers solve complex application and integration problems?
How can you use these techniques to reduce time-to-market and improve quality as you build software that is more flexible, more scalable, and more tightly aligned to business goals?
How can in-memory data grids like open source Apache Geode and GemFire (Pivotal’s product based on Apache Geode) fit with these modern concepts?
How do you grapple with a legacy portfolio? What strategies do you employ to get an application to cloud native?
How do you grapple with a legacy portfolio? What strategies do you employ to get an application to cloud native?
This talk will cover tools, process and techniques for decomposing monolithic applications to Cloud Native applications running on Pivotal Cloud Foundry (PCF). The webinar will build on ideas from seminal works in this area: Working Effectively With Legacy Code and The Mikado Method. We will begin with an overview of the technology constraints of porting existing applications to the cloud, sharing approaches to migrate applications to PCF. Architects & Developers will come away from this webinar with prescriptive replatforming and decomposition techniques. These techniques offer a scientific approach for an application migration funnel and how to implement patterns like Anti-Corruption Layer, Strangler, Backends For Frontend, Seams etc., plus recipes and tools to refactor and replatform enterprise apps to the cloud. Go beyond the 12 factors and see WHY Cloud Foundry is the best place to run any app - cloud native or non-cloud native.
Speakers: Pieter Humphrey, Principal Product Manager; Pivotal
Rohit Kelapure, PCF Advisory Solutions Architect; Pivotal
Hungry for more? Check out this blog from Kenny Bastani:
http://www.kennybastani.com/2016/08/strangling-legacy-microservices-spring-cloud.html
Are you being asked to put more cloud in your strategy? If you’re like most people, the answer is a definite yes. The word “cloud” can mean so many things, however, that making an actionable strategy is impossible. At Pivotal, we divide cloud into two distinct parts: migrating as many legacy applications into SaaS as possible and focusing on perfecting the software you build in-house that runs your business. Gartner is predicting that by 2020, 75% of applications used to support digital businesses will be built in-house. If you’re one of these companies, you’ll need to quickly evaluate how you develop and run your custom written software.
We believe that soon, every company will either be a software company or losing to a competitor who is. It’s time to focus on the craft of managing the software development life-cycle, and this brief, but dense webinar will help launch your efforts to become a software defined business.
Join us in the last installment in our series: Organization Transformation - to get the full benefit of a cloud native approach, you'll likely need to change how your organization functions and behaves: you'll have to change its culture. When software is thought of more as ongoing products instead of discrete projects, the way the IT department is managed and run changes accordingly. This last part covers the motivations for those changes and outlines how to start transforming everyday management, strategy, staffing, and operations to become a cloud native enterprise.
Presenter: Michael Coté
Hadoop is used to run large scale jobs that are sub-divided into many tasks that are executed over multiple machines. There are complex dependencies between these tasks. And at scale, there can be 10’s to 1000’s of tasks running over 100 to 1000’s of machines which increases the challenge of making sense of their performance. Add to that pipelines of such jobs that logically run a business workflow as another level of complexity. This does not even account for other users competing for resources on the same Hadoop cluster which can have even a more drastic impact on your job performance and your SLAs. No wonder that the question of why Hadoop jobs run slower than expected, remains a perennial source of grief for developers. In this talk, we will draw on our experience in debugging and analyzing Hadoop jobs to describe some methodical approaches to this and present current and new tracing and tooling ideas that can help semi-automate parts of this difficult problem.
Hadoop is used to run large scale jobs that are sub-divided into many tasks that are executed over multiple machines. There are complex dependencies between these tasks. And at scale, there can be 10’s to 1000’s of tasks running over 100 to 1000’s of machines which increases the challenge of making sense of their performance. Add to that pipelines of such jobs that logically run a business workflow as another level of complexity. This does not even account for other users competing for resources on the same Hadoop cluster which can have even a more drastic impact on your job performance and your SLAs. No wonder that the question of why Hadoop jobs run slower than expected, remains a perennial source of grief for developers. In this talk, we will draw on our experience in debugging and analyzing Hadoop jobs to describe some methodical approaches to this and present current and new tracing and tooling ideas that can help semi-automate parts of this difficult problem.
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...Data Con LA
This talk draws on our experience in debugging and analyzing Hadoop jobs to describe some methodical approaches to this and present current and new tracing and tooling ideas that can help semi-automate parts of this difficult problem.
Using Apache Hadoop and related technologies as a data warehouse has been an area of interest since the early days of Hadoop. In recent years Hive has made great strides towards enabling data warehousing by expanding its SQL coverage, adding transactions, and enabling sub-second queries with LLAP. But data warehousing requires more than a full powered SQL engine. Security, governance, data movement, workload management, monitoring, and user tools are required as well. These functions are being addressed by other Apache projects such as Ranger, Atlas, Falcon, Ambari, and Zeppelin. This talk will examine how these projects can be assembled to build a data warehousing solution. It will also discuss features and performance work going on in Hive and the other projects that will enable more data warehousing use cases. These include use cases like data ingestion using merge, support for OLAP cubing queries via Hive’s integration with Druid, expanded SQL coverage, replication of data between data warehouses, advanced access control options, data discovery, and user tools to manage, monitor, and query the warehouse.
Dynamic Column Masking and Row-Level Filtering in HDPHortonworks
As enterprises around the world bring more of their sensitive data into Hadoop data lakes, balancing the need for democratization of access to data without sacrificing strong security principles becomes paramount. In this webinar, Srikanth Venkat, director of product management for security & governance will demonstrate two new data protection capabilities in Apache Ranger – dynamic column masking and row level filtering of data stored in Apache Hive. These features have been introduced as part of HDP 2.5 platform release.
The Atlas/ Ranger integration represents a paradigm shift for big data governance and security. Enterprises can now implement dynamic classification-based security policies, in addition to role-based security. Ranger’s centralized platform empowers data administrators to define security policy based on Atlas metadata tags or attributes and apply this policy in real-time to the entire hierarchy of data assets including databases, tables and columns.
Future of Data New Jersey - HDF 3.0 Deep DiveAldrin Piri
Presentation on new features of HDF 3.0 presented on August 8, 2017 to the Future of Data: New Jersey Meetup group. This event was hosted by Honeywell in Morris Plains, NJ.
https://www.meetup.com/futureofdata-princeton/events/240972326/
Curb your insecurity with HDP - Tips for a Secure Clusterahortonworks
NOTE: Slides contains gifs which may appear as dark pics.
You got your cluster installed and configured. You celebrate, until the party is ruined by your company's Security officer stamping a big "Deny" on your Hadoop cluster. And oops!! You cannot place any data onto the cluster until you can demonstrate it is secure. In this session you will learn the tips and tricks to fully secure your cluster for data at rest, data in motion and all the apps including Spark. Your Security officer can then join your Hadoop revelry (unless you don't authorize him to, with your newly acquired admin rights)
This is the presentation from the "Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS" webinar on May 28, 2014. Rohit Bahkshi, a senior product manager at Hortonworks, and Vinod Vavilapalli, PMC for Apache Hadoop, discuss an overview of YARN in HDFS and new features in HDP 2.1. Those new features include: HDFS extended ACLs, HTTPs wire encryption, HDFS DataNode caching, resource manager high availability, application timeline server, and capacity scheduler pre-emption.
Sharing metadata across the data lake and streamsDataWorks Summit
Traditionally systems have stored and managed their own metadata, just as they traditionally stored and managed their own data. A revolutionary feature of big data tools such as Apache Hadoop and Apache Kafka is the ability to store all data together, where users can bring the tools of their choice to process it.
Apache Hive's metastore can be used to share the metadata in the same way. It is already used by many SQL and SQL-like systems beyond Hive (e.g. Apache Spark, Presto, Apache Impala, and via HCatalog, Apache Pig). As data processing changes from only data in the cluster to include data in streams, the metastore needs to expand and grow to meet these use cases as well. There is work going on in the Hive community to separate out the metastore, so it can continue to serve Hive but also be used by a more diverse set of tools. This talk will discuss that work, with particular focus on adding support for storing schemas for Kafka messages.
Speaker
Alan Gates, Co-Founder, Hortonworks
Stream processing has become the defacto standard for building real-time ETL and Stream Analytics applications. We see batch workloads move into Stream processing to to act on the data and derive insights faster. With the explosion of data with "Perishable Insights" such IoT and machine-generated data, Stream Processing + Predictive Analytics is driving tremendous business value. This is evidenced by the explosion of Stream Processing frameworks like proven and evolving Apache Storm and newer frameworks such as Apache Flink, Apache Apex, and Spark Streaming.
Today, users have to choose and try to understand the benefits of each of these frameworks and not only that they have to learn the new APIs and also operationalize their applications. To create value faster, we are introducing new open source tool - Streamline. It is a self-service tool that will ease building streaming application and deploy the streaming application across multiple frameworks/engines that users prefer in a snap. It simplifies integration with Machine Learning models for scoring and classification of data for Predictive Analytics. It provides an elegant way to build Analytics dashboards to derive business insights out of the streaming data and to allow the business users to consume it easily.
In this talk, we will outline the fundamentals of real-time stream processing and demonstrate Streamline capabilities to show how it simplifies building real-time streaming analytics applications.
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
Hortonworks DataFlow (HDF) is the complete solution that addresses the most complex streaming architectures of today’s enterprises. More than 20 billion IoT devices are active on the planet today and thousands of use cases across IIOT, Healthcare and Manufacturing warrant capturing data-in-motion and delivering actionable intelligence right NOW. “Data decay” happens in a matter of seconds in today’s digital enterprises.
To meet all the needs of such fast-moving businesses, we have made significant enhancements and new streaming features in HDF 3.1.
https://hortonworks.com/webinar/series-hdf-3-1-technical-deep-dive-new-streaming-features/
Many Organizations are currently processing various types of data and in different formats. Most often this data will be in free form, As the consumers of this data growing it’s imperative that this free-flowing data needs to adhere to a schema. It will help data consumers to have an expectation of about the type of data they are getting and also they will be able to avoid immediate impact if the upstream source changes its format. Having a uniform schema representation also gives the Data Pipeline a really easy way to integrate and support various systems that use different data formats.
SchemaRegistry is a central repository for storing, evolving schemas. It provides an API & tooling to help developers and users to register a schema and consume that schema without having any impact if the schema changed. Users can tag different schemas and versions, register for notifications of schema changes with versions etc.
In this talk, we will go through the need for a schema registry and schema evolution and showcase the integration with Apache NiFi, Apache Kafka, Apache Storm.
There is increasing need for large-scale recommendation systems. Typical solutions rely on periodically retrained batch algorithms, but for massive amounts of data, training a new model could take hours. This is a problem when the model needs to be more up-to-date. For example, when recommending TV programs while they are being transmitted the model should take into consideration users who watch a program at that time.
The promise of online recommendation systems is fast adaptation to changes, but methods of online machine learning from streams is commonly believed to be more restricted and hence less accurate than batch trained models. Combining batch and online learning could lead to a quickly adapting recommendation system with increased accuracy. However, designing a scalable data system for uniting batch and online recommendation algorithms is a challenging task. In this talk we present our experiences in creating such a recommendation engine with Apache Flink and Apache Spark.
DeepLearning is not just a hype - it outperforms state-of-the-art ML algorithms. One by one. In this talk we will show how DeepLearning can be used for detecting anomalies on IoT sensor data streams at high speed using DeepLearning4J on top of different BigData engines like ApacheSpark and ApacheFlink. Key in this talk is the absence of any large training corpus since we are using unsupervised machine learning - a domain current DL research threats step-motherly. As we can see in this demo LSTM networks can learn very complex system behavior - in this case data coming from a physical model simulating bearing vibration data. Once draw back of DeepLearning is that normally a very large labaled training data set is required. This is particularly interesting since we can show how unsupervised machine learning can be used in conjunction with DeepLearning - no labeled data set is necessary. We are able to detect anomalies and predict braking bearings with 10 fold confidence. All examples and all code will be made publicly available and open sources. Only open source components are used.
QE automation for large systems is a great step forward in increasing system reliability. In the big-data world, multiple components have to come together to provide end-users with business outcomes. This means, that QE Automations scenarios need to be detailed around actual use cases, cross-cutting components. The system tests potentially generate large amounts of data on a recurring basis, verifying which is a tedious job. Given the multiple levels of indirection, the false positives of actual defects are higher, and are generally wasteful.
At Hortonworks, we’ve designed and implemented Automated Log Analysis System - Mool, using Statistical Data Science and ML. Currently the work in progress has a batch data pipeline with a following ensemble ML pipeline which feeds into the recommendation engine. The system identifies the root cause of test failures, by correlating the failing test cases, with current and historical error records, to identify root cause of errors across multiple components. The system works in unsupervised mode with no perfect model/stable builds/source-code version to refer to. In addition the system provides limited recommendations to file/open past tickets and compares run-profiles with past runs.
Improving business performance is never easy! The Natixis Pack is like Rugby. Working together is key to scrum success. Our data journey would undoubtedly have been so much more difficult if we had not made the move together.
This session is the story of how ‘The Natixis Pack’ has driven change in its current IT architecture so that legacy systems can leverage some of the many components in Hortonworks Data Platform in order to improve the performance of business applications. During this session, you will hear:
• How and why the business and IT requirements originated
• How we leverage the platform to fulfill security and production requirements
• How we organize a community to:
o Guard all the players, no one gets left on the ground!
o Us the platform appropriately (Not every problem is eligible for Big Data and standard databases are not dead)
• What are the most usable, the most interesting and the most promising technologies in the Apache Hadoop community
We will finish the story of a successful rugby team with insight into the special skills needed from each player to win the match!
DETAILS
This session is part business, part technical. We will talk about infrastructure, security and project management as well as the industrial usage of Hive, HBase, Kafka, and Spark within an industrial Corporate and Investment Bank environment, framed by regulatory constraints.
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
There has been an explosion of data digitising our physical world – from cameras, environmental sensors and embedded devices, right down to the phones in our pockets. Which means that, now, companies have new ways to transform their businesses – both operationally, and through their products and services – by leveraging this data and applying fresh analytical techniques to make sense of it. But are they ready? The answer is “no” in most cases.
In this session, we’ll be discussing the challenges facing companies trying to embrace the Analytics of Things, and how Teradata has helped customers work through and turn those challenges to their advantage.
In this talk, we will present a new distribution of Hadoop, Hops, that can scale the Hadoop Filesystem (HDFS) by 16X, from 70K ops/s to 1.2 million ops/s on Spotiy's industrial Hadoop workload. Hops is an open-source distribution of Apache Hadoop that supports distributed metadata for HSFS (HopsFS) and the ResourceManager in Apache YARN. HopsFS is the first production-grade distributed hierarchical filesystem to store its metadata normalized in an in-memory, shared nothing database. For YARN, we will discuss optimizations that enable 2X throughput increases for the Capacity scheduler, enabling scalability to clusters with >20K nodes. We will discuss the journey of how we reached this milestone, discussing some of the challenges involved in efficiently and safely mapping hierarchical filesystem metadata state and operations onto a shared-nothing, in-memory database. We will also discuss the key database features needed for extreme scaling, such as multi-partition transactions, partition-pruned index scans, distribution-aware transactions, and the streaming changelog API. Hops (www.hops.io) is Apache-licensed open-source and supports a pluggable database backend for distributed metadata, although it currently only support MySQL Cluster as a backend. Hops opens up the potential for new directions for Hadoop when metadata is available for tinkering in a mature relational database.
In high-risk manufacturing industries, regulatory bodies stipulate continuous monitoring and documentation of critical product attributes and process parameters. On the other hand, sensor data coming from production processes can be used to gain deeper insights into optimization potentials. By establishing a central production data lake based on Hadoop and using Talend Data Fabric as a basis for a unified architecture, the German pharmaceutical company HERMES Arzneimittel was able to cater to compliance requirements as well as unlock new business opportunities, enabling use cases like predictive maintenance, predictive quality assurance or open world analytics. Learn how the Talend Data Fabric enabled HERMES Arzneimittel to become data-driven and transform Big Data projects from challenging, hard to maintain hand-coding jobs to repeatable, future-proof integration designs.
Talend Data Fabric combines Talend products into a common set of powerful, easy-to-use tools for any integration style: real-time or batch, big data or master data management, on-premises or in the cloud.
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
- We will be looking at 3 aspects mainly metrics and monitoring, logging and co-relation and tracing &analysis
Metrics are at the bottom level where you have counters and system level metrics gathered at runtime and exposed to high level for processing
Logs are mainly at the service level, for example, HDFS, YARN, HIVE, HBASE etc
Tracing and Analysis is at the application level where you are running a specific hive job or hadoop job.
When we say about metrics and monitoring, we have heard saying “there is no smoke without fire” and metrics is something equivalent to the smoke. And that is something which provides info on where the problem could be. Usually metrics do not go wrong unless something is very broken.
So what do I mean by metrics as high level pointers. Let me start with a simple example of one of the issues we dealt with. <customer story>
Similarly there is YARN dashboard. YARN is a resource management layer where you arrange a bunch of queues in the cluster which have a certain amount of memory and cpu allocated to it and you run multiple jobs,and it arbitrates the cpu/memory among these jobs.
HDFS and YARN provides the audit logs to track of the various state changes in the system.
You can use log search directly query directly .So here we are search for give me top 5 files which are being used.
You have lots of data distributed across systems about the metrics, audit logs, lineage etc and it is possible to make use big data tools itself to mine these logs.
You could potentially fetch data from ambari metrics system, from audit logs, from yarn application time serivce etc and start mining it via the notebooks.
Summarize:
There are many kinds of data that are available for debugging. Like counters and system counters which are emitted at the stack level. There are audit log details to provide information about the kind of operation that are happening in the system
And Also the application time line service which captures the metadata about the events. They all have ways in which each of them can be co-related to get meaningful information.