Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks
Hortonworks Data Platform is a key component of Modern Data Architecture. Organizations rely on HDP for mission critical business functions and expects for the system to be constantly available and performant. In this session we will cover the operational best practices for administering the Hortonworks Data Platform including the initial setup and ongoing maintenance.
The document discusses evolving HDFS to support generalized storage containers in order to better scale the number of files and blocks. It proposes using block containers and a partial namespace approach to initially scale to billions of files and blocks, and eventually much higher numbers. The storage layer is being restructured to support various container types for use cases beyond HDFS like object storage and HBase.
Hadoop Operations - Best Practices from the FieldDataWorks Summit
This document discusses best practices for Hadoop operations based on analysis of support cases. Key learnings include using HDFS ACLs and snapshots to prevent accidental data deletion and improve recoverability. HDFS improvements like pausing block deletion and adding diagnostics help address incidents around namespace mismatches and upgrade failures. Proper configuration of hardware, JVM settings, and monitoring is also emphasized.
Most users know HDFS as the reliable store of record for big data analytics. HDFS is also used to store transient and operational data when working with cloud object stores, such as Azure HDInsight and Amazon EMR. In these settings - but also in more traditional, on premise deployments - applications often manage data stored in multiple storage systems or clusters, requiring a complex workflow for synchronizing data between filesystems to achieve goals for durability, performance, and coordination.
Building on existing heterogeneous storage support, we add a storage tier to HDFS to work with external stores, allowing remote namespaces to be "mounted" in HDFS. This capability not only supports transparent caching of remote data as HDFS blocks, it also supports (a)synchronous writes to remote clusters for business continuity planning (BCP) and supports hybrid cloud architectures.
This idea was presented at last year’s Summit in San Jose. Lots of progress has been made since then and active development is ongoing at the Apache Software Foundation on branch HDFS-9806, driven by Microsoft and Western Digital. We will discuss the refined design & implementation and present how end-users and admins will be able to use this powerful functionality.
The document discusses evolving HDFS to better support large scale deployments. It summarizes HDFS's strengths in scaling to large clusters and data sizes. However, scaling the large number of small files and blocks is challenging. The solution involves using partial namespaces to store only recently used metadata in memory, and block containers to group blocks together. This will generalize the storage layer to support different container types beyond HDFS blocks. Initial goals are to scale to billions of files and blocks per volume, with the ability to add more volumes for further scaling. The changes will also enable new use cases like block storage and caching data in cloud storage.
Most users know HDFS as the reliable store of record for big data analytics. HDFS is also used to store transient and operational data when working with cloud object stores, such as Microsoft Azure or Amazon S3, and on-premises object stores, such as Western Digital’s ActiveScale. In these settings, applications often manage data stored in multiple storage systems or clusters, requiring a complex workflow for synchronizing data between filesystems for business continuity planning (BCP) and/or supporting hybrid cloud architectures to achieve the required business goals for durability, performance, and coordination.
To resolve this complexity, HDFS-9806 has added a PROVIDED storage tier to mount external storage systems in the HDFS NameNode. Building on this functionality, we can now allow remote namespaces to be synchronized with HDFS, enabling asynchronous writes to the remote storage and the possibility to synchronously and transparently read data back to a local application wanting to access file data which is stored remotely. In this talk, which corresponds to the work in progress under HDFS-12090, we will present how the Hadoop admin can manage storage tiering between clusters and how that is then handled inside HDFS through the snapshotting mechanism and asynchronously satisfying the storage policy.
Speakers
Chris Douglas, Microsoft, Principal Research Software Engineer
Thomas Denmoor, Western Digital, Object Storage Architect
The document discusses Ozone, which is designed to address HDFS scalability limitations and enable trillions of file system objects. It was created as HDFS struggles with hundreds of millions of files. Ozone uses a microservices architecture of Ozone Manager, Storage Container Managers, and Recon Server to divide responsibilities and scale independently. It provides seamless transition for applications like YARN, MapReduce, Hive and Spark, and supports Kubernetes deployments. The document outlines Ozone's architecture, deployment options, write and read paths, usage similarities to HDFS/S3, enterprise-grade features around security, high availability and roadmap.
The document summarizes recommendations for efficiently and effectively managing Apache Hadoop based on observations from analyzing over 1,000 customer bundles. It covers common operational mistakes like inconsistent operating system configurations involving locale, transparent huge pages, NTP, and legacy kernel issues. It also provides recommendations for optimizing configurations involving HDFS name node and data node settings, YARN resource manager and node manager memory settings, and YARN ATS timeline storage. The presentation encourages adopting recommendations built into the SmartSense analytics product to improve cluster operations and prevent issues.
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks
Hortonworks Data Platform is a key component of Modern Data Architecture. Organizations rely on HDP for mission critical business functions and expects for the system to be constantly available and performant. In this session we will cover the operational best practices for administering the Hortonworks Data Platform including the initial setup and ongoing maintenance.
The document discusses evolving HDFS to support generalized storage containers in order to better scale the number of files and blocks. It proposes using block containers and a partial namespace approach to initially scale to billions of files and blocks, and eventually much higher numbers. The storage layer is being restructured to support various container types for use cases beyond HDFS like object storage and HBase.
Hadoop Operations - Best Practices from the FieldDataWorks Summit
This document discusses best practices for Hadoop operations based on analysis of support cases. Key learnings include using HDFS ACLs and snapshots to prevent accidental data deletion and improve recoverability. HDFS improvements like pausing block deletion and adding diagnostics help address incidents around namespace mismatches and upgrade failures. Proper configuration of hardware, JVM settings, and monitoring is also emphasized.
Most users know HDFS as the reliable store of record for big data analytics. HDFS is also used to store transient and operational data when working with cloud object stores, such as Azure HDInsight and Amazon EMR. In these settings - but also in more traditional, on premise deployments - applications often manage data stored in multiple storage systems or clusters, requiring a complex workflow for synchronizing data between filesystems to achieve goals for durability, performance, and coordination.
Building on existing heterogeneous storage support, we add a storage tier to HDFS to work with external stores, allowing remote namespaces to be "mounted" in HDFS. This capability not only supports transparent caching of remote data as HDFS blocks, it also supports (a)synchronous writes to remote clusters for business continuity planning (BCP) and supports hybrid cloud architectures.
This idea was presented at last year’s Summit in San Jose. Lots of progress has been made since then and active development is ongoing at the Apache Software Foundation on branch HDFS-9806, driven by Microsoft and Western Digital. We will discuss the refined design & implementation and present how end-users and admins will be able to use this powerful functionality.
The document discusses evolving HDFS to better support large scale deployments. It summarizes HDFS's strengths in scaling to large clusters and data sizes. However, scaling the large number of small files and blocks is challenging. The solution involves using partial namespaces to store only recently used metadata in memory, and block containers to group blocks together. This will generalize the storage layer to support different container types beyond HDFS blocks. Initial goals are to scale to billions of files and blocks per volume, with the ability to add more volumes for further scaling. The changes will also enable new use cases like block storage and caching data in cloud storage.
Most users know HDFS as the reliable store of record for big data analytics. HDFS is also used to store transient and operational data when working with cloud object stores, such as Microsoft Azure or Amazon S3, and on-premises object stores, such as Western Digital’s ActiveScale. In these settings, applications often manage data stored in multiple storage systems or clusters, requiring a complex workflow for synchronizing data between filesystems for business continuity planning (BCP) and/or supporting hybrid cloud architectures to achieve the required business goals for durability, performance, and coordination.
To resolve this complexity, HDFS-9806 has added a PROVIDED storage tier to mount external storage systems in the HDFS NameNode. Building on this functionality, we can now allow remote namespaces to be synchronized with HDFS, enabling asynchronous writes to the remote storage and the possibility to synchronously and transparently read data back to a local application wanting to access file data which is stored remotely. In this talk, which corresponds to the work in progress under HDFS-12090, we will present how the Hadoop admin can manage storage tiering between clusters and how that is then handled inside HDFS through the snapshotting mechanism and asynchronously satisfying the storage policy.
Speakers
Chris Douglas, Microsoft, Principal Research Software Engineer
Thomas Denmoor, Western Digital, Object Storage Architect
The document discusses Ozone, which is designed to address HDFS scalability limitations and enable trillions of file system objects. It was created as HDFS struggles with hundreds of millions of files. Ozone uses a microservices architecture of Ozone Manager, Storage Container Managers, and Recon Server to divide responsibilities and scale independently. It provides seamless transition for applications like YARN, MapReduce, Hive and Spark, and supports Kubernetes deployments. The document outlines Ozone's architecture, deployment options, write and read paths, usage similarities to HDFS/S3, enterprise-grade features around security, high availability and roadmap.
The document summarizes recommendations for efficiently and effectively managing Apache Hadoop based on observations from analyzing over 1,000 customer bundles. It covers common operational mistakes like inconsistent operating system configurations involving locale, transparent huge pages, NTP, and legacy kernel issues. It also provides recommendations for optimizing configurations involving HDFS name node and data node settings, YARN resource manager and node manager memory settings, and YARN ATS timeline storage. The presentation encourages adopting recommendations built into the SmartSense analytics product to improve cluster operations and prevent issues.
Data is the fuel for the idea economy, and being data-driven is essential for businesses to be competitive. HPE works with all the Hadoop partners to deliver packaged solutions to become data driven. Join us in this session and you’ll hear about HPE’s Enterprise-grade Hadoop solution which encompasses the following
-Infrastructure – Two industrialized solutions optimized for Hadoop; a standard solution with co-located storage and compute and an elastic solution which lets you scale storage and compute independently to enable data sharing and prevent Hadoop cluster sprawl.
-Software – A choice of all popular Hadoop distributions, and Hadoop ecosystem components like Spark and more. And a comprehensive utility to manage your Hadoop cluster infrastructure.
-Services – HPE’s data center experts have designed some of the largest Hadoop clusters in the world and can help you design the right Hadoop infrastructure to avoid performance issues and future proof you against Hadoop cluster sprawl.
-Add-on solutions – Hadoop needs more to fill in the gaps. HPE partners with the right ecosystem partners to bring you solutions such an industrial grade SQL on Hadoop with Vertica, data encryption with SecureData, SAP ecosystem with SAP HANA VORA, Multitenancy with Blue Data, Object storage with Scality and more.
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
The ever-increasing interest in running fast analytic scans on constantly updating data is stretching the capabilities of HDFS and NoSQL storage. Users want the fast online updates and serving of real-time data that NoSQL offers, as well as the fast scans, analytics, and processing of HDFS. Additionally, users are demanding that big data storage systems integrate natively with their existing BI and analytic technology investments, which typically use SQL as the standard query language of choice. This demand has led big data back to a familiar friend: relationally structured data storage systems.
Todd Lipcon explores the advantages of relational storage and reviews new developments, including Google Cloud Spanner and Apache Kudu, which provide a scalable relational solution for users who have too much data for a legacy high-performance analytic system. Todd explains how to address use cases that fall between HDFS and NoSQL with technologies like Apache Kudu or Google Cloud Spanner and how the combination of relational data models, SQL query support, and native API-based access enables the next generation of big data applications. Along the way, he also covers suggested architectures, the performance characteristics of Kudu and Spanner, and the deployment flexibility each option provides.
How the Internet of Things are Turning the Internet Upside DownDataWorks Summit
- The document discusses how time series data from sensors can be ingested and analyzed at large scales. It describes how traditional internet architecture concentrates resources at the core while sensors and devices reside at the edge, producing large amounts of time series data.
- It then summarizes techniques for ingesting and analyzing time series data at rates of millions to hundreds of millions of data points per second using technologies like OpenTSDB, HBase, and MapR databases. This involves batching data at the edge and optimized storage designs.
- The document concludes by discussing the advantages of MapR for time series use cases due to its high ingestion rates and integration with query engines like Drill for flexible analysis of large time series datasets
The document discusses architectural considerations for implementing clickstream analytics using Hadoop. It covers choices for data storage layers like HDFS vs HBase, data modeling including file formats and partitioning, data ingestion methods like Flume and Sqoop, available processing engines like MapReduce, Hive, Spark and Impala, and the need to sessionize clickstream data to analyze metrics like bounce rates and attribution.
In this talk, we will present a new distribution of Hadoop, Hops, that can scale the Hadoop Filesystem (HDFS) by 16X, from 70K ops/s to 1.2 million ops/s on Spotiy's industrial Hadoop workload. Hops is an open-source distribution of Apache Hadoop that supports distributed metadata for HSFS (HopsFS) and the ResourceManager in Apache YARN. HopsFS is the first production-grade distributed hierarchical filesystem to store its metadata normalized in an in-memory, shared nothing database. For YARN, we will discuss optimizations that enable 2X throughput increases for the Capacity scheduler, enabling scalability to clusters with >20K nodes. We will discuss the journey of how we reached this milestone, discussing some of the challenges involved in efficiently and safely mapping hierarchical filesystem metadata state and operations onto a shared-nothing, in-memory database. We will also discuss the key database features needed for extreme scaling, such as multi-partition transactions, partition-pruned index scans, distribution-aware transactions, and the streaming changelog API. Hops (www.hops.io) is Apache-licensed open-source and supports a pluggable database backend for distributed metadata, although it currently only support MySQL Cluster as a backend. Hops opens up the potential for new directions for Hadoop when metadata is available for tinkering in a mature relational database.
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoopjdcryans
Kudu is a new column-oriented storage system for Apache Hadoop that is designed to address the gaps in transactional processing and analytics in Hadoop. It aims to provide high throughput for large scans, low latency for individual rows, and database semantics like ACID transactions. Kudu is motivated by the changing hardware landscape with faster SSDs and more memory, and aims to take advantage of these advances. It uses a distributed table design partitioned into tablets replicated across servers, with a centralized metadata service for coordination.
This document discusses Apache Kudu, an open source column-oriented storage system that provides fast analytics on fast data. It describes Kudu's design goals of high throughput for large scans, low latency for short accesses, and database-like semantics. The document outlines Kudu's architecture, including its use of columnar storage, replication for fault tolerance, and integrations with Spark, Impala and other frameworks. It provides examples of using Kudu for IoT and real-time analytics use cases. Performance comparisons show Kudu outperforming other NoSQL systems on analytics and operational workloads.
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
This document provides an overview of Apache Tez, a framework for building data processing applications on Hadoop YARN. It describes how Tez allows applications to define complex data flows as directed acyclic graphs (DAGs) and handles distributed execution, fault tolerance, and resource management. Tez has improved the performance of Apache Hive and Pig by an order of magnitude by enabling more flexible DAG definitions and runtime optimizations. It also supports integration with other data processing engines like Spark, Storm and interactive SQL queries. The document outlines how Tez works and provides guidance on how developers can contribute to the open source project.
HDFS Tiered Storage: Mounting Object Stores in HDFSDataWorks Summit
Most users know HDFS as the reliable store of record for big data analytics. HDFS is also used to store transient and operational data when working with cloud object stores, such as Azure HDInsight and Amazon EMR. In these settings- but also in more traditional, on premise deployments- applications often manage data stored in multiple storage systems or clusters, requiring a complex workflow for synchronizing data between filesystems to achieve goals for durability, performance, and coordination.
Building on existing heterogeneous storage support, we add a storage tier to HDFS to work with external stores, allowing remote namespaces to be "mounted" in HDFS. This capability not only supports transparent caching of remote data as HDFS blocks, it also supports synchronous writes to remote clusters for business continuity planning (BCP) and supports hybrid cloud architectures.
This idea was presented at last year’s Summit in San Jose. Lots of progress has been made since then and the feature is in active development at the Apache Software Foundation on branch HDFS-9806, driven by Microsoft and Western Digital. We will discuss the refined design & implementation and present how end-users and admins will be able to use this powerful functionality.
CBlocks - Posix compliant files systems for HDFSDataWorks Summit
With YARN running Docker containers, it is possible to run applications that are not HDFS aware inside these containers. It is hard to customize these applications since most of them assume a Posix file system with rewrite capabilities. In this talk, we will dive into how we created a block storage, how it is being tested internally and the storage containers which makes it all possible.
The storage container framework was developed as part of Ozone (HDFS-7240). This is talk will also explore the current state of Ozone along with CBlocks. This talk will explore architecture of storage containers, how replication is handled, scaling to millions of volumes and I/O performance optimizations.
It’s no longer a world of just relational databases. Companies are increasingly adopting specialized datastores such as Hadoop, HBase, MongoDB, Elasticsearch, Solr and S3. Apache Drill, an open source, in-memory, columnar SQL execution engine, enables interactive SQL queries against more datastores.
Dremio is a startup founded in 2015 by experts in big data and open source. It aims to provide a platform for interactive analysis across disparate data sources through a storage-agnostic and client-agnostic approach leveraging Apache Arrow for high performance in-memory columnar execution. Dremio uses Apache Drill as its query engine, allowing users to query data across different systems like HDFS, S3, MongoDB as if it was a single relational database through SQL. It has an extensible architecture that allows new data sources to be easily added via plugins.
The document discusses LLAP (Live Long and Process), a new execution layer for Hive that enables sub-second analytical queries. LLAP uses daemons running on worker nodes to cache data in memory and keep query fragments executing between queries for faster performance. It allows for highly concurrent queries without specialized YARN queues. Benchmarks show LLAP providing up to 90% faster performance over Hive for queries against large datasets. LLAP also aims to serve as a unified data access layer for other systems like Spark SQL.
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
This document provides an overview of stream processing with Apache Flink. It discusses the rise of stream processing and how it enables low-latency applications and real-time analysis. It then describes Flink's stream processing capabilities, including pipelining of data, fault tolerance through checkpointing and recovery, and integration with batch processing. The document also summarizes Flink's programming model, state management, and roadmap for further development.
Apache Hadoop 3.0 is coming! As the next major release, it attracts everyone's attention as show case several bleeding-edge technologies and significant features across all components of Apache Hadoop, include: Erasure Coding in HDFS, Multiple Standby NameNodes, YARN Timeline Service v2, JNI-based shuffle in MapReduce, Apache Slider integration and Service Support as First Class Citizen, Hadoop library updates and client-side class path isolation, etc.
In this talk, we will update the status of Hadoop 3 especially the releasing work in community and then go deep diving on new features included in Hadoop 3.0. As a new major release, Hadoop 3 would also include some incompatible changes - we will go through most of these changes and explore its impact to existing Hadoop users and operators. In the last part of this session, we will continue to discuss ongoing efforts in Hadoop 3 age and show the big picture that how big data landscape could be largely influenced by Hadoop 3.
This document summarizes Richard Xu's presentation on tuning Yarn, Hive, and queries on a Hadoop cluster. The initial issues with the cluster included jobs taking hours to finish when they were supposed to take minutes. Initial tuning focused on cluster configuration best practices and increasing Yarn capacity. Further tuning involved limiting user capacity, increasing resources for application masters, and tuning memory settings for MapReduce and Tez. Specific Hive query issues addressed were full table scans, non-deterministic functions, join orders, and data type mismatches. Tools discussed for analysis included Tez visualization and Lipwig. Lessons learned emphasized a holistic tuning approach and understanding data structures and explain plans. Long-lived execution (LLAP) was presented as providing in
This document provides guidance on sizing and configuring Apache Hadoop clusters. It recommends separating master nodes, which run processes like the NameNode and JobTracker, from slave nodes, which run DataNodes, TaskTrackers and RegionServers. For medium to large clusters it suggests 4 master nodes and the remaining nodes as slaves. The document outlines factors to consider for optimizing performance and cost like selecting balanced CPU, memory and disk configurations and using a "shared nothing" architecture with 1GbE or 10GbE networking. Redundancy is more important for master than slave nodes.
This document discusses loading data from Hadoop into Oracle databases using Oracle connectors. It describes how the Oracle Loader for Hadoop and Oracle SQL Connector for HDFS can load data from HDFS into Oracle tables much faster than traditional methods like Sqoop by leveraging parallel processing in Hadoop. The connectors optimize the loading process by automatically partitioning, sorting, and formatting the data into Oracle blocks to achieve high performance loads. Measuring the CPU time needed per gigabyte loaded allows estimating how long full loads will take based on available resources.
In this session, learn how to build an Apache Spark or Spark Streaming application that can interact with HBase. In addition, you'll walk through how to implement common, real-world batch design patterns to optimize for performance and scale.
Fundamentals of Big Data, Hadoop project design and case study or Use case
General planning consideration and most necessaries in Hadoop ecosystem and Hadoop projects
This will provide the basis for choosing the right Hadoop implementation, Hadoop technologies integration, adoption and creating an infrastructure.
Building applications using Apache Hadoop with a use-case of WI-FI log analysis has real life example.
Data is the fuel for the idea economy, and being data-driven is essential for businesses to be competitive. HPE works with all the Hadoop partners to deliver packaged solutions to become data driven. Join us in this session and you’ll hear about HPE’s Enterprise-grade Hadoop solution which encompasses the following
-Infrastructure – Two industrialized solutions optimized for Hadoop; a standard solution with co-located storage and compute and an elastic solution which lets you scale storage and compute independently to enable data sharing and prevent Hadoop cluster sprawl.
-Software – A choice of all popular Hadoop distributions, and Hadoop ecosystem components like Spark and more. And a comprehensive utility to manage your Hadoop cluster infrastructure.
-Services – HPE’s data center experts have designed some of the largest Hadoop clusters in the world and can help you design the right Hadoop infrastructure to avoid performance issues and future proof you against Hadoop cluster sprawl.
-Add-on solutions – Hadoop needs more to fill in the gaps. HPE partners with the right ecosystem partners to bring you solutions such an industrial grade SQL on Hadoop with Vertica, data encryption with SecureData, SAP ecosystem with SAP HANA VORA, Multitenancy with Blue Data, Object storage with Scality and more.
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
The ever-increasing interest in running fast analytic scans on constantly updating data is stretching the capabilities of HDFS and NoSQL storage. Users want the fast online updates and serving of real-time data that NoSQL offers, as well as the fast scans, analytics, and processing of HDFS. Additionally, users are demanding that big data storage systems integrate natively with their existing BI and analytic technology investments, which typically use SQL as the standard query language of choice. This demand has led big data back to a familiar friend: relationally structured data storage systems.
Todd Lipcon explores the advantages of relational storage and reviews new developments, including Google Cloud Spanner and Apache Kudu, which provide a scalable relational solution for users who have too much data for a legacy high-performance analytic system. Todd explains how to address use cases that fall between HDFS and NoSQL with technologies like Apache Kudu or Google Cloud Spanner and how the combination of relational data models, SQL query support, and native API-based access enables the next generation of big data applications. Along the way, he also covers suggested architectures, the performance characteristics of Kudu and Spanner, and the deployment flexibility each option provides.
How the Internet of Things are Turning the Internet Upside DownDataWorks Summit
- The document discusses how time series data from sensors can be ingested and analyzed at large scales. It describes how traditional internet architecture concentrates resources at the core while sensors and devices reside at the edge, producing large amounts of time series data.
- It then summarizes techniques for ingesting and analyzing time series data at rates of millions to hundreds of millions of data points per second using technologies like OpenTSDB, HBase, and MapR databases. This involves batching data at the edge and optimized storage designs.
- The document concludes by discussing the advantages of MapR for time series use cases due to its high ingestion rates and integration with query engines like Drill for flexible analysis of large time series datasets
The document discusses architectural considerations for implementing clickstream analytics using Hadoop. It covers choices for data storage layers like HDFS vs HBase, data modeling including file formats and partitioning, data ingestion methods like Flume and Sqoop, available processing engines like MapReduce, Hive, Spark and Impala, and the need to sessionize clickstream data to analyze metrics like bounce rates and attribution.
In this talk, we will present a new distribution of Hadoop, Hops, that can scale the Hadoop Filesystem (HDFS) by 16X, from 70K ops/s to 1.2 million ops/s on Spotiy's industrial Hadoop workload. Hops is an open-source distribution of Apache Hadoop that supports distributed metadata for HSFS (HopsFS) and the ResourceManager in Apache YARN. HopsFS is the first production-grade distributed hierarchical filesystem to store its metadata normalized in an in-memory, shared nothing database. For YARN, we will discuss optimizations that enable 2X throughput increases for the Capacity scheduler, enabling scalability to clusters with >20K nodes. We will discuss the journey of how we reached this milestone, discussing some of the challenges involved in efficiently and safely mapping hierarchical filesystem metadata state and operations onto a shared-nothing, in-memory database. We will also discuss the key database features needed for extreme scaling, such as multi-partition transactions, partition-pruned index scans, distribution-aware transactions, and the streaming changelog API. Hops (www.hops.io) is Apache-licensed open-source and supports a pluggable database backend for distributed metadata, although it currently only support MySQL Cluster as a backend. Hops opens up the potential for new directions for Hadoop when metadata is available for tinkering in a mature relational database.
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoopjdcryans
Kudu is a new column-oriented storage system for Apache Hadoop that is designed to address the gaps in transactional processing and analytics in Hadoop. It aims to provide high throughput for large scans, low latency for individual rows, and database semantics like ACID transactions. Kudu is motivated by the changing hardware landscape with faster SSDs and more memory, and aims to take advantage of these advances. It uses a distributed table design partitioned into tablets replicated across servers, with a centralized metadata service for coordination.
This document discusses Apache Kudu, an open source column-oriented storage system that provides fast analytics on fast data. It describes Kudu's design goals of high throughput for large scans, low latency for short accesses, and database-like semantics. The document outlines Kudu's architecture, including its use of columnar storage, replication for fault tolerance, and integrations with Spark, Impala and other frameworks. It provides examples of using Kudu for IoT and real-time analytics use cases. Performance comparisons show Kudu outperforming other NoSQL systems on analytics and operational workloads.
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
This document provides an overview of Apache Tez, a framework for building data processing applications on Hadoop YARN. It describes how Tez allows applications to define complex data flows as directed acyclic graphs (DAGs) and handles distributed execution, fault tolerance, and resource management. Tez has improved the performance of Apache Hive and Pig by an order of magnitude by enabling more flexible DAG definitions and runtime optimizations. It also supports integration with other data processing engines like Spark, Storm and interactive SQL queries. The document outlines how Tez works and provides guidance on how developers can contribute to the open source project.
HDFS Tiered Storage: Mounting Object Stores in HDFSDataWorks Summit
Most users know HDFS as the reliable store of record for big data analytics. HDFS is also used to store transient and operational data when working with cloud object stores, such as Azure HDInsight and Amazon EMR. In these settings- but also in more traditional, on premise deployments- applications often manage data stored in multiple storage systems or clusters, requiring a complex workflow for synchronizing data between filesystems to achieve goals for durability, performance, and coordination.
Building on existing heterogeneous storage support, we add a storage tier to HDFS to work with external stores, allowing remote namespaces to be "mounted" in HDFS. This capability not only supports transparent caching of remote data as HDFS blocks, it also supports synchronous writes to remote clusters for business continuity planning (BCP) and supports hybrid cloud architectures.
This idea was presented at last year’s Summit in San Jose. Lots of progress has been made since then and the feature is in active development at the Apache Software Foundation on branch HDFS-9806, driven by Microsoft and Western Digital. We will discuss the refined design & implementation and present how end-users and admins will be able to use this powerful functionality.
CBlocks - Posix compliant files systems for HDFSDataWorks Summit
With YARN running Docker containers, it is possible to run applications that are not HDFS aware inside these containers. It is hard to customize these applications since most of them assume a Posix file system with rewrite capabilities. In this talk, we will dive into how we created a block storage, how it is being tested internally and the storage containers which makes it all possible.
The storage container framework was developed as part of Ozone (HDFS-7240). This is talk will also explore the current state of Ozone along with CBlocks. This talk will explore architecture of storage containers, how replication is handled, scaling to millions of volumes and I/O performance optimizations.
It’s no longer a world of just relational databases. Companies are increasingly adopting specialized datastores such as Hadoop, HBase, MongoDB, Elasticsearch, Solr and S3. Apache Drill, an open source, in-memory, columnar SQL execution engine, enables interactive SQL queries against more datastores.
Dremio is a startup founded in 2015 by experts in big data and open source. It aims to provide a platform for interactive analysis across disparate data sources through a storage-agnostic and client-agnostic approach leveraging Apache Arrow for high performance in-memory columnar execution. Dremio uses Apache Drill as its query engine, allowing users to query data across different systems like HDFS, S3, MongoDB as if it was a single relational database through SQL. It has an extensible architecture that allows new data sources to be easily added via plugins.
The document discusses LLAP (Live Long and Process), a new execution layer for Hive that enables sub-second analytical queries. LLAP uses daemons running on worker nodes to cache data in memory and keep query fragments executing between queries for faster performance. It allows for highly concurrent queries without specialized YARN queues. Benchmarks show LLAP providing up to 90% faster performance over Hive for queries against large datasets. LLAP also aims to serve as a unified data access layer for other systems like Spark SQL.
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
This document provides an overview of stream processing with Apache Flink. It discusses the rise of stream processing and how it enables low-latency applications and real-time analysis. It then describes Flink's stream processing capabilities, including pipelining of data, fault tolerance through checkpointing and recovery, and integration with batch processing. The document also summarizes Flink's programming model, state management, and roadmap for further development.
Apache Hadoop 3.0 is coming! As the next major release, it attracts everyone's attention as show case several bleeding-edge technologies and significant features across all components of Apache Hadoop, include: Erasure Coding in HDFS, Multiple Standby NameNodes, YARN Timeline Service v2, JNI-based shuffle in MapReduce, Apache Slider integration and Service Support as First Class Citizen, Hadoop library updates and client-side class path isolation, etc.
In this talk, we will update the status of Hadoop 3 especially the releasing work in community and then go deep diving on new features included in Hadoop 3.0. As a new major release, Hadoop 3 would also include some incompatible changes - we will go through most of these changes and explore its impact to existing Hadoop users and operators. In the last part of this session, we will continue to discuss ongoing efforts in Hadoop 3 age and show the big picture that how big data landscape could be largely influenced by Hadoop 3.
This document summarizes Richard Xu's presentation on tuning Yarn, Hive, and queries on a Hadoop cluster. The initial issues with the cluster included jobs taking hours to finish when they were supposed to take minutes. Initial tuning focused on cluster configuration best practices and increasing Yarn capacity. Further tuning involved limiting user capacity, increasing resources for application masters, and tuning memory settings for MapReduce and Tez. Specific Hive query issues addressed were full table scans, non-deterministic functions, join orders, and data type mismatches. Tools discussed for analysis included Tez visualization and Lipwig. Lessons learned emphasized a holistic tuning approach and understanding data structures and explain plans. Long-lived execution (LLAP) was presented as providing in
This document provides guidance on sizing and configuring Apache Hadoop clusters. It recommends separating master nodes, which run processes like the NameNode and JobTracker, from slave nodes, which run DataNodes, TaskTrackers and RegionServers. For medium to large clusters it suggests 4 master nodes and the remaining nodes as slaves. The document outlines factors to consider for optimizing performance and cost like selecting balanced CPU, memory and disk configurations and using a "shared nothing" architecture with 1GbE or 10GbE networking. Redundancy is more important for master than slave nodes.
This document discusses loading data from Hadoop into Oracle databases using Oracle connectors. It describes how the Oracle Loader for Hadoop and Oracle SQL Connector for HDFS can load data from HDFS into Oracle tables much faster than traditional methods like Sqoop by leveraging parallel processing in Hadoop. The connectors optimize the loading process by automatically partitioning, sorting, and formatting the data into Oracle blocks to achieve high performance loads. Measuring the CPU time needed per gigabyte loaded allows estimating how long full loads will take based on available resources.
In this session, learn how to build an Apache Spark or Spark Streaming application that can interact with HBase. In addition, you'll walk through how to implement common, real-world batch design patterns to optimize for performance and scale.
Fundamentals of Big Data, Hadoop project design and case study or Use case
General planning consideration and most necessaries in Hadoop ecosystem and Hadoop projects
This will provide the basis for choosing the right Hadoop implementation, Hadoop technologies integration, adoption and creating an infrastructure.
Building applications using Apache Hadoop with a use-case of WI-FI log analysis has real life example.
We Provide Hadoop training institute in Hyderabad and Bangalore with corporate training by 12+ Experience faculty.
Real-time industry experts from MNCs
Resume Preparation by expert Professionals
Lab exercises
Interview Preparation
Experts advice
From: DataWorks Summit Munich 2017 - 20170406
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
Why is everyone interested in Big Data and Hadoop?
Why you should use Hadoop?
Read this to and you as well you quickly and easily be the proud owner of a Hadoop kit of your own, using Cloudera Free Edition.
************************NOTE**********************
This presentation is still being edited and new slides added every day. Stay tuned...
****************************************************
You’ve successfully deployed Hadoop, but are you taking advantage of all of Hadoop’s features to operate a stable and effective cluster? In the first part of the talk, we will cover issues that have been seen over the last two years on hundreds of production clusters with detailed breakdown covering the number of occurrences, severity, and root cause. We will cover best practices and many new tools and features in Hadoop added over the last year to help system administrators monitor, diagnose and address such incidents.
The second part of our talk discusses new features for making daily operations easier. This includes features such as ACLs for simplified permission control, snapshots for data protection and more. We will also cover tuning configuration and features that improve cluster utilization, such as short-circuit reads and datanode caching.
This document provides an overview of Cloudera's "Data Analyst Training: Using Pig, Hive, and Impala with Hadoop" course. The course teaches data analysts how to use Pig, Hive, and Impala for large-scale data analysis on Hadoop. It covers loading and analyzing data with these tools, choosing the best tool for different jobs, and includes hands-on exercises. The target audience is data analysts and others interested in using Pig, Hive and Impala for big data analytics.
If you are search Best Engineering college in India, Then you can trust RCE (Roorkee College of Engineering) services and facilities. They provide the best education facility, highly educated and experienced faculty, well furnished hostels for both boys and girls, top computerized Library, great placement opportunity and more at affordable fee.
The document provides an overview of Hadoop, including:
- A brief history of Hadoop and its origins from Google and Apache projects
- An explanation of Hadoop's architecture including HDFS, MapReduce, JobTracker, TaskTracker, and DataNodes
- Examples of how large companies like Yahoo, Facebook, and Amazon use Hadoop for applications like log processing, searches, and advertisement targeting
The document provides an overview of Hadoop, including:
- A brief history of Hadoop and its origins at Google and Yahoo
- An explanation of Hadoop's architecture including HDFS, MapReduce, JobTracker, TaskTracker, and DataNodes
- Examples of how large companies like Facebook and Amazon use Hadoop to process massive amounts of data
Hadoop security has improved with additions such as HDFS ACLs, Hive column-level ACLs, HBase cell-level ACLs, and Knox for perimeter security. Data encryption has also been enhanced, with support for encrypting data in transit using SSL and data at rest through file encryption or the upcoming native HDFS encryption. Authentication is provided by Kerberos/AD with token-based authorization, and auditing tracks who accessed what data.
This document provides an overview of Oracle Big Data Cloud Service. It discusses Oracle's deployment models for big data including on-premises, cloud at customer, and public cloud options. It then describes Oracle Big Data Cloud Service, covering its core principles of efficient cluster management, security, and versatility. Administrative tasks and data management tools available with Oracle Big Data Cloud Service are also summarized.
This webinar discusses tools for making big data easy to work with. It covers MetaScale Expertise, which provides Hadoop expertise and case studies. Kognitio Analytics is discussed as a way to accelerate Hadoop for organizations. The webinar agenda includes an introduction, presentations on MetaScale and Kognitio, and a question and answer session. Rethinking data strategies with Hadoop and using in-memory analytics are presented as ways to gain insights from large, diverse datasets.
This document summarizes Syncsort's high performance data integration solutions for Hadoop contexts. Syncsort has over 40 years of experience innovating performance solutions. Their DMExpress product provides high-speed connectivity to Hadoop and accelerates ETL workflows. It uses partitioning and parallelization to load data into HDFS 6x faster than native methods. DMExpress also enhances usability with a graphical interface and accelerates MapReduce jobs by replacing sort functions. Customers report TCO reductions of 50-75% and ROI within 12 months by using DMExpress to optimize their Hadoop deployments.
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio, Inc.
Alluxio is a virtual distributed file system that serves as a data access layer between applications and storage systems. It provides a unified interface, improved performance through caching, and enables transparent migration between storage systems. Alluxio deployed with Presto on cloud storage like S3 can provide 5x faster query performance through caching query data in Alluxio workers located with compute. Case studies show how Alluxio improved response times for analytics workloads at large companies by eliminating remote data access and enabling data locality.
Are you taking advantage of all of Hadoop’s features to operate a stable and effective cluster? Inspired by real-world support cases, this talk discusses best practices and new features to help improve incident response and daily operations. Chances are that you’ll walk away from this talk with some new ideas to implement in your own clusters.
From limited Hadoop compute capacity to increased data scientist efficiencyAlluxio, Inc.
Alluxio Tech Talk
Oct 17, 2019
Speaker:
Alex Ma, Alluxio
Want to leverage your existing investments in Hadoop with your data on-premise and still benefit from the elasticity of the cloud?
Like other Hadoop users, you most likely experience very large and busy Hadoop clusters, particularly when it comes to compute capacity. Bursting HDFS data to the cloud can bring challenges – network latency impacts performance, copying data via DistCP means maintaining duplicate data, and you may have to make application changes to accomodate the use of S3.
“Zero-copy” hybrid bursting with Alluxio keeps your data on-prem and syncs data to compute in the cloud so you can expand compute capacity, particularly for ephemeral Spark jobs.
Data Orchestration Platform for the CloudAlluxio, Inc.
This document discusses using a hybrid cloud approach with data orchestration to enable analytics workloads on data stored both on-premises and in the cloud. It outlines reasons for a hybrid approach including reducing time to production and leveraging cloud flexibility. It then describes alternatives like lift-and-shift or compute-driven approaches and their issues. Finally, it introduces a data orchestration platform that can cache and tier data intelligently while enabling analytics frameworks to access both on-premises and cloud-based data with low latency.
Similar to Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) 2.3 _Part 2 (20)
Growth hacking tips and tricks that you can trySpringPeople
The term growth hacking has been gaining popularity in the tech space.In these slides, we will talk about tips and tricks that help a skilled growth hacker to grow their company.
Top Big data Analytics tools: Emerging trends and Best practicesSpringPeople
This document discusses top big data analytics tools and emerging trends in big data analytics. It defines big data analytics as examining large data sets to find patterns and business insights. The document then covers several open source and commercial big data analytics tools, including Jaspersoft and Talend for reporting, Skytree for machine learning, Tableau for visualization, and Pentaho and Splunk for reporting. It emphasizes that tool selection is just one part of a big data project and that evaluating business value is also important.
Every day we roughly create 2.5 Quintillion bytes of data; 90% of the worlds collected data has been generated only in the last 2 years. In this slide, learn the all about big data
in a simple and easiest way.
In this slide, learn how selenium WebDriver tool supply a well-designed object-oriented API that provides improved support for modern, advanced web-app testing problems.
Introduction to Open stack - An Overview SpringPeople
OpenStack is a free & open-source software platform for cloud computing, mostly deployed as an IaaS. In this Slide, we will cover:
- Evolution of Openstack
- Cloud, its types and advantages
- Importance and overview of Openstack
- Openstack course syllabus
Mongo DB: Fundamentals & Basics/ An Overview of MongoDB/ Mongo DB tutorialsSpringPeople
The document discusses MongoDB, an open-source document database. It provides an overview of MongoDB, including what it is, why it is used, its basic concepts like databases, collections, and documents, and how it compares to a relational database. It also covers MongoDB commands for creating and dropping collections, inserting, querying, and updating documents.
Mastering Test Automation: How To Use Selenium SuccessfullySpringPeople
In this slide, identify what to test and choose the best language for automation. Learn to write maintainable and reusable Selenium tests and add UI layout test as part of automation using Galen framework. This slide will also guide you in reporting structure by using external plugin's, an illustration covering cross browser testing (Running selenium grid with Docker) and explain Code repository (Git) and Jenkins CI tool.
An Introduction of Big data; Big data for beginners; Overview of Big Data; Bi...SpringPeople
Technologies such as Hadoop and Apache Spark have brought a dynamic change in the ways of analyzing big data. It is increasingly used by companies across the globe. Data Scientist has been regarded as the hottest job of 2016. In this Slide, you will be taken through the basics of Big data and its future. You will also be exposed to Hadoop and Apache Spark.
SpringPeople - Introduction to Cloud ComputingSpringPeople
Cloud computing is no longer a fad that is going around. It is for real and is perhaps the most talked about subject. Various players in the cloud eco-system have provided a definition that is closely aligned to their sweet spot –let it be infrastructure, platforms or applications.
This presentation will provide an exposure of a variety of cloud computing techniques, architecture, technology options to the participants and in general will familiarize cloud fundamentals in a holistic manner spanning all dimensions such as cost, operations, technology etc
SpringPeople - Devops skills - Do you have what it takes?SpringPeople
Whether you are a Developer, QA or a IT Operations personnel, with organizations adapting devops practices you need to skill up with the latest and the greatest of the devops tools, relevant to you. And its not the same basket of tools that dev and ops both opt for. This webinar is about the essential devops skills required to transform yourself to be a next gen devops professional. And this is based on real data, a devops skills report by Initcron.
ELK Stack workshop covers real-world use cases and works with the participants to - implement them. This includes Elastic overview, Logstash configuration, creation of dashboards in Kibana, guidelines and tips on processing custom log formats, designing a system to scale, choosing hardware, and managing the lifecycle of your logs.
To transform your organization and unlock the value of your data, you need a way to ingest, store and analyze every type of data in your organization.
This presentation covers the Data Access Layer of the Hadoop Ecosystem which enables you to achieve this.
We will use the HDP (Hortonworks Data Platform) reference architecture to walk through the Hadoop core and its ecosystem with focus on the data access layer.
We will cover some of the prominent tools of the ecosystem such as Pig, Hive, Sqoop, Flume and Oozie and how they are used for ingesting data into Hadoop from structured, unstructured and streaming sources.
Talk to us at +91 80 6567 9700 or send an email to training@springpeople.com for more information.
Introduction To Hadoop Administration - SpringPeopleSpringPeople
The Hadoop framework is used by major players including Google, Yahoo and IBM, largely for applications involving search engines and advertising. The popularity of Hadoop is juts increasing exponentially.
Introduction To Cloud Foundry - SpringPeopleSpringPeople
Cloud Foundry - Streamline application development, deployment and operations on a centrally-managed Platform as a Service for public and private cloud.
Introduction To Spring Enterprise Integration - SpringPeopleSpringPeople
The document provides an introduction to Spring Enterprise Integration. It discusses that Spring Enterprise Integration extends the Spring programming model to messaging and builds on existing enterprise integration support. It provides a higher level of abstraction and supports message-driven architectures, routing, and transformation of messages to integrate different transports and data formats. The document also summarizes the goals of Spring Integration as providing a simple model for complex integration solutions, facilitating asynchronous messaging in Spring applications, and promoting intuitive adoption for Spring users.
Introduction To Groovy And Grails - SpringPeopleSpringPeople
Groovy is a dynamic language that runs on the Java Virtual Machine. Grails is a web application framework that uses Groovy. It allows building web applications quickly by generating the necessary configuration automatically and integrating with existing Java code. The document discusses Groovy testing, Grails architecture, controllers, services, managing databases and data migration in Grails. It also advertises a 4-day training on mastering Groovy and Grails.
Introduction To Jenkins - SpringPeopleSpringPeople
Jenkins CI is the leading open-source continuous integration server. Built with Java, it provides 1057 plugins to support building and testing virtually any project.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
2. We have INSTRUCTOR LED - both Online LIVE & Classroom Session
Present for classroom sessions in Bangalore & Delhi (NCR)
We are the ONLY Education delivery partners for Mulesoft, Elastic, Pivotal & Lightbend in India
We have delivered more than 5000 trainings and have over 400 courses and a vast pool of over 200 experts to make YOU the
EXPERT!
FOLLOW US ON SOCIAL MEDIA TO STAY UPDATED ON THE UPCOMING WEBINARS
3. Online and Classroom Training on Technology Courses
at SpringPeople
Certified Partners
Non-Certified Courses
…and many more
…NOW
6. Covered Till Now
1. Use Ambari – Cluster Management Tool
2. More of WebHDFS Access
3. WebHDFS
4. Use More of HDFS Access Control Lists
5. Use HDFS Quotas
6. Understanding of YARN Components
7. Adding, Deleting, or Replacing Worker Nodes
8. Rack Awareness
9. NameNode High Availability
10. ResourceManager High Availability
11. Ambari Metrics System
12. What to Backup?
7. 13 – Setting appropriate Directory Space Quota
• Best practice is to also set space limits on home directory To set a
12TB limit:
$ hdfs dfsadmin –setSpaceQuota 12t /user/username
• Includes space for replication
• This is the actual use of space
• Example:
• If storing 1TB and replication factor is 3
• 3TB is needed
• Quota can be set on any directory
8. 14 - Configuring Trash
• Enable by setting time delay for trash's checkpoint removal:
In core-site.xml
• fs.trash.interval
• Delay is set in minutes (24 hours would be 1440 minutes)
• Recommendation is to set to 360 minutes (6 hours)
• Setting the value to 0 disables Trash
• Files deleted programmatically are deleted immediately
• Files can be immediately deleted from the command line using -skipTrash
9. 15 - Compression Needs and Tradeoffs
Compressing data can speed up data-intensive I/O operations
• MapReduce jobs are almost always I/O bound
Compressed data can save storage space and speed up data transfers across the network
• Capital allocation for hardware can go further
Reduced I/O and network load can result in significant performance improvements
• MapReduce jobs can finish faster overall
But, CPU utilization and processing time increase during compression and decompression
• Understanding the tradeoffs is important for MapReduce pipeline’s overall performance
10. 16 - Sqoop Security
• Database Authentication:
• Sqoop needs to authenticate to the RDBMS
• How?
• Usually this involves a username/password
(Oracle Wallet is the exception)
• Can hard code password in scripts (not recommended/used)
• Password usually stored in plaintext in a file protected by the filesystem
• Hadoop Credential Management Framework added in HDP 2.2
• Not a keystore, but a way to interface with keystore backends
• Passwords can be stored in a keystore and not in plain text
• Can help with “no passwords in plaintext” requirements
11. 17 - distcp Configurations
• If Distcp runs out of memory before copying:
• Possible Cause: Number of files/directories being copied from source
path(s) is extremely large (e.g. 100,000 paths)
• Change: HEAP Size
- Export HADOOP_CLIENT_OPTS="-Xms64m -Xmx1024m”
• Map Sizing
• If -m is not specified: Default to 20 maps max
• Tuning the number of maps to:
- Size of the source and destination cluster
- The size of the copy
- Available bandwidth
12. 18 - Falcon
Centrally manages data lifecycle
• Centralized definition & management of pipelines for data ingest,
process and export
Supports Business continuity and Disaster
Recovery
• Out of the box policies for data replication
and retention
• End-to-end monitoring of data pipelines
Addresses basic audit & compliance requirements
• Visualize data pipeline lineage
• Track data pipeline audit logs
• Tag data with business metadata
13. 19 - Running Balancer
• Can be run periodically as a batch job
• Examples: every 24 hours or weekly
• Run after new nodes have been added to the cluster
• To run balancer:
hdfs balancer [-threshold <threshold>] [-policy <policy>]]
• Runs until there are no blocks to move
or
Until it has lost contact with the NameNode
• Can be stopped with a Ctrl+C
14. 20 - HDFS Snapshots
Create HDFS directory snapshots
Fast operation - only metadata affected
Results in .snapshot/ directory in the HDFS directory
Snapshots are named or default to timestamp
Directories must be made snapshottable
Snapshot Steps:
– Allow snapshot on directory
hdfs dfsadmin -allowSnapshot foo/bar/
– Create snapshot for directory and optionally provide snapshot name
hdfs dfs -createSnapshot foo/bar/ mysnapshot_today
– Verify snapshot
hdfs dfs -ls foo/bar/.snapshot
15. 21 - HDFS Data – Automate & Restore
• Use Falcon/Oozie to automate backups
• Falcon utilizes Oozie as a workflow scheduler
• distcp is an Oozie action
- use -update and -prbugp
• Restoring is the reverse process of backups
1. On your backup cluster choose which snapshot to restore
2. Remove/move target directory on production system
3. Run distcp without -update options
Feel free to spend a lot of time on this slide. Many of these frameworks are not discussed later in the course, so now is likely your only chance to explain them. Let the students ask questions and make the discussion interactive.
So what is Hortonworks Data Platform(HDP)? It is an open enterprise version of Hadoop distributed by Hortonworks. It includes a single installation utility that installs many of the Apache Hadoop software frameworks. Even the installer is pure Hadoop.
The primary benefit is that Hortonworks has put HDP through a rigorous set of system, functional, and regression tests to ensure that versions of any frameworks included in the distribution work seamlessly together in a secure and reliable manner.
Because HDP is an open enterprise version of Hadoop, it is imperative that it uses the best combination of the most stable, reliable, secure, and current frameworks.
There is one more property which is having relation with the above property called fs.trash.checkpoint.interval. It is the number of minutes between trash checkpoints. This should be smaller or equal to fs.trash.interval. Everytime the checkpointer runs, it creates a new checkpoint out of current and removes checkpoints created more than fs.trash.interval minutes ago.The default value of this property is zero.