To get the most out of the heaps of data your company is sitting on, you’ll need a platform such as Apache Spark to sort through the noise and get meaningful conclusions you can use to improve your services. If you need to get results from such an intense workload in a reasonable amount of time, your company should invest in a solution whose power matches your level of work.
This proof of concept has introduced you to a new solution based on the AMD EPYC line of processors. Based on the new Zen architecture, the AMD EPYC line of processors offers resources and features worth considering. In the Principled Technologies datacenter, we set up a big data solution consisting of whitebox servers powered by the AMD EPYC 7601—the top-of-the-line offering from AMD. We ran an Apache Spark workload and tested the solution with three components of the HiBench benchmarking suite. The AMD systems maintained a consistent level of performance across these tests.
HPC and cloud distributed computing, as a journeyPeter Clapham
Introducing an internal cloud brings new paradigms, tools and infrastructure management. When placed alongside traditional HPC the new opportunities are significant But getting to the new world with micro-services, autoscaling and autodialing is a journey that cannot be achieved in a single step.
Hadoop & cloud storage object store integration in production (final)Chris Nauroth
Today's typical Apache Hadoop deployments use HDFS for persistent, fault-tolerant storage of big data files. However, recent emerging architectural patterns increasingly rely on cloud object storage such as S3, Azure Blob Store, GCS, which are designed for cost-efficiency, scalability and geographic distribution. Hadoop supports pluggable file system implementations to enable integration with these systems for use cases such as off-site backup or even complex multi-step ETL, but applications may encounter unique challenges related to eventual consistency, performance and differences in semantics compared to HDFS. This session explores those challenges and presents recent work to address them in a comprehensive effort spanning multiple Hadoop ecosystem components, including the Object Store FileSystem connector, Hive, Tez and ORC. Our goal is to improve correctness, performance, security and operations for users that choose to integrate Hadoop with Cloud Storage. We use S3 and S3A connector as case study.
The document discusses how EMC Isilon scale-out NAS storage improves Hadoop resiliency and operational efficiency. It analyzes the impact of DataNode and TaskTracker failures on Hadoop jobs. EMC Isilon provides high availability, independent scalability of storage and compute, data protection features, and support for multiple Hadoop distributions and protocols like HDFS, NFS, SMB. This allows using existing data for analysis without replication and reduces time-to-results for Hadoop jobs.
Apache Hadoop YARN is the resource and application manager for Apache Hadoop. In the past, YARN only supported launching containers as processes. However, as containerization has become extremely popular, more and more users wanted support for launching Docker containers. With recent changes to YARN, it now supports running Docker containers alongside process containers. Couple this with the newly added support for running services on YARN and it allows a host of new possibilities. In this talk, we'll present how to run a potential container cloud on YARN. Leveraging the support in YARN for Docker and services, we can allow users to spin up a bunch of Docker containers for their applications. These containers can be self contained or wired up to form more complex applications(using the Assemblies support in YARN). We will go over some of the lessons we learned as part of our experiences handling issues such as resource management, debugging application failures, running Docker, etc.
This document discusses the performance metrics and capabilities of an enterprise grade streaming platform called Onyx. It can process streaming data with latencies under 2ms on Hadoop clusters. The key metrics it aims for are latencies under 16ms, throughput of 2000 events/second, 99.5% uptime, and the ability to scale resources while maintaining latency. It also aims to have open source components, extensible rules, and transparent integration with existing systems. Testing showed it can process over 70,000 records/second with average latency of 0.19ms and meet stringent reliability targets.
Apache Ambari is an extensible framework that simplifies provisioning, managing and monitoring Hadoop clusters. Apache Ambari was built on a standardized stack-based operations model. Stacks wrap services of all shapes and sizes with a consistent definition and lifecycle-control layer; thereby providing a consistent approach for managing and monitoring the services. This also provided a natural extension point for operators and the community to bring in their own add-on services and “plug-in” the new services into the stack.
However, one of the fundamental limitations of the current Apache Ambari architecture has been that there is a strong one-on-one coupling between entities. For instance, a cluster is tied to a single stack and a Hadoop operator can only deploy services defined in that stack, a cluster can have only a single instance of a service and a host can have only a single instance of a component. Taking into consideration various use case scenarios that cannot be enabled due to these limitations there is a growing need to revamp the Ambari architecture.
In this talk, we propose a revamped Apache Ambari architecture that will open up the floodgates for a wide range of scenarios that wouldn’t have been possible thus far. We will focus the discussion on a new mpack-based operations model that will replace the stack-based operations model. A management package is a self-contained deployment artifact that includes all the details for deploying, managing and upgrading a set of services bundled in the package. A third-party provider can also build their own management package containing their custom services. This eliminates the need to plug-in their services into a stack and also can define their own upgrade story for these custom services. A Hadoop operator will be able to deploy a Hadoop cluster with a mix of services across multiple packages instead of being limited to a single stack. For example, it would be possible to deploy a cluster with HDFS from HDP and NIFI from HDF.
Further, we will also discuss about the architectural changes needed to enable a multi instance architecture in future Ambari releases to support deploying multiple instances of a service in a cluster, deploying multiple instances of a component on a host as well as future proofing the Ambari architecture to leverage some of the advancements happening in the Hadoop community like YARN services (YARN-4692). We will wrap up the conversation with a brief overview of other improvements planned for future releases of Ambari.
HDFS Tiered Storage: Mounting Object Stores in HDFSDataWorks Summit
Most users know HDFS as the reliable store of record for big data analytics. HDFS is also used to store transient and operational data when working with cloud object stores, such as Azure HDInsight and Amazon EMR. In these settings- but also in more traditional, on premise deployments- applications often manage data stored in multiple storage systems or clusters, requiring a complex workflow for synchronizing data between filesystems to achieve goals for durability, performance, and coordination.
Building on existing heterogeneous storage support, we add a storage tier to HDFS to work with external stores, allowing remote namespaces to be "mounted" in HDFS. This capability not only supports transparent caching of remote data as HDFS blocks, it also supports synchronous writes to remote clusters for business continuity planning (BCP) and supports hybrid cloud architectures.
This idea was presented at last year’s Summit in San Jose. Lots of progress has been made since then and the feature is in active development at the Apache Software Foundation on branch HDFS-9806, driven by Microsoft and Western Digital. We will discuss the refined design & implementation and present how end-users and admins will be able to use this powerful functionality.
HPC and cloud distributed computing, as a journeyPeter Clapham
Introducing an internal cloud brings new paradigms, tools and infrastructure management. When placed alongside traditional HPC the new opportunities are significant But getting to the new world with micro-services, autoscaling and autodialing is a journey that cannot be achieved in a single step.
Hadoop & cloud storage object store integration in production (final)Chris Nauroth
Today's typical Apache Hadoop deployments use HDFS for persistent, fault-tolerant storage of big data files. However, recent emerging architectural patterns increasingly rely on cloud object storage such as S3, Azure Blob Store, GCS, which are designed for cost-efficiency, scalability and geographic distribution. Hadoop supports pluggable file system implementations to enable integration with these systems for use cases such as off-site backup or even complex multi-step ETL, but applications may encounter unique challenges related to eventual consistency, performance and differences in semantics compared to HDFS. This session explores those challenges and presents recent work to address them in a comprehensive effort spanning multiple Hadoop ecosystem components, including the Object Store FileSystem connector, Hive, Tez and ORC. Our goal is to improve correctness, performance, security and operations for users that choose to integrate Hadoop with Cloud Storage. We use S3 and S3A connector as case study.
The document discusses how EMC Isilon scale-out NAS storage improves Hadoop resiliency and operational efficiency. It analyzes the impact of DataNode and TaskTracker failures on Hadoop jobs. EMC Isilon provides high availability, independent scalability of storage and compute, data protection features, and support for multiple Hadoop distributions and protocols like HDFS, NFS, SMB. This allows using existing data for analysis without replication and reduces time-to-results for Hadoop jobs.
Apache Hadoop YARN is the resource and application manager for Apache Hadoop. In the past, YARN only supported launching containers as processes. However, as containerization has become extremely popular, more and more users wanted support for launching Docker containers. With recent changes to YARN, it now supports running Docker containers alongside process containers. Couple this with the newly added support for running services on YARN and it allows a host of new possibilities. In this talk, we'll present how to run a potential container cloud on YARN. Leveraging the support in YARN for Docker and services, we can allow users to spin up a bunch of Docker containers for their applications. These containers can be self contained or wired up to form more complex applications(using the Assemblies support in YARN). We will go over some of the lessons we learned as part of our experiences handling issues such as resource management, debugging application failures, running Docker, etc.
This document discusses the performance metrics and capabilities of an enterprise grade streaming platform called Onyx. It can process streaming data with latencies under 2ms on Hadoop clusters. The key metrics it aims for are latencies under 16ms, throughput of 2000 events/second, 99.5% uptime, and the ability to scale resources while maintaining latency. It also aims to have open source components, extensible rules, and transparent integration with existing systems. Testing showed it can process over 70,000 records/second with average latency of 0.19ms and meet stringent reliability targets.
Apache Ambari is an extensible framework that simplifies provisioning, managing and monitoring Hadoop clusters. Apache Ambari was built on a standardized stack-based operations model. Stacks wrap services of all shapes and sizes with a consistent definition and lifecycle-control layer; thereby providing a consistent approach for managing and monitoring the services. This also provided a natural extension point for operators and the community to bring in their own add-on services and “plug-in” the new services into the stack.
However, one of the fundamental limitations of the current Apache Ambari architecture has been that there is a strong one-on-one coupling between entities. For instance, a cluster is tied to a single stack and a Hadoop operator can only deploy services defined in that stack, a cluster can have only a single instance of a service and a host can have only a single instance of a component. Taking into consideration various use case scenarios that cannot be enabled due to these limitations there is a growing need to revamp the Ambari architecture.
In this talk, we propose a revamped Apache Ambari architecture that will open up the floodgates for a wide range of scenarios that wouldn’t have been possible thus far. We will focus the discussion on a new mpack-based operations model that will replace the stack-based operations model. A management package is a self-contained deployment artifact that includes all the details for deploying, managing and upgrading a set of services bundled in the package. A third-party provider can also build their own management package containing their custom services. This eliminates the need to plug-in their services into a stack and also can define their own upgrade story for these custom services. A Hadoop operator will be able to deploy a Hadoop cluster with a mix of services across multiple packages instead of being limited to a single stack. For example, it would be possible to deploy a cluster with HDFS from HDP and NIFI from HDF.
Further, we will also discuss about the architectural changes needed to enable a multi instance architecture in future Ambari releases to support deploying multiple instances of a service in a cluster, deploying multiple instances of a component on a host as well as future proofing the Ambari architecture to leverage some of the advancements happening in the Hadoop community like YARN services (YARN-4692). We will wrap up the conversation with a brief overview of other improvements planned for future releases of Ambari.
HDFS Tiered Storage: Mounting Object Stores in HDFSDataWorks Summit
Most users know HDFS as the reliable store of record for big data analytics. HDFS is also used to store transient and operational data when working with cloud object stores, such as Azure HDInsight and Amazon EMR. In these settings- but also in more traditional, on premise deployments- applications often manage data stored in multiple storage systems or clusters, requiring a complex workflow for synchronizing data between filesystems to achieve goals for durability, performance, and coordination.
Building on existing heterogeneous storage support, we add a storage tier to HDFS to work with external stores, allowing remote namespaces to be "mounted" in HDFS. This capability not only supports transparent caching of remote data as HDFS blocks, it also supports synchronous writes to remote clusters for business continuity planning (BCP) and supports hybrid cloud architectures.
This idea was presented at last year’s Summit in San Jose. Lots of progress has been made since then and the feature is in active development at the Apache Software Foundation on branch HDFS-9806, driven by Microsoft and Western Digital. We will discuss the refined design & implementation and present how end-users and admins will be able to use this powerful functionality.
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3DataWorks Summit
The Hadoop community announced Hadoop 3.0 GA in December, 2017 and 3.1 around April, 2018 loaded with a lot of features and improvements. One of the biggest challenges for any new major release of a software platform is its compatibility. Apache Hadoop community has focused on ensuring wire and binary compatibility for Hadoop 2 clients and workloads.
There are many challenges to be addressed by admins while upgrading to a major release of Hadoop. Users running workloads on Hadoop 2 should be able to seamlessly run or migrate their workloads onto Hadoop 3. This session will be deep diving into upgrade aspects in detail and provide a detailed preview of migration strategies with information on what works and what might not work. This talk would focus on the motivation for upgrading to Hadoop 3 and provide a cluster upgrade guide for admins and workload migration guide for users of Hadoop.
Speaker
Suma Shivaprasad, Hortonworks, Staff Engineer
Rohith Sharma, Hortonworks, Senior Software Engineer
1) Enterprises struggle to manage big data with existing technologies due to more systems, complexity, and data to handle.
2) HPE proposes a new "Sparkitecture" called the HPE Elastic Platform for Analytics to address these issues. It uses a data-centric foundation to consolidate all data and applications on a single, elastic platform for analytics workloads.
3) The platform offers workload-optimized systems that provide better performance, scalability, and economics than traditional Hadoop architectures.
Cutting-edge Hadoop clusters are bound to need custom (add-on) services that are not available in the Hadoop distribution of their choice. Agility is crucial for companies to integrate any service into existing large-scale Hadoop clusters with ease.
Apache Ambari manages the Hadoop cluster and solves this problem by extending the stack with add-on services, which can be a new Apache project, different Hadoop file system, or internal tool. This talk covers how to create a service definition in Ambari to manage lifecycle commands and configs, plus advanced topics like packaging, installing from multiple repositories, recommending and validating configs using Service Advisor, running custom commands, defining dependencies on configs and other services, and more. We will also cover how to create custom metrics and dashboards using Ambari Metric System and Grafana, generating alerts, and enabling security by authenticating with Kerberos.
Further, we will discuss the future of service definitions and how Ambari 3.0 will support custom services through Management Packs to enable Hadoop vendors to release software faster.
Speaker
Jayush Luniya, Principal Software Engineer, Hortonworks
LinkedIn leverages the Apache Hadoop ecosystem for its big data analytics. Steady growth of the member base at LinkedIn along with their social activities results in exponential growth of the analytics infrastructure. Innovations in analytics tooling lead to heavier workloads on the clusters, which generate more data, which in turn encourage innovations in tooling and more workloads. Thus, the infrastructure remains under constant growth pressure. Heterogeneous environments embodied via a variety of hardware and diverse workloads make the task even more challenging.
This talk will tell the story of how we doubled our Hadoop infrastructure twice in the past two years.
• We will outline our main use cases and historical rates of cluster growth in multiple dimensions.
• We will focus on optimizations, configuration improvements, performance monitoring and architectural decisions we undertook to allow the infrastructure to keep pace with business needs.
• The topics include improvements in HDFS NameNode performance, and fine tuning of block report processing, the block balancer, and the namespace checkpointer.
• We will reveal a study on the optimal storage device for HDFS persistent journals (SATA vs. SAS vs. SSD vs. RAID).
• We will also describe Satellite Cluster project which allowed us to double the objects stored on one logical cluster by splitting an HDFS cluster into two partitions without the use of federation and practically no code changes.
• Finally, we will take a peek at our future goals, requirements, and growth perspectives.
SPEAKERS
Konstantin Shvachko, Sr Staff Software Engineer, LinkedIn
Erik Krogen, Senior Software Engineer, LinkedIn
The document discusses best practices for operating and supporting Apache HBase. It outlines tools like the HBase UI and HBCK that can be used to debug issues. The top categories of issues covered are region server stability problems, read/write performance, and inconsistencies. SmartSense is introduced as a tool that can help detect configuration issues proactively.
Running secured Spark job in Kubernetes compute cluster and integrating with ...DataWorks Summit
This presentation will provide technical design and development insights to run a secured Spark job in Kubernetes compute cluster that accesses job data from a Kerberized HDFS cluster. Joy will show how to run a long-running machine learning or ETL Spark job in Kubernetes and to access data from HDFS using Kerberos Principal and Delegation token.
The first part of this presentation will unleash the design and best practices to deploy and run Spark in Kubernetes integrated with HDFS that creates on-demand multi-node Spark cluster during job submission, installing/resolving software dependencies (packages), executing/monitoring the workload, and finally disposing the resources at the end of job completion. The second part of this presentation covers the design and development details to setup a Spark+Kubernetes cluster that supports long-running jobs accessing data from secured HDFS storage by creating and renewing Kerberos delegation tokens seamlessly from end-user's Kerberos Principal.
All the techniques covered in this presentation are essential in order to set up a Spark+Kubernetes compute cluster that accesses data securely from distributed storage cluster such as HDFS in a corporate environment. No prior knowledge of any of these technologies is required to attend this presentation.
Speaker
Joy Chakraborty, Data Architect
Best Practices for Using Alluxio with Apache Spark with Gene PangSpark Summit
Alluxio, formerly Tachyon, is a memory speed virtual distributed storage system and leverages memory for storing data and accelerating access to data in different storage systems. Many organizations and deployments use Alluxio with Apache Spark, and some of them scale out to over PB’s of data. Alluxio can enable Spark to be even more effective, in both on-premise deployments and public cloud deployments. Alluxio bridges Spark applications with various storage systems and further accelerates data intensive applications. In this talk, we briefly introduce Alluxio, and present different ways how Alluxio can help Spark jobs. We discuss best practices of using Alluxio with Spark, including RDDs and DataFrames, as well as on-premise deployments and public cloud deployments.
Hortonworks provides best practices for system testing Hadoop clusters. It recommends testing across different operating systems, configurations, workloads and hardware to mimic a production environment. The document outlines automating the testing process through continuous integration to test over 15,000 configurations. It provides guidance on test planning, including identifying requirements, selecting hardware and workloads to test upgrades, migrations and changes to security settings.
Dancing elephants - efficiently working with object stores from Apache Spark ...DataWorks Summit
As Hadoop applications move into cloud deployments, object stores become more and more the source and destination of data. But object stores are not filesystems: sometimes they are slower; security is different,
What are the secret settings to get maximum performance from queries against data living in cloud object stores? That's at the filesystem client, the file format and the query engine layers? It's even how you lay out the files —the directory structure and the names you give them.
We know these things, from our work in all these layers, from the benchmarking we've done —and the support calls we get when people have problems. And now: we'll show you.
This talk will start from the ground up "why isn't an object store a filesystem?" issue, showing how that breaks fundamental assumptions in code, and so causes performance issues which you don't get when working with HDFS. We'll look at the ways to get Apache Hive and Spark to work better, looking at optimizations which have been done to enable this —and what work is ongoing. Finally, we'll consider what your own code needs to do in order to adapt to cloud execution.
This document discusses streaming data ingestion and processing options. It provides an overview of common streaming architectures including Kafka as an ingestion hub and various streaming engines. Spark Streaming is highlighted as a popular and full-featured option for processing streaming data due to its support for SQL, machine learning, and ease of transition from batch workflows. The document also briefly profiles StreamSets Data Collector as a higher-level tool for building streaming data pipelines.
Hadoop Operations - Best Practices from the FieldDataWorks Summit
This document discusses best practices for Hadoop operations based on analysis of support cases. Key learnings include using HDFS ACLs and snapshots to prevent accidental data deletion and improve recoverability. HDFS improvements like pausing block deletion and adding diagnostics help address incidents around namespace mismatches and upgrade failures. Proper configuration of hardware, JVM settings, and monitoring is also emphasized.
Most users know HDFS as the reliable store of record for big data analytics. HDFS is also used to store transient and operational data when working with cloud object stores, such as Azure HDInsight and Amazon EMR. In these settings - but also in more traditional, on premise deployments - applications often manage data stored in multiple storage systems or clusters, requiring a complex workflow for synchronizing data between filesystems to achieve goals for durability, performance, and coordination.
Building on existing heterogeneous storage support, we add a storage tier to HDFS to work with external stores, allowing remote namespaces to be "mounted" in HDFS. This capability not only supports transparent caching of remote data as HDFS blocks, it also supports (a)synchronous writes to remote clusters for business continuity planning (BCP) and supports hybrid cloud architectures.
This idea was presented at last year’s Summit in San Jose. Lots of progress has been made since then and active development is ongoing at the Apache Software Foundation on branch HDFS-9806, driven by Microsoft and Western Digital. We will discuss the refined design & implementation and present how end-users and admins will be able to use this powerful functionality.
Rich placement constraints: Who said YARN cannot schedule services?DataWorks Summit
The rise in popularity of machine learning, streaming, and latency-sensitive online applications in shared production clusters has raised new challenges for cluster schedulers. To optimize their performance and resilience, these applications require precise control of their placements by means of complex constraints. Examples of such scenarios are the following:
• Deep learning applications need to run on GPU machines with specific GPU models and driver/kernel versions.
• Hive or Spark applications benefit from being collocated on the same rack to reduce network cost and thus speed up their execution. At the same time, it is desirable to limit the number of allocations per machine to minimize resource interference.
• Low-latency services such as HBase need to be allocated across failure domains to improve their availability.
• A DNS service might need to run on machines with public IP address.
In this talk we present the brand new addition of expressive placement constraints in YARN. We show how applications can leverage such constraints to achieve complex placements, such as collocating their allocations on the same node/rack (affinity), spreading their allocations across nodes/racks (anti-affinity), or allowing up to a specific number of allocations per node group (cardinality) to strike a balance between the two. We describe real use cases from production clusters and show the benefits of placement constraints on large clusters using popular applications in both on-prem and cloud settings.
Speakers
Konstantinos Karanasos, Senior Scientists, Microsoft
Wangda Tan, Staff Software Engineer, Hortonworks
The document outlines Renault's big data initiatives from 2014-2016, including:
1. Starting with a big data sandbox in 2014 using an old HPC infrastructure for data exploration.
2. Implementing a DataLab in 2015 with a new HP infrastructure and establishing a first level of industrialization while improving data protection.
3. Creating a big data platform in 2016 to industrialize hosting both proofs of concept and production projects while ensuring data protection.
This document discusses security features in Apache Kafka including SSL for encryption, SASL/Kerberos for authentication, authorization controls using an authorizer, and securing Zookeeper. It provides details on how these security components work, such as how SSL establishes an encrypted channel and SASL performs authentication. The authorizer implementation stores ACLs in Zookeeper and caches them for performance. Securing Zookeeper involves setting ACLs on Zookeeper nodes and migrating security configurations. Future plans include moving more functionality to the broker side and adding new authorization features.
Security is one of fundamental features for enterprise adoption. Specifically, for SQL users, row/column-level access control is important. However, when a cluster is used as a data warehouse accessed by various user groups via different ways, it is difficult to guarantee data governance in a consistent way. In this talk, we focus on SQL users and talk about how to provide row/column-level access controls with common access control rules throughout the whole cluster with various SQL engines, e.g., Apache Spark 2.1, Apache Spark 1.6 and Apache Hive 2.1. If some of rules are changed, all engines are controlled consistently in near real-time. Technically, we enables Spark Thrift Server to work with an identify given by JDBC connection and take advantage of Hive LLAP daemon as a shared and secured processing engine. We demonstrate row-level filtering, column-level filtering and various column maskings in Apache Spark with Apache Ranger. We use Apache Ranger as a single point of security control center.
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDataWorks Summit
As Apache Hadoop clusters become central to an organization’s operations, they have clusters in more than one data center. Historically, this has been largely driven by requirements of business continuity planning or geo localization. It has also recently been gaining a lot of interest from a hybrid cloud perspective, i.e. wherein people are trying to augment their traditional on-prem setup with cloud-based additions as well. A robust replication solution is a fundamental requirement in such cases.
The Apache Hive community has been working on new capabilities for efficient and fault tolerant replication of data in the Hive warehouse. In this talk, we will discuss these new capabilities, how it works, what replication at Hive-scale looks like, what challenges it poses, what we have done to solve those issues. We will also focus on what we need to be aware of in our use case that might make replication optimal.
Speaker
Sankar Hariappan, Senior Software Engineer, Hortonworks
In this talk, we will present a new distribution of Hadoop, Hops, that can scale the Hadoop Filesystem (HDFS) by 16X, from 70K ops/s to 1.2 million ops/s on Spotiy's industrial Hadoop workload. Hops is an open-source distribution of Apache Hadoop that supports distributed metadata for HSFS (HopsFS) and the ResourceManager in Apache YARN. HopsFS is the first production-grade distributed hierarchical filesystem to store its metadata normalized in an in-memory, shared nothing database. For YARN, we will discuss optimizations that enable 2X throughput increases for the Capacity scheduler, enabling scalability to clusters with >20K nodes. We will discuss the journey of how we reached this milestone, discussing some of the challenges involved in efficiently and safely mapping hierarchical filesystem metadata state and operations onto a shared-nothing, in-memory database. We will also discuss the key database features needed for extreme scaling, such as multi-partition transactions, partition-pruned index scans, distribution-aware transactions, and the streaming changelog API. Hops (www.hops.io) is Apache-licensed open-source and supports a pluggable database backend for distributed metadata, although it currently only support MySQL Cluster as a backend. Hops opens up the potential for new directions for Hadoop when metadata is available for tinkering in a mature relational database.
The document summarizes the results of a study that evaluated the performance of different Platform-as-a-Service offerings for running SQL on Hadoop workloads. The study tested Amazon EMR, Google Cloud DataProc, Microsoft Azure HDInsight, and Rackspace Cloud Big Data using the TPC-H benchmark at various data sizes up to 1 terabyte. It found that at 1TB, lower-end systems had poorer performance. In general, HDInsight running on D4 instances and Rackspace Cloud Big Data on dedicated hardware had the best scalability and execution times. The study provides insights into the performance, scalability, and price-performance of running SQL on Hadoop in the cloud.
The Hadoop Distributed File System is the foundational storage layer in typical Hadoop deployments. Performance and stability of HDFS are crucial to the correct functioning of applications at higher layers in the Hadoop stack. This session is a technical deep dive into recent enhancements committed to HDFS by the entire Apache contributor community. We describe real-world incidents that motivated these changes and how the enhancements prevent those problems from reoccurring. Attendees will leave this session with a deeper understanding of the implementation challenges in a distributed file system and identify helpful new metrics to monitor in their own clusters.
Data proliferation and machine learning: The case for upgrading your servers ...Principled Technologies
Principled Technologies examined the performance improvements and cost savings associated with upgrading to the 16th Generation Dell PowerEdge R7625 for machine learning algorithms
Conclusion
As data proliferates and the sizes of databases grow, the potential to unlock valuable insights from them becomes increasingly dependent on fast architectures that can handle compute-intensive machine learning workloads such as k-means clustering and Bayesian inference. By upgrading to the latest servers, organizations can scale their processing power to meet the growing demands of their databases.
Larger databases and more powerful algorithms have the potential to give organizations a competitive edge. Faster servers can improve the accuracy of data-driven decisions by allowing organizations to use more complex algorithms and update ML models more frequently. To consider just two examples, improved performance could allow an e-commerce company to make better recommendations to customers and a financial services company to assess risks more accurately.
When we compared the machine learning performance of a 16G Dell PowerEdge R7625 server powered by 4th Gen AMD EPYC 64-core processors with Broadcom NICs and PERC 11 storage controllers to a previous-generation PowerEdge server, we found performance enhancements in terms of throughput and speed, whether running k-means clustering or Bayesian workloads. These findings suggest that organizations that rely on machine learning algorithms might gain performance advantages by upgrading to the latest generation of these Dell servers.
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmarkLenovo Data Center
Some configurations deserve their own SlideShare entry: this is one of them. When the indsutry's first 100TB Spark SQL benchmark was reached, the media took notice. For good reason.
Intel, Mellanox, Lenovo and IBM came together to investigate a topology that leveraged advances in CPU, memory, storage and networking to assess the readiness of Spark SQL to harness new capabilities -- and speeds.
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3DataWorks Summit
The Hadoop community announced Hadoop 3.0 GA in December, 2017 and 3.1 around April, 2018 loaded with a lot of features and improvements. One of the biggest challenges for any new major release of a software platform is its compatibility. Apache Hadoop community has focused on ensuring wire and binary compatibility for Hadoop 2 clients and workloads.
There are many challenges to be addressed by admins while upgrading to a major release of Hadoop. Users running workloads on Hadoop 2 should be able to seamlessly run or migrate their workloads onto Hadoop 3. This session will be deep diving into upgrade aspects in detail and provide a detailed preview of migration strategies with information on what works and what might not work. This talk would focus on the motivation for upgrading to Hadoop 3 and provide a cluster upgrade guide for admins and workload migration guide for users of Hadoop.
Speaker
Suma Shivaprasad, Hortonworks, Staff Engineer
Rohith Sharma, Hortonworks, Senior Software Engineer
1) Enterprises struggle to manage big data with existing technologies due to more systems, complexity, and data to handle.
2) HPE proposes a new "Sparkitecture" called the HPE Elastic Platform for Analytics to address these issues. It uses a data-centric foundation to consolidate all data and applications on a single, elastic platform for analytics workloads.
3) The platform offers workload-optimized systems that provide better performance, scalability, and economics than traditional Hadoop architectures.
Cutting-edge Hadoop clusters are bound to need custom (add-on) services that are not available in the Hadoop distribution of their choice. Agility is crucial for companies to integrate any service into existing large-scale Hadoop clusters with ease.
Apache Ambari manages the Hadoop cluster and solves this problem by extending the stack with add-on services, which can be a new Apache project, different Hadoop file system, or internal tool. This talk covers how to create a service definition in Ambari to manage lifecycle commands and configs, plus advanced topics like packaging, installing from multiple repositories, recommending and validating configs using Service Advisor, running custom commands, defining dependencies on configs and other services, and more. We will also cover how to create custom metrics and dashboards using Ambari Metric System and Grafana, generating alerts, and enabling security by authenticating with Kerberos.
Further, we will discuss the future of service definitions and how Ambari 3.0 will support custom services through Management Packs to enable Hadoop vendors to release software faster.
Speaker
Jayush Luniya, Principal Software Engineer, Hortonworks
LinkedIn leverages the Apache Hadoop ecosystem for its big data analytics. Steady growth of the member base at LinkedIn along with their social activities results in exponential growth of the analytics infrastructure. Innovations in analytics tooling lead to heavier workloads on the clusters, which generate more data, which in turn encourage innovations in tooling and more workloads. Thus, the infrastructure remains under constant growth pressure. Heterogeneous environments embodied via a variety of hardware and diverse workloads make the task even more challenging.
This talk will tell the story of how we doubled our Hadoop infrastructure twice in the past two years.
• We will outline our main use cases and historical rates of cluster growth in multiple dimensions.
• We will focus on optimizations, configuration improvements, performance monitoring and architectural decisions we undertook to allow the infrastructure to keep pace with business needs.
• The topics include improvements in HDFS NameNode performance, and fine tuning of block report processing, the block balancer, and the namespace checkpointer.
• We will reveal a study on the optimal storage device for HDFS persistent journals (SATA vs. SAS vs. SSD vs. RAID).
• We will also describe Satellite Cluster project which allowed us to double the objects stored on one logical cluster by splitting an HDFS cluster into two partitions without the use of federation and practically no code changes.
• Finally, we will take a peek at our future goals, requirements, and growth perspectives.
SPEAKERS
Konstantin Shvachko, Sr Staff Software Engineer, LinkedIn
Erik Krogen, Senior Software Engineer, LinkedIn
The document discusses best practices for operating and supporting Apache HBase. It outlines tools like the HBase UI and HBCK that can be used to debug issues. The top categories of issues covered are region server stability problems, read/write performance, and inconsistencies. SmartSense is introduced as a tool that can help detect configuration issues proactively.
Running secured Spark job in Kubernetes compute cluster and integrating with ...DataWorks Summit
This presentation will provide technical design and development insights to run a secured Spark job in Kubernetes compute cluster that accesses job data from a Kerberized HDFS cluster. Joy will show how to run a long-running machine learning or ETL Spark job in Kubernetes and to access data from HDFS using Kerberos Principal and Delegation token.
The first part of this presentation will unleash the design and best practices to deploy and run Spark in Kubernetes integrated with HDFS that creates on-demand multi-node Spark cluster during job submission, installing/resolving software dependencies (packages), executing/monitoring the workload, and finally disposing the resources at the end of job completion. The second part of this presentation covers the design and development details to setup a Spark+Kubernetes cluster that supports long-running jobs accessing data from secured HDFS storage by creating and renewing Kerberos delegation tokens seamlessly from end-user's Kerberos Principal.
All the techniques covered in this presentation are essential in order to set up a Spark+Kubernetes compute cluster that accesses data securely from distributed storage cluster such as HDFS in a corporate environment. No prior knowledge of any of these technologies is required to attend this presentation.
Speaker
Joy Chakraborty, Data Architect
Best Practices for Using Alluxio with Apache Spark with Gene PangSpark Summit
Alluxio, formerly Tachyon, is a memory speed virtual distributed storage system and leverages memory for storing data and accelerating access to data in different storage systems. Many organizations and deployments use Alluxio with Apache Spark, and some of them scale out to over PB’s of data. Alluxio can enable Spark to be even more effective, in both on-premise deployments and public cloud deployments. Alluxio bridges Spark applications with various storage systems and further accelerates data intensive applications. In this talk, we briefly introduce Alluxio, and present different ways how Alluxio can help Spark jobs. We discuss best practices of using Alluxio with Spark, including RDDs and DataFrames, as well as on-premise deployments and public cloud deployments.
Hortonworks provides best practices for system testing Hadoop clusters. It recommends testing across different operating systems, configurations, workloads and hardware to mimic a production environment. The document outlines automating the testing process through continuous integration to test over 15,000 configurations. It provides guidance on test planning, including identifying requirements, selecting hardware and workloads to test upgrades, migrations and changes to security settings.
Dancing elephants - efficiently working with object stores from Apache Spark ...DataWorks Summit
As Hadoop applications move into cloud deployments, object stores become more and more the source and destination of data. But object stores are not filesystems: sometimes they are slower; security is different,
What are the secret settings to get maximum performance from queries against data living in cloud object stores? That's at the filesystem client, the file format and the query engine layers? It's even how you lay out the files —the directory structure and the names you give them.
We know these things, from our work in all these layers, from the benchmarking we've done —and the support calls we get when people have problems. And now: we'll show you.
This talk will start from the ground up "why isn't an object store a filesystem?" issue, showing how that breaks fundamental assumptions in code, and so causes performance issues which you don't get when working with HDFS. We'll look at the ways to get Apache Hive and Spark to work better, looking at optimizations which have been done to enable this —and what work is ongoing. Finally, we'll consider what your own code needs to do in order to adapt to cloud execution.
This document discusses streaming data ingestion and processing options. It provides an overview of common streaming architectures including Kafka as an ingestion hub and various streaming engines. Spark Streaming is highlighted as a popular and full-featured option for processing streaming data due to its support for SQL, machine learning, and ease of transition from batch workflows. The document also briefly profiles StreamSets Data Collector as a higher-level tool for building streaming data pipelines.
Hadoop Operations - Best Practices from the FieldDataWorks Summit
This document discusses best practices for Hadoop operations based on analysis of support cases. Key learnings include using HDFS ACLs and snapshots to prevent accidental data deletion and improve recoverability. HDFS improvements like pausing block deletion and adding diagnostics help address incidents around namespace mismatches and upgrade failures. Proper configuration of hardware, JVM settings, and monitoring is also emphasized.
Most users know HDFS as the reliable store of record for big data analytics. HDFS is also used to store transient and operational data when working with cloud object stores, such as Azure HDInsight and Amazon EMR. In these settings - but also in more traditional, on premise deployments - applications often manage data stored in multiple storage systems or clusters, requiring a complex workflow for synchronizing data between filesystems to achieve goals for durability, performance, and coordination.
Building on existing heterogeneous storage support, we add a storage tier to HDFS to work with external stores, allowing remote namespaces to be "mounted" in HDFS. This capability not only supports transparent caching of remote data as HDFS blocks, it also supports (a)synchronous writes to remote clusters for business continuity planning (BCP) and supports hybrid cloud architectures.
This idea was presented at last year’s Summit in San Jose. Lots of progress has been made since then and active development is ongoing at the Apache Software Foundation on branch HDFS-9806, driven by Microsoft and Western Digital. We will discuss the refined design & implementation and present how end-users and admins will be able to use this powerful functionality.
Rich placement constraints: Who said YARN cannot schedule services?DataWorks Summit
The rise in popularity of machine learning, streaming, and latency-sensitive online applications in shared production clusters has raised new challenges for cluster schedulers. To optimize their performance and resilience, these applications require precise control of their placements by means of complex constraints. Examples of such scenarios are the following:
• Deep learning applications need to run on GPU machines with specific GPU models and driver/kernel versions.
• Hive or Spark applications benefit from being collocated on the same rack to reduce network cost and thus speed up their execution. At the same time, it is desirable to limit the number of allocations per machine to minimize resource interference.
• Low-latency services such as HBase need to be allocated across failure domains to improve their availability.
• A DNS service might need to run on machines with public IP address.
In this talk we present the brand new addition of expressive placement constraints in YARN. We show how applications can leverage such constraints to achieve complex placements, such as collocating their allocations on the same node/rack (affinity), spreading their allocations across nodes/racks (anti-affinity), or allowing up to a specific number of allocations per node group (cardinality) to strike a balance between the two. We describe real use cases from production clusters and show the benefits of placement constraints on large clusters using popular applications in both on-prem and cloud settings.
Speakers
Konstantinos Karanasos, Senior Scientists, Microsoft
Wangda Tan, Staff Software Engineer, Hortonworks
The document outlines Renault's big data initiatives from 2014-2016, including:
1. Starting with a big data sandbox in 2014 using an old HPC infrastructure for data exploration.
2. Implementing a DataLab in 2015 with a new HP infrastructure and establishing a first level of industrialization while improving data protection.
3. Creating a big data platform in 2016 to industrialize hosting both proofs of concept and production projects while ensuring data protection.
This document discusses security features in Apache Kafka including SSL for encryption, SASL/Kerberos for authentication, authorization controls using an authorizer, and securing Zookeeper. It provides details on how these security components work, such as how SSL establishes an encrypted channel and SASL performs authentication. The authorizer implementation stores ACLs in Zookeeper and caches them for performance. Securing Zookeeper involves setting ACLs on Zookeeper nodes and migrating security configurations. Future plans include moving more functionality to the broker side and adding new authorization features.
Security is one of fundamental features for enterprise adoption. Specifically, for SQL users, row/column-level access control is important. However, when a cluster is used as a data warehouse accessed by various user groups via different ways, it is difficult to guarantee data governance in a consistent way. In this talk, we focus on SQL users and talk about how to provide row/column-level access controls with common access control rules throughout the whole cluster with various SQL engines, e.g., Apache Spark 2.1, Apache Spark 1.6 and Apache Hive 2.1. If some of rules are changed, all engines are controlled consistently in near real-time. Technically, we enables Spark Thrift Server to work with an identify given by JDBC connection and take advantage of Hive LLAP daemon as a shared and secured processing engine. We demonstrate row-level filtering, column-level filtering and various column maskings in Apache Spark with Apache Ranger. We use Apache Ranger as a single point of security control center.
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDataWorks Summit
As Apache Hadoop clusters become central to an organization’s operations, they have clusters in more than one data center. Historically, this has been largely driven by requirements of business continuity planning or geo localization. It has also recently been gaining a lot of interest from a hybrid cloud perspective, i.e. wherein people are trying to augment their traditional on-prem setup with cloud-based additions as well. A robust replication solution is a fundamental requirement in such cases.
The Apache Hive community has been working on new capabilities for efficient and fault tolerant replication of data in the Hive warehouse. In this talk, we will discuss these new capabilities, how it works, what replication at Hive-scale looks like, what challenges it poses, what we have done to solve those issues. We will also focus on what we need to be aware of in our use case that might make replication optimal.
Speaker
Sankar Hariappan, Senior Software Engineer, Hortonworks
In this talk, we will present a new distribution of Hadoop, Hops, that can scale the Hadoop Filesystem (HDFS) by 16X, from 70K ops/s to 1.2 million ops/s on Spotiy's industrial Hadoop workload. Hops is an open-source distribution of Apache Hadoop that supports distributed metadata for HSFS (HopsFS) and the ResourceManager in Apache YARN. HopsFS is the first production-grade distributed hierarchical filesystem to store its metadata normalized in an in-memory, shared nothing database. For YARN, we will discuss optimizations that enable 2X throughput increases for the Capacity scheduler, enabling scalability to clusters with >20K nodes. We will discuss the journey of how we reached this milestone, discussing some of the challenges involved in efficiently and safely mapping hierarchical filesystem metadata state and operations onto a shared-nothing, in-memory database. We will also discuss the key database features needed for extreme scaling, such as multi-partition transactions, partition-pruned index scans, distribution-aware transactions, and the streaming changelog API. Hops (www.hops.io) is Apache-licensed open-source and supports a pluggable database backend for distributed metadata, although it currently only support MySQL Cluster as a backend. Hops opens up the potential for new directions for Hadoop when metadata is available for tinkering in a mature relational database.
The document summarizes the results of a study that evaluated the performance of different Platform-as-a-Service offerings for running SQL on Hadoop workloads. The study tested Amazon EMR, Google Cloud DataProc, Microsoft Azure HDInsight, and Rackspace Cloud Big Data using the TPC-H benchmark at various data sizes up to 1 terabyte. It found that at 1TB, lower-end systems had poorer performance. In general, HDInsight running on D4 instances and Rackspace Cloud Big Data on dedicated hardware had the best scalability and execution times. The study provides insights into the performance, scalability, and price-performance of running SQL on Hadoop in the cloud.
The Hadoop Distributed File System is the foundational storage layer in typical Hadoop deployments. Performance and stability of HDFS are crucial to the correct functioning of applications at higher layers in the Hadoop stack. This session is a technical deep dive into recent enhancements committed to HDFS by the entire Apache contributor community. We describe real-world incidents that motivated these changes and how the enhancements prevent those problems from reoccurring. Attendees will leave this session with a deeper understanding of the implementation challenges in a distributed file system and identify helpful new metrics to monitor in their own clusters.
Data proliferation and machine learning: The case for upgrading your servers ...Principled Technologies
Principled Technologies examined the performance improvements and cost savings associated with upgrading to the 16th Generation Dell PowerEdge R7625 for machine learning algorithms
Conclusion
As data proliferates and the sizes of databases grow, the potential to unlock valuable insights from them becomes increasingly dependent on fast architectures that can handle compute-intensive machine learning workloads such as k-means clustering and Bayesian inference. By upgrading to the latest servers, organizations can scale their processing power to meet the growing demands of their databases.
Larger databases and more powerful algorithms have the potential to give organizations a competitive edge. Faster servers can improve the accuracy of data-driven decisions by allowing organizations to use more complex algorithms and update ML models more frequently. To consider just two examples, improved performance could allow an e-commerce company to make better recommendations to customers and a financial services company to assess risks more accurately.
When we compared the machine learning performance of a 16G Dell PowerEdge R7625 server powered by 4th Gen AMD EPYC 64-core processors with Broadcom NICs and PERC 11 storage controllers to a previous-generation PowerEdge server, we found performance enhancements in terms of throughput and speed, whether running k-means clustering or Bayesian workloads. These findings suggest that organizations that rely on machine learning algorithms might gain performance advantages by upgrading to the latest generation of these Dell servers.
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmarkLenovo Data Center
Some configurations deserve their own SlideShare entry: this is one of them. When the indsutry's first 100TB Spark SQL benchmark was reached, the media took notice. For good reason.
Intel, Mellanox, Lenovo and IBM came together to investigate a topology that leveraged advances in CPU, memory, storage and networking to assess the readiness of Spark SQL to harness new capabilities -- and speeds.
Fundamentals of Big Data, Hadoop project design and case study or Use case
General planning consideration and most necessaries in Hadoop ecosystem and Hadoop projects
This will provide the basis for choosing the right Hadoop implementation, Hadoop technologies integration, adoption and creating an infrastructure.
Building applications using Apache Hadoop with a use-case of WI-FI log analysis has real life example.
IBM POWER - An ideal platform for scale-out deploymentsthinkASG
IBM Power Systems is the ideal platform for scale-out deployments such as Big Data, SAP HANA and anything else the requires heavy compute to achieve business goals, faster.
Azure Data Platform Services
HDInsight Clusters in Azure
Data Storage: Apache Hive, Apache Hbase, Azure Data Catalog
Data Transformations: Apache Storm, Apache Spark, Azure Data Factory
Healthcare / Life Sciences Use Cases
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE UNDER AZURE ...Megha Shah
This presentation aims to compare the performance of ETL pipeline using Spark and Hive under Azure. We will examine the features, strengths, and weaknesses of each tool, and provide recommendations on which one to use based on specific use cases.
If you're like most of the world, you're on an aggressive race to implement machine learning applications and on a path to get to deep learning. If you can give better service at a lower cost, you will be the winners in 2030. But infrastructure is a key challenge to getting there. What does the technology infrastructure look like over the next decade as you move from Petabytes to Exabytes? How are you budgeting for more colossal data growth over the next decade? How do your data scientists share data today and will it scale for 5-10 years? Do you have the appropriate security, governance, back-up and archiving processes in place? This session will address these issues and discuss strategies for customers as they ramp up their AI journey with a long term view.
HPE plans to deliver a DAOS storage solution targeting version 2.0 to enable initial customer deployments. The reference implementation will include HPE servers, Intel Ice Lake CPUs with DCPMM and NVMe SSDs, Mellanox or HPE Slingshot switches, and customized Cray management software. HPE will host potential customers in their CTO lab to run proofs of concept and collect feedback to inform full productization planned with Sapphire Rapids.
Apache Cassandra performance advantages of the new Dell PowerEdge C6620 with ...Principled Technologies
The PowerEdge C6620 with PERC 12 delivered lower latency and higher throughput than an HPE ProLiant XL170r Gen9 server with an HPE Smart Array P440ar controller
Conclusion
Data proliferation today is rapid, and its growth shows no signs of stopping. For businesses that can take advantage of that data, there is tremendous potential value. One recent McKinsey study notes that “companies that are using data-driven B2B sales-growth engines report above-market growth and EBITDA increases in the range of 15 to 25 percent.” With data flooding in so quickly and in so many different forms, however, companies need high-performing big data solutions to have a chance at utilizing that data effectively.
We tested the performance of two platforms with a read-intensive Apache Cassandra database system bigdata workload to assess which might be better suited to speedily deliver the insights decision makers need. Compared to an older HPE ProLiant XL170r Gen9 server with an HPE Smart Array P440ar controller, the new Dell PowerEdge C6620 with Broadcom-based PERC 12 RAID controller delivered faster read and update latencies and more than twice the throughput. This improvement in performance can help you glean more value from your unstructured data more quickly. If you’re watching your stores of unstructured data grow but are still leaning on older servers for your critical Cassandra workloads, it may be time for an upgrade.
Application Report: Big Data - Big Cluster InterconnectsIT Brand Pulse
As a leading analytics platform that runs on industry-standard hardware and integrates industry-standard database tools and applications, one of ParAccel’s biggest challenges is to architect and test hardware (servers, storage, interconnects) that make their software perform at its peak. In this case, they have achieved their mission to eliminate a cluster bottleneck by implementing 10GbE NICs to provide the bandwidth needed to-day, and well into the future.
Architectures for HPC/HTC Workloads on AWS - CMP306 - re:Invent 2017Amazon Web Services
Researchers and IT professionals who use high-performance computing (HPC) and high-throughput computing (HTC) need a large scale infrastructure to move their research forward. This session provides reference architectures for running your workloads on AWS, which enable you to achieve scale on-demand and reduce your time to science. We debunk myths about HPC in the cloud and demonstrate techniques for running common on-premises workloads in the cloud.
Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC P...Principled Technologies
Efficiently running compute-heavy, Apache Hadoop big data workloads today might not translate to continued quality performance for growing data sets. Moving compute-intensive, Hadoop big data workloads to currentgeneration Dell EMC PowerEdge R640 servers powered by 2nd Generation Intel Xeon Scalable processors could allow your organization to better meet the data analysis challenges of today and have the resources to support growth. In our data center, a Hadoop cluster of PowerEdge R640 servers completed three big data workloads in less time by delivering greater throughput than a cluster of previous-generation Dell EMC PowerEdge R630 servers. Faster analysis of large data sets means getting insight into your organization, products, and services sooner, which could help your organization grow and beat its competition.
The Best Infrastructure for OpenStack: VMware vSphere and Virtual SANEMC
Side-by-side OpenStack benchmarking conducted by a third-party technical validation firm shows that infrastructure based on vSphere with VSAN performs better than a similar Red Hat stack built upon KVM and Red Hat Storage (GlusterFS). OpenStack workloads run faster on vSphere than on KVM and cost less - benefits of the hyperconverged Virtual SAN architecture.
Cost and performance comparison for OpenStack compute and storage infrastructurePrincipled Technologies
To be competitive in the cloud, businesses need more efficient hardware and software for their OpenStack environment and solutions that can pool physical resources efficiently for consumption and management through OpenStack. Software-defined storage has made it easier to create storage resource pools that spread across your datacenter, but external or separate distributed storage systems are still a challenge for many. VMware vSphere with Virtual SAN converges storage resources with the hypervisor, which can allow for management of an entire infrastructure through a single vCenter Server, increase performance, save space, and reduce costs.
Both solutions used the same twelve drive bays per storage node for virtual disk storage, however the tiered design of VMware Virtual SAN allowed for greater performance in our tests. We used a mix of two SSDs and 10 rotational drives for the VMware Virtual SAN solution, while we used 12 rotational drives for the Red Hat Storage Server solution behind a battery-backed RAID controller, the Red Hat recommended approach.
In our testing, the VMware vSphere with Virtual SAN solution performed better than the Red Hat Storage solution in both real world and raw performance testing by providing 53 percent more database OPS and 159 percent more IOPS. In addition, the vSphere with Virtual SAN solution can occupy less datacenter space, which can result in lower costs associated with density. A three-year cost projection for the two solutions showed that VMware vSphere with Virtual SAN could save your business up to 26 percent in hardware and software costs when compared to the Red Hat Storage solution we tested.
MySQL and Spark machine learning performance on Azure VMsbased on 3rd Gen AMD...Principled Technologies
If your organization is one of the many that are shifting critical applications to the cloud, you know that cloud service providers offer a staggering number of virtual machine options. In your quest for the best performance, an important factor to consider is the processor that powers the VMs.
By upgrading from the legacy solution we tested to the new Intel processor-based Dell and VMware solution, you could do 18 times the work in the same amount of space. Imagine what that performance could mean to your business: Consolidate workloads from across your company, lower your power and cooling bills, and limit datacenter expansion in the future, all while maintaining a consistent user experience—the list of potential benefits is huge.
Try running DPACK, which can help you identify bottlenecks in your environment and inform you about your current performance needs. Then consider how the consolidation ratio we proved could be helpful for your company. The Intel processor-powered Dell PowerEdge R730 solution with VMware vSphere and Dell Storage SC4020, also powered by Intel, could be the right destination for your upgrade journey.
Do more Apache Cassandra distributed database work with AMD EPYC 7601 processorsPrincipled Technologies
Private clouds require an investment in hardware that can often be costly—a cost that grows along with the size of distributed database workloads a business deploys. The new AMD EPYC processor architecture can help ease that burden by increasing the available number of cores per socket, which could let businesses get more distributed database work done per server compared to a previous generation Intel Xeon E5-2699 v4 processor architecture. In our tests, we found that a cluster based on 32-core AMD EPYC 7601 processors increased the operations per second an Apache Cassandra distributed database could process by 50 percent over a same-sized cluster based on 22-core Intel Xeon processors E5-2699 v4.3 This means that businesses seeking to run these reliable, elastic databases on a private cloud setup could do so on an AMD EPYC 7601 processor-based server platform and experience faster updates and shorter data load times.
Drive new initiatives with a powerful Dell EMC, Nutanix, and Toshiba solutionPrincipled Technologies
A Dell EMC XC Series cluster featuring Nutanix software and powered by Toshiba PX05S SAS SSDs delivered strong database performance with a blend of structured and unstructured data
Kumar Ramaswamy provides services related to developing highly scalable and secure distributed systems using technologies like PostgreSQL, HDFS, Spark and Kafka. He has extensive experience architecting fault tolerant systems and has developed distributed systems for tasks like product tracking and advertisement data processing. His background includes work with technologies such as Unix, Java, distributed databases and big data platforms.
Performance of persistent apps on Container-Native Storage for Red Hat OpenSh...Principled Technologies
The document summarizes benchmark testing of Container-Native Storage (CNS) on Red Hat OpenShift Container Platform using two different storage media: solid-state drives (SSDs) and hard disk drives (HDDs). It tested CNS under IO-intensive and CPU-intensive workloads. For the IO-intensive workload, the SSD configuration significantly outperformed the HDD configuration, achieving over 5 times the maximum performance. However, for the CPU-intensive workload, the two configurations achieved similar maximum performance levels, with SSDs having less than a 5% improvement. The testing demonstrated that CNS can provide scalable storage and that storage performance depends on understanding how it matches the workload characteristics.
Similar to Deploying Apache Spark and testing big data applications on servers powered by the AMD EPYC 7601 processor (20)
Help skilled workers succeed with Dell Latitude 7030 and 7230 Rugged Extreme ...Principled Technologies
Instead of equipping consumer-grade tablets with rugged cases
Conclusion
In our hands-on testing, the Dell Latitude 7030 and 7230 Rugged Extreme Tablets showed that they are better equipped to help skilled workers than consumer-grade Apple iPad Pro and Samsung Galaxy Tab S9 tablets in multiple ways. They provide more built-in capabilities and features than the consumer-grade tablets we tested. And, while they were more expensive than the rugged-case fortified consumer-grade options we tested, their rugged claims were more than skin deep.
In our performance and durability tests, the Dell Latitude 7030 and 7230 Rugged Extreme Tablets performed better in demanding manufacturing, logistics, and field service environments than consumer-grade tablets with rugged cases. Both Rugged Extreme Tablets, with their greater thermal range, suffered less performance degradation in extreme temperatures, never failed and were merely scuffed after 26 hard drops, survived a 10 minute drenching with no ill effects, and were easier to view in direct sunlight than Apple iPad Pro and Samsung Galaxy Tab S9 tablets.
Bring ideas to life with the HP Z2 G9 Tower Workstation - InfographicPrincipled Technologies
We compared CPU performance and noise output of an HP Z2 G9 Tower Workstation in High Performance Mode to a similarly configured Dell Precision 3660 Tower Workstation in its out-of-box performance mode
Investing in GenAI: Cost‑benefit analysis of Dell on‑premises deployments vs....Principled Technologies
Conclusion
Diving into the world of GenAI has the potential to yield a great many benefits for your organization, but it first requires consideration for how best to implement those GenAI workloads. Whether your AI goals are to create a chatbot for online visitors, generate marketing materials, aid troubleshooting, or something else, implementing an AI solution requires careful planning and decision-making. A major decision is whether to host GenAI in the cloud or keep your data on premises. Traditional on-premises solutions can provide superior security and control, a substantial concern when dealing with large amounts of potentially sensitive data. But will supporting a GenAI solution on site be a drain on an organization’s IT budget?
In our research, we found that the value proposition is just the opposite: Hosting GenAI workloads on premises, either in a traditional Dell solution or using a managed Dell APEX pay-per-use solution, could significantly lower your GenAI costs over 3 years compared to hosting these workloads in the cloud. In fact, we found that a comparable AWS SageMaker solution would cost up to 3.8 times as much and an Azure ML solution would cost up to 3.6 times as much as GenAI on a Dell APEX pay-per-use solution. These results show that organizations looking to implement GenAI and reap the business benefits to come can find many advantages in an on-premises Dell solution, whether they opt to purchase and manage it themselves or choose a subscription-based Dell APEX pay-per-use solution. Choosing an on-premises Dell solution could save your organization significantly over hosting GenAI in the cloud, while giving you control over the security and privacy of your data as well as any updates and changes to the environment, and while ensuring your environment is managed consistently.
Workstations powered by Intel can play a vital role in CPU-intensive AI devel...Principled Technologies
In three AI development workflows, Intel processor-powered workstations delivered strong performance, without using their GPUs, making them a good choice for this part of the AI process
Conclusion
We executed three AI development workflows on tower workstations and mobile workstations from three vendors, with each workflow utilizing only the Intel CPU cores, and found that these platforms were suitable for carrying out various AI tasks. For two of the workflows, we learned that completing the tasks on the tower workstations took roughly half as much time as on the mobile workstations. This supports the idea that the tower workstations would be appropriate for a development environment for more complex models with a greater volume of data and that the mobile workstations would be well-suited for data scientists fine-tuning simpler models. In the third workflow, we explored tower workstation performance with different precision levels and learned that using 16-bit floating point precision allowed the workstations to execute the workflow in less time and also reduced memory usage dramatically. For all three AI workflows we executed, we consider the time the workstations needed to complete the tasks to be acceptable, and believe that these workstations can be appropriate, cost-effective choices for these kinds of activities.
Enable security features with no impact to OLTP performance with Dell PowerEd...Principled Technologies
Get comparable online transaction processing (OLTP) performance with or without enabling AMD Secure Memory Encryption and AMD Secure Encrypted Virtualization - Encrypted State
Conclusion
You’ve likely already implemented many security measures for your servers, which may include physical security for the data center, hardware-level security, and software-level security. With the cost of data breaches high and still growing, however, wise IT teams will consider what additional security measures they may be able to implement.
AMD SME and SEV-ES are technologies that are already available within your AMD processor-powered 16th Generation Dell PowerEdge servers—and in our testing, we saw that they can offer extra layers of security without affecting performance. We compared the online transaction processing performance of a Dell PowerEdge R7625 server, powered by AMD EPYC 9274F processors, with and without these two security features enabled. We found that enabling AMD Secure Memory Encryption and Secure Encrypted Virtualization-Encrypted State did not impact performance at all.
If your team is assessing areas where you might be able to enhance security—without paying a large performance cost—consider enabling AME SME and AMD SEV-ES in your Dell PowerEdge servers.
Improving energy efficiency in the data center: Endure higher temperatures wi...Principled Technologies
In high-temperature test scenarios, a Dell PowerEdge HS5620 server continued running an intensive workload without component warnings or failures, while a Supermicro SYS‑621C-TN12R server failed
Conclusion: Remain resilient in high temperatures with the Dell PowerEdge HS5620 to help increase efficiency
Increasing your data center’s temperature can help your organization make strides in energy efficiency and cooling cost savings. With servers that can hold up to these higher everyday temperatures—as well as high temperatures due to unforeseen circumstances—your business can continue to deliver the performance your apps and clients require.
When we ran an intensive floating-point workload on a Dell PowerEdge HS5620 and a Supermicro SYS-621CTN12R in three scenario types simulating typical operations at 25°C, a fan failure, and an HVAC malfunction, the Dell server experienced no component warnings or failures. In contrast, the Supermicro server experienced warnings in all three scenario types and experienced component failures in the latter two tests, rendering the system unusable. When we inspected and analyzed each system, we found that the Dell PowerEdge HS5620 server’s motherboard layout, fans, and chassis offered cooling design advantages.
For businesses aiming to meet sustainability goals by running hotter data centers, as well as those concerned with server cooling design, the Dell PowerEdge HS5620 is a strong contender to take on higher temperatures during day-to-day operations and unexpected malfunctions.
Dell APEX Cloud Platform for Red Hat OpenShift: An easily deployable and powe...Principled Technologies
The 4th Generation Intel Xeon Scalable processor‑powered solution deployed in less than two hours and ran a Kubernetes container-based generative AI workload effectively
Dell APEX Cloud Platform for Red Hat OpenShift: An easily deployable and powe...Principled Technologies
The 4th Generation Intel Xeon Scalable processor‑powered solution deployed in less than two hours and ran a generative AI workload effectively
Conclusion
The appeal of incorporating GenAI into your organization’s operations is likely great. Getting started with an efficient solution for your next LLM workload or application can seem daunting because of the changing hardware and software landscape, but Dell APEX Cloud Platform for Red Hat OpenShift powered by 4th Gen Intel Xeon Scalable processors could provide the solution you need. We started with a Dell Validated Design as a reference, and then went on to modify the deployment as necessary for our Llama 2 workload. The Dell APEX Cloud Platform for Red Hat OpenShift solution worked well for our LLM, and by using this deployment guide in conjunction with numerous Dell documents and some flexibility, you could be well on your way to innovating your next GenAI breakthrough.
Upgrade your cloud infrastructure with Dell PowerEdge R760 servers and VMware...Principled Technologies
Compared to a cluster of PowerEdge R750 servers running VMware Cloud Foundation (VCF)
For organizations running clusters of moderately configured, older Dell PowerEdge servers with a previous version of VCF, upgrading to better-configured modern servers can provide a significant performance boost and more.
Upgrade your cloud infrastructure with Dell PowerEdge R760 servers and VMware...Principled Technologies
Compared to a cluster of PowerEdge R750 servers running VMware Cloud Foundation 4.5
If your company is struggling with underperforming infrastructure, upgrading to 16th Generation Dell PowerEdge servers running VCF 5.1 could be just what you need to handle more database throughput and reduce vSAN latencies. As an additional benefit to IT admins, we also found that the embedded VMware Aria Operation adapter provided useful infrastructure insights.
Realize 2.1X the performance with 20% less power with AMD EPYC processor-back...Principled Technologies
Three AMD EPYC processor-based two-processor solutions outshined comparable Intel Xeon Scalable processor-based solutions by handling more Redis workload transactions and requests while consuming less power
Conclusion
Performance and energy efficiency are significant factors in processor selection for servers running data-intensive workloads, such as Redis. We compared the Redis performance and energy consumption of a server cluster in three AMD EPYC two-processor configurations against that of a server cluster in two Intel Xeon Scalable two-processor configurations. In each of our three test scenarios, the server cluster backed by AMD EPYC processors outperformed the server cluster backed by Intel Xeon Scalable processors. In addition, one of the AMD EPYC processor-based clusters consumed 20 percent less power than its Intel Xeon Scalable processor-based counterpart. Combining these measurements gave us power efficiency metrics that demonstrate how valuable AMD EPYC processor-based servers could be—you could see better performance per watt with these AMD EPYC processor-based server clusters and potentially get more from your Redis or other data intensive applications and workloads while reducing data center power costs.
Improve performance and gain room to grow by easily migrating to a modern Ope...Principled Technologies
We deployed this modern environment, then migrated database VMs from legacy servers and saw performance improvements that support consolidation
Conclusion
If your organization’s transactional databases are running on gear that is several years old, you have much to gain by upgrading to modern servers with new processors and networking components and an OpenShift environment. In our testing, a modern OpenShift environment with a cluster of three Dell PowerEdge R7615 servers with 4th Generation AMD EPYC processors and high-speed 100Gb Broadcom NICs outperformed a legacy environment with MySQL VMs running on a cluster of three Dell PowerEdge R7515 servers with 3rd Generation AMD EPYC processors and 25Gb Broadcom NICs. We also easily migrated a VM from the legacy environment to the modern environment, with only a few steps required to set up and less than ten minutes of hands-on time. The performance advantage of the modern servers would allow a company to reduce the number of servers necessary to perform a given amount of database work, thus lowering operational expenditures such as power and cooling and IT staff time for maintenance. The high-speed 100Gb Broadcom NICs in this solution also give companies better network performance and networking capacity to grow as they embrace emerging technologies such as AI that put great demands on networks.
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
With more memory available, system performance of three Dell devices increased, which can translate to a better user experience
Conclusion
When your system has plenty of RAM to meet your needs, you can efficiently access the applications and data you need to finish projects and to-do lists without sacrificing time and focus. Our test results show that with more memory available, three Dell PCs delivered better performance and took less time to complete the Procyon Office Productivity benchmark. These advantages translate to users being able to complete workflows more quickly and multitask more easily. Whether you need the mobility of the Latitude 5440, the creative capabilities of the Precision 3470, or the high performance of the OptiPlex Tower Plus 7010, configuring your system with more RAM can help keep processes running smoothly, enabling you to do more without compromising performance.
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
A Principled Technologies deployment guide
Conclusion
Deploying VMware Cloud Foundation 5.1 on next gen Dell PowerEdge servers brings together critical virtualization capabilities and high-performing hardware infrastructure. Relying on our hands-on experience, this deployment guide offers a comprehensive roadmap that can guide your organization through the seamless integration of advanced VMware cloud solutions with the performance and reliability of Dell PowerEdge servers. In addition to the deployment efficiency, the Cloud Foundation 5.1 and PowerEdge solution delivered strong performance while running a MySQL database workload. By leveraging VMware Cloud Foundation 5.1 and PowerEdge servers, you could help your organization embrace cloud computing with confidence, potentially unlocking a new level of agility, scalability, and efficiency in your data center operations.
Upgrade your cloud infrastructure with Dell PowerEdge R760 servers and VMware...Principled Technologies
Compared to a cluster of PowerEdge R750 servers running VMware Cloud Foundation 4.5
Conclusion
If your company is struggling with underperforming infrastructure, upgrading to 16th Generation Dell PowerEdge servers running VCF 5.1 could be just what you need to handle more database throughput and reduce vSAN latencies. We found that a Dell PowerEdge R760 server cluster running VCF 5.1 processed over 78 percent more TPM and 79 percent more NOPM than a Dell PowerEdge R750 server cluster running VCF 4.5. It’s also worth noting that the PowerEdge R750 cluster bottlenecked on vSAN storage, with max write latency at 8.9ms. For reference, the PowerEdge R760 cluster clocked in at 3.8ms max write latency. This higher latency is due in part to the single disk group per host on the moderately configured PowerEdge R750 cluster, while the better-configured PowerEdge R760 cluster supported four disk groups per host. As an additional benefit to IT admins, we also found that the embedded VMware Aria Operation adapter provided useful infrastructure insights.
Based on our research using publicly available materials, it appears that Dell supports nine of the ten PC security features we investigated, HP supports six of them, and Lenovo supports three features.
Increase security, sustainability, and efficiency with robust Dell server man...Principled Technologies
Compared to the Supermicro management portfolio
Conclusion
Choosing a vendor for server purchases is about more than just the hardware platform. Decision-makers must also consider more long-term concerns, including system/data security, energy efficiency, and ease of management. These concerns make the systems management tools a vendor offers as important as the hardware.
We investigated the features and capabilities of server management tools from Dell and Supermicro, comparing Dell iDRAC9 against Supermicro IPMI for embedded server management and Dell OpenManage Enterprise and CloudIQ against Supermicro Server Manager for one-to-many device and console management and monitoring. We found that the Dell management tools provided more comprehensive security, sustainability, and management/monitoring features and capabilities than Supermicro servers did. In addition, Dell tools automated more tasks to ease server management, resulting in significant time savings for administrators versus having to do the same tasks manually with Supermicro tools.
When making a server purchase, a vendor’s associated management products are critical to protect data, support a more sustainable environment, and to ease the maintenance of systems. Our tests and research showed that the Dell management portfolio for PowerEdge servers offered more features to help organizations meet these goals than the comparable Supermicro management products.
Increase security, sustainability, and efficiency with robust Dell server man...Principled Technologies
Compared to the Supermicro management portfolio
Conclusion
Choosing a vendor for server purchases is about more than just the hardware platform. Decision-makers must also consider more long-term concerns, including system/data security, energy efficiency, and ease of management. These concerns make the systems management tools a vendor offers as important as the hardware.
We investigated the features and capabilities of server management tools from Dell and Supermicro, comparing Dell iDRAC9 against Supermicro IPMI for embedded server management and Dell OpenManage Enterprise and CloudIQ against Supermicro Server Manager for one-to-many device and console management and monitoring. We found that the Dell management tools provided more comprehensive security, sustainability, and management/monitoring features and capabilities than Supermicro servers did. In addition, Dell tools automated more tasks to ease server management, resulting in significant time savings for administrators versus having to do the same tasks manually with Supermicro tools.
When making a server purchase, a vendor’s associated management products are critical to protect data, support a more sustainable environment, and to ease the maintenance of systems. Our tests and research showed that the Dell management portfolio for PowerEdge servers offered more features to help organizations meet these goals than the comparable Supermicro management products.
Scale up your storage with higher-performing Dell APEX Block Storage for AWS ...Principled Technologies
In our tests, Dell APEX Block Storage for AWS outperformed similarly configured solutions from Vendor A, achieving more IOPS, better throughput, and more consistent performance on both NVMe-supported configurations and configurations backed by Elastic Block Store (EBS) alone.
Dell APEX Block Storage for AWS supports a full NVMe backed configuration, but Vendor A doesn’t—its solution uses EBS for storage capacity and NVMe as an extended read cache—which means APEX Block Storage for AWS can deliver faster storage performance.
Scale up your storage with higher-performing Dell APEX Block Storage for AWSPrincipled Technologies
Dell APEX Block Storage for AWS offered stronger and more consistent storage performance for better business agility than a Vendor A solution
Conclusion
Enterprises desiring the flexibility and convenience of the cloud for their block storage workloads can find fast-performing solutions with the enterprise storage features they’re used to in on-premises infrastructure by selecting Dell APEX Block Storage for AWS.
Our hands-on tests showed that compared to the Vendor A solution, Dell APEX Block Storage for AWS offered stronger, more consistent storage performance in both NVMe-supported and EBS-backed configurations. Using NVMe-supported configurations, Dell APEX Block Storage for AWS achieved 4.7x the random read IOPS and 5.1x the throughput on sequential read operations per node vs. Vendor A. In our EBS-backed comparison, Dell APEX Block Storage for AWS offered 2.2x the throughput per node on sequential read operations vs. Vendor A.
Plus, the ability to scale beyond three nodes—up to 512 storage nodes with capacity of up to 8 PBs—enables Dell APEX Block Storage for AWS to help ensure performance and capacity as your team plans for the future.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Building RAG with self-deployed Milvus vector database and Snowpark Container...Zilliz
This talk will give hands-on advice on building RAG applications with an open-source Milvus database deployed as a docker container. We will also introduce the integration of Milvus with Snowpark Container Services.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Zilliz
Join us to introduce Milvus Lite, a vector database that can run on notebooks and laptops, share the same API with Milvus, and integrate with every popular GenAI framework. This webinar is perfect for developers seeking easy-to-use, well-integrated vector databases for their GenAI apps.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Deploying Apache Spark and testing big data applications on servers powered by the AMD EPYC 7601 processor
1. Deploying Apache Spark and testing big data applications on servers powered by the AMD EPYC 7601 processor | Commissioned by AMD November 2017
Deploying Apache Spark and testing big data
applications on servers powered by the AMD EPYC
7601 processor
Your company has access to troves of data on your customers and how
they interact with your services. To take advantage of that raw data and
turn it into meaningful returns for your business requires the right big
data hardware and software solution. You may have begun looking for
servers capable of running intense big data software solutions, such as
Apache Spark™
.
Earlier this year, the AMD EPYC™
series of server processors entered
the market. We conducted a proof-of-concept study using one
of these servers with Apache Spark running the HiBench big data
benchmarking tool. In this document, we lead you through the process
of setting up such a solution, share the results of our HiBench testing,
and look at the AMD Zen architecture.
Processed a 210GB
Bayesian classification
problem in
3m 22s
Processed a 224GB
Kmeans dataset in
9m 13s
Counted the words in
a 1,530GB database in
5m 47s
A Principled Technologies proof-of-concept study: Hands-on work. Real-world results.
2. Deploying Apache Spark and testing big data applications on servers powered by the AMD EPYC 7601 processor | Commissioned by AMD November 2017 | 2
Apache Spark helps you reap the value of big data
As the IT marketplace, and especially the cloud sector, continues to evolve rapidly, the volume of data in
existence has exploded and will continue to expand for the foreseeable future. Alongside traditional devices and
databases, a host of new Internet of things (IoT) platforms, mobile technologies, and apps are also generating
information. In fact, the International Data Corporation (IDC) predicts “digital data will grow at a compound
annual growth rate (CAGR) of 42% through 2020.” Put differently, the world’s digital data “will grow from about
1ZB in 2010 to about 50ZB in 2020.”2
Alongside this data growth, increasing disk, networking, and compute speeds for servers have converged
to enable hardware technologies that can quickly process much larger amounts of data than was previously
possible. A variety of software tools have emerged to help analysts make sense of this big data. Among these is
Apache Spark, a distributed, massively parallelized data processing engine that data scientists can use to query
and analyze large amounts of data.
While Apache Spark is often paired with traditional Hadoop®
components, such as HDFS for file system storage,
it performs its real work in memory, which shortens analysis time and accelerates value for customers. Companies
across the industry now use Apache Spark in applications ranging from real-time monitoring and analytics to
consumer-focused recommendations on ecommerce sites.
In this study, we used Apache Spark with AMD EPYC-powered servers to demonstrate data analysis performance
using several subtests in the industry-standard HiBench tool.
The configuration we used for this proof-of-concept study
Hardware
We set up a Hortonworks Data Platform (HDP) cluster for Apache Spark using six servers. We set up one server as
an infrastructure or utility server, two servers as name nodes, and the remaining three servers (configured as data
nodes) as the systems under test. We networked all systems together using 25GbE LinkX™
cables, connecting to a
Mellanox®
SN2700 Open Ethernet switch.
The infrastructure server ran VMware®
vSphere®
and hosted the Ambari host VM and the Apache Spark client driver
VM. We used a pair of two-socket AMD servers for name nodes. Each of these three supporting servers had one
25GbE connection to the network switch via a Mellanox ConnectX®
-4 Lx NIC.
About the Mellanox SN2700 Open Ethernet switch
The Mellanox SN2700 100GbE switch we used is a Spectrum-based, 32-port, ONIE (Open Network Install
Environment)-based platform on which you can mount a variety of operating systems. According to Mellanox,
the switch lets you use 25, 40, 50 and 100GbE in large scale without changing power infrastructure facilities.
Learn more at http://www.mellanox.com/related-docs/prod_eth_switches/PB_SN2700.pdf
3. Deploying Apache Spark and testing big data applications on servers powered by the AMD EPYC 7601 processor | Commissioned by AMD November 2017 | 3
Our three systems under test (the data nodes in the cluster) were dual-socket AMD EPYC processor-powered
servers. They each had two AMD EPYC 7601 processors, 512 GB of DDR4-2400 RAM, and 24 Samsung®
PM863
solid-state drives. We connected each server to the network switch with one 25GbE connection via a Mellanox
ConnectX-4 Lx NIC. We tuned the Mellanox cards using the high-throughput setting. We left BIOS settings as
default, and we placed all disks in HBA mode.
The following diagram shows the physical setup of our testbed:
Software
Operating system and prerequisites
We installed Red Hat®
Enterprise Linux 7.3 on our infrastructure, name node, and data node systems, and
updated the software through June 15, 2017. We disabled selinux and the firewalls on all servers, and set
tuned to network-latency, high performance with a focus on low network latency. We also installed the Mellanox
OFED 4.0-2.0.0.1-rhel7.3 driver. Finally, we installed OpenJDK 1.8.0 on all hosts.
Assembling the solution
We configured Red Hat Enterprise Linux to run HDP and HiBench. We set up passwordless SSH among all hosts
in the testbed and configured each server’s NTP client to use a common NTP server. Additionally, we configured
a hosts file for all hosts in the testbed and copied it to all systems. On the Ambari VM, we installed Ambari server
and used it to distribute HDP across all servers. On the client host, we installed HiBench. Finally, we ran the
HiBench tests on the cluster.
Mellanox SN2700 100Gb switch
AMD EPYC data node
AMD EPYC data node
AMD EPYC data node
Name node
Name node
ESXi™
infrastructure server
Ambari
host
VM
Client
VM
3x LinkX
25GbE
2x LinkX
25GbE
About Samsung PM863a solid-state drives
Samsung PM863a 2.5-inch SATA SSDs leverage the company’s 3D (V-NAND) technology to provide
storage capacities of up to 3.84 TB without increasing the physical footprint of the drive. Note that the
server we tested used the earlier version, the PM863.
Learn more at http://www.samsung.com/semiconductor/minisite/ssd/product/enterprise/pm863a.html
4. Deploying Apache Spark and testing big data applications on servers powered by the AMD EPYC 7601 processor | Commissioned by AMD November 2017 | 4
Using HiBench to examine big data capabilities
Basic overview
HiBench is a benchmark suite that helps evaluate big data frameworks in terms of speed, throughput, and
system resource utilization. It includes Hadoop, Apache Spark, and streaming workloads.3
For our tests in this
study, we ran the following three tests:
• Bayes (big data dataset)
• Kmeans (big data dataset)
• Wordcount (big data dataset)
Running the workload
While HiBench allows users to select which of its multiple workloads to run, we used a general format that
runs any of them. We executed the tests against the cluster from the client system. We ran our tests using the
following general process:
1. Clear the PageCache, dentries, and inodes on all systems under test.
2. Navigate to the HiBench directory for the current test (e.g., ~/HiBench/bin/workloads/kmeans).
3. Run the prepare script to initialize the data used in the test.
4. Run the Apache Spark run script to perform the test (in Kmeans, that would be to sort the data we created
using the Kmeans grouping algorithm).
Tuning information
On the data nodes, we found the following HiBench settings yielded the best results for the Apache Spark
workloads we tested:
• hibench.conf
yy hibench.default.map.parallelism 1080
yy hibench.default.shuffle.parallelism 1080
• spark.conf
yy hibench.yarn.executor.num 90
yy hibench.yarn.executor.cores 4
yy spark.executor.memory 13g
yy spark.driver.memory 8g
yy hibench.streambench.spark.storageLevel 0
Sizing information
During testing, we observed that AMD EPYC processor-powered servers operated at their peak when the workload
neared or exceeded the memory in the server. The large number of memory channels, combined with the SSDs in
the system directly controlled by the processors, can potentially help the EPYC processors perform well in high-
memory-usage situations.
5. Deploying Apache Spark and testing big data applications on servers powered by the AMD EPYC 7601 processor | Commissioned by AMD November 2017 | 5
Results from testing
We ran three of the more popular HiBench Spark tests with the settings we mentioned previously. In the table below,
we summarize the time and throughput results for each test. We then discuss each set of results in greater detail.
Benchmark Time to complete (seconds) Throughput (bytes per second)
Bayes 552.873 408.702
Kmeans 207.245 1,162.787
Wordcount 346.843 4,735.449
Bayes
The Bayesian classification test, which HiBench refers to as Bayes, uses the Naïve Bayesian trainer to classify
data. Like many Apache Spark workloads, it has one large disk-usage spike in the beginning and another at the
end, but primarily taxes a server’s CPU. It is also RAM intensive, which means that as database size increases,
the large number of AMD EPYC memory channels comes into play. The graph below, from the HiBench report,
shows the CPU usage during the test.
The graph below shows the disk usage during the test.
0
100
200
300
400
500
600
700
800
900
0:00 0:50 1:40 2:30 3:20 4:10 5:00 5:50 6:40 7:30 8:20 9:10
MB/sec
Time (minutes:seconds)
Summarized disk throughput
MB read/sec MB written/sec
0
20
40
60
80
100
0:00 0:50 1:40 2:30 3:20 4:10 5:00 5:50 6:40 7:30 8:20 9:10
Percentage
Time (minutes:seconds)
Summarized CPU utilization
others iowait system user idle
6. Deploying Apache Spark and testing big data applications on servers powered by the AMD EPYC 7601 processor | Commissioned by AMD November 2017 | 6
Kmeans
Many companies use the K-means clustering algorithm, which HiBench refers to as Kmeans, to help separate
data into useful categories. Companies can use this algorithm to help segment the market for targeted
advertising, or to generate product suggestions on websites based on a user’s habits. Like the Bayes test, it is
very CPU intensive. Unlike Bayes, after performing the initial analysis, Kmeans runs subsequent, smaller checks to
eventually arrive at its conclusion, as the graph below shows.
Kmeans has even less disk activity than Bayes, involving only a very large spike in the beginning as the workload
distributes itself among the nodes. The graph below illustrates this.
0
20
40
60
80
100
0:00 0:25 0:50 1:15 1:40 2:05 2:30 2:55 3:20
Percentage
Time (minutes:seconds)
Summarized CPU utilization
others iowait system user idle
0
200
400
600
800
1,000
0:00 0:25 0:50 1:15 1:40 2:05 2:30 2:55 3:20
MB/sec
Time (minutes:seconds)
Summarized disk throughput
MB read/sec MB written/sec
7. Deploying Apache Spark and testing big data applications on servers powered by the AMD EPYC 7601 processor | Commissioned by AMD November 2017 | 7
Wordcount
Wordcount is the simplest HiBench test we ran. One real-world application of Wordcount is to create word
clouds. These are a way of visually representing the number of times a given body of text, such as a website or
set of documents, uses terms. The more frequently a term occurs in the text, the larger and more prominently
the term appears in the word cloud. The Wordcount benchmark counts the number of times each word appears
in a randomized dataset. While Wordcount remains CPU intensive, when the database size exceeds the RAM on
the systems, the number of memory channels and the storage bandwidth of the AMD systems can come into
play. The chart below shows CPU utilization during the Wordcount run.
As the chart below shows, while the Wordcount benchmark runs, a large amount of read I/O occurs in the system.
0
20
40
60
80
100
0:00 0:25 0:50 1:15 1:40 2:05 2:30 2:55 3:20 3:45 4:10 4:35 5:00 5:25
Percentage
Time (minutes:seconds)
Summarized CPU utilization
others iowait system user idle
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
0:00 0:25 0:50 1:15 1:40 2:05 2:30 2:55 3:20 3:45 4:10 4:35 5:00 5:25
MB/sec
Time (minutes:seconds)
Summarized disk throughput
MB read/sec MB written/sec
8. Deploying Apache Spark and testing big data applications on servers powered by the AMD EPYC 7601 processor | Commissioned by AMD November 2017 | 8
AMD EPYC architecture overview
The AMD EPYC processor has up to up to 32 cores, 64 threads, and can run with 2 TB of DDR4 memory
capacity spread across eight channels. This high core density contributed to the ability to process data quickly in
Apache Spark.
To better understand this high core density, let’s look at the components that go into it. Note that the information in
this section comes from publicly available materials from the AMD website.
Building blocks of the EPYC 7601
The primary building block is the Zen core. Each Zen core has a 64KB L1 instruction cache and a 32KB L2 data
cache, as well as access to a 512KB instruction and data (I+D) L2 cache and a shared L3 cache.
Four Zen cores, along with 8 MB of L3 caching, make up a single CPU Complex (CCX).
A Zeppelin die comprises two CCXs, meaning that each Zeppelin die has eight Zen cores.
L2CTL
Core 0 L2M
512K
L3M
1MB
L3M
1MB
L3CTL
L2CTL
Core 2 L2M
512K
L3M
1MB
L3M
1MB
L3CTL
L2CTL
Core 1L2M
512K
L3M
1MB
L3M
1MB
L3CTL
L2CTL
Core 3L2M
512K
L3M
1MB
L3M
1MB
L3CTL
Coherentlinks
2xDDR4channels
2x16 PCIe
Coherent links
Core Core
Core
Core
Core
Core
Core
Core
L2
L2
L2
8M
L3
8M
L3
L2
L2
L2
L2
L2
9. Deploying Apache Spark and testing big data applications on servers powered by the AMD EPYC 7601 processor | Commissioned by AMD November 2017 | 9
On a 7000-series processor, four Zeppelin dies connect to each other with AMD Infinity Fabric (IF) interconnect to
form the AMD EPYC processor. As we show below, one EPYC processor comprises eight CCXes, 32 cores, and
64 threads.
Single socket
In a single-socket configuration, the EPYC chip has eight
DDR4 channels and 128 PCIe lanes.
Dual socket
In a dual-socket configuration, half of the PCIe lanes are
for the IF between both sockets. This configuration has
16 DDR4 channels and 128 PCIe lanes.
2xDDR4channels
CoherentlinksCoherentlinks
Coherentlinks
2x16 PCIe
Coherent links
L2
L2
L2
8M
L3
8M
L3
L2
L2
L2
L2
L2
2xDDR4channels
2x16 PCIe
Coherent links
L2
L2
L2
8M
L3
8M
L3
L2
L2
L2
L2
L2
Coherentlinks
Coherent links
2x16 PCIe
L2
L2
L2
8M
L3
8M
L3
L2
L2
L2
L2
L2
2xDDR4channels
Coherent links
2x16 PCIe
L2
L2
L2
8M
L3
8M
L3
L2
L2
L2
L2
L2
Core Core
Core
Core
Core
Core
Core
Core
Core Core
Core
Core
Core
Core
Core
Core
Core Core
Core
Core
Core
Core
Core
Core
Core Core
Core
Core
Core
Core
Core
Core
2xDDR4channels
Single-socket Dual-socket
X16
X16
X16
X16
X16
DDR4
RAM
X16
X16
X16
X16
X16
X16
X16
X16
X16
X16
X16
10. Deploying Apache Spark and testing big data applications on servers powered by the AMD EPYC 7601 processor | Commissioned by AMD November 2017 | 10
Performance and power management
EPYC chips have several features for performance and power management. Performance Determinism sets
consistent performance across all CPUs, though the power draw of each processor may vary slightly with
manufacturing variance. Power Determinism sets consistent power usage across all processors, though the
performance of each processor may also vary. Being able to manage power usage can be useful in datacenters
where it is necessary to control the power draw of servers.
Security and encrypted memory
Though we did not test them in our datacenter, the EPYC family of processors contains several security features:
• AMD Secure Root-of-Trust technology prevents booting any software that is not cryptographically signed.
• AMD Secure Run Technology encrypts software and data in memory.
• AMD Secure Move technology lets you securely migrate virtual machines and containers between EPYC
processor-based servers.
The EPYC family supports encrypted memory features:
• Transparent mode encrypts all memory pages. You can enable transparent mode in the BIOS without having
to change the OS or individual applications.
• In Secure Encrypted Virtualization (SEV) mode, you can encrypt specific virtual machines without having to
change individual applications.
Memory capacity
Each AMD EPYC processor supports 2 TB of memory, equivalent to 4 TB per two-socket server. The closest
competitor supports 1.5 TB of memory per processor, equivalent to 3 TB per two-socket server. This means that
supporting 12 TB of memory would require three AMD EPYC processor-powered two-socket servers and four of the
competitor’s two-socket servers.
Conclusion
To get the most out of the heaps of data your company is sitting on, you’ll need a platform such as Apache Spark
to sort through the noise and get meaningful conclusions you can use to improve your services. If you need to get
results from such an intense workload in a reasonable amount of time, your company should invest in a solution
whose power matches your level of work.
This proof of concept has introduced you to a new solution based on the AMD EPYC line of processors. Based
on the new Zen architecture, the AMD EPYC line of processors offers resources and features worth considering. In
the Principled Technologies datacenter, we set up a big data solution consisting of whitebox servers powered by
the AMD EPYC 7601—the top-of-the-line offering from AMD. We ran an Apache Spark workload and tested the
solution with three components of the HiBench benchmarking suite. The AMD systems maintained a consistent
level of performance across these tests.
1 Core count per CCX decreases for lower-bin processors in the EPYC line.
2 Source: “Digital Data Storage is Undergoing Mind-Boggling Growth,” accessed July 25, 2017, http://www.eetimes.com/
author.asp?section_id=36&doc_id=1330462
3 Source: “HiBench benchmark,” accessed July 26, 2017, https://github.com/intel-hadoop/HiBench
11. Deploying Apache Spark and testing big data applications on servers powered by the AMD EPYC 7601 processor | Commissioned by AMD November 2017 | 11
On July 5, 2017, we finalized the hardware and software configurations we tested. Updates for current and
recently released hardware and software appear often, so unavoidably these configurations may not represent
the latest versions available when this report appears. For older systems, we chose configurations representative
of typical purchases of those systems. We concluded hands-on testing on July 25, 2017.
Appendix A: System configuration information
Server configuration information
BIOS name and version AMD WSW7802N_PLAT-
Non-default BIOS settings NA
Operating system name and version/build number Red Hat Enterprise Server 7.3 / 3.10.0-514.16.1.el7.x86_64
Date of last OS updates/patches applied 06/15/2017
Power management policy High Performance
Processor
Number of processors 2
Vendor and model AMD EPYC 7601
Core count (per processor) 32
Core frequency (GHz) 2.20
Stepping 1
Memory module(s)
Total memory in system (GB) 512
Number of memory modules 32
Vendor and model Samsung M393A2K40BB1-CRC0Q
Size (GB) 16
Type PC4-2400T
Speed (MHz) 19200
Speed running in the server (MHz) 19200
Storage controller
Vendor and model AMD FCH SATA Controller
Driver version 3.0
Local storage
Number of drives 24
Drive vendor and model Samsung PM863
Drive size (GB) 120
Drive information (speed, interface, type) 6Gb, SATA, SSD
12. Deploying Apache Spark and testing big data applications on servers powered by the AMD EPYC 7601 processor | Commissioned by AMD November 2017 | 12
Server configuration information
Network adapter
Vendor and model Mellanox ConnectX-4 Lx
Number and type of ports 2 x 25GbE
Driver version 4.0-2.0.0.1
Cooling fans
Vendor and model Delta Electronics GFC0812DS
Number of cooling fans 4
Power supplies
Vendor and model LITEON PS-2112-5Q
Number of power supplies 2
Wattage of each (W) 1200
13. Deploying Apache Spark and testing big data applications on servers powered by the AMD EPYC 7601 processor | Commissioned by AMD November 2017 | 13
Appendix B: How we tested
We performed the following steps to install and configure the testbed for Hortonworks Data Platform and HiBench. Your environment may
differ slightly, so keep that in mind when setting IPs and naming hosts.
Installing the operating system
1. Insert the Red Hat Enterprise Linux 7.3 installation media, and power on the system.
2. Select Install Red Hat Enterprise Linux 7.3, and press Enter.
3. Select English, and click Continue.
4. Click NETWORK & HOST NAME.
5. Click ON for the 25Gb Ethernet connection you are using.
6. Select the 25Gb Ethernet connection you are using, and click Configure.
7. Click IPv4 Settings.
8. Change Method to Manual.
9. Click Add, and type your IP address, netmask, and gateway.
10. Click Save.
11. Change your host name to an appropriate host name based on the role of your system, then click Apply.
12. Click Done to exit the network configuration.
13. Click INSTALLATION DESTINATION.
14. Select I will configure partitioning, and click Done.
15. Select Standard Partition, and then click Click here to create them automatically.
16. Modify the swap to 4 GB, then click Done.
17. When the summary of changes appears, click Accept Changes.
18. Click Begin Installation.
19. Click ROOT PASSWORD.
20. Type the root password, and confirm it. Click Done.
21. When the installation finishes, click Reboot.
22. Repeat steps 1–21 for the remaining servers in the test bed.
Configuring the operating system
1. Log in to the Red Hat console.
2. Disable selinux by typing the following commands:
setenforce 0
sed -i ‘s/SELINUX=.*/SELINUX=disabled/’ /etc/selinux/config
3. Disable the firewall by typing the following commands:
systemctl stop firewalld
systemctl disable firewalld
4. Configure default file permissions by typing the following command:
echo umask 0022 >> /etc/profile
5. Either register your system with Red Hat or configure your yum patch repositories to operate off of a locally accessible repository.
6. Remove chrony from your system by typing the following command:
yum remove -y chrony
7. Update the system by typing the following command:
yum update -y
8. Reboot the system.
9. Install the prerequisites for Ambari and HDP by typing the following command:
yum install -y ntp time xfsprogs tuned wget vim nfs-utils openssh-clients man zip unzip numactl
sysstat bc lzop xz libhugetlbfs python numpy blas64 lapack64 gtk2 atk cairo gcc-gfortran tcsh lsof
tcl tk java-1.8.0-openjdk-devel perl-Data-Dumper.x86_64
10. Enable NTP by typing the following commands:
sed -i ‘/^server [^ ]* iburst/d’ /etc/ntp.conf
echo “server [your NTP server IP address] iburst” >> /etc/ntp.conf
systemctl enable ntpd
systemctl start ntpd
11. Reboot the system.
12. Download the latest version of the Mellanox OFED from the Mellanox website.
14. Deploying Apache Spark and testing big data applications on servers powered by the AMD EPYC 7601 processor | Commissioned by AMD November 2017 | 14
13. Install the Mellanox OFED by typing the following commands:
tar -xf MLNX_OFED_LINUX-4.0-2.0.0.1-rhel7.3-x86_64.tgz
cd MLNX_OFED_LINUX-4.0-2.0.0.1-rhel7.3-x86_64
./mlnxofedinstall --all --force
14. Clear the SSH settings by typing the following commands:
mkdir -p /root/.ssh
chmod 700 /root/.ssh
cd /root/.ssh
mv * *.orig
15. Create an SSH private key for all hosts by typing the following commands:
ssh-keygen -t rsa -q
cp id_rsa.pub authorized_keys
echo “StrictHostKeyChecking=no” > config
16. Copy the key to the other hosts.
17. Create a hosts file containing a FQDN, nickname, and IP address of every host you plan to use in the cluster. Copy the file out to all of
your hosts.
18. On the hosts that will be used as data nodes, run the following script to format and prepare the drives you are using for HDFS, making
sure to modify the SKIP_LETTER variable for the drive your operating system is running on:
#!/bin/bash
FS=xfs
COUNT=0
SKIP_LETTER=”w”
umount /grid/*
rmdir /grid/*
rmdir /grid
mkdir -p /grid
for i in {a..x};
do
if [ “$SKIP_LETTER” = “$i” ]; then
continue
fi
DEV=/dev/sd$i
GRID=/grid/$COUNT
echo $DEV $COUNT
mkdir -p $GRID
#dd if=/dev/zero of=$DEV bs=1M count=10 oflag=direct
sync
mkfs.${FS} $DEV
echo -e “`blkid -p $DEV | awk ‘{print $2}’`t${GRID} t${FS}tdefaults,noatime,nodiratime,nofail,x-
systemd.device-timeout=60t0 0” >> /etc/fstab
((COUNT++))
done
19. On the data nodes, run vim /etc/default/grub and modify the GRUB_CMDLINE_LINUX setting by adding iommu=pt to the end of
the line.
20. Configure the data nodes to run in high throughput performance mode by typing the following command:
tuned-adm profile throughput-performance
21. Configure the data nodes to run the Mellanox cards in high performance mode by typing the following command (you will need to
repeat this each time you reboot):
mlnx_tune –p HIGH_THROUGHPUT` Installing Ambari
22. Log into your Ambari host.
23. Add the Ambari repository to your yum repos by typing the following command:
wget -nv http://public-repo-1.hortonworks.com/ambari/centos7/2.x/updates/2.5.0.3/ambari.repo -O /etc/
yum.repos.d/ambari.repo
15. Deploying Apache Spark and testing big data applications on servers powered by the AMD EPYC 7601 processor | Commissioned by AMD November 2017 | 15
24. Navigate to your OpenJDK directory and start the Ambari server setup by typing the following command:
ambari-server setup
25. To accept the default of not customizing a user account for the ambari-server daemon, press Enter.
26. When prompted, choose a custom JDK by typing 3 and pressing Enter.
27. To accept the default of not entering advanced database configuration, press Enter.
28. Type ambari-server start to start up the Ambari host.
Installing Hortonworks Data Platform on the cluster
1. Open a web browser, and navigate to the Ambari host’s website.
2. Log into the Ambari host website.
3. In Welcome to Apache Ambari, click Launch Install Wizard.
4. In the Get Started page, name your cluster, and click Next.
5. In Select Version, select the latest version, check Use Public Repository, and click Next.
6. In Install Options, enter the hosts you plan to use. For example:
namenode[1-2].hdp.local
amd[1-3].hdp.local
client1.hdp.local
7. Copy the contents of the SSH private key you created previously into the host registration information. Write root as the User Account
and 22 as the port number.
8. Click Register and Confirm.
9. When the Host name pattern expressions window appears, verify that all of your hosts are there, and click OK.
10. In the Confirm Hosts window, verify that all of the hosts you intend to install on are there, and click Next.
11. In Choose Services, uncheck the following services:
• Accumulo
• Atlas
• Falcon
• Flume
• HBase
• Sqoop
• Oozie
Leave the rest of the services at their defaults, and click Next.
12. In Assign Masters, assign services to the following master servers:
• namenode1
yy NameNode
yy ZooKeeper Server
yy DRPC Server
yy Storm UI Server
yy Nimbus
yy Infra Solr Instance
yy Metrics Collector
yy Grafana
yy Kafka Broker
yy Knox Gateway
yy HST Server
yy Activity Explorer
yy Activity Analyzer
yy Spark History Server
yy Spark2 History Server
• namenode2
yy SNameNode
yy App Timeline Server
yy ResourceManager
yy History Server
yy WebHCat Server
16. Deploying Apache Spark and testing big data applications on servers powered by the AMD EPYC 7601 processor | Commissioned by AMD November 2017 | 16
yy Hive Metastore
yy HiveServer2
yy ZooKeeper Server
13. In Assign Slaves and Clients, apply the slave and client components to the following hosts:
• amd1-3
yy DataNode
yy NodeManager
yy Supervisor
• vclient1
yy Client
14. In Customize Services, perform the following actions:
a. Enter a database password for the Hive Metastore.
b. Enter a password for the Grafana Admin.
c. Enter a password for the Knox Master Secret.
d. Enter a password for the admin in SmartSense.
e. In the HDFS section, add the following line to the DataNode directories textbox:
/grid/0,/grid/1,/grid/2,/grid/3,/grid/4,/grid/5,/grid/6,/grid/7,/grid/8,/grid/9,/grid/10,/
grid/11,/grid/12,/grid/13,/grid/14,/grid/15,/grid/16,/grid/17,/grid/18,/grid/19,/grid/20,/
grid/21,/grid/22
15. When warned in the Configurations window, click Proceed Anyway.
16. In Review, verify your settings and click Deploy.
17. In Install, Start and Test, click Next when the deployment completes.
18. In Summary, click Complete.
Installing HiBench
1. Log into your client host.
2. Install the prerequisite packages to the client by typing the following commands:
yum install -y maven git vim numpy blas64 lapack64
echo “/usr/hdp/current/hadoop-client/lib/native” > /etc/ld.so.conf.d/hadoop-client-native.conf
3. Change to the HDFS user, create the HiBench directory in the dfs, and set permissions by typing the following commands:
su - hdfs
hdfs dfs -mkdir /HiBench
hdfs dfs -chown -R root:hadoop /HiBench
hdfs dfs -mkdir /user/root
hdfs dfs -chown root /user/root
exit
4. Edit the bash profile to automatically include the HiBench prerequisites by typing the following commands:
vim ~/.bash_profile
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
export HADOOP_HOME=/usr/hdp/current/hadoop-client
export SPARK_HOME=/usr/hdp/current/spark2-client
export KAFKA_HOME=/usr/hdp/current/kafka-broker
export LD_LIBRARY_PATH=/usr/hdp/current/hadoop-client/lib/native
5. Download the HiBench benchmark from GitHub by typing the following commands:
cd /root
git clone https://github.com/intel-hadoop/HiBench.git
6. Change to the newly downloaded HiBench directory and type the following command:
mvn -Dspark=2.1 -Dscala=2.11 clean package | tee hibench_build.log
7. Modify the 99-user_defined_properties.conf file by changing the following lines:
hibench.hadoop.home /usr/hdp/current/hadoop-client
hibench.spark.home /usr/hdp/current/spark2-client
hibench.hadoop.mapreduce.home /usr/hdp/current/hadoop-mapreduce-client
17. Deploying Apache Spark and testing big data applications on servers powered by the AMD EPYC 7601 processor | Commissioned by AMD November 2017 | 17
Configuring Hortonworks Data Platform
We made the following changes to the settings to the services via the Ambari console:
• HDFS
yy NameNode
ŠŠ NameNode Java heap size - 5.25 GB
ŠŠ NameNode Server threads - 1800
yy DataNode
ŠŠ DataNode directories - each of the drives on the server
ŠŠ DataNode failed disk tolerance - 2
yy Advanced
ŠŠ NameNode new generation size - 672 MB
ŠŠ NameNode maximum new generation size - 672 MB
• YARN
yy Memory
ŠŠ Memory allocated for all YARN containers on a node - 480 GB
ŠŠ Maximum Container Size (Memory) - 491520 MB
yy CPU
ŠŠ Percentage of physical CPU allocated for all containers on a node - 100%
ŠŠ Number of virtual cores - 128
ŠŠ Maximum Container Size (VCores) - 16
Configuring HiBench
We used the following settings across the tests we ran:
hibench.conf
hibench.default.map.parallelism 1080
hibench.default.shuffle.parallelism 1080
hibench.scale.profile bigdata
spark.conf
hibench.yarn.executor.num 90
hibench.yarn.executor.cores 4
spark.executor.memory 13g
spark.driver.memory 8g
hibench.streambench.spark.storageLevel 0
We modified the Bayes test to better take advantage of the memory in the AMD servers by increasing the size of the workload to 60 million
pages and 60,000 classes.
hibench.bayes.bigdata.pages 60000000
hibench.bayes.bigdata.classes 60000
18. Deploying Apache Spark and testing big data applications on servers powered by the AMD EPYC 7601 processor | Commissioned by AMD November 2017 | 18
Principled Technologies is a registered trademark of Principled Technologies, Inc.
All other product names are the trademarks of their respective owners.
DISCLAIMER OF WARRANTIES; LIMITATION OF LIABILITY:
Principled Technologies, Inc. has made reasonable efforts to ensure the accuracy and validity of its testing, however, Principled Technologies, Inc. specifically disclaims
any warranty, expressed or implied, relating to the test results and analysis, their accuracy, completeness or quality, including any implied warranty of fitness for any
particular purpose. All persons or entities relying on the results of any testing do so at their own risk, and agree that Principled Technologies, Inc., its employees and its
subcontractors shall have no liability whatsoever from any claim of loss or damage on account of any alleged error or defect in any testing procedure or result.
In no event shall Principled Technologies, Inc. be liable for indirect, special, incidental, or consequential damages in connection with its testing, even if advised of the
possibility of such damages. In no event shall Principled Technologies, Inc.’s liability, including for direct damages, exceed the amounts paid in connection with Principled
Technologies, Inc.’s testing. Customer’s sole and exclusive remedies are as set forth herein.
This project was commissioned by AMD.
Principled
Technologies®
Facts matter.®Principled
Technologies®
Facts matter.®
Running HiBench
1. Log into the first data node.
2. Clear the PageCache, dentries, and inodes by typing the following command:
echo 3 > /proc/sys/vm/drop_caches
3. Repeat steps 1–2 for the remaining data nodes.
4. Log into the client host.
5. Navigate to the bin/workload/[benchmark] directory in the HiBench directory.
6. Initialize the data by typing the following command:
./prepare/prepare.sh
7. After initializing the data, wait 5 minutes.
8. Run the benchmark by typing the following command:
./spark/run.sh
9. When the test completes, record the results.