Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...ScyllaDB
To maximize the benefits of ScyllaDB, you must adapt the structure of your data. Data modeling for ScyllaDB should be query-driven based on your access patterns – a very different approach than normalization for SQL tables. In this session, you will learn how tools can help you migrate your existing SQL structures to accelerate your digital transformation and application modernization.
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...ScyllaDB
To maximize the benefits of ScyllaDB, you must adapt the structure of your data. Data modeling for ScyllaDB should be query-driven based on your access patterns – a very different approach than normalization for SQL tables. In this session, you will learn how tools can help you migrate your existing SQL structures to accelerate your digital transformation and application modernization.
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
Apache Spark™ is a fast and general engine for large-scale data processing. Spark is written in Scala and runs on top of JVM, but Python is one of the officially supported languages. But how does it actually work? How can Python communicate with Java / Scala? In this talk, we’ll dive into the PySpark internals and try to understand how to write and test high-performance PySpark applications.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaGuido Schmutz
Many of the Big Data and IoT use cases are based on combining data from multiple data sources and to make them available on a Big Data platform for analysis. The data sources are often very heterogeneous, from simple files, databases to high-volume event streams from sensors (IoT devices). It’s important to retrieve this data in a secure and reliable manner and integrate it with the Big Data platform so that it is available for analysis in real-time (stream processing) as well as in batch (typical big data processing). In past some new tools have emerged, which are especially capable of handling the process of integrating data from outside, often called Data Ingestion. From an outside perspective, they are very similar to a traditional Enterprise Service Bus infrastructures, which in larger organization are often in use to handle message-driven and service-oriented systems. But there are also important differences, they are typically easier to scale in a horizontal fashion, offer a more distributed setup, are capable of handling high-volumes of data/messages, provide a very detailed monitoring on message level and integrate very well with the Hadoop ecosystem. This session will present and compare Apache Flume, Apache NiFi, StreamSets and the Kafka Ecosystem and show how they handle the data ingestion in a Big Data solution architecture.
How to Build a Scylla Database Cluster that Fits Your NeedsScyllaDB
Sizing a database cluster makes or breaks your application. Too small and you could sustain spikes in usage and recover from a node loss or an operational slowdown. Too big and your cluster will cost more and waste valuable human resources.
Since different workloads have different requirements, successful sizing of your application should be optimized for both throughput and latency performance. However, in many cases, the requirements for each contradicts each other.
In this webinar, we explain how to remediate the contradicting forces and build a sustainable cluster to meet both performance and resiliency requirements.
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Ethan Guo | Current 2022
Back in 2016, Apache Hudi brought transactions, change capture on top of data lakes, what is today referred to as the Lakehouse architecture. In this session, we first introduce Apache Hudi and the key technology gaps it fills in the modern data architecture. Bridging traditional data lakes and warehouses, Hudi helps realize the Lakehouse vision, by bringing transactions, optimized table metadata to data lakes and powerful storage layout optimizations, moving them closer to cloud warehouses of today. Viewed from a data engineering lens, Hudi also plays a key unifying role between the batch and stream processing worlds, by acting as a columnar, server-less ""state store"" for batch jobs, ushering in what we call the incremental processing model, where batch jobs can consume new data, update/delete intermediate results in a Hudi table, instead of re-computing/re-write entire output like old-school big batch jobs.
Rest of talk focusses on a deep dive into the some of the time-tested design choices and tradeoffs in Hudi, that helps power some of the largest transactional data lakes on the planet today. We will start by describing a tour of the storage format design, including data, metadata layouts and of course Hudi's timeline, an event log that is central to implementing ACID transactions and concurrency control. We will delve deeper into the practical concurrency control pitfalls in data lakes, and show how Hudi's hybrid approach combining MVCC with optimistic concurrency control, lowers contention and unlocks minute-level near real-time commits to Hudi tables. We will conclude with code examples that showcase Hudi's rich set of table services that perform vital table management such as cleaning older file versions, compaction of delta logs into base files, dynamic re-clustering for faster query performance, or the more recently introduced indexing service that maintains Hudi's multi-modal indexing capabilities.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
This Edureka "What is Spark" tutorial will introduce you to big data analytics framework - Apache Spark. This tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Apache Spark concepts. Below are the topics covered in this tutorial:
1) Big Data Analytics
2) What is Apache Spark?
3) Why Apache Spark?
4) Using Spark with Hadoop
5) Apache Spark Features
6) Apache Spark Architecture
7) Apache Spark Ecosystem - Spark Core, Spark Streaming, Spark MLlib, Spark SQL, GraphX
8) Demo: Analyze Flight Data Using Apache Spark
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
Business leads, executives, analysts, and data scientists rely on up-to-date information to make business decision, adjust to the market, meet needs of their customers or run effective supply chain operations.
Come hear how Asurion used Delta, Structured Streaming, AutoLoader and SQL Analytics to improve production data latency from day-minus-one to near real time Asurion’s technical team will share battle tested tips and tricks you only get with certain scale. Asurion data lake executes 4000+ streaming jobs and hosts over 4000 tables in production Data Lake on AWS.
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Pig, and Presto.
we will see an overview of Spark in Big Data. We will start with an introduction to Apache Spark Programming. Then we will move to know the Spark History. Moreover, we will learn why Spark is needed. Afterward, will cover all fundamental of Spark components. Furthermore, we will learn about Spark’s core abstraction and Spark RDD. For more detailed insights, we will also cover spark features, Spark limitations, and Spark Use cases.
Big data processing meets non-volatile memory: opportunities and challenges DataWorks Summit
Advanced big data processing frameworks have been proposed to harness the fast data transmission capability of remote direct memory access (RDMA) over InfiniBand and RoCE. However, with the introduction of the non-volatile memory (NVM), these designs along with the default execution models, like MapReduce and Directed Acyclic Graph (DAG), need to be re-assessed to discover the possibilities of further enhanced performance.
In this context, we propose an accelerated execution framework (NVMD) for MapReduce and DAG that leverages the benefits of NVM and RDMA. NVMD introduces novel features for MapReduce and DAG, such as a hybrid push and pull shuffle mechanism and dynamic adaptation to the network congestion. The design has been incorporated into Apache Hadoop and Tez. Performance results illustrate that NVMD can achieve up to 3.65x and 3.18x improvement for Hadoop and Tez, respectively. In this talk, we will also present NVM-aware HDFS design and its benefits for MapReduce, Spark, and HBase.
Speaker: Shashank Gugnani, PhD Student, Ohio State University
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSDinside-BigData.com
In this deck from the 2018 OpenFabrics Workshop, Xiaoyi Lu from OSU presents: High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD.
"The convergence of Big Data and HPC has been pushing the innovation of accelerating Big Data analytics and management on modern HPC clusters. Recent studies have shown that the performance of Apache Hadoop, Spark, and Memcached can be significantly improved by leveraging the high-performance networking technologies, such as Remote Direct Memory Access (RDMA). Most of these studies are based on `DRAM+RDMA' schemes. On the other hand, Non-Volatile Memory (NVM) and NVMe-SSD technologies can support RDMA access with low-latency, high-throughput, and persistence on HPC clusters. NVMs and NVMe-SSDs provide the opportunity to build novel high-performance and QoS-aware communication and I/O subsystems for data-intensive applications. In this talk, we propose new communication and I/O schemes for these data analytics stacks, which are designed with RDMA over NVM and NVMe-SSD. Our studies show that the proposed designs can significantly improve the communication, I/O, and application performance for Big Data analytics and management middleware, such as Hadoop, Spark, Memcached, etc. In addition, we will also discuss how to design QoS-aware schemes in these frameworks with NVMe-SSD."
Watch the video: https://wp.me/p3RLHQ-iyB
Learn more: http://web.cse.ohio-state.edu/~lu.932/
and
https://www.openfabrics.org/index.php/2018-ofa-workshop-presentations.html
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Apache Spark™ is a fast and general engine for large-scale data processing. Spark is written in Scala and runs on top of JVM, but Python is one of the officially supported languages. But how does it actually work? How can Python communicate with Java / Scala? In this talk, we’ll dive into the PySpark internals and try to understand how to write and test high-performance PySpark applications.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaGuido Schmutz
Many of the Big Data and IoT use cases are based on combining data from multiple data sources and to make them available on a Big Data platform for analysis. The data sources are often very heterogeneous, from simple files, databases to high-volume event streams from sensors (IoT devices). It’s important to retrieve this data in a secure and reliable manner and integrate it with the Big Data platform so that it is available for analysis in real-time (stream processing) as well as in batch (typical big data processing). In past some new tools have emerged, which are especially capable of handling the process of integrating data from outside, often called Data Ingestion. From an outside perspective, they are very similar to a traditional Enterprise Service Bus infrastructures, which in larger organization are often in use to handle message-driven and service-oriented systems. But there are also important differences, they are typically easier to scale in a horizontal fashion, offer a more distributed setup, are capable of handling high-volumes of data/messages, provide a very detailed monitoring on message level and integrate very well with the Hadoop ecosystem. This session will present and compare Apache Flume, Apache NiFi, StreamSets and the Kafka Ecosystem and show how they handle the data ingestion in a Big Data solution architecture.
How to Build a Scylla Database Cluster that Fits Your NeedsScyllaDB
Sizing a database cluster makes or breaks your application. Too small and you could sustain spikes in usage and recover from a node loss or an operational slowdown. Too big and your cluster will cost more and waste valuable human resources.
Since different workloads have different requirements, successful sizing of your application should be optimized for both throughput and latency performance. However, in many cases, the requirements for each contradicts each other.
In this webinar, we explain how to remediate the contradicting forces and build a sustainable cluster to meet both performance and resiliency requirements.
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Ethan Guo | Current 2022
Back in 2016, Apache Hudi brought transactions, change capture on top of data lakes, what is today referred to as the Lakehouse architecture. In this session, we first introduce Apache Hudi and the key technology gaps it fills in the modern data architecture. Bridging traditional data lakes and warehouses, Hudi helps realize the Lakehouse vision, by bringing transactions, optimized table metadata to data lakes and powerful storage layout optimizations, moving them closer to cloud warehouses of today. Viewed from a data engineering lens, Hudi also plays a key unifying role between the batch and stream processing worlds, by acting as a columnar, server-less ""state store"" for batch jobs, ushering in what we call the incremental processing model, where batch jobs can consume new data, update/delete intermediate results in a Hudi table, instead of re-computing/re-write entire output like old-school big batch jobs.
Rest of talk focusses on a deep dive into the some of the time-tested design choices and tradeoffs in Hudi, that helps power some of the largest transactional data lakes on the planet today. We will start by describing a tour of the storage format design, including data, metadata layouts and of course Hudi's timeline, an event log that is central to implementing ACID transactions and concurrency control. We will delve deeper into the practical concurrency control pitfalls in data lakes, and show how Hudi's hybrid approach combining MVCC with optimistic concurrency control, lowers contention and unlocks minute-level near real-time commits to Hudi tables. We will conclude with code examples that showcase Hudi's rich set of table services that perform vital table management such as cleaning older file versions, compaction of delta logs into base files, dynamic re-clustering for faster query performance, or the more recently introduced indexing service that maintains Hudi's multi-modal indexing capabilities.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
This Edureka "What is Spark" tutorial will introduce you to big data analytics framework - Apache Spark. This tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Apache Spark concepts. Below are the topics covered in this tutorial:
1) Big Data Analytics
2) What is Apache Spark?
3) Why Apache Spark?
4) Using Spark with Hadoop
5) Apache Spark Features
6) Apache Spark Architecture
7) Apache Spark Ecosystem - Spark Core, Spark Streaming, Spark MLlib, Spark SQL, GraphX
8) Demo: Analyze Flight Data Using Apache Spark
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
Business leads, executives, analysts, and data scientists rely on up-to-date information to make business decision, adjust to the market, meet needs of their customers or run effective supply chain operations.
Come hear how Asurion used Delta, Structured Streaming, AutoLoader and SQL Analytics to improve production data latency from day-minus-one to near real time Asurion’s technical team will share battle tested tips and tricks you only get with certain scale. Asurion data lake executes 4000+ streaming jobs and hosts over 4000 tables in production Data Lake on AWS.
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Pig, and Presto.
we will see an overview of Spark in Big Data. We will start with an introduction to Apache Spark Programming. Then we will move to know the Spark History. Moreover, we will learn why Spark is needed. Afterward, will cover all fundamental of Spark components. Furthermore, we will learn about Spark’s core abstraction and Spark RDD. For more detailed insights, we will also cover spark features, Spark limitations, and Spark Use cases.
Big data processing meets non-volatile memory: opportunities and challenges DataWorks Summit
Advanced big data processing frameworks have been proposed to harness the fast data transmission capability of remote direct memory access (RDMA) over InfiniBand and RoCE. However, with the introduction of the non-volatile memory (NVM), these designs along with the default execution models, like MapReduce and Directed Acyclic Graph (DAG), need to be re-assessed to discover the possibilities of further enhanced performance.
In this context, we propose an accelerated execution framework (NVMD) for MapReduce and DAG that leverages the benefits of NVM and RDMA. NVMD introduces novel features for MapReduce and DAG, such as a hybrid push and pull shuffle mechanism and dynamic adaptation to the network congestion. The design has been incorporated into Apache Hadoop and Tez. Performance results illustrate that NVMD can achieve up to 3.65x and 3.18x improvement for Hadoop and Tez, respectively. In this talk, we will also present NVM-aware HDFS design and its benefits for MapReduce, Spark, and HBase.
Speaker: Shashank Gugnani, PhD Student, Ohio State University
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSDinside-BigData.com
In this deck from the 2018 OpenFabrics Workshop, Xiaoyi Lu from OSU presents: High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD.
"The convergence of Big Data and HPC has been pushing the innovation of accelerating Big Data analytics and management on modern HPC clusters. Recent studies have shown that the performance of Apache Hadoop, Spark, and Memcached can be significantly improved by leveraging the high-performance networking technologies, such as Remote Direct Memory Access (RDMA). Most of these studies are based on `DRAM+RDMA' schemes. On the other hand, Non-Volatile Memory (NVM) and NVMe-SSD technologies can support RDMA access with low-latency, high-throughput, and persistence on HPC clusters. NVMs and NVMe-SSDs provide the opportunity to build novel high-performance and QoS-aware communication and I/O subsystems for data-intensive applications. In this talk, we propose new communication and I/O schemes for these data analytics stacks, which are designed with RDMA over NVM and NVMe-SSD. Our studies show that the proposed designs can significantly improve the communication, I/O, and application performance for Big Data analytics and management middleware, such as Hadoop, Spark, Memcached, etc. In addition, we will also discuss how to design QoS-aware schemes in these frameworks with NVMe-SSD."
Watch the video: https://wp.me/p3RLHQ-iyB
Learn more: http://web.cse.ohio-state.edu/~lu.932/
and
https://www.openfabrics.org/index.php/2018-ofa-workshop-presentations.html
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
In this deck from the Stanford HPC Conference, DK Panda from Ohio State University presents: Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Processing.
"This talk will provide an overview of challenges in accelerating Hadoop, Spark and Memcached on modern HPC clusters. An overview of RDMA-based designs for Hadoop (HDFS, MapReduce, RPC and HBase), Spark, Memcached, Swift, and Kafka using native RDMA support for InfiniBand and RoCE will be presented. Enhanced designs for these components to exploit NVM-based in-memory technology and parallel file systems (such as Lustre) will also be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project (http://hibd.cse.ohio-state.edu) will be shown."
Watch the video: https://youtu.be/iLTYkTandEA
Learn more: http://web.cse.ohio-state.edu/~panda.2/
and
http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Dyn delivers exceptional Internet Performance. Enabling high quality services requires data centers around the globe. In order to manage services, customers need timely insight collected from all over the world. Dyn uses DataStax Enterprise (DSE) to deploy complex clusters across multiple datacenters to enable sub 50 ms query responses for hundreds of billions of data points. From granular DNS traffic data, to aggregated counts for a variety of report dimensions, DSE at Dyn has been up since 2013 and has shined through upgrades, data center migrations, DDoS attacks and hardware failures. In this webinar, Principal Engineers Tim Chadwick and Rick Bross cover the requirements which led them to choose DSE as their go-to Big Data solution, the path which led to SPARK, and the lessons that we’ve learned in the process.
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
Since 2006, Hadoop and its ecosystem components have evolved into a platform that Yahoo has begun to trust for running its businesses globally. In this talk, we will take a broad look at some of the top software, hardware, and services considerations that have gone in to make the platform indispensable for nearly 1,000 active developers, including the challenges that come from scale, security and multi-tenancy. We will cover the current technology stack that we have built or assembled, infrastructure elements such as configurations, deployment models, and network, and and what it takes to offer hosted Hadoop services to a large customer base.
Storage is one of the main 3 pillars of any data center, along with compute and networking.
OpenStack provides flexibility and automation for storage provisioning, no matter if one uses iSCSI integrated with Cinder or Ceph for block and object storage.
But what about performance? How can one enjoy storage flexibility without compromising on state of the art, low-latency, high-throughput storage, that is required by today’s applications?
In this session, we will present three storage solutions for OpenStack and how they can be accelerated natively in OpenStack with Remote Direct memory Access (RDMA) technology.
Join us to learn how RDMA boosts storage performance in the cloud.
Accelerating Hadoop, Spark, and Memcached with HPC Technologiesinside-BigData.com
DK Panda from Ohio State University presented this deck at the OpenFabrics Workshop.
"Modern HPC clusters are having many advanced features, such as multi-/many-core architectures, highperformance RDMA-enabled interconnects, SSD-based storage devices, burst-buffers and parallel file systems. However, current generation Big Data processing middleware (such as Hadoop, Spark, and Memcached) have not fully exploited the benefits of the advanced features on modern HPC clusters. This talk will present RDMA-based designs using OpenFabrics Verbs and heterogeneous storage architectures to accelerate multiple components of Hadoop (HDFS, MapReduce, RPC, and HBase), Spark and Memcached. An overview of the associated RDMA-enabled software libraries (being designed and publicly distributed as a part of the HiBD project for Apache Hadoop (integrated and plug-ins for Apache, HDP, and Cloudera distributions), Apache Spark and Memcached will be presented. The talk will also address the need for designing benchmarks using a multi-layered and systematic approach, which can be used to evaluate the performance of these Big Data processing middleware."
Watch the video presentation: http://wp.me/p3RLHQ-gzg
Learn more: http://hibd.cse.ohio-state.edu/
and
https://www.openfabrics.org/index.php/abstracts-agenda.html
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...StampedeCon
This session will begin with an overview of current non-volatile memory (NVM, aka persistent memory) architectures and its relationship between several levels of memory and storage hierarchy, both near- and far-processor. A discussion on its significant impact on computing analytic workloads now and in the near future will ensue, including use cases and the concept of very large persistent memory surfaces as applied to both analytic computation and storage for big data workflows. The presentation will end with ‘why you should care’ about such technologies which inevitably will completely change the way we think about solving data-intensive problems.
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
Event: TDWI Accelerate, Seattle, Oct 16, 2017
Topic: Distributed and In-Database Analytics with R
Presenter: Debraj GuhaThakurta
Tags: R, Spark, SQL Server
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
Event: TDWI Accelerate Seattle, October 16, 2017
Topic: Distributed and In-Database Analytics with R
Presenter: Debraj GuhaThakurta
Description: How to develop scalable and in-DB analytics using R in Spark and SQL-Server
Accelerate Big Data Processing with High-Performance Computing TechnologiesIntel® Software
Learn about opportunities and challenges for accelerating big data middleware on modern high-performance computing (HPC) clusters by exploiting HPC technologies.
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-PremiseDatabricks
The ever-growing continuous influx of data causes every component in a system to burst at its seams. GPUs and ASICs are helping on the compute side, whereas in-memory and flash storage devices are utilized to keep up with those local IOPS. All of those can perform extremely well in smaller setups and under contained workloads. However, today's workloads require more and more power that directly translates into higher scale. Training major AI models can no longer fit into humble setups. Streaming ingestion systems are barely keeping up with the load. These are just a few examples of why enterprises require a massive versatile infrastructure, that continuously grows and scales. The problems start when workloads are then scaled out to reveal the hardships of traditional network infrastructures in coping with those bandwidth hungry and latency sensitive applications. In this talk, we are going to dive into how intelligent hardware offloads can mitigate network bottlenecks in Big Data and AI platforms, and compare the offering and performance of what's available in major public clouds, as well as a la carte on-premise solutions.
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber.
Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable.
At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads.
At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms.
To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...DataWorks Summit
The Census Bureau is the U.S. government's largest statistical agency with a mission to provide current facts and figures about America's people, places and economy. The Bureau operates a large number of surveys to collect this data, the most well known being the decennial population census. Data is being collected in increasing volumes and the analytics solutions must be able to scale to meet the ever increasing needs while maintaining the confidentiality of the data. Past data analytics have occurred in processing silos inhibiting the sharing of information and common reference data is replicated across multiple system. The use of the Hortonworks Data Platform, Hortonworks Data Flow and other open-source technologies is enabling the creation of a cloud-based enterprise data lake and analytics platform. Cloud object stores are used to provide scalable data storage and cloud compute supports permanent and transient clusters. Data governance tools are used to track the data lineage and to provide access controls to sensitive data.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory (NVM)
1. Big Data Meets NVM: Accelerating Big Data Processing
with Non-Volatile Memory (NVM)
DataWorks Summit 2019 | Washington, DC
by
Xiaoyi Lu
The Ohio State University
luxi@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~luxi
Dhabaleswar K. (DK) Panda
The Ohio State University
panda@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~panda
Dipti Shankar
The Ohio State University
shankard@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~shankar.50
2. DataWorks Summit, 2019 2Network Based Computing Laboratory
• Substantial impact on designing and utilizing data management and processing systems in multiple tiers
– Front-end data accessing and serving (Online)
• Memcached + DB (e.g. MySQL), HBase
– Back-end data analytics (Offline)
• HDFS, MapReduce, Spark
Big Data Management and Processing on Modern Clusters
3. DataWorks Summit, 2019 3Network Based Computing Laboratory
Big Data Processing with Apache Big Data Analytics Stacks
• Major components included:
– MapReduce (Batch)
– Spark (Iterative and Interactive)
– HBase (Query)
– HDFS (Storage)
– RPC (Inter-process communication)
• Underlying Hadoop Distributed File
System (HDFS) used by MapReduce,
Spark, HBase, and many others
• Model scales but high amount of
communication and I/O can be further
optimized!
HDFS
MapReduce
Apache Big Data Analytics Stacks
User Applications
HBase
Hadoop Common (RPC)
Spark
4. DataWorks Summit, 2019 4Network Based Computing Laboratory
Drivers of Modern HPC Cluster and Data Center Architecture
• Multi-core/many-core technologies
• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
– Single Root I/O Virtualization (SR-IOV)
• NVM and NVMe-SSD
• Accelerators (NVIDIA GPGPUs and FPGAs)
High Performance Interconnects –
InfiniBand (with SR-IOV)
<1usec latency, 200Gbps Bandwidth>
Multi-/Many-core
Processors
Cloud CloudSDSC Comet TACC Stampede
Accelerators / Coprocessors
high compute density, high
performance/watt
>1 TFlop DP on a chip
SSD, NVMe-SSD, NVRAM
5. DataWorks Summit, 2019 5Network Based Computing Laboratory
• RDMA for Apache Spark
• RDMA for Apache Hadoop 3.x (RDMA-Hadoop-3.x)
• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)
– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions
• RDMA for Apache Kafka
• RDMA for Apache HBase
• RDMA for Memcached (RDMA-Memcached)
• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
• OSU HiBD-Benchmarks (OHB)
– HDFS, Memcached, HBase, and Spark Micro-benchmarks
• http://hibd.cse.ohio-state.edu
• Users Base: 305 organizations from 35 countries
• More than 29,750 downloads from the project site (April ‘19)
The High-Performance Big Data (HiBD) Project
Available for InfiniBand and RoCE
Also run on Ethernet
Available for x86 and OpenPOWER
Significant performance
improvement with ‘RDMA+DRAM’
compared to default Sockets-
based designs;
How about RDMA+NVRAM?
6. DataWorks Summit, 2019 6Network Based Computing Laboratory
Non-Volatile Memory (NVM) and NVMe-SSD
3D XPoint from Intel & Micron Samsung NVMe SSD Performance of PMC Flashtec NVRAM [*]
• Non-Volatile Memory (NVM) provides byte-addressability with persistence
• The huge explosion of data in diverse fields require fast analysis and storage
• NVMs provide the opportunity to build high-throughput storage systems for data-intensive
applications
• Storage technology is moving rapidly towards NVM
[*] http://www.enterprisetech.com/2014/08/06/ flashtec-nvram-15-million-iops-sub-microsecond- latency/
7. DataWorks Summit, 2019 7Network Based Computing Laboratory
• Popular methods employed by recent works to emulate NVRAM performance
model over DRAM
• Two ways:
– Emulate byte-addressable NVRAM over DRAM
– Emulate block-based NVM device over DRAM
NVRAM Emulation based on DRAM
Application
Virtual File System
Block Device PCMDisk
(RAM-Disk + Delay)
DRAM
mmap/memcpy/msync (DAX)
Application
Persistent Memory Library
Clflush + Delay
DRAM
pmem_memcpy_persist (DAX)
Load/store
Load/Store
open/read/write/close
8. DataWorks Summit, 2019 8Network Based Computing Laboratory
• NRCIO: NVM-aware RDMA-based Communication
and I/O Schemes
• NRCIO for Big Data Analytics
• NVMe-SSD based Big Data Analytics
• Conclusion and Q&A
Presentation Outline
9. DataWorks Summit, 2019 9Network Based Computing Laboratory
Design Scope (NVM for RDMA)
D-to-N over RDMA N-to-D over RDMA N-to-N over RDMA
D-to-N over RDMA: Communication buffers for client are allocated in DRAM; Server uses NVM
N-to-D over RDMA: Communication buffers for client are allocated in NVM; Server uses DRAM
N-to-N over RDMA: Communication buffers for client and server are allocated in NVM
DRAM NVM
HDFS-RDMA
(RDMADFSClient)
HDFS-RDMA
(RDMADFSServer)
Client
CPU
Server
CPU
PCIe
NIC
PCIe
NIC
Client Server
NVM DRAM
HDFS-RDMA
(RDMADFSClient)
HDFS-RDMA
(RDMADFSServer)
Client
CPU
Server
CPU
PCIePCIe
NIC NIC
Client Server
NVM NVM
HDFS-RDMA
(RDMADFSClient)
HDFS-RDMA
(RDMADFSServer)
Client
CPU
Server
CPU
PCIePCIe
NIC NIC
Client Server
D-to-D over RDMA: Communication buffers for client and server are allocated in DRAM (Common)
10. DataWorks Summit, 2019 10Network Based Computing Laboratory
NVRAM-aware RDMA-based Communication in NRCIO
NRCIO RDMA Write over NVRAM NRCIO RDMA Read over NVRAM
11. DataWorks Summit, 2019 11Network Based Computing Laboratory
DRAM-TO-NVRAM RDMA-Aware Communication with NRCIO
• Comparison of communication latency using NRCIO RDMA read and write communication
protocols over InfiniBand EDR HCA with DRAM as source and NVRAM as destination
• {NxDRAM} NVRAM emulation mode = Nx NVRAM write slowdown vs. DRAM with clflushopt
(emulated) + sfence
• Smaller impact of time-for-persistence on the end-to-end latencies for small messages vs.
large messages => larger number of cache lines to flush
0
5
10
15
20
25
256 4K 16K 256 4K 16K 256 4K 16K
1xDRAM 2xDRAM 5xDRAM
Latency(us)
Data Size (Bytes)
NRCIO-RW NRCIO-RR
0
0.5
1
1.5
2
2.5
3
3.5
256K
1M
4M
256K
1M
4M
256K
1M
4M
1xDRAM 2xDRAM 5xDRAM
Latency(ms)
Data Size (Bytes)
NRCIO-RW NRCIO-RR
12. DataWorks Summit, 2019 12Network Based Computing Laboratory
NVRAM-TO-NVRAM RDMA-Aware Communication with NRCIO
• Comparison of communication latency using NRCIO RDMA read and write communication
protocols over InfiniBand EDR HCA vs. DRAM
• {Ax, By} NVRAM emulation mode = Ax NVRAM read slowdown and Bx NVRAM write slowdown
vs. NVRAM
• High end-to-end latencies due to slower writes to non-volatile persistent memory
• E.g., 3.9x for {1x,2x} and 8x for {2x,5x}
0
0.5
1
1.5
2
2.5
3
3.5
256K 1M 4M 256K 1M 4M 256K 1M 4M
No Persist
(D2D)
1x,2x 2x,5x
Latency(ms)
Data Size (Bytes)
NRCIO-RW NRCIO-RR
0
5
10
15
20
25
64 1K 16K 64 1K 16K 64 1K 16K
No Persist
(D2D)
1x,2x 2x,5x
Latency(us)
Data Size (Bytes)
NRCIO-RW NRCIO-RR
13. DataWorks Summit, 2019 13Network Based Computing Laboratory
• NRCIO: NVM-aware RDMA-based Communication
and I/O Schemes
• NRCIO for Big Data Analytics
• NVMe-SSD based Big Data Analytics
• Conclusion and Q&A
Presentation Outline
14. DataWorks Summit, 2019 14Network Based Computing Laboratory
• Files are divided into fixed sized blocks
– Blocks divided into packets
• NameNode: stores the file system namespace
• DataNode: stores data blocks in local storage
devices
• Uses block replication for fault tolerance
– Replication enhances data-locality and read
throughput
• Communication and I/O intensive
• Java Sockets based communication
• Data needs to be persistent, typically on
SSD/HDD
NameNode
DataNodes
Client
Opportunities of Using NVRAM+RDMA in HDFS
15. DataWorks Summit, 2019 15Network Based Computing Laboratory
Design Overview of NVM and RDMA-aware HDFS (NVFS)
• Design Features
• RDMA over NVM
• HDFS I/O with NVM
• Block Access
• Memory Access
• Hybrid design
• NVM with SSD as a hybrid
storage for HDFS I/O
• Co-Design with Spark and HBase
• Cost-effectiveness
• Use-case
Applications and Benchmarks
Hadoop MapReduce Spark HBase
Co-Design
(Cost-Effectiveness, Use-case)
RDMA
Receiver
RDMA
Sender
DFSClient
RDMA
Replicator
RDMA
Receiver
NVFS
-BlkIO
Writer/Reader
NVM
NVFS-
MemIO
SSD SSD SSD
NVM and RDMA-aware HDFS (NVFS)
DataNode
N. S. Islam, M. W. Rahman , X. Lu, and D. K.
Panda, High Performance Design for HDFS with
Byte-Addressability of NVM and RDMA, 24th
International Conference on Supercomputing
(ICS), June 2016
16. DataWorks Summit, 2019 16Network Based Computing Laboratory
Evaluation with Hadoop MapReduce
0
50
100
150
200
250
300
350
Write Read
AverageThroughput(MBps)
HDFS (56Gbps)
NVFS-BlkIO (56Gbps)
NVFS-MemIO (56Gbps)
• TestDFSIO on SDSC Comet (32 nodes)
– Write: NVFS-MemIO gains by 4x over
HDFS
– Read: NVFS-MemIO gains by 1.2x over
HDFS
TestDFSIO
0
200
400
600
800
1000
1200
1400
Write Read
AverageThroughput(MBps)
HDFS (56Gbps)
NVFS-BlkIO (56Gbps)
NVFS-MemIO (56Gbps)
4x
1.2x
4x
2x
SDSC Comet (32 nodes: 80 GB, SATA-SSDs) OSU Nowlab (4 nodes: 8 GB, NVMe-SSDs)
• TestDFSIO on OSU Nowlab (4 nodes)
– Write: NVFS-MemIO gains by 4x over
HDFS
– Read: NVFS-MemIO gains by 2x over
HDFS
17. DataWorks Summit, 2019 17Network Based Computing Laboratory
Evaluation with HBase
0
100
200
300
400
500
600
700
800
8:800K 16:1600K 32:3200K
Throughput(ops/s)
Cluster Size : No. of Records
HDFS (56Gbps) NVFS (56Gbps)
HBase 100% insert
0
200
400
600
800
1000
1200
8:800K 16:1600K 32:3200K
Throughput(ops/s)
Cluster Size : Number of Records
HBase 50% read, 50% update
• YCSB 100% Insert on SDSC Comet (32 nodes)
– NVFS-BlkIO gains by 21% by storing only WALs to NVM
• YCSB 50% Read, 50% Update on SDSC Comet (32 nodes)
– NVFS-BlkIO gains by 20% by storing only WALs to NVM
20%21%
18. DataWorks Summit, 2019 18Network Based Computing Laboratory
Opportunities to Use NVRAM+RDMA in MapReduce
Disk Operations
• Map and Reduce Tasks carry out the total job execution
– Map tasks read from HDFS, operate on it, and write the intermediate data to local disk (persistent)
– Reduce tasks get these data by shuffle from NodeManagers, operate on it and write to HDFS (persistent)
• Communication and I/O intensive; Shuffle phase uses HTTP over Java Sockets; I/O operations take
place in SSD/HDD typically
Bulk Data Transfer
19. DataWorks Summit, 2019 19Network Based Computing Laboratory
Opportunities to Use NVRAM in MapReduce-RDMA
DesignInputFiles
OutputFiles
IntermediateData
Map Task
Read Map
Spill
Merge
Map Task
Read Map
Spill
Merge
Reduce Task
Shuffle Reduce
In-
Mem
Merge
Reduce Task
Shuffle Reduce
In-
Mem
Merge
RDMA
All Operations are In-
Memory
Opportunities exist to
improve the
performance with
NVRAM
20. DataWorks Summit, 2019 20Network Based Computing Laboratory
NVRAM-Assisted Map Spilling in MapReduce-RDMA
InputFiles
OutputFiles
IntermediateData
Map Task
Read Map
Spill
Merge
Map Task
Read Map
Spill
Merge
Reduce Task
Shuffle Reduce
In-
Mem
Merge
Reduce Task
Shuffle Reduce
In-
Mem
Merge
RDMA
NVRAM
Minimizes the disk operations in Spill phase
M. W. Rahman, N. S. Islam, X. Lu, and D. K. Panda, Can Non-Volatile Memory Benefit MapReduce Applications on HPC Clusters? PDSW-DISCS, with SC 2016.
M. W. Rahman, N. S. Islam, X. Lu, and D. K. Panda, NVMD: Non-Volatile Memory Assisted Design for Accelerating MapReduce and DAG Execution Frameworks on
HPC Systems? IEEE BigData 2017.
21. DataWorks Summit, 2019 21Network Based Computing Laboratory
Comparison with Sort and TeraSort
• RMR-NVM achieves 2.37x benefit for Map
phase compared to RMR and MR-IPoIB;
overall benefit 55% compared to MR-IPoIB,
28% compared to RMR
2.37x
55%
2.48x
51%
• RMR-NVM achieves 2.48x benefit for Map
phase compared to RMR and MR-IPoIB;
overall benefit 51% compared to MR-IPoIB,
31% compared to RMR
22. DataWorks Summit, 2019 22Network Based Computing Laboratory
Evaluation of Intel HiBench Workloads
• We evaluate different HiBench
workloads with Huge data sets
on 8 nodes
• Performance benefits for
Shuffle-intensive workloads
compared to MR-IPoIB:
– Sort: 42% (25 GB)
– TeraSort: 39% (32 GB)
– PageRank: 21% (5 million pages)
• Other workloads:
– WordCount: 18% (25 GB)
– KMeans: 11% (100 million samples)
23. DataWorks Summit, 2019 23Network Based Computing Laboratory
Evaluation of PUMA Workloads
• We evaluate different PUMA
workloads on 8 nodes with
30GB data size
• Performance benefits for
Shuffle-intensive workloads
compared to MR-IPoIB :
– AdjList: 39%
– SelfJoin: 58%
– RankedInvIndex: 39%
• Other workloads:
– SeqCount: 32%
– InvIndex: 18%
24. DataWorks Summit, 2019 24Network Based Computing Laboratory
• NRCIO: NVM-aware RDMA-based Communication
and I/O Schemes
• NRCIO for Big Data Analytics
• NVMe-SSD based Big Data Analytics
• Conclusion and Q&A
Presentation Outline
25. DataWorks Summit, 2019 25Network Based Computing Laboratory
Overview of NVMe Standard
• NVMe is the standardized interface
for PCIe SSDs
• Built on ‘RDMA’ principles
– Submission and completion I/O
queues
– Similar semantics as RDMA send/recv
queues
– Asynchronous command processing
• Up to 64K I/O queues, with up to 64K
commands per queue
• Efficient small random I/O operation
• MSI/MSI-X and interrupt aggregation
NVMe Command Processing
Source: NVMExpress.org
26. DataWorks Summit, 2019 26Network Based Computing Laboratory
Overview of NVMe-over-Fabric
• Remote access to flash with NVMe
over the network
• RDMA fabric is of most importance
– Low latency makes remote access
feasible
– 1 to 1 mapping of NVMe I/O queues
to RDMA send/recv queues
NVMf Architecture
I/O
Submission
Queue
I/O
Completion
Queue
RDMA Fabric
SQ RQ
NVMe
Low latency
overhead compared
to local I/O
27. DataWorks Summit, 2019 27Network Based Computing Laboratory
Design Challenges with NVMe-SSD
• QoS
– Hardware-assisted QoS
• Persistence
– Flushing buffered data
• Performance
– Consider flash related design aspects
– Read/Write performance skew
– Garbage collection
• Virtualization
– SR-IOV hardware support
– Namespace isolation
• New software systems
– Disaggregated Storage with NVMf
– Persistent Caches
Co-design
28. DataWorks Summit, 2019 28Network Based Computing Laboratory
Evaluation with RocksDB
0
5
10
15
Insert Overwrite Random Read
Latency (us)
POSIX SPDK
0
100
200
300
400
500
Write Sync Read Write
Latency (us)
POSIX SPDK
• 20%, 33%, 61% improvement for Insert, Write Sync, and Read Write
• Overwrite: Compaction and flushing in background
– Low potential for improvement
• Read: Performance much worse; Additional tuning/optimization required
29. DataWorks Summit, 2019 29Network Based Computing Laboratory
Evaluation with RocksDB
0
5000
10000
15000
20000
Write Sync Read Write
Throughput (ops/sec)
POSIX SPDK
0
100000
200000
300000
400000
500000
600000
Insert Overwrite Random Read
Throughput (ops/sec)
POSIX SPDK
• 25%, 50%, 160% improvement for Insert, Write Sync, and Read Write
• Overwrite: Compaction and flushing in background
– Low potential for improvement
• Read: Performance much worse; Additional tuning/optimization required
30. DataWorks Summit, 2019 30Network Based Computing Laboratory
QoS-aware SPDK Design
0
50
100
150
1 5 9 13 17 21 25 29 33 37 41 45 49
Bandwidth(MB/s)
Time
Scenario 1
High Priority Job (WRR) Medium Priority Job (WRR)
High Priority Job (OSU-Design) Medium Priority Job (OSU-Design)
0
1
2
3
4
5
2 3 4 5
JobBandwidthRatio
Scenario
Synthetic Application Scenarios
SPDK-WRR OSU-Design Desired
• Synthetic application scenarios with different QoS requirements
– Comparison using SPDK with Weighted Round Robbin NVMe arbitration
• Near desired job bandwidth ratios
• Stable and consistent bandwidth
S. Gugnani, X. Lu, and D. K. Panda, Analyzing, Modeling, and
Provisioning QoS for NVMe SSDs, 11th IEEE/ACM International
Conference on Utility and Cloud Computing (UCC), Dec 2018
31. DataWorks Summit, 2019 31Network Based Computing Laboratory
Conclusion and Future Work
• Big Data Analytics needs high-performance NVM-aware RDMA-based
Communication and I/O Schemes
• Proposed a new library, NRCIO (work-in-progress)
• Re-design HDFS storage architecture with NVRAM
• Re-design RDMA-MapReduce with NVRAM
• Design Big Data analytics stacks with NVMe and NVMf protocols
• Results are promising
• Further optimizations in NRCIO
• Co-design with more Big Data analytics frameworks
• TensorFlow, Object Storage, Database, etc.
32. DataWorks Summit, 2019 32Network Based Computing Laboratory
Thank You!
Network-Based Computing Laboratory
http://nowlab.cse.ohio-state.edu/
The High-Performance Big Data Project
http://hibd.cse.ohio-state.edu/
luxi@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~luxi
shankard@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~shankar.50
Editor's Notes
How we can combine current HPC tech with emerging NVM tech like NVMe and NVRAM/PMEM to accelerate Big Data processing on the latest compute systems.
We all know that as a step towards handling today’s Big Data challenges, we need faster and more efficient system software or data processing stacks.
This means low latencies data access at the front end tier and low latency inter-process comm, and data shuffling, and high throughput I/O.
The key here is that this model enables high productivity.
it is easy for say data scientists to design and deploy analytical applications.
drawback is that it requires handling tons of I/O and communoicaton, but it currently employs gener
loads of technologies that can be harnessed for better performance.
persistence, higher throughput and closer-to-DRAM performance.
1. Modern processors have hardware-based virtualization support
2. Multi-core processors and large memory nodes have enabled a large number of VMs to be deployed on a single node
3. HPC Clouds are often deployed with InfiniBand with SR-IOV support
4. They also have SSDs and Object Storage Clusters such as OpenStack Swift which often use SSDs for backend storage
5. Many large-scale cloud deployments such as Microsoft Azure, Softlayer (an IBM company), Oracle Cloud, and Chameleon Cloud provide support for InfiniBand and SR-IOV
6. In fact, all our evaluations are done on Chameleon Cloud
nable native performance is to use the SR-IOV (Single Root IO Virtualization) mechanism which bypasses the Hypervisor and enables a direct link between the VM to the IO adapter.
msync: persistent the whole region
If multiple CLFLUSH flushes different cache lines and these multiple CLFLUSH come from different threads (in other words, different logical processors' instruction streams), then these CLFLUSH should be able to run in parallel. If multiple CLFLUSH come from the same thread, then they cannot run in parallel. The point of having CLFLUSHOPT is to allow flushing multiple cache lines in parallel within a single logical processor's instruction stream.
D-to-N and N-to-D over RDMA have similar performance characteristics. D-to-N does not need NVM to be present in the client side
NVMs are expensive. Therefore, for data-intensive applications, it is not feasible to store all the data in NVM.
We propose to use NVM with SSD as a hybrid storage for
HDFS I/O. In our design, NVM can replace or co-exist with
SSD through a configuration parameter. As a result, cost-effective, NVM-aware placement policies are needed to identify the appropriate data to go to NVMs. The idea behind
this is to take advantage of the high IOPS of NVMs for
performance-critical data; all others can go to SSD.
80 GB test
MSI/MSI-X: Message Signaled Interrupts
Read Sequential/Random: 20/115 us
Write Sequential/Random: 20/25 us?
50 Million Keys, Key size is 64 bytes, Value size is 1K.
Benchmark: DBBench (a part of RocksDB, Facebook)
Intel DC P3700
All scenarios run 2 simultaneous jobs with back-to-back requests
Priority Weights: High Priority = 4, Medium Priority = 2, Low Priority = 1
Scenario1: one high priority job with 4k requests and one medium
priority job with 8k
Scenario 2: two high priority jobs, one with 4k and the other with 8k requests
Scenario 3: 1 high priority job with 4k requests and 1 low priority job with 8k requests
Scenario 4: same as Scenario 3 with the priorities exchanged
Scenario 5: two high priority jobs, one submitting 4k and 8k requests and the other 8k and 16k requests
Deficit Round Robin (DRR) as a hardware-based arbitration scheme is more suited for providing bandwidth guarantees for NVMe SSDs.
Schemes like deficit round robin (DRR) and weighted fair queuing (WFQ) are popular models widely used in networking. Both DRR and WFQ can provide bandwidth guarantees. However, WFQ requires O(log(n)) time to process each request, while DRR only requires O(1), where n is the number of priority classes.