- GD2L is a cost-aware buffer pool management algorithm that uses two priority queues to manage pages on SSD vs HDD. CAC is a predictive cost-based technique for managing which pages are placed on SSD.
- Experiments show that using GD2L and CAC together provides up to 2x better TPC-C performance compared to LRU baseline, by lowering total I/O costs on both SSD and HDD for large 30GB databases. For smaller databases the gains were less significant.
- CAC is able to make better decisions than a non-anticipatory technique about which pages should remain on SSD long-term in order to reduce I/O costs.
PL/CUDA allows writing user-defined functions in CUDA C that can run on a GPU. This provides benefits for analytics workloads that can utilize thousands of GPU cores and wide memory bandwidth. A sample logistic regression implementation in PL/CUDA showed a 350x speedup compared to a CPU-based implementation in MADLib. Logistic regression performs binary classification by estimating weights for explanatory variables and intercept through iterative updates. This is well-suited to parallelization on a GPU.
The document discusses Spark, an open-source cluster computing framework. It describes Spark's Resilient Distributed Dataset (RDD) as an immutable and partitioned collection that can automatically recover from node failures. RDDs can be created from data sources like files or existing collections. Transformations create new RDDs from existing ones lazily, while actions return values to the driver program. Spark supports operations like WordCount through transformations like flatMap and reduceByKey. It uses stages and shuffling to distribute operations across a cluster in a fault-tolerant manner. Spark Streaming processes live data streams by dividing them into batches treated as RDDs. Spark SQL allows querying data through SQL on DataFrames.
20181116 Massive Log Processing using I/O optimized PostgreSQLKohei KaiGai
The document describes a technology called PG-Strom that uses GPU acceleration to optimize I/O performance for PostgreSQL. PG-Strom allows data to be transferred directly from NVMe SSDs to the GPU over the PCIe bus, bypassing the CPU and RAM. This reduces data movement and allows PostgreSQL queries to be partially executed directly on the GPU. Benchmark results show the approach can achieve throughput close to the theoretical hardware limits for a single server configuration processing large datasets.
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...MongoDB
This document discusses using time series data from traffic sensors to monitor road conditions and support navigation systems. It reviews using MongoDB to store sensor data and perform aggregations to calculate metrics like average speed. MapReduce and the aggregation framework are demonstrated for queries like calculating average speeds by weather, road status, or pavement conditions. Hadoop and the MongoDB connector for Hadoop are mentioned for processing large datasets in parallel across nodes.
MongoDB World 2019: RDBMS Versus MongoDB Aggregation PerformanceMongoDB
Join me as we compare the performance of MySQL and MongoDB aggregating and analyzing data against a large, real-world data set. From this talk, you will learn when MongoDB is faster than MySQL, why that's the case, and that doctors appear to do some very questionable things.
CSQL Caching enables applications to significantly improve its throughput. As data is cached on the main-memory database, cache delivers a real-time, dynamic, updatable cache for frequently accessed data in the disk based databases such as Oracle, MySQL or
Postgres.
PL/CUDA allows writing user-defined functions in CUDA C that can run on a GPU. This provides benefits for analytics workloads that can utilize thousands of GPU cores and wide memory bandwidth. A sample logistic regression implementation in PL/CUDA showed a 350x speedup compared to a CPU-based implementation in MADLib. Logistic regression performs binary classification by estimating weights for explanatory variables and intercept through iterative updates. This is well-suited to parallelization on a GPU.
The document discusses Spark, an open-source cluster computing framework. It describes Spark's Resilient Distributed Dataset (RDD) as an immutable and partitioned collection that can automatically recover from node failures. RDDs can be created from data sources like files or existing collections. Transformations create new RDDs from existing ones lazily, while actions return values to the driver program. Spark supports operations like WordCount through transformations like flatMap and reduceByKey. It uses stages and shuffling to distribute operations across a cluster in a fault-tolerant manner. Spark Streaming processes live data streams by dividing them into batches treated as RDDs. Spark SQL allows querying data through SQL on DataFrames.
20181116 Massive Log Processing using I/O optimized PostgreSQLKohei KaiGai
The document describes a technology called PG-Strom that uses GPU acceleration to optimize I/O performance for PostgreSQL. PG-Strom allows data to be transferred directly from NVMe SSDs to the GPU over the PCIe bus, bypassing the CPU and RAM. This reduces data movement and allows PostgreSQL queries to be partially executed directly on the GPU. Benchmark results show the approach can achieve throughput close to the theoretical hardware limits for a single server configuration processing large datasets.
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...MongoDB
This document discusses using time series data from traffic sensors to monitor road conditions and support navigation systems. It reviews using MongoDB to store sensor data and perform aggregations to calculate metrics like average speed. MapReduce and the aggregation framework are demonstrated for queries like calculating average speeds by weather, road status, or pavement conditions. Hadoop and the MongoDB connector for Hadoop are mentioned for processing large datasets in parallel across nodes.
MongoDB World 2019: RDBMS Versus MongoDB Aggregation PerformanceMongoDB
Join me as we compare the performance of MySQL and MongoDB aggregating and analyzing data against a large, real-world data set. From this talk, you will learn when MongoDB is faster than MySQL, why that's the case, and that doctors appear to do some very questionable things.
CSQL Caching enables applications to significantly improve its throughput. As data is cached on the main-memory database, cache delivers a real-time, dynamic, updatable cache for frequently accessed data in the disk based databases such as Oracle, MySQL or
Postgres.
The document describes benchmark results achieved by using NVMe SSDs and GPU acceleration to improve the performance of PostgreSQL beyond typical limitations. A benchmark was run using 13 queries on a 1055GB dataset with PostgreSQL v11beta3 + PG-Strom v2.1. This achieved a maximum query execution throughput of 13.5GB/s. PG-Strom is an extension module that uses thousands of GPU cores and wide-band memory to accelerate SQL workloads. It generates GPU code from SQL and executes queries directly on the GPU, bypassing data transfers between CPU and GPU to improve performance.
Advanced Apache Cassandra Operations with JMXzznate
Nodetool is a command line interface for managing a Cassandra node. It provides commands for node administration, cluster inspection, table operations and more. The nodetool info command displays node-specific information such as status, load, memory usage and cache details. The nodetool compactionstats command shows compaction status including active tasks and progress. The nodetool tablestats command displays statistics for a specific table including read/write counts, space usage, cache usage and latency.
This document discusses supporting HDF5 in GrADS, an interactive desktop tool for analyzing earth science data. It outlines how GrADS currently handles different data formats, including HDF4 and netCDF. The document proposes two options for supporting HDF5 in GrADS - linking with the NetCDF-4 library, which would be easy but limited, or linking directly with the HDF5 library, which would require a new interface but provide more general HDF5 support.
This document discusses using PostgreSQL and GPU acceleration to build a machine learning platform. It describes HeteroDB, which provides database and analytics acceleration using GPUs. It outlines how PostgreSQL's foreign data wrapper Gstore_fdw manages persistent GPU device memory, allowing data to remain on the GPU between queries for faster analytics. Gstore_fdw also enables inter-process data collaboration by allowing processes to share access to GPU memory using IPC handles. This facilitates integrating PostgreSQL with external analytics code in languages like Python.
This document summarizes an update on OpenTSDB, an open source time series database. It discusses OpenTSDB's ability to store trillions of data points at scale using HBase, Cassandra, or Bigtable as backends. Use cases mentioned include systems monitoring, sensor data, and financial data. The document outlines writing and querying functionality and describes the data model and table schema. It also discusses new features in OpenTSDB 2.2 and 2.3 like downsampling, expressions, and data stores. Community projects using OpenTSDB are highlighted and the future of OpenTSDB is discussed.
SSD Aware Scan Operation Optimization in PostGreSQL DatabaseSupun Nakandala
This document summarizes a study on optimizing scan operations in PostgreSQL for SSD storage. It hypothesizes that index scans may outperform other scan methods on SSDs due to near-equal random and sequential access times. The methodology tests scan performance on a SSD-equipped system using indexes versus bitmap index scans and heap scans. Results show index scans improve performance by 29-44% for selective queries when sufficient memory holds the table. The optimization only benefits databases fitting entirely in memory.
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...Athens Big Data
Title: Fast and simple data exploration with ClickHouse
Speaker: Alexander Kuzmenkov (https://github.com/akuzm/)
Date: Thursday, March 5, 2020
Event: https://meetup.com/Athens-Big-Data/events/268379195/
- The document discusses a paper on Distributed Interactive Cube Exploration (DICE), a system that allows interactive exploration of large data cubes with response times of seconds for queries involving billions of tuples.
- DICE improves query response times in distributed environments for data cube exploration through speculative query execution and online data sampling combined with a cost-based framework.
- It proposes a faceted cube exploration model to limit the search space of speculative queries by considering consecutive queries as query sessions.
ICDE2015 Research 3: Distributed Storage and ProcessingTakuma Wakamori
The document summarizes a research presentation on distributed storage and processing. It discusses two papers: 1) PABIRS, a data access middleware for distributed file systems that efficiently processes mixed workloads of queries. It proposes an integrated data access middleware to address this. 2) Scalable distributed transactions across heterogeneous stores.
It then provides details on PABIRS, which uses a hybrid index with a bitmap index and LSM (log-structured merge) tree index. The bitmap index is used for low selectivity keys, while the LSM index is built for "hot" values with selectivity above a threshold. The system aims to efficiently support data retrieval and insertion for various query workloads on distributed file systems.
This document summarizes information about a person named Takeshi Arabiki. It includes:
1. Their Twitter handle is @a_bicky and ID is id:a_bicky.
2. A link to their blog on Hatena is provided.
3. They have written books and slides about using R and SciPy.
4. Links are provided to their slideshare presentations about using Twitter and R.
This document compares the performance of SSDs designed for personal storage versus those designed for enterprise applications. It finds that SSDs optimized for personal storage excel at burst workloads but have lower steady-state performance for constant random writes due to having less over-provisioning space and write buffering capabilities. Enterprise SSDs perform better for steady-state workloads through features like increased over-provisioning, improved garbage collection, and write accumulation buffers that maintain performance even without volatile write caching enabled. The key factors that impact SSD performance depend on the intended application usage model.
PG-Strom is an open source PostgreSQL extension that accelerates analytic queries using GPUs. Key features of version 2.0 include direct loading of data from SSDs to GPU memory for processing, an in-memory columnar data cache for efficient GPU querying, and a foreign data wrapper that allows data to be stored directly in GPU memory and queried using SQL. These features improve performance by reducing data movement and leveraging the GPU's parallel architecture. Benchmark results show the new version providing over 3.5x faster query throughput for large datasets compared to PostgreSQL alone.
Designing SSD-friendly Applications for Better Application Performance and Hi...Zhenyun Zhuang
This document discusses how applications can be designed to take advantage of the unique characteristics of solid state drives (SSDs) in order to improve application performance, storage input/output (IO) efficiency, and SSD lifespan. It proposes nine SSD-friendly application design changes and explains how they can result in better application performance by fully utilizing SSDs' internal parallelism, more efficient storage IO by reducing write amplification, and longer SSD lifespan by decreasing write amplification.
The document discusses the advantages of using solid state drives (SSDs) over traditional hard disk drives (HDDs) for storage in vehicles. SSDs offer significantly faster access times, higher input/output operations per second, lower latency, greater durability, shock resistance, and lower power consumption compared to HDDs. SSDs are better suited for the performance needs of onboard vehicle computer systems that require quick access to sensor and other vehicle data for applications related to navigation, infotainment, and driver assistance systems.
Solid State Drives (SSDs) enable dramatically higher throughput and lower response times, providing the
potential to significantly lower operational costs in the data center. IBM recently announced the IBM
System Storage DS8000 Turbo series with Solid State Drives (SSDs).
This paper will discuss the advantages of SSD, highlight SSD best practices, provide an assessment of
performance using DS8000s with SSD, and demonstrate energy, cooling, and space savings with SSD.
This document summarizes an assessment of SSD performance in the IBM DS8000 storage system. SSDs provide significantly higher throughput and lower response times than HDDs for random I/O workloads. Testing showed SSD response times were much lower than HDDs across all data points. Using SSDs can improve performance for applications requiring low response times while reducing energy usage, cooling needs, and data center footprint. SSD best practices include placing hot data on SSDs and using SSDs for applications that traditionally overprovisioned HDDs.
The document discusses Aerospike, an unstructured distributed database. It provides definitions for key database concepts like clusters, nodes, namespaces, sets, records, and bins. It then compares structured and unstructured databases and their pros and cons. The document uses an example to illustrate how standard data types are accessed in Aerospike, involving reading, writing, and updating records. It notes that updating a record requires rewriting the entire record. Finally, it provides examples of operations like get/put, increment, and touch in Aerospike.
The document describes benchmark results achieved by using NVMe SSDs and GPU acceleration to improve the performance of PostgreSQL beyond typical limitations. A benchmark was run using 13 queries on a 1055GB dataset with PostgreSQL v11beta3 + PG-Strom v2.1. This achieved a maximum query execution throughput of 13.5GB/s. PG-Strom is an extension module that uses thousands of GPU cores and wide-band memory to accelerate SQL workloads. It generates GPU code from SQL and executes queries directly on the GPU, bypassing data transfers between CPU and GPU to improve performance.
Advanced Apache Cassandra Operations with JMXzznate
Nodetool is a command line interface for managing a Cassandra node. It provides commands for node administration, cluster inspection, table operations and more. The nodetool info command displays node-specific information such as status, load, memory usage and cache details. The nodetool compactionstats command shows compaction status including active tasks and progress. The nodetool tablestats command displays statistics for a specific table including read/write counts, space usage, cache usage and latency.
This document discusses supporting HDF5 in GrADS, an interactive desktop tool for analyzing earth science data. It outlines how GrADS currently handles different data formats, including HDF4 and netCDF. The document proposes two options for supporting HDF5 in GrADS - linking with the NetCDF-4 library, which would be easy but limited, or linking directly with the HDF5 library, which would require a new interface but provide more general HDF5 support.
This document discusses using PostgreSQL and GPU acceleration to build a machine learning platform. It describes HeteroDB, which provides database and analytics acceleration using GPUs. It outlines how PostgreSQL's foreign data wrapper Gstore_fdw manages persistent GPU device memory, allowing data to remain on the GPU between queries for faster analytics. Gstore_fdw also enables inter-process data collaboration by allowing processes to share access to GPU memory using IPC handles. This facilitates integrating PostgreSQL with external analytics code in languages like Python.
This document summarizes an update on OpenTSDB, an open source time series database. It discusses OpenTSDB's ability to store trillions of data points at scale using HBase, Cassandra, or Bigtable as backends. Use cases mentioned include systems monitoring, sensor data, and financial data. The document outlines writing and querying functionality and describes the data model and table schema. It also discusses new features in OpenTSDB 2.2 and 2.3 like downsampling, expressions, and data stores. Community projects using OpenTSDB are highlighted and the future of OpenTSDB is discussed.
SSD Aware Scan Operation Optimization in PostGreSQL DatabaseSupun Nakandala
This document summarizes a study on optimizing scan operations in PostgreSQL for SSD storage. It hypothesizes that index scans may outperform other scan methods on SSDs due to near-equal random and sequential access times. The methodology tests scan performance on a SSD-equipped system using indexes versus bitmap index scans and heap scans. Results show index scans improve performance by 29-44% for selective queries when sufficient memory holds the table. The optimization only benefits databases fitting entirely in memory.
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...Athens Big Data
Title: Fast and simple data exploration with ClickHouse
Speaker: Alexander Kuzmenkov (https://github.com/akuzm/)
Date: Thursday, March 5, 2020
Event: https://meetup.com/Athens-Big-Data/events/268379195/
- The document discusses a paper on Distributed Interactive Cube Exploration (DICE), a system that allows interactive exploration of large data cubes with response times of seconds for queries involving billions of tuples.
- DICE improves query response times in distributed environments for data cube exploration through speculative query execution and online data sampling combined with a cost-based framework.
- It proposes a faceted cube exploration model to limit the search space of speculative queries by considering consecutive queries as query sessions.
ICDE2015 Research 3: Distributed Storage and ProcessingTakuma Wakamori
The document summarizes a research presentation on distributed storage and processing. It discusses two papers: 1) PABIRS, a data access middleware for distributed file systems that efficiently processes mixed workloads of queries. It proposes an integrated data access middleware to address this. 2) Scalable distributed transactions across heterogeneous stores.
It then provides details on PABIRS, which uses a hybrid index with a bitmap index and LSM (log-structured merge) tree index. The bitmap index is used for low selectivity keys, while the LSM index is built for "hot" values with selectivity above a threshold. The system aims to efficiently support data retrieval and insertion for various query workloads on distributed file systems.
This document summarizes information about a person named Takeshi Arabiki. It includes:
1. Their Twitter handle is @a_bicky and ID is id:a_bicky.
2. A link to their blog on Hatena is provided.
3. They have written books and slides about using R and SciPy.
4. Links are provided to their slideshare presentations about using Twitter and R.
This document compares the performance of SSDs designed for personal storage versus those designed for enterprise applications. It finds that SSDs optimized for personal storage excel at burst workloads but have lower steady-state performance for constant random writes due to having less over-provisioning space and write buffering capabilities. Enterprise SSDs perform better for steady-state workloads through features like increased over-provisioning, improved garbage collection, and write accumulation buffers that maintain performance even without volatile write caching enabled. The key factors that impact SSD performance depend on the intended application usage model.
PG-Strom is an open source PostgreSQL extension that accelerates analytic queries using GPUs. Key features of version 2.0 include direct loading of data from SSDs to GPU memory for processing, an in-memory columnar data cache for efficient GPU querying, and a foreign data wrapper that allows data to be stored directly in GPU memory and queried using SQL. These features improve performance by reducing data movement and leveraging the GPU's parallel architecture. Benchmark results show the new version providing over 3.5x faster query throughput for large datasets compared to PostgreSQL alone.
Designing SSD-friendly Applications for Better Application Performance and Hi...Zhenyun Zhuang
This document discusses how applications can be designed to take advantage of the unique characteristics of solid state drives (SSDs) in order to improve application performance, storage input/output (IO) efficiency, and SSD lifespan. It proposes nine SSD-friendly application design changes and explains how they can result in better application performance by fully utilizing SSDs' internal parallelism, more efficient storage IO by reducing write amplification, and longer SSD lifespan by decreasing write amplification.
The document discusses the advantages of using solid state drives (SSDs) over traditional hard disk drives (HDDs) for storage in vehicles. SSDs offer significantly faster access times, higher input/output operations per second, lower latency, greater durability, shock resistance, and lower power consumption compared to HDDs. SSDs are better suited for the performance needs of onboard vehicle computer systems that require quick access to sensor and other vehicle data for applications related to navigation, infotainment, and driver assistance systems.
Solid State Drives (SSDs) enable dramatically higher throughput and lower response times, providing the
potential to significantly lower operational costs in the data center. IBM recently announced the IBM
System Storage DS8000 Turbo series with Solid State Drives (SSDs).
This paper will discuss the advantages of SSD, highlight SSD best practices, provide an assessment of
performance using DS8000s with SSD, and demonstrate energy, cooling, and space savings with SSD.
This document summarizes an assessment of SSD performance in the IBM DS8000 storage system. SSDs provide significantly higher throughput and lower response times than HDDs for random I/O workloads. Testing showed SSD response times were much lower than HDDs across all data points. Using SSDs can improve performance for applications requiring low response times while reducing energy usage, cooling needs, and data center footprint. SSD best practices include placing hot data on SSDs and using SSDs for applications that traditionally overprovisioned HDDs.
The document discusses Aerospike, an unstructured distributed database. It provides definitions for key database concepts like clusters, nodes, namespaces, sets, records, and bins. It then compares structured and unstructured databases and their pros and cons. The document uses an example to illustrate how standard data types are accessed in Aerospike, involving reading, writing, and updating records. It notes that updating a record requires rewriting the entire record. Finally, it provides examples of operations like get/put, increment, and touch in Aerospike.
Demartek Lenovo Storage S3200 i a mixed workload environment_2016-01Lenovo Data Center
This document evaluates the Lenovo S3200 storage array's ability to support multiple workloads simultaneously. Testing showed that while an all-HDD configuration met performance requirements, one application suffered high latency. Enabling SSD caching or tiering significantly improved performance for that application specifically, reducing latency by 70% and increasing bandwidth by up to 7x, without impacting other applications. The Lenovo S3200 is suitable for consolidating diverse workloads due to its flexibility to configure HDDs with SSDs for optimized performance tailored to each use case.
Speeding time to insight: The Dell PowerEdge C6620 with Dell PERC 12 RAID con...Principled Technologies
The new PowerEdge C6620 delivered better performance—both higher throughput and lower latency—than a previous-generation PowerEdge C6520 with PERC 11
Conclusion
The vast amounts of unstructured data that people and organizations generate daily have the potential to bring incredible value to companies that can utilize it quickly and correctly. Buried in the data are insights about consumer preferences, product performance, environmental trends, and more—but to access those insights at the speed of business, you need high-performing NoSQL databases. Aging servers may be holding you back from the full value of your data.
We found that the new Dell PowerEdge C6620 with Broadcom-based PERC 12 RAID controller can speed read-intensive Apache Cassandra database workloads compared to an older server solution. Faster read and update latencies and higher throughput, as we saw the PowerEdge C6620 deliver, can speed the retrieval, processing, and analysis of your unstructured data, enabling you to more effectively extract its value. To more fully utilize your data to inform your everyday business operations, consider the Dell PowerEdge C6620 with Broadcom-based PERC 12 RAID controller.
This presentation breaks down the Aerospike Key Value Data Access. It covers the topics of Structured vs Unstructured Data, Database Hierarchy & Definitions as well as Data Patterns.
This document discusses HP SmartCache technology, which uses solid-state drives (SSDs) to cache frequently accessed data from hard disk drives (HDDs). This improves performance over HDD-only systems by serving cached data from SSDs faster than HDDs, while maintaining high capacity from HDDs. The caching architecture stores a copy of data on both SSDs and HDDs. Management tools provide analytics on cache hit rates. A case study shows SmartCache doubled the transactions per second of an online transaction processing workload compared to HDD-only storage. In conclusion, SmartCache provides SSD performance benefits while preserving investments in HDD capacity.
This document discusses MapReduce and distributed file systems. It begins with an overview of MapReduce, providing an example of counting word frequencies. It then describes how MapReduce jobs are executed and made fault tolerant. The document also discusses HDFS and its architecture, describing how it stores large files across clusters in a fault tolerant manner.
Historically, the tradeoff of hard disk drives (HDDs) versus solid state drives (SSDs) in enterprises has revolved around three variables: capacity, endurance and price. This whitepaper looks at how increased capacity and durability is expanding SSD applications in the data center.
Demartek evaluated the Lenovo S3200 SAN supporting multiple workloads and saw tremendous results. Read this report and find out why the S3200 should be considered for your SAN deployments!
How Data Volume Affects Spark Based Data Analytics on a Scale-up ServerAhsan Javed Awan
Sheer increase in volume of data over the last decade has triggered research in cluster computing frameworks that enable web enterprises to extract big insights from big data. While Apache Spark is gaining popularity for exhibiting superior scale-out performance on the commodity machines, the impact of data volume on the performance of Spark based data analytics in scale-up configuration is not well under-stood. We present a deep-dive analysis of Spark based applications on a large scale-up server machine. Our analysis reveals that Spark based data analytics are DRAM bound and do not benefitt by using more than 12 cores for an executor. By enlarging input data size, application per-performance degrades significantly due to substantial increase in wait time during I/O operations and garbage collection, despite 10% better instruction retirement rate (due to lower L1 cache misses and higher core utilization). We match memory behavior with the garbage collector to improve performance of applications between 1.6x to 3x.
Comparing desktop drive performance: Seagate Solid State Hybrid Drive vs. har...Principled Technologies
In our tests, the Seagate SSHD configuration outperformed all three hard drive configurations and delivered results comparable to a Seagate Client SSD configuration. It launched applications as much as 23.7 percent more quickly and delivered disk performance increases of up to 387.3 percent over the HDDs we tested.
By speeding up the tasks that users perform day in and day out, the Seagate Solid State Hybrid Drive can boost productivity and let you spend more of your day working and less of it waiting—without forcing you to choose between speed and storage capacity.
Getting Started with Amazon Redshift - AWS July 2016 Webinar SeriesAmazon Web Services
Traditional data warehouses become expensive and slow down as the volume of your data grows. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it easy to analyze all of your data using existing business intelligence tools for as low as $1000/TB/year. This webinar will provide an introduction to Amazon Redshift and cover the essentials you need to deploy your data warehouse in the cloud so that you can achieve faster analytics and save costs.
Learning Objectives:
• Get an introduction to Amazon Redshift's massively parallel processing, columnar, scale-out architecture
• Learn how to configure your data warehouse cluster, optimize schema, and load data efficiently
• Get an overview of all the latest features including interleaved sorting and user-defined functions
This document discusses query processing and provides an overview of algorithms for evaluating relational algebra operations. It begins with an overview of the basic steps in query processing - parsing and translation, optimization, and evaluation. It then discusses how to measure query costs by focusing on resource consumption, particularly disk access. The document outlines algorithms for common relational operations like selection, sorting, and join. It provides cost estimates for different algorithms like file scan, index scan, and block nested loops join. The overall summary is that the document describes query processing and evaluation strategies for relational algebra operations like selection and join, providing cost estimates to help optimize queries.
This document discusses query processing and algorithms for evaluating relational algebra operations. It begins with an overview of the basic steps in query processing: parsing and translation, optimization, and evaluation. It then discusses how to measure query costs using a cost model based on disk access times. The document outlines several algorithms (A1-A10) for performing selection operations on relations using file scans and indexes. It provides cost estimates for each algorithm based on factors like the number of blocks accessed and index height. The algorithms can handle selections with equality and inequality conditions, as well as complex selections using conjunctions, disjunctions, and negation.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
2. Hybrid Storage Management for Database Systems
2
} 目的:SSDとHDDからなるハイブリッドなストレージシステムを
データベース管理に用いる
} 貢献
} GD2L:ハイブリッドストレージにおけるcost-awareなbuffer pool管理
アルゴリズムの提案
} GreedyDualアルゴリズム[1]をベースに,SSDとHDDのパフォーマンスの差異
を考慮
} CAC:予測的なコストベースのSSD管理技術の提案
} GD2LとCACの実験によるパフォーマンス評価
} MySQLのInnoDBストレージエンジンに提案機構を実装
Session 1: Emerging Hardware 担当:若森拓馬(NTT)
システムアーキテクチャ(Figure: 1より)
ustrated in Figure 1. When writing data to storage, the
MS chooses which type of device to write it to.
RAM
Buffer pool
manager
SSD
manager
HDD
read/write requests
Buffer pool
SSD
write
Figure 1: System Architecture
revious work has considered how a DBMS should place
in a hybrid storage system [11, 1, 2, 16, 5]. We provide
mmary of such work in Section 7. In this paper, we
a broader view of the problem than is used by most
his work. Our view includes the DBMS bu↵er pool as
as the two types of storage devices. We consider two
ted problems. The first is determining which data should
etained in the DBMS bu↵er pool. The answer to this
tion is a↵ected by the presence of hybrid storage because
ks evicted from the bu↵er cache to an SSD are much
er to retrieve later than blocks evicted to the HDD. Thus,
onsider cost-aware bu↵er management, which can take
distinction into account. Second, assuming that the
is not large enough to hold the entire database, we
e the problem of deciding which data should be placed on
SSD. This should depend on the physical access pattern
[1] N. Young. The k-server dual and loose competitiveness for paging. Algorithmica, 11:525–541, 1994.
• 全てのDBのページはHDDに格納され,いくつかのページのコ
ピーがSSDとbuffer poolのいずれかor両方に格納されている
• buffer poolがいっぱいになった時,buffer pool managerが
ページ置換ポリシに基づいてページをbufferから取り除く
• SSD managerは,
• SSD追加ポリシに基づいて,buffer poolから取り除かれ
たページをSSDに追記する
• SSD置換ポリシに基づいて,SSDからデータを取り除く
3. GD2L:cost-awareなバッファプール管理アルゴリズム
3
} 前提
} ページpはコストHをもつ
} L(inflation value)を用いてHを計算
する
} 2つの優先度付キューを使用する
} QS:SSD上のページ管理用
} QD:SSD上にないページ管理用
} アルゴリズム
} ページpがキャッシュされていない
時
} QsのLRUページとQDのLRUページ
のコストHを比較し,小さい方のHを
もつページqを取り除く
} L←H(q),pをキャッシュに置く
} pがSSD上にあるとき
} H(p)←L + RS,pをQsのMRUに置く
} それ以外で,pがHDD上にあるとき
} H(p) = L + RD,pをQDのMRUに置く
Session 1: Emerging Hardware 担当:若森拓馬(NTT)
H
S S S S
H H H H HQD
QS
S QSLRU
QDLRU
Page is flushed
to the SSD
Page is evicted
from the SSD
Figure 3: Bu↵er pool managed by GD2L
GD2Lによるbuffer pool管理(Figure: 3より)
ストレージデバイスのパラメータ
RD:HDDからの読み込み時間
WD:HDDへの書き込み時間
RS:SSDからの読み込み時間
WS:SSDへの書き込み時間
buffer poolの空きが足りなくなった時,page cleanerは
dirty bufferをcost-awareのポリシに基づいて
ストレージにフラッシュする
4. cost-awareキャッシングの効果
4
} ページがSSDにキャッシュ
されているとき,buffer
poolミス率と物理書き込み
率が高くなる
} 理由
} ページが一度SSDに置か
れると,そのページは読み
出しコストが低いため,
buffer poolから取り除かれ
やすくなる
} page cleanerは取り除く前
にdirty pageをフラッシュし
なければならない
Session 1: Emerging Hardware 担当:若森拓馬(NTT)
gure 3: Bu↵er pool managed by GD2L
types of writes: replacement writes and recoverabil-
s. Replacement writes are issued when dirty pages
ified as eviction candidates. To remove the latency
d with synchronous writes, the page cleaners try to
hat pages that are likely to be replaced are clean
me of the replacement. In contrast, recoverabil-
s are those that are used to limit failure recovery
he InnoDB uses write ahead logging to ensure that
ed database updates survive failures. The failure
time depends on the age of the oldest changes in
r pool. The page cleaners issue recoverability writes
ast recently modified pages to ensure that a config-
covery time threshold will not be exceeded.
oDB, when the free space of the bu↵er pool is below
Figure 4: Miss rate/write rate while on the HDD
(only) vs. miss rate/write rate while on the SSD.
Each point represents one page
HDDのキャッシュミス/書き込み率 対
SSDのキャッシュミス/書き込み率 (Figure: 4より)
5. CAC:Cost-Adjusted Caching
5
} pをSSDに置くかどうかを決定するために,利得Bを見積もる
} p’がすでにSSDキャッシュにあるとき,B(p’)<B(p)を満たすpをSSD
キャッシュに置き,最も小さいBをもつページを取り除く
} miss rate expansion factor (α)
} ページの物理read/write率が,
SSDにページがあるかどうかで
どのように変化するかを見積もる
ために導入
Session 1: Emerging Hardware 担当:若森拓馬(NTT)
Figure 5: Summary of Notation
Symbol Description
rD, wD Measured physical read/write count while
not on the SSD
rS, wS Measured physical read/write count while
on the SSD
crD, cwD Estimated physical read/write count if
never on the SSD
crS, cwS Estimated physical read/write count if al-
ways on the SSD
mS Bu↵er cache miss rate for pages on the
SSD
mD Bu↵er cache miss rate for pages not on the
SSD
↵ Miss rate expansion factor
Figure 6: Miss rate expansion factor for pages from
three TPC-C tables.
expansion factors of pages grouped by table and by logical
3つのTPC-Cのテーブル(のページ)に対する
miss rate expansion factor(Figure: 6より)
would experience if it were placed on the SSD, and crD(p)
and cwD(p) are the physical read and write counts p would
experience if it were not. Using these hypothetical physical
read and write counts, we can write our desired estimate of
the benefit of placing p on the SSD as follows
B(p) = (crD(p)RD crS(p)RS)+( cwD(p)WD cwS(p)WS) (2)
Thus, the problem of estimating benefit reduces to the prob-
lem of estimating values for crD(p), crS(p), cwD(p), and cwS(p).
To estimate crS(p), CAC uses two measured read counts:
rS(p) and rD(p). (In the following, we will drop the explicit
page reference from our notation as long as the page is clear
from context.) In general, p may spend some time on the
SSD and some time not on the SSD. rS is the count of the
number of physical reads experienced by p while p is on the
SSD. rD is the number of physical reads experienced by p
while it is not on the SSD. To estimate what p’s physical
read count would be if it were on the SSD full time (crS),
CAC uses
a room for a new entry.
The Miss Rate Expansion Factor
urpose of the miss rate expansion factor (↵) is to
how much a page physical read and write rates
nge if the page is admitted to the SSD. A simple
estimate ↵ is to compare the overall miss rates of
the SSD to that of pages that are not on the SSD.
that mS represents the overall miss rate of logical
uests for pages that are on the SSD, i.e., the total
of physical reads from the SSD divided by the total
of logical reads of pages on the SSD. Similarly, let
esent the overall miss rate of logical read requests
s that are not located on the SSD. Both mS and
easily measured. Using mS and mD, we can define
rate expansion factor as:
↵ =
mS
mD
(7)
mple, ↵ = 3 means that the miss rate is three times
ages for on the SSD than for pages that are not on
where mS(g) is
they are in the
pages in g while
We track logi
vidual page, as
counts are upda
to the page. Gr
tain events occu
pages. Specifica
is evicted from
from the bu↵er
from the SSD. B
their logical rea
which a page b
occurs, we subt
old group and a
It is possible t
groups. For ex
Figure 5: Summary of Notation
Symbol Description
rD, wD Measured physical read/write count while
not on the SSD
rS, wS Measured physical read/write count while
on the SSD
crD, cwD Estimated physical read/write count if
never on the SSD
crS, cwS Estimated physical read/write count if al-
ways on the SSD
mS Bu↵er cache miss rate for pages on the
SSD
mD Bu↵er cache miss rate for pages not on the
SSD
↵ Miss rate expansion factor
scaling it up to account for any time in which the page was
not on the SSD. While this might be e↵ective, it will work
only if the page has actually spent time on the SSD to that
rS can be observed. We still require a way to estimate rS for
pages that have not been observed on the SSD. In contrast,
estimation using Equation 3 will work even if rS or rD are
zero due to lack of observations.
We track reference counts for all pages in the bu↵er pool
and all pages in the SSD. In addition, we maintain an out-
queue for to track reference counts for a fixed number (Noutq)
of additional pages. When a page is evicted from the SSD,
Figure 6: Miss rate expansion fact
three TPC-C tables.
expansion factors of pages grouped by
read rate. The three lines represent pag
C STOCK, CUSTOMER, and ORDER
Since di↵erent pages may have substa
rate expansion factors, we use di↵erent
di↵erent groups of pages. Specifically,
pages based on the database object (e
they store data, and on their logical rea
a di↵erent expansion factor for each gr
range of possible logical read rates into
size. We define a group as pages tha
same database object and whose logic
the same subrange. For example, in o
: SSD上に全く無いときの物理read/writeの見積もり数
: SSD上に常にあるときの物理read/writeの見積もる数
:SSD上のページに対する バッファキャッシュミス率
:SSD上にないページに対する バッファキャッシュミス率
6. GD2LとCACのパフォーマンス評価
6
Session 1: Emerging Hardware 担当:若森拓馬(NTT)
Next, we consider the impact of switching from a non-
anticipatory cost-based SSD manager (CC) to an antici-
patory one (CAC). Figure 7 shows that GD2L+CAC pro-
vides additional performance gains above and beyond those
achieved by GD2L+CC in the case of the large (30GB)
database. Together, GD2L and CAC provide a TPC-C per-
formance improvement of about a factor of two relative to
the LRU+CC baseline in our 30GB tests. The performance
gain was less significant in the 15GB database tests and
non-existent in the 8GB database tests.
Figure 8 shows that GD2L+CAC results in lower total
I/O costs on both the SSD and HDD devices, relative to
GD2L+CC, in the 30GB experiments. Both policies re-
sult in similar bu↵er pool hit ratios, so the lower I/O cost
achieved by GD2L-CAC is attributable to better decisions
about which pages to retain on the SSD. To better under-
stand the reasons for the lower total I/O cost achieved by
CAC, we analyzed logs of system activity to try to iden-
tify specific situations in which GD2L+CC and GD2L+CAC
make di↵erent placement decisions. One interesting situa-
tion we encounter is one in which a very hot page that is in
the bu↵er pool is placed in the SSD. This may occur, for ex-
ample, when the page is cleaned by the bu↵er manager and
there is free space in the SSD, either during cold start and
because of invalidations. When this occurs, I/O activity for
the hot page will spike because GD2L will consider the page
to be a good eviction candidate. Under the CC policy, such
a page will tend to remain in the SSD because CC prefers to
keep pages with high I/O activity in the SSD. In contrast,
CAC is much more likely to to evict such a page from the
SSD, since it can (correctly) estimate that moving the page
will result in a substantial drop in I/O activity. Thus, we
find that GD2L+CAC tends to keep very hot pages in the
bu↵er pool and out of the SSD, while with GD2L+CC such
pages tend to remain in the SSD and bounce into and out
of the bu↵er pool. Such dynamics illustrate why it is impor-
tant to use an anticipatory SSD manager (like CAC) if the
bu↵er pool manager is cost-aware.
For the experiments with smaller databases (15GB and
8GB), there is little di↵erence in performance between GD2L+CC
and GD2L+CAC. Both policies result in similar per-transaction
I/O costs and similar TPC-C throughput. This is not sur-
prising, since in these settings most or all of the hot part of
the database can fit into the SSD, i.e., there is no need to be
marginally faster than LRU+LRU2, and for the smallest
database (8GB) they were essentially indistinguishable. LRU-
+MV-FIFO performed much worse than the other two al-
gorithms in all of the scenarios we test.
Figure 10: TPC-C Throughput Under LRU+MV-
FIFO, LRU+LRU2, GD2L+LRU2, and
GD2L+CAC for Various Database and DBMS
} GD2L+CACが最も良いパ
フォーマンスを達成
Alg & HDD HDD SSD SSD total BP miss
BP size util I/O util I/O I/O rate
(GB) (%) (ms) (%) (ms) (ms) (%)
GD2L+CAC
1G 85 39.6 19 8.8 48.4 7.4
2G 83 31.8 20 7.8 39.6 6.3
4G 82 23.5 20 5.8 29.3 4.8
LRU+LRU2
1G 85 49.4 15 8.6 58.0 6.7
2G 87 50.3 12 6.7 57.0 4.4
4G 90 48.5 9 4.7 53.2 2.4
GD2L+LRU2
1G 73 41.1 43 24.6 65.8 10.7
2G 79 46.5 32 18.9 65.3 8.8
4G 79 34.4 32 13.8 48.2 7.6
LRU+FIFO
1G 91 101.3 8 9.2 110.5 6.6
2G 92 92.3 6 5.8 98.1 4.4
4G 92 62.9 5 3.7 66.6 2.5
Figure 11: Device Utilizations, Bu↵er Pool Miss
Rate, and Normalized I/O Time (DB size=30GB)
to 400M. We tested k set to 1%, 2%, 5% and
size, Our results showed that k values in th
impact on TPC-C throughput. In InnoDB,
fier is eight bytes and the size of each page is
hash map for a 400M SSD fits into ten pag
the rate with which the SSD hash map was fl
that even with k = 1%, the highest rate of ch
hash map experienced by any of the three SS
algorithms (CAC, CC, and LRU2) is less tha
ond. Thus, the overhead imposed by checkp
map is negligible.
7. RELATED WORK
Placing hot data in fast storage (e.g. h
cold data in slow storage (e.g. tapes) is n
Hierarchical storage management (HSM) is
technique which automatically moves data
cost and low-cost storage media. It uses f
TPC-Cスループット(Figure: 10より)
デバイス利用率,buffer poolミス率,正規化後I/O時間(Figure: 11より)