[Paper Reading] QAGen: Generating query-aware test databases

This is the speech Max Liu gave at Percona Live Open Source Database Conference 2016. Max Liu: Co-founder and CEO, a hacker with a free soul The slide covered the following topics: - Why another database? - What kind of database we want to build? - How to design such a database, including the principles, the architecture, and design decisions? - How to develop such a database, including the architecture and the core technologies for TiKV and TiDB? - How to test the database to ensure the quality and stability?

from Binary to Binary: How Qemu Works

Zhen Wei

This document discusses how Qemu works to translate guest binaries to run on the host machine. It first generates an intermediate representation called TCG-IR from the guest binary code. It then translates the TCG-IR into native host machine code. To achieve high performance, it chains translated blocks together by patching jump targets. Key techniques include just-in-time compilation, translation block finding, block chaining, and helper functions to emulate unsupported guest instructions.

Etl confessions pg conf us 2017

Corey Huinker

This document discusses strategies for efficiently loading and transforming large datasets in PostgreSQL for analytics use cases. It presents several case studies: 1) Loading a large CSV file - different methods like pgloader, COPY, and temporary foreign tables are compared. Temporary foreign tables perform best when filtering columns. 2) Pre-aggregating ("rolling up") data into multiple tables at different granularities for optimized querying. Chained INSERTs and CTEs are more efficient than individual inserts. 3) Creating a "dumb rollup table" using GROUPING SETS to pre-aggregate into a single temp table and insert into final tables in one pass. This outperforms multiple round trips or inserts.

Engineering data quality

Garbage in, garbage out - we have all heard about the importance of data quality. Having high quality data is essential for all types of use cases, whether it is reporting, anomaly detection, or for avoiding bias in machine learning applications. But where does high quality data come from? How can one assess data quality, improve quality if necessary, and prevent bad quality from slipping in? Obtaining good data quality involves several engineering challenges. In this presentation, we will go through tools and strategies that help us measure, monitor, and improve data quality. We will enumerate factors that can cause data collection and data processing to cause data quality issues, and we will show how to use engineering to detect and mitigate data quality problems.

Holistic data application quality

Centernet

Arithmer Inc.

Slide for study session given by Christian Saravia at Arithmer inc. It is a summary of recent method for object detection, centernet. Arithmer株式会社は東京大学大学院数理科学研究科発の数学の会社です。私達は現代数学を応用して、様々な分野のソリューションに、新しい高度AIシステムを導入しています。AIをいかに上手に使って仕事を効率化するか、そして人々の役に立つ結果を生み出すのか、それを考えるのが私たちの仕事です。 Arithmer began at the University of Tokyo Graduate School of Mathematical Sciences. Today, our research of modern mathematics and AI systems has the capability of providing solutions when dealing with tough complex issues. At Arithmer we believe it is our job to realize the functions of AI through improving work efficiency and producing more useful results for society.

Data Analysis with TensorFlow in PostgreSQL

EDB

In this talk I'll discuss how we can combine the power of PostgreSQL with TensorFlow to perform data analysis. By using the pl/python3 procedural language we can integrate machine learning libraries such as TensorFlow with PostgreSQL, opening the door for powerful data analytics combining SQL with AI. Typical use-cases might involve regression analysis to find relationships in an existing dataset and to predict results based on new inputs, or to analyse time series data and extrapolate future data taking into account general trends and seasonal variability whilst ignoring noise. Python is an ideal language for building custom systems to do this kind of work as it gives us access to a rich ecosystem of libraries such as Pandas and Numpy, in addition to TensorFlow itself.

NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration

ScyllaDB

Validating big data pipelines - FOSDEM 2019

This document discusses validating data pipelines built with Apache Spark and Apache Airflow. It emphasizes that tests are not perfect and failures will occur, so validation is important to minimize impacts. Simple validation rules can check for invalid records, changes in data distributions, and schema mismatches. Validation rules can run as separate Spark jobs and metrics from jobs can be compared against expected values. Airflow can coordinate validation jobs and check for anomalies before publishing results. Overall, the key is to have validation rules that alert infrequently but catch meaningful issues.

Validating big data jobs - Spark AI Summit EU

As big data jobs move from the proof-of-concept phase into powering real production services, we have to start consider what will happen when everything eventually goes wrong (such as recommending inappropriate products or other decisions taken on bad data). This talk will attempt to convince you that we will all eventually get aboard the failboat (especially with ~40% of respondents automatically deploying their Spark jobs results to production), and its important to automatically recognize when things have gone wrong so we can stop deployment before we have to update our resumes. Figuring out when things have gone terribly wrong is trickier than it first appears, since we want to catch the errors before our users notice them (or failing that before CNN notices them). We will explore general techniques for validation, look at responses from people validating big data jobs in production environments, and libraries that can assist us in writing relative validation rules based on historical data. For folks working in streaming, we will talk about the unique challenges of attempting to validate in a real-time system, and what we can do besides keeping an up-to-date resume on file for when things go wrong. To keep the talk interesting real-world examples (with company names removed) will be presented, as well as several creative-common licensed cat pictures and an adorable panda GIF. If you’ve seen Holden’s previous testing Spark talks this can be viewed as a deep dive on the second half focused around what else we need to do besides good testing practices to create production quality pipelines. If you haven’t seen the testing talks watch those on YouTube after you come see this one

Model selection and tuning at scale

Owen Zhang

This document discusses model selection and tuning at scale using large datasets. It describes using different percentages of a 1TB Criteo click-through dataset to test and tune gradient boosted trees (GBTs) and other models. Testing on small slices found GBT performed best. Tuning GBT on larger slices up to 10% of the data showed tree depth should increase logarithmically with data size. Online learning with VW was also efficient, needing minimal tuning. The document cautions that true model selection and tuning at scale involves starting with larger data samples than GBs to avoid extrapolating from small data.

Data engineering in 10 years.pdf

If we could only predict the future of the software industry, we could make better investments and decisions. We could waste less resources on technology and processes we know will not last, or at least be conscious in our decisions to choose solutions with a limited life time. It turns out that for data engineering, we can predict the future, because it has already happened. Not in our workplace, but at a few leading companies that are blazing ahead. It has also already happened in the neighbouring field of software engineering, which is two decades ahead of data engineering regarding process maturity. In this presentation, we will glimpse into the future of data engineering. Data engineering has gone from legacy data warehouses with stored procedures, to big data with Hadoop and data lakes, on to a new form of modern data warehouses and low code tools aka "the modern data stack". Where does it go from here? We will look at the points where data leaders differ from the crowd and combine with observations on how software engineering has evolved, to see that it points towards a new, more industrialised form of data engineering - "data factory engineering".

From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...

confluent

This document provides an overview of a company's first Kafka Streams project to build a streaming data pipeline. Some key lessons learned include adopting a data-first mindset where the data defines the application behavior and architecture. All business logic is modeled as data transformations. Testing was done using TopologyTestDriver for unit tests and emulators for external systems. Kafka Streams was determined to be a good fit as it provided an ordered, fault-tolerant processing pipeline with exactly-once guarantees. Future work includes open sourcing components and improving the declarative side effect handling in the KStreams DSL.

Willump: Optimizing Feature Computation in ML Inference

Databricks

Druid

Dori Waldman

This document discusses Druid in production at Fyber, a company that indexes 5 terabytes of data daily from various sources into Druid. It describes the hardware used, including 30 historical nodes and 2 broker nodes. Issues addressed include slow query times with many dimensions, some as lists, and data cleanup steps to reduce cardinality like replacing values. Segment sizing and partitioning are also discussed. Hardware, data ingestion, querying, and optimizations used to scale Druid for Fyber's analytics needs are covered in under 3 sentences.

Building real time Data Pipeline using Spark Streaming

datamantra

This document summarizes the key challenges and solutions in building a real-time data pipeline that ingests data from a database, transforms it using Spark Streaming, and publishes the output to Salesforce. The pipeline aims to have a latency of 1 minute with zero data loss and ordering guarantees. Some challenges discussed include handling out of sequence and late arrival events, schema evolution, bootstrap loading, data loss/corruption, and diagnosing issues. Solutions proposed use Kafka, checkpointing, replay capabilities, and careful broker/connect setups to help meet the reliability requirements for the pipeline.

Mirko Damiani - An Embedded soft real time distributed system in Go

linuxlab_conf

Beyond unit tests: Deployment and testing for Hadoop/Spark workflows

DataWorks Summit

As a Hadoop developer, do you want to quickly develop your Hadoop workflows? Do you want to test your workflows in a sandboxed environment similar to production? Do you want to write unit tests for your workflows and add assertions on top of it? In just a few years, the number of users writing Hadoop/Spark jobs at LinkedIn have grown from tens to hundreds and the number of jobs running every day has grown from hundreds to thousands. With the ever increasing number of users and jobs, it becomes crucial to reduce the development time for these jobs. It is also important to test these jobs thoroughly before they go to production. We’ve tried to address these issues by creating a testing framework for Hadoop/Spark jobs. The testing framework enables the users to run their jobs in an environment similar to the production environment and on the data which is sampled from the original data. The testing framework consists of a test deployment system, a data generation pipeline to generate the sampled data, a data management system to help users manage and search the sampled data and an assertion engine to validate the test output. In this talk, we will discuss the motivation behind the testing framework before deep diving into its design. We will further discuss how the testing framework is helping the Hadoop users at LinkedIn to be more productive.

[Paper Reading] Efficient Query Processing with Optimistically Compressed Has...

Modern query engines rely heavily on hash tables for query processing. Overall query performance and memory footprint is often determined by how hash tables and the tuples within them are represented. In this work, we propose three complementary techniques to improve this representation: Domain-Guided Prefix Suppression bit-packs keys and values tightly to reduce hash table record width. Optimistic Splitting decomposes values (and operations on them) into (operations on) frequently-accessed and infrequently-accessed value slices. By removing the infrequently-accessed value slices from the hash table record, it improves cache locality. The Unique Strings Selfaligned Region (USSR) accelerates handling frequently-occurring strings, which are very common in real-world data sets, by creating an on-the-fly dictionary of the most frequent strings. This allows executing many string operations with integer logic and reduces memory pressure. We integrated these techniques into Vectorwise. On the TPC-H benchmark, our approach reduces peak memory consumption by 2–4× and improves performance by up to 1.5×. On a real-world BI workload, we measured a 2× improvement in performance and in micro-benchmarks we observed speedups of up to 25×.

[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data

Olaf Reitmaier Veracierta

The performance of analytical query processing in data management systems depends primarily on the capabilities of the system's query optimizer. Increased data volumes and heightened interest in processing complex analytical queries have prompted Pivotal to build a new query optimizer. In this paper we present the architecture of Orca, the new query optimizer for all Pivotal data management products, including Pivotal Greenplum Database and Pivotal HAWQ. Orca is a comprehensive development uniting state-of-the-art query optimization technology with own original research resulting in a modular and portable optimizer architecture. In addition to describing the overall architecture, we highlight several unique features and present performance comparisons against other systems.

Similar to [Paper Reading] QAGen: Generating query-aware test databases

Kubernetes Workload Rebalancing

DAA Slides for Multiple topics such as different algorithms

DEVARSHHIRENBHAIPARM

How to build TiDB

from Binary to Binary: How Qemu Works

Zhen Wei

Etl confessions pg conf us 2017

Corey Huinker

Engineering data quality

Holistic data application quality

Centernet

Arithmer Inc.

Data Analysis with TensorFlow in PostgreSQL

EDB

NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration

ScyllaDB

Validating big data pipelines - FOSDEM 2019

Validating big data jobs - Spark AI Summit EU

Model selection and tuning at scale

Owen Zhang

Data engineering in 10 years.pdf

From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...

confluent

Willump: Optimizing Feature Computation in ML Inference

Databricks

Druid

Dori Waldman

Building real time Data Pipeline using Spark Streaming

datamantra

Mirko Damiani - An Embedded soft real time distributed system in Go

linuxlab_conf

Beyond unit tests: Deployment and testing for Hadoop/Spark workflows

DataWorks Summit

Similar to [Paper Reading] QAGen: Generating query-aware test databases (20)

Kubernetes Workload Rebalancing

DAA Slides for Multiple topics such as different algorithms

How to build TiDB

from Binary to Binary: How Qemu Works

Etl confessions pg conf us 2017

Engineering data quality

Holistic data application quality

Centernet

Data Analysis with TensorFlow in PostgreSQL

NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration

Validating big data pipelines - FOSDEM 2019

Validating big data jobs - Spark AI Summit EU

Model selection and tuning at scale

Data engineering in 10 years.pdf

From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...

Willump: Optimizing Feature Computation in ML Inference

Druid

Building real time Data Pipeline using Spark Streaming

Mirko Damiani - An Embedded soft real time distributed system in Go

Beyond unit tests: Deployment and testing for Hadoop/Spark workflows

More from PingCAP

[Paper Reading] Efficient Query Processing with Optimistically Compressed Has...

[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data

[Paper Reading]KVSSD: Close integration of LSM trees and flash translation la...

Log-Structured-Merge (LSM) trees are a write-optimized data structure for lightweight, high-performance Key-Value (KV) store. Solid State Disks (SSDs) provide acceleration of KV operations on LSM trees. However, this hierarchical design involves multiple software layers, including the LSM tree, host file system, and Flash Translation Layer (FTL), causing cascading write amplifications. We propose KVSSD, a close integration of LSM trees and the FTL, to manage write amplifications from different layers. KVSSD exploits the FTL mapping mechanism to implement copy-free compaction of LSM trees, and it enables direct data allocation in flash memory for efficient garbage collection. In our experiments, compared to the hierarchical design, our KVSSD reduced the write amplification by 88% and improved the throughput by 347%.

[Paper Reading]Chucky: A Succinct Cuckoo Filter for LSM-Tree

Modern key-value stores typically rely on an LSM-tree in storage (SSD) to handle writes and Bloom filters in memory (DRAM) to optimize reads. With ongoing advances in SSD technology shrinking the performance gap between storage and memory devices, the Bloom filters are now emerging as a performance bottleneck. We propose Chucky, a new design that replaces the multiple Bloom filters by a single Cuckoo filter that maps each data entry to an auxiliary address of its location within the LSM-tree. We show that while such a design entails fewer memory accesses than with Bloom filters, its false positive rate off the bat is higher. The reason is that the auxiliary addresses occupy bits that would otherwise be used as parts of the Cuckoo filter's fingerprints. To address this, we harness techniques from information theory to succinctly encode the auxiliary addresses so that the fingerprints can stay large. As a result, Chucky achieves the best of both worlds: a modest access cost and a low false positive rate at the same time.

[Paper Reading]The Bw-Tree: A B-tree for New Hardware Platforms

The emergence of new hardware and platforms has led to reconsideration of how data management systems are designed. However, certain basic functions such as key indexed access to records remain essential. While we exploit the common architectural layering of prior systems, we make radically new design decisions about each layer. Our new form of B-tree, called the Bw-tree achieves its very high performance via a latch-free approach that effectively exploits the processor caches of modern multi-core chips. Our storage manager uses a unique form of log structuring that blurs the distinction between a page and a record store and works well with flash storage. This paper describes the architecture and algorithms for the Bw-tree, focusing on the main memory aspects. The paper includes results of our experiments that demonstrate that this fresh approach produces outstanding performance.

[Paper Reading] Leases: An Efficient Fault-Tolerant Mechanism for Distribute...

In a distributed system, caching must deal ,with the additional complications of communication and host failures. Leases are proposed as a time-based mechanism that provides efficient consistent access to cached data in distributed systems. Non-Byzantine failures affect performance, not correctness, with their effect minimized by short leases. An analytic model and an evaluation for file access in the V system show that leases of short duration provide good performance. The impact of leases on performance grows more significant in systems of larger scale and higher processor performance. Paper: https://web.stanford.edu/class/cs240/readings/89-leases.pdf

[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...

This paper proposes interleaving with coroutines for any type of index join. It showcases the proposal on SAP HANA by implementing binary search and CSB+-tree traversal for an instance of index join related to dictionary compression. Coroutine implementations not only perform similarly to prior interleaving techniques, but also resemble the original code closely, while supporting both interleaved and non-interleaved execution. Thus, this paper claims that coroutines make interleaving practical for use in real DBMS codebases. Paper: http://www.vldb.org/pvldb/vol11/p230-psaropoulos.pdf Follow PingCAP on Twitter: https://twitter.com/PingCAP Follow PingCAP on LinkedIn: https://www.linkedin.com/company/13205484/

[Paperreading] Paxos made easy (by sen han)

This is a sub-project of open-source project TuringCell. Turing Cell Model is a computing model running on top of distributed consistency algorithms (such as Paxos/Raft). TuringCell is an open-source implementation of Turing Cell Model. This means that you can add features as high-availability, fault-tolerance and strong-consistency to existing software very easily. At the same time, TuringCell is an industry-friendly project, at its core is the force from an open, tolerant community. Wherever you are from, whichever language you speak, you are all welcome to equally join in, discuss and build TuringCell ! Paper: https://github.com/turingcell/paxos-made-easy/blob/feature/translation/README_en.md

[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...

This document describes RESIN, a query optimizer that eliminates redundant I/O for big data queries. RESIN introduces two new operators - ResinMap and ResinReduce - and two optimization rules - sub-query fusion and binary-operator elimination. These optimizations were found to benefit 40% of queries in the TPC-DS benchmark, improving performance by an average of 1.4x. The optimizer works by fusing operators applied to the same table, eliminating redundant joins or unions, and combining grouped aggregations. An evaluation on a 10GB TPC-DS dataset found RESIN's optimizations significantly reduced redundant I/O for many real-world analytical queries.

[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...

This document discusses methods for optimizing query performance in a query optimizer called Scope by selecting alternative rule configurations. It proposes using rule signatures to group similar queries and generate candidate rule configurations to execute for each group. A learning model is then trained on execution results to select the best configuration for future queries in each group. The goal is to improve upon the default configuration by adapting to workloads and addressing inaccuracies in cardinality estimation that can lead to suboptimal plans.

The Dark Side Of Go -- Go runtime related problems in TiDB in production

TiDB DevCon 2020 Opening Keynote

Finding Logic Bugs in Database Management Systems

Chaos Practice in PingCAP

Building a distributed database is very difficult because we have to make sure that the user data is secure. At PingCAP, we did a lot of testing, including Chaos Engineering, to sure the security of TiDB. In this sharing, we talk about PingCAP's testing philosophy, Chaos practices. We'll also introduce Chaos Mesh, a Chaos Engineering platform based on K8s, and how to use Chaos Mesh to test against your application.

TiDB at PayPay

PayPay migrated their payment database from Amazon Aurora to TiDB in 3 months. They chose TiDB for its horizontal scalability, high availability, and ability to remove the need for application-level sharding. They performed an accuracy verification by comparing data between the old and new databases, as well as across microservices. Performance and availability testing was also conducted during the migration to validate the migration was successful. After 3 months of the new TiDB database in production, PayPay saw the expected performance improvements and zero incidents, finding TiDB to be a reliable replacement.

Paper Reading: FPTree

Paper Reading: Smooth Scan

Paper Reading: Flexible Paxos

Paper reading: Cost-based Query Transformation in Oracle

Paper reading: HashKV and beyond