The document summarizes a research presentation on distributed storage and processing. It discusses two papers: 1) PABIRS, a data access middleware for distributed file systems that efficiently processes mixed workloads of queries. It proposes an integrated data access middleware to address this. 2) Scalable distributed transactions across heterogeneous stores.
It then provides details on PABIRS, which uses a hybrid index with a bitmap index and LSM (log-structured merge) tree index. The bitmap index is used for low selectivity keys, while the LSM index is built for "hot" values with selectivity above a threshold. The system aims to efficiently support data retrieval and insertion for various query workloads on distributed file systems.
The document discusses Spark, a system for large-scale data processing. It provides an example of using Spark to perform a text search over logs stored in HDFS to find and extract error messages related to "HDFS" and their time fields. The example shows defining RDDs from files, applying filters and maps as transformations, and using actions like count and collect. It explains how Spark operations are lazily evaluated and how caching improves performance of repeated queries.
This document discusses data structures in C++. It begins by introducing structures as a data type that can store different data types together under a single name. It then covers:
- Declaring and defining single structures with examples.
- Using arrays of structures to organize related data, such as employee records, together rather than across multiple arrays.
- Passing structures as arguments to functions by passing the entire structure or individual members. Changes to the local copy in the function do not affect the original variable.
- Declaring structures at the global scope for use across multiple functions.
Deep Packet Inspection with Regular Expression MatchingEditor IJCATR
Deep packet inspection directs, persists, filters and logs IP-based applications and Web services traffic based on content
encapsulated in a packet's header or payload, regardless of the protocol or application type. In content scanning, the packet payload is
compared against a set of patterns specified as regular expressions. With deep packet inspection in place through a single intelligent
network device, companies can boost performance without buying expensive servers or additional security products. They are typically
matched through deterministic finite automata (DFAs), but large rule sets need a memory amount that turns out to be too large for
practical implementation. Many recent works have proposed improvements to address this issue, but they increase the number of
transitions (and then of memory accesses) per character. This paper presents a new representation for DFAs, orthogonal to most of the
previous solutions, called delta finite automata (FA), which considerably reduces states and transitions while preserving a transition
per character only, thus allowing fast matching. A further optimization exploits Nth order relationships within the DFA by adopting
the concept of temporary transitions.
1) The document discusses how hierarchical file systems are organized using tree data structures, with directories and files represented as nodes.
2) It provides examples and explanations of key tree terminology like root, leaf, height, and level.
3) Binary trees are discussed in more detail, including their properties and different representations like linked and sequential structures.
Aes cryptography algorithm based on intelligent blum blum-shub prn gs publica...zaidinvisible
This document summarizes a study that proposes enhancing the Advanced Encryption Standard (AES) algorithm by using an intelligent Blum-Blum-Shub (BBS) pseudo-random number generator to generate the initial encryption key. The AES algorithm is described along with its standard steps of sub-bytes, shift rows, mix columns, and add round key. Issues with the security of AES's public key are discussed. The study then introduces BBS and Iterated Local Search (ILS) metaheuristics and describes how combining them can generate strong cryptographic keys. An example is provided to demonstrate encrypting a message with the enhanced AES approach using an intelligent BBS-generated key. The study concludes the method increases encryption efficiency and
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...Beat Signer
This document discusses query processing and optimization in databases. It covers the basic steps of query processing including parsing, optimization, and evaluation. It also describes different algorithms for query operations like selection, join, and sorting that are used to process queries efficiently. The goals of query optimization are to select the most efficient query execution plan based on the given data and minimize the number of disk accesses.
23. Advanced Datatypes and New Application in DBMSkoolkampus
This document discusses advanced data types and new applications in databases, including temporal data, spatial and geographic data, and multimedia data. It covers topics such as representing time in databases, temporal query languages, representing geometric information and spatial queries, indexing spatial data using structures like k-d trees and quadtrees, and applications of geographic data like in vehicle navigation systems.
The document discusses the MapReduce framework. It covers topics like the MapReduce programming model which divides work into map and reduce phases, data flow in MapReduce, and key concepts like input splits, mappers, reducers, and the shuffle process. It also provides examples of word count implementation and explains the relationship between input splits and HDFS blocks.
The document discusses Spark, a system for large-scale data processing. It provides an example of using Spark to perform a text search over logs stored in HDFS to find and extract error messages related to "HDFS" and their time fields. The example shows defining RDDs from files, applying filters and maps as transformations, and using actions like count and collect. It explains how Spark operations are lazily evaluated and how caching improves performance of repeated queries.
This document discusses data structures in C++. It begins by introducing structures as a data type that can store different data types together under a single name. It then covers:
- Declaring and defining single structures with examples.
- Using arrays of structures to organize related data, such as employee records, together rather than across multiple arrays.
- Passing structures as arguments to functions by passing the entire structure or individual members. Changes to the local copy in the function do not affect the original variable.
- Declaring structures at the global scope for use across multiple functions.
Deep Packet Inspection with Regular Expression MatchingEditor IJCATR
Deep packet inspection directs, persists, filters and logs IP-based applications and Web services traffic based on content
encapsulated in a packet's header or payload, regardless of the protocol or application type. In content scanning, the packet payload is
compared against a set of patterns specified as regular expressions. With deep packet inspection in place through a single intelligent
network device, companies can boost performance without buying expensive servers or additional security products. They are typically
matched through deterministic finite automata (DFAs), but large rule sets need a memory amount that turns out to be too large for
practical implementation. Many recent works have proposed improvements to address this issue, but they increase the number of
transitions (and then of memory accesses) per character. This paper presents a new representation for DFAs, orthogonal to most of the
previous solutions, called delta finite automata (FA), which considerably reduces states and transitions while preserving a transition
per character only, thus allowing fast matching. A further optimization exploits Nth order relationships within the DFA by adopting
the concept of temporary transitions.
1) The document discusses how hierarchical file systems are organized using tree data structures, with directories and files represented as nodes.
2) It provides examples and explanations of key tree terminology like root, leaf, height, and level.
3) Binary trees are discussed in more detail, including their properties and different representations like linked and sequential structures.
Aes cryptography algorithm based on intelligent blum blum-shub prn gs publica...zaidinvisible
This document summarizes a study that proposes enhancing the Advanced Encryption Standard (AES) algorithm by using an intelligent Blum-Blum-Shub (BBS) pseudo-random number generator to generate the initial encryption key. The AES algorithm is described along with its standard steps of sub-bytes, shift rows, mix columns, and add round key. Issues with the security of AES's public key are discussed. The study then introduces BBS and Iterated Local Search (ILS) metaheuristics and describes how combining them can generate strong cryptographic keys. An example is provided to demonstrate encrypting a message with the enhanced AES approach using an intelligent BBS-generated key. The study concludes the method increases encryption efficiency and
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...Beat Signer
This document discusses query processing and optimization in databases. It covers the basic steps of query processing including parsing, optimization, and evaluation. It also describes different algorithms for query operations like selection, join, and sorting that are used to process queries efficiently. The goals of query optimization are to select the most efficient query execution plan based on the given data and minimize the number of disk accesses.
23. Advanced Datatypes and New Application in DBMSkoolkampus
This document discusses advanced data types and new applications in databases, including temporal data, spatial and geographic data, and multimedia data. It covers topics such as representing time in databases, temporal query languages, representing geometric information and spatial queries, indexing spatial data using structures like k-d trees and quadtrees, and applications of geographic data like in vehicle navigation systems.
The document discusses the MapReduce framework. It covers topics like the MapReduce programming model which divides work into map and reduce phases, data flow in MapReduce, and key concepts like input splits, mappers, reducers, and the shuffle process. It also provides examples of word count implementation and explains the relationship between input splits and HDFS blocks.
Tutorial on Parallel Computing and Message Passing Model - C4Marcirio Chaves
This document provides a tutorial on communicating non-contiguous data and mixed data types in parallel computing using MPI (Message Passing Interface). It discusses several strategies for sending this type of complex data, including sending multiple messages, buffering using pack/unpack, and defining derived datatypes. It also covers collective communication operations like broadcast, scatter/gather, and reductions.
A microprocessor is an electronic component that is used by a computer to do its work. It is a central processing unit on a single integrated circuit chip containing millions of very small components including transistors, resistors, and diodes that work together. Some microprocessors in the 20th century required several chips. Microprocessors help to do everything from controlling elevators to searching the Web. Everything a computer does is described by instructions of computer programs, and microprocessors carry out these instructions many millions of times a second. [1]
Microprocessors were invented in the 1970s for use in embedded systems. The majority are still used that way, in such things as mobile phones, cars, military weapons, and home appliances. Some microprocessors are microcontrollers, so small and inexpensive that they are used to control very simple products like flashlights and greeting cards that play music when you open them. A few especially powerful microprocessors are used in personal computers.
The document discusses distributed query processing and optimization in distributed database systems. It covers topics like query decomposition, distributed query optimization techniques including cost models, statistics collection and use, and algorithms for query optimization. Specifically, it describes the process of optimizing queries distributed across multiple database fragments or sites including generating the search space of possible query execution plans, using cost functions and statistics to pick the best plan, and examples of deterministic and randomized search strategies used.
Positional Data Organization and Compression in Web Inverted IndexesLeonidas Akritidis
The conference presentation of the article:
L. Akritidis, P. Bozanis, "Positional Data Organization and Compression in Web Inverted Indexes", In Proceedings of the 23rd International Conference on Database and Expert Systems Applications (DEXA), Lecture Notes in Computer Science (LLNCS), vol. 7446, pp. 422-429, 2012.
which was presented in Vienna, Austria in Spetember of 2012.
Data Structure is a way of collecting and organising data in such a way that we can perform operations on these data in an effective way. Data Structures is about rendering data elements in terms of some relationship, for better organization and storage. For example, we have data player's name "Virat" and age 26. Here "Virat" is of String data type and 26 is of integer data type.
We can organize this data as a record like Player record. Now we can collect and store player's records in a file or database as a data structure. For example: "Dhoni" 30, "Gambhir" 31, "Sehwag" 33
In simple language, Data Structures are structures programmed to store ordered data, so that various operations can be performed on it easily.
The document provides 29 sample questions on data structures and algorithms along with their answers. Some of the key questions covered include:
1. What is a data structure and examples of areas where they are applied extensively such as compiler design, operating systems, etc.
2. Major data structures used in relational databases, network and hierarchical data models.
3. Data structures used to perform recursion and evaluate arithmetic expressions.
4. Sorting algorithms like quicksort illustrated through an example.
5. Properties of different trees including the number of possible trees with a given number of nodes and number of null branches in a binary tree.
The summary hits the main topics covered in the document such as common
This document provides an overview of query processing costs, selection operations, join operations, and concurrency control in database systems. It discusses how the costs of queries are estimated based on factors like disk accesses and seeks. It then describes algorithms for common operations like selection, join, and concurrency control protocols. Selection algorithms include file scan, binary search, and using indexes. Join algorithms include nested loops, block nested loops, indexed nested loops, merge join, and hash join. Concurrency control protocols help manage concurrent transaction executions and maintain consistency.
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...Kiruthikak14
MapReduce is a programming model used to process large datasets in a distributed system. It involves three main steps: Map, Shuffle, and Reduce. Map processes the input data and produces intermediate key-value pairs. Shuffle redistributes the data to reduce nodes based on the keys. Reduce aggregates the intermediate data with the same key. Serialization converts object containers into byte streams for transferring and storing data, and is commonly used in Big Data systems for its benefits like splittability and portability. Popular serialization formats include JSON, XML, YAML, and binary formats like HDF and netCDF.
The document provides information about a class presentation on bus structures. It discusses parallel and serial communication, synchronous and asynchronous buses, basic protocol concepts, and bus arbitration. Specifically, it defines parallel and serial communication, explains the differences between synchronous and asynchronous buses, describes the basic components of a bus transaction including requests and data transfer, and outlines different approaches to bus arbitration including daisy chain, centralized parallel arbitration, and polling. The presentation aims to provide both a review of key bus topics and a practical exposure to the concepts through examples and diagrams.
MapReduce is a programming model used for processing and generating large data sets in a parallel, distributed manner. It involves three main steps: Map, Shuffle, and Reduce. In the Map step, data is processed by individual nodes. In the Shuffle step, data is redistributed based on keys. In the Reduce step, processed data with the same key is grouped and aggregated. Serialization is the process of converting data into a byte stream for storage or transmission. It allows data to be transferred between systems and formats like JSON, XML, and binary formats are commonly used. Schema control is important for big data serialization to validate data structure.
This document provides an introduction to coalesced hashing, a technique for storing and retrieving records from a database. Coalesced hashing stores records in two areas: an address region and a cellar. It is sensitive to the relative sizes of these areas, represented by an address factor. The document analyzes how to optimize this factor to minimize search time. It finds the compromise address factor of 0.86 works well. Coalesced hashing with this optimized factor outperforms other methods like separate chaining and linear probing.
DB2 FAQs provides questions and answers about DB2 concepts including what DB2 is, what an access path is, what a plan and bind are, what buffer pools and storage groups are used for, and what information can be found in DB2 catalog tables.
Furnish an Index Using the Works of Tree Structuresijceronline
This document discusses tree-based indexing schemes, specifically B-trees and B+-trees. It provides definitions and descriptions of the key components and properties of B-trees and B+-trees, including their nodes, keys, pointers, operations like search, insertion and deletion. Examples and figures are used to illustrate the concepts. The capacity and performance of B-trees and B+-trees are also analyzed and compared.
This document discusses storage management in database systems. It describes the storage device hierarchy from fastest but smallest (cache) to slowest but largest (magnetic tapes). It covers main memory, hard disks, solid state drives and tertiary storage. The document also discusses RAID configurations and how the relational model is represented on secondary storage through records, blocks, files and indexes.
Interpolation is a method for constructing new data points within the range of a discrete set of known data points. There are various interpolation methods, including linear interpolation which uses a straight line between points, polynomial interpolation which uses higher degree polynomials, and spline interpolation which uses smooth piecewise polynomials. The goal is to accurately estimate values between the known data points. The accuracy and smoothness of the estimate varies between methods.
A Novel Approach of Caching Direct Mapping using Cubic ApproachKartik Asati
Hi.., I had work out on this research paper in duration my master degree(MCA) and Successfully present at 6th International Conference under "Science Engineering Technology (SET) " in may-2013 in VIT University.
- The document discusses a paper on Distributed Interactive Cube Exploration (DICE), a system that allows interactive exploration of large data cubes with response times of seconds for queries involving billions of tuples.
- DICE improves query response times in distributed environments for data cube exploration through speculative query execution and online data sampling combined with a cost-based framework.
- It proposes a faceted cube exploration model to limit the search space of speculative queries by considering consecutive queries as query sessions.
- GD2L is a cost-aware buffer pool management algorithm that uses two priority queues to manage pages on SSD vs HDD. CAC is a predictive cost-based technique for managing which pages are placed on SSD.
- Experiments show that using GD2L and CAC together provides up to 2x better TPC-C performance compared to LRU baseline, by lowering total I/O costs on both SSD and HDD for large 30GB databases. For smaller databases the gains were less significant.
- CAC is able to make better decisions than a non-anticipatory technique about which pages should remain on SSD long-term in order to reduce I/O costs.
Tutorial on Parallel Computing and Message Passing Model - C4Marcirio Chaves
This document provides a tutorial on communicating non-contiguous data and mixed data types in parallel computing using MPI (Message Passing Interface). It discusses several strategies for sending this type of complex data, including sending multiple messages, buffering using pack/unpack, and defining derived datatypes. It also covers collective communication operations like broadcast, scatter/gather, and reductions.
A microprocessor is an electronic component that is used by a computer to do its work. It is a central processing unit on a single integrated circuit chip containing millions of very small components including transistors, resistors, and diodes that work together. Some microprocessors in the 20th century required several chips. Microprocessors help to do everything from controlling elevators to searching the Web. Everything a computer does is described by instructions of computer programs, and microprocessors carry out these instructions many millions of times a second. [1]
Microprocessors were invented in the 1970s for use in embedded systems. The majority are still used that way, in such things as mobile phones, cars, military weapons, and home appliances. Some microprocessors are microcontrollers, so small and inexpensive that they are used to control very simple products like flashlights and greeting cards that play music when you open them. A few especially powerful microprocessors are used in personal computers.
The document discusses distributed query processing and optimization in distributed database systems. It covers topics like query decomposition, distributed query optimization techniques including cost models, statistics collection and use, and algorithms for query optimization. Specifically, it describes the process of optimizing queries distributed across multiple database fragments or sites including generating the search space of possible query execution plans, using cost functions and statistics to pick the best plan, and examples of deterministic and randomized search strategies used.
Positional Data Organization and Compression in Web Inverted IndexesLeonidas Akritidis
The conference presentation of the article:
L. Akritidis, P. Bozanis, "Positional Data Organization and Compression in Web Inverted Indexes", In Proceedings of the 23rd International Conference on Database and Expert Systems Applications (DEXA), Lecture Notes in Computer Science (LLNCS), vol. 7446, pp. 422-429, 2012.
which was presented in Vienna, Austria in Spetember of 2012.
Data Structure is a way of collecting and organising data in such a way that we can perform operations on these data in an effective way. Data Structures is about rendering data elements in terms of some relationship, for better organization and storage. For example, we have data player's name "Virat" and age 26. Here "Virat" is of String data type and 26 is of integer data type.
We can organize this data as a record like Player record. Now we can collect and store player's records in a file or database as a data structure. For example: "Dhoni" 30, "Gambhir" 31, "Sehwag" 33
In simple language, Data Structures are structures programmed to store ordered data, so that various operations can be performed on it easily.
The document provides 29 sample questions on data structures and algorithms along with their answers. Some of the key questions covered include:
1. What is a data structure and examples of areas where they are applied extensively such as compiler design, operating systems, etc.
2. Major data structures used in relational databases, network and hierarchical data models.
3. Data structures used to perform recursion and evaluate arithmetic expressions.
4. Sorting algorithms like quicksort illustrated through an example.
5. Properties of different trees including the number of possible trees with a given number of nodes and number of null branches in a binary tree.
The summary hits the main topics covered in the document such as common
This document provides an overview of query processing costs, selection operations, join operations, and concurrency control in database systems. It discusses how the costs of queries are estimated based on factors like disk accesses and seeks. It then describes algorithms for common operations like selection, join, and concurrency control protocols. Selection algorithms include file scan, binary search, and using indexes. Join algorithms include nested loops, block nested loops, indexed nested loops, merge join, and hash join. Concurrency control protocols help manage concurrent transaction executions and maintain consistency.
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...Kiruthikak14
MapReduce is a programming model used to process large datasets in a distributed system. It involves three main steps: Map, Shuffle, and Reduce. Map processes the input data and produces intermediate key-value pairs. Shuffle redistributes the data to reduce nodes based on the keys. Reduce aggregates the intermediate data with the same key. Serialization converts object containers into byte streams for transferring and storing data, and is commonly used in Big Data systems for its benefits like splittability and portability. Popular serialization formats include JSON, XML, YAML, and binary formats like HDF and netCDF.
The document provides information about a class presentation on bus structures. It discusses parallel and serial communication, synchronous and asynchronous buses, basic protocol concepts, and bus arbitration. Specifically, it defines parallel and serial communication, explains the differences between synchronous and asynchronous buses, describes the basic components of a bus transaction including requests and data transfer, and outlines different approaches to bus arbitration including daisy chain, centralized parallel arbitration, and polling. The presentation aims to provide both a review of key bus topics and a practical exposure to the concepts through examples and diagrams.
MapReduce is a programming model used for processing and generating large data sets in a parallel, distributed manner. It involves three main steps: Map, Shuffle, and Reduce. In the Map step, data is processed by individual nodes. In the Shuffle step, data is redistributed based on keys. In the Reduce step, processed data with the same key is grouped and aggregated. Serialization is the process of converting data into a byte stream for storage or transmission. It allows data to be transferred between systems and formats like JSON, XML, and binary formats are commonly used. Schema control is important for big data serialization to validate data structure.
This document provides an introduction to coalesced hashing, a technique for storing and retrieving records from a database. Coalesced hashing stores records in two areas: an address region and a cellar. It is sensitive to the relative sizes of these areas, represented by an address factor. The document analyzes how to optimize this factor to minimize search time. It finds the compromise address factor of 0.86 works well. Coalesced hashing with this optimized factor outperforms other methods like separate chaining and linear probing.
DB2 FAQs provides questions and answers about DB2 concepts including what DB2 is, what an access path is, what a plan and bind are, what buffer pools and storage groups are used for, and what information can be found in DB2 catalog tables.
Furnish an Index Using the Works of Tree Structuresijceronline
This document discusses tree-based indexing schemes, specifically B-trees and B+-trees. It provides definitions and descriptions of the key components and properties of B-trees and B+-trees, including their nodes, keys, pointers, operations like search, insertion and deletion. Examples and figures are used to illustrate the concepts. The capacity and performance of B-trees and B+-trees are also analyzed and compared.
This document discusses storage management in database systems. It describes the storage device hierarchy from fastest but smallest (cache) to slowest but largest (magnetic tapes). It covers main memory, hard disks, solid state drives and tertiary storage. The document also discusses RAID configurations and how the relational model is represented on secondary storage through records, blocks, files and indexes.
Interpolation is a method for constructing new data points within the range of a discrete set of known data points. There are various interpolation methods, including linear interpolation which uses a straight line between points, polynomial interpolation which uses higher degree polynomials, and spline interpolation which uses smooth piecewise polynomials. The goal is to accurately estimate values between the known data points. The accuracy and smoothness of the estimate varies between methods.
A Novel Approach of Caching Direct Mapping using Cubic ApproachKartik Asati
Hi.., I had work out on this research paper in duration my master degree(MCA) and Successfully present at 6th International Conference under "Science Engineering Technology (SET) " in may-2013 in VIT University.
- The document discusses a paper on Distributed Interactive Cube Exploration (DICE), a system that allows interactive exploration of large data cubes with response times of seconds for queries involving billions of tuples.
- DICE improves query response times in distributed environments for data cube exploration through speculative query execution and online data sampling combined with a cost-based framework.
- It proposes a faceted cube exploration model to limit the search space of speculative queries by considering consecutive queries as query sessions.
- GD2L is a cost-aware buffer pool management algorithm that uses two priority queues to manage pages on SSD vs HDD. CAC is a predictive cost-based technique for managing which pages are placed on SSD.
- Experiments show that using GD2L and CAC together provides up to 2x better TPC-C performance compared to LRU baseline, by lowering total I/O costs on both SSD and HDD for large 30GB databases. For smaller databases the gains were less significant.
- CAC is able to make better decisions than a non-anticipatory technique about which pages should remain on SSD long-term in order to reduce I/O costs.
This document summarizes information about a person named Takeshi Arabiki. It includes:
1. Their Twitter handle is @a_bicky and ID is id:a_bicky.
2. A link to their blog on Hatena is provided.
3. They have written books and slides about using R and SciPy.
4. Links are provided to their slideshare presentations about using Twitter and R.
Low complexity low-latency architecture for matchingBhavya Venkatesh
This document discusses architectures for matching data encoded with error-correcting codes to reduce latency and complexity. It proposes a new architecture that parallelizes comparison of the data and parity portions of systematic codes. It also introduces a butterfly-formed weight accumulator to efficiently compute Hamming distance. Evaluation shows the proposed architecture reduces latency and hardware complexity compared to conventional decode-and-compare and encode-and-compare architectures.
DeepSort is a 'scalable and efficiency-optimized distributed general sorting engine.’ DeepSort enables a fluent data flow that shares the limited memory space and minimizes data movement, which makes it to be highly efficient at a large scale.
XML is a standard of data exchange between web applications such as in e-commerce, elearning
and other web portals. The data volume has grown substantially in the web and in
order to effectively retrieve or store these data, it is recommended to be physically or virtually
fragmented and distributed into different nodes. Basically, fragmentation design contains of
two parts: fragmentation operation and fragmentation method. There are three different kinds
of fragmentation operation: Horizontal, Vertical and Hybrid, determines how the XML should
be fragmented. The aim of this paper is to give an overview on the fragmentation design
consideration.
for sbi so Ds c c++ unix rdbms sql cn osalisha230390
This document contains 35 questions related to data structures and algorithms. It covers topics like data structures used in different areas like databases, networks and hierarchies. Other topics covered include trees, graphs, sorting, hashing and file structures. Sample problems are given related to these topics to test understanding.
Counting and sorting are basic tasks that distributed systems rely on. The document discusses different approaches for distributed counting and sorting, including software combining trees, counting networks, and sorting networks. Counting networks like bitonic and periodic networks have depth of O(log2w) where w is the network width. Sorting networks can sort in the same time complexity by exploiting an isomorphism between counting and sorting networks. Sample sorting is also discussed as a way to sort large datasets across multiple threads.
GEN: A Database Interface Generator for HPC ProgramsTanu Malik
GEN is a database interface generator that takes user-supplied C declarations and provides an interface to load scientific array data into databases without requiring changes to source code. It works by wrapping POSIX I/O calls at runtime to generate database schema definitions and load data. Experiments show it can reduce the time needed to reorganize data in the database compared to loading data from files and reorganizing outside the database. Current work aims to relax GEN's assumptions and improve data loading performance.
ADBS_parallel Databases in Advanced DBMSchandugoswami
This document discusses parallel database architecture. It covers various types of parallelism including I/O parallelism, inter-query parallelism, and intra-query parallelism. It describes techniques for partitioning relations across multiple disks to enable I/O parallelism, including round robin, hash, and range partitioning. It also addresses issues like skew in partitioning and techniques to handle skew like virtual processor partitioning and histograms.
OMT: A DYNAMIC AUTHENTICATED DATA STRUCTURE FOR SECURITY KERNELSIJCNCJournal
We introduce a family of authenticated data structures — Ordered Merkle Trees (OMT) — and illustrate
their utility in security kernels for a wide variety of sub-systems. Specifically, the utility of two types of
OMTs: a) the index ordered merkle tree (IOMT) and b) the range ordered merkle tree (ROMT), are
investigated for their suitability in security kernels for various sub-systems of Border Gateway Protocol
(BGP), the Internet’s inter-autonomous system routing infrastructure. We outline simple generic security
kernel functions to maintain OMTs, and sub-system specific security kernel functionality for BGP subsystems
(like registries, autonomous system owners, and BGP speakers/routers), that take advantage of
OMTs.
P REFIX - BASED L ABELING A NNOTATION FOR E FFECTIVE XML F RAGMENTATIONijcsit
XML is
gradually
emplo
yed as
a standard of data exchange
in
web
environment
since its inception
in the
90s
until
present
.
It
serves
as a data exchange between system
s
and other application
s
.
Meanwhile t
he data
volume has grown substantially
in the web and
thus effective methods
of
storing and retrieving
these
data
is
essential
.
One recommended way is
p
hysically or virtually
fragments
the large chunk of data
and
distributes
the fragments
into different nodes.
F
ragmentation design
of XML document
contains of two
parts: fragmentat
ion operation and fragmentation method. The
three
fragmentation o
peration
s are
Horizontal, Vertical
and Hybrid. It
determines how the XML should be fragmented.
This
paper
aims
to give
an overview on the fragmentation design consideration
and
subsequently,
propose a
fragmentation
technique
using
number addressing
.
Network Flow Pattern Extraction by Clustering Eugine KangEugine Kang
This document discusses a study that uses clustering techniques to analyze network flow data and extract patterns of torrent usage at Korea University. The study transforms network flow data by time blocks. It then uses k-means clustering and principal component analysis to identify optimal cluster numbers and visualize cluster formations. The study finds that 7 clusters best captures distinct torrent usage patterns. It analyzes the stability of the clusters and identifies two clusters that show regular torrent usage during working hours and heavy overall usage. The goal is to help network administrators identify times of heavy bandwidth usage.
Modifications in lsb based steganographyAslesha Niki
This document discusses steganography techniques for hiding secret information in digital images. It describes the Least Significant Bit (LSB) substitution method, where bits of the secret message are embedded in the LSBs of pixel values. However, this can be detected through statistical analysis of the image histogram. To address this, later techniques aim to preserve the cover image histogram by embedding extra bits as needed. The document also discusses quantizing audio signals for embedding and using linear feedback shift registers to generate encryption keystreams.
The document discusses porting a seismic inversion code to run in parallel using standard message passing libraries. It describes three options considered for distributing the large 3D seismic data across processors: mapping the data to a processor grid, treating it as a sparse matrix problem, or distributing the data as 1D vectors assigned to each processor. The third option was chosen as it best preserved the code structure, had regular dependencies, and simplified communications. The parallel code was implemented using the Distributed Data Library (DDL) for data management and the Message Passing Interface (MPI) for basic point-to-point communication between processors. Initial tests on clusters showed near linear speedup on up to 30 processors.
The document summarizes research on performing spatio-textual similarity joins. It discusses:
1) Developing a filter-and-refine framework to efficiently find similar object pairs from two datasets using signatures.
2) Generating spatial and textual signatures for objects and building inverted indexes on the signatures to find candidate pairs.
3) Refining the candidate pairs to obtain the final result pairs that satisfy spatial and textual similarity thresholds.
IRJET- Clustering of Hierarchical Documents based on the Similarity Deduc...IRJET Journal
This document discusses techniques for clustering hierarchical documents based on their structural similarity. It summarizes several existing approaches:
1) A tree edit distance-based method that represents trees as paths and computes the distance between subtrees. However, it requires trees to have a pre-specified structure.
2) Chawathe's algorithm that uses pre-order tree traversal and transforms trees into sequences of node labels and depths to calculate distances. It allows efficient assignment of new documents to clusters.
3) The XCLSC algorithm that clusters documents in two phases - grouping structurally similar documents and then searching to further improve clustering results and performance. However, it has high computational requirements.
4) The XPattern and PathXP
1. The document discusses tree-based indexing schemes like B-trees and B+-trees that are commonly used to organize data for efficient retrieval. It defines the structure of B+-trees, including that internal nodes contain keys and pointers, while leaf nodes contain keys and pointers to data.
2. Searching, insertion, and deletion operations on B+-trees are described. Searching follows pointers down the tree until the target key is found or not in a leaf node. Insertion may cause node splits and affect the tree height. Deletion removes keys from leaf nodes and merges nodes if they become too empty.
3. Examples and figures demonstrate searching, insertion, and deletion on sample B+-trees. The capacity
Advanced Non-Relational Schemas For Big DataVictor Smirnov
This is the presentation from barcamp in Altoros where I was explaining how various advanced non-relational schemas (or, simply, data structures) can be modelled on top of Key/Value storage. The set of covered schemas includes Dynamic Vector, File System, Searchable Bitmap, LOUDS Tree, Wavelet Tree and Inverted Index.
See https://bitbucket.org/vsmirnov/memoria/wiki/MemoriaForBigData
for additional details.
«Дизайн продвинутых нереляционных схем для Big Data»Olga Lavrentieva
Виктор Смирнов (Java Tech Lead в Klika Technologies)
Доклад: «Дизайн продвинутых нереляционных схем для Big Data»
О чём: Виктор познакомит всех с примерами продвинутых нереляционных схем данных и тем, как они могут использоваться для решения задач, связанных с хранением и обработкой больших данных.
Linked lists are linear data structures where elements are linked using pointers. Unlike arrays, the elements of a linked list are not stored at contiguous memory locations. Linked lists allow for dynamic sizes and easier insertion/deletion of elements compared to arrays but have disadvantages like non-sequential access of elements and extra memory usage for pointers. A linked list node contains a data field and a pointer to the next node. A doubly linked list node also contains a pointer to the previous node, allowing traversal in both directions.
As we see that the world has become closer and faster and with the enormous growth of distributed networks like p2p, social networks, overlay networks, cloud computing etc. Theses Distributed networks are represented as graphs and the fundamental component of distributed network is the relationship defined by linkages among units or nodes in the network. Major concern for computer experts is how to store such enormous amount of data especially in form of graphs. There is a need for efficient data structure used for storage of such type of data should provide efficient format for fast retrieval of data as and when required, in this types of networks. Although adjacency matrix is an effective technique to represent a graph having few or large number of nodes and vertices but when it comes to analysis of huge amount of data from site likes like face book or twitter, adjacency matrix cannot do this. In this paper, we study the existing application of a special kind of data structure, skip graph with its various versions which can be efficiently used for storing such type of data resulting in optimal storage, space utilization retrieval and concurrency.
The document discusses fractal tree indexes, which are a data structure that can be used in databases like MySQL and MongoDB for indexing and retrieving data. Fractal tree indexes execute the same operations as B-trees but have faster insertion and deletion performance due to buffering techniques. They are highly optimized for large writes by scheduling disk writes to perform many operations at once. Fractal tree indexes also have better performance than B-trees due to lower fragmentation and faster searching enabled by forward pointers between index rows.
Similar to ICDE2015 Research 3: Distributed Storage and Processing (20)
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
2. 紹介する論論⽂文
1. PABIRS: A Data Access Middleware
for Distributed File Systems
– S. Wu (Zhejiang Univ.), G. Chen, X. Zhou, Z.
Zhang, A. K. H. Tung, and M. Winslett
(UIUC)
2. Scalable Distributed Transactions
across Heterogeneous Stores
– A. Dey, A. Fekete, and U. Röhm
(Univ. of Sydney)
R3: Distributed Storage and Processing 担当:若若森(NTT)2
3. • ⽬目的
– 選択率率率の⾼高いクエリや分析クエリの混合したワークロードを分散FS
上で効率率率的に処理理する
• 課題
– ソートやインデキシング等の前処理理が有効だが,複雑な前処理理は挿
⼊入のスループットを低下する
– べき乗分布の実データ(下図)のインデックス設計も困難
• 貢献
– 複雑なクエリの混合するワークロードを効率率率的に処理理する
統合データアクセスミドルウェア (PABIRS) の提案
数10億の通話履履歴から直近数カ⽉月分(数千)のレコードを検索索
PABIRS: A Data Access Middleware for
Distributed File Systems
R3: Distributed Storage and Processing 担当:若若森(NTT)3
0
500
1000
1500
2000
2500
0 100 200 300 400 500 600 700 800 900 1000
CallFrequency
Caller ID
Call Frequency
Fig. 1. Distribution of Call
Frequency
0
200
400
600
800
1000
1200
0 100 200 300 400 500 600 700 800 900 1000
NumberofBlocks
Caller ID
Number of Blocks
Fig. 2. Number of Blocks per
Key
PABIRS
図は元論論⽂文より引⽤用
例例)
ある電話会社の通話ログデータから1,000個の電話番号(ID)をランダムに抽出した結果
Fig.
1.
通話頻度度の分布
0
500
1000
1500
2000
2500
0 100 200 300 400 500 600 700 800 900 1000
CallFrequency
Caller ID
Call Frequency
Fig. 1. Distribution of Call
Frequency
0
200
400
600
800
1000
1200
0 100 200 300 400 500 600 700 800 900 1000
NumberofBlocks
Caller ID
Number of Blocks
Fig. 2. Number of Blocks per
Key
support efficient data retrieval for various query workloads.
PABIRS
Fig.
2.
IDの含まれるデータブロック数
・1%
の
ID
による通話が
半分以上を占める
・べき乗分布
(power-‐law)
場所や通話回数などの属性で集約して分析
・頻出
ID
のレコードが
ほぼ全てのDFSブロック
中に存在
4. PABIRS = Bitmap + LSM index
• DFS上の(半)構造化データへのアクセス⼿手段
– DFSへのGETインタフェース
– MapReduce処理理:mapへのinputformat
– KVSのトランザクション:secondary index
• DFS wrapper: ハイブリッドインデックス
– Bitmap index:選択率率率の低いキー/タプル向け
– LSM index:hot value に対してのみ⽣生成
R3: Distributed Storage and Processing 担当:若若森(NTT)4
PABIRS
DFS
g. 3. Architecture of PABIRSFig.
3.
PABIRS
のアーキテクチャ
InputFormatInsert(key,
value)Lookup(key)
Fig.
4.
bitmap
の例例
・ブロック毎に
signature
を保持
・DAGベースの階層構造
(directory
vertices
à
data
vertices)
1 0 0 0 0
1 0 0 0 0
0 0 0 0 1
0 0 0 1 0
1 0 0 0 0
1 0 0 0 0
u
u
u
u
u
u
UID u u u u u
1 0 0 1 0
1 0 0 0 1
block
signature
data block 1
1 0 0 1 1
1 0 1 1 0
1 0 1 1 1
data block 2
block
signature
UID
Fig. 4. Bitmap Example
III. HYBRID INDEXING SCHEME
The general idea behind our hybrid indexing scheme is
to build bitmap signatures for all data blocks and select
certain hot keys for LSM index. Bitmap signature is created
for multiple attributes without re-ordering the records. To
facilitate efficient parallel search, we design a hierarchical
model based on a virtual Directed Acyclic Graph (DAG)
structure, in which each intermediate vertex is a summary of
the signatures accessible on its descendants. We present an
example DAG structure in Figure 5 as a virtual index structure
Param
s, s1
v
pj
N
Bp
Bt
✓
k
m
F
W
r
with entries taken from the leaf level of the C0 tree, thus decreasing the size of C0, and creates a
newly merged leaf node of the C1 tree.
The buffered multi-page block containing old C1 tree nodes prior to merge is called the emp-
tying block, and new leaf nodes are written to a different buffered multi-page block called the
filling block. When this filling block has been packed full with newly merged leaf nodes of C1,
the block is written to a new free area on disk. The new multi-page block containing merged
results is pictured in Figure 2.2 as lying on the right of the former nodes. Subsequent merge
steps bring together increasing index value segments of the C0 and C1 components until the
maximum values are reached and the rolling merge starts again from the smallest values.
C1 tree C0 tree
Disk Memory
Figure 2.2. Conceptual picture of rolling merge steps, with result written back to disk
Newly merged blocks are written to new disk positions, so that the old blocks will not be over-
written and will be available for recovery in case of a crash. The parent directory nodes in C1,
also buffered in memory, are updated to reflect this new leaf structure, but usually remain in
buffer for longer periods to minimize I/O; the old leaf nodes from the C1 component are in-
validated after the merge step is complete and are then deleted from the C1 directory. In gen-
eral, there will be leftover leaf-level entries for the merged C1 component following each
merge step, since a merge step is unlikely to result in a new node just as the old leaf node
empties. The same consideration holds for multi-page blocks, since in general when the filling
block has filled with newly merged nodes, there will be numerous nodes containing entries still
LSM
Tree
[O'Neil+,
’96]
・特徴:
⾼高い書込スループット
・インメモリのC0(AVL-‐Tree)に
書き込み
・C0のサイズがしきい値を
超えた時,ディスク上の
C1
(B-‐Tree)
に
rolling
merge
コスト⾒見見積り
図は元論論⽂文より引⽤用
(図:
LSM
Tree
の元論論⽂文より)
5. ハイブリッドインデックスの最適化
1. Bitmap Signatureのコストモデルと最適化
– fanout パラメータ F に基づき,low-‐‑‒level
vertices からhigh-‐‑‒level vertexを⽣生成
– コストモデルを定義,コスト最⼩小化するFを推定
– Pregel [Malewicz+, ʼ’10] の BSP でグラフ探索索
2. LSMによる最適化
– あるキーの選択率率率がしきい値を超えた場合にLSM indexを作成
R3: Distributed Storage and Processing 担当:若若森(NTT)5
0
200
400
600
800
1000
1200
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
ProcessingTime(msec)
Selectivities of Call Numbers (%)
bitmap
lsm
Cost of Bitmap and LSM
he fanout of the B-tree. We try to insert the key
ex, only when the estimated cost is no larger than
Index Manager
Data Statistics
DFS
Data Stream Append
LSM Index
New Data
Bitmap Signature
MapReduce
Algorithm
Fig. 8. Index Update
Fig.
7.
bitmap
と
LSM
の検索索コスト
・LSM
は選択率率率によらず⼀一定
・選択率率率
0.1
%以下の場合は
bitmap
が⾼高速
・実際は
90%
以上のクエリが
0.1
%以下
0.1
0
200
400
600
800
1000
1200
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
ProcessingTime(msec)
Selectivities of Call Numbers (%)
bitmap
lsm
Fig. 7. Search Cost of Bitmap and LSM
where E is the fanout of the B-tree. We try to insert the key
into LSM index, only when the estimated cost is no larger than
the cost of bitmap index.
Based on the inequality above, we are able to calculate
the minimal selectivity, which makes LSM a more attractive
selection than the bitmap. In Figure 7, we apply the theoretical
Index Manager
Data Statistics
DFS
Data Stream Append
LSM Index
New Data
Bitmap Signature
MapReduce
Algorithm
Fig. 8. Index Update
C. Update on the Indices
PABIRS is specifically designed for the applications that
require fast data insertion. In PABIRS, bitmap index is a
lightweight index which can be built in a batch, while LSM
index is intentionally designed to support the fast insertion.
Fig.
8.
インデックスの更更新
・新規データはDFSに追記
・オフラインの
MR
Jobで
Bitmap
signature
と
LSM
の
Hot-‐key
を更更新
for multiple attributes without re-ordering the records. To
facilitate efficient parallel search, we design a hierarchical
model based on a virtual Directed Acyclic Graph (DAG)
structure, in which each intermediate vertex is a summary of
the signatures accessible on its descendants. We present an
example DAG structure in Figure 5 as a virtual index structure
on two different attributes, using 8 bits and 5 bits for these two
attributes respectively.
Generally speaking, the DAG structure consists of three
layers. The retrieval layer contains individual signatures cor-
responding to the data blocks, while each intermediate vertex
in the index layer is associated with a summary signature by
merging signatures of its children vertices. Data layer refers
to the physical data blocks stored in the DFS. Signatures and
their corresponding graph vertices are randomly distributed to
multiple DFS nodes. In the rest of the paper, we refer the
vertices in the retrieval layer as data vertices and the vertices
in the index layer as directory vertices.
On the other hand, LSM index replicates the records with
hot keys and sorts them in its B-trees. For each indexing
attribute, we independently create an LSM index to maintain
its sorted replicas of hot data. In the rest of the section, we
first introduce the tuning approaches used on the bitmap-based
index, followed by the selection strategy between these two
indices. For better readability, the notations used in the section
are summarized in Table I.
A. Optimizations on Bitmap Signature
Suppose the signature of each data block follows the same
distribution {p1, p2, . . . , pk}, in which each pj indicates the
probability of having “1” on the j-th bit. Because of the
exclusiveness between the values, the signature is a sparse
vector, i.e.
P
j pj ⌧ k. Given two signatures s1 and s2, the
expected number of common “1” on j-th bits in both s1 and
s2 is
P
j(pj)2
. It is much smaller than the expected number
of “1” in either s1 or s2, i.e.
P
j(pj), unless there exists a pj
dominating the distribution.
When records are randomly assigned to the data blocks,
each probability pj is supposed to be a small positive number.
This leads to the phenomenon of Weak Locality in PABIRS. It
N total number of records/tuples
Bp size of a data block
Bt size of a tuple
✓ query selectivity
k number of distinct values of the attributes
m number of values mapped to the same bit
F fanout of the directory vertex
W number of virtual machines (workers)
rl computation cost of a directory vertex
rn network delay between any pair of vertices
rd the overhead of reading a data block
✓ selectivity of a particular queried key
fmin minimum frequency of any value in a domain
p(✓) pdf of distribution on selectivity ✓ of queries.
...
…...
Fig. 5. Demonstration of Signature Graph
is thus not helpful to group similar signatures when building
high-level directory vertex in the index layer, because such
merging only generates a new signature with a union of “1”s
from the signatures of its children vertices. Although it is
unlikely to optimize by better grouping, the fanout of the
abstract tree structure, i.e. the number of children vertices for
every directory vertex, remains tunable and turns out to be
crucial to the searching efficiency.
1) Cost Model and Fanout Optimization: Instead of picking
up similar signatures during bitmap construction, PABIRS
simply groups the low-level vertices to generate a high-level
vertex, based on a pre-specified fanout parameter F. Specifi-
116
Fig.
5.
signature
graph
※ほか,分析ワークロード向けの最適化等については元論論⽂文を参照
図は元論論⽂文より引⽤用
6. 実験,評価
• 環境
– Hadoop 1.0.4 + GPS [Salihoglu+, ʼ’12] (Pregel のOSS実装)をベースに
実装
– 4コアcpu, 8G RAM, 32ノードのhadoopクラスタで実験
• 項⽬目
A. ⾼高選択率率率クエリ:電話履履歴データに対する3種のselectクエリをHBase
Phoenix, Impala, BIDS [Lu+, 13]と⽐比較
B. 分析クエリ:tpcdskew [Bruno+, ʻ‘05] で⽣生成した⼈人⼯工データを⽤用いて
TPC-‐‑‒H Q3, Q5, Q6をHiveと⽐比較
R3: Distributed Storage and Processing 担当:若若森(NTT)6
320 400
)
0
50
100
150
200
250
300
350
400
Q1 Q2 Q3
ResponseTime(second)
HBase
PABIRS
BIDS
Impala-8G
Impala-4G
Fig. 13. Queries
1
10
100
1000
80 160 240 320 400
AverageResponseTime(second)
Data Size (G)
Q1
Q2
Q3
Fig. 14. Effect of Data Size
40
50
onseTime
PABIRS
HBase
150
200
nute)
PABIRS
0
20
40
60
80
00
20
40
60
0 4 8 12 16 20 24 28 32
Query Batch Size
Single Processor Thread
Quad Processor Thread
19. Throughput of Concurrent
ries (Q1)
0
5
10
15
20
25
30
35
40
45
50
0 4 8 12 16 20 24 28 32
AverageResponseTime(second)
Query Batch Size
Single Processor Thread
Quad Processor Thread
Fig. 20. Response Time of Concurrent
Queries (Q1)
10
100
1000
10000
1 2 3 4
ResponseTime(second)
Skew Factor
q3/h
q3/p
q5/h
q5/p
q6/h
q6/p
Fig. 21. Performance of TPC-H Query
(Skew)
10
100
1000
10000
3.5%(3) 7.1%(6) 10.5%(9)14.2%(12)
ResponseTime(second)
Query Selectivity (month)
q3/h
q3/p
q5/h
q5/p
q6/h
q6/p
Fig. 22. Performance of TPC-H Query
(Selectivity)
emory which leads to the “Memory Limit Exceeded”
tion.
or Q3 and Q6, we build an index for the column shipdate
we increase ✏ to a larger value (e.g., 365), PABIRS finds th
index-based access is even worse than scan-based access.
will automatically switch to the disk scan, which generate
better
station. To avoid query with empty result, we intentionally
select a number with at least one record under the base station.
PABIRS can effectively handle queries with a high selectiv-
ity but still involving numerous tuples. As shown in Table III,
in our 160G dataset, we have 40960 blocks in total. Although
the selectivities of the queries are as low as 0.00001%, the
records related to Q1, Q2 and Q3 cover 477, 28863 and 343
data blocks respectively. The involved data blocks, especially
for Q1 and Q3, are no close to the total number of data blocks,
while the overhead of loading hundreds of data blocks from
the disks remains high.
In experiments, PABIRS, Phoenix and BIDS are allowed
to use 4 GB main memory on each node of the cluster, while
Impala are tested under two settings with 4 GB and 8 GB
main memory respectively. The results in Figure 13 shows that
Impala-4G is unable to finish the queries in reasonable time
(i.e. 1,000 seconds), as it incurs high I/O cost on memory-
disk data swap. It reveals the limitation of Impala on memory
usage efficiency. Moreover, Impala and BIDS show a similar
performance for all queries, because both approaches adopt
the scan-based techniques (memory scan and disk scan). In the
rest of the experiments, we only report the results of Impala-
8G, denoted as Impala in abbreviation. The results also imply
that PABIRS significantly outperforms the other systems on
all queries. When the selectivity of the query is high, such as
Q1 and Q3, HBase Phoenix is the only alternative with close
performance to PABIRS, because of its adoption of secondary
index. But for the query involving a large portion of data like
Q2, HBase Phoenix is slow as it incurs many random I/Os to
retrieve all results.
TABLE III. PROCESSING TIME OF PABIRS
QID selectivity index time disk time total time
Q1 1.2% 1.03s 1.47s 2.50s
Q2 70% 2.11s 137.63s 139.74s
Q3 0.8% 1.04s 1.28s 2.32s
To gain better insight into the scalability of PABIRS, we
the performances of PABIRS and HBase Phoenix degrade
slightly when more insertions are conducted, because they
need to build and query indexes for the new tuples. Finally,
we implement a simple transaction module as discussed in
Section 2. Our test transaction retrieves all records of a specific
phone number (normally hundreds to thousands of records) and
updates the values of NeID in those records to a new value.
We vary the number of concurrent transactions and ss shown
in Figure 18, for this test transaction, PABIRS can provide a
good throughput.
In PABIRS, queries can be grouped into batch and share the
index searching process. In Figure 19 and Figure 20, we show
the throughput and response time for varied batch size. As each
node in the cluster is equipped with a 4-core CPU, we start
four concurrent I/O threads at the same time. For comparison
purpose, we also show the result when a single I/O thread
run. The throughput of four I/O threads is almost three times
higher than the single thread case. The throughput improves
dramatically for a larger query batch, since we can share
more signature and data scans among the queries. However,
the results imply that the throughput gain shrinks with the
increase of the query batch size. It is thus important to choose
an appropriate batch size in real actions. The response time is
also affected by the batch size. Figure 20 illustrates that the
response time is generally proportional to the batch size. If a
strict real-time requirement is needed, it is important for the
system to carefully choose batch size, in order to hit a balance
between the throughput and response time.
C. Analytic Query Performance
In this group of experiments, we evaluate the performance
of PABIRS on data and queries generated by TPC-H bench-
mark. Specifically, we generate 320 GB data with different
skew factors using the TPC-H Skew Generator6
. We deploy
Hive on top of PABIRS and compare the performances of
PABIRS against the original Hive on query Q3, Q5 and
Q6 in TPC-H. We also include Impala in the experiment.
However, Impala requires buffering all intermediate join results
A. B.
Fig.
21.
TPC-‐H
(skew) Fig.
22.
TPC-‐H
(Selectivity)
(Fig.
13)
・skew⼩小:Hiveと同等の性能
・skew⼤大:インデックスの効果で性能向上
・Q5には効果なし
(インデックスがordersにしかないため)
図は元論論⽂文より引⽤用
7. Scalable Distributed Transactions across
Heterogeneous Stores
• ⽬目的
– 異異なるデータストア間で複数アイテムに
対応したトランザクション処理理を⾏行行いたい
• 課題
– アプリケーションでトランザクションを⾏行行う場合:
• プログラマによるエラーを起こしやすく,可⽤用性やスケーラビリティ喪失の
恐れがある
– コーディネータのミドルウェアを導⼊入する場合:
• アプリケーションは全て管理理下になければならない
• 貢献
– 異異なるデータストア間での複数アイテムのトランザクションをサ
ポートするクライアントライブラリ: Cherry Garcia (CG)の提案
– Windows Azure Storage (WAS), Google Cloud Storage (GCS),
Tora (a high-‐‑‒throughput KVS) に実装
– YCSB+T [Dey+, ʻ‘14] (WebスケールTXNベンチマーク)で評価
R3: Distributed Storage and Processing 担当:若若森(NTT)7
BEGIN
TRANSACTION
SET
item1
of
Store1
SET
item2
of
Store2
COMMIT
TRANSACTION
8. 異異なるデータストア間のトランザクション
Datastore wds. The example also uses a third store (e
later) that acts as the Coordinating Data Store (CDS)
1 public void UserTransaction ( ) {
D a t a s t o r e cds = D a t a s t o r e . c r e a t e ( ” c r e d e n t i a l s . xml” ) ;
3 D a t a s t o r e gds = D a t a s t o r e . c r e a t e ( ” goog creds . xml” ) ;
D a t a s t o r e wds = D a t a s t o r e . c r e a t e ( ” msft creds . xml” ) ;
5 T r a n s a c t i o n tx = new T r a n s a c t i o n ( cds ) ;
try {
7 tx . s t a r t ( ) ;
Record saving = tx . read ( gds , ” saving ” )
9 Record checking = tx . read ( wds , ” checking ” ) ;
i n t s = saving . get ( ” amount ” ) ;
11 i n t c = checking . get ( ” amount ” ) ;
saving . s e t ( ” amount ” , s 5 ) ;
13 checking . s e t ( ” amount ” , c + 5 ) ;
tx . w r i t e ( gds , ” saving ” , saving ) ;
15 tx . w r i t e ( wds , ” checking ” , checking ) ;
tx . commit ( ) ;
17 } catch ( Exception e ) {
tx . a b o r t ( ) ;
19 }
}
Listing 1. Example code that uses the API to accesses two da
R3: Distributed Storage and Processing 担当:若若森(NTT)8
Listing.
1.
GCのAPIを使⽤用して記述した2つのデータストア間のトランザクション
Datastore:
データストアの
インスタンス
Transaction:
トランザクションコーディネータ
・Google
Cloud
Storage
の Datastore
(gds)
から’saving’,Windows
Asure
Storage
の
Datastore
(wds)から’checking’を読み込み
それぞれ更更新する
・ほか,Coordinating
Data
Store
(CDS)
として動作するDatastoreも使⽤用している
図は元論論⽂文より引⽤用
9. Cherry Garcia (CG):
クライアントライブラリ
• プラットフォームの想定
– 単⼀一レコードをreadする
際のStrong Consistency
– アトミックな単⼀一アイテ
ムの更更新・削除 (Test-‐‑‒
and-‐‑‒Set)
– アイテム中にユーザ定義
のメタデータを含められ
る
R3: Distributed Storage and Processing 担当:若若森(NTT)9
II. SYSTEM DESIGN
ection, we describe the design of our client-
ransaction processing protocol that enables trans-
ving multiple data items that span multiple het-
data store instances. The protocol is to be imple-
library whose API abstracts data store instances
alled Datastore, and these are accessed via a
oordinator abstraction, a class called Transaction.
ecord is addressable using a string key and its
accessed using an object of a class called Record.
an example of an application that uses the API to
a records, one (“saving”) residing in an instance of
d Storage, abstracted by the Datastore gds, while
s stored in Windows Azure Storage represented as
ds. The example also uses a third store (explained
ts as the Coordinating Data Store (CDS).
ransaction ( ) {
D a t a s t o r e . c r e a t e ( ” c r e d e n t i a l s . xml” ) ;
D a t a s t o r e . c r e a t e ( ” goog creds . xml” ) ;
D a t a s t o r e . c r e a t e ( ” msft creds . xml” ) ;
= new T r a n s a c t i o n ( cds ) ;
= tx . read ( gds , ” saving ” )
g = tx . read ( wds , ” checking ” ) ;
g . get ( ” amount ” ) ;
ng . get ( ” amount ” ) ;
mount ” , s 5 ) ;
” amount ” , c + 5 ) ;
Application 1
Transaction
Application 2
Transaction
Tora Windows Azure
Storage
Google Cloud
Storage
Tora Datastore
Abstraction
Application 3
Transaction
Tora Datastore
Abstraction
WAS Datastore
Abstraction
WAS Datastore
Abstraction
GCS Datastore
Abstraction
GCS Datastore
Abstraction
Datastore
Specific
REST API
Cherry
Garcia
Coordinating
Storage
TSR
Fig. 1. Library architecture
2) Overview: In essence, the protocol calls for each data
item to maintain the last committed and perhaps also the
currently active version, for the data and relevant meta-
data. Each version is tagged with meta-data pertaining to
the transaction that created it. This includes the transaction
commit time and transaction identifier that created it, pointing
to a globally visible transaction status record (TSR) using a
Universal Resource Identifier (URI). The TSR is used by the
client to determine which version of the data item to use when
reading it, and so that transaction commit can happen just
by updating (in one step) the TSR. The transaction identifier,
Fig.
1.
ライブラリのアーキテクチャ
図は元論論⽂文より引⽤用
• クライアントによるトランザクションコーディネーションの概要
• 各レコードを単⼀一アイテムのデータベースのように扱う
• 2PCでトランザクションコーディネーション
• 中央にコーディネータをもたない
• データにトランザクションステートを持たせ,クライアントでコー
ディネートする
10. CGによるトランザクションのタイムライン
• 2PC
– Current state と previous state をデータに持たせる
– Key の hash 値順に PREPARED フラグを⽴立立てる
– Coordinating Data Store (CDS) に Transaction Status Record
(TSR) を書き込み, COMMITTED フラグを⽴立立てる (並列列処理理)
R3: Distributed Storage and Processing 担当:若若森(NTT)10
data
wo
the
mit.
ped
ked
hed
ion
wer
ten
SR)
ing
uly
COMMITTED PREPARED
application logic
CDS
WAS
GCS
C1
t1
r2
r1
transaction cache
COMMITTED
read() read()
v1v1
v1
v1
v2 v2
write() commit()
v2
v2
PREPARE
PREPARE
TSR
COMMIT
v2
v2
COMMITTEDPREPARED
COMMITTED
DELETE
application logic C2
t2
transaction cache
read()
v1v1 v2 v2
commit()write()
v1v1 v2 v2
time
read()
PREPARE
application logic
t3
transaction cache
Cherry Garcia
Cherry Garcia
Fig. 2. The timeline describing 3 transactions running on 2 client hosts to
access records in 2 data stores using a third data store as a CDS
In the rest of this section we go deeper in detail on the
components of the library and the algorithms. Pseudocode for図は元論論⽂文より引⽤用
(Fig.
2.)
11. 実装,実験
• Cherry Garcia の実装
– Javaライブラリ (JDK 1.6)
– Datastore abstractionをWindows Azure Storage
(WAS), Google Cloud Storage (GCS), Tora
(WiredTigerストレージエンジン上で動作するKVS) に対
して実装
• 実験
R3: Distributed Storage and Processing 担当:若若森(NTT)11 図は元論論⽂文より引⽤用
1885.4& 1888.6& 1862.2& 1911.6&
5898.4&
33810&
0&
10000&
20000&
30000&
40000&
0.1& 0.3& 0.5& 0.7& 0.9& 0.99&
aborts'per'million'
theta'
aborts&per&million&transac:ons&
Fig. 6. Aborts measured varying theta with 1 YCSB+T client against a
1-node Tora cluster
0"
5000"
10000"
15000"
1" 32" 60" 91" 121" 152" 182" 213" 244"
Throughput"
(transac8ons/second)"
YCSB+T"Client"Threads"
transac8ons/sec"
Fig. 7. Throughput of 4 YCSB+T client hosts each with 1 though 64 threads
against a 4-node Tora cluster
0"
5000"
10000"
15000"
20000"
25000"
1" 2" 3" 4" 5" 6" 7" 8"
Throughput"
(transac8ons/second)"
Number"of"hosts"running"16"YCSB+T"clients"threads"each"
transac8ons/second"
Fig. 8. Throughput of YCSB+T with 16 through 128 threads on 8 client
hosts against a 4-node Tora cluster
600"
700"
800"
900"
ons"per"second)"
1"record"tx"
1"record"nonBtx"
2"record"tx"
2"record"nonBtx"800"
1000"
1200"
tx/sec)"
3"records"serial"
phase"2"
3"records"with"
parallel"phase"2"
Fig.
8.
8クライアントから
4ノードのToraクラスタに
対して16スレッドから128スレッドのトランザクショ
ンを実⾏行行した時のスループット
increased linearly until 16 threads and the average latency for
each request stayed within the 500µs mark. As the number
of threads were increased beyond 16 the latency increased
until it reached 4.5ms at 64 threads. This increased latency
suggests that there is a performance bottleneck somewhere in
the system.
We ran a further test with 4 client hosts and a cluster of
4 Tora servers and repeated the previous test and varied the
number of threads from 1 through to 64 threads across all 4
client hosts and measured the throughput. The graph in Figure
7 shows that the performance on each host scales linearly until
16 threads (an aggregate of 64 threads across 4 client hosts)
and then flattens out. We observed that the socket send buffers
on the servers were full suggesting a network bottleneck at the
client.
G. Experiment 4: abort rates vary with contention
We setup one EC2 m3.2xlarge server each as a YCSB+T
client and Tora server in AWS and ran the client with 16
threads with a read to read-write ration of 50:50 over 1 million
transactions. We used the Zipfian access key pattern, and
varied the theta value over 0.1, 0.3, 0.5, 0.7,0.9 and 0.99.
Figure Fig 6 shows that the aborts increase as the contention
increases, though aborts are infrequent even with extreme
contention.
H. Experiment 5: Scale-out test
We ran YCSB+T with a mix of 90:10 read to read-modify-
write operations in a Zipfian data access pattern with theta set
to 0.99 across 1 to 8 client hosts each with 16 threads, running
against a 4-node Tora cluster. We collected the throughput
0"
5000"
10000"
15000"
1" 32" 60" 91" 121" 152" 182" 213" 244"
Throughput"
(transac8ons/second)"
YCSB+T"Client"Threads"
transac8ons/sec"
Fig. 7. Throughput of 4 YCSB+T client hosts each with 1 though 64 threads
against a 4-node Tora cluster
0"
5000"
10000"
15000"
20000"
25000"
1" 2" 3" 4" 5" 6" 7" 8"
Throughput"
(transac8ons/second)"
Number"of"hosts"running"16"YCSB+T"clients"threads"each"
transac8ons/second"
Fig. 8. Throughput of YCSB+T with 16 through 128 threads on 8 client
hosts against a 4-node Tora cluster
0"
100"
200"
300"
400"
500"
600"
700"
800"
900"
1" 6" 11" 16"
throughput"(transac8ons"per"second)"
number"of"client"threads"
1"record"tx"
1"record"nonBtx"
2"record"tx"
2"record"nonBtx"
0"
200"
400"
600"
800"
1000"
1200"
1" 6" 11" 16"
throughput"(tx/sec)"
number"of"client"threads"
3"records"serial"
phase"2"
3"records"with"
parallel"phase"2"
Fig. 9. Overhead of transactions and the effect of 1-phase optimization
133
Fig.
9.
トランザクションのオーバーヘッド
と1-‐phaseの最適化(*)の効果
(*)
1アイテムに限定してPREPAREフェーズを省省略略
線形にスケール
(最⼤大 23288
trans/sec)
1-‐phaseの最適化の
オーバーヘッドは⼩小さい
並列列化による
スループット向上