Column-oriented query processing techniques can significantly improve the performance of row-oriented database systems when applied properly. The authors introduce new operators such as index merge, index merge join, and index hash join that take advantage of parallel processing and unique characteristics of database indexes. An experimental study shows their approach using these operators on a commercial row-oriented database system approaches the performance of a column-oriented database system.
A column-oriented DBMS is a database management system (DBMS) that stores its content by column rather than by row. This has advantages for data warehouses and library catalogues where aggregates are computed over large numbers of similar data items.
A column-oriented DBMS is a database management system (DBMS) that stores its content by column rather than by row. This has advantages for data warehouses and library catalogues where aggregates are computed over large numbers of similar data items.
Data Warehousing and Business Intelligence is one of the hottest skills today, and is the cornerstone for reporting, data science, and analytics. This course teaches the fundamentals with examples plus a project to fully illustrate the concepts.
Challenges of Building a First Class SQL-on-Hadoop EngineNicolas Morales
Challenges of Building a First Class SQL-on-Hadoop Engine:
Why and what is Big SQL 3.0?
Overview of the challenges
How we solved (some of) them
Architecture and interaction with Hadoop
Query rewrite
Query optimization
Future challenges
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at AlibabaMichael Stack
Yun Zhang
Track 2: Ecology and Solutions
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
My notes from the book: Designing Data Intensive Applications (https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable-ebook/dp/B06XPJML5D)
This eight page technical report documents a series of tests that demonstrate the benefits of using large amounts of server memory for a large-scale Decision Support System workload on an eX5 platform
The design and implementation of modern column oriented databasesTilak Patidar
An attempt to break down the paper on the design of column-oriented databases into simpler terms.
https://stratos.seas.harvard.edu/files/stratos/files/columnstoresfntdbs.pdf
https://blog.acolyer.org/2018/09/26/the-design-and-implementation-of-modern-column-oriented-database-systems/
Column store databases approaches and optimization techniquesIJDKP
Column-Stores database stores data column-by-column. The need for Column-Stores database arose for
the efficient query processing in read-intensive relational databases. Also, for read-intensive relational
databases,extensive research has performed for efficient data storage and query processing. This paper
gives an overview of storage and performance optimization techniques used in Column-Stores.
Benchmarking Scalability and Elasticity of DistributedDataba.docxjasoninnes20
Benchmarking Scalability and Elasticity of Distributed
Database Systems
Jörn Kuhlenkamp
Technische Universität Berlin
Information Systems
Engineering Group
Berlin, Germany
[email protected]
Markus Klems
Technische Universität Berlin
Information Systems
Engineering Group
Berlin, Germany
[email protected]
Oliver Röss
Karlsruhe Institute of
Technology (KIT)
Karlsruhe, Germany
[email protected]
ABSTRACT
Distributed database system performance benchmarks are
an important source of information for decision makers who
must select the right technology for their data management
problems. Since important decisions rely on trustworthy
experimental data, it is necessary to reproduce experiments
and verify the results. We reproduce performance and scal-
ability benchmarking experiments of HBase and Cassandra
that have been conducted by previous research and com-
pare the results. The scope of our reproduced experiments
is extended with a performance evaluation of Cassandra on
different Amazon EC2 infrastructure configurations, and an
evaluation of Cassandra and HBase elasticity by measuring
scaling speed and performance impact while scaling.
1. INTRODUCTION
Modern distributed database systems, such as HBase, Cas-
sandra, MongoDB, Redis, Riak, etc. have become popular
choices for solving a variety of data management challenges.
Since these systems are optimized for different types of work-
loads, decision makers rely on performance benchmarks to
select the right data management solution for their prob-
lems. Furthermore, for many applications, it is not sufficient
to only evaluate performance of one particular system setup;
scalability and elasticity must also be taken into considera-
tion. Scalability measures how much performance increases
when resource capacity is added to a system, or how much
performance decreases when resource capacity is removed,
respectively. Elasticity measures how efficient a system can
be scaled at runtime, in terms of scaling speed and perfor-
mance impact on the concurrent workloads.
Experiment reproduction. In section 4, we reproduce
performance and scalability benchmarking experiments that
were originally conducted by Rabl, et al. [14] for evaluating
distributed database systems in the context of Enterprise
Application Performance Management (APM) on virtual-
ized infrastructure. In section 5, we discuss the problem of
This work is licensed under the Creative Commons Attribution-
NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li-
cense, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain per-
mission prior to any use beyond those covered by the license. Contact
copyright holder by emailing [email protected] Articles from this volume
were invited to present their results at the 40th International Conference on
Very Large Data Bases, September 1st - 5th 2014, Hangzhou, China.
Proceedings of the VLDB Endowment, Vol. 7, No. 13
Copyright 2014 VLDB Endowment 2150-8097/14/08.
selec ...
Data Warehousing and Business Intelligence is one of the hottest skills today, and is the cornerstone for reporting, data science, and analytics. This course teaches the fundamentals with examples plus a project to fully illustrate the concepts.
Challenges of Building a First Class SQL-on-Hadoop EngineNicolas Morales
Challenges of Building a First Class SQL-on-Hadoop Engine:
Why and what is Big SQL 3.0?
Overview of the challenges
How we solved (some of) them
Architecture and interaction with Hadoop
Query rewrite
Query optimization
Future challenges
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at AlibabaMichael Stack
Yun Zhang
Track 2: Ecology and Solutions
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
My notes from the book: Designing Data Intensive Applications (https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable-ebook/dp/B06XPJML5D)
This eight page technical report documents a series of tests that demonstrate the benefits of using large amounts of server memory for a large-scale Decision Support System workload on an eX5 platform
The design and implementation of modern column oriented databasesTilak Patidar
An attempt to break down the paper on the design of column-oriented databases into simpler terms.
https://stratos.seas.harvard.edu/files/stratos/files/columnstoresfntdbs.pdf
https://blog.acolyer.org/2018/09/26/the-design-and-implementation-of-modern-column-oriented-database-systems/
Column store databases approaches and optimization techniquesIJDKP
Column-Stores database stores data column-by-column. The need for Column-Stores database arose for
the efficient query processing in read-intensive relational databases. Also, for read-intensive relational
databases,extensive research has performed for efficient data storage and query processing. This paper
gives an overview of storage and performance optimization techniques used in Column-Stores.
Benchmarking Scalability and Elasticity of DistributedDataba.docxjasoninnes20
Benchmarking Scalability and Elasticity of Distributed
Database Systems
Jörn Kuhlenkamp
Technische Universität Berlin
Information Systems
Engineering Group
Berlin, Germany
[email protected]
Markus Klems
Technische Universität Berlin
Information Systems
Engineering Group
Berlin, Germany
[email protected]
Oliver Röss
Karlsruhe Institute of
Technology (KIT)
Karlsruhe, Germany
[email protected]
ABSTRACT
Distributed database system performance benchmarks are
an important source of information for decision makers who
must select the right technology for their data management
problems. Since important decisions rely on trustworthy
experimental data, it is necessary to reproduce experiments
and verify the results. We reproduce performance and scal-
ability benchmarking experiments of HBase and Cassandra
that have been conducted by previous research and com-
pare the results. The scope of our reproduced experiments
is extended with a performance evaluation of Cassandra on
different Amazon EC2 infrastructure configurations, and an
evaluation of Cassandra and HBase elasticity by measuring
scaling speed and performance impact while scaling.
1. INTRODUCTION
Modern distributed database systems, such as HBase, Cas-
sandra, MongoDB, Redis, Riak, etc. have become popular
choices for solving a variety of data management challenges.
Since these systems are optimized for different types of work-
loads, decision makers rely on performance benchmarks to
select the right data management solution for their prob-
lems. Furthermore, for many applications, it is not sufficient
to only evaluate performance of one particular system setup;
scalability and elasticity must also be taken into considera-
tion. Scalability measures how much performance increases
when resource capacity is added to a system, or how much
performance decreases when resource capacity is removed,
respectively. Elasticity measures how efficient a system can
be scaled at runtime, in terms of scaling speed and perfor-
mance impact on the concurrent workloads.
Experiment reproduction. In section 4, we reproduce
performance and scalability benchmarking experiments that
were originally conducted by Rabl, et al. [14] for evaluating
distributed database systems in the context of Enterprise
Application Performance Management (APM) on virtual-
ized infrastructure. In section 5, we discuss the problem of
This work is licensed under the Creative Commons Attribution-
NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li-
cense, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain per-
mission prior to any use beyond those covered by the license. Contact
copyright holder by emailing [email protected] Articles from this volume
were invited to present their results at the 40th International Conference on
Very Large Data Bases, September 1st - 5th 2014, Hangzhou, China.
Proceedings of the VLDB Endowment, Vol. 7, No. 13
Copyright 2014 VLDB Endowment 2150-8097/14/08.
selec ...
Performance Improvement Technique in Column-StoreIDES Editor
Column-oriented database has gained popularity as
“Data Warehousing” data and performance issues for
“Analytical Queries” have increased. Each attribute of a
relation is physically stored as a separate column, which will
help analytical queries to work fast. The overhead is incurred
in tuple reconstruction for multi attribute queries. Each tuple
reconstruction is joining of two columns based on tuple IDs,
making it significant cost component. For reducing cost,
physical design have multiple presorted copies of each base
table, such that tuples are already appropriately organized in
different orders across the various columns.
This paper proposes a novel design, called
partitioning, that minimizes the tuple reconstruction cost. It
achieves performance similar to using presorted data, but
without requiring the heavy initial presorting step. In addition,
it handles dynamic, unpredictable workloads with no idle time
and frequent updates. Partitioning provides the direct loading
of the data in respective partitions. Partitions are created on
the fly and depend on distribution of data, which will work
nicely in limited storage space environments.
A Common Database Approach for OLTP and OLAP Using an In-Memory Column DatabaseIshara Amarasekera
This presentation was prepared by Ishara Amarasekera based on the paper, A Common Database Approach for OLTP and OLAP Using an In-Memory Column Database by Hasso Plattner.
This presentation contains a summary of the content provided in this research paper and was presented as a paper discussion for the course, Advanced Database Systems in Computer Science.
Prepare for your interview with these top 20 SAP HANA interview questions. For more IT Profiles, Sample Resumes, Practice exams, Interview Questions, Live Training and more…visit ITLearnMore – Most Trusted Website for all Learning Needs by Students, Graduates and Working Professionals.
Looking to add weight to your resume? Check out for ITLearnmore for varied online IT courses at affordable prices intended for career boost. There is so much in store for both fresh graduates and professionals here. Hurry up..! Get updated with the current IT job market requirements and related courses.For more information visit http://www.ITLearnMore.com.
The International Journal of Engineering and Science (The IJES)theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
International Refereed Journal of Engineering and Science (IRJES) is a peer reviewed online journal for professionals and researchers in the field of computer science. The main aim is to resolve emerging and outstanding problems revealed by recent social and technological change. IJRES provides the platform for the researchers to present and evaluate their work from both theoretical and technical aspects and to share their views.
www.irjes.com
Data Warehouse Physical Design,Physical Data Model, Tablespaces, Integrity Constraints, ETL (Extract-Transform-Load) ,OLAP Server Architectures, MOLAP vs. ROLAP, Distributed Data Warehouse ,
Brad McGehee's presentation on "How to Interpret Query Execution Plans in SQL Server 2005/2008".
Presented to the San Francisco SQL Server User Group on March 11, 2009.
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Safalta Digital marketing institute in Noida, provide complete applications that encompass a huge range of virtual advertising and marketing additives, which includes search engine optimization, virtual communication advertising, pay-per-click on marketing, content material advertising, internet analytics, and greater. These university courses are designed for students who possess a comprehensive understanding of virtual marketing strategies and attributes.Safalta Digital Marketing Institute in Noida is a first choice for young individuals or students who are looking to start their careers in the field of digital advertising. The institute gives specialized courses designed and certification.
for beginners, providing thorough training in areas such as SEO, digital communication marketing, and PPC training in Noida. After finishing the program, students receive the certifications recognised by top different universitie, setting a strong foundation for a successful career in digital marketing.
Embracing GenAI - A Strategic ImperativePeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
2. INTRODUCTION OBJECTIVE AND SCOPE
Column-oriented DBMSs have gained increasing interest dueto their superior performance
for analytical workloads. Priorefforts tried to determine the possibility of simulating the
Query processing techniques of column-oriented systems inrow-oriented databases, in a hope
to improve their performance,especially for OLAP and data warehousing applications.
We show that column-oriented queryprocessing can significantly improve the performance of
row-orientedDBMSs. We introduce new operators that take intoaccount the unique
characteristics of data obtained from indexes,and exploit new technologies such as flash
SSDs andmulti-core processors to boost the performance. We demonstrateour approach with
an experimental study using a prototypebuilt on a commercial row-oriented DBMS.
Recently, column-oriented database systems (also knownas column stores) have been
receiving a lot of attention. The main difference between column stores and the
traditionalrow-oriented database systems (also known as row stores)is in the way the data is
physically stored. As the namesimply, both kinds of database systems store data differently.
In row-stores, the DBMS stores all the attribute values ofa single row in contiguous space on
disk. Column-orientedDBMSs, on the other hand, store the data in a column-wisemanner (the
values of a column are stored contiguously).This difference in data storage has an implication
on dataaccess. In row stores, a whole row of data has to be read fromdisk, even if only few
columns of that row are needed to answera query. This means that the system may have to
readmuch more data than it actually needs. In column stores,only the required columns are
read from disk, although someextra processing is needed to construct the output tuplesfrom
these individual columns. However, when it comes todata updates, column stores tend to be
at a disadvantage.Since data updates (or inserts) usually happen at row granularity,multiple
accesses are needed to insert or update therelevant columns of each updated row. In row
stores, thewhole row is written in a single operation.This difference in access patterns and
performance hasmade column stores more suitable for workloads that areread-intensive with
few or no updates, e.g., analytical processingworkloads, such as those found in data
warehousesor decision support systems.
3. Related Area
We describe a column-oriented query processingmechanism for row stores based on indexonly plans,operating on single-column indexes. In our approach, onlythe indexes relevant to
the query are read, without readinganything from the underlying base tables. This leads
toavoiding the cost of reading the entire table rows when onlya few columns are referenced in
the query.
Seminar is related to DBMS and DATA WAREHOUSING.
4. Process Description
These column-oriented designs include:
_ vertically partitioning the tables in the database intoa set of two-column tables, each
consisting of (key,attribute) pairs, so that only the necessary columnsneed to be read to
answer a query. The key is used tojoin these tables to reconstruct the output tuples.
Using index-only plans; by creating a collection of indexes that cover all of the columns used
in a query; it is possible for the DBMS to answer a query without ever going to the
underlying (row-oriented) tables.
Such column-orientedphysical designs do not improve the performance of rowstores. In fact, in their
results, these techniques result inworse performance than the baseline row-oriented processing.
Above depicts the results of their experiments (normalizedrelative to the row store time). This
behaviour wasmainly because they tried to employ these techniques withoutany changes to the
underlying systems, which causedthe optimizer to make some bad decisions and use very
expensiveoperations. However, if there are operators that arespecially designed for column-oriented
processing, this canlead to a significant performance improvement.
We demonstrate the potential of column-orientedquery processing in row stores, focusing on
usingindex-only plans, but with query operators that take advantageof the characteristics of the data
obtained from theindexes. Our prototype is based on a commercial row store(which we shall refer to
as System A). We compare the performanceof our technique to the baseline performance ofSystem
A, as well as to that of a leading open-source columnstore (which we shall refer to as Col-DB). Our
resultsshow that column-oriented processing can significantly improvethe performance of row
stores for analytical workloads,and even approach the performance of a column store. Insummary,
we make the following contributions:
5. 1. We show that column-oriented query processing techniques,when employed using proper
operators andmechanisms, can significantly improve the performanceof a row store, and even
approach the performance ofa column-store. Our main focus is index-only plans.
2. We introduce new ways of performing index intersectionthat leverage parallel processing, to be
used inquery execution using index-only plans; namely theindex-merge, index-merge-join, and indexhash-join operators.We describe the algorithms, and present a costanalysis for each operator.
3. We highlight the advantage of exploiting new hardwaretechnologies – namely flash SSDs and
multi-coreprocessors – to reduce query processing time.
4. We demonstrate the effectiveness of our approach withan extensive experimental study using a
prototype basedon System A, and we compare the performance of ourapproach to those of a
baseline row store (System A)and column store (Col-DB).
We describe a column-oriented query processingmechanism for row stores based on index-only
plans,operating on single-column indexes. In our approach, onlythe indexes relevant to the query
are read, without readinganything from the underlying base tables. This leads toavoiding the cost of
reading the entire table rows when onlya few columns are referenced in the query.
Weintroduce new specialized ways of combining the data obtainedfrom the indexes that take
advantage of the uniquecharacteristics of database indexes, instead of completely relyingon the
existing database operators.Our algorithmsare designed specifically to take advantage of
parallelprocessing whenever possible, which leads to better performance.
The new operators are:
Index Merge Merges the sorted RID-lists associated withindividual values from the index, producing
a (RID,value) list for the whole index, in RID order.
Index Merge Join Performs an N-way merge join operationbetween the (RID, value) lists coming
from severalindexes. The input lists have to be in RID order (e.g.,produced by the Index Merge
operator). The outputis a (RID, data) list, also in RID order, where the datais a collection of values).
Index Hash Join Performs an N-way hash join operationbetween N (RID, data) lists. The input lists
can bethe output of any of the previous two operators, anotherIndex Hash Join operator, or the
traditional Index Scan operator that exists in most row stores. Theinput lists do not need to be in any
order. One of theinput lists is used to build the hash table while theothers are used for probing. The
output is a (RID,data) list.
6. Testing Technologies
INDEX MERGE (IXMG)
INDEX MERGE JOIN (IXMGJN)
Parallel Processing
INDEX HASH JOIN (IXHSJN)
In-place update (IPU)
Linked List (LL)
Linked List with Partitioning (LLP)
7. Resources and Limitations
All of our experiments were run on an IBM System x3400(Model 7974) server, with a 1.6
GHz single processor, dualcore Intel Xeon CPU with 4 GB of RAM running FedoraLinux. The
server has a 1 TB Seagate SATA disk, with105 MB/sec sustained data rate at 7.2K RPM. The
serveralso has an 80 GB Fusion-IO IO-Disk SSD with a read bandwidthof 700 MB/sec and a
write bandwidth of 550 MB/sec.
The numbers we report are the average of several runs, andare based on a cold buffer pool.
We built a prototype usingSystem A. We used two TPC-H [2] data sets for ourexperiments,
with scale factors 10 and 30. The lineitemtable(fact table with 16 columns) contains 60
million rowsin the 10GB database, and 180 million rows in the 30GBdatabase. A typical data
warehousing query involves thefact table (lineitem) and one or more dimension tables,
withpredicates on the fact and/or dimension tables.
8. Future scope and further enhancement
Several recent papers [3, 5, 8] have tackled the issue ofcomparing the performance of row
stores and column stores,and proposing optimizations for row stores in order for themto
compete with column stores for analytical workloads.
The work in [5] compares row stores and column stores.In particular, the authors try to
determine whether columnstore performance can be achieved in row stored usingcolumnoriented query processing. They simulate column-orientedprocessing by using both vertical
partitioning andindex-only plans in a commercial database system (referredto in the paper as
System X"), and compare the performanceof this system to the performance of C-Store [22].
In above Figure, System X performed very poorly withvertical partitioning, and even worse
with index-only plans.The reason for such a poor performance is because SystemX had to join
the (RID, value) lists coming from each indexusing a series of 2-way hash joins, each of
which builds anew hash table, which can be very expensive.
Bruno [8] argues that it is possible to achieve the performanceof column stores (or very close
to it) within row storeswithout any changes to the DBMS. The author proposes amethod to
store data in a compressed form, using a separatetable for each column in the original schema
(similar tothe vertical partitioning idea), but with these tables (called“c-tables”) storing
data in a format similar to run-length encoding.
The results show great performance improvement.However, this scheme requires creating a
c-table for everycolumn, and creating multiple indexes on every c-table, thusthe storage
requirements can grow rapidly. Additionally, thedependency between the tuples in the c-table
causes insertingnew rows to become very expensive, even for workloadswith infrequent
updates.Several optimizations have been implemented or proposedfor row-stores to approach
column-store performance, suchas super-tuples, using a column-based layout within eachdata
page, operating on compressed data, block processing,mirroring, invisible joins, late
materialization, and column-storeindexes [4, 12, 13, 15, 19, 20, 23, 24].
The idea of using multiple indexes to access data has beenstudied for some time. Mohan et al.
[16] introduced methodsfor using multiple indexes together to access base tables.
Raman et al. [21] presented two algorithms to perform theRID-list intersection operation,
which are comparable to thealgorithms we use for the low-cardinality indexes.
Flash memory has received increasing interest as a stablestorage media that can overcome the
access bottlenecks ofmagnetic disks. Researchers have considered modifying
existingalgorithms and data structures to make use of flashmemory. Flash-DB [17] is a selftuning flash-based index thatdynamically adapts to the mix of reads and writes in
theworkload. Flash-Logging [10] exploits using flash in appendmode for synchronous
logging.
9. Conclusion
Column-oriented processing can indeed improve query performancein row stores
significantly. Such performance improvementscan only be seen when changes are made to
thedatabase system. These changes should also make use ofnewly emerging technologies. For
example, multi-core processorsprovide a level of parallelism that can achieve
betterperformance, and new media such as flash memory canbe particularly used as
temporary storage, as it providesfast random access. Our work exploits these technologies
tobring the performance of a row store close to that of a columnstore while retaining roworiented capabilities (such asfast single-row lookups). This extends the scope for which
asingle database system can be used, where one does not needseparate software for the
warehouse and the OLTP system.
10. References
[1] MySQL, http://http://www.mysql.com/.
[2] TPC-H Benchmark, http://www.tpc.org/tpch.
[3] D. J. Abadi, P. A. Boncz, and S. Harizopoulos. Column oriented Database Systems.
PVLDB, 2(2):1664–1665, 2009.
[4] D. J. Abadi, S. Madden, and M. Ferreira. Integrating Compression and Execution in
Column-oriented Database Systems.In SIGMOD, 2006.
[5] D. J. Abadi, S. R. Madden, and N. Hachem. Column-Stores vs. Row-Stores: How
Different Are They Really? In SIGMOD, 2008.
[6] P. A. Boncz and M. L. Kersten.MIL Primitives for Querying a Fragmented World.VLDB
Journal,
8(2):101–119, 1999.
[7] P. A. Boncz, M. Zukowski, and N. Nes.MonetDB/X100: Hyper-Pipelining Query
Execution. In CIDR, 2005.
[8] N. Bruno. Teaching an Old Elephant New Tricks.In CIDR, 2009.
[9] M. Canim, G. A. Mihaila, B. Bhattacharjee, C. A. Lang, and K. A. Ross.Buffered Bloom
Filters on Solid State Storage.In ADMS, 2010.
[10] S. Chen. FlashLogging: Exploiting Flash Devices for Synchronous Logging
Performance. In SIGMOD, 2009.
[11] B. Debnath, S. Sengupta, and J. Li. FlashStore: High Throughput Persistent Key-Value
Store. In VLDB, 2010.
[12] G. Graefe. Efficient Columnar Storage in B-trees. SIGMOD Record, 36(1):3–6, 2007.
[13] A. Halverson, J. L. Beckmann, J. F. Naughton, and D. J. Dewitt.A Comparison of CStore and Row-Store in a Common Framework.Technical Report TR1570, University of
Wisconsin-Madison, 2006.
[14] Y.-R. Kim, K.-Y.Whang, and I.-Y. Song. Page-Differential Logging: An Efficient and
DBMS-Independent Approach for Storing Data into Flash Memory. In SIGMOD, 2010.
[15] P.-˚ A. Larson, C. Clinciu, E. N. Hanson, A. Oks, S. L. Price, S. Rangarajan, A. Surna,
and Q. Zhou. SQL Server Column Store Indexes.In SIGMOD, 2011.
[16] C. Mohan, D. J. Haderle, Y. Wang, and J. M. Cheng. Single Table Access Using
Multiple Indexes: Optimization, Execution, and Concurrency Control Techniques. In EDBT,
1990.
[17] S. Nath and A. Kansal.FlashDB: Dynamic Self-Tuning Database for NAND Flash. In
IPSN, 2007.
[18] S. Padmanabhan, B. Bhattacharjee, T. Malkemus, L. Cranston, and M. Huras. MultiDimensional Clustering: A New Data Layout Scheme in DB2. In SIGMOD, 2003.
[19] S. Padmanabhan, T. Malkemus, R. C. Agarwal, and A. Jhingran. Block Oriented
Processing of Relational Database Operations in Modern Computer Architectures. In ICDE,
2001.
[20] R. Ramamurthy, D. J. DeWitt, and Q. Su.A Case for Fractured Mirrors.In VLDB, 2002.
[21] V. Raman, L. Qiao, W. Han, I. Narang, Y.-L. Chen, K.-H.Yang, and F.-L. Ling. Lazy,
Adaptive RID-List Intersection, and Its Application to Index Anding.In SIGMOD, 2007.
[22] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A.
Lin, S. Madden, E. O’Neil, P. O’Neil, A. Rasin, N. Tran, and S. Zdonik. C-Store: A ColumnOriented DBMS. In VLDB, 2005.
[23] D. Tsirogiannis, S. Harizopoulos, M. A. Shah, J. L. Wiener, and G. Graefe. Query
Processing Techniques for Solid State Drives.In SIGMOD, 2009.