SlideShare a Scribd company logo
1 of 7
Download to read offline
1
Improving Performance of Decision Support
Queries
in Columnar Cloud Database Systems
Serkan Ozal 1
and Ahmet Cosar 1
1
Department of Computer Engineering, Middle East Technical University
{serkan.ozal,cosar}@metu.edu.tr
Abstract—MapReduce is a popular programming model for executing time-consuming analytical queries as a batch of tasks on large
scale data clusters. Besides execution model, storage model has also significant effect on performance of queries. Storage model
improvements are generally focused on reducing I/O operation count and latency by designing storage formats. There is no best
storage format for all cases and queries because each of them has their own advantages and disadvantages. In this study, we compare
performances of existing storage formats on Hive and propose a new variation of them called DWRF++ which performs better in some
cases and queries. There are still possible optimization points for new proposed format and they are discussed as conclusions and
future work.
Index Terms—Hadoop, Hive, Columnar Storage, Query Optimization, ORC, Data Warehouse
!
1 INTRODUCTION
HADOOP [1] is an open source framework that
supports the processing and storage of extremely
large data sets in a distributed computing environment.
It is based on the MapReduce paradigm to divide the
computation tasks into parts that can be distributed
across the cluster to provide horizontal scalability [2].
Hadoop is based on the MapReduce paradigm which
divides the computation tasks into sub-tasks which
are distributed across the cluster to provide horizontal
scalability. [7], [8], [9], [10]. Hadoop has two main
components: HDFS and MapReduce.
Fig. 1. HDFS
HDFS (Hadoop Distributed File System) is the
distributed file system component of Hadoop [3]. HDFS
stores file system metadata on node named ”Name
Node” and application data on nodes named ”Data
Node” separately. While name node manages the file
system namespace and regulates client access to files,
data nodes store data in blocks within files as replicated.
MapReduce is the distributed processing component
of Hadoop [6], [4]. The MapReduce functions of
Hadoop uses (key,value) pairs as data format. The input
is retrieved in chunks from Hadoop Distributed File
System (HDFS) and assigned to one of the mappers
that will process data in parallel and produce the
(k1,v1) pairs for the reduce step. Then, (k1,v1) pair goes
through shuffle phase that assigns the same k1 pairs to
the same reducer. The reducers gather the pairs with the
same k1 values into groups and perform aggregation
operations (see Figure 2) [2].
Fig. 2. MapReduce tasks.
2
Querying the data with Hadoop as in a traditional
RDBMS infrastructure is one of the most common
problems that Hadoop users face. This affects a majority
of users who are not familiar with the internal details of
MapReduce jobs to extract information from their data
warehouses [11].
Querying the data with Hadoop is required
programming map-reduce jobs. This is very different
way from querying over traditional RDBMS [12].
This affects a majority of users who are not familiar
with the internal details of MapReduce jobs to extract
information from their data warehouses [2].
Hive is an open source SQL-based distributed
warehouse system which is proposed to solve the
problems mentioned above by providing an SQL-like
abstraction on top of Hadoop framework [2]. As shown
in the Figure 3, Hive translates queries to MapReduce
jobs to be run on Hadoop. Hive has its own query
language called HiveQL, [13] HiveQL supports select,
project, join, aggregate, union all and sub-queries in the
from clause. HiveQL supports data definition (DDL)
statements to create tables with specific serialization
formats, and partitioning and bucketing columns [5].
Fig. 3. Hive Architecture
Since most data warehouse applications are
implemented using SQL-based RDBMSs, Hive lowers
the barrier to moving these applications to Hadoop,
thus, people who already know SQL can easily use
Hive. Similarly, Hive makes it easier for developers to
port SQL-based applications to Hadoop [2].
There are many completed and ongoing studies for
improving indexing and query plans to improve query
performance [2], [34], [35], [36], [37]. In addition to these
study areas, storage format is also one of the important
points to improve query performance [28], [24], [29],
[30], [31], [32], [21].
Columnar storages are very known storage format
especially for NoSQL ecosystem [30]. By columnar
storage, disk I/O can be reduced dramatically and less
data is loaded from disk [23]. The following figure 4
shows how records from database tables are typically
stored into disk blocks by row.
Fig. 4. Row-Wise Storage
In row-wise storage, data blocks store values
sequentially for each consecutive column making up
the entire row. If block size is smaller than the size of
a record, record might be saved as multiple blocks. If
block size is larger than the size of a record, this leads to
internal disk fragmentation by wasting disk space [23].
The next figure 5 shows how with columnar storage,
the values for each column are stored sequentially into
disk blocks.
Fig. 5. Columnar Storage
In columnar storage, single column for multiple rows
are stored in each data block [31]. In this simplified
example, using columnar storage, each data block holds
column field values for as many as three times as many
records as row-based storage.
The biggest advantage of this storage is that it
reduces the required I/O operations count compared
to row-wise storage while getting column values. Since
many operations only need to access or operate on
one or a small number of columns at a time, columnar
storage saves memory space by only retrieving blocks
for columns actually needed for a query. In the example
above, columnar storage requires a third of the I/O
operations compared to row-wise storage [23].
Another advantage is that since each block holds the
same type of data, specific compression schema can be
selected for each individual column based on column
data type for reducing disk space and I/O.
3
There are some columnar storage formats such as RC
(Record Columnar), Parquet and ORC (Optimized Row
Columnar). ORC is the improved version of RC and
in our experiments, we will use ORC (Optimized Row
Columnar) storage format as reference point.
1.1 Record Columnar - RC
RCFile [33] is a hybrid columnar storage format that
partitions Hive table data first horizontally into row
groups then vertically by columns so that the columns
are written out one after the other as contiguous
chunks as shown in the figure 6. Each column can be
compressed individually using a Codec like Zlib/Lzo.
While querying over subset of columns, decompression
of columns are done on demand as lazy so columns, that
are not needed, are not decompressed and deserialized.
Fig. 6. RC Storage Format
1.2 Optimized Row Columnar - ORC
RCFile has limitations because it treats each column
as a binary blob without semantics, For handling this
limitation, Optimized Row Columnar (ORC) format was
introduced which uses and retains the type information
from the table definition. In ORCFile format, the data
is partitioned into a set of stripes (typically 256MB). A
stripe is similar to a row group in RCFile as shown in
the figure 7. Each stripe can have its own compression
codec such as Zlib, Snappy, etc ... [24]
ORC files have three levels of indexing: file statis-
tics, stripe statistics and row group indexes (typically
10,000). Statistics includes corresponding minimum and
maximum values per column; this allows for query
optimizations such as skipping over a set of rows when
applying column filters. This approach is called ”pred-
icate pushdown” [24]. Predicate pushdown reduces the
amount of data that must be read from disk or across the
network by skipping non-promising data for the query.
1.3 Introducing DWRF++
ORC doesn’t support ”lazy read” (lazy decompression)
of columns. By lazy read, only required columns are
decompressed and saves CPU time by eliminating
redundant column decompression which doesn’t match
the filter. In order to handle this overhead, we propose
”DWRF++” storage format which is based on ”DWRF”
[25], [26], [27] storage format which supports lazy reads
and also basic column statistics based filtering per file
block.
Fig. 7. ORC Storage Format
In Section 2, we give brief information about the
related work on columnar storage formats (such as
DWRF) to eliminate redundant reads in Hive. Section
3 explains our proposed ”DWRF++” storage format
which is based on regular ”DWRF” format which was
open-sourced by Facebook before. Section 4 discusses
the experiments conducted for evaluating ”DWRF++”
storage format. Finally our concluding remarks are
given in Section 5.
2 RELATED WORK
Hive currently supports ORC format by its
”OrcInputFormat” implementation. So It has ”columnar
read” and ”predicate pushdown” abilities. Predicate
pushdown is amazing when it works, but for a lot
of data sets, it doesn’t work at all. If the data has a
large number of distinct values and is well-shuffled,
the minimum and maximum stats will cover almost the
entire range of values, rendering predicate pushdown
ineffective [24].
Besides these features, Hive ORC reader doesnt
support lazy reads. With lazy reads, the query engine
always inspects the columns needed to evaluate the
query filter, and only then reads other columns for
segments that match the filter (if any are found). For
most workloads, this feature saves more CPU than
4
predicate pushdown.
Facebook had been working on this feature and open-
sourced [27] their own in-house developed ”DWRF”
storage format, which is a variation of regular ORC
format [25]. The key advantage of DWRF is that it
supports ”lazy read”. Consider a query that selects
many columns when performing a very selective filter
on one column. Without lazy decompression, all the data
for every column read by the query is decompressed.
Ideally, only data for the rows that pass the filter
would be decompressed and decoded so the query
isnt spending a majority of its time decompressing and
decoding data that never gets used. To support this,
DWRF supports lazy decompression and lazy decoding
by leveraging the index strides that already existed in
the format. In the case described above, all the rows for
the column the filter is operating on is decompressed
and decoded. For the other columns of the rows that
pass the filter, DWRF changes the reader to seek to
the appropriate index stride in the stripe (a metadata
operation) and only decompresses and decodes the
values preceding and including the target row in that
index stride.
However DWRF doesnt support ”columnar read”
and ”predicate pushdown” features [26]. Facebook had
added these three critical ”columnar read”, ”predicate
pushdown” and ”lazy read” features to Presto (which
is an open source distributed SQL query engine for
running interactive analytic queries) but they are not
currently supported by Hive.
3 IMPROVED DWRF - DWRF++
In this study, we propose an improved variation
of regular DWRF, called as DWRF++, with partial
”predicate pushdown” feature and compare it with
TEXT, regular ORC and DWRF formats.
As shown in the figure 7, like ORC format, also
DRWF format has footer which contains information
about stripes and column statistics for that file block.
Column statistics contains metadata which depends
on column data types. For numeric and string typed
columns, column statistics contains minimum and
maximum value for that particular file block.
In our DWRF++ approach, this minimum and
maximum value is used to determine whether current
file block is promising or not while querying. In here
promising means that this file block can contains records
which are interested in by query.
For equality based conditional queries, if the
conditional value is smaller than minimum or bigger
than maximum value of that particular file block, that
block is simply skipped.
Fig. 8. DWRF Storage Format
For range based conditional queries (smaller and/or
greater), if the conditional range and min/max range of
that particular file block are not overlap, that block is
simply skipped.
4 EXPERIMENTAL SETUP AND RESULTS
In this section, experimental setup and the performance
evaluation of the DWRF++ storage format is presented.
TPC-H is chosen as our benchmark database and
related decision support queries because they process
high amounts of data. We believe this is a good match
for our experiments since Hadoop is also designed to
process large amounts of data.
We run our benchmarks on AWS. The
hardware/software environment information of the
machines are shown in the table 1.
TABLE 1
Hardware/Software environment information
Instance Type m3.xlarge
CPU Intel Xeon E5-2670 v2, 2.6 GHz, 4 virtual core
Memory 15 GB
Storage (SSD) 2 x 40 GB
Operating System AWS Linux based on Debian 6.0.5
Hadoop Version 2.7.3
Hive Version 2.1.1
In our experiments, DAG (Directed Acyclic Graph)
based ”Apache Tez” framework is used as execution
5
engine, vectorization, predicate pushdown and cost
based query optimization supports are enabled. The
Hive configurations are shown in the table 2.
TABLE 2
Hive configurations
Configuration Value
hive.execution.engine tez
hive.vectorized.execution.enabled true
hive.vectorized.execution.reduce.enabled true
hive.vectorized.execution.reduce.groupby.enabled true
hive.cbo.enable true
hive.compute.query.using.stats true
hive.stats.fetch.column.stats true
hive.stats.fetch.partition.stats true
hive.optimize.ppd true
hive.optimize.ppd.storage true
Since our DWRF++ has only basic ”predicate
pushdown” support currently, we select the TPC-H
1 and TPC-H 6 queries due to their simple query
conditions to predicate for the benchmarks. File blocks
are filtered as their minimum and maximum values in
column statistics on their footers for specified columns
in the query condition.
• Every test is run 10 times for more accurate results
• 3 different TPC-H datasets are queried:
– Small TPC-H dataset: 1GB
– Medium TPC-H dataset: 10GB
– Big TPC-H dataset: 100GB
• 3 different sized clusters are formed:
– Small cluster: 1 master + 2 workers
– Medium cluster: 1 master + 4 workers
– Big cluster: 1 master + 8 workers
4.1 TPC-H Query 1
In this benchmark, we used TPC-H Query 1. It has ”less
than or equal” (≤) condition on ”L SHIPDATE” fields
in table ”lineitem”.
TABLE 3
1 Master + 2 Workers
Storage / Data Size 1GB 10GB 100GB
Text 53 seconds 114 seconds 595 seconds
ORC 28 seconds 39 seconds 88 seconds
DWRF 29 seconds 41 seconds 107 seconds
DWRF++ 21 seconds 28 seconds 52 seconds
TABLE 4
1 Master + 4 Workers
Storage / Data Size 1GB 10GB 100GB
Text 46 seconds 86 seconds 312 seconds
ORC 25 seconds 34 seconds 70 seconds
DWRF 26 seconds 40 seconds 84 seconds
DWRF++ 23 seconds 27 seconds 49 seconds
TABLE 5
1 Master + 8 Workers
Storage / Data Size 1GB 10GB 100GB
Text 44 seconds 71 seconds 190 seconds
ORC 24 seconds 33 seconds 64 seconds
DWRF 25 seconds 40 seconds 74 seconds
DWRF++ 21 seconds 24 seconds 48 seconds
4.2 TPC-H Query 6
In this benchmark, we used TPC-H Query 6. It has vari-
ous ”greater than” (>), ”greater than or equal” (≥), ”less
than” (<) and ”less than or equal” (≤) conditions on
”L SHIPDATE”, ”L DISCOUNT” and ”L QUANTITY”
columns in the table ”lineitem”.
TABLE 6
1 Master + 2 Workers
Storage / Data Size 1GB 10GB 100GB
Text 42 seconds 143 seconds 531 seconds
ORC 20 seconds 26 seconds 46 seconds
DWRF 22 seconds 32 seconds 88 seconds
DWRF++ 22 seconds 28 seconds 34 seconds
TABLE 7
1 Master + 4 Workers
Storage / Data Size 1GB 10GB 100GB
Text 37 seconds 73 seconds 290 seconds
ORC 15 seconds 23 seconds 44 seconds
DWRF 21 seconds 29 seconds 68 seconds
DWRF++ 20 seconds 25 seconds 31 seconds
TABLE 8
1 Master + 8 Workers
Storage / Data Size 1GB 10GB 100GB
Text 35 seconds 60 seconds 175 seconds
ORC 15 seconds 21 seconds 39 seconds
DWRF 21 seconds 29 seconds 60 seconds
DWRF++ 19 seconds 21 seconds 28 seconds
4.3 The Effect of DWRF++
As seen from the benchmark results, regular ORC
format better than raw text format. The difference
between the ORC and Text formats becomes more
significant while the data size is getting bigger. Regular
DWRF format has also much better better performance
than raw text format but ORC is still better than it.
The benchmark results show that for queries such as
TPC-H 1 and TPC-H 6 which have simple conditional
predicates, our improved DWRF++ format performs
about 30% better than regular ORC format with large
data sets.
6
5 CONCLUSION AND FUTURE WORK
In this study, we propose improved version of DWRF,
which we call DWRF++, storage format which is
developed by Facebook as a variation of open-
source ORC format for improving the performance
of MapReduce-based data warehouse Hive queries.
To our knowledge, this is the first work that aims to
add ”predicate pushdown” feature to DWRF storage
format and test it on TPC-H HiveQL queries. By adding
”predicate pushdown” feature, non-promising blocks
can be directly skipped while executing queries. With
this approach, it has been experimentally shown that
significant performance improvements can be achieved
by ignoring non-promising record blocks for less I/O
and computation operation.
Currently ”predicate pushdown” technique is applied
per file block basis. As future work, we plan to apply
”predicate pushdown” filtering per stripe and. In this
way, it will be possible to eliminate non-promising
records even more sensitively.
In addition, bloom filtering can also be used per file
block or stripe basis and it can be very useful especially
for equality based conditional queries. So if bloom filter
says that there is no such key in that file block or stripe,
it can be ignored silently.
As final improvement, we are planning to combine
these columnar storage improvements and MQO
technique by sharing correlated query parts which
was introduced in SharedHive [2]. We think that this
combined improvement can be very useful especially
for decision queries on data mining area by improving
query performance with less CPU power and I/O
operation.
REFERENCES
[1] Hadoop project. http://hadoop.apache.org/.
[2] Dokeroglu T, Ozal S, Bayir M A, et al. Improving the performance
of Hadoop Hive by sharing scan and computation tasks[J]. Journal
of Cloud Computing, 2014, 3(1):1-11.
[3] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert
Chansler, The Hadoop Distributed File System, Proceedings of
the 2010 IEEE 26th Symposium on Mass Storage Systems and
Technologies (MSST), p.1-10, May 03-07, 2010
[4] T. White, ”Hadoop: The Definitive Guide” in , O’Reilly Media,
Yahoo! Press, June 2009
[5] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao,
Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff,
Raghotham Murthy, Hive: a warehousing solution over a map-
reduce framework, Proceedings of the VLDB Endowment, v.2 n.2,
August 2009
[6] Dean, J., and Ghemawat, S. (2008). MapReduce: simplified data
processing on large clusters. Communications of the ACM, 51(1),
107-113.
[7] Condie, T., et al. (2010). MapReduce online. In Proceedings of
the 7th USENIX conference on Networked systems design and
implementation.
[8] Stonebraker, M., et al. (2010). MapReduce and parallel DBMSs:
friends or foes. Communications of the ACM, 53(1), 64-71.
[9] DeWitt, D., and Stonebraker, M. (2008). MapReduce: A major step
backwards. The Database Column,1.
[10] Lee, K. H., et al. (2012). Parallel data processing with MapReduce:
a survey. ACM SIGMOD Record, 40(4), 11-20.
[11] Abouzeid, A., et al. (2009). HadoopDB: an architectural hybrid
of MapReduce and DBMS technologies for analytical workloads.
Proceedings of the VLDB Endowment, 2(1), 922-933.
[12] Ordonez, C., Song, I. Y., and Garcia-Alvarado, C. (2010). Relational
versus non-relational database systems for data warehousing. In
Proceedings of the ACM 13th international workshop on Data
warehousing and OLAP (67-68).
[13] Thusoo, A., et al. (2010). Hive-a petabyte scale data warehouse
using hadoop. ICDE,(996-1005).
[14] Chaiken, R., et al. (2008). SCOPE: easy and efficient parallel pro-
cessing of massive data sets. Proceedings of the VLDB Endowment,
1(2), 1265-1276.
[15] Cohen, J., et al. (2009). MAD skills: new analysis practices for big
data. VLDB, 2(2), 1481-1492.
[16] He, Y., et al. (2011). Rcfile: A fast and space-efficient data place-
ment structure in mapreduce-based warehouse systems. ICDE
(1199-1208).
[17] Lee, R., et al. (2011). Ysmart: Yet another sql-to-mapreduce trans-
lator. ICDCS (25-36).
[18] Beynon, M., et al. (2002). Processing large-scale multi-dimensional
data in parallel and distributed environments. Parallel Computing,
28(5), 827-859.
[19] Bajda-Pawlikowski, K., Abadi, D. J., Silberschatz, A., and Paulson,
E. (2011). Efficient processing of data warehousing queries in a split
execution environment. In Proceedings of international conference
on Management of data (1165-1176).
[20] Friedman, E., Pawlowski, P., and Cieslewicz, J. (2009).
SQL/MapReduce: A practical approach to self-describing,
polymorphic, and parallelizable user-defined functions. VLDB,
2(2), 1402-1413.
[21] Gruenheid, A., Omiecinski, E., and Mark, L. (2011). Query opti-
mization using column statistics in hive. In Proceedings of the 15th
Symposium on International Database Engineering and Applica-
tions (97-105).
[22] Running TPC-H queries on Hive.
http://issues.apache.org/jira/browse/HIVE-600.
[23] http://docs.aws.amazon.com/redshift/latest/dg/c columnar
storage disk mem mgmnt.html
[24] Yin Huai, Ashutosh Chauhan, Alan Gates, Gunther Hagleitner,
Eric N. Hanson, Owen O’Malley, Jitendra Pandey, Yuan Yuan,
Rubao Lee, Xiaodong Zhang, Major technical advancements in
apache hive, Proceedings of the 2014 ACM SIGMOD International
Conference on Management of Data, June 22-27, 2014, Snowbird,
Utah, USA
[25] code.facebook.com/posts/229861827208629/scalingthefacebook-
datawarehouseto300pb
[26] code.facebook.com/posts/370832626374903/evenfasterdataatthe-
speedofprestoorc
[27] https://github.com/facebookarchive/hivedwrf
[28] Avrilia Floratou , Umar Farooq Minhas , Fatma zcan, SQL-on-
Hadoop: full circle back to shared-nothing database architectures,
[29] Proceedings of the VLDB Endowment, v.7 n.12, p.1295-1306,
August 2014 Kumar, R. and Kumar, N., Improved join operations
using ORC in HIVE. CSI Transactions on ICT, pp.1-7.
[30] Mehmood A, Iqbal M, Khaleeq M, Khaliq Y. Performance analysis
of shared-nothing SQL-on-Hadoop frameworks based on columnar
database systems. InInnovative Computing Technology (INTECH),
2016 Sixth International Conference on 2016 Aug 24 (pp. 128-133).
IEEE.
[31] Plase D, Niedrite L, Taranovs R. Accelerating data queries on
Hadoop framework by using compact data formats. InAdvances
in Information, Electronic and Electrical Engineering (AIEEE), 2016
IEEE 4th Workshop on 2016 Nov 10 (pp. 1-7). IEEE.
[32] Liu Y, Guo S, Hu S, Rabl T, Jacobsen HA, Li J, Wang J. Performance
Evaluation and Optimization of Multi-dimensional Indexes in
Hive. IEEE Transactions on Services Computing. 2016 Jul 27.
[33] He Y, Lee R, Huai Y, Shao Z, Jain N, Zhang X, Xu Z. RCFile: A fast
and space-efficient data placement structure in MapReduce-based
warehouse systems. InData Engineering (ICDE), 2011 IEEE 27th
International Conference on 2011 Apr 11 (pp. 1199-1208). IEEE.
[34] Bayir, M. A., Toroslu, I. H., and Cosar, A. (2007). Genetic algorithm
for the multiple-query optimization problem. IEEE Transactions on
7
Systems, Man, and Cybernetics, Part C: Applications and Reviews,
37(1), 147-153.
[35] Cosar, A., Lim, E. P., and Srivastava, J. (1993). Multiple query op-
timization with depth-first branch-and-bound and dynamic query
ordering. In Proceedings of the second international conference on
Information and knowledge management (433-438).
[36] Zhou, J., Larson, P. A., Freytag, J. C., and Lehner, W. (2007). Effi-
cient exploitation of similar subexpressions for query processing.
In Proceedings of ACM SIGMOD (533-544).
[37] Silva, Y. N., Larson, P., and Zhou, J. (2012). Exploiting Common
Subexpressions for Cloud Query Processing. ICDE (1337-1348).

More Related Content

What's hot

Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...
Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...
Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...Amazon Web Services
 
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017Amazon Web Services
 
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)Amazon Web Services
 
Introduction to AWS Database Services
Introduction to AWS Database ServicesIntroduction to AWS Database Services
Introduction to AWS Database ServicesAmazon Web Services
 
Bursting on-premise analytic workloads to Amazon EMR using Alluxio
Bursting on-premise analytic workloads to Amazon EMR using AlluxioBursting on-premise analytic workloads to Amazon EMR using Alluxio
Bursting on-premise analytic workloads to Amazon EMR using AlluxioAlluxio, Inc.
 
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon RedshiftAmazon Web Services
 
Getting Started with Amazon EC2 and Compute Services
Getting Started with Amazon EC2 and Compute ServicesGetting Started with Amazon EC2 and Compute Services
Getting Started with Amazon EC2 and Compute ServicesAmazon Web Services
 
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMRCost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMRProvectus
 
Lessons from Migrating Oracle Databases to Amazon RDS or Amazon Aurora
Lessons from Migrating Oracle Databases to Amazon RDS or Amazon Aurora Lessons from Migrating Oracle Databases to Amazon RDS or Amazon Aurora
Lessons from Migrating Oracle Databases to Amazon RDS or Amazon Aurora Datavail
 
Data Replication Options in AWS (ARC302) | AWS re:Invent 2013
Data Replication Options in AWS (ARC302) | AWS re:Invent 2013Data Replication Options in AWS (ARC302) | AWS re:Invent 2013
Data Replication Options in AWS (ARC302) | AWS re:Invent 2013Amazon Web Services
 
(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduce(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduceAmazon Web Services
 
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...Amazon Web Services
 
Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28
Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28
Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28Amazon Web Services
 
Leveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data WarehouseLeveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data WarehouseAmazon Web Services
 

What's hot (20)

Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...
Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...
Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...
 
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
 
Beyond EC2 and S3
Beyond EC2 and S3Beyond EC2 and S3
Beyond EC2 and S3
 
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
 
Introduction to AWS Database Services
Introduction to AWS Database ServicesIntroduction to AWS Database Services
Introduction to AWS Database Services
 
Bursting on-premise analytic workloads to Amazon EMR using Alluxio
Bursting on-premise analytic workloads to Amazon EMR using AlluxioBursting on-premise analytic workloads to Amazon EMR using Alluxio
Bursting on-premise analytic workloads to Amazon EMR using Alluxio
 
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
 
Amazon Aurora: Under the Hood
Amazon Aurora: Under the HoodAmazon Aurora: Under the Hood
Amazon Aurora: Under the Hood
 
Getting Started with Amazon EC2 and Compute Services
Getting Started with Amazon EC2 and Compute ServicesGetting Started with Amazon EC2 and Compute Services
Getting Started with Amazon EC2 and Compute Services
 
SQL Server on AWS
SQL Server on AWSSQL Server on AWS
SQL Server on AWS
 
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMRCost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
 
Amazon Aurora
Amazon AuroraAmazon Aurora
Amazon Aurora
 
Lessons from Migrating Oracle Databases to Amazon RDS or Amazon Aurora
Lessons from Migrating Oracle Databases to Amazon RDS or Amazon Aurora Lessons from Migrating Oracle Databases to Amazon RDS or Amazon Aurora
Lessons from Migrating Oracle Databases to Amazon RDS or Amazon Aurora
 
Redshift overview
Redshift overviewRedshift overview
Redshift overview
 
Masterclass - Redshift
Masterclass - RedshiftMasterclass - Redshift
Masterclass - Redshift
 
Data Replication Options in AWS (ARC302) | AWS re:Invent 2013
Data Replication Options in AWS (ARC302) | AWS re:Invent 2013Data Replication Options in AWS (ARC302) | AWS re:Invent 2013
Data Replication Options in AWS (ARC302) | AWS re:Invent 2013
 
(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduce(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduce
 
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
 
Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28
Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28
Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28
 
Leveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data WarehouseLeveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data Warehouse
 

Similar to Improving Performance of Decision Support Queries in Columnar Cloud Database Systems

Column store databases approaches and optimization techniques
Column store databases  approaches and optimization techniquesColumn store databases  approaches and optimization techniques
Column store databases approaches and optimization techniquesIJDKP
 
Optimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud EnvironmentOptimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud EnvironmentFei Dong
 
Performance Enhancement using Appropriate File Formats in Big Data Hadoop Eco...
Performance Enhancement using Appropriate File Formats in Big Data Hadoop Eco...Performance Enhancement using Appropriate File Formats in Big Data Hadoop Eco...
Performance Enhancement using Appropriate File Formats in Big Data Hadoop Eco...IRJET Journal
 
Db2 Important questions to read
Db2 Important questions to readDb2 Important questions to read
Db2 Important questions to readPrasanth Dusi
 
Presentation on NoSQL Database related RDBMS
Presentation on NoSQL Database related RDBMSPresentation on NoSQL Database related RDBMS
Presentation on NoSQL Database related RDBMSabdurrobsoyon
 
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...IJCERT JOURNAL
 
Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisApache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisliang chen
 
What is Hive Optimization_ - Bahaa Al Zubaidi.pdf
What is Hive Optimization_ - Bahaa Al Zubaidi.pdfWhat is Hive Optimization_ - Bahaa Al Zubaidi.pdf
What is Hive Optimization_ - Bahaa Al Zubaidi.pdfBahaa Al Zubaidi
 
Ms sql server architecture
Ms sql server architectureMs sql server architecture
Ms sql server architectureAjeet Singh
 
Database Performance Management in Cloud
Database Performance Management in CloudDatabase Performance Management in Cloud
Database Performance Management in CloudDr. Amarjeet Singh
 
Getting to know oracle database objects iot, mviews, clusters and more…
Getting to know oracle database objects iot, mviews, clusters and more…Getting to know oracle database objects iot, mviews, clusters and more…
Getting to know oracle database objects iot, mviews, clusters and more…Aaron Shilo
 
The design and implementation of modern column oriented databases
The design and implementation of modern column oriented databasesThe design and implementation of modern column oriented databases
The design and implementation of modern column oriented databasesTilak Patidar
 

Similar to Improving Performance of Decision Support Queries in Columnar Cloud Database Systems (20)

Column store databases approaches and optimization techniques
Column store databases  approaches and optimization techniquesColumn store databases  approaches and optimization techniques
Column store databases approaches and optimization techniques
 
Lecture3.ppt
Lecture3.pptLecture3.ppt
Lecture3.ppt
 
Column oriented Transactions
Column oriented TransactionsColumn oriented Transactions
Column oriented Transactions
 
Optimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud EnvironmentOptimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud Environment
 
Performance Enhancement using Appropriate File Formats in Big Data Hadoop Eco...
Performance Enhancement using Appropriate File Formats in Big Data Hadoop Eco...Performance Enhancement using Appropriate File Formats in Big Data Hadoop Eco...
Performance Enhancement using Appropriate File Formats in Big Data Hadoop Eco...
 
Db2 Important questions to read
Db2 Important questions to readDb2 Important questions to read
Db2 Important questions to read
 
Ceph
CephCeph
Ceph
 
Presentation on NoSQL Database related RDBMS
Presentation on NoSQL Database related RDBMSPresentation on NoSQL Database related RDBMS
Presentation on NoSQL Database related RDBMS
 
hive hadoop sql
hive hadoop sqlhive hadoop sql
hive hadoop sql
 
Cassandra data modelling best practices
Cassandra data modelling best practicesCassandra data modelling best practices
Cassandra data modelling best practices
 
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
 
Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisApache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysis
 
What is Hive Optimization_ - Bahaa Al Zubaidi.pdf
What is Hive Optimization_ - Bahaa Al Zubaidi.pdfWhat is Hive Optimization_ - Bahaa Al Zubaidi.pdf
What is Hive Optimization_ - Bahaa Al Zubaidi.pdf
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 
Ms sql server architecture
Ms sql server architectureMs sql server architecture
Ms sql server architecture
 
Database Performance Management in Cloud
Database Performance Management in CloudDatabase Performance Management in Cloud
Database Performance Management in Cloud
 
Getting to know oracle database objects iot, mviews, clusters and more…
Getting to know oracle database objects iot, mviews, clusters and more…Getting to know oracle database objects iot, mviews, clusters and more…
Getting to know oracle database objects iot, mviews, clusters and more…
 
The design and implementation of modern column oriented databases
The design and implementation of modern column oriented databasesThe design and implementation of modern column oriented databases
The design and implementation of modern column oriented databases
 
Oracle
OracleOracle
Oracle
 
3 OLAP.pptx
3 OLAP.pptx3 OLAP.pptx
3 OLAP.pptx
 

More from Serkan Özal

Flying Server-less on the Cloud with AWS Lambda
Flying Server-less on the Cloud with AWS LambdaFlying Server-less on the Cloud with AWS Lambda
Flying Server-less on the Cloud with AWS LambdaSerkan Özal
 
JVM Under the Hood
JVM Under the HoodJVM Under the Hood
JVM Under the HoodSerkan Özal
 
Ankara JUG Big Data Presentation
Ankara JUG Big Data PresentationAnkara JUG Big Data Presentation
Ankara JUG Big Data PresentationSerkan Özal
 
AWS EMR - Amazon Elastic Map Reduce
AWS EMR - Amazon Elastic Map ReduceAWS EMR - Amazon Elastic Map Reduce
AWS EMR - Amazon Elastic Map ReduceSerkan Özal
 

More from Serkan Özal (7)

Flying Server-less on the Cloud with AWS Lambda
Flying Server-less on the Cloud with AWS LambdaFlying Server-less on the Cloud with AWS Lambda
Flying Server-less on the Cloud with AWS Lambda
 
MySafe
MySafeMySafe
MySafe
 
JVM Under the Hood
JVM Under the HoodJVM Under the Hood
JVM Under the Hood
 
Big data on aws
Big data on awsBig data on aws
Big data on aws
 
Ankara JUG Big Data Presentation
Ankara JUG Big Data PresentationAnkara JUG Big Data Presentation
Ankara JUG Big Data Presentation
 
AWS EMR - Amazon Elastic Map Reduce
AWS EMR - Amazon Elastic Map ReduceAWS EMR - Amazon Elastic Map Reduce
AWS EMR - Amazon Elastic Map Reduce
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 

Recently uploaded

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan
 

Recently uploaded (20)

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software Solutions
 

Improving Performance of Decision Support Queries in Columnar Cloud Database Systems

  • 1. 1 Improving Performance of Decision Support Queries in Columnar Cloud Database Systems Serkan Ozal 1 and Ahmet Cosar 1 1 Department of Computer Engineering, Middle East Technical University {serkan.ozal,cosar}@metu.edu.tr Abstract—MapReduce is a popular programming model for executing time-consuming analytical queries as a batch of tasks on large scale data clusters. Besides execution model, storage model has also significant effect on performance of queries. Storage model improvements are generally focused on reducing I/O operation count and latency by designing storage formats. There is no best storage format for all cases and queries because each of them has their own advantages and disadvantages. In this study, we compare performances of existing storage formats on Hive and propose a new variation of them called DWRF++ which performs better in some cases and queries. There are still possible optimization points for new proposed format and they are discussed as conclusions and future work. Index Terms—Hadoop, Hive, Columnar Storage, Query Optimization, ORC, Data Warehouse ! 1 INTRODUCTION HADOOP [1] is an open source framework that supports the processing and storage of extremely large data sets in a distributed computing environment. It is based on the MapReduce paradigm to divide the computation tasks into parts that can be distributed across the cluster to provide horizontal scalability [2]. Hadoop is based on the MapReduce paradigm which divides the computation tasks into sub-tasks which are distributed across the cluster to provide horizontal scalability. [7], [8], [9], [10]. Hadoop has two main components: HDFS and MapReduce. Fig. 1. HDFS HDFS (Hadoop Distributed File System) is the distributed file system component of Hadoop [3]. HDFS stores file system metadata on node named ”Name Node” and application data on nodes named ”Data Node” separately. While name node manages the file system namespace and regulates client access to files, data nodes store data in blocks within files as replicated. MapReduce is the distributed processing component of Hadoop [6], [4]. The MapReduce functions of Hadoop uses (key,value) pairs as data format. The input is retrieved in chunks from Hadoop Distributed File System (HDFS) and assigned to one of the mappers that will process data in parallel and produce the (k1,v1) pairs for the reduce step. Then, (k1,v1) pair goes through shuffle phase that assigns the same k1 pairs to the same reducer. The reducers gather the pairs with the same k1 values into groups and perform aggregation operations (see Figure 2) [2]. Fig. 2. MapReduce tasks.
  • 2. 2 Querying the data with Hadoop as in a traditional RDBMS infrastructure is one of the most common problems that Hadoop users face. This affects a majority of users who are not familiar with the internal details of MapReduce jobs to extract information from their data warehouses [11]. Querying the data with Hadoop is required programming map-reduce jobs. This is very different way from querying over traditional RDBMS [12]. This affects a majority of users who are not familiar with the internal details of MapReduce jobs to extract information from their data warehouses [2]. Hive is an open source SQL-based distributed warehouse system which is proposed to solve the problems mentioned above by providing an SQL-like abstraction on top of Hadoop framework [2]. As shown in the Figure 3, Hive translates queries to MapReduce jobs to be run on Hadoop. Hive has its own query language called HiveQL, [13] HiveQL supports select, project, join, aggregate, union all and sub-queries in the from clause. HiveQL supports data definition (DDL) statements to create tables with specific serialization formats, and partitioning and bucketing columns [5]. Fig. 3. Hive Architecture Since most data warehouse applications are implemented using SQL-based RDBMSs, Hive lowers the barrier to moving these applications to Hadoop, thus, people who already know SQL can easily use Hive. Similarly, Hive makes it easier for developers to port SQL-based applications to Hadoop [2]. There are many completed and ongoing studies for improving indexing and query plans to improve query performance [2], [34], [35], [36], [37]. In addition to these study areas, storage format is also one of the important points to improve query performance [28], [24], [29], [30], [31], [32], [21]. Columnar storages are very known storage format especially for NoSQL ecosystem [30]. By columnar storage, disk I/O can be reduced dramatically and less data is loaded from disk [23]. The following figure 4 shows how records from database tables are typically stored into disk blocks by row. Fig. 4. Row-Wise Storage In row-wise storage, data blocks store values sequentially for each consecutive column making up the entire row. If block size is smaller than the size of a record, record might be saved as multiple blocks. If block size is larger than the size of a record, this leads to internal disk fragmentation by wasting disk space [23]. The next figure 5 shows how with columnar storage, the values for each column are stored sequentially into disk blocks. Fig. 5. Columnar Storage In columnar storage, single column for multiple rows are stored in each data block [31]. In this simplified example, using columnar storage, each data block holds column field values for as many as three times as many records as row-based storage. The biggest advantage of this storage is that it reduces the required I/O operations count compared to row-wise storage while getting column values. Since many operations only need to access or operate on one or a small number of columns at a time, columnar storage saves memory space by only retrieving blocks for columns actually needed for a query. In the example above, columnar storage requires a third of the I/O operations compared to row-wise storage [23]. Another advantage is that since each block holds the same type of data, specific compression schema can be selected for each individual column based on column data type for reducing disk space and I/O.
  • 3. 3 There are some columnar storage formats such as RC (Record Columnar), Parquet and ORC (Optimized Row Columnar). ORC is the improved version of RC and in our experiments, we will use ORC (Optimized Row Columnar) storage format as reference point. 1.1 Record Columnar - RC RCFile [33] is a hybrid columnar storage format that partitions Hive table data first horizontally into row groups then vertically by columns so that the columns are written out one after the other as contiguous chunks as shown in the figure 6. Each column can be compressed individually using a Codec like Zlib/Lzo. While querying over subset of columns, decompression of columns are done on demand as lazy so columns, that are not needed, are not decompressed and deserialized. Fig. 6. RC Storage Format 1.2 Optimized Row Columnar - ORC RCFile has limitations because it treats each column as a binary blob without semantics, For handling this limitation, Optimized Row Columnar (ORC) format was introduced which uses and retains the type information from the table definition. In ORCFile format, the data is partitioned into a set of stripes (typically 256MB). A stripe is similar to a row group in RCFile as shown in the figure 7. Each stripe can have its own compression codec such as Zlib, Snappy, etc ... [24] ORC files have three levels of indexing: file statis- tics, stripe statistics and row group indexes (typically 10,000). Statistics includes corresponding minimum and maximum values per column; this allows for query optimizations such as skipping over a set of rows when applying column filters. This approach is called ”pred- icate pushdown” [24]. Predicate pushdown reduces the amount of data that must be read from disk or across the network by skipping non-promising data for the query. 1.3 Introducing DWRF++ ORC doesn’t support ”lazy read” (lazy decompression) of columns. By lazy read, only required columns are decompressed and saves CPU time by eliminating redundant column decompression which doesn’t match the filter. In order to handle this overhead, we propose ”DWRF++” storage format which is based on ”DWRF” [25], [26], [27] storage format which supports lazy reads and also basic column statistics based filtering per file block. Fig. 7. ORC Storage Format In Section 2, we give brief information about the related work on columnar storage formats (such as DWRF) to eliminate redundant reads in Hive. Section 3 explains our proposed ”DWRF++” storage format which is based on regular ”DWRF” format which was open-sourced by Facebook before. Section 4 discusses the experiments conducted for evaluating ”DWRF++” storage format. Finally our concluding remarks are given in Section 5. 2 RELATED WORK Hive currently supports ORC format by its ”OrcInputFormat” implementation. So It has ”columnar read” and ”predicate pushdown” abilities. Predicate pushdown is amazing when it works, but for a lot of data sets, it doesn’t work at all. If the data has a large number of distinct values and is well-shuffled, the minimum and maximum stats will cover almost the entire range of values, rendering predicate pushdown ineffective [24]. Besides these features, Hive ORC reader doesnt support lazy reads. With lazy reads, the query engine always inspects the columns needed to evaluate the query filter, and only then reads other columns for segments that match the filter (if any are found). For most workloads, this feature saves more CPU than
  • 4. 4 predicate pushdown. Facebook had been working on this feature and open- sourced [27] their own in-house developed ”DWRF” storage format, which is a variation of regular ORC format [25]. The key advantage of DWRF is that it supports ”lazy read”. Consider a query that selects many columns when performing a very selective filter on one column. Without lazy decompression, all the data for every column read by the query is decompressed. Ideally, only data for the rows that pass the filter would be decompressed and decoded so the query isnt spending a majority of its time decompressing and decoding data that never gets used. To support this, DWRF supports lazy decompression and lazy decoding by leveraging the index strides that already existed in the format. In the case described above, all the rows for the column the filter is operating on is decompressed and decoded. For the other columns of the rows that pass the filter, DWRF changes the reader to seek to the appropriate index stride in the stripe (a metadata operation) and only decompresses and decodes the values preceding and including the target row in that index stride. However DWRF doesnt support ”columnar read” and ”predicate pushdown” features [26]. Facebook had added these three critical ”columnar read”, ”predicate pushdown” and ”lazy read” features to Presto (which is an open source distributed SQL query engine for running interactive analytic queries) but they are not currently supported by Hive. 3 IMPROVED DWRF - DWRF++ In this study, we propose an improved variation of regular DWRF, called as DWRF++, with partial ”predicate pushdown” feature and compare it with TEXT, regular ORC and DWRF formats. As shown in the figure 7, like ORC format, also DRWF format has footer which contains information about stripes and column statistics for that file block. Column statistics contains metadata which depends on column data types. For numeric and string typed columns, column statistics contains minimum and maximum value for that particular file block. In our DWRF++ approach, this minimum and maximum value is used to determine whether current file block is promising or not while querying. In here promising means that this file block can contains records which are interested in by query. For equality based conditional queries, if the conditional value is smaller than minimum or bigger than maximum value of that particular file block, that block is simply skipped. Fig. 8. DWRF Storage Format For range based conditional queries (smaller and/or greater), if the conditional range and min/max range of that particular file block are not overlap, that block is simply skipped. 4 EXPERIMENTAL SETUP AND RESULTS In this section, experimental setup and the performance evaluation of the DWRF++ storage format is presented. TPC-H is chosen as our benchmark database and related decision support queries because they process high amounts of data. We believe this is a good match for our experiments since Hadoop is also designed to process large amounts of data. We run our benchmarks on AWS. The hardware/software environment information of the machines are shown in the table 1. TABLE 1 Hardware/Software environment information Instance Type m3.xlarge CPU Intel Xeon E5-2670 v2, 2.6 GHz, 4 virtual core Memory 15 GB Storage (SSD) 2 x 40 GB Operating System AWS Linux based on Debian 6.0.5 Hadoop Version 2.7.3 Hive Version 2.1.1 In our experiments, DAG (Directed Acyclic Graph) based ”Apache Tez” framework is used as execution
  • 5. 5 engine, vectorization, predicate pushdown and cost based query optimization supports are enabled. The Hive configurations are shown in the table 2. TABLE 2 Hive configurations Configuration Value hive.execution.engine tez hive.vectorized.execution.enabled true hive.vectorized.execution.reduce.enabled true hive.vectorized.execution.reduce.groupby.enabled true hive.cbo.enable true hive.compute.query.using.stats true hive.stats.fetch.column.stats true hive.stats.fetch.partition.stats true hive.optimize.ppd true hive.optimize.ppd.storage true Since our DWRF++ has only basic ”predicate pushdown” support currently, we select the TPC-H 1 and TPC-H 6 queries due to their simple query conditions to predicate for the benchmarks. File blocks are filtered as their minimum and maximum values in column statistics on their footers for specified columns in the query condition. • Every test is run 10 times for more accurate results • 3 different TPC-H datasets are queried: – Small TPC-H dataset: 1GB – Medium TPC-H dataset: 10GB – Big TPC-H dataset: 100GB • 3 different sized clusters are formed: – Small cluster: 1 master + 2 workers – Medium cluster: 1 master + 4 workers – Big cluster: 1 master + 8 workers 4.1 TPC-H Query 1 In this benchmark, we used TPC-H Query 1. It has ”less than or equal” (≤) condition on ”L SHIPDATE” fields in table ”lineitem”. TABLE 3 1 Master + 2 Workers Storage / Data Size 1GB 10GB 100GB Text 53 seconds 114 seconds 595 seconds ORC 28 seconds 39 seconds 88 seconds DWRF 29 seconds 41 seconds 107 seconds DWRF++ 21 seconds 28 seconds 52 seconds TABLE 4 1 Master + 4 Workers Storage / Data Size 1GB 10GB 100GB Text 46 seconds 86 seconds 312 seconds ORC 25 seconds 34 seconds 70 seconds DWRF 26 seconds 40 seconds 84 seconds DWRF++ 23 seconds 27 seconds 49 seconds TABLE 5 1 Master + 8 Workers Storage / Data Size 1GB 10GB 100GB Text 44 seconds 71 seconds 190 seconds ORC 24 seconds 33 seconds 64 seconds DWRF 25 seconds 40 seconds 74 seconds DWRF++ 21 seconds 24 seconds 48 seconds 4.2 TPC-H Query 6 In this benchmark, we used TPC-H Query 6. It has vari- ous ”greater than” (>), ”greater than or equal” (≥), ”less than” (<) and ”less than or equal” (≤) conditions on ”L SHIPDATE”, ”L DISCOUNT” and ”L QUANTITY” columns in the table ”lineitem”. TABLE 6 1 Master + 2 Workers Storage / Data Size 1GB 10GB 100GB Text 42 seconds 143 seconds 531 seconds ORC 20 seconds 26 seconds 46 seconds DWRF 22 seconds 32 seconds 88 seconds DWRF++ 22 seconds 28 seconds 34 seconds TABLE 7 1 Master + 4 Workers Storage / Data Size 1GB 10GB 100GB Text 37 seconds 73 seconds 290 seconds ORC 15 seconds 23 seconds 44 seconds DWRF 21 seconds 29 seconds 68 seconds DWRF++ 20 seconds 25 seconds 31 seconds TABLE 8 1 Master + 8 Workers Storage / Data Size 1GB 10GB 100GB Text 35 seconds 60 seconds 175 seconds ORC 15 seconds 21 seconds 39 seconds DWRF 21 seconds 29 seconds 60 seconds DWRF++ 19 seconds 21 seconds 28 seconds 4.3 The Effect of DWRF++ As seen from the benchmark results, regular ORC format better than raw text format. The difference between the ORC and Text formats becomes more significant while the data size is getting bigger. Regular DWRF format has also much better better performance than raw text format but ORC is still better than it. The benchmark results show that for queries such as TPC-H 1 and TPC-H 6 which have simple conditional predicates, our improved DWRF++ format performs about 30% better than regular ORC format with large data sets.
  • 6. 6 5 CONCLUSION AND FUTURE WORK In this study, we propose improved version of DWRF, which we call DWRF++, storage format which is developed by Facebook as a variation of open- source ORC format for improving the performance of MapReduce-based data warehouse Hive queries. To our knowledge, this is the first work that aims to add ”predicate pushdown” feature to DWRF storage format and test it on TPC-H HiveQL queries. By adding ”predicate pushdown” feature, non-promising blocks can be directly skipped while executing queries. With this approach, it has been experimentally shown that significant performance improvements can be achieved by ignoring non-promising record blocks for less I/O and computation operation. Currently ”predicate pushdown” technique is applied per file block basis. As future work, we plan to apply ”predicate pushdown” filtering per stripe and. In this way, it will be possible to eliminate non-promising records even more sensitively. In addition, bloom filtering can also be used per file block or stripe basis and it can be very useful especially for equality based conditional queries. So if bloom filter says that there is no such key in that file block or stripe, it can be ignored silently. As final improvement, we are planning to combine these columnar storage improvements and MQO technique by sharing correlated query parts which was introduced in SharedHive [2]. We think that this combined improvement can be very useful especially for decision queries on data mining area by improving query performance with less CPU power and I/O operation. REFERENCES [1] Hadoop project. http://hadoop.apache.org/. [2] Dokeroglu T, Ozal S, Bayir M A, et al. Improving the performance of Hadoop Hive by sharing scan and computation tasks[J]. Journal of Cloud Computing, 2014, 3(1):1-11. [3] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, The Hadoop Distributed File System, Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), p.1-10, May 03-07, 2010 [4] T. White, ”Hadoop: The Definitive Guide” in , O’Reilly Media, Yahoo! Press, June 2009 [5] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, Raghotham Murthy, Hive: a warehousing solution over a map- reduce framework, Proceedings of the VLDB Endowment, v.2 n.2, August 2009 [6] Dean, J., and Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113. [7] Condie, T., et al. (2010). MapReduce online. In Proceedings of the 7th USENIX conference on Networked systems design and implementation. [8] Stonebraker, M., et al. (2010). MapReduce and parallel DBMSs: friends or foes. Communications of the ACM, 53(1), 64-71. [9] DeWitt, D., and Stonebraker, M. (2008). MapReduce: A major step backwards. The Database Column,1. [10] Lee, K. H., et al. (2012). Parallel data processing with MapReduce: a survey. ACM SIGMOD Record, 40(4), 11-20. [11] Abouzeid, A., et al. (2009). HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proceedings of the VLDB Endowment, 2(1), 922-933. [12] Ordonez, C., Song, I. Y., and Garcia-Alvarado, C. (2010). Relational versus non-relational database systems for data warehousing. In Proceedings of the ACM 13th international workshop on Data warehousing and OLAP (67-68). [13] Thusoo, A., et al. (2010). Hive-a petabyte scale data warehouse using hadoop. ICDE,(996-1005). [14] Chaiken, R., et al. (2008). SCOPE: easy and efficient parallel pro- cessing of massive data sets. Proceedings of the VLDB Endowment, 1(2), 1265-1276. [15] Cohen, J., et al. (2009). MAD skills: new analysis practices for big data. VLDB, 2(2), 1481-1492. [16] He, Y., et al. (2011). Rcfile: A fast and space-efficient data place- ment structure in mapreduce-based warehouse systems. ICDE (1199-1208). [17] Lee, R., et al. (2011). Ysmart: Yet another sql-to-mapreduce trans- lator. ICDCS (25-36). [18] Beynon, M., et al. (2002). Processing large-scale multi-dimensional data in parallel and distributed environments. Parallel Computing, 28(5), 827-859. [19] Bajda-Pawlikowski, K., Abadi, D. J., Silberschatz, A., and Paulson, E. (2011). Efficient processing of data warehousing queries in a split execution environment. In Proceedings of international conference on Management of data (1165-1176). [20] Friedman, E., Pawlowski, P., and Cieslewicz, J. (2009). SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions. VLDB, 2(2), 1402-1413. [21] Gruenheid, A., Omiecinski, E., and Mark, L. (2011). Query opti- mization using column statistics in hive. In Proceedings of the 15th Symposium on International Database Engineering and Applica- tions (97-105). [22] Running TPC-H queries on Hive. http://issues.apache.org/jira/browse/HIVE-600. [23] http://docs.aws.amazon.com/redshift/latest/dg/c columnar storage disk mem mgmnt.html [24] Yin Huai, Ashutosh Chauhan, Alan Gates, Gunther Hagleitner, Eric N. Hanson, Owen O’Malley, Jitendra Pandey, Yuan Yuan, Rubao Lee, Xiaodong Zhang, Major technical advancements in apache hive, Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, June 22-27, 2014, Snowbird, Utah, USA [25] code.facebook.com/posts/229861827208629/scalingthefacebook- datawarehouseto300pb [26] code.facebook.com/posts/370832626374903/evenfasterdataatthe- speedofprestoorc [27] https://github.com/facebookarchive/hivedwrf [28] Avrilia Floratou , Umar Farooq Minhas , Fatma zcan, SQL-on- Hadoop: full circle back to shared-nothing database architectures, [29] Proceedings of the VLDB Endowment, v.7 n.12, p.1295-1306, August 2014 Kumar, R. and Kumar, N., Improved join operations using ORC in HIVE. CSI Transactions on ICT, pp.1-7. [30] Mehmood A, Iqbal M, Khaleeq M, Khaliq Y. Performance analysis of shared-nothing SQL-on-Hadoop frameworks based on columnar database systems. InInnovative Computing Technology (INTECH), 2016 Sixth International Conference on 2016 Aug 24 (pp. 128-133). IEEE. [31] Plase D, Niedrite L, Taranovs R. Accelerating data queries on Hadoop framework by using compact data formats. InAdvances in Information, Electronic and Electrical Engineering (AIEEE), 2016 IEEE 4th Workshop on 2016 Nov 10 (pp. 1-7). IEEE. [32] Liu Y, Guo S, Hu S, Rabl T, Jacobsen HA, Li J, Wang J. Performance Evaluation and Optimization of Multi-dimensional Indexes in Hive. IEEE Transactions on Services Computing. 2016 Jul 27. [33] He Y, Lee R, Huai Y, Shao Z, Jain N, Zhang X, Xu Z. RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems. InData Engineering (ICDE), 2011 IEEE 27th International Conference on 2011 Apr 11 (pp. 1199-1208). IEEE. [34] Bayir, M. A., Toroslu, I. H., and Cosar, A. (2007). Genetic algorithm for the multiple-query optimization problem. IEEE Transactions on
  • 7. 7 Systems, Man, and Cybernetics, Part C: Applications and Reviews, 37(1), 147-153. [35] Cosar, A., Lim, E. P., and Srivastava, J. (1993). Multiple query op- timization with depth-first branch-and-bound and dynamic query ordering. In Proceedings of the second international conference on Information and knowledge management (433-438). [36] Zhou, J., Larson, P. A., Freytag, J. C., and Lehner, W. (2007). Effi- cient exploitation of similar subexpressions for query processing. In Proceedings of ACM SIGMOD (533-544). [37] Silva, Y. N., Larson, P., and Zhou, J. (2012). Exploiting Common Subexpressions for Cloud Query Processing. ICDE (1337-1348).