Cheetah:Data Warehouse on Top of MapReduce

Cheetah: A High Performance,
Custom DWH on
Top of MapReduce

Tilani Gunawardena

Content
• Introduction
• Background
• Schema design,Query language
• Architecture
• Data Storage and Compression
• Query plan and execution
• Query Optimization
• Integration
• Performance Evaluation
• Conclusions & Future Works

Introduction
• Hadoop-impressive scalability & flexibility to
handle structured as well as unstructured data
• A shared-nothing MPP architecture built upon
cheap, commodity hardware
– MPP relational data warehouses
• Aster-Data, DATAllegro, Infobright, Greenplum,
ParAccel, Vertica
– MapReduce system

• Relational data ware-houses
– Highly optimized for storing and querying relational data.
– Hard to scale to 100s,1000s of nodes

• MapReduce
– Handles failures & scale to 1000s node
– Lacks a declarative query inter-face
• Users have to write code to access the data
• May result in redundant code
• Requires a lot of effort & technical skills.

• Recently we can see convergence of these system
– AsterData, GreenPlum- are extended with some MapReduce capabilities.
– Number of data analytics tools have been built on top of Hadoop
• Pig - translates a high-level data flow language into MapReduce jobs
• HBase - similar to BigTable provides random read and write access to a huge distributed
(key, value) store
• Hive & HadoopDB - support SQL-like query language.

Background and Motivation
• www.turn.com- use platform to more effectively,
scale, optimize performance & centrally manage
campaigns
• Data management challenges
– Data : schema changes, not relational data, daata size
– Simple yet Powerful Query Language
– Data Mining Applications:simple yet efficient data
access method.
– Performance

Cheetah
• A custom data warehouse system, built on top of
Hadoop.
– Succinct Query Language
• Virtual views
• Clients with no prior SQL knowledge are able to quickly
grasp
– High Performance
– Seamless integration of MapReduce and Data
Warehouse
• Advantage of the power of both MapReduce (massive
parallelism & scalability) & data warehouse (easy &
efficient data access) technologies

Similar Works
• HadoopDB -PostgreSQL as the underlying
storage and query processor for Hadoop.
• Cheetah-new data warehouse on top of
Hadoop
– Specical purpose
– Most data are not relational and fast evelving
schema

• Hive -is the most similar work to cheetah

Schema Design, Query Language
• Virtual View over Warehouse Schema
• Query Language
• Security and Anonymity

Virtual View over Warehouse Schema
Virtual view on top of the star or snow-flake schema

• Virtual view
– contains attributes from fact
tables, dimension tables
– are exposed to the users for
query.
• At runtime, only the tables
that the query referred to are
accessed & joined
• Users no longer have to
understand the underlying
schema design
• There may be multiple fact
tables and subsequently
multiple virtual views

Handle Big Dimension Tables
• Filtering,grouping and aggregation operators easily & efficiently
supported by Hadoop
• Implementation of JOIN operator on top of MapReduce is not as
straight-forward
• 2 ways
– Reduce phase:more general & applicable to all scenarios
• Partitions the input tables in Map Phase & Actual Join in Reduce Phase
– Map phase :
• one of join tables is small (m - small tables with one big table)
– load the small table into memory & perform a hash-lookup while scanning the other
big table
• Input tables are already partitioned on the join column
– Read the same partitions from both tables and perform the join ????
» Problem :HDFS no facility to force two data blocks with the same
partition key to be stored on the same node
– One of the join tables still has to transferred over the network Cost ???
• Denormalize big dimension tables.
– Assume big dimension tables are either insertion-only or slowly
changing dimensions

• Handle Schema Changes
– Schema versioned table -to efficiently support schema
changes on the fact table

• Handle Nested Tables
– single user may have different type of events

– Cheetah support fact tables with a nested relational
data model.
– Define a nested relational virtual view
– Query language that is much simpler than the
standard nested relational query

Query Language
• Cheetah supports single block, SQL-like query

– Fact tables / virtual views-Impressions, Clicks and Actions.

• Cheetah supports Multi-Block Query

• Query language
– Fairly simple
• Users do not have to understand the underlying schema design
• They do not have to repetitively write any join predicates.

Security and Anonymity
• Supports row level security based on virtual
views.
• Provides ways to anonymize data

Architecture
• Simple yet efficient
• Open:also provide a simple, non-SQL interface

Query MR Job
• Issue query through either Web UI, or CLI or
Java code via JDBC
• Query is sent to the node that runs Query Driver
• Query Driver
Query  MapReduce job
• Each node in the Hadoop cluster provides a data
access primitive (DAP) interface

Ad-hoc MR Job
• Can issue a similar API call for fine-grained data
access

Data Storage & Compression
• Storage Format
• Columnar Compression

Storage Format
• Text (in CSV format)
– Simplest storage format & commonly used in web access
logs.
• Serialized java object
• Row-based binary array
– Commonly used in row-oriented database systems
• Columnar binary array

 Storage format -huge impact on both
compression ratio and query performance.
 In Cheetah, we store data in columnar format
whenever possible

Columnar Compression
• Compression type for each column set is
dynamically determined based on data in each cell
• ETL phase- best compression method is chosen
• After one cell is created, it is further compressed
using GZIP.

Query Plan & Execution
• Input Files
• Map Phase Plan
• Reduce Phase Plan

Input Files
• Input files to the MapReduce job are always
fact tables.
• Fact tables are partitioned by date
• Fact tables are further partitioned by a
dimension key attribute, referred as DID

Map Phase Plan
• Each node in the cluster stores some portion of the fact table
data blocks and (small) dimension files
• Query contains two operators scanner and aggregation.

• Scanner operator has an interface which resembles a SELECT
followed by PROJECT operator over the virtual view
• Scanner operator translates the request to an equivalent SPJ
query to pick up the attributes on the dimension tables.
• As optimization-Dimensions are loaded into in-memory hash
tables only once if different map tasks share the same JVM.
• Hash-based implementation of group by operator-Default

Reduce Phase Plan
• First performs global aggregation over the results from map phase.
• Then it evaluates any residual expressions over the aggregate values and/or
the HAVING clause
• If the ORDER BY columns are group by columns
– They are already sorted by Hadoop framework during the reduce phase.
• If the ORDER BY columns are the aggregation columns
– Then we sort the results within each reduce task & merge final results
after MapReduce job completes.

Query Optimization
• MapReduce Job Configuration
• MultiQuery Optimization
• Exploiting Materialized Views
• LowLatency Query Optimization

MapReduce Job Configuration
• # of map tasks - based on the # of input files & number of
blocks per file.
• # of reduce tasks -supplied by the job itself & has a big
impact on performance.

• query output
– Small:map phase dominates total cost.
– Large:it is mandatory to have sufficient number of reducers to
partition the work.

• Heuristics
– #of reducers is proportional to the number of group by
columns in the query.
– if the group by column includes some column with very large
cardinality, we increase # of reducers as well.

MultiQuery Optimization
• In Cheetah allow users to simultaneously
submit multiple queries & execute them in a
single batch, as long as these queries have the
same FROM and DATES clauses

Map Phase
• Shared scanner-shares the scan of the fact tables
& joins to the dimension tables
• Scanner will attach a query ID to each output row
• Output from different aggregation operators will
be merged into a single output stream.

Reduce Phase
• Split the input rows based on their query Ids
• Send them to the corresponding query operators.

Exploiting Materialized Views(1)
• Definition of Materialized Views
– Each materialized view only includes the columns in
the face table, i.e., excludes those on the dimension
tables.
– It is partitioned by date

• Both columns referred in the query reside on the fact
table, Impressions
• Resulting virtual view has two types of columns - group by
columns & aggregate columns.

Exploiting Materialized Views(2)
• View Matching and Query Rewriting
– To make use of materialized view

• Refer virtual view that corresponds to same fact
table that materialized view is defined upon.
• Non-aggregate columns referred in the SELECT and
WHERE clauses in the query must be a subset of
the materialized view’s group by columns
• Aggregate columns must be computable from the
materialized view’s aggregate columns.

• Replace the virtual view in the query with the
matching materialized view

Query Optimization
• MapReduce Job Configuration
• Exploiting Materialized Views
• Low Latency Query Optimization

LowLatency Query Optimization
• Current Hadoop implementation has some non-
trivial overhead itself
– Ex:job start time,JVM start time

Problem :For small queries, this becomes a
significant extra overhead.
– In query translation phase: if size of the input file is
small it may choose to directly read the file from HDFS
and then process the query locally.

Integration
• Cheetah provides JDBC interface -user program
can submit query & iterate through the output
results.
• If query results are too big for a single program to
consume, user can write a MapReduce job to
analyze the query output files which are stored on
HDFS
• Cheetah provides a non-SQL interface that can be
easily integrated into any ad-hoc MapReduce jobs

• Achieve ad-hoc MapReduce
– Specify the input files, which can be one or more fact
table partitions.
– In the Map program, it needs to include a scanner to
access individual raw record
– After that, user has complete freedom to decide what
to do with the data.

• Advantages to have low-level, non-SQL interface open
to external applications
– It provides ad-hoc MapReduce programs efficient and more
importantly local data access
– The virtual view-based scanner hides most of the schema
complexity from MapReduce developers
– The data is well compressed and the access method is fairly
optimized inside the scanner operator

Summery :
• Ad-hoc MapReduce programs can now
automatically take full advantage of both
MapReduce (massive parallelism and scalability)
and data warehouse (easy and efficient data
access) technologies.

Performance Evaluation
• Implementation
• Storage Format
• Small .vs. Big Queries

Implementation
Important :
• Query engine must have low CPU overhead.
– choosing the right data format
– efficient implementation of various components on
the data processing path.
• Efficient hashing method
• All the experiments are performed on a cluster
with 10 nodes.
• Each node has two quad core, 8GBytes memory
and 4x1TBytes 7200RPM hard disks
• Cloudera’s Hadoop distribution version 0.20.1.
• The size of the data blocks for fact tables is
512MBytes.

Storage Format
• We store one fact table partition into four files
formats,
– Text (in CSV format)
– Java object (equivalent to row-based binary array),
– Column-based binary array,
– Column-based binary array with compressions

• Each file is further compressed by Hadoop’s
GZIP library at block level

• We run a simple aggregate query over three different data
formats, namely, Text, Java Object and Columnar (with
compression).
• We use iostat to monitor the CPU utilization and IO throughput at
each node

Small .vs. Big Queries
• Impact of query complexity on query performance.
• We create a test query
– two joins
– one predicate,
– 3 group by columns
– 7 aggregate functions.

MultiQuery Optimization
• Randomly pick 40 queries from our query
workload.
• The number of output rows for these queries
ranges from few hundred to 1M.
• We compare the time for executing these queries
in a single batch to the time for executing them
separately.

Conclusions
• Cheetah -data warehouse system built on top
of the MapReduce technology.
• The virtual view abstraction plays a central
role in designing the Cheetah system.
• Multi-query optimization.

Future Work
• Current IO throughput 130MBytes (Figure 12)
has not reached the maximum possible speed
of hard disks.
• Current multi-query optimization only exploits
shared data scan and shared joins.
– further explore predicate sharing and aggregation
sharing
• Previous query results for answering similar
queries later.

Cheetah:Data Warehouse on Top of MapReduce

More Related Content

What's hot

Similar to Cheetah:Data Warehouse on Top of MapReduce

More from Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

Recently uploaded

Cheetah:Data Warehouse on Top of MapReduce

Editor's Notes