Cheetah: A High Performance,
      Custom DWH on
     Top of MapReduce


             Tilani Gunawardena
Content
•   Introduction
•   Background
•   Schema design,Query language
•   Architecture
•   Data Storage and Compression
•   Query plan and execution
•   Query Optimization
•   Integration
•   Performance Evaluation
•   Conclusions & Future Works
Introduction
• Hadoop-impressive scalability & flexibility to
  handle structured as well as unstructured data
• A shared-nothing MPP architecture built upon
  cheap, commodity hardware
  – MPP relational data warehouses
     • Aster-Data, DATAllegro, Infobright, Greenplum,
       ParAccel, Vertica
  – MapReduce system
• Relational data ware-houses
   – Highly optimized for storing and querying relational data.
   – Hard to scale to 100s,1000s of nodes

• MapReduce
   – Handles failures & scale to 1000s node
   – Lacks a declarative query inter-face
       • Users have to write code to access the data
       • May result in redundant code
       • Requires a lot of effort & technical skills.

• Recently we can see convergence of these system
   – AsterData, GreenPlum- are extended with some MapReduce capabilities.
   – Number of data analytics tools have been built on top of Hadoop
       • Pig - translates a high-level data flow language into MapReduce jobs
       • HBase - similar to BigTable provides random read and write access to a huge distributed
         (key, value) store
       • Hive & HadoopDB - support SQL-like query language.
Background and Motivation
• www.turn.com- use platform to more effectively,
  scale, optimize performance & centrally manage
  campaigns
• Data management challenges
  – Data : schema changes, not relational data, daata size
  – Simple yet Powerful Query Language
  – Data Mining Applications:simple yet efficient data
    access method.
  – Performance
Cheetah
• A custom data warehouse system, built on top of
  Hadoop.
  – Succinct Query Language
     • Virtual views
     • Clients with no prior SQL knowledge are able to quickly
       grasp
  – High Performance
  – Seamless integration of MapReduce and Data
    Warehouse
     • Advantage of the power of both MapReduce (massive
       parallelism & scalability) & data warehouse (easy &
       efficient data access) technologies
Similar Works
• HadoopDB -PostgreSQL as the underlying
  storage and query processor for Hadoop.
• Cheetah-new data warehouse on top of
  Hadoop
  – Specical purpose
  – Most data are not relational and fast evelving
    schema

• Hive -is the most similar work to cheetah
Schema Design, Query Language
• Virtual View over Warehouse Schema
• Query Language
• Security and Anonymity
Virtual View over Warehouse Schema
   Virtual view on top of the star or snow-flake schema


• Virtual view
   – contains attributes from fact
     tables, dimension tables
   – are exposed to the users for
     query.
• At runtime, only the tables
  that the query referred to are
  accessed & joined
• Users no longer have to
  understand the underlying
  schema design
• There may be multiple fact
  tables and subsequently
  multiple virtual views
Handle Big Dimension Tables
• Filtering,grouping and aggregation operators easily & efficiently
  supported by Hadoop
• Implementation of JOIN operator on top of MapReduce is not as
  straight-forward
• 2 ways
   – Reduce phase:more general & applicable to all scenarios
       • Partitions the input tables in Map Phase & Actual Join in Reduce Phase
   – Map phase :
       • one of join tables is small (m - small tables with one big table)
           – load the small table into memory & perform a hash-lookup while scanning the other
             big table
       • Input tables are already partitioned on the join column
           – Read the same partitions from both tables and perform the join ????
               » Problem :HDFS no facility to force two data blocks with the same
                  partition key to be stored on the same node
           – One of the join tables still has to transferred over the network Cost ???
• Denormalize big dimension tables.
   – Assume big dimension tables are either insertion-only or slowly
     changing dimensions
• Handle Schema Changes
  – Schema versioned table -to efficiently support schema
    changes on the fact table


• Handle Nested Tables
  –   single user may have different type of events

  – Cheetah support fact tables with a nested relational
    data model.
  – Define a nested relational virtual view
  – Query language that is much simpler than the
    standard nested relational query
Schema Design, Query Language
• Virtual View over Warehouse Schema
• Query Language
• Security and Anonymity
Query Language
• Cheetah supports single block, SQL-like query




   – Fact tables / virtual views-Impressions, Clicks and Actions.

• Cheetah supports Multi-Block Query

• Query language
   – Fairly simple
       • Users do not have to understand the underlying schema design
       • They do not have to repetitively write any join predicates.
Schema Design, Query Language
• Virtual View over Warehouse Schema
• Query Language
• Security and Anonymity
Security and Anonymity
• Supports row level security based on virtual
  views.
• Provides ways to anonymize data
Architecture
• Simple yet efficient
• Open:also provide a simple, non-SQL interface
Query MR Job
• Issue query through either Web UI, or CLI or
  Java code via JDBC
• Query is sent to the node that runs Query Driver
• Query Driver
         Query  MapReduce job
• Each node in the Hadoop cluster provides a data
  access primitive (DAP) interface

Ad-hoc MR Job
• Can issue a similar API call for fine-grained data
  access
Data Storage & Compression
• Storage Format
• Columnar Compression
Storage Format
• Text (in CSV format)
  – Simplest storage format & commonly used in web access
    logs.
• Serialized java object
• Row-based binary array
  – Commonly used in row-oriented database systems
• Columnar binary array

 Storage format -huge impact on both
  compression ratio and query performance.
 In Cheetah, we store data in columnar format
  whenever possible
Data Storage & Compression
• Storage Format
• Columnar Compression
Columnar Compression
• Compression type for each column set is
  dynamically determined based on data in each cell
• ETL phase- best compression method is chosen
• After one cell is created, it is further compressed
  using GZIP.
Query Plan & Execution
• Input Files
• Map Phase Plan
• Reduce Phase Plan
Input Files
• Input files to the MapReduce job are always
  fact tables.
• Fact tables are partitioned by date
• Fact tables are further partitioned by a
  dimension key attribute, referred as DID
Query Plan & Execution
• Input Files
• Map Phase Plan
• Reduce Phase Plan
Map Phase Plan
• Each node in the cluster stores some portion of the fact table
  data blocks and (small) dimension files
• Query contains two operators scanner and aggregation.
• Scanner operator has an interface which resembles a SELECT
  followed by PROJECT operator over the virtual view
• Scanner operator translates the request to an equivalent SPJ
  query to pick up the attributes on the dimension tables.
• As optimization-Dimensions are loaded into in-memory hash
  tables only once if different map tasks share the same JVM.
• Hash-based implementation of group by operator-Default
Query Plan & Execution
• Input Files
• Map Phase Plan
• Reduce Phase Plan
Reduce Phase Plan
• First performs global aggregation over the results from map phase.
• Then it evaluates any residual expressions over the aggregate values and/or
  the HAVING clause
• If the ORDER BY columns are group by columns
    – They are already sorted by Hadoop framework during the reduce phase.
• If the ORDER BY columns are the aggregation columns
    – Then we sort the results within each reduce task & merge final results
       after MapReduce job completes.
Query Optimization
•   MapReduce Job Configuration
•   MultiQuery Optimization
•   Exploiting Materialized Views
•   LowLatency Query Optimization
MapReduce Job Configuration
• # of map tasks - based on the # of input files & number of
  blocks per file.
• # of reduce tasks -supplied by the job itself & has a big
  impact on performance.

• query output
   – Small:map phase dominates total cost.
   – Large:it is mandatory to have sufficient number of reducers to
     partition the work.

• Heuristics
   – #of reducers is proportional to the number of group by
     columns in the query.
   – if the group by column includes some column with very large
     cardinality, we increase # of reducers as well.
Query Optimization
•   MapReduce Job Configuration
•   MultiQuery Optimization
•   Exploiting Materialized Views
•   LowLatency Query Optimization
MultiQuery Optimization
• In Cheetah allow users to simultaneously
  submit multiple queries & execute them in a
  single batch, as long as these queries have the
  same FROM and DATES clauses
Map Phase
• Shared scanner-shares the scan of the fact tables
  & joins to the dimension tables
• Scanner will attach a query ID to each output row
• Output from different aggregation operators will
  be merged into a single output stream.
Reduce Phase
• Split the input rows based on their query Ids
• Send them to the corresponding query operators.
Query Optimization
•   MapReduce Job Configuration
•   MultiQuery Optimization
•   Exploiting Materialized Views
•   LowLatency Query Optimization
Exploiting Materialized Views(1)
• Definition of Materialized Views
   – Each materialized view only includes the columns in
     the face table, i.e., excludes those on the dimension
     tables.
   – It is partitioned by date


• Both columns referred in the query reside on the fact
  table, Impressions
• Resulting virtual view has two types of columns - group by
  columns & aggregate columns.
Exploiting Materialized Views(2)
• View Matching and Query Rewriting
  – To make use of materialized view


• Refer virtual view that corresponds to same fact
  table that materialized view is defined upon.
• Non-aggregate columns referred in the SELECT and
  WHERE clauses in the query must be a subset of
  the materialized view’s group by columns
• Aggregate columns must be computable from the
  materialized view’s aggregate columns.
• Replace the virtual view in the query with the
  matching materialized view
Query Optimization
•   MapReduce Job Configuration
•   MultiQuery Optimization
•   Exploiting Materialized Views
•   Low Latency Query Optimization
LowLatency Query Optimization
• Current Hadoop implementation has some non-
  trivial overhead itself
  – Ex:job start time,JVM start time


Problem :For small queries, this becomes a
significant extra overhead.
  – In query translation phase: if size of the input file is
    small it may choose to directly read the file from HDFS
    and then process the query locally.
Integration
• Cheetah provides JDBC interface -user program
  can submit query & iterate through the output
  results.
• If query results are too big for a single program to
  consume, user can write a MapReduce job to
  analyze the query output files which are stored on
  HDFS
• Cheetah provides a non-SQL interface that can be
  easily integrated into any ad-hoc MapReduce jobs
• Achieve ad-hoc MapReduce
  – Specify the input files, which can be one or more fact
    table partitions.
  – In the Map program, it needs to include a scanner to
    access individual raw record
  – After that, user has complete freedom to decide what
    to do with the data.
• Advantages to have low-level, non-SQL interface open
  to external applications
  – It provides ad-hoc MapReduce programs efficient and more
    importantly local data access
  – The virtual view-based scanner hides most of the schema
    complexity from MapReduce developers
  – The data is well compressed and the access method is fairly
    optimized inside the scanner operator

   Summery :
• Ad-hoc MapReduce programs can now
  automatically take full advantage of both
  MapReduce (massive parallelism and scalability)
  and data warehouse (easy and efficient data
  access) technologies.
Performance Evaluation
•   Implementation
•   Storage Format
•   Small .vs. Big Queries
•   MultiQuery Optimization
Implementation
Important :
• Query engine must have low CPU overhead.
  – choosing the right data format
  – efficient implementation of various components on
    the data processing path.
     • Efficient hashing method
• All the experiments are performed on a cluster
  with 10 nodes.
• Each node has two quad core, 8GBytes memory
  and 4x1TBytes 7200RPM hard disks
• Cloudera’s Hadoop distribution version 0.20.1.
• The size of the data blocks for fact tables is
  512MBytes.
Performance Evaluation
•   Implementation
•   Storage Format
•   Small .vs. Big Queries
•   MultiQuery Optimization
Storage Format
• We store one fact table partition into four files
  formats,
  – Text (in CSV format)
  – Java object (equivalent to row-based binary array),
  – Column-based binary array,
  – Column-based binary array with compressions


• Each file is further compressed by Hadoop’s
  GZIP library at block level
• We run a simple aggregate query over three different data
  formats, namely, Text, Java Object and Columnar (with
  compression).
• We use iostat to monitor the CPU utilization and IO throughput at
  each node
Performance Evaluation
•   Implementation
•   Storage Format
•   Small .vs. Big Queries
•   MultiQuery Optimization
Small .vs. Big Queries
• Impact of query complexity on query performance.
• We create a test query
   – two joins
   – one predicate,
   – 3 group by columns
   – 7 aggregate functions.
Performance Evaluation
•   Implementation
•   Storage Format
•   Small .vs. Big Queries
•   MultiQuery Optimization
MultiQuery Optimization
• Randomly pick 40 queries from our query
  workload.
• The number of output rows for these queries
  ranges from few hundred to 1M.
• We compare the time for executing these queries
  in a single batch to the time for executing them
  separately.
Conclusions
• Cheetah -data warehouse system built on top
  of the MapReduce technology.
• The virtual view abstraction plays a central
  role in designing the Cheetah system.
• Multi-query optimization.
Future Work
• Current IO throughput 130MBytes (Figure 12)
  has not reached the maximum possible speed
  of hard disks.
• Current multi-query optimization only exploits
  shared data scan and shared joins.
  – further explore predicate sharing and aggregation
    sharing
• Previous query results for answering similar
  queries later.
Thank You!

Cheetah:Data Warehouse on Top of MapReduce

  • 1.
    Cheetah: A HighPerformance, Custom DWH on Top of MapReduce Tilani Gunawardena
  • 2.
    Content • Introduction • Background • Schema design,Query language • Architecture • Data Storage and Compression • Query plan and execution • Query Optimization • Integration • Performance Evaluation • Conclusions & Future Works
  • 3.
    Introduction • Hadoop-impressive scalability& flexibility to handle structured as well as unstructured data • A shared-nothing MPP architecture built upon cheap, commodity hardware – MPP relational data warehouses • Aster-Data, DATAllegro, Infobright, Greenplum, ParAccel, Vertica – MapReduce system
  • 4.
    • Relational dataware-houses – Highly optimized for storing and querying relational data. – Hard to scale to 100s,1000s of nodes • MapReduce – Handles failures & scale to 1000s node – Lacks a declarative query inter-face • Users have to write code to access the data • May result in redundant code • Requires a lot of effort & technical skills. • Recently we can see convergence of these system – AsterData, GreenPlum- are extended with some MapReduce capabilities. – Number of data analytics tools have been built on top of Hadoop • Pig - translates a high-level data flow language into MapReduce jobs • HBase - similar to BigTable provides random read and write access to a huge distributed (key, value) store • Hive & HadoopDB - support SQL-like query language.
  • 5.
    Background and Motivation •www.turn.com- use platform to more effectively, scale, optimize performance & centrally manage campaigns • Data management challenges – Data : schema changes, not relational data, daata size – Simple yet Powerful Query Language – Data Mining Applications:simple yet efficient data access method. – Performance
  • 6.
    Cheetah • A customdata warehouse system, built on top of Hadoop. – Succinct Query Language • Virtual views • Clients with no prior SQL knowledge are able to quickly grasp – High Performance – Seamless integration of MapReduce and Data Warehouse • Advantage of the power of both MapReduce (massive parallelism & scalability) & data warehouse (easy & efficient data access) technologies
  • 7.
    Similar Works • HadoopDB-PostgreSQL as the underlying storage and query processor for Hadoop. • Cheetah-new data warehouse on top of Hadoop – Specical purpose – Most data are not relational and fast evelving schema • Hive -is the most similar work to cheetah
  • 8.
    Schema Design, QueryLanguage • Virtual View over Warehouse Schema • Query Language • Security and Anonymity
  • 9.
    Virtual View overWarehouse Schema Virtual view on top of the star or snow-flake schema • Virtual view – contains attributes from fact tables, dimension tables – are exposed to the users for query. • At runtime, only the tables that the query referred to are accessed & joined • Users no longer have to understand the underlying schema design • There may be multiple fact tables and subsequently multiple virtual views
  • 10.
    Handle Big DimensionTables • Filtering,grouping and aggregation operators easily & efficiently supported by Hadoop • Implementation of JOIN operator on top of MapReduce is not as straight-forward • 2 ways – Reduce phase:more general & applicable to all scenarios • Partitions the input tables in Map Phase & Actual Join in Reduce Phase – Map phase : • one of join tables is small (m - small tables with one big table) – load the small table into memory & perform a hash-lookup while scanning the other big table • Input tables are already partitioned on the join column – Read the same partitions from both tables and perform the join ???? » Problem :HDFS no facility to force two data blocks with the same partition key to be stored on the same node – One of the join tables still has to transferred over the network Cost ??? • Denormalize big dimension tables. – Assume big dimension tables are either insertion-only or slowly changing dimensions
  • 11.
    • Handle SchemaChanges – Schema versioned table -to efficiently support schema changes on the fact table • Handle Nested Tables – single user may have different type of events – Cheetah support fact tables with a nested relational data model. – Define a nested relational virtual view – Query language that is much simpler than the standard nested relational query
  • 12.
    Schema Design, QueryLanguage • Virtual View over Warehouse Schema • Query Language • Security and Anonymity
  • 13.
    Query Language • Cheetahsupports single block, SQL-like query – Fact tables / virtual views-Impressions, Clicks and Actions. • Cheetah supports Multi-Block Query • Query language – Fairly simple • Users do not have to understand the underlying schema design • They do not have to repetitively write any join predicates.
  • 14.
    Schema Design, QueryLanguage • Virtual View over Warehouse Schema • Query Language • Security and Anonymity
  • 15.
    Security and Anonymity •Supports row level security based on virtual views. • Provides ways to anonymize data
  • 16.
    Architecture • Simple yetefficient • Open:also provide a simple, non-SQL interface
  • 17.
    Query MR Job •Issue query through either Web UI, or CLI or Java code via JDBC • Query is sent to the node that runs Query Driver • Query Driver Query  MapReduce job • Each node in the Hadoop cluster provides a data access primitive (DAP) interface Ad-hoc MR Job • Can issue a similar API call for fine-grained data access
  • 18.
    Data Storage &Compression • Storage Format • Columnar Compression
  • 19.
    Storage Format • Text(in CSV format) – Simplest storage format & commonly used in web access logs. • Serialized java object • Row-based binary array – Commonly used in row-oriented database systems • Columnar binary array  Storage format -huge impact on both compression ratio and query performance.  In Cheetah, we store data in columnar format whenever possible
  • 20.
    Data Storage &Compression • Storage Format • Columnar Compression
  • 21.
    Columnar Compression • Compressiontype for each column set is dynamically determined based on data in each cell • ETL phase- best compression method is chosen • After one cell is created, it is further compressed using GZIP.
  • 22.
    Query Plan &Execution • Input Files • Map Phase Plan • Reduce Phase Plan
  • 23.
    Input Files • Inputfiles to the MapReduce job are always fact tables. • Fact tables are partitioned by date • Fact tables are further partitioned by a dimension key attribute, referred as DID
  • 24.
    Query Plan &Execution • Input Files • Map Phase Plan • Reduce Phase Plan
  • 25.
    Map Phase Plan •Each node in the cluster stores some portion of the fact table data blocks and (small) dimension files • Query contains two operators scanner and aggregation.
  • 26.
    • Scanner operatorhas an interface which resembles a SELECT followed by PROJECT operator over the virtual view • Scanner operator translates the request to an equivalent SPJ query to pick up the attributes on the dimension tables. • As optimization-Dimensions are loaded into in-memory hash tables only once if different map tasks share the same JVM. • Hash-based implementation of group by operator-Default
  • 27.
    Query Plan &Execution • Input Files • Map Phase Plan • Reduce Phase Plan
  • 28.
    Reduce Phase Plan •First performs global aggregation over the results from map phase. • Then it evaluates any residual expressions over the aggregate values and/or the HAVING clause • If the ORDER BY columns are group by columns – They are already sorted by Hadoop framework during the reduce phase. • If the ORDER BY columns are the aggregation columns – Then we sort the results within each reduce task & merge final results after MapReduce job completes.
  • 29.
    Query Optimization • MapReduce Job Configuration • MultiQuery Optimization • Exploiting Materialized Views • LowLatency Query Optimization
  • 30.
    MapReduce Job Configuration •# of map tasks - based on the # of input files & number of blocks per file. • # of reduce tasks -supplied by the job itself & has a big impact on performance. • query output – Small:map phase dominates total cost. – Large:it is mandatory to have sufficient number of reducers to partition the work. • Heuristics – #of reducers is proportional to the number of group by columns in the query. – if the group by column includes some column with very large cardinality, we increase # of reducers as well.
  • 31.
    Query Optimization • MapReduce Job Configuration • MultiQuery Optimization • Exploiting Materialized Views • LowLatency Query Optimization
  • 32.
    MultiQuery Optimization • InCheetah allow users to simultaneously submit multiple queries & execute them in a single batch, as long as these queries have the same FROM and DATES clauses
  • 33.
    Map Phase • Sharedscanner-shares the scan of the fact tables & joins to the dimension tables • Scanner will attach a query ID to each output row • Output from different aggregation operators will be merged into a single output stream.
  • 34.
    Reduce Phase • Splitthe input rows based on their query Ids • Send them to the corresponding query operators.
  • 35.
    Query Optimization • MapReduce Job Configuration • MultiQuery Optimization • Exploiting Materialized Views • LowLatency Query Optimization
  • 36.
    Exploiting Materialized Views(1) •Definition of Materialized Views – Each materialized view only includes the columns in the face table, i.e., excludes those on the dimension tables. – It is partitioned by date • Both columns referred in the query reside on the fact table, Impressions • Resulting virtual view has two types of columns - group by columns & aggregate columns.
  • 37.
    Exploiting Materialized Views(2) •View Matching and Query Rewriting – To make use of materialized view • Refer virtual view that corresponds to same fact table that materialized view is defined upon. • Non-aggregate columns referred in the SELECT and WHERE clauses in the query must be a subset of the materialized view’s group by columns • Aggregate columns must be computable from the materialized view’s aggregate columns.
  • 38.
    • Replace thevirtual view in the query with the matching materialized view
  • 39.
    Query Optimization • MapReduce Job Configuration • MultiQuery Optimization • Exploiting Materialized Views • Low Latency Query Optimization
  • 40.
    LowLatency Query Optimization •Current Hadoop implementation has some non- trivial overhead itself – Ex:job start time,JVM start time Problem :For small queries, this becomes a significant extra overhead. – In query translation phase: if size of the input file is small it may choose to directly read the file from HDFS and then process the query locally.
  • 41.
    Integration • Cheetah providesJDBC interface -user program can submit query & iterate through the output results. • If query results are too big for a single program to consume, user can write a MapReduce job to analyze the query output files which are stored on HDFS • Cheetah provides a non-SQL interface that can be easily integrated into any ad-hoc MapReduce jobs
  • 42.
    • Achieve ad-hocMapReduce – Specify the input files, which can be one or more fact table partitions. – In the Map program, it needs to include a scanner to access individual raw record – After that, user has complete freedom to decide what to do with the data.
  • 43.
    • Advantages tohave low-level, non-SQL interface open to external applications – It provides ad-hoc MapReduce programs efficient and more importantly local data access – The virtual view-based scanner hides most of the schema complexity from MapReduce developers – The data is well compressed and the access method is fairly optimized inside the scanner operator Summery : • Ad-hoc MapReduce programs can now automatically take full advantage of both MapReduce (massive parallelism and scalability) and data warehouse (easy and efficient data access) technologies.
  • 44.
    Performance Evaluation • Implementation • Storage Format • Small .vs. Big Queries • MultiQuery Optimization
  • 45.
    Implementation Important : • Queryengine must have low CPU overhead. – choosing the right data format – efficient implementation of various components on the data processing path. • Efficient hashing method • All the experiments are performed on a cluster with 10 nodes. • Each node has two quad core, 8GBytes memory and 4x1TBytes 7200RPM hard disks • Cloudera’s Hadoop distribution version 0.20.1. • The size of the data blocks for fact tables is 512MBytes.
  • 46.
    Performance Evaluation • Implementation • Storage Format • Small .vs. Big Queries • MultiQuery Optimization
  • 47.
    Storage Format • Westore one fact table partition into four files formats, – Text (in CSV format) – Java object (equivalent to row-based binary array), – Column-based binary array, – Column-based binary array with compressions • Each file is further compressed by Hadoop’s GZIP library at block level
  • 49.
    • We runa simple aggregate query over three different data formats, namely, Text, Java Object and Columnar (with compression). • We use iostat to monitor the CPU utilization and IO throughput at each node
  • 50.
    Performance Evaluation • Implementation • Storage Format • Small .vs. Big Queries • MultiQuery Optimization
  • 51.
    Small .vs. BigQueries • Impact of query complexity on query performance. • We create a test query – two joins – one predicate, – 3 group by columns – 7 aggregate functions.
  • 52.
    Performance Evaluation • Implementation • Storage Format • Small .vs. Big Queries • MultiQuery Optimization
  • 53.
    MultiQuery Optimization • Randomlypick 40 queries from our query workload. • The number of output rows for these queries ranges from few hundred to 1M. • We compare the time for executing these queries in a single batch to the time for executing them separately.
  • 54.
    Conclusions • Cheetah -datawarehouse system built on top of the MapReduce technology. • The virtual view abstraction plays a central role in designing the Cheetah system. • Multi-query optimization.
  • 55.
    Future Work • CurrentIO throughput 130MBytes (Figure 12) has not reached the maximum possible speed of hard disks. • Current multi-query optimization only exploits shared data scan and shared joins. – further explore predicate sharing and aggregation sharing • Previous query results for answering similar queries later.
  • 56.

Editor's Notes

  • #5 Users have to write code in order to access the datMapReduce is just an execution model, the underlying data storage and access method are completely left to users toimplement.