SlideShare a Scribd company logo
1 of 124
Download to read offline
12: MapReduce and DBMS Hybrids
Zubair Nabi
zubair.nabi@itu.edu.pk
May 26, 2013
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 1 / 37
Outline
1 Hive
2 HadoopDB
3 nCluster
4 Summary
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 2 / 37
Outline
1 Hive
2 HadoopDB
3 nCluster
4 Summary
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 3 / 37
Introduction
Data warehousing solution built atop Hadoop by Facebook
1
https://www.facebook.com/note.php?note_id=89508453919
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
Introduction
Data warehousing solution built atop Hadoop by Facebook
Now an Apache open source project
1
https://www.facebook.com/note.php?note_id=89508453919
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
Introduction
Data warehousing solution built atop Hadoop by Facebook
Now an Apache open source project
Queries are expressed in SQL-like HiveQL, which are compiled into
map-reduce jobs
1
https://www.facebook.com/note.php?note_id=89508453919
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
Introduction
Data warehousing solution built atop Hadoop by Facebook
Now an Apache open source project
Queries are expressed in SQL-like HiveQL, which are compiled into
map-reduce jobs
Also contains a type system for describing RDBMS-like tables
1
https://www.facebook.com/note.php?note_id=89508453919
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
Introduction
Data warehousing solution built atop Hadoop by Facebook
Now an Apache open source project
Queries are expressed in SQL-like HiveQL, which are compiled into
map-reduce jobs
Also contains a type system for describing RDBMS-like tables
A system catalog, Hive-Metastore, which contains schemas and
statistics is used for data exploration and query optimization
1
https://www.facebook.com/note.php?note_id=89508453919
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
Introduction
Data warehousing solution built atop Hadoop by Facebook
Now an Apache open source project
Queries are expressed in SQL-like HiveQL, which are compiled into
map-reduce jobs
Also contains a type system for describing RDBMS-like tables
A system catalog, Hive-Metastore, which contains schemas and
statistics is used for data exploration and query optimization
Stores 2PB of uncompressed data at Facebook and is heavily used for
simple summarization, business intelligence, machine learning, among
many other applications1
1
https://www.facebook.com/note.php?note_id=89508453919
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
Introduction
Data warehousing solution built atop Hadoop by Facebook
Now an Apache open source project
Queries are expressed in SQL-like HiveQL, which are compiled into
map-reduce jobs
Also contains a type system for describing RDBMS-like tables
A system catalog, Hive-Metastore, which contains schemas and
statistics is used for data exploration and query optimization
Stores 2PB of uncompressed data at Facebook and is heavily used for
simple summarization, business intelligence, machine learning, among
many other applications1
Also used by Digg, Grooveshark, hi5, Last.fm, Scribd, etc.
1
https://www.facebook.com/note.php?note_id=89508453919
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
Data Model
Tables:
Similar to RDBMS tables
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
Data Model
Tables:
Similar to RDBMS tables
Each table has a corresponding HDFS directory
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
Data Model
Tables:
Similar to RDBMS tables
Each table has a corresponding HDFS directory
The contents of the table are serialized and stored in ļ¬les within that
directory
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
Data Model
Tables:
Similar to RDBMS tables
Each table has a corresponding HDFS directory
The contents of the table are serialized and stored in ļ¬les within that
directory
Serialization can be both system provided or user deļ¬ned
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
Data Model
Tables:
Similar to RDBMS tables
Each table has a corresponding HDFS directory
The contents of the table are serialized and stored in ļ¬les within that
directory
Serialization can be both system provided or user deļ¬ned
Serialization information of each table is also stored in the
Hive-Metastore for query optimization
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
Data Model
Tables:
Similar to RDBMS tables
Each table has a corresponding HDFS directory
The contents of the table are serialized and stored in ļ¬les within that
directory
Serialization can be both system provided or user deļ¬ned
Serialization information of each table is also stored in the
Hive-Metastore for query optimization
Tables can also be deļ¬ned for data stored in external sources such as
HDFS, NFS, and local FS
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
Data Model (2)
Partitions:
Determine the distribution of data within sub-directories of the main
table directory
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
Data Model (2)
Partitions:
Determine the distribution of data within sub-directories of the main
table directory
For instance, for a table T stored in /wh/T and partitioned on columns
ds and ctry
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
Data Model (2)
Partitions:
Determine the distribution of data within sub-directories of the main
table directory
For instance, for a table T stored in /wh/T and partitioned on columns
ds and ctry
Data with ds value 20090101 and ctry value US,
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
Data Model (2)
Partitions:
Determine the distribution of data within sub-directories of the main
table directory
For instance, for a table T stored in /wh/T and partitioned on columns
ds and ctry
Data with ds value 20090101 and ctry value US,
Will be stored in ļ¬les within /wh/T/ds=20090101/ctry=US
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
Data Model (2)
Partitions:
Determine the distribution of data within sub-directories of the main
table directory
For instance, for a table T stored in /wh/T and partitioned on columns
ds and ctry
Data with ds value 20090101 and ctry value US,
Will be stored in ļ¬les within /wh/T/ds=20090101/ctry=US
Buckets:
Data within partitions is divided into buckets
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
Data Model (2)
Partitions:
Determine the distribution of data within sub-directories of the main
table directory
For instance, for a table T stored in /wh/T and partitioned on columns
ds and ctry
Data with ds value 20090101 and ctry value US,
Will be stored in ļ¬les within /wh/T/ds=20090101/ctry=US
Buckets:
Data within partitions is divided into buckets
Buckets are calculated based on the hash of a column within the
partition
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
Data Model (2)
Partitions:
Determine the distribution of data within sub-directories of the main
table directory
For instance, for a table T stored in /wh/T and partitioned on columns
ds and ctry
Data with ds value 20090101 and ctry value US,
Will be stored in ļ¬les within /wh/T/ds=20090101/ctry=US
Buckets:
Data within partitions is divided into buckets
Buckets are calculated based on the hash of a column within the
partition
Each bucket is stored within a ļ¬le in the partition directory
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
Column Data Types
Primitive types: integers, ļ¬‚oats, strings, dates, and booleans
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 7 / 37
Column Data Types
Primitive types: integers, ļ¬‚oats, strings, dates, and booleans
Nestable collection types: arrays and maps
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 7 / 37
Column Data Types
Primitive types: integers, ļ¬‚oats, strings, dates, and booleans
Nestable collection types: arrays and maps
Custom types: user-deļ¬ned
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 7 / 37
HiveQL
Supports select, project, join, aggregate, union all, and sub-queries
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
HiveQL
Supports select, project, join, aggregate, union all, and sub-queries
Tables are created using data deļ¬nition statements with speciļ¬c
serialization formats, partitioning, and bucketing
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
HiveQL
Supports select, project, join, aggregate, union all, and sub-queries
Tables are created using data deļ¬nition statements with speciļ¬c
serialization formats, partitioning, and bucketing
Data is loaded from external sources and inserted into tables
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
HiveQL
Supports select, project, join, aggregate, union all, and sub-queries
Tables are created using data deļ¬nition statements with speciļ¬c
serialization formats, partitioning, and bucketing
Data is loaded from external sources and inserted into tables
Support for multi-table insert ā€“ multiple queries on the same input data
using a single HiveQL statement
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
HiveQL
Supports select, project, join, aggregate, union all, and sub-queries
Tables are created using data deļ¬nition statements with speciļ¬c
serialization formats, partitioning, and bucketing
Data is loaded from external sources and inserted into tables
Support for multi-table insert ā€“ multiple queries on the same input data
using a single HiveQL statement
User-deļ¬ned column transformation and aggregation functions in Java
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
HiveQL
Supports select, project, join, aggregate, union all, and sub-queries
Tables are created using data deļ¬nition statements with speciļ¬c
serialization formats, partitioning, and bucketing
Data is loaded from external sources and inserted into tables
Support for multi-table insert ā€“ multiple queries on the same input data
using a single HiveQL statement
User-deļ¬ned column transformation and aggregation functions in Java
Custom map-reduce scripts written in any language can be embedded
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
Example: Facebook Status
Status updates are stored on ļ¬‚at ļ¬les in an NFS directory
/logs/status_updates
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 9 / 37
Example: Facebook Status
Status updates are stored on ļ¬‚at ļ¬les in an NFS directory
/logs/status_updates
This data is loaded on a daily basis to a Hive table:
status_updates(userid int,status string,ds
string)
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 9 / 37
Example: Facebook Status
Status updates are stored on ļ¬‚at ļ¬les in an NFS directory
/logs/status_updates
This data is loaded on a daily basis to a Hive table:
status_updates(userid int,status string,ds
string)
Using:
1 LOAD DATA LOCAL INPATH ā€™/logs/status_updatesā€™
2 INTO TABLE status_updates PARTITION (ds=ā€™2013-05-26ā€™)
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 9 / 37
Example: Facebook Status
Status updates are stored on ļ¬‚at ļ¬les in an NFS directory
/logs/status_updates
This data is loaded on a daily basis to a Hive table:
status_updates(userid int,status string,ds
string)
Using:
1 LOAD DATA LOCAL INPATH ā€™/logs/status_updatesā€™
2 INTO TABLE status_updates PARTITION (ds=ā€™2013-05-26ā€™)
Detailed proļ¬le information, such as gender and academic institution is
present in the table: profiles(userid int,school
string,gender int)
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 9 / 37
Example: Facebook Status (2)
Query to workout the frequency of status updates based on gender and
academic institution
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 10 / 37
Example: Facebook Status (2)
Query to workout the frequency of status updates based on gender and
academic institution
1 FROM (SELECT a.status, b.school, b.gender
2 FROM status_updates a JOIN profiles b
3 ON (a.userid = b.userid and
4 a.ds=ā€™2013-05-26ā€™)
5 ) subq1
6 INSERT OVERWRITE TABLE gender_summary
7 PARTITION(ds=ā€™2013-05-26ā€™)
8 SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender
9 INSERT OVERWRITE TABLE school_summary
10 PARTITION(ds=ā€™2013-05-26ā€™)
11 SELECT subq1.school, COUNT(1) GROUP BY subq1.school
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 10 / 37
Metastore
Similar to the metastore maintained by traditional warehousing
solutions such as Oracle and IBM DB2 (distinguishes Hive from Pig or
Cascading which have no such store)
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
Metastore
Similar to the metastore maintained by traditional warehousing
solutions such as Oracle and IBM DB2 (distinguishes Hive from Pig or
Cascading which have no such store)
Stored in either a traditional DB such as MySQL or an FS such as NFS
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
Metastore
Similar to the metastore maintained by traditional warehousing
solutions such as Oracle and IBM DB2 (distinguishes Hive from Pig or
Cascading which have no such store)
Stored in either a traditional DB such as MySQL or an FS such as NFS
Contains the following objects:
Database: namespace for tables
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
Metastore
Similar to the metastore maintained by traditional warehousing
solutions such as Oracle and IBM DB2 (distinguishes Hive from Pig or
Cascading which have no such store)
Stored in either a traditional DB such as MySQL or an FS such as NFS
Contains the following objects:
Database: namespace for tables
Table: metadata for a table including columns and their types, owner,
storage, and serialization information
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
Metastore
Similar to the metastore maintained by traditional warehousing
solutions such as Oracle and IBM DB2 (distinguishes Hive from Pig or
Cascading which have no such store)
Stored in either a traditional DB such as MySQL or an FS such as NFS
Contains the following objects:
Database: namespace for tables
Table: metadata for a table including columns and their types, owner,
storage, and serialization information
Partition: metadata for a partition; similar to the information for a table
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
Outline
1 Hive
2 HadoopDB
3 nCluster
4 Summary
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 12 / 37
Introduction
Two options for data analytics on shared nothing clusters:
1 Parallel Databases, such as Teradata, Oracle etc. but,
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
Introduction
Two options for data analytics on shared nothing clusters:
1 Parallel Databases, such as Teradata, Oracle etc. but,
Assume that failures are a rare event
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
Introduction
Two options for data analytics on shared nothing clusters:
1 Parallel Databases, such as Teradata, Oracle etc. but,
Assume that failures are a rare event
Assume that hardware is homogeneous
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
Introduction
Two options for data analytics on shared nothing clusters:
1 Parallel Databases, such as Teradata, Oracle etc. but,
Assume that failures are a rare event
Assume that hardware is homogeneous
Never tested in deployments with more than a few dozen nodes
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
Introduction
Two options for data analytics on shared nothing clusters:
1 Parallel Databases, such as Teradata, Oracle etc. but,
Assume that failures are a rare event
Assume that hardware is homogeneous
Never tested in deployments with more than a few dozen nodes
2 MapReduce but,
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
Introduction
Two options for data analytics on shared nothing clusters:
1 Parallel Databases, such as Teradata, Oracle etc. but,
Assume that failures are a rare event
Assume that hardware is homogeneous
Never tested in deployments with more than a few dozen nodes
2 MapReduce but,
All shortcomings pointed by DeWitt and Stonebraker, as discussed
before
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
Introduction
Two options for data analytics on shared nothing clusters:
1 Parallel Databases, such as Teradata, Oracle etc. but,
Assume that failures are a rare event
Assume that hardware is homogeneous
Never tested in deployments with more than a few dozen nodes
2 MapReduce but,
All shortcomings pointed by DeWitt and Stonebraker, as discussed
before
At times an order of magnitude slower than parallel DBs
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
Hybrid
Combine scalability and non-existent monetary cost of MapReduce
with performance of parallel DBs
2
http://hadapt.com/
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
Hybrid
Combine scalability and non-existent monetary cost of MapReduce
with performance of parallel DBs
HadoopDB is such a hybrid
2
http://hadapt.com/
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
Hybrid
Combine scalability and non-existent monetary cost of MapReduce
with performance of parallel DBs
HadoopDB is such a hybrid
Unlike Hive, Pig, Greenplum, Aster, etc. which are language and
interface level hybrids, Hadoop DB is a systems level hybrid
2
http://hadapt.com/
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
Hybrid
Combine scalability and non-existent monetary cost of MapReduce
with performance of parallel DBs
HadoopDB is such a hybrid
Unlike Hive, Pig, Greenplum, Aster, etc. which are language and
interface level hybrids, Hadoop DB is a systems level hybrid
Uses MapReduce as the communication layer atop a cluster of nodes
running single-node DBMS instances
2
http://hadapt.com/
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
Hybrid
Combine scalability and non-existent monetary cost of MapReduce
with performance of parallel DBs
HadoopDB is such a hybrid
Unlike Hive, Pig, Greenplum, Aster, etc. which are language and
interface level hybrids, Hadoop DB is a systems level hybrid
Uses MapReduce as the communication layer atop a cluster of nodes
running single-node DBMS instances
PostgreSQL as the database layer, Hadoop as the communication
layer, and Hive as the translation layer
2
http://hadapt.com/
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
Hybrid
Combine scalability and non-existent monetary cost of MapReduce
with performance of parallel DBs
HadoopDB is such a hybrid
Unlike Hive, Pig, Greenplum, Aster, etc. which are language and
interface level hybrids, Hadoop DB is a systems level hybrid
Uses MapReduce as the communication layer atop a cluster of nodes
running single-node DBMS instances
PostgreSQL as the database layer, Hadoop as the communication
layer, and Hive as the translation layer
Commercialized through the start up, Hadapt2
2
http://hadapt.com/
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
HadoopDB
Consists of four components:
1 Database Connector: Interface between per-node database systems
and Hadoop TaskTrackers
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 15 / 37
HadoopDB
Consists of four components:
1 Database Connector: Interface between per-node database systems
and Hadoop TaskTrackers
2 Catalog: Meta-information about per-node databases
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 15 / 37
HadoopDB
Consists of four components:
1 Database Connector: Interface between per-node database systems
and Hadoop TaskTrackers
2 Catalog: Meta-information about per-node databases
3 Data Loader: Data partitioning across single-node databases
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 15 / 37
HadoopDB
Consists of four components:
1 Database Connector: Interface between per-node database systems
and Hadoop TaskTrackers
2 Catalog: Meta-information about per-node databases
3 Data Loader: Data partitioning across single-node databases
4 SQL to MapReduce to SQL (SMS) Planner: Translation between
SQL and MapReduce
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 15 / 37
HadoopDB Architecture
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 16 / 37
Database Connector
Uses the Java Database Connectivity (JDBC)-compliant Hadoop
InputFormat
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 17 / 37
Database Connector
Uses the Java Database Connectivity (JDBC)-compliant Hadoop
InputFormat
The connector is served the SQL query and other information by the
MapReduce job
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 17 / 37
Database Connector
Uses the Java Database Connectivity (JDBC)-compliant Hadoop
InputFormat
The connector is served the SQL query and other information by the
MapReduce job
The connector connects to the DB, executes the SQL query, and
returns results in the form of key/value pairs
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 17 / 37
Database Connector
Uses the Java Database Connectivity (JDBC)-compliant Hadoop
InputFormat
The connector is served the SQL query and other information by the
MapReduce job
The connector connects to the DB, executes the SQL query, and
returns results in the form of key/value pairs
Hadoop in essence sees the DB as just another data source
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 17 / 37
Catalog
Contains information, such as:
1 Connection parameters, such as DB location, format, and any
credentials
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 18 / 37
Catalog
Contains information, such as:
1 Connection parameters, such as DB location, format, and any
credentials
2 Metadata about the datasets, replica locations, and partitioning scheme
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 18 / 37
Catalog
Contains information, such as:
1 Connection parameters, such as DB location, format, and any
credentials
2 Metadata about the datasets, replica locations, and partitioning scheme
Stored as an XML ļ¬le on the HDFS
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 18 / 37
Data Loader
Consists of two key components:
1 Global Hasher: Executes a custom Hadoop job to repartition raw data
ļ¬les from the HDFS into n parts, where n is the number of nodes in the
cluster
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 19 / 37
Data Loader
Consists of two key components:
1 Global Hasher: Executes a custom Hadoop job to repartition raw data
ļ¬les from the HDFS into n parts, where n is the number of nodes in the
cluster
2 Local Hasher: Copies a partition from the HDFS to the node-local DB
of each node and further partitions it into smaller size chunks
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 19 / 37
SQL to MapReduce to SQL (SMS) Planner
Extends HiveQL in two key ways:
1 Before query execution, the Hive Metastore is updated with references
to HadoopDB tables, table schemas, formats, and serialization
information
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 20 / 37
SQL to MapReduce to SQL (SMS) Planner
Extends HiveQL in two key ways:
1 Before query execution, the Hive Metastore is updated with references
to HadoopDB tables, table schemas, formats, and serialization
information
2 All operators with partitioning keys similar to the node-local database
are converted into SQL queries and pushed to the database layer
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 20 / 37
Outline
1 Hive
2 HadoopDB
3 nCluster
4 Summary
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 21 / 37
Introduction
The declarative nature of SQL is too limiting for describing most big
data computation
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
Introduction
The declarative nature of SQL is too limiting for describing most big
data computation
The underlying subsystems are also suboptimal as they do not
consider domain-speciļ¬c optimizations
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
Introduction
The declarative nature of SQL is too limiting for describing most big
data computation
The underlying subsystems are also suboptimal as they do not
consider domain-speciļ¬c optimizations
nCluster makes use of SQL/MR, a framework that inserts user-deļ¬ned
functions in any programming language into SQL queries
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
Introduction
The declarative nature of SQL is too limiting for describing most big
data computation
The underlying subsystems are also suboptimal as they do not
consider domain-speciļ¬c optimizations
nCluster makes use of SQL/MR, a framework that inserts user-deļ¬ned
functions in any programming language into SQL queries
By itself, nCluster is a shared-nothing parallel database geared
towards analytic workloads
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
Introduction
The declarative nature of SQL is too limiting for describing most big
data computation
The underlying subsystems are also suboptimal as they do not
consider domain-speciļ¬c optimizations
nCluster makes use of SQL/MR, a framework that inserts user-deļ¬ned
functions in any programming language into SQL queries
By itself, nCluster is a shared-nothing parallel database geared
towards analytic workloads
Originally designed by Aster Data Systems and later acquired by
Teradata
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
Introduction
The declarative nature of SQL is too limiting for describing most big
data computation
The underlying subsystems are also suboptimal as they do not
consider domain-speciļ¬c optimizations
nCluster makes use of SQL/MR, a framework that inserts user-deļ¬ned
functions in any programming language into SQL queries
By itself, nCluster is a shared-nothing parallel database geared
towards analytic workloads
Originally designed by Aster Data Systems and later acquired by
Teradata
Used by Barnes and Noble, LinkedIn, SAS, etc.
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
SQL/MR Functions
Dynamically polymorphic: input and output schemes are decided at
runtime
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
SQL/MR Functions
Dynamically polymorphic: input and output schemes are decided at
runtime
Parallelizable across cores and machines
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
SQL/MR Functions
Dynamically polymorphic: input and output schemes are decided at
runtime
Parallelizable across cores and machines
Composable because their input and output behaviour is identical to
SQL subqueries
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
SQL/MR Functions
Dynamically polymorphic: input and output schemes are decided at
runtime
Parallelizable across cores and machines
Composable because their input and output behaviour is identical to
SQL subqueries
Amenable to static and dynamic optimizations just like SQL subqueries
or a relation
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
SQL/MR Functions
Dynamically polymorphic: input and output schemes are decided at
runtime
Parallelizable across cores and machines
Composable because their input and output behaviour is identical to
SQL subqueries
Amenable to static and dynamic optimizations just like SQL subqueries
or a relation
Can be implemented in a number of languages including Java, C#,
C++, Python, etc. and can thus make use of third-party libraries
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
SQL/MR Functions
Dynamically polymorphic: input and output schemes are decided at
runtime
Parallelizable across cores and machines
Composable because their input and output behaviour is identical to
SQL subqueries
Amenable to static and dynamic optimizations just like SQL subqueries
or a relation
Can be implemented in a number of languages including Java, C#,
C++, Python, etc. and can thus make use of third-party libraries
Executed within processes to provide sandboxing and resource
allocation
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
Syntax
1 SELECT ...
2 FROM functionname(
3 ON table-or-query
4 [PARTITION BY expr, ...]
5 [ORDER BY expr, ...]
6 [clausename(arg, ...) ...]
7 )
8 ...
SQL/MR function appears in the FROM clause
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 24 / 37
Syntax
1 SELECT ...
2 FROM functionname(
3 ON table-or-query
4 [PARTITION BY expr, ...]
5 [ORDER BY expr, ...]
6 [clausename(arg, ...) ...]
7 )
8 ...
SQL/MR function appears in the FROM clause
ON is the only required clause which speciļ¬es the input to the function
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 24 / 37
Syntax
1 SELECT ...
2 FROM functionname(
3 ON table-or-query
4 [PARTITION BY expr, ...]
5 [ORDER BY expr, ...]
6 [clausename(arg, ...) ...]
7 )
8 ...
SQL/MR function appears in the FROM clause
ON is the only required clause which speciļ¬es the input to the function
PARTITION BY partitions the input to the function on one or more
attributes from the schema
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 24 / 37
Syntax (2)
1 SELECT ...
2 FROM functionname(
3 ON table-or-query
4 [PARTITION BY expr, ...]
5 [ORDER BY expr, ...]
6 [clausename(arg, ...) ...]
7 )
8 ...
ORDER BY sorts the input to the function and can only be used after a
PARTITION BY clause
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 25 / 37
Syntax (2)
1 SELECT ...
2 FROM functionname(
3 ON table-or-query
4 [PARTITION BY expr, ...]
5 [ORDER BY expr, ...]
6 [clausename(arg, ...) ...]
7 )
8 ...
ORDER BY sorts the input to the function and can only be used after a
PARTITION BY clause
Any number of custom clauses can also be deļ¬ned whose names and
arguments are passed as a key/value map to the function
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 25 / 37
Syntax (2)
1 SELECT ...
2 FROM functionname(
3 ON table-or-query
4 [PARTITION BY expr, ...]
5 [ORDER BY expr, ...]
6 [clausename(arg, ...) ...]
7 )
8 ...
ORDER BY sorts the input to the function and can only be used after a
PARTITION BY clause
Any number of custom clauses can also be deļ¬ned whose names and
arguments are passed as a key/value map to the function
Implemented as relations so easily nestable
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 25 / 37
Execution Model
Functions are equivalent to either map (row function) or reduce
(partition function) functions
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 26 / 37
Execution Model
Functions are equivalent to either map (row function) or reduce
(partition function) functions
Identical to MapReduce, these functions are executed across many
nodes and machines
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 26 / 37
Execution Model
Functions are equivalent to either map (row function) or reduce
(partition function) functions
Identical to MapReduce, these functions are executed across many
nodes and machines
Contracts identical to MapReduce functions
Only one row function operates over a row from the input table
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 26 / 37
Execution Model
Functions are equivalent to either map (row function) or reduce
(partition function) functions
Identical to MapReduce, these functions are executed across many
nodes and machines
Contracts identical to MapReduce functions
Only one row function operates over a row from the input table
Only one partition function operates over a group of rows deļ¬ned by the
PARTITION BY clause, in the order speciļ¬ed by the ORDER BY
clause
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 26 / 37
Programming Interface
A Runtime Contract is passed by the query planner to the
function which contains the names and types of the input columns and
the names and values of the argument clauses
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
Programming Interface
A Runtime Contract is passed by the query planner to the
function which contains the names and types of the input columns and
the names and values of the argument clauses
The function then completes this contract by ļ¬lling in the output
schema and making a call to complete()
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
Programming Interface
A Runtime Contract is passed by the query planner to the
function which contains the names and types of the input columns and
the names and values of the argument clauses
The function then completes this contract by ļ¬lling in the output
schema and making a call to complete()
Row and partition functions are implemented through the
operateOnSomeRows and operateOnPartition methods,
respectively
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
Programming Interface
A Runtime Contract is passed by the query planner to the
function which contains the names and types of the input columns and
the names and values of the argument clauses
The function then completes this contract by ļ¬lling in the output
schema and making a call to complete()
Row and partition functions are implemented through the
operateOnSomeRows and operateOnPartition methods,
respectively
These methods are passed an iterator over their input rows and an
emitter object for returning output rows to the database
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
Programming Interface
A Runtime Contract is passed by the query planner to the
function which contains the names and types of the input columns and
the names and values of the argument clauses
The function then completes this contract by ļ¬lling in the output
schema and making a call to complete()
Row and partition functions are implemented through the
operateOnSomeRows and operateOnPartition methods,
respectively
These methods are passed an iterator over their input rows and an
emitter object for returning output rows to the database
operateOnPartition can also optionally implement the combiner
interface
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
Installation
Functions need to be installed ļ¬rst before they can be used
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
Installation
Functions need to be installed ļ¬rst before they can be used
Can be supplied as a .zip along with third-party libraries
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
Installation
Functions need to be installed ļ¬rst before they can be used
Can be supplied as a .zip along with third-party libraries
Install-time examination also enables static analysis of properties, such
as row function or partition function, support for combining, etc.
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
Installation
Functions need to be installed ļ¬rst before they can be used
Can be supplied as a .zip along with third-party libraries
Install-time examination also enables static analysis of properties, such
as row function or partition function, support for combining, etc.
Any arbitrary ļ¬le can be installed which is replicated to all workers,
such as conļ¬guration ļ¬les, binaries, etc.
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
Installation
Functions need to be installed ļ¬rst before they can be used
Can be supplied as a .zip along with third-party libraries
Install-time examination also enables static analysis of properties, such
as row function or partition function, support for combining, etc.
Any arbitrary ļ¬le can be installed which is replicated to all workers,
such as conļ¬guration ļ¬les, binaries, etc.
Each function is provided with a temporary directory which is garbage
collected after execution
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
Architecture
One or more Queen nodes process queries and hash partition them
across Worker nodes
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
Architecture
One or more Queen nodes process queries and hash partition them
across Worker nodes
The query planner honours the Runtime Contract with the
function and invokes its initializer (Constructor in case of Java)
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
Architecture
One or more Queen nodes process queries and hash partition them
across Worker nodes
The query planner honours the Runtime Contract with the
function and invokes its initializer (Constructor in case of Java)
Functions are executed within the Worker databases as separate
processes for isolation, security, resource allocation, forced
termination, etc.
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
Architecture
One or more Queen nodes process queries and hash partition them
across Worker nodes
The query planner honours the Runtime Contract with the
function and invokes its initializer (Constructor in case of Java)
Functions are executed within the Worker databases as separate
processes for isolation, security, resource allocation, forced
termination, etc.
The worker database implements a ā€œbridgeā€ which manages its
communication with the SQL/MR function
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
Architecture
One or more Queen nodes process queries and hash partition them
across Worker nodes
The query planner honours the Runtime Contract with the
function and invokes its initializer (Constructor in case of Java)
Functions are executed within the Worker databases as separate
processes for isolation, security, resource allocation, forced
termination, etc.
The worker database implements a ā€œbridgeā€ which manages its
communication with the SQL/MR function
The SQL/MR function process contains a ā€œrunnerā€ which manages its
communication with the worker database
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
Architecture (2)
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 30 / 37
Example: Wordcount
1 SELECT token, COUNT(*)
2 FROM tokenizer(
3 ON input-table
4 DELIMITER(ā€™ ā€™)
5 )
6 GROUP BY token;
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 31 / 37
Example: Clickstream Sessionization
Divide a userā€™s clicks on a website into sessions
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 32 / 37
Example: Clickstream Sessionization
Divide a userā€™s clicks on a website into sessions
A session includes the userā€™s clicks within a speciļ¬ed time period
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 32 / 37
Example: Clickstream Sessionization
Divide a userā€™s clicks on a website into sessions
A session includes the userā€™s clicks within a speciļ¬ed time period
Timestamp User ID
10:00:00 238909
00:58:24 7656
10:00:24 238909
02:30:33 7656
10:01:23 238909
10:02:40 238909
Timestamp User ID Session ID
10:00:00 238909 0
10:00:24 238909 0
10:01:23 238909 0
10:02:40 238909 1
00:58:24 7656 0
02:30:33 7656 1
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 32 / 37
Example: Clickstream Sessionization (2)
1 SELECT ts, userid, session
2 FROM sessionize (
3 ON clicks
4 PARTITION BY userid
5 ORDER BY ts
6 TIMECOLUMN (ā€™tsā€™)
7 TIMEOUT (60)
8 );
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 33 / 37
Example: Clickstream Sessionization (3)
1 public class Sessionize implements PartitionFunction {
2
3 private int timeColumnIndex;
4 private int timeout;
5
6 public Sessionize(RuntimeContract contract) {
7 // Get time column and timeout from contract
8 // Define output schema
9 contract.complete();
10 }
11
12 public void operationOnPartition(
13 PartitionDefinition partition,
14 RowIterator inputIterator,
15 RowEmitter outputEmitter) {
16 // Implement the partition function logic
17 // Emit output rows
18 }
19
20 }
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 34 / 37
Outline
1 Hive
2 HadoopDB
3 nCluster
4 Summary
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 35 / 37
Summary
Hive, HadoopDB, and nCluster explore three different points in the design
space
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 36 / 37
Summary
Hive, HadoopDB, and nCluster explore three different points in the design
space
1 Hive uses MapReduce to give DBMS-like functionality
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 36 / 37
Summary
Hive, HadoopDB, and nCluster explore three different points in the design
space
1 Hive uses MapReduce to give DBMS-like functionality
2 HadoopDB uses MapReduce and DBMS side-by-side
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 36 / 37
Summary
Hive, HadoopDB, and nCluster explore three different points in the design
space
1 Hive uses MapReduce to give DBMS-like functionality
2 HadoopDB uses MapReduce and DBMS side-by-side
3 nCluster implements MapReduce within a DBMS
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 36 / 37
References
1 Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad
Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham
Murthy. 2009. Hive: a warehousing solution over a map-reduce
framework. Proc. VLDB Endow. 2, 2 (August 2009), 1626-1629.
2 Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi
Silberschatz, and Alexander Rasin. 2009. HadoopDB: an architectural
hybrid of MapReduce and DBMS technologies for analytical workloads.
Proc. VLDB Endow. 2, 1 (August 2009), 922-933.
3 Eric Friedman, Peter Pawlowski, and John Cieslewicz. 2009.
SQL/MapReduce: a practical approach to self-describing, polymorphic,
and parallelizable user-deļ¬ned functions. Proc. VLDB Endow. 2, 2
(August 2009), 1402-1413.
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 37 / 37

More Related Content

What's hot

Hadoop 2.0 and yarn
Hadoop 2.0 and yarnHadoop 2.0 and yarn
Hadoop 2.0 and yarnMichael Joseph
Ā 
Sparkr sigmod
Sparkr sigmodSparkr sigmod
Sparkr sigmodwaqasm86
Ā 
Star ,Snow and Fact-Constullation Schemas??
Star ,Snow and  Fact-Constullation Schemas??Star ,Snow and  Fact-Constullation Schemas??
Star ,Snow and Fact-Constullation Schemas??Abdul Aslam
Ā 
Download-manuals-gis-how toworkwithmaplayersandnetworklayers
 Download-manuals-gis-how toworkwithmaplayersandnetworklayers Download-manuals-gis-how toworkwithmaplayersandnetworklayers
Download-manuals-gis-how toworkwithmaplayersandnetworklayershydrologywebsite1
Ā 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce cscpconf
Ā 
Performance Comparison of HBase and Cassandra
Performance Comparison of HBase and CassandraPerformance Comparison of HBase and Cassandra
Performance Comparison of HBase and CassandraYashIyengar
Ā 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkIRJET Journal
Ā 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415SANTOSH WAYAL
Ā 
TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)ruchabhandiwad
Ā 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111NavNeet KuMar
Ā 
Data analytics online training
Data analytics online trainingData analytics online training
Data analytics online trainingankitha reddy
Ā 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiativeMansi Mehra
Ā 
SITNL 2015 - Big Data Small Pockets
SITNL 2015 - Big Data Small PocketsSITNL 2015 - Big Data Small Pockets
SITNL 2015 - Big Data Small PocketsJan van Ansem
Ā 
Reduce Side Joins
Reduce Side Joins Reduce Side Joins
Reduce Side Joins Edureka!
Ā 
Strayer cis-499-week-5-project-deliverable-3-database-and-data-warehousing-de...
Strayer cis-499-week-5-project-deliverable-3-database-and-data-warehousing-de...Strayer cis-499-week-5-project-deliverable-3-database-and-data-warehousing-de...
Strayer cis-499-week-5-project-deliverable-3-database-and-data-warehousing-de...infinityend3
Ā 

What's hot (18)

Hadoop 2.0 and yarn
Hadoop 2.0 and yarnHadoop 2.0 and yarn
Hadoop 2.0 and yarn
Ā 
Sparkr sigmod
Sparkr sigmodSparkr sigmod
Sparkr sigmod
Ā 
Hadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and AssessmentHadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and Assessment
Ā 
Star ,Snow and Fact-Constullation Schemas??
Star ,Snow and  Fact-Constullation Schemas??Star ,Snow and  Fact-Constullation Schemas??
Star ,Snow and Fact-Constullation Schemas??
Ā 
Download-manuals-gis-how toworkwithmaplayersandnetworklayers
 Download-manuals-gis-how toworkwithmaplayersandnetworklayers Download-manuals-gis-how toworkwithmaplayersandnetworklayers
Download-manuals-gis-how toworkwithmaplayersandnetworklayers
Ā 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
Ā 
Performance Comparison of HBase and Cassandra
Performance Comparison of HBase and CassandraPerformance Comparison of HBase and Cassandra
Performance Comparison of HBase and Cassandra
Ā 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
Ā 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
Ā 
TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)
Ā 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
Ā 
Data analytics online training
Data analytics online trainingData analytics online training
Data analytics online training
Ā 
Hadoop paper
Hadoop paperHadoop paper
Hadoop paper
Ā 
Database
DatabaseDatabase
Database
Ā 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiative
Ā 
SITNL 2015 - Big Data Small Pockets
SITNL 2015 - Big Data Small PocketsSITNL 2015 - Big Data Small Pockets
SITNL 2015 - Big Data Small Pockets
Ā 
Reduce Side Joins
Reduce Side Joins Reduce Side Joins
Reduce Side Joins
Ā 
Strayer cis-499-week-5-project-deliverable-3-database-and-data-warehousing-de...
Strayer cis-499-week-5-project-deliverable-3-database-and-data-warehousing-de...Strayer cis-499-week-5-project-deliverable-3-database-and-data-warehousing-de...
Strayer cis-499-week-5-project-deliverable-3-database-and-data-warehousing-de...
Ā 

Viewers also liked

AOS Lab 11: Virtualization
AOS Lab 11: VirtualizationAOS Lab 11: Virtualization
AOS Lab 11: VirtualizationZubair Nabi
Ā 
Topic 13: Cloud Stacks
Topic 13: Cloud StacksTopic 13: Cloud Stacks
Topic 13: Cloud StacksZubair Nabi
Ā 
AOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyondAOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyondZubair Nabi
Ā 
Raabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing WorldRaabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing WorldZubair Nabi
Ā 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!Zubair Nabi
Ā 
AOS Lab 4: If you liked it, then you should have put a ā€œlockā€ on it
AOS Lab 4: If you liked it, then you should have put a ā€œlockā€ on itAOS Lab 4: If you liked it, then you should have put a ā€œlockā€ on it
AOS Lab 4: If you liked it, then you should have put a ā€œlockā€ on itZubair Nabi
Ā 
AOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device DriversAOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device DriversZubair Nabi
Ā 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!Zubair Nabi
Ā 
The Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in PakistanThe Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in PakistanZubair Nabi
Ā 
AOS Lab 7: Page tables
AOS Lab 7: Page tablesAOS Lab 7: Page tables
AOS Lab 7: Page tablesZubair Nabi
Ā 
MapReduce Application Scripting
MapReduce Application ScriptingMapReduce Application Scripting
MapReduce Application ScriptingZubair Nabi
Ā 
AOS Lab 5: System calls
AOS Lab 5: System callsAOS Lab 5: System calls
AOS Lab 5: System callsZubair Nabi
Ā 
AOS Lab 6: Scheduling
AOS Lab 6: SchedulingAOS Lab 6: Scheduling
AOS Lab 6: SchedulingZubair Nabi
Ā 
AOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksAOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksZubair Nabi
Ā 
AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!Zubair Nabi
Ā 
Topic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and VirtualizationTopic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and VirtualizationZubair Nabi
Ā 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data StackZubair Nabi
Ā 
AOS Lab 12: Network Communication
AOS Lab 12: Network CommunicationAOS Lab 12: Network Communication
AOS Lab 12: Network CommunicationZubair Nabi
Ā 
Topic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and NetworkingTopic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and NetworkingZubair Nabi
Ā 

Viewers also liked (19)

AOS Lab 11: Virtualization
AOS Lab 11: VirtualizationAOS Lab 11: Virtualization
AOS Lab 11: Virtualization
Ā 
Topic 13: Cloud Stacks
Topic 13: Cloud StacksTopic 13: Cloud Stacks
Topic 13: Cloud Stacks
Ā 
AOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyondAOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyond
Ā 
Raabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing WorldRaabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing World
Ā 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!
Ā 
AOS Lab 4: If you liked it, then you should have put a ā€œlockā€ on it
AOS Lab 4: If you liked it, then you should have put a ā€œlockā€ on itAOS Lab 4: If you liked it, then you should have put a ā€œlockā€ on it
AOS Lab 4: If you liked it, then you should have put a ā€œlockā€ on it
Ā 
AOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device DriversAOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device Drivers
Ā 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!
Ā 
The Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in PakistanThe Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in Pakistan
Ā 
AOS Lab 7: Page tables
AOS Lab 7: Page tablesAOS Lab 7: Page tables
AOS Lab 7: Page tables
Ā 
MapReduce Application Scripting
MapReduce Application ScriptingMapReduce Application Scripting
MapReduce Application Scripting
Ā 
AOS Lab 5: System calls
AOS Lab 5: System callsAOS Lab 5: System calls
AOS Lab 5: System calls
Ā 
AOS Lab 6: Scheduling
AOS Lab 6: SchedulingAOS Lab 6: Scheduling
AOS Lab 6: Scheduling
Ā 
AOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksAOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocks
Ā 
AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!
Ā 
Topic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and VirtualizationTopic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and Virtualization
Ā 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
Ā 
AOS Lab 12: Network Communication
AOS Lab 12: Network CommunicationAOS Lab 12: Network Communication
AOS Lab 12: Network Communication
Ā 
Topic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and NetworkingTopic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and Networking
Ā 

Similar to MapReduce and DBMS Hybrids

Database Programming with Perl and DBIx::Class
Database Programming with Perl and DBIx::ClassDatabase Programming with Perl and DBIx::Class
Database Programming with Perl and DBIx::ClassDave Cross
Ā 
No SQL introduction
No SQL introductionNo SQL introduction
No SQL introductionsurabhi_dwivedi
Ā 
Chapter 6 Database SC025 2017/2018
Chapter 6 Database SC025 2017/2018Chapter 6 Database SC025 2017/2018
Chapter 6 Database SC025 2017/2018Fizaril Amzari Omar
Ā 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014Stratebi
Ā 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.
Ā 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
Ā 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopGiovanna Roda
Ā 
No sql ā€“ rise of the clusters
No sql ā€“ rise of the clustersNo sql ā€“ rise of the clusters
No sql ā€“ rise of the clustersresponseteam
Ā 
co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.Yousef Fadila
Ā 
SIG-04-Databases.pptx
SIG-04-Databases.pptxSIG-04-Databases.pptx
SIG-04-Databases.pptxHugoDeConello
Ā 
Topic 12: NoSQL in Action
Topic 12: NoSQL in ActionTopic 12: NoSQL in Action
Topic 12: NoSQL in ActionZubair Nabi
Ā 

Similar to MapReduce and DBMS Hybrids (20)

Database Part 2
Database Part 2Database Part 2
Database Part 2
Ā 
Hadoop presentation
Hadoop presentationHadoop presentation
Hadoop presentation
Ā 
Database Programming with Perl and DBIx::Class
Database Programming with Perl and DBIx::ClassDatabase Programming with Perl and DBIx::Class
Database Programming with Perl and DBIx::Class
Ā 
No SQL introduction
No SQL introductionNo SQL introduction
No SQL introduction
Ā 
DBMS Basics
DBMS BasicsDBMS Basics
DBMS Basics
Ā 
03 data mining : data warehouse
03 data mining : data warehouse03 data mining : data warehouse
03 data mining : data warehouse
Ā 
Chapter 6 Database SC025 2017/2018
Chapter 6 Database SC025 2017/2018Chapter 6 Database SC025 2017/2018
Chapter 6 Database SC025 2017/2018
Ā 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014
Ā 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
Ā 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
Ā 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Ā 
No sql ā€“ rise of the clusters
No sql ā€“ rise of the clustersNo sql ā€“ rise of the clusters
No sql ā€“ rise of the clusters
Ā 
co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.
Ā 
Hadoop
HadoopHadoop
Hadoop
Ā 
Uint-5 Big data Frameworks.pdf
Uint-5 Big data Frameworks.pdfUint-5 Big data Frameworks.pdf
Uint-5 Big data Frameworks.pdf
Ā 
Uint-5 Big data Frameworks.pdf
Uint-5 Big data Frameworks.pdfUint-5 Big data Frameworks.pdf
Uint-5 Big data Frameworks.pdf
Ā 
Distributed DBMS - Unit 3 - Distributed DBMS Architecture
Distributed DBMS - Unit 3 - Distributed DBMS ArchitectureDistributed DBMS - Unit 3 - Distributed DBMS Architecture
Distributed DBMS - Unit 3 - Distributed DBMS Architecture
Ā 
SIG-04-Databases.pptx
SIG-04-Databases.pptxSIG-04-Databases.pptx
SIG-04-Databases.pptx
Ā 
Topic 12: NoSQL in Action
Topic 12: NoSQL in ActionTopic 12: NoSQL in Action
Topic 12: NoSQL in Action
Ā 
Nosql
NosqlNosql
Nosql
Ā 

More from Zubair Nabi

Lab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using MininetLab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using MininetZubair Nabi
Ā 
Lab 4: Interfacing with Cassandra
Lab 4: Interfacing with CassandraLab 4: Interfacing with Cassandra
Lab 4: Interfacing with CassandraZubair Nabi
Ā 
Topic 10: Taxonomy of Data and Storage
Topic 10: Taxonomy of Data and StorageTopic 10: Taxonomy of Data and Storage
Topic 10: Taxonomy of Data and StorageZubair Nabi
Ā 
Topic 11: Google Filesystem
Topic 11: Google FilesystemTopic 11: Google Filesystem
Topic 11: Google FilesystemZubair Nabi
Ā 
Lab 3: Writing a Naiad Application
Lab 3: Writing a Naiad ApplicationLab 3: Writing a Naiad Application
Lab 3: Writing a Naiad ApplicationZubair Nabi
Ā 
Topic 9: MR+
Topic 9: MR+Topic 9: MR+
Topic 9: MR+Zubair Nabi
Ā 
Topic 8: Enhancements and Alternative Architectures
Topic 8: Enhancements and Alternative ArchitecturesTopic 8: Enhancements and Alternative Architectures
Topic 8: Enhancements and Alternative ArchitecturesZubair Nabi
Ā 
Topic 7: Shortcomings in the MapReduce Paradigm
Topic 7: Shortcomings in the MapReduce ParadigmTopic 7: Shortcomings in the MapReduce Paradigm
Topic 7: Shortcomings in the MapReduce ParadigmZubair Nabi
Ā 
Lab 1: Introduction to Amazon EC2 and MPI
Lab 1: Introduction to Amazon EC2 and MPILab 1: Introduction to Amazon EC2 and MPI
Lab 1: Introduction to Amazon EC2 and MPIZubair Nabi
Ā 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsZubair Nabi
Ā 

More from Zubair Nabi (10)

Lab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using MininetLab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using Mininet
Ā 
Lab 4: Interfacing with Cassandra
Lab 4: Interfacing with CassandraLab 4: Interfacing with Cassandra
Lab 4: Interfacing with Cassandra
Ā 
Topic 10: Taxonomy of Data and Storage
Topic 10: Taxonomy of Data and StorageTopic 10: Taxonomy of Data and Storage
Topic 10: Taxonomy of Data and Storage
Ā 
Topic 11: Google Filesystem
Topic 11: Google FilesystemTopic 11: Google Filesystem
Topic 11: Google Filesystem
Ā 
Lab 3: Writing a Naiad Application
Lab 3: Writing a Naiad ApplicationLab 3: Writing a Naiad Application
Lab 3: Writing a Naiad Application
Ā 
Topic 9: MR+
Topic 9: MR+Topic 9: MR+
Topic 9: MR+
Ā 
Topic 8: Enhancements and Alternative Architectures
Topic 8: Enhancements and Alternative ArchitecturesTopic 8: Enhancements and Alternative Architectures
Topic 8: Enhancements and Alternative Architectures
Ā 
Topic 7: Shortcomings in the MapReduce Paradigm
Topic 7: Shortcomings in the MapReduce ParadigmTopic 7: Shortcomings in the MapReduce Paradigm
Topic 7: Shortcomings in the MapReduce Paradigm
Ā 
Lab 1: Introduction to Amazon EC2 and MPI
Lab 1: Introduction to Amazon EC2 and MPILab 1: Introduction to Amazon EC2 and MPI
Lab 1: Introduction to Amazon EC2 and MPI
Ā 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
Ā 

Recently uploaded

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
Ā 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
Ā 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
Ā 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
Ā 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
Ā 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
Ā 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
Ā 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
Ā 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
Ā 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
Ā 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
Ā 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
Ā 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
Ā 
Scaling API-first ā€“ The story of a global engineering organization
Scaling API-first ā€“ The story of a global engineering organizationScaling API-first ā€“ The story of a global engineering organization
Scaling API-first ā€“ The story of a global engineering organizationRadu Cotescu
Ā 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
Ā 
Finology Group ā€“ Insurtech Innovation Award 2024
Finology Group ā€“ Insurtech Innovation Award 2024Finology Group ā€“ Insurtech Innovation Award 2024
Finology Group ā€“ Insurtech Innovation Award 2024The Digital Insurer
Ā 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
Ā 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
Ā 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
Ā 

Recently uploaded (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
Ā 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
Ā 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Ā 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Ā 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
Ā 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
Ā 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
Ā 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
Ā 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
Ā 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
Ā 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
Ā 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Ā 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
Ā 
Scaling API-first ā€“ The story of a global engineering organization
Scaling API-first ā€“ The story of a global engineering organizationScaling API-first ā€“ The story of a global engineering organization
Scaling API-first ā€“ The story of a global engineering organization
Ā 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
Ā 
Finology Group ā€“ Insurtech Innovation Award 2024
Finology Group ā€“ Insurtech Innovation Award 2024Finology Group ā€“ Insurtech Innovation Award 2024
Finology Group ā€“ Insurtech Innovation Award 2024
Ā 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
Ā 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
Ā 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Ā 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Ā 

MapReduce and DBMS Hybrids

  • 1. 12: MapReduce and DBMS Hybrids Zubair Nabi zubair.nabi@itu.edu.pk May 26, 2013 Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 1 / 37
  • 2. Outline 1 Hive 2 HadoopDB 3 nCluster 4 Summary Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 2 / 37
  • 3. Outline 1 Hive 2 HadoopDB 3 nCluster 4 Summary Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 3 / 37
  • 4. Introduction Data warehousing solution built atop Hadoop by Facebook 1 https://www.facebook.com/note.php?note_id=89508453919 Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
  • 5. Introduction Data warehousing solution built atop Hadoop by Facebook Now an Apache open source project 1 https://www.facebook.com/note.php?note_id=89508453919 Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
  • 6. Introduction Data warehousing solution built atop Hadoop by Facebook Now an Apache open source project Queries are expressed in SQL-like HiveQL, which are compiled into map-reduce jobs 1 https://www.facebook.com/note.php?note_id=89508453919 Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
  • 7. Introduction Data warehousing solution built atop Hadoop by Facebook Now an Apache open source project Queries are expressed in SQL-like HiveQL, which are compiled into map-reduce jobs Also contains a type system for describing RDBMS-like tables 1 https://www.facebook.com/note.php?note_id=89508453919 Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
  • 8. Introduction Data warehousing solution built atop Hadoop by Facebook Now an Apache open source project Queries are expressed in SQL-like HiveQL, which are compiled into map-reduce jobs Also contains a type system for describing RDBMS-like tables A system catalog, Hive-Metastore, which contains schemas and statistics is used for data exploration and query optimization 1 https://www.facebook.com/note.php?note_id=89508453919 Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
  • 9. Introduction Data warehousing solution built atop Hadoop by Facebook Now an Apache open source project Queries are expressed in SQL-like HiveQL, which are compiled into map-reduce jobs Also contains a type system for describing RDBMS-like tables A system catalog, Hive-Metastore, which contains schemas and statistics is used for data exploration and query optimization Stores 2PB of uncompressed data at Facebook and is heavily used for simple summarization, business intelligence, machine learning, among many other applications1 1 https://www.facebook.com/note.php?note_id=89508453919 Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
  • 10. Introduction Data warehousing solution built atop Hadoop by Facebook Now an Apache open source project Queries are expressed in SQL-like HiveQL, which are compiled into map-reduce jobs Also contains a type system for describing RDBMS-like tables A system catalog, Hive-Metastore, which contains schemas and statistics is used for data exploration and query optimization Stores 2PB of uncompressed data at Facebook and is heavily used for simple summarization, business intelligence, machine learning, among many other applications1 Also used by Digg, Grooveshark, hi5, Last.fm, Scribd, etc. 1 https://www.facebook.com/note.php?note_id=89508453919 Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
  • 11. Data Model Tables: Similar to RDBMS tables Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
  • 12. Data Model Tables: Similar to RDBMS tables Each table has a corresponding HDFS directory Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
  • 13. Data Model Tables: Similar to RDBMS tables Each table has a corresponding HDFS directory The contents of the table are serialized and stored in ļ¬les within that directory Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
  • 14. Data Model Tables: Similar to RDBMS tables Each table has a corresponding HDFS directory The contents of the table are serialized and stored in ļ¬les within that directory Serialization can be both system provided or user deļ¬ned Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
  • 15. Data Model Tables: Similar to RDBMS tables Each table has a corresponding HDFS directory The contents of the table are serialized and stored in ļ¬les within that directory Serialization can be both system provided or user deļ¬ned Serialization information of each table is also stored in the Hive-Metastore for query optimization Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
  • 16. Data Model Tables: Similar to RDBMS tables Each table has a corresponding HDFS directory The contents of the table are serialized and stored in ļ¬les within that directory Serialization can be both system provided or user deļ¬ned Serialization information of each table is also stored in the Hive-Metastore for query optimization Tables can also be deļ¬ned for data stored in external sources such as HDFS, NFS, and local FS Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
  • 17. Data Model (2) Partitions: Determine the distribution of data within sub-directories of the main table directory Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
  • 18. Data Model (2) Partitions: Determine the distribution of data within sub-directories of the main table directory For instance, for a table T stored in /wh/T and partitioned on columns ds and ctry Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
  • 19. Data Model (2) Partitions: Determine the distribution of data within sub-directories of the main table directory For instance, for a table T stored in /wh/T and partitioned on columns ds and ctry Data with ds value 20090101 and ctry value US, Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
  • 20. Data Model (2) Partitions: Determine the distribution of data within sub-directories of the main table directory For instance, for a table T stored in /wh/T and partitioned on columns ds and ctry Data with ds value 20090101 and ctry value US, Will be stored in ļ¬les within /wh/T/ds=20090101/ctry=US Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
  • 21. Data Model (2) Partitions: Determine the distribution of data within sub-directories of the main table directory For instance, for a table T stored in /wh/T and partitioned on columns ds and ctry Data with ds value 20090101 and ctry value US, Will be stored in ļ¬les within /wh/T/ds=20090101/ctry=US Buckets: Data within partitions is divided into buckets Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
  • 22. Data Model (2) Partitions: Determine the distribution of data within sub-directories of the main table directory For instance, for a table T stored in /wh/T and partitioned on columns ds and ctry Data with ds value 20090101 and ctry value US, Will be stored in ļ¬les within /wh/T/ds=20090101/ctry=US Buckets: Data within partitions is divided into buckets Buckets are calculated based on the hash of a column within the partition Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
  • 23. Data Model (2) Partitions: Determine the distribution of data within sub-directories of the main table directory For instance, for a table T stored in /wh/T and partitioned on columns ds and ctry Data with ds value 20090101 and ctry value US, Will be stored in ļ¬les within /wh/T/ds=20090101/ctry=US Buckets: Data within partitions is divided into buckets Buckets are calculated based on the hash of a column within the partition Each bucket is stored within a ļ¬le in the partition directory Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
  • 24. Column Data Types Primitive types: integers, ļ¬‚oats, strings, dates, and booleans Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 7 / 37
  • 25. Column Data Types Primitive types: integers, ļ¬‚oats, strings, dates, and booleans Nestable collection types: arrays and maps Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 7 / 37
  • 26. Column Data Types Primitive types: integers, ļ¬‚oats, strings, dates, and booleans Nestable collection types: arrays and maps Custom types: user-deļ¬ned Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 7 / 37
  • 27. HiveQL Supports select, project, join, aggregate, union all, and sub-queries Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
  • 28. HiveQL Supports select, project, join, aggregate, union all, and sub-queries Tables are created using data deļ¬nition statements with speciļ¬c serialization formats, partitioning, and bucketing Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
  • 29. HiveQL Supports select, project, join, aggregate, union all, and sub-queries Tables are created using data deļ¬nition statements with speciļ¬c serialization formats, partitioning, and bucketing Data is loaded from external sources and inserted into tables Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
  • 30. HiveQL Supports select, project, join, aggregate, union all, and sub-queries Tables are created using data deļ¬nition statements with speciļ¬c serialization formats, partitioning, and bucketing Data is loaded from external sources and inserted into tables Support for multi-table insert ā€“ multiple queries on the same input data using a single HiveQL statement Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
  • 31. HiveQL Supports select, project, join, aggregate, union all, and sub-queries Tables are created using data deļ¬nition statements with speciļ¬c serialization formats, partitioning, and bucketing Data is loaded from external sources and inserted into tables Support for multi-table insert ā€“ multiple queries on the same input data using a single HiveQL statement User-deļ¬ned column transformation and aggregation functions in Java Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
  • 32. HiveQL Supports select, project, join, aggregate, union all, and sub-queries Tables are created using data deļ¬nition statements with speciļ¬c serialization formats, partitioning, and bucketing Data is loaded from external sources and inserted into tables Support for multi-table insert ā€“ multiple queries on the same input data using a single HiveQL statement User-deļ¬ned column transformation and aggregation functions in Java Custom map-reduce scripts written in any language can be embedded Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
  • 33. Example: Facebook Status Status updates are stored on ļ¬‚at ļ¬les in an NFS directory /logs/status_updates Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 9 / 37
  • 34. Example: Facebook Status Status updates are stored on ļ¬‚at ļ¬les in an NFS directory /logs/status_updates This data is loaded on a daily basis to a Hive table: status_updates(userid int,status string,ds string) Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 9 / 37
  • 35. Example: Facebook Status Status updates are stored on ļ¬‚at ļ¬les in an NFS directory /logs/status_updates This data is loaded on a daily basis to a Hive table: status_updates(userid int,status string,ds string) Using: 1 LOAD DATA LOCAL INPATH ā€™/logs/status_updatesā€™ 2 INTO TABLE status_updates PARTITION (ds=ā€™2013-05-26ā€™) Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 9 / 37
  • 36. Example: Facebook Status Status updates are stored on ļ¬‚at ļ¬les in an NFS directory /logs/status_updates This data is loaded on a daily basis to a Hive table: status_updates(userid int,status string,ds string) Using: 1 LOAD DATA LOCAL INPATH ā€™/logs/status_updatesā€™ 2 INTO TABLE status_updates PARTITION (ds=ā€™2013-05-26ā€™) Detailed proļ¬le information, such as gender and academic institution is present in the table: profiles(userid int,school string,gender int) Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 9 / 37
  • 37. Example: Facebook Status (2) Query to workout the frequency of status updates based on gender and academic institution Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 10 / 37
  • 38. Example: Facebook Status (2) Query to workout the frequency of status updates based on gender and academic institution 1 FROM (SELECT a.status, b.school, b.gender 2 FROM status_updates a JOIN profiles b 3 ON (a.userid = b.userid and 4 a.ds=ā€™2013-05-26ā€™) 5 ) subq1 6 INSERT OVERWRITE TABLE gender_summary 7 PARTITION(ds=ā€™2013-05-26ā€™) 8 SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender 9 INSERT OVERWRITE TABLE school_summary 10 PARTITION(ds=ā€™2013-05-26ā€™) 11 SELECT subq1.school, COUNT(1) GROUP BY subq1.school Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 10 / 37
  • 39. Metastore Similar to the metastore maintained by traditional warehousing solutions such as Oracle and IBM DB2 (distinguishes Hive from Pig or Cascading which have no such store) Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
  • 40. Metastore Similar to the metastore maintained by traditional warehousing solutions such as Oracle and IBM DB2 (distinguishes Hive from Pig or Cascading which have no such store) Stored in either a traditional DB such as MySQL or an FS such as NFS Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
  • 41. Metastore Similar to the metastore maintained by traditional warehousing solutions such as Oracle and IBM DB2 (distinguishes Hive from Pig or Cascading which have no such store) Stored in either a traditional DB such as MySQL or an FS such as NFS Contains the following objects: Database: namespace for tables Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
  • 42. Metastore Similar to the metastore maintained by traditional warehousing solutions such as Oracle and IBM DB2 (distinguishes Hive from Pig or Cascading which have no such store) Stored in either a traditional DB such as MySQL or an FS such as NFS Contains the following objects: Database: namespace for tables Table: metadata for a table including columns and their types, owner, storage, and serialization information Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
  • 43. Metastore Similar to the metastore maintained by traditional warehousing solutions such as Oracle and IBM DB2 (distinguishes Hive from Pig or Cascading which have no such store) Stored in either a traditional DB such as MySQL or an FS such as NFS Contains the following objects: Database: namespace for tables Table: metadata for a table including columns and their types, owner, storage, and serialization information Partition: metadata for a partition; similar to the information for a table Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
  • 44. Outline 1 Hive 2 HadoopDB 3 nCluster 4 Summary Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 12 / 37
  • 45. Introduction Two options for data analytics on shared nothing clusters: 1 Parallel Databases, such as Teradata, Oracle etc. but, Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
  • 46. Introduction Two options for data analytics on shared nothing clusters: 1 Parallel Databases, such as Teradata, Oracle etc. but, Assume that failures are a rare event Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
  • 47. Introduction Two options for data analytics on shared nothing clusters: 1 Parallel Databases, such as Teradata, Oracle etc. but, Assume that failures are a rare event Assume that hardware is homogeneous Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
  • 48. Introduction Two options for data analytics on shared nothing clusters: 1 Parallel Databases, such as Teradata, Oracle etc. but, Assume that failures are a rare event Assume that hardware is homogeneous Never tested in deployments with more than a few dozen nodes Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
  • 49. Introduction Two options for data analytics on shared nothing clusters: 1 Parallel Databases, such as Teradata, Oracle etc. but, Assume that failures are a rare event Assume that hardware is homogeneous Never tested in deployments with more than a few dozen nodes 2 MapReduce but, Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
  • 50. Introduction Two options for data analytics on shared nothing clusters: 1 Parallel Databases, such as Teradata, Oracle etc. but, Assume that failures are a rare event Assume that hardware is homogeneous Never tested in deployments with more than a few dozen nodes 2 MapReduce but, All shortcomings pointed by DeWitt and Stonebraker, as discussed before Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
  • 51. Introduction Two options for data analytics on shared nothing clusters: 1 Parallel Databases, such as Teradata, Oracle etc. but, Assume that failures are a rare event Assume that hardware is homogeneous Never tested in deployments with more than a few dozen nodes 2 MapReduce but, All shortcomings pointed by DeWitt and Stonebraker, as discussed before At times an order of magnitude slower than parallel DBs Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
  • 52. Hybrid Combine scalability and non-existent monetary cost of MapReduce with performance of parallel DBs 2 http://hadapt.com/ Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
  • 53. Hybrid Combine scalability and non-existent monetary cost of MapReduce with performance of parallel DBs HadoopDB is such a hybrid 2 http://hadapt.com/ Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
  • 54. Hybrid Combine scalability and non-existent monetary cost of MapReduce with performance of parallel DBs HadoopDB is such a hybrid Unlike Hive, Pig, Greenplum, Aster, etc. which are language and interface level hybrids, Hadoop DB is a systems level hybrid 2 http://hadapt.com/ Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
  • 55. Hybrid Combine scalability and non-existent monetary cost of MapReduce with performance of parallel DBs HadoopDB is such a hybrid Unlike Hive, Pig, Greenplum, Aster, etc. which are language and interface level hybrids, Hadoop DB is a systems level hybrid Uses MapReduce as the communication layer atop a cluster of nodes running single-node DBMS instances 2 http://hadapt.com/ Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
  • 56. Hybrid Combine scalability and non-existent monetary cost of MapReduce with performance of parallel DBs HadoopDB is such a hybrid Unlike Hive, Pig, Greenplum, Aster, etc. which are language and interface level hybrids, Hadoop DB is a systems level hybrid Uses MapReduce as the communication layer atop a cluster of nodes running single-node DBMS instances PostgreSQL as the database layer, Hadoop as the communication layer, and Hive as the translation layer 2 http://hadapt.com/ Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
  • 57. Hybrid Combine scalability and non-existent monetary cost of MapReduce with performance of parallel DBs HadoopDB is such a hybrid Unlike Hive, Pig, Greenplum, Aster, etc. which are language and interface level hybrids, Hadoop DB is a systems level hybrid Uses MapReduce as the communication layer atop a cluster of nodes running single-node DBMS instances PostgreSQL as the database layer, Hadoop as the communication layer, and Hive as the translation layer Commercialized through the start up, Hadapt2 2 http://hadapt.com/ Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
  • 58. HadoopDB Consists of four components: 1 Database Connector: Interface between per-node database systems and Hadoop TaskTrackers Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 15 / 37
  • 59. HadoopDB Consists of four components: 1 Database Connector: Interface between per-node database systems and Hadoop TaskTrackers 2 Catalog: Meta-information about per-node databases Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 15 / 37
  • 60. HadoopDB Consists of four components: 1 Database Connector: Interface between per-node database systems and Hadoop TaskTrackers 2 Catalog: Meta-information about per-node databases 3 Data Loader: Data partitioning across single-node databases Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 15 / 37
  • 61. HadoopDB Consists of four components: 1 Database Connector: Interface between per-node database systems and Hadoop TaskTrackers 2 Catalog: Meta-information about per-node databases 3 Data Loader: Data partitioning across single-node databases 4 SQL to MapReduce to SQL (SMS) Planner: Translation between SQL and MapReduce Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 15 / 37
  • 62. HadoopDB Architecture Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 16 / 37
  • 63. Database Connector Uses the Java Database Connectivity (JDBC)-compliant Hadoop InputFormat Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 17 / 37
  • 64. Database Connector Uses the Java Database Connectivity (JDBC)-compliant Hadoop InputFormat The connector is served the SQL query and other information by the MapReduce job Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 17 / 37
  • 65. Database Connector Uses the Java Database Connectivity (JDBC)-compliant Hadoop InputFormat The connector is served the SQL query and other information by the MapReduce job The connector connects to the DB, executes the SQL query, and returns results in the form of key/value pairs Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 17 / 37
  • 66. Database Connector Uses the Java Database Connectivity (JDBC)-compliant Hadoop InputFormat The connector is served the SQL query and other information by the MapReduce job The connector connects to the DB, executes the SQL query, and returns results in the form of key/value pairs Hadoop in essence sees the DB as just another data source Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 17 / 37
  • 67. Catalog Contains information, such as: 1 Connection parameters, such as DB location, format, and any credentials Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 18 / 37
  • 68. Catalog Contains information, such as: 1 Connection parameters, such as DB location, format, and any credentials 2 Metadata about the datasets, replica locations, and partitioning scheme Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 18 / 37
  • 69. Catalog Contains information, such as: 1 Connection parameters, such as DB location, format, and any credentials 2 Metadata about the datasets, replica locations, and partitioning scheme Stored as an XML ļ¬le on the HDFS Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 18 / 37
  • 70. Data Loader Consists of two key components: 1 Global Hasher: Executes a custom Hadoop job to repartition raw data ļ¬les from the HDFS into n parts, where n is the number of nodes in the cluster Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 19 / 37
  • 71. Data Loader Consists of two key components: 1 Global Hasher: Executes a custom Hadoop job to repartition raw data ļ¬les from the HDFS into n parts, where n is the number of nodes in the cluster 2 Local Hasher: Copies a partition from the HDFS to the node-local DB of each node and further partitions it into smaller size chunks Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 19 / 37
  • 72. SQL to MapReduce to SQL (SMS) Planner Extends HiveQL in two key ways: 1 Before query execution, the Hive Metastore is updated with references to HadoopDB tables, table schemas, formats, and serialization information Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 20 / 37
  • 73. SQL to MapReduce to SQL (SMS) Planner Extends HiveQL in two key ways: 1 Before query execution, the Hive Metastore is updated with references to HadoopDB tables, table schemas, formats, and serialization information 2 All operators with partitioning keys similar to the node-local database are converted into SQL queries and pushed to the database layer Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 20 / 37
  • 74. Outline 1 Hive 2 HadoopDB 3 nCluster 4 Summary Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 21 / 37
  • 75. Introduction The declarative nature of SQL is too limiting for describing most big data computation Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
  • 76. Introduction The declarative nature of SQL is too limiting for describing most big data computation The underlying subsystems are also suboptimal as they do not consider domain-speciļ¬c optimizations Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
  • 77. Introduction The declarative nature of SQL is too limiting for describing most big data computation The underlying subsystems are also suboptimal as they do not consider domain-speciļ¬c optimizations nCluster makes use of SQL/MR, a framework that inserts user-deļ¬ned functions in any programming language into SQL queries Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
  • 78. Introduction The declarative nature of SQL is too limiting for describing most big data computation The underlying subsystems are also suboptimal as they do not consider domain-speciļ¬c optimizations nCluster makes use of SQL/MR, a framework that inserts user-deļ¬ned functions in any programming language into SQL queries By itself, nCluster is a shared-nothing parallel database geared towards analytic workloads Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
  • 79. Introduction The declarative nature of SQL is too limiting for describing most big data computation The underlying subsystems are also suboptimal as they do not consider domain-speciļ¬c optimizations nCluster makes use of SQL/MR, a framework that inserts user-deļ¬ned functions in any programming language into SQL queries By itself, nCluster is a shared-nothing parallel database geared towards analytic workloads Originally designed by Aster Data Systems and later acquired by Teradata Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
  • 80. Introduction The declarative nature of SQL is too limiting for describing most big data computation The underlying subsystems are also suboptimal as they do not consider domain-speciļ¬c optimizations nCluster makes use of SQL/MR, a framework that inserts user-deļ¬ned functions in any programming language into SQL queries By itself, nCluster is a shared-nothing parallel database geared towards analytic workloads Originally designed by Aster Data Systems and later acquired by Teradata Used by Barnes and Noble, LinkedIn, SAS, etc. Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
  • 81. SQL/MR Functions Dynamically polymorphic: input and output schemes are decided at runtime Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
  • 82. SQL/MR Functions Dynamically polymorphic: input and output schemes are decided at runtime Parallelizable across cores and machines Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
  • 83. SQL/MR Functions Dynamically polymorphic: input and output schemes are decided at runtime Parallelizable across cores and machines Composable because their input and output behaviour is identical to SQL subqueries Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
  • 84. SQL/MR Functions Dynamically polymorphic: input and output schemes are decided at runtime Parallelizable across cores and machines Composable because their input and output behaviour is identical to SQL subqueries Amenable to static and dynamic optimizations just like SQL subqueries or a relation Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
  • 85. SQL/MR Functions Dynamically polymorphic: input and output schemes are decided at runtime Parallelizable across cores and machines Composable because their input and output behaviour is identical to SQL subqueries Amenable to static and dynamic optimizations just like SQL subqueries or a relation Can be implemented in a number of languages including Java, C#, C++, Python, etc. and can thus make use of third-party libraries Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
  • 86. SQL/MR Functions Dynamically polymorphic: input and output schemes are decided at runtime Parallelizable across cores and machines Composable because their input and output behaviour is identical to SQL subqueries Amenable to static and dynamic optimizations just like SQL subqueries or a relation Can be implemented in a number of languages including Java, C#, C++, Python, etc. and can thus make use of third-party libraries Executed within processes to provide sandboxing and resource allocation Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
  • 87. Syntax 1 SELECT ... 2 FROM functionname( 3 ON table-or-query 4 [PARTITION BY expr, ...] 5 [ORDER BY expr, ...] 6 [clausename(arg, ...) ...] 7 ) 8 ... SQL/MR function appears in the FROM clause Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 24 / 37
  • 88. Syntax 1 SELECT ... 2 FROM functionname( 3 ON table-or-query 4 [PARTITION BY expr, ...] 5 [ORDER BY expr, ...] 6 [clausename(arg, ...) ...] 7 ) 8 ... SQL/MR function appears in the FROM clause ON is the only required clause which speciļ¬es the input to the function Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 24 / 37
  • 89. Syntax 1 SELECT ... 2 FROM functionname( 3 ON table-or-query 4 [PARTITION BY expr, ...] 5 [ORDER BY expr, ...] 6 [clausename(arg, ...) ...] 7 ) 8 ... SQL/MR function appears in the FROM clause ON is the only required clause which speciļ¬es the input to the function PARTITION BY partitions the input to the function on one or more attributes from the schema Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 24 / 37
  • 90. Syntax (2) 1 SELECT ... 2 FROM functionname( 3 ON table-or-query 4 [PARTITION BY expr, ...] 5 [ORDER BY expr, ...] 6 [clausename(arg, ...) ...] 7 ) 8 ... ORDER BY sorts the input to the function and can only be used after a PARTITION BY clause Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 25 / 37
  • 91. Syntax (2) 1 SELECT ... 2 FROM functionname( 3 ON table-or-query 4 [PARTITION BY expr, ...] 5 [ORDER BY expr, ...] 6 [clausename(arg, ...) ...] 7 ) 8 ... ORDER BY sorts the input to the function and can only be used after a PARTITION BY clause Any number of custom clauses can also be deļ¬ned whose names and arguments are passed as a key/value map to the function Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 25 / 37
  • 92. Syntax (2) 1 SELECT ... 2 FROM functionname( 3 ON table-or-query 4 [PARTITION BY expr, ...] 5 [ORDER BY expr, ...] 6 [clausename(arg, ...) ...] 7 ) 8 ... ORDER BY sorts the input to the function and can only be used after a PARTITION BY clause Any number of custom clauses can also be deļ¬ned whose names and arguments are passed as a key/value map to the function Implemented as relations so easily nestable Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 25 / 37
  • 93. Execution Model Functions are equivalent to either map (row function) or reduce (partition function) functions Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 26 / 37
  • 94. Execution Model Functions are equivalent to either map (row function) or reduce (partition function) functions Identical to MapReduce, these functions are executed across many nodes and machines Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 26 / 37
  • 95. Execution Model Functions are equivalent to either map (row function) or reduce (partition function) functions Identical to MapReduce, these functions are executed across many nodes and machines Contracts identical to MapReduce functions Only one row function operates over a row from the input table Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 26 / 37
  • 96. Execution Model Functions are equivalent to either map (row function) or reduce (partition function) functions Identical to MapReduce, these functions are executed across many nodes and machines Contracts identical to MapReduce functions Only one row function operates over a row from the input table Only one partition function operates over a group of rows deļ¬ned by the PARTITION BY clause, in the order speciļ¬ed by the ORDER BY clause Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 26 / 37
  • 97. Programming Interface A Runtime Contract is passed by the query planner to the function which contains the names and types of the input columns and the names and values of the argument clauses Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
  • 98. Programming Interface A Runtime Contract is passed by the query planner to the function which contains the names and types of the input columns and the names and values of the argument clauses The function then completes this contract by ļ¬lling in the output schema and making a call to complete() Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
  • 99. Programming Interface A Runtime Contract is passed by the query planner to the function which contains the names and types of the input columns and the names and values of the argument clauses The function then completes this contract by ļ¬lling in the output schema and making a call to complete() Row and partition functions are implemented through the operateOnSomeRows and operateOnPartition methods, respectively Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
  • 100. Programming Interface A Runtime Contract is passed by the query planner to the function which contains the names and types of the input columns and the names and values of the argument clauses The function then completes this contract by ļ¬lling in the output schema and making a call to complete() Row and partition functions are implemented through the operateOnSomeRows and operateOnPartition methods, respectively These methods are passed an iterator over their input rows and an emitter object for returning output rows to the database Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
  • 101. Programming Interface A Runtime Contract is passed by the query planner to the function which contains the names and types of the input columns and the names and values of the argument clauses The function then completes this contract by ļ¬lling in the output schema and making a call to complete() Row and partition functions are implemented through the operateOnSomeRows and operateOnPartition methods, respectively These methods are passed an iterator over their input rows and an emitter object for returning output rows to the database operateOnPartition can also optionally implement the combiner interface Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
  • 102. Installation Functions need to be installed ļ¬rst before they can be used Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
  • 103. Installation Functions need to be installed ļ¬rst before they can be used Can be supplied as a .zip along with third-party libraries Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
  • 104. Installation Functions need to be installed ļ¬rst before they can be used Can be supplied as a .zip along with third-party libraries Install-time examination also enables static analysis of properties, such as row function or partition function, support for combining, etc. Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
  • 105. Installation Functions need to be installed ļ¬rst before they can be used Can be supplied as a .zip along with third-party libraries Install-time examination also enables static analysis of properties, such as row function or partition function, support for combining, etc. Any arbitrary ļ¬le can be installed which is replicated to all workers, such as conļ¬guration ļ¬les, binaries, etc. Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
  • 106. Installation Functions need to be installed ļ¬rst before they can be used Can be supplied as a .zip along with third-party libraries Install-time examination also enables static analysis of properties, such as row function or partition function, support for combining, etc. Any arbitrary ļ¬le can be installed which is replicated to all workers, such as conļ¬guration ļ¬les, binaries, etc. Each function is provided with a temporary directory which is garbage collected after execution Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
  • 107. Architecture One or more Queen nodes process queries and hash partition them across Worker nodes Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
  • 108. Architecture One or more Queen nodes process queries and hash partition them across Worker nodes The query planner honours the Runtime Contract with the function and invokes its initializer (Constructor in case of Java) Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
  • 109. Architecture One or more Queen nodes process queries and hash partition them across Worker nodes The query planner honours the Runtime Contract with the function and invokes its initializer (Constructor in case of Java) Functions are executed within the Worker databases as separate processes for isolation, security, resource allocation, forced termination, etc. Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
  • 110. Architecture One or more Queen nodes process queries and hash partition them across Worker nodes The query planner honours the Runtime Contract with the function and invokes its initializer (Constructor in case of Java) Functions are executed within the Worker databases as separate processes for isolation, security, resource allocation, forced termination, etc. The worker database implements a ā€œbridgeā€ which manages its communication with the SQL/MR function Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
  • 111. Architecture One or more Queen nodes process queries and hash partition them across Worker nodes The query planner honours the Runtime Contract with the function and invokes its initializer (Constructor in case of Java) Functions are executed within the Worker databases as separate processes for isolation, security, resource allocation, forced termination, etc. The worker database implements a ā€œbridgeā€ which manages its communication with the SQL/MR function The SQL/MR function process contains a ā€œrunnerā€ which manages its communication with the worker database Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
  • 112. Architecture (2) Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 30 / 37
  • 113. Example: Wordcount 1 SELECT token, COUNT(*) 2 FROM tokenizer( 3 ON input-table 4 DELIMITER(ā€™ ā€™) 5 ) 6 GROUP BY token; Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 31 / 37
  • 114. Example: Clickstream Sessionization Divide a userā€™s clicks on a website into sessions Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 32 / 37
  • 115. Example: Clickstream Sessionization Divide a userā€™s clicks on a website into sessions A session includes the userā€™s clicks within a speciļ¬ed time period Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 32 / 37
  • 116. Example: Clickstream Sessionization Divide a userā€™s clicks on a website into sessions A session includes the userā€™s clicks within a speciļ¬ed time period Timestamp User ID 10:00:00 238909 00:58:24 7656 10:00:24 238909 02:30:33 7656 10:01:23 238909 10:02:40 238909 Timestamp User ID Session ID 10:00:00 238909 0 10:00:24 238909 0 10:01:23 238909 0 10:02:40 238909 1 00:58:24 7656 0 02:30:33 7656 1 Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 32 / 37
  • 117. Example: Clickstream Sessionization (2) 1 SELECT ts, userid, session 2 FROM sessionize ( 3 ON clicks 4 PARTITION BY userid 5 ORDER BY ts 6 TIMECOLUMN (ā€™tsā€™) 7 TIMEOUT (60) 8 ); Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 33 / 37
  • 118. Example: Clickstream Sessionization (3) 1 public class Sessionize implements PartitionFunction { 2 3 private int timeColumnIndex; 4 private int timeout; 5 6 public Sessionize(RuntimeContract contract) { 7 // Get time column and timeout from contract 8 // Define output schema 9 contract.complete(); 10 } 11 12 public void operationOnPartition( 13 PartitionDefinition partition, 14 RowIterator inputIterator, 15 RowEmitter outputEmitter) { 16 // Implement the partition function logic 17 // Emit output rows 18 } 19 20 } Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 34 / 37
  • 119. Outline 1 Hive 2 HadoopDB 3 nCluster 4 Summary Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 35 / 37
  • 120. Summary Hive, HadoopDB, and nCluster explore three different points in the design space Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 36 / 37
  • 121. Summary Hive, HadoopDB, and nCluster explore three different points in the design space 1 Hive uses MapReduce to give DBMS-like functionality Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 36 / 37
  • 122. Summary Hive, HadoopDB, and nCluster explore three different points in the design space 1 Hive uses MapReduce to give DBMS-like functionality 2 HadoopDB uses MapReduce and DBMS side-by-side Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 36 / 37
  • 123. Summary Hive, HadoopDB, and nCluster explore three different points in the design space 1 Hive uses MapReduce to give DBMS-like functionality 2 HadoopDB uses MapReduce and DBMS side-by-side 3 nCluster implements MapReduce within a DBMS Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 36 / 37
  • 124. References 1 Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2, 2 (August 2009), 1626-1629. 2 Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, and Alexander Rasin. 2009. HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc. VLDB Endow. 2, 1 (August 2009), 922-933. 3 Eric Friedman, Peter Pawlowski, and John Cieslewicz. 2009. SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-deļ¬ned functions. Proc. VLDB Endow. 2, 2 (August 2009), 1402-1413. Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 37 / 37