SlideShare a Scribd company logo
1 of 124
Download to read offline
12: MapReduce and DBMS Hybrids
Zubair Nabi
zubair.nabi@itu.edu.pk
May 26, 2013
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 1 / 37
Outline
1 Hive
2 HadoopDB
3 nCluster
4 Summary
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 2 / 37
Outline
1 Hive
2 HadoopDB
3 nCluster
4 Summary
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 3 / 37
Introduction
Data warehousing solution built atop Hadoop by Facebook
1
https://www.facebook.com/note.php?note_id=89508453919
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
Introduction
Data warehousing solution built atop Hadoop by Facebook
Now an Apache open source project
1
https://www.facebook.com/note.php?note_id=89508453919
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
Introduction
Data warehousing solution built atop Hadoop by Facebook
Now an Apache open source project
Queries are expressed in SQL-like HiveQL, which are compiled into
map-reduce jobs
1
https://www.facebook.com/note.php?note_id=89508453919
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
Introduction
Data warehousing solution built atop Hadoop by Facebook
Now an Apache open source project
Queries are expressed in SQL-like HiveQL, which are compiled into
map-reduce jobs
Also contains a type system for describing RDBMS-like tables
1
https://www.facebook.com/note.php?note_id=89508453919
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
Introduction
Data warehousing solution built atop Hadoop by Facebook
Now an Apache open source project
Queries are expressed in SQL-like HiveQL, which are compiled into
map-reduce jobs
Also contains a type system for describing RDBMS-like tables
A system catalog, Hive-Metastore, which contains schemas and
statistics is used for data exploration and query optimization
1
https://www.facebook.com/note.php?note_id=89508453919
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
Introduction
Data warehousing solution built atop Hadoop by Facebook
Now an Apache open source project
Queries are expressed in SQL-like HiveQL, which are compiled into
map-reduce jobs
Also contains a type system for describing RDBMS-like tables
A system catalog, Hive-Metastore, which contains schemas and
statistics is used for data exploration and query optimization
Stores 2PB of uncompressed data at Facebook and is heavily used for
simple summarization, business intelligence, machine learning, among
many other applications1
1
https://www.facebook.com/note.php?note_id=89508453919
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
Introduction
Data warehousing solution built atop Hadoop by Facebook
Now an Apache open source project
Queries are expressed in SQL-like HiveQL, which are compiled into
map-reduce jobs
Also contains a type system for describing RDBMS-like tables
A system catalog, Hive-Metastore, which contains schemas and
statistics is used for data exploration and query optimization
Stores 2PB of uncompressed data at Facebook and is heavily used for
simple summarization, business intelligence, machine learning, among
many other applications1
Also used by Digg, Grooveshark, hi5, Last.fm, Scribd, etc.
1
https://www.facebook.com/note.php?note_id=89508453919
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
Data Model
Tables:
Similar to RDBMS tables
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
Data Model
Tables:
Similar to RDBMS tables
Each table has a corresponding HDFS directory
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
Data Model
Tables:
Similar to RDBMS tables
Each table has a corresponding HDFS directory
The contents of the table are serialized and stored in files within that
directory
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
Data Model
Tables:
Similar to RDBMS tables
Each table has a corresponding HDFS directory
The contents of the table are serialized and stored in files within that
directory
Serialization can be both system provided or user defined
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
Data Model
Tables:
Similar to RDBMS tables
Each table has a corresponding HDFS directory
The contents of the table are serialized and stored in files within that
directory
Serialization can be both system provided or user defined
Serialization information of each table is also stored in the
Hive-Metastore for query optimization
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
Data Model
Tables:
Similar to RDBMS tables
Each table has a corresponding HDFS directory
The contents of the table are serialized and stored in files within that
directory
Serialization can be both system provided or user defined
Serialization information of each table is also stored in the
Hive-Metastore for query optimization
Tables can also be defined for data stored in external sources such as
HDFS, NFS, and local FS
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
Data Model (2)
Partitions:
Determine the distribution of data within sub-directories of the main
table directory
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
Data Model (2)
Partitions:
Determine the distribution of data within sub-directories of the main
table directory
For instance, for a table T stored in /wh/T and partitioned on columns
ds and ctry
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
Data Model (2)
Partitions:
Determine the distribution of data within sub-directories of the main
table directory
For instance, for a table T stored in /wh/T and partitioned on columns
ds and ctry
Data with ds value 20090101 and ctry value US,
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
Data Model (2)
Partitions:
Determine the distribution of data within sub-directories of the main
table directory
For instance, for a table T stored in /wh/T and partitioned on columns
ds and ctry
Data with ds value 20090101 and ctry value US,
Will be stored in files within /wh/T/ds=20090101/ctry=US
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
Data Model (2)
Partitions:
Determine the distribution of data within sub-directories of the main
table directory
For instance, for a table T stored in /wh/T and partitioned on columns
ds and ctry
Data with ds value 20090101 and ctry value US,
Will be stored in files within /wh/T/ds=20090101/ctry=US
Buckets:
Data within partitions is divided into buckets
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
Data Model (2)
Partitions:
Determine the distribution of data within sub-directories of the main
table directory
For instance, for a table T stored in /wh/T and partitioned on columns
ds and ctry
Data with ds value 20090101 and ctry value US,
Will be stored in files within /wh/T/ds=20090101/ctry=US
Buckets:
Data within partitions is divided into buckets
Buckets are calculated based on the hash of a column within the
partition
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
Data Model (2)
Partitions:
Determine the distribution of data within sub-directories of the main
table directory
For instance, for a table T stored in /wh/T and partitioned on columns
ds and ctry
Data with ds value 20090101 and ctry value US,
Will be stored in files within /wh/T/ds=20090101/ctry=US
Buckets:
Data within partitions is divided into buckets
Buckets are calculated based on the hash of a column within the
partition
Each bucket is stored within a file in the partition directory
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
Column Data Types
Primitive types: integers, floats, strings, dates, and booleans
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 7 / 37
Column Data Types
Primitive types: integers, floats, strings, dates, and booleans
Nestable collection types: arrays and maps
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 7 / 37
Column Data Types
Primitive types: integers, floats, strings, dates, and booleans
Nestable collection types: arrays and maps
Custom types: user-defined
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 7 / 37
HiveQL
Supports select, project, join, aggregate, union all, and sub-queries
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
HiveQL
Supports select, project, join, aggregate, union all, and sub-queries
Tables are created using data definition statements with specific
serialization formats, partitioning, and bucketing
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
HiveQL
Supports select, project, join, aggregate, union all, and sub-queries
Tables are created using data definition statements with specific
serialization formats, partitioning, and bucketing
Data is loaded from external sources and inserted into tables
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
HiveQL
Supports select, project, join, aggregate, union all, and sub-queries
Tables are created using data definition statements with specific
serialization formats, partitioning, and bucketing
Data is loaded from external sources and inserted into tables
Support for multi-table insert – multiple queries on the same input data
using a single HiveQL statement
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
HiveQL
Supports select, project, join, aggregate, union all, and sub-queries
Tables are created using data definition statements with specific
serialization formats, partitioning, and bucketing
Data is loaded from external sources and inserted into tables
Support for multi-table insert – multiple queries on the same input data
using a single HiveQL statement
User-defined column transformation and aggregation functions in Java
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
HiveQL
Supports select, project, join, aggregate, union all, and sub-queries
Tables are created using data definition statements with specific
serialization formats, partitioning, and bucketing
Data is loaded from external sources and inserted into tables
Support for multi-table insert – multiple queries on the same input data
using a single HiveQL statement
User-defined column transformation and aggregation functions in Java
Custom map-reduce scripts written in any language can be embedded
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
Example: Facebook Status
Status updates are stored on flat files in an NFS directory
/logs/status_updates
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 9 / 37
Example: Facebook Status
Status updates are stored on flat files in an NFS directory
/logs/status_updates
This data is loaded on a daily basis to a Hive table:
status_updates(userid int,status string,ds
string)
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 9 / 37
Example: Facebook Status
Status updates are stored on flat files in an NFS directory
/logs/status_updates
This data is loaded on a daily basis to a Hive table:
status_updates(userid int,status string,ds
string)
Using:
1 LOAD DATA LOCAL INPATH ’/logs/status_updates’
2 INTO TABLE status_updates PARTITION (ds=’2013-05-26’)
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 9 / 37
Example: Facebook Status
Status updates are stored on flat files in an NFS directory
/logs/status_updates
This data is loaded on a daily basis to a Hive table:
status_updates(userid int,status string,ds
string)
Using:
1 LOAD DATA LOCAL INPATH ’/logs/status_updates’
2 INTO TABLE status_updates PARTITION (ds=’2013-05-26’)
Detailed profile information, such as gender and academic institution is
present in the table: profiles(userid int,school
string,gender int)
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 9 / 37
Example: Facebook Status (2)
Query to workout the frequency of status updates based on gender and
academic institution
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 10 / 37
Example: Facebook Status (2)
Query to workout the frequency of status updates based on gender and
academic institution
1 FROM (SELECT a.status, b.school, b.gender
2 FROM status_updates a JOIN profiles b
3 ON (a.userid = b.userid and
4 a.ds=’2013-05-26’)
5 ) subq1
6 INSERT OVERWRITE TABLE gender_summary
7 PARTITION(ds=’2013-05-26’)
8 SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender
9 INSERT OVERWRITE TABLE school_summary
10 PARTITION(ds=’2013-05-26’)
11 SELECT subq1.school, COUNT(1) GROUP BY subq1.school
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 10 / 37
Metastore
Similar to the metastore maintained by traditional warehousing
solutions such as Oracle and IBM DB2 (distinguishes Hive from Pig or
Cascading which have no such store)
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
Metastore
Similar to the metastore maintained by traditional warehousing
solutions such as Oracle and IBM DB2 (distinguishes Hive from Pig or
Cascading which have no such store)
Stored in either a traditional DB such as MySQL or an FS such as NFS
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
Metastore
Similar to the metastore maintained by traditional warehousing
solutions such as Oracle and IBM DB2 (distinguishes Hive from Pig or
Cascading which have no such store)
Stored in either a traditional DB such as MySQL or an FS such as NFS
Contains the following objects:
Database: namespace for tables
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
Metastore
Similar to the metastore maintained by traditional warehousing
solutions such as Oracle and IBM DB2 (distinguishes Hive from Pig or
Cascading which have no such store)
Stored in either a traditional DB such as MySQL or an FS such as NFS
Contains the following objects:
Database: namespace for tables
Table: metadata for a table including columns and their types, owner,
storage, and serialization information
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
Metastore
Similar to the metastore maintained by traditional warehousing
solutions such as Oracle and IBM DB2 (distinguishes Hive from Pig or
Cascading which have no such store)
Stored in either a traditional DB such as MySQL or an FS such as NFS
Contains the following objects:
Database: namespace for tables
Table: metadata for a table including columns and their types, owner,
storage, and serialization information
Partition: metadata for a partition; similar to the information for a table
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
Outline
1 Hive
2 HadoopDB
3 nCluster
4 Summary
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 12 / 37
Introduction
Two options for data analytics on shared nothing clusters:
1 Parallel Databases, such as Teradata, Oracle etc. but,
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
Introduction
Two options for data analytics on shared nothing clusters:
1 Parallel Databases, such as Teradata, Oracle etc. but,
Assume that failures are a rare event
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
Introduction
Two options for data analytics on shared nothing clusters:
1 Parallel Databases, such as Teradata, Oracle etc. but,
Assume that failures are a rare event
Assume that hardware is homogeneous
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
Introduction
Two options for data analytics on shared nothing clusters:
1 Parallel Databases, such as Teradata, Oracle etc. but,
Assume that failures are a rare event
Assume that hardware is homogeneous
Never tested in deployments with more than a few dozen nodes
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
Introduction
Two options for data analytics on shared nothing clusters:
1 Parallel Databases, such as Teradata, Oracle etc. but,
Assume that failures are a rare event
Assume that hardware is homogeneous
Never tested in deployments with more than a few dozen nodes
2 MapReduce but,
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
Introduction
Two options for data analytics on shared nothing clusters:
1 Parallel Databases, such as Teradata, Oracle etc. but,
Assume that failures are a rare event
Assume that hardware is homogeneous
Never tested in deployments with more than a few dozen nodes
2 MapReduce but,
All shortcomings pointed by DeWitt and Stonebraker, as discussed
before
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
Introduction
Two options for data analytics on shared nothing clusters:
1 Parallel Databases, such as Teradata, Oracle etc. but,
Assume that failures are a rare event
Assume that hardware is homogeneous
Never tested in deployments with more than a few dozen nodes
2 MapReduce but,
All shortcomings pointed by DeWitt and Stonebraker, as discussed
before
At times an order of magnitude slower than parallel DBs
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
Hybrid
Combine scalability and non-existent monetary cost of MapReduce
with performance of parallel DBs
2
http://hadapt.com/
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
Hybrid
Combine scalability and non-existent monetary cost of MapReduce
with performance of parallel DBs
HadoopDB is such a hybrid
2
http://hadapt.com/
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
Hybrid
Combine scalability and non-existent monetary cost of MapReduce
with performance of parallel DBs
HadoopDB is such a hybrid
Unlike Hive, Pig, Greenplum, Aster, etc. which are language and
interface level hybrids, Hadoop DB is a systems level hybrid
2
http://hadapt.com/
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
Hybrid
Combine scalability and non-existent monetary cost of MapReduce
with performance of parallel DBs
HadoopDB is such a hybrid
Unlike Hive, Pig, Greenplum, Aster, etc. which are language and
interface level hybrids, Hadoop DB is a systems level hybrid
Uses MapReduce as the communication layer atop a cluster of nodes
running single-node DBMS instances
2
http://hadapt.com/
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
Hybrid
Combine scalability and non-existent monetary cost of MapReduce
with performance of parallel DBs
HadoopDB is such a hybrid
Unlike Hive, Pig, Greenplum, Aster, etc. which are language and
interface level hybrids, Hadoop DB is a systems level hybrid
Uses MapReduce as the communication layer atop a cluster of nodes
running single-node DBMS instances
PostgreSQL as the database layer, Hadoop as the communication
layer, and Hive as the translation layer
2
http://hadapt.com/
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
Hybrid
Combine scalability and non-existent monetary cost of MapReduce
with performance of parallel DBs
HadoopDB is such a hybrid
Unlike Hive, Pig, Greenplum, Aster, etc. which are language and
interface level hybrids, Hadoop DB is a systems level hybrid
Uses MapReduce as the communication layer atop a cluster of nodes
running single-node DBMS instances
PostgreSQL as the database layer, Hadoop as the communication
layer, and Hive as the translation layer
Commercialized through the start up, Hadapt2
2
http://hadapt.com/
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
HadoopDB
Consists of four components:
1 Database Connector: Interface between per-node database systems
and Hadoop TaskTrackers
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 15 / 37
HadoopDB
Consists of four components:
1 Database Connector: Interface between per-node database systems
and Hadoop TaskTrackers
2 Catalog: Meta-information about per-node databases
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 15 / 37
HadoopDB
Consists of four components:
1 Database Connector: Interface between per-node database systems
and Hadoop TaskTrackers
2 Catalog: Meta-information about per-node databases
3 Data Loader: Data partitioning across single-node databases
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 15 / 37
HadoopDB
Consists of four components:
1 Database Connector: Interface between per-node database systems
and Hadoop TaskTrackers
2 Catalog: Meta-information about per-node databases
3 Data Loader: Data partitioning across single-node databases
4 SQL to MapReduce to SQL (SMS) Planner: Translation between
SQL and MapReduce
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 15 / 37
HadoopDB Architecture
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 16 / 37
Database Connector
Uses the Java Database Connectivity (JDBC)-compliant Hadoop
InputFormat
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 17 / 37
Database Connector
Uses the Java Database Connectivity (JDBC)-compliant Hadoop
InputFormat
The connector is served the SQL query and other information by the
MapReduce job
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 17 / 37
Database Connector
Uses the Java Database Connectivity (JDBC)-compliant Hadoop
InputFormat
The connector is served the SQL query and other information by the
MapReduce job
The connector connects to the DB, executes the SQL query, and
returns results in the form of key/value pairs
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 17 / 37
Database Connector
Uses the Java Database Connectivity (JDBC)-compliant Hadoop
InputFormat
The connector is served the SQL query and other information by the
MapReduce job
The connector connects to the DB, executes the SQL query, and
returns results in the form of key/value pairs
Hadoop in essence sees the DB as just another data source
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 17 / 37
Catalog
Contains information, such as:
1 Connection parameters, such as DB location, format, and any
credentials
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 18 / 37
Catalog
Contains information, such as:
1 Connection parameters, such as DB location, format, and any
credentials
2 Metadata about the datasets, replica locations, and partitioning scheme
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 18 / 37
Catalog
Contains information, such as:
1 Connection parameters, such as DB location, format, and any
credentials
2 Metadata about the datasets, replica locations, and partitioning scheme
Stored as an XML file on the HDFS
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 18 / 37
Data Loader
Consists of two key components:
1 Global Hasher: Executes a custom Hadoop job to repartition raw data
files from the HDFS into n parts, where n is the number of nodes in the
cluster
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 19 / 37
Data Loader
Consists of two key components:
1 Global Hasher: Executes a custom Hadoop job to repartition raw data
files from the HDFS into n parts, where n is the number of nodes in the
cluster
2 Local Hasher: Copies a partition from the HDFS to the node-local DB
of each node and further partitions it into smaller size chunks
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 19 / 37
SQL to MapReduce to SQL (SMS) Planner
Extends HiveQL in two key ways:
1 Before query execution, the Hive Metastore is updated with references
to HadoopDB tables, table schemas, formats, and serialization
information
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 20 / 37
SQL to MapReduce to SQL (SMS) Planner
Extends HiveQL in two key ways:
1 Before query execution, the Hive Metastore is updated with references
to HadoopDB tables, table schemas, formats, and serialization
information
2 All operators with partitioning keys similar to the node-local database
are converted into SQL queries and pushed to the database layer
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 20 / 37
Outline
1 Hive
2 HadoopDB
3 nCluster
4 Summary
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 21 / 37
Introduction
The declarative nature of SQL is too limiting for describing most big
data computation
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
Introduction
The declarative nature of SQL is too limiting for describing most big
data computation
The underlying subsystems are also suboptimal as they do not
consider domain-specific optimizations
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
Introduction
The declarative nature of SQL is too limiting for describing most big
data computation
The underlying subsystems are also suboptimal as they do not
consider domain-specific optimizations
nCluster makes use of SQL/MR, a framework that inserts user-defined
functions in any programming language into SQL queries
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
Introduction
The declarative nature of SQL is too limiting for describing most big
data computation
The underlying subsystems are also suboptimal as they do not
consider domain-specific optimizations
nCluster makes use of SQL/MR, a framework that inserts user-defined
functions in any programming language into SQL queries
By itself, nCluster is a shared-nothing parallel database geared
towards analytic workloads
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
Introduction
The declarative nature of SQL is too limiting for describing most big
data computation
The underlying subsystems are also suboptimal as they do not
consider domain-specific optimizations
nCluster makes use of SQL/MR, a framework that inserts user-defined
functions in any programming language into SQL queries
By itself, nCluster is a shared-nothing parallel database geared
towards analytic workloads
Originally designed by Aster Data Systems and later acquired by
Teradata
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
Introduction
The declarative nature of SQL is too limiting for describing most big
data computation
The underlying subsystems are also suboptimal as they do not
consider domain-specific optimizations
nCluster makes use of SQL/MR, a framework that inserts user-defined
functions in any programming language into SQL queries
By itself, nCluster is a shared-nothing parallel database geared
towards analytic workloads
Originally designed by Aster Data Systems and later acquired by
Teradata
Used by Barnes and Noble, LinkedIn, SAS, etc.
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
SQL/MR Functions
Dynamically polymorphic: input and output schemes are decided at
runtime
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
SQL/MR Functions
Dynamically polymorphic: input and output schemes are decided at
runtime
Parallelizable across cores and machines
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
SQL/MR Functions
Dynamically polymorphic: input and output schemes are decided at
runtime
Parallelizable across cores and machines
Composable because their input and output behaviour is identical to
SQL subqueries
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
SQL/MR Functions
Dynamically polymorphic: input and output schemes are decided at
runtime
Parallelizable across cores and machines
Composable because their input and output behaviour is identical to
SQL subqueries
Amenable to static and dynamic optimizations just like SQL subqueries
or a relation
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
SQL/MR Functions
Dynamically polymorphic: input and output schemes are decided at
runtime
Parallelizable across cores and machines
Composable because their input and output behaviour is identical to
SQL subqueries
Amenable to static and dynamic optimizations just like SQL subqueries
or a relation
Can be implemented in a number of languages including Java, C#,
C++, Python, etc. and can thus make use of third-party libraries
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
SQL/MR Functions
Dynamically polymorphic: input and output schemes are decided at
runtime
Parallelizable across cores and machines
Composable because their input and output behaviour is identical to
SQL subqueries
Amenable to static and dynamic optimizations just like SQL subqueries
or a relation
Can be implemented in a number of languages including Java, C#,
C++, Python, etc. and can thus make use of third-party libraries
Executed within processes to provide sandboxing and resource
allocation
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
Syntax
1 SELECT ...
2 FROM functionname(
3 ON table-or-query
4 [PARTITION BY expr, ...]
5 [ORDER BY expr, ...]
6 [clausename(arg, ...) ...]
7 )
8 ...
SQL/MR function appears in the FROM clause
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 24 / 37
Syntax
1 SELECT ...
2 FROM functionname(
3 ON table-or-query
4 [PARTITION BY expr, ...]
5 [ORDER BY expr, ...]
6 [clausename(arg, ...) ...]
7 )
8 ...
SQL/MR function appears in the FROM clause
ON is the only required clause which specifies the input to the function
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 24 / 37
Syntax
1 SELECT ...
2 FROM functionname(
3 ON table-or-query
4 [PARTITION BY expr, ...]
5 [ORDER BY expr, ...]
6 [clausename(arg, ...) ...]
7 )
8 ...
SQL/MR function appears in the FROM clause
ON is the only required clause which specifies the input to the function
PARTITION BY partitions the input to the function on one or more
attributes from the schema
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 24 / 37
Syntax (2)
1 SELECT ...
2 FROM functionname(
3 ON table-or-query
4 [PARTITION BY expr, ...]
5 [ORDER BY expr, ...]
6 [clausename(arg, ...) ...]
7 )
8 ...
ORDER BY sorts the input to the function and can only be used after a
PARTITION BY clause
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 25 / 37
Syntax (2)
1 SELECT ...
2 FROM functionname(
3 ON table-or-query
4 [PARTITION BY expr, ...]
5 [ORDER BY expr, ...]
6 [clausename(arg, ...) ...]
7 )
8 ...
ORDER BY sorts the input to the function and can only be used after a
PARTITION BY clause
Any number of custom clauses can also be defined whose names and
arguments are passed as a key/value map to the function
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 25 / 37
Syntax (2)
1 SELECT ...
2 FROM functionname(
3 ON table-or-query
4 [PARTITION BY expr, ...]
5 [ORDER BY expr, ...]
6 [clausename(arg, ...) ...]
7 )
8 ...
ORDER BY sorts the input to the function and can only be used after a
PARTITION BY clause
Any number of custom clauses can also be defined whose names and
arguments are passed as a key/value map to the function
Implemented as relations so easily nestable
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 25 / 37
Execution Model
Functions are equivalent to either map (row function) or reduce
(partition function) functions
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 26 / 37
Execution Model
Functions are equivalent to either map (row function) or reduce
(partition function) functions
Identical to MapReduce, these functions are executed across many
nodes and machines
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 26 / 37
Execution Model
Functions are equivalent to either map (row function) or reduce
(partition function) functions
Identical to MapReduce, these functions are executed across many
nodes and machines
Contracts identical to MapReduce functions
Only one row function operates over a row from the input table
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 26 / 37
Execution Model
Functions are equivalent to either map (row function) or reduce
(partition function) functions
Identical to MapReduce, these functions are executed across many
nodes and machines
Contracts identical to MapReduce functions
Only one row function operates over a row from the input table
Only one partition function operates over a group of rows defined by the
PARTITION BY clause, in the order specified by the ORDER BY
clause
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 26 / 37
Programming Interface
A Runtime Contract is passed by the query planner to the
function which contains the names and types of the input columns and
the names and values of the argument clauses
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
Programming Interface
A Runtime Contract is passed by the query planner to the
function which contains the names and types of the input columns and
the names and values of the argument clauses
The function then completes this contract by filling in the output
schema and making a call to complete()
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
Programming Interface
A Runtime Contract is passed by the query planner to the
function which contains the names and types of the input columns and
the names and values of the argument clauses
The function then completes this contract by filling in the output
schema and making a call to complete()
Row and partition functions are implemented through the
operateOnSomeRows and operateOnPartition methods,
respectively
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
Programming Interface
A Runtime Contract is passed by the query planner to the
function which contains the names and types of the input columns and
the names and values of the argument clauses
The function then completes this contract by filling in the output
schema and making a call to complete()
Row and partition functions are implemented through the
operateOnSomeRows and operateOnPartition methods,
respectively
These methods are passed an iterator over their input rows and an
emitter object for returning output rows to the database
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
Programming Interface
A Runtime Contract is passed by the query planner to the
function which contains the names and types of the input columns and
the names and values of the argument clauses
The function then completes this contract by filling in the output
schema and making a call to complete()
Row and partition functions are implemented through the
operateOnSomeRows and operateOnPartition methods,
respectively
These methods are passed an iterator over their input rows and an
emitter object for returning output rows to the database
operateOnPartition can also optionally implement the combiner
interface
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
Installation
Functions need to be installed first before they can be used
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
Installation
Functions need to be installed first before they can be used
Can be supplied as a .zip along with third-party libraries
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
Installation
Functions need to be installed first before they can be used
Can be supplied as a .zip along with third-party libraries
Install-time examination also enables static analysis of properties, such
as row function or partition function, support for combining, etc.
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
Installation
Functions need to be installed first before they can be used
Can be supplied as a .zip along with third-party libraries
Install-time examination also enables static analysis of properties, such
as row function or partition function, support for combining, etc.
Any arbitrary file can be installed which is replicated to all workers,
such as configuration files, binaries, etc.
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
Installation
Functions need to be installed first before they can be used
Can be supplied as a .zip along with third-party libraries
Install-time examination also enables static analysis of properties, such
as row function or partition function, support for combining, etc.
Any arbitrary file can be installed which is replicated to all workers,
such as configuration files, binaries, etc.
Each function is provided with a temporary directory which is garbage
collected after execution
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
Architecture
One or more Queen nodes process queries and hash partition them
across Worker nodes
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
Architecture
One or more Queen nodes process queries and hash partition them
across Worker nodes
The query planner honours the Runtime Contract with the
function and invokes its initializer (Constructor in case of Java)
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
Architecture
One or more Queen nodes process queries and hash partition them
across Worker nodes
The query planner honours the Runtime Contract with the
function and invokes its initializer (Constructor in case of Java)
Functions are executed within the Worker databases as separate
processes for isolation, security, resource allocation, forced
termination, etc.
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
Architecture
One or more Queen nodes process queries and hash partition them
across Worker nodes
The query planner honours the Runtime Contract with the
function and invokes its initializer (Constructor in case of Java)
Functions are executed within the Worker databases as separate
processes for isolation, security, resource allocation, forced
termination, etc.
The worker database implements a “bridge” which manages its
communication with the SQL/MR function
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
Architecture
One or more Queen nodes process queries and hash partition them
across Worker nodes
The query planner honours the Runtime Contract with the
function and invokes its initializer (Constructor in case of Java)
Functions are executed within the Worker databases as separate
processes for isolation, security, resource allocation, forced
termination, etc.
The worker database implements a “bridge” which manages its
communication with the SQL/MR function
The SQL/MR function process contains a “runner” which manages its
communication with the worker database
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
Architecture (2)
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 30 / 37
Example: Wordcount
1 SELECT token, COUNT(*)
2 FROM tokenizer(
3 ON input-table
4 DELIMITER(’ ’)
5 )
6 GROUP BY token;
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 31 / 37
Example: Clickstream Sessionization
Divide a user’s clicks on a website into sessions
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 32 / 37
Example: Clickstream Sessionization
Divide a user’s clicks on a website into sessions
A session includes the user’s clicks within a specified time period
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 32 / 37
Example: Clickstream Sessionization
Divide a user’s clicks on a website into sessions
A session includes the user’s clicks within a specified time period
Timestamp User ID
10:00:00 238909
00:58:24 7656
10:00:24 238909
02:30:33 7656
10:01:23 238909
10:02:40 238909
Timestamp User ID Session ID
10:00:00 238909 0
10:00:24 238909 0
10:01:23 238909 0
10:02:40 238909 1
00:58:24 7656 0
02:30:33 7656 1
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 32 / 37
Example: Clickstream Sessionization (2)
1 SELECT ts, userid, session
2 FROM sessionize (
3 ON clicks
4 PARTITION BY userid
5 ORDER BY ts
6 TIMECOLUMN (’ts’)
7 TIMEOUT (60)
8 );
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 33 / 37
Example: Clickstream Sessionization (3)
1 public class Sessionize implements PartitionFunction {
2
3 private int timeColumnIndex;
4 private int timeout;
5
6 public Sessionize(RuntimeContract contract) {
7 // Get time column and timeout from contract
8 // Define output schema
9 contract.complete();
10 }
11
12 public void operationOnPartition(
13 PartitionDefinition partition,
14 RowIterator inputIterator,
15 RowEmitter outputEmitter) {
16 // Implement the partition function logic
17 // Emit output rows
18 }
19
20 }
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 34 / 37
Outline
1 Hive
2 HadoopDB
3 nCluster
4 Summary
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 35 / 37
Summary
Hive, HadoopDB, and nCluster explore three different points in the design
space
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 36 / 37
Summary
Hive, HadoopDB, and nCluster explore three different points in the design
space
1 Hive uses MapReduce to give DBMS-like functionality
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 36 / 37
Summary
Hive, HadoopDB, and nCluster explore three different points in the design
space
1 Hive uses MapReduce to give DBMS-like functionality
2 HadoopDB uses MapReduce and DBMS side-by-side
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 36 / 37
Summary
Hive, HadoopDB, and nCluster explore three different points in the design
space
1 Hive uses MapReduce to give DBMS-like functionality
2 HadoopDB uses MapReduce and DBMS side-by-side
3 nCluster implements MapReduce within a DBMS
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 36 / 37
References
1 Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad
Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham
Murthy. 2009. Hive: a warehousing solution over a map-reduce
framework. Proc. VLDB Endow. 2, 2 (August 2009), 1626-1629.
2 Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi
Silberschatz, and Alexander Rasin. 2009. HadoopDB: an architectural
hybrid of MapReduce and DBMS technologies for analytical workloads.
Proc. VLDB Endow. 2, 1 (August 2009), 922-933.
3 Eric Friedman, Peter Pawlowski, and John Cieslewicz. 2009.
SQL/MapReduce: a practical approach to self-describing, polymorphic,
and parallelizable user-defined functions. Proc. VLDB Endow. 2, 2
(August 2009), 1402-1413.
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 37 / 37

More Related Content

What's hot

Sparkr sigmod
Sparkr sigmodSparkr sigmod
Sparkr sigmodwaqasm86
 
Star ,Snow and Fact-Constullation Schemas??
Star ,Snow and  Fact-Constullation Schemas??Star ,Snow and  Fact-Constullation Schemas??
Star ,Snow and Fact-Constullation Schemas??Abdul Aslam
 
Download-manuals-gis-how toworkwithmaplayersandnetworklayers
 Download-manuals-gis-how toworkwithmaplayersandnetworklayers Download-manuals-gis-how toworkwithmaplayersandnetworklayers
Download-manuals-gis-how toworkwithmaplayersandnetworklayershydrologywebsite1
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce cscpconf
 
Performance Comparison of HBase and Cassandra
Performance Comparison of HBase and CassandraPerformance Comparison of HBase and Cassandra
Performance Comparison of HBase and CassandraYashIyengar
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkIRJET Journal
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415SANTOSH WAYAL
 
TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)ruchabhandiwad
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111NavNeet KuMar
 
Data analytics online training
Data analytics online trainingData analytics online training
Data analytics online trainingankitha reddy
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiativeMansi Mehra
 
SITNL 2015 - Big Data Small Pockets
SITNL 2015 - Big Data Small PocketsSITNL 2015 - Big Data Small Pockets
SITNL 2015 - Big Data Small PocketsJan van Ansem
 
Reduce Side Joins
Reduce Side Joins Reduce Side Joins
Reduce Side Joins Edureka!
 
Strayer cis-499-week-5-project-deliverable-3-database-and-data-warehousing-de...
Strayer cis-499-week-5-project-deliverable-3-database-and-data-warehousing-de...Strayer cis-499-week-5-project-deliverable-3-database-and-data-warehousing-de...
Strayer cis-499-week-5-project-deliverable-3-database-and-data-warehousing-de...infinityend3
 

What's hot (18)

Hadoop 2.0 and yarn
Hadoop 2.0 and yarnHadoop 2.0 and yarn
Hadoop 2.0 and yarn
 
Sparkr sigmod
Sparkr sigmodSparkr sigmod
Sparkr sigmod
 
Hadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and AssessmentHadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and Assessment
 
Star ,Snow and Fact-Constullation Schemas??
Star ,Snow and  Fact-Constullation Schemas??Star ,Snow and  Fact-Constullation Schemas??
Star ,Snow and Fact-Constullation Schemas??
 
Download-manuals-gis-how toworkwithmaplayersandnetworklayers
 Download-manuals-gis-how toworkwithmaplayersandnetworklayers Download-manuals-gis-how toworkwithmaplayersandnetworklayers
Download-manuals-gis-how toworkwithmaplayersandnetworklayers
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
 
Performance Comparison of HBase and Cassandra
Performance Comparison of HBase and CassandraPerformance Comparison of HBase and Cassandra
Performance Comparison of HBase and Cassandra
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
 
Data analytics online training
Data analytics online trainingData analytics online training
Data analytics online training
 
Hadoop paper
Hadoop paperHadoop paper
Hadoop paper
 
Database
DatabaseDatabase
Database
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiative
 
SITNL 2015 - Big Data Small Pockets
SITNL 2015 - Big Data Small PocketsSITNL 2015 - Big Data Small Pockets
SITNL 2015 - Big Data Small Pockets
 
Reduce Side Joins
Reduce Side Joins Reduce Side Joins
Reduce Side Joins
 
Strayer cis-499-week-5-project-deliverable-3-database-and-data-warehousing-de...
Strayer cis-499-week-5-project-deliverable-3-database-and-data-warehousing-de...Strayer cis-499-week-5-project-deliverable-3-database-and-data-warehousing-de...
Strayer cis-499-week-5-project-deliverable-3-database-and-data-warehousing-de...
 

Viewers also liked

AOS Lab 11: Virtualization
AOS Lab 11: VirtualizationAOS Lab 11: Virtualization
AOS Lab 11: VirtualizationZubair Nabi
 
Topic 13: Cloud Stacks
Topic 13: Cloud StacksTopic 13: Cloud Stacks
Topic 13: Cloud StacksZubair Nabi
 
AOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyondAOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyondZubair Nabi
 
Raabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing WorldRaabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing WorldZubair Nabi
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!Zubair Nabi
 
AOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on itAOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on itZubair Nabi
 
AOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device DriversAOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device DriversZubair Nabi
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!Zubair Nabi
 
The Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in PakistanThe Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in PakistanZubair Nabi
 
AOS Lab 7: Page tables
AOS Lab 7: Page tablesAOS Lab 7: Page tables
AOS Lab 7: Page tablesZubair Nabi
 
MapReduce Application Scripting
MapReduce Application ScriptingMapReduce Application Scripting
MapReduce Application ScriptingZubair Nabi
 
AOS Lab 5: System calls
AOS Lab 5: System callsAOS Lab 5: System calls
AOS Lab 5: System callsZubair Nabi
 
AOS Lab 6: Scheduling
AOS Lab 6: SchedulingAOS Lab 6: Scheduling
AOS Lab 6: SchedulingZubair Nabi
 
AOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksAOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksZubair Nabi
 
AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!Zubair Nabi
 
Topic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and VirtualizationTopic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and VirtualizationZubair Nabi
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data StackZubair Nabi
 
AOS Lab 12: Network Communication
AOS Lab 12: Network CommunicationAOS Lab 12: Network Communication
AOS Lab 12: Network CommunicationZubair Nabi
 
Topic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and NetworkingTopic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and NetworkingZubair Nabi
 

Viewers also liked (19)

AOS Lab 11: Virtualization
AOS Lab 11: VirtualizationAOS Lab 11: Virtualization
AOS Lab 11: Virtualization
 
Topic 13: Cloud Stacks
Topic 13: Cloud StacksTopic 13: Cloud Stacks
Topic 13: Cloud Stacks
 
AOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyondAOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyond
 
Raabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing WorldRaabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing World
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!
 
AOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on itAOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on it
 
AOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device DriversAOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device Drivers
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!
 
The Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in PakistanThe Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in Pakistan
 
AOS Lab 7: Page tables
AOS Lab 7: Page tablesAOS Lab 7: Page tables
AOS Lab 7: Page tables
 
MapReduce Application Scripting
MapReduce Application ScriptingMapReduce Application Scripting
MapReduce Application Scripting
 
AOS Lab 5: System calls
AOS Lab 5: System callsAOS Lab 5: System calls
AOS Lab 5: System calls
 
AOS Lab 6: Scheduling
AOS Lab 6: SchedulingAOS Lab 6: Scheduling
AOS Lab 6: Scheduling
 
AOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksAOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocks
 
AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!
 
Topic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and VirtualizationTopic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and Virtualization
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
AOS Lab 12: Network Communication
AOS Lab 12: Network CommunicationAOS Lab 12: Network Communication
AOS Lab 12: Network Communication
 
Topic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and NetworkingTopic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and Networking
 

Similar to MapReduce and DBMS Hybrids

Database Programming with Perl and DBIx::Class
Database Programming with Perl and DBIx::ClassDatabase Programming with Perl and DBIx::Class
Database Programming with Perl and DBIx::ClassDave Cross
 
Chapter 6 Database SC025 2017/2018
Chapter 6 Database SC025 2017/2018Chapter 6 Database SC025 2017/2018
Chapter 6 Database SC025 2017/2018Fizaril Amzari Omar
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014Stratebi
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopGiovanna Roda
 
No sql – rise of the clusters
No sql – rise of the clustersNo sql – rise of the clusters
No sql – rise of the clustersresponseteam
 
co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.Yousef Fadila
 
SIG-04-Databases.pptx
SIG-04-Databases.pptxSIG-04-Databases.pptx
SIG-04-Databases.pptxHugoDeConello
 
Topic 12: NoSQL in Action
Topic 12: NoSQL in ActionTopic 12: NoSQL in Action
Topic 12: NoSQL in ActionZubair Nabi
 

Similar to MapReduce and DBMS Hybrids (20)

Database Part 2
Database Part 2Database Part 2
Database Part 2
 
Hadoop presentation
Hadoop presentationHadoop presentation
Hadoop presentation
 
Database Programming with Perl and DBIx::Class
Database Programming with Perl and DBIx::ClassDatabase Programming with Perl and DBIx::Class
Database Programming with Perl and DBIx::Class
 
No SQL introduction
No SQL introductionNo SQL introduction
No SQL introduction
 
DBMS Basics
DBMS BasicsDBMS Basics
DBMS Basics
 
03 data mining : data warehouse
03 data mining : data warehouse03 data mining : data warehouse
03 data mining : data warehouse
 
Chapter 6 Database SC025 2017/2018
Chapter 6 Database SC025 2017/2018Chapter 6 Database SC025 2017/2018
Chapter 6 Database SC025 2017/2018
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
No sql – rise of the clusters
No sql – rise of the clustersNo sql – rise of the clusters
No sql – rise of the clusters
 
co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.
 
Hadoop
HadoopHadoop
Hadoop
 
Uint-5 Big data Frameworks.pdf
Uint-5 Big data Frameworks.pdfUint-5 Big data Frameworks.pdf
Uint-5 Big data Frameworks.pdf
 
Uint-5 Big data Frameworks.pdf
Uint-5 Big data Frameworks.pdfUint-5 Big data Frameworks.pdf
Uint-5 Big data Frameworks.pdf
 
Distributed DBMS - Unit 3 - Distributed DBMS Architecture
Distributed DBMS - Unit 3 - Distributed DBMS ArchitectureDistributed DBMS - Unit 3 - Distributed DBMS Architecture
Distributed DBMS - Unit 3 - Distributed DBMS Architecture
 
SIG-04-Databases.pptx
SIG-04-Databases.pptxSIG-04-Databases.pptx
SIG-04-Databases.pptx
 
Topic 12: NoSQL in Action
Topic 12: NoSQL in ActionTopic 12: NoSQL in Action
Topic 12: NoSQL in Action
 
Nosql
NosqlNosql
Nosql
 

More from Zubair Nabi

Lab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using MininetLab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using MininetZubair Nabi
 
Lab 4: Interfacing with Cassandra
Lab 4: Interfacing with CassandraLab 4: Interfacing with Cassandra
Lab 4: Interfacing with CassandraZubair Nabi
 
Topic 10: Taxonomy of Data and Storage
Topic 10: Taxonomy of Data and StorageTopic 10: Taxonomy of Data and Storage
Topic 10: Taxonomy of Data and StorageZubair Nabi
 
Topic 11: Google Filesystem
Topic 11: Google FilesystemTopic 11: Google Filesystem
Topic 11: Google FilesystemZubair Nabi
 
Lab 3: Writing a Naiad Application
Lab 3: Writing a Naiad ApplicationLab 3: Writing a Naiad Application
Lab 3: Writing a Naiad ApplicationZubair Nabi
 
Topic 8: Enhancements and Alternative Architectures
Topic 8: Enhancements and Alternative ArchitecturesTopic 8: Enhancements and Alternative Architectures
Topic 8: Enhancements and Alternative ArchitecturesZubair Nabi
 
Topic 7: Shortcomings in the MapReduce Paradigm
Topic 7: Shortcomings in the MapReduce ParadigmTopic 7: Shortcomings in the MapReduce Paradigm
Topic 7: Shortcomings in the MapReduce ParadigmZubair Nabi
 
Lab 1: Introduction to Amazon EC2 and MPI
Lab 1: Introduction to Amazon EC2 and MPILab 1: Introduction to Amazon EC2 and MPI
Lab 1: Introduction to Amazon EC2 and MPIZubair Nabi
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsZubair Nabi
 

More from Zubair Nabi (10)

Lab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using MininetLab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using Mininet
 
Lab 4: Interfacing with Cassandra
Lab 4: Interfacing with CassandraLab 4: Interfacing with Cassandra
Lab 4: Interfacing with Cassandra
 
Topic 10: Taxonomy of Data and Storage
Topic 10: Taxonomy of Data and StorageTopic 10: Taxonomy of Data and Storage
Topic 10: Taxonomy of Data and Storage
 
Topic 11: Google Filesystem
Topic 11: Google FilesystemTopic 11: Google Filesystem
Topic 11: Google Filesystem
 
Lab 3: Writing a Naiad Application
Lab 3: Writing a Naiad ApplicationLab 3: Writing a Naiad Application
Lab 3: Writing a Naiad Application
 
Topic 9: MR+
Topic 9: MR+Topic 9: MR+
Topic 9: MR+
 
Topic 8: Enhancements and Alternative Architectures
Topic 8: Enhancements and Alternative ArchitecturesTopic 8: Enhancements and Alternative Architectures
Topic 8: Enhancements and Alternative Architectures
 
Topic 7: Shortcomings in the MapReduce Paradigm
Topic 7: Shortcomings in the MapReduce ParadigmTopic 7: Shortcomings in the MapReduce Paradigm
Topic 7: Shortcomings in the MapReduce Paradigm
 
Lab 1: Introduction to Amazon EC2 and MPI
Lab 1: Introduction to Amazon EC2 and MPILab 1: Introduction to Amazon EC2 and MPI
Lab 1: Introduction to Amazon EC2 and MPI
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
 

Recently uploaded

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 

Recently uploaded (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

MapReduce and DBMS Hybrids

  • 1. 12: MapReduce and DBMS Hybrids Zubair Nabi zubair.nabi@itu.edu.pk May 26, 2013 Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 1 / 37
  • 2. Outline 1 Hive 2 HadoopDB 3 nCluster 4 Summary Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 2 / 37
  • 3. Outline 1 Hive 2 HadoopDB 3 nCluster 4 Summary Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 3 / 37
  • 4. Introduction Data warehousing solution built atop Hadoop by Facebook 1 https://www.facebook.com/note.php?note_id=89508453919 Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
  • 5. Introduction Data warehousing solution built atop Hadoop by Facebook Now an Apache open source project 1 https://www.facebook.com/note.php?note_id=89508453919 Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
  • 6. Introduction Data warehousing solution built atop Hadoop by Facebook Now an Apache open source project Queries are expressed in SQL-like HiveQL, which are compiled into map-reduce jobs 1 https://www.facebook.com/note.php?note_id=89508453919 Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
  • 7. Introduction Data warehousing solution built atop Hadoop by Facebook Now an Apache open source project Queries are expressed in SQL-like HiveQL, which are compiled into map-reduce jobs Also contains a type system for describing RDBMS-like tables 1 https://www.facebook.com/note.php?note_id=89508453919 Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
  • 8. Introduction Data warehousing solution built atop Hadoop by Facebook Now an Apache open source project Queries are expressed in SQL-like HiveQL, which are compiled into map-reduce jobs Also contains a type system for describing RDBMS-like tables A system catalog, Hive-Metastore, which contains schemas and statistics is used for data exploration and query optimization 1 https://www.facebook.com/note.php?note_id=89508453919 Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
  • 9. Introduction Data warehousing solution built atop Hadoop by Facebook Now an Apache open source project Queries are expressed in SQL-like HiveQL, which are compiled into map-reduce jobs Also contains a type system for describing RDBMS-like tables A system catalog, Hive-Metastore, which contains schemas and statistics is used for data exploration and query optimization Stores 2PB of uncompressed data at Facebook and is heavily used for simple summarization, business intelligence, machine learning, among many other applications1 1 https://www.facebook.com/note.php?note_id=89508453919 Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
  • 10. Introduction Data warehousing solution built atop Hadoop by Facebook Now an Apache open source project Queries are expressed in SQL-like HiveQL, which are compiled into map-reduce jobs Also contains a type system for describing RDBMS-like tables A system catalog, Hive-Metastore, which contains schemas and statistics is used for data exploration and query optimization Stores 2PB of uncompressed data at Facebook and is heavily used for simple summarization, business intelligence, machine learning, among many other applications1 Also used by Digg, Grooveshark, hi5, Last.fm, Scribd, etc. 1 https://www.facebook.com/note.php?note_id=89508453919 Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
  • 11. Data Model Tables: Similar to RDBMS tables Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
  • 12. Data Model Tables: Similar to RDBMS tables Each table has a corresponding HDFS directory Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
  • 13. Data Model Tables: Similar to RDBMS tables Each table has a corresponding HDFS directory The contents of the table are serialized and stored in files within that directory Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
  • 14. Data Model Tables: Similar to RDBMS tables Each table has a corresponding HDFS directory The contents of the table are serialized and stored in files within that directory Serialization can be both system provided or user defined Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
  • 15. Data Model Tables: Similar to RDBMS tables Each table has a corresponding HDFS directory The contents of the table are serialized and stored in files within that directory Serialization can be both system provided or user defined Serialization information of each table is also stored in the Hive-Metastore for query optimization Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
  • 16. Data Model Tables: Similar to RDBMS tables Each table has a corresponding HDFS directory The contents of the table are serialized and stored in files within that directory Serialization can be both system provided or user defined Serialization information of each table is also stored in the Hive-Metastore for query optimization Tables can also be defined for data stored in external sources such as HDFS, NFS, and local FS Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
  • 17. Data Model (2) Partitions: Determine the distribution of data within sub-directories of the main table directory Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
  • 18. Data Model (2) Partitions: Determine the distribution of data within sub-directories of the main table directory For instance, for a table T stored in /wh/T and partitioned on columns ds and ctry Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
  • 19. Data Model (2) Partitions: Determine the distribution of data within sub-directories of the main table directory For instance, for a table T stored in /wh/T and partitioned on columns ds and ctry Data with ds value 20090101 and ctry value US, Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
  • 20. Data Model (2) Partitions: Determine the distribution of data within sub-directories of the main table directory For instance, for a table T stored in /wh/T and partitioned on columns ds and ctry Data with ds value 20090101 and ctry value US, Will be stored in files within /wh/T/ds=20090101/ctry=US Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
  • 21. Data Model (2) Partitions: Determine the distribution of data within sub-directories of the main table directory For instance, for a table T stored in /wh/T and partitioned on columns ds and ctry Data with ds value 20090101 and ctry value US, Will be stored in files within /wh/T/ds=20090101/ctry=US Buckets: Data within partitions is divided into buckets Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
  • 22. Data Model (2) Partitions: Determine the distribution of data within sub-directories of the main table directory For instance, for a table T stored in /wh/T and partitioned on columns ds and ctry Data with ds value 20090101 and ctry value US, Will be stored in files within /wh/T/ds=20090101/ctry=US Buckets: Data within partitions is divided into buckets Buckets are calculated based on the hash of a column within the partition Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
  • 23. Data Model (2) Partitions: Determine the distribution of data within sub-directories of the main table directory For instance, for a table T stored in /wh/T and partitioned on columns ds and ctry Data with ds value 20090101 and ctry value US, Will be stored in files within /wh/T/ds=20090101/ctry=US Buckets: Data within partitions is divided into buckets Buckets are calculated based on the hash of a column within the partition Each bucket is stored within a file in the partition directory Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
  • 24. Column Data Types Primitive types: integers, floats, strings, dates, and booleans Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 7 / 37
  • 25. Column Data Types Primitive types: integers, floats, strings, dates, and booleans Nestable collection types: arrays and maps Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 7 / 37
  • 26. Column Data Types Primitive types: integers, floats, strings, dates, and booleans Nestable collection types: arrays and maps Custom types: user-defined Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 7 / 37
  • 27. HiveQL Supports select, project, join, aggregate, union all, and sub-queries Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
  • 28. HiveQL Supports select, project, join, aggregate, union all, and sub-queries Tables are created using data definition statements with specific serialization formats, partitioning, and bucketing Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
  • 29. HiveQL Supports select, project, join, aggregate, union all, and sub-queries Tables are created using data definition statements with specific serialization formats, partitioning, and bucketing Data is loaded from external sources and inserted into tables Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
  • 30. HiveQL Supports select, project, join, aggregate, union all, and sub-queries Tables are created using data definition statements with specific serialization formats, partitioning, and bucketing Data is loaded from external sources and inserted into tables Support for multi-table insert – multiple queries on the same input data using a single HiveQL statement Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
  • 31. HiveQL Supports select, project, join, aggregate, union all, and sub-queries Tables are created using data definition statements with specific serialization formats, partitioning, and bucketing Data is loaded from external sources and inserted into tables Support for multi-table insert – multiple queries on the same input data using a single HiveQL statement User-defined column transformation and aggregation functions in Java Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
  • 32. HiveQL Supports select, project, join, aggregate, union all, and sub-queries Tables are created using data definition statements with specific serialization formats, partitioning, and bucketing Data is loaded from external sources and inserted into tables Support for multi-table insert – multiple queries on the same input data using a single HiveQL statement User-defined column transformation and aggregation functions in Java Custom map-reduce scripts written in any language can be embedded Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
  • 33. Example: Facebook Status Status updates are stored on flat files in an NFS directory /logs/status_updates Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 9 / 37
  • 34. Example: Facebook Status Status updates are stored on flat files in an NFS directory /logs/status_updates This data is loaded on a daily basis to a Hive table: status_updates(userid int,status string,ds string) Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 9 / 37
  • 35. Example: Facebook Status Status updates are stored on flat files in an NFS directory /logs/status_updates This data is loaded on a daily basis to a Hive table: status_updates(userid int,status string,ds string) Using: 1 LOAD DATA LOCAL INPATH ’/logs/status_updates’ 2 INTO TABLE status_updates PARTITION (ds=’2013-05-26’) Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 9 / 37
  • 36. Example: Facebook Status Status updates are stored on flat files in an NFS directory /logs/status_updates This data is loaded on a daily basis to a Hive table: status_updates(userid int,status string,ds string) Using: 1 LOAD DATA LOCAL INPATH ’/logs/status_updates’ 2 INTO TABLE status_updates PARTITION (ds=’2013-05-26’) Detailed profile information, such as gender and academic institution is present in the table: profiles(userid int,school string,gender int) Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 9 / 37
  • 37. Example: Facebook Status (2) Query to workout the frequency of status updates based on gender and academic institution Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 10 / 37
  • 38. Example: Facebook Status (2) Query to workout the frequency of status updates based on gender and academic institution 1 FROM (SELECT a.status, b.school, b.gender 2 FROM status_updates a JOIN profiles b 3 ON (a.userid = b.userid and 4 a.ds=’2013-05-26’) 5 ) subq1 6 INSERT OVERWRITE TABLE gender_summary 7 PARTITION(ds=’2013-05-26’) 8 SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender 9 INSERT OVERWRITE TABLE school_summary 10 PARTITION(ds=’2013-05-26’) 11 SELECT subq1.school, COUNT(1) GROUP BY subq1.school Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 10 / 37
  • 39. Metastore Similar to the metastore maintained by traditional warehousing solutions such as Oracle and IBM DB2 (distinguishes Hive from Pig or Cascading which have no such store) Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
  • 40. Metastore Similar to the metastore maintained by traditional warehousing solutions such as Oracle and IBM DB2 (distinguishes Hive from Pig or Cascading which have no such store) Stored in either a traditional DB such as MySQL or an FS such as NFS Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
  • 41. Metastore Similar to the metastore maintained by traditional warehousing solutions such as Oracle and IBM DB2 (distinguishes Hive from Pig or Cascading which have no such store) Stored in either a traditional DB such as MySQL or an FS such as NFS Contains the following objects: Database: namespace for tables Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
  • 42. Metastore Similar to the metastore maintained by traditional warehousing solutions such as Oracle and IBM DB2 (distinguishes Hive from Pig or Cascading which have no such store) Stored in either a traditional DB such as MySQL or an FS such as NFS Contains the following objects: Database: namespace for tables Table: metadata for a table including columns and their types, owner, storage, and serialization information Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
  • 43. Metastore Similar to the metastore maintained by traditional warehousing solutions such as Oracle and IBM DB2 (distinguishes Hive from Pig or Cascading which have no such store) Stored in either a traditional DB such as MySQL or an FS such as NFS Contains the following objects: Database: namespace for tables Table: metadata for a table including columns and their types, owner, storage, and serialization information Partition: metadata for a partition; similar to the information for a table Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
  • 44. Outline 1 Hive 2 HadoopDB 3 nCluster 4 Summary Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 12 / 37
  • 45. Introduction Two options for data analytics on shared nothing clusters: 1 Parallel Databases, such as Teradata, Oracle etc. but, Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
  • 46. Introduction Two options for data analytics on shared nothing clusters: 1 Parallel Databases, such as Teradata, Oracle etc. but, Assume that failures are a rare event Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
  • 47. Introduction Two options for data analytics on shared nothing clusters: 1 Parallel Databases, such as Teradata, Oracle etc. but, Assume that failures are a rare event Assume that hardware is homogeneous Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
  • 48. Introduction Two options for data analytics on shared nothing clusters: 1 Parallel Databases, such as Teradata, Oracle etc. but, Assume that failures are a rare event Assume that hardware is homogeneous Never tested in deployments with more than a few dozen nodes Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
  • 49. Introduction Two options for data analytics on shared nothing clusters: 1 Parallel Databases, such as Teradata, Oracle etc. but, Assume that failures are a rare event Assume that hardware is homogeneous Never tested in deployments with more than a few dozen nodes 2 MapReduce but, Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
  • 50. Introduction Two options for data analytics on shared nothing clusters: 1 Parallel Databases, such as Teradata, Oracle etc. but, Assume that failures are a rare event Assume that hardware is homogeneous Never tested in deployments with more than a few dozen nodes 2 MapReduce but, All shortcomings pointed by DeWitt and Stonebraker, as discussed before Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
  • 51. Introduction Two options for data analytics on shared nothing clusters: 1 Parallel Databases, such as Teradata, Oracle etc. but, Assume that failures are a rare event Assume that hardware is homogeneous Never tested in deployments with more than a few dozen nodes 2 MapReduce but, All shortcomings pointed by DeWitt and Stonebraker, as discussed before At times an order of magnitude slower than parallel DBs Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
  • 52. Hybrid Combine scalability and non-existent monetary cost of MapReduce with performance of parallel DBs 2 http://hadapt.com/ Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
  • 53. Hybrid Combine scalability and non-existent monetary cost of MapReduce with performance of parallel DBs HadoopDB is such a hybrid 2 http://hadapt.com/ Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
  • 54. Hybrid Combine scalability and non-existent monetary cost of MapReduce with performance of parallel DBs HadoopDB is such a hybrid Unlike Hive, Pig, Greenplum, Aster, etc. which are language and interface level hybrids, Hadoop DB is a systems level hybrid 2 http://hadapt.com/ Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
  • 55. Hybrid Combine scalability and non-existent monetary cost of MapReduce with performance of parallel DBs HadoopDB is such a hybrid Unlike Hive, Pig, Greenplum, Aster, etc. which are language and interface level hybrids, Hadoop DB is a systems level hybrid Uses MapReduce as the communication layer atop a cluster of nodes running single-node DBMS instances 2 http://hadapt.com/ Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
  • 56. Hybrid Combine scalability and non-existent monetary cost of MapReduce with performance of parallel DBs HadoopDB is such a hybrid Unlike Hive, Pig, Greenplum, Aster, etc. which are language and interface level hybrids, Hadoop DB is a systems level hybrid Uses MapReduce as the communication layer atop a cluster of nodes running single-node DBMS instances PostgreSQL as the database layer, Hadoop as the communication layer, and Hive as the translation layer 2 http://hadapt.com/ Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
  • 57. Hybrid Combine scalability and non-existent monetary cost of MapReduce with performance of parallel DBs HadoopDB is such a hybrid Unlike Hive, Pig, Greenplum, Aster, etc. which are language and interface level hybrids, Hadoop DB is a systems level hybrid Uses MapReduce as the communication layer atop a cluster of nodes running single-node DBMS instances PostgreSQL as the database layer, Hadoop as the communication layer, and Hive as the translation layer Commercialized through the start up, Hadapt2 2 http://hadapt.com/ Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
  • 58. HadoopDB Consists of four components: 1 Database Connector: Interface between per-node database systems and Hadoop TaskTrackers Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 15 / 37
  • 59. HadoopDB Consists of four components: 1 Database Connector: Interface between per-node database systems and Hadoop TaskTrackers 2 Catalog: Meta-information about per-node databases Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 15 / 37
  • 60. HadoopDB Consists of four components: 1 Database Connector: Interface between per-node database systems and Hadoop TaskTrackers 2 Catalog: Meta-information about per-node databases 3 Data Loader: Data partitioning across single-node databases Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 15 / 37
  • 61. HadoopDB Consists of four components: 1 Database Connector: Interface between per-node database systems and Hadoop TaskTrackers 2 Catalog: Meta-information about per-node databases 3 Data Loader: Data partitioning across single-node databases 4 SQL to MapReduce to SQL (SMS) Planner: Translation between SQL and MapReduce Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 15 / 37
  • 62. HadoopDB Architecture Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 16 / 37
  • 63. Database Connector Uses the Java Database Connectivity (JDBC)-compliant Hadoop InputFormat Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 17 / 37
  • 64. Database Connector Uses the Java Database Connectivity (JDBC)-compliant Hadoop InputFormat The connector is served the SQL query and other information by the MapReduce job Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 17 / 37
  • 65. Database Connector Uses the Java Database Connectivity (JDBC)-compliant Hadoop InputFormat The connector is served the SQL query and other information by the MapReduce job The connector connects to the DB, executes the SQL query, and returns results in the form of key/value pairs Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 17 / 37
  • 66. Database Connector Uses the Java Database Connectivity (JDBC)-compliant Hadoop InputFormat The connector is served the SQL query and other information by the MapReduce job The connector connects to the DB, executes the SQL query, and returns results in the form of key/value pairs Hadoop in essence sees the DB as just another data source Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 17 / 37
  • 67. Catalog Contains information, such as: 1 Connection parameters, such as DB location, format, and any credentials Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 18 / 37
  • 68. Catalog Contains information, such as: 1 Connection parameters, such as DB location, format, and any credentials 2 Metadata about the datasets, replica locations, and partitioning scheme Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 18 / 37
  • 69. Catalog Contains information, such as: 1 Connection parameters, such as DB location, format, and any credentials 2 Metadata about the datasets, replica locations, and partitioning scheme Stored as an XML file on the HDFS Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 18 / 37
  • 70. Data Loader Consists of two key components: 1 Global Hasher: Executes a custom Hadoop job to repartition raw data files from the HDFS into n parts, where n is the number of nodes in the cluster Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 19 / 37
  • 71. Data Loader Consists of two key components: 1 Global Hasher: Executes a custom Hadoop job to repartition raw data files from the HDFS into n parts, where n is the number of nodes in the cluster 2 Local Hasher: Copies a partition from the HDFS to the node-local DB of each node and further partitions it into smaller size chunks Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 19 / 37
  • 72. SQL to MapReduce to SQL (SMS) Planner Extends HiveQL in two key ways: 1 Before query execution, the Hive Metastore is updated with references to HadoopDB tables, table schemas, formats, and serialization information Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 20 / 37
  • 73. SQL to MapReduce to SQL (SMS) Planner Extends HiveQL in two key ways: 1 Before query execution, the Hive Metastore is updated with references to HadoopDB tables, table schemas, formats, and serialization information 2 All operators with partitioning keys similar to the node-local database are converted into SQL queries and pushed to the database layer Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 20 / 37
  • 74. Outline 1 Hive 2 HadoopDB 3 nCluster 4 Summary Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 21 / 37
  • 75. Introduction The declarative nature of SQL is too limiting for describing most big data computation Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
  • 76. Introduction The declarative nature of SQL is too limiting for describing most big data computation The underlying subsystems are also suboptimal as they do not consider domain-specific optimizations Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
  • 77. Introduction The declarative nature of SQL is too limiting for describing most big data computation The underlying subsystems are also suboptimal as they do not consider domain-specific optimizations nCluster makes use of SQL/MR, a framework that inserts user-defined functions in any programming language into SQL queries Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
  • 78. Introduction The declarative nature of SQL is too limiting for describing most big data computation The underlying subsystems are also suboptimal as they do not consider domain-specific optimizations nCluster makes use of SQL/MR, a framework that inserts user-defined functions in any programming language into SQL queries By itself, nCluster is a shared-nothing parallel database geared towards analytic workloads Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
  • 79. Introduction The declarative nature of SQL is too limiting for describing most big data computation The underlying subsystems are also suboptimal as they do not consider domain-specific optimizations nCluster makes use of SQL/MR, a framework that inserts user-defined functions in any programming language into SQL queries By itself, nCluster is a shared-nothing parallel database geared towards analytic workloads Originally designed by Aster Data Systems and later acquired by Teradata Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
  • 80. Introduction The declarative nature of SQL is too limiting for describing most big data computation The underlying subsystems are also suboptimal as they do not consider domain-specific optimizations nCluster makes use of SQL/MR, a framework that inserts user-defined functions in any programming language into SQL queries By itself, nCluster is a shared-nothing parallel database geared towards analytic workloads Originally designed by Aster Data Systems and later acquired by Teradata Used by Barnes and Noble, LinkedIn, SAS, etc. Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
  • 81. SQL/MR Functions Dynamically polymorphic: input and output schemes are decided at runtime Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
  • 82. SQL/MR Functions Dynamically polymorphic: input and output schemes are decided at runtime Parallelizable across cores and machines Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
  • 83. SQL/MR Functions Dynamically polymorphic: input and output schemes are decided at runtime Parallelizable across cores and machines Composable because their input and output behaviour is identical to SQL subqueries Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
  • 84. SQL/MR Functions Dynamically polymorphic: input and output schemes are decided at runtime Parallelizable across cores and machines Composable because their input and output behaviour is identical to SQL subqueries Amenable to static and dynamic optimizations just like SQL subqueries or a relation Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
  • 85. SQL/MR Functions Dynamically polymorphic: input and output schemes are decided at runtime Parallelizable across cores and machines Composable because their input and output behaviour is identical to SQL subqueries Amenable to static and dynamic optimizations just like SQL subqueries or a relation Can be implemented in a number of languages including Java, C#, C++, Python, etc. and can thus make use of third-party libraries Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
  • 86. SQL/MR Functions Dynamically polymorphic: input and output schemes are decided at runtime Parallelizable across cores and machines Composable because their input and output behaviour is identical to SQL subqueries Amenable to static and dynamic optimizations just like SQL subqueries or a relation Can be implemented in a number of languages including Java, C#, C++, Python, etc. and can thus make use of third-party libraries Executed within processes to provide sandboxing and resource allocation Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
  • 87. Syntax 1 SELECT ... 2 FROM functionname( 3 ON table-or-query 4 [PARTITION BY expr, ...] 5 [ORDER BY expr, ...] 6 [clausename(arg, ...) ...] 7 ) 8 ... SQL/MR function appears in the FROM clause Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 24 / 37
  • 88. Syntax 1 SELECT ... 2 FROM functionname( 3 ON table-or-query 4 [PARTITION BY expr, ...] 5 [ORDER BY expr, ...] 6 [clausename(arg, ...) ...] 7 ) 8 ... SQL/MR function appears in the FROM clause ON is the only required clause which specifies the input to the function Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 24 / 37
  • 89. Syntax 1 SELECT ... 2 FROM functionname( 3 ON table-or-query 4 [PARTITION BY expr, ...] 5 [ORDER BY expr, ...] 6 [clausename(arg, ...) ...] 7 ) 8 ... SQL/MR function appears in the FROM clause ON is the only required clause which specifies the input to the function PARTITION BY partitions the input to the function on one or more attributes from the schema Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 24 / 37
  • 90. Syntax (2) 1 SELECT ... 2 FROM functionname( 3 ON table-or-query 4 [PARTITION BY expr, ...] 5 [ORDER BY expr, ...] 6 [clausename(arg, ...) ...] 7 ) 8 ... ORDER BY sorts the input to the function and can only be used after a PARTITION BY clause Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 25 / 37
  • 91. Syntax (2) 1 SELECT ... 2 FROM functionname( 3 ON table-or-query 4 [PARTITION BY expr, ...] 5 [ORDER BY expr, ...] 6 [clausename(arg, ...) ...] 7 ) 8 ... ORDER BY sorts the input to the function and can only be used after a PARTITION BY clause Any number of custom clauses can also be defined whose names and arguments are passed as a key/value map to the function Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 25 / 37
  • 92. Syntax (2) 1 SELECT ... 2 FROM functionname( 3 ON table-or-query 4 [PARTITION BY expr, ...] 5 [ORDER BY expr, ...] 6 [clausename(arg, ...) ...] 7 ) 8 ... ORDER BY sorts the input to the function and can only be used after a PARTITION BY clause Any number of custom clauses can also be defined whose names and arguments are passed as a key/value map to the function Implemented as relations so easily nestable Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 25 / 37
  • 93. Execution Model Functions are equivalent to either map (row function) or reduce (partition function) functions Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 26 / 37
  • 94. Execution Model Functions are equivalent to either map (row function) or reduce (partition function) functions Identical to MapReduce, these functions are executed across many nodes and machines Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 26 / 37
  • 95. Execution Model Functions are equivalent to either map (row function) or reduce (partition function) functions Identical to MapReduce, these functions are executed across many nodes and machines Contracts identical to MapReduce functions Only one row function operates over a row from the input table Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 26 / 37
  • 96. Execution Model Functions are equivalent to either map (row function) or reduce (partition function) functions Identical to MapReduce, these functions are executed across many nodes and machines Contracts identical to MapReduce functions Only one row function operates over a row from the input table Only one partition function operates over a group of rows defined by the PARTITION BY clause, in the order specified by the ORDER BY clause Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 26 / 37
  • 97. Programming Interface A Runtime Contract is passed by the query planner to the function which contains the names and types of the input columns and the names and values of the argument clauses Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
  • 98. Programming Interface A Runtime Contract is passed by the query planner to the function which contains the names and types of the input columns and the names and values of the argument clauses The function then completes this contract by filling in the output schema and making a call to complete() Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
  • 99. Programming Interface A Runtime Contract is passed by the query planner to the function which contains the names and types of the input columns and the names and values of the argument clauses The function then completes this contract by filling in the output schema and making a call to complete() Row and partition functions are implemented through the operateOnSomeRows and operateOnPartition methods, respectively Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
  • 100. Programming Interface A Runtime Contract is passed by the query planner to the function which contains the names and types of the input columns and the names and values of the argument clauses The function then completes this contract by filling in the output schema and making a call to complete() Row and partition functions are implemented through the operateOnSomeRows and operateOnPartition methods, respectively These methods are passed an iterator over their input rows and an emitter object for returning output rows to the database Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
  • 101. Programming Interface A Runtime Contract is passed by the query planner to the function which contains the names and types of the input columns and the names and values of the argument clauses The function then completes this contract by filling in the output schema and making a call to complete() Row and partition functions are implemented through the operateOnSomeRows and operateOnPartition methods, respectively These methods are passed an iterator over their input rows and an emitter object for returning output rows to the database operateOnPartition can also optionally implement the combiner interface Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
  • 102. Installation Functions need to be installed first before they can be used Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
  • 103. Installation Functions need to be installed first before they can be used Can be supplied as a .zip along with third-party libraries Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
  • 104. Installation Functions need to be installed first before they can be used Can be supplied as a .zip along with third-party libraries Install-time examination also enables static analysis of properties, such as row function or partition function, support for combining, etc. Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
  • 105. Installation Functions need to be installed first before they can be used Can be supplied as a .zip along with third-party libraries Install-time examination also enables static analysis of properties, such as row function or partition function, support for combining, etc. Any arbitrary file can be installed which is replicated to all workers, such as configuration files, binaries, etc. Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
  • 106. Installation Functions need to be installed first before they can be used Can be supplied as a .zip along with third-party libraries Install-time examination also enables static analysis of properties, such as row function or partition function, support for combining, etc. Any arbitrary file can be installed which is replicated to all workers, such as configuration files, binaries, etc. Each function is provided with a temporary directory which is garbage collected after execution Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
  • 107. Architecture One or more Queen nodes process queries and hash partition them across Worker nodes Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
  • 108. Architecture One or more Queen nodes process queries and hash partition them across Worker nodes The query planner honours the Runtime Contract with the function and invokes its initializer (Constructor in case of Java) Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
  • 109. Architecture One or more Queen nodes process queries and hash partition them across Worker nodes The query planner honours the Runtime Contract with the function and invokes its initializer (Constructor in case of Java) Functions are executed within the Worker databases as separate processes for isolation, security, resource allocation, forced termination, etc. Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
  • 110. Architecture One or more Queen nodes process queries and hash partition them across Worker nodes The query planner honours the Runtime Contract with the function and invokes its initializer (Constructor in case of Java) Functions are executed within the Worker databases as separate processes for isolation, security, resource allocation, forced termination, etc. The worker database implements a “bridge” which manages its communication with the SQL/MR function Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
  • 111. Architecture One or more Queen nodes process queries and hash partition them across Worker nodes The query planner honours the Runtime Contract with the function and invokes its initializer (Constructor in case of Java) Functions are executed within the Worker databases as separate processes for isolation, security, resource allocation, forced termination, etc. The worker database implements a “bridge” which manages its communication with the SQL/MR function The SQL/MR function process contains a “runner” which manages its communication with the worker database Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
  • 112. Architecture (2) Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 30 / 37
  • 113. Example: Wordcount 1 SELECT token, COUNT(*) 2 FROM tokenizer( 3 ON input-table 4 DELIMITER(’ ’) 5 ) 6 GROUP BY token; Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 31 / 37
  • 114. Example: Clickstream Sessionization Divide a user’s clicks on a website into sessions Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 32 / 37
  • 115. Example: Clickstream Sessionization Divide a user’s clicks on a website into sessions A session includes the user’s clicks within a specified time period Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 32 / 37
  • 116. Example: Clickstream Sessionization Divide a user’s clicks on a website into sessions A session includes the user’s clicks within a specified time period Timestamp User ID 10:00:00 238909 00:58:24 7656 10:00:24 238909 02:30:33 7656 10:01:23 238909 10:02:40 238909 Timestamp User ID Session ID 10:00:00 238909 0 10:00:24 238909 0 10:01:23 238909 0 10:02:40 238909 1 00:58:24 7656 0 02:30:33 7656 1 Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 32 / 37
  • 117. Example: Clickstream Sessionization (2) 1 SELECT ts, userid, session 2 FROM sessionize ( 3 ON clicks 4 PARTITION BY userid 5 ORDER BY ts 6 TIMECOLUMN (’ts’) 7 TIMEOUT (60) 8 ); Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 33 / 37
  • 118. Example: Clickstream Sessionization (3) 1 public class Sessionize implements PartitionFunction { 2 3 private int timeColumnIndex; 4 private int timeout; 5 6 public Sessionize(RuntimeContract contract) { 7 // Get time column and timeout from contract 8 // Define output schema 9 contract.complete(); 10 } 11 12 public void operationOnPartition( 13 PartitionDefinition partition, 14 RowIterator inputIterator, 15 RowEmitter outputEmitter) { 16 // Implement the partition function logic 17 // Emit output rows 18 } 19 20 } Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 34 / 37
  • 119. Outline 1 Hive 2 HadoopDB 3 nCluster 4 Summary Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 35 / 37
  • 120. Summary Hive, HadoopDB, and nCluster explore three different points in the design space Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 36 / 37
  • 121. Summary Hive, HadoopDB, and nCluster explore three different points in the design space 1 Hive uses MapReduce to give DBMS-like functionality Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 36 / 37
  • 122. Summary Hive, HadoopDB, and nCluster explore three different points in the design space 1 Hive uses MapReduce to give DBMS-like functionality 2 HadoopDB uses MapReduce and DBMS side-by-side Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 36 / 37
  • 123. Summary Hive, HadoopDB, and nCluster explore three different points in the design space 1 Hive uses MapReduce to give DBMS-like functionality 2 HadoopDB uses MapReduce and DBMS side-by-side 3 nCluster implements MapReduce within a DBMS Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 36 / 37
  • 124. References 1 Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2, 2 (August 2009), 1626-1629. 2 Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, and Alexander Rasin. 2009. HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc. VLDB Endow. 2, 1 (August 2009), 922-933. 3 Eric Friedman, Peter Pawlowski, and John Cieslewicz. 2009. SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. Proc. VLDB Endow. 2, 2 (August 2009), 1402-1413. Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 37 / 37