Big Data MapReduce vs. RDBMS Arjen P. de Vries [email_address] Centrum Wiskunde & Informatica Delft University of Technology Spinque B.V.
Context Business ‘best practices’: decisions based on data and hard facts rather than instinct and theory MapReduce, though originally designed for text, is more and more “ab”-used for structured data, at tremendous scale Hadoop is used to manage Facebook’s 2.5 petabyte data warehouse
Shared-nothing Architecture A collection of independent, possibly virtual, machines, each with local disk and local main memory, connected together on a high-speed network Possible trade-off: large number of low-end servers instead of small number of high-end ones
@CWI – 2011
 
Programming Model Parallel DBMS Claimed best at ad-hoc analytical queries Substantially faster once data is loaded, but loading the data takes considerably longer Who wants to program parallel joins etc.?! Map-Reduce Very suited for extract-transform-load tasks Ease-of-use for complex analytics tasks Hybrid Best of both worlds?
Parallel DBMS Horizontal partitioning of relational tables Partitioned execution of SQL operators Select, aggregate, join, project and update New shuffle operator; dynamically repartition rows of intermediate results (usually by hash) Partitioning techniques: Hash, range, round-robin
Parallel DBMS DBMS automatically manages the various alternative strategies for the tables involved in the query, transparently to user and application program
Parallel DBMS Many map and Reduce operations can be expressed as plain SQL; reshuffle between Map and Reduce is equivalent to a GROUP BY operation in SQL Map operations not easily expressed in SQL can be implemented as UDF ( not trivial) Reduce operations not easily expressed in SQL can be implemented as user defined aggregates ( even less trivial )
Comparison  (on 100-node cluster) http://database.cs.brown.edu/projects/mapreduce-vs-dbms/ Hadoop DBMS-X Vertica Hadoop/ DBMS-X Hadoop/ Vertica Grep 284s 194s 108s 1.5 2.6 Web Log >1Ks 740s 268s 1.6 4.3 Join >1Ks 32s 55s 36.3 21
Details Comparison Study Avoid repetitive record parsing Default HDFS setting stores data in textual format it has been generated Even when using SequenceFiles, user code necessary to parse out multiple attributes Don’t write intermediate results to disk Push vs. pull model Note:  DBMSs now offer ‘restart operators’ to improve fault tolerance; trade-off between runtime penalty and # work lost on failure
Details Comparison Study Column-oriented storage Compression Column-stores have focused on cheap (and even no) de-compression (some operations executed in compressed domains) Column-by-column compression may achieve better compression (?)
Parallel DBMS Bad “out-of-the-box” experience Reported difference of execution time from days to minutes, after tuning by vendor However, tuning Hadoop to maximum performance also an arduous task! Big difference in cost of loading data Performance gains from faster queries offset upfront costs Most DBMSs don’t work on “in situ” data Only expensive, commercial offerings No low-cost O/S alternative
Ease-of-Use Push programmers to a higher level of abstraction Pig, Hive, … SQL code substantially easier to write than MR code in their study Just for them?  Right model of real-life tasks in benchmark?
 
Ease-of-Use Getting a MapReduce program up and running takes generally less effort than the Parallel DBMS alternative No schema definition No UDF registration Modifying MapReduce program though…
Parallel DBMS Not used on 100s or 1000s of nodes Assume homogeneous array of machines Designed with the assumption that failures are a rare event Combine MapReduce proven scalability with Parallel DBMS proven efficiency?
Hybrid Solution? HadoopDB: Hadoop as communication layer above multiple nodes running single-node DBMS instances Full open-source solution: PostgreSQL as DB layer Hadoop as communication layer Hive as translation layer …  and the rest is “HadoopDB” Shared-nothing version of PostgreSQL as side-effect
Desiderata Performance Fault tolerance Ability to run in heterogeneous environment Slowest compute node should not determine completion time Flexible query interface ODBC/JDBC UDFs
HadoopDB RDBMS Careful layout of data Indexing Sorting Shared I/O, buffer management Compression Query Optimization Hadoop Job scheduling Task coordination Parallellization
HadoopDB Database connection JDBC, by extending Hadoop’s InputFormat Catalog XML file in HDFS
Data Loader Globally repartitioning data on a given partitioning key upon loading Breaks apart single node data into multiple smaller partitions ( chunks ) Bulk-loads chunks in single-node databases Chunk size ~1GB in experiments
Planner (SMS) SQL    MapReduce    SQL Extends Hive All operators in DAG, bottom-up, until first repartition operator with a partitioning key different from the database’s key are converted into one or more SQL queries for that database Takes advantage of relative quality of RDBMS query optimizer as opposed to ‘normal’ Hive query optimizer
SELECT YEAR(saleDate), SUM(revenue) FROM SALES GROUP BY YEAR(saleDate)
Planner (SMS) Join queries Hive assumes tables are never collocated SMS pushes entire join into database layer where possible (i.e., whenever join key matches database partitioning key)
Comparison HadoopDB up to an order of magnitude faster than Hadoop and Hive But… also 10x longer load time (for the join benchmark query, amortized in 1 query) Outperformed by Vertica Even for fault-tolerance tests, in spite of larger slow-down Main performance difference attributed to efficiency of column-store and lack of compression
Hadoop / Hive Shortcomings: Data storage layer No use of hash partitioning on join keys for co-location of related tables No statistics of data in catalog No cost based optimization Lack of native indexing Most jobs heavy on I/O BTW: Hive is catching up on some of these!
Hadapt Two heuristics to guide optimizations: Maximize single-node DBMS use DBMS processes data at faster rate than Hadoop Minimize # jobs per SQL query Each MapReduce job involves much I/O, both to disk and over network
Two orders of magnitude Three key database ideas at basis: Column store relational back-ends Referential partitioning to maximize number of single node joins Integrate semi-joins in Hadoop Map phase
Dutch Database History!!! Vectorwise = MonetDB/X100 – Peter Boncz and Martin Kersten (CWI/UvA) Semi-joins in distributed relational query processing – Peter Apers (UT) Peter M. G. Apers ,  Alan R. Hevner ,  S. Bing Yao : Optimization Algorithms for Distributed Queries.  IEEE Trans. Software Eng. 9(1) : 57-68(1983)
Vectorwise Vectorized operations on in-cache data Attacks directly the  memory wall Efficient I/O PFor and PForDelta lightweight compression algorithms – extremely high decompression rates by design for modern CPUs (including tricks like  predication )
Improved Query Plans Join plans including data re-distribution before computing Extended Database Connector, giving access to multiple database tables in Map phase of single job After repartitioning on the join key, related records sent to Reduce phase for actual join computation
Improved Query Plans Referential Partitioning HadoopDB/Hadapt performs ‘aggressive’ hash-partitioning on foreign-key attributes ~ Jimmy’s secondary (value) sort trick During data load, involves extra step of joining to parent table to enable partitioning
Join in Hadoop Outline algorithm If tables not already co-partitioned… Mappers read table partitions and output join attributes intended to re-partition the tables Reducer processes the tuples with the same join key, i.e., does the join on the partition BTW… symmetric hash-join is a UT invention! Wilschut & Apers, Dataflow execution in a parallel main memory environment, PDIS 1991
Improved Query Plans Alternatives for fully partitioned hash-join Directed join Re-partition only one table, when other argument already partitioned on join key Broadcast join Ship entire smaller table to all nodes with larger table
Broadcast & Directed Joins Non-trivial in Hadoop HDFS does not guarantee to maintain co-partitioning between jobs; datasets using same hash may end up on different nodes Requires join in Map phase; hard to do well when multiple passes required (unless both tables already sorted by join key)
Broadcast Join Mapper reads smaller table from HDFS into in-memory hashtable, followed by sequential scan of larger table Map-side join ~ Jimmy’s in-mapper combiner Provided  low-cost database support for temporary tables, the join can in HadoopDB be pushed into DBMS for (usually) more efficient execution
Directed Join The OutputFormat feature of Hadoop writes output of a repartitioning mapper, reading catalog data for other table, directly into DBMSs, circumventing HDFS
Semi-join Hadoop: Mapper performs selection and projection of join attribute on first table Resulting column replicated as “map-side join” HadoopDB: If projected column is small (e.g., list of countries, …), transform to SELECT … WHERE foreignKey IN (list-of-Values) – skips completely the temporary table costs ~ Jimmy’s stripes
Results TPC-H 3TB on 45-node cluster Loading time: DBMS-X 33h3m Hive and Hadoop 49m HadoopDB 11h4m (w/ 6h42m for referential partitioning) VectorWise 3h47m (includes clustering index creation)
Results DBMS-X >> Hive Lack of partitioning and indexing Switching HadoopDB from PostreSQL to Vectorwise results in a factor of 7 improvement on average Generally, map-side join optimization improves efficiency by 2 to 3 when using column-store Semi-join improves by factor of 2 over map-side join and factor of 3.6 over reduce join
Conclusion Hybrid is good MapReduce takes care “rack to cluster” RDBMS takes care of the within-rack Not sure how good it is for text analytical tasks RDBMS often problems with data skew Hadapt whitepaper suggests they do unstructured data with MapReduce and structured data with HadoopDB
Conclusion Never a free lunch… If your problem involves non-text data types, consider working with hybrid solution If your problem involves primarily textual data, question still open whether hybrid will actually be of any help
Information Science “ Search for the fundamental knowledge which will allow us to postulate and utilize the most efficient combination of [human and machine] resources”  M.E. Senko. Information systems: records, relations, sets, entities, and things.  Information systems , 1(1):3–13, 1975.
References Stonebraker et al., MapReduce and Parallel DBMSs: Friends or Foes?, in CACM 53, 1 (Jan 2010):64-70 Bajda-Pawlikowski et al., Efficient Processing of Data Warehousing Queries in a Split Execution Environment, SIGMOD 2011 Abouzeid et al., HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads, VLDB 2009 Pavlo et al., A Comparison of Approaches to Large-Scale Data Analysis, SIGMOD 2009 Thusoo et al., Data Warehousing and Analytics Infrastructure at Facebook, SIGMOD 2010 Wilschut, Flokstra, Apers, Parallelism in a Main-Memory DBMS: The performance of PRISMA/DB, VLDB 1992 Wilschut, Apers & Flokstra, Parallel Query Execution in PRISMA/DB. LNCS 503 (1990) Daniel Abadi’s blog,  http://dbmsmusings.blogspot.com/

Big data hadoop rdbms

  • 1.
    Big Data MapReducevs. RDBMS Arjen P. de Vries [email_address] Centrum Wiskunde & Informatica Delft University of Technology Spinque B.V.
  • 2.
    Context Business ‘bestpractices’: decisions based on data and hard facts rather than instinct and theory MapReduce, though originally designed for text, is more and more “ab”-used for structured data, at tremendous scale Hadoop is used to manage Facebook’s 2.5 petabyte data warehouse
  • 3.
    Shared-nothing Architecture Acollection of independent, possibly virtual, machines, each with local disk and local main memory, connected together on a high-speed network Possible trade-off: large number of low-end servers instead of small number of high-end ones
  • 4.
  • 5.
  • 6.
    Programming Model ParallelDBMS Claimed best at ad-hoc analytical queries Substantially faster once data is loaded, but loading the data takes considerably longer Who wants to program parallel joins etc.?! Map-Reduce Very suited for extract-transform-load tasks Ease-of-use for complex analytics tasks Hybrid Best of both worlds?
  • 7.
    Parallel DBMS Horizontalpartitioning of relational tables Partitioned execution of SQL operators Select, aggregate, join, project and update New shuffle operator; dynamically repartition rows of intermediate results (usually by hash) Partitioning techniques: Hash, range, round-robin
  • 8.
    Parallel DBMS DBMSautomatically manages the various alternative strategies for the tables involved in the query, transparently to user and application program
  • 9.
    Parallel DBMS Manymap and Reduce operations can be expressed as plain SQL; reshuffle between Map and Reduce is equivalent to a GROUP BY operation in SQL Map operations not easily expressed in SQL can be implemented as UDF ( not trivial) Reduce operations not easily expressed in SQL can be implemented as user defined aggregates ( even less trivial )
  • 10.
    Comparison (on100-node cluster) http://database.cs.brown.edu/projects/mapreduce-vs-dbms/ Hadoop DBMS-X Vertica Hadoop/ DBMS-X Hadoop/ Vertica Grep 284s 194s 108s 1.5 2.6 Web Log >1Ks 740s 268s 1.6 4.3 Join >1Ks 32s 55s 36.3 21
  • 11.
    Details Comparison StudyAvoid repetitive record parsing Default HDFS setting stores data in textual format it has been generated Even when using SequenceFiles, user code necessary to parse out multiple attributes Don’t write intermediate results to disk Push vs. pull model Note: DBMSs now offer ‘restart operators’ to improve fault tolerance; trade-off between runtime penalty and # work lost on failure
  • 12.
    Details Comparison StudyColumn-oriented storage Compression Column-stores have focused on cheap (and even no) de-compression (some operations executed in compressed domains) Column-by-column compression may achieve better compression (?)
  • 13.
    Parallel DBMS Bad“out-of-the-box” experience Reported difference of execution time from days to minutes, after tuning by vendor However, tuning Hadoop to maximum performance also an arduous task! Big difference in cost of loading data Performance gains from faster queries offset upfront costs Most DBMSs don’t work on “in situ” data Only expensive, commercial offerings No low-cost O/S alternative
  • 14.
    Ease-of-Use Push programmersto a higher level of abstraction Pig, Hive, … SQL code substantially easier to write than MR code in their study Just for them? Right model of real-life tasks in benchmark?
  • 15.
  • 16.
    Ease-of-Use Getting aMapReduce program up and running takes generally less effort than the Parallel DBMS alternative No schema definition No UDF registration Modifying MapReduce program though…
  • 17.
    Parallel DBMS Notused on 100s or 1000s of nodes Assume homogeneous array of machines Designed with the assumption that failures are a rare event Combine MapReduce proven scalability with Parallel DBMS proven efficiency?
  • 18.
    Hybrid Solution? HadoopDB:Hadoop as communication layer above multiple nodes running single-node DBMS instances Full open-source solution: PostgreSQL as DB layer Hadoop as communication layer Hive as translation layer … and the rest is “HadoopDB” Shared-nothing version of PostgreSQL as side-effect
  • 19.
    Desiderata Performance Faulttolerance Ability to run in heterogeneous environment Slowest compute node should not determine completion time Flexible query interface ODBC/JDBC UDFs
  • 20.
    HadoopDB RDBMS Carefullayout of data Indexing Sorting Shared I/O, buffer management Compression Query Optimization Hadoop Job scheduling Task coordination Parallellization
  • 21.
    HadoopDB Database connectionJDBC, by extending Hadoop’s InputFormat Catalog XML file in HDFS
  • 22.
    Data Loader Globallyrepartitioning data on a given partitioning key upon loading Breaks apart single node data into multiple smaller partitions ( chunks ) Bulk-loads chunks in single-node databases Chunk size ~1GB in experiments
  • 23.
    Planner (SMS) SQL  MapReduce  SQL Extends Hive All operators in DAG, bottom-up, until first repartition operator with a partitioning key different from the database’s key are converted into one or more SQL queries for that database Takes advantage of relative quality of RDBMS query optimizer as opposed to ‘normal’ Hive query optimizer
  • 24.
    SELECT YEAR(saleDate), SUM(revenue)FROM SALES GROUP BY YEAR(saleDate)
  • 25.
    Planner (SMS) Joinqueries Hive assumes tables are never collocated SMS pushes entire join into database layer where possible (i.e., whenever join key matches database partitioning key)
  • 26.
    Comparison HadoopDB upto an order of magnitude faster than Hadoop and Hive But… also 10x longer load time (for the join benchmark query, amortized in 1 query) Outperformed by Vertica Even for fault-tolerance tests, in spite of larger slow-down Main performance difference attributed to efficiency of column-store and lack of compression
  • 27.
    Hadoop / HiveShortcomings: Data storage layer No use of hash partitioning on join keys for co-location of related tables No statistics of data in catalog No cost based optimization Lack of native indexing Most jobs heavy on I/O BTW: Hive is catching up on some of these!
  • 28.
    Hadapt Two heuristicsto guide optimizations: Maximize single-node DBMS use DBMS processes data at faster rate than Hadoop Minimize # jobs per SQL query Each MapReduce job involves much I/O, both to disk and over network
  • 29.
    Two orders ofmagnitude Three key database ideas at basis: Column store relational back-ends Referential partitioning to maximize number of single node joins Integrate semi-joins in Hadoop Map phase
  • 30.
    Dutch Database History!!!Vectorwise = MonetDB/X100 – Peter Boncz and Martin Kersten (CWI/UvA) Semi-joins in distributed relational query processing – Peter Apers (UT) Peter M. G. Apers ,  Alan R. Hevner ,  S. Bing Yao : Optimization Algorithms for Distributed Queries.  IEEE Trans. Software Eng. 9(1) : 57-68(1983)
  • 31.
    Vectorwise Vectorized operationson in-cache data Attacks directly the memory wall Efficient I/O PFor and PForDelta lightweight compression algorithms – extremely high decompression rates by design for modern CPUs (including tricks like predication )
  • 32.
    Improved Query PlansJoin plans including data re-distribution before computing Extended Database Connector, giving access to multiple database tables in Map phase of single job After repartitioning on the join key, related records sent to Reduce phase for actual join computation
  • 33.
    Improved Query PlansReferential Partitioning HadoopDB/Hadapt performs ‘aggressive’ hash-partitioning on foreign-key attributes ~ Jimmy’s secondary (value) sort trick During data load, involves extra step of joining to parent table to enable partitioning
  • 34.
    Join in HadoopOutline algorithm If tables not already co-partitioned… Mappers read table partitions and output join attributes intended to re-partition the tables Reducer processes the tuples with the same join key, i.e., does the join on the partition BTW… symmetric hash-join is a UT invention! Wilschut & Apers, Dataflow execution in a parallel main memory environment, PDIS 1991
  • 35.
    Improved Query PlansAlternatives for fully partitioned hash-join Directed join Re-partition only one table, when other argument already partitioned on join key Broadcast join Ship entire smaller table to all nodes with larger table
  • 36.
    Broadcast & DirectedJoins Non-trivial in Hadoop HDFS does not guarantee to maintain co-partitioning between jobs; datasets using same hash may end up on different nodes Requires join in Map phase; hard to do well when multiple passes required (unless both tables already sorted by join key)
  • 37.
    Broadcast Join Mapperreads smaller table from HDFS into in-memory hashtable, followed by sequential scan of larger table Map-side join ~ Jimmy’s in-mapper combiner Provided low-cost database support for temporary tables, the join can in HadoopDB be pushed into DBMS for (usually) more efficient execution
  • 38.
    Directed Join TheOutputFormat feature of Hadoop writes output of a repartitioning mapper, reading catalog data for other table, directly into DBMSs, circumventing HDFS
  • 39.
    Semi-join Hadoop: Mapperperforms selection and projection of join attribute on first table Resulting column replicated as “map-side join” HadoopDB: If projected column is small (e.g., list of countries, …), transform to SELECT … WHERE foreignKey IN (list-of-Values) – skips completely the temporary table costs ~ Jimmy’s stripes
  • 40.
    Results TPC-H 3TBon 45-node cluster Loading time: DBMS-X 33h3m Hive and Hadoop 49m HadoopDB 11h4m (w/ 6h42m for referential partitioning) VectorWise 3h47m (includes clustering index creation)
  • 41.
    Results DBMS-X >>Hive Lack of partitioning and indexing Switching HadoopDB from PostreSQL to Vectorwise results in a factor of 7 improvement on average Generally, map-side join optimization improves efficiency by 2 to 3 when using column-store Semi-join improves by factor of 2 over map-side join and factor of 3.6 over reduce join
  • 42.
    Conclusion Hybrid isgood MapReduce takes care “rack to cluster” RDBMS takes care of the within-rack Not sure how good it is for text analytical tasks RDBMS often problems with data skew Hadapt whitepaper suggests they do unstructured data with MapReduce and structured data with HadoopDB
  • 43.
    Conclusion Never afree lunch… If your problem involves non-text data types, consider working with hybrid solution If your problem involves primarily textual data, question still open whether hybrid will actually be of any help
  • 44.
    Information Science “Search for the fundamental knowledge which will allow us to postulate and utilize the most efficient combination of [human and machine] resources” M.E. Senko. Information systems: records, relations, sets, entities, and things. Information systems , 1(1):3–13, 1975.
  • 45.
    References Stonebraker etal., MapReduce and Parallel DBMSs: Friends or Foes?, in CACM 53, 1 (Jan 2010):64-70 Bajda-Pawlikowski et al., Efficient Processing of Data Warehousing Queries in a Split Execution Environment, SIGMOD 2011 Abouzeid et al., HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads, VLDB 2009 Pavlo et al., A Comparison of Approaches to Large-Scale Data Analysis, SIGMOD 2009 Thusoo et al., Data Warehousing and Analytics Infrastructure at Facebook, SIGMOD 2010 Wilschut, Flokstra, Apers, Parallelism in a Main-Memory DBMS: The performance of PRISMA/DB, VLDB 1992 Wilschut, Apers & Flokstra, Parallel Query Execution in PRISMA/DB. LNCS 503 (1990) Daniel Abadi’s blog, http://dbmsmusings.blogspot.com/