Big data hadoop rdbms


Published on

Siks course on Hadoop, discussing Stonebraker debate, HadoopDB, Hadapt, RDBMS roots

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Big data hadoop rdbms

  1. 1. Big Data MapReduce vs. RDBMS Arjen P. de Vries [email_address] Centrum Wiskunde & Informatica Delft University of Technology Spinque B.V.
  2. 2. Context <ul><li>Business ‘best practices’: decisions based on data and hard facts rather than instinct and theory </li></ul><ul><li>MapReduce, though originally designed for text, is more and more “ab”-used for structured data, at tremendous scale </li></ul><ul><ul><li>Hadoop is used to manage Facebook’s 2.5 petabyte data warehouse </li></ul></ul>
  3. 3. Shared-nothing Architecture <ul><li>A collection of independent, possibly virtual, machines, each with local disk and local main memory, connected together on a high-speed network </li></ul><ul><ul><li>Possible trade-off: large number of low-end servers instead of small number of high-end ones </li></ul></ul>
  4. 4. @CWI – 2011
  5. 6. Programming Model <ul><li>Parallel DBMS </li></ul><ul><ul><li>Claimed best at ad-hoc analytical queries </li></ul></ul><ul><ul><ul><li>Substantially faster once data is loaded, but loading the data takes considerably longer </li></ul></ul></ul><ul><ul><ul><li>Who wants to program parallel joins etc.?! </li></ul></ul></ul><ul><li>Map-Reduce </li></ul><ul><ul><li>Very suited for extract-transform-load tasks </li></ul></ul><ul><ul><ul><li>Ease-of-use for complex analytics tasks </li></ul></ul></ul><ul><li>Hybrid </li></ul><ul><ul><li>Best of both worlds? </li></ul></ul>
  6. 7. Parallel DBMS <ul><li>Horizontal partitioning of relational tables </li></ul><ul><li>Partitioned execution of SQL operators </li></ul><ul><ul><li>Select, aggregate, join, project and update </li></ul></ul><ul><ul><li>New shuffle operator; dynamically repartition rows of intermediate results (usually by hash) </li></ul></ul><ul><li>Partitioning techniques: </li></ul><ul><ul><li>Hash, range, round-robin </li></ul></ul>
  7. 8. Parallel DBMS <ul><li>DBMS automatically manages the various alternative strategies for the tables involved in the query, transparently to user and application program </li></ul>
  8. 9. Parallel DBMS <ul><li>Many map and Reduce operations can be expressed as plain SQL; reshuffle between Map and Reduce is equivalent to a GROUP BY operation in SQL </li></ul><ul><li>Map operations not easily expressed in SQL can be implemented as UDF ( not trivial) </li></ul><ul><li>Reduce operations not easily expressed in SQL can be implemented as user defined aggregates ( even less trivial ) </li></ul>
  9. 10. Comparison (on 100-node cluster) Hadoop DBMS-X Vertica Hadoop/ DBMS-X Hadoop/ Vertica Grep 284s 194s 108s 1.5 2.6 Web Log >1Ks 740s 268s 1.6 4.3 Join >1Ks 32s 55s 36.3 21
  10. 11. Details Comparison Study <ul><li>Avoid repetitive record parsing </li></ul><ul><ul><li>Default HDFS setting stores data in textual format it has been generated </li></ul></ul><ul><ul><li>Even when using SequenceFiles, user code necessary to parse out multiple attributes </li></ul></ul><ul><li>Don’t write intermediate results to disk </li></ul><ul><ul><li>Push vs. pull model </li></ul></ul><ul><ul><li>Note: </li></ul></ul><ul><ul><ul><li>DBMSs now offer ‘restart operators’ to improve fault tolerance; trade-off between runtime penalty and # work lost on failure </li></ul></ul></ul>
  11. 12. Details Comparison Study <ul><li>Column-oriented storage </li></ul><ul><li>Compression </li></ul><ul><ul><li>Column-stores have focused on cheap (and even no) de-compression (some operations executed in compressed domains) </li></ul></ul><ul><ul><li>Column-by-column compression may achieve better compression (?) </li></ul></ul>
  12. 13. Parallel DBMS <ul><li>Bad “out-of-the-box” experience </li></ul><ul><ul><li>Reported difference of execution time from days to minutes, after tuning by vendor </li></ul></ul><ul><ul><li>However, tuning Hadoop to maximum performance also an arduous task! </li></ul></ul><ul><li>Big difference in cost of loading data </li></ul><ul><ul><li>Performance gains from faster queries offset upfront costs </li></ul></ul><ul><ul><li>Most DBMSs don’t work on “in situ” data </li></ul></ul><ul><li>Only expensive, commercial offerings </li></ul><ul><ul><li>No low-cost O/S alternative </li></ul></ul>
  13. 14. Ease-of-Use <ul><li>Push programmers to a higher level of abstraction </li></ul><ul><ul><li>Pig, Hive, … </li></ul></ul><ul><ul><li>SQL code substantially easier to write than MR code in their study </li></ul></ul><ul><ul><li>Just for them? Right model of real-life tasks in benchmark? </li></ul></ul>
  14. 16. Ease-of-Use <ul><li>Getting a MapReduce program up and running takes generally less effort than the Parallel DBMS alternative </li></ul><ul><ul><li>No schema definition </li></ul></ul><ul><ul><li>No UDF registration </li></ul></ul><ul><li>Modifying MapReduce program though… </li></ul>
  15. 17. Parallel DBMS <ul><li>Not used on 100s or 1000s of nodes </li></ul><ul><ul><li>Assume homogeneous array of machines </li></ul></ul><ul><ul><li>Designed with the assumption that failures are a rare event </li></ul></ul><ul><li>Combine MapReduce proven scalability with Parallel DBMS proven efficiency? </li></ul>
  16. 18. Hybrid Solution? <ul><li>HadoopDB: Hadoop as communication layer above multiple nodes running single-node DBMS instances </li></ul><ul><li>Full open-source solution: </li></ul><ul><ul><li>PostgreSQL as DB layer </li></ul></ul><ul><ul><li>Hadoop as communication layer </li></ul></ul><ul><ul><li>Hive as translation layer </li></ul></ul><ul><ul><li>… and the rest is “HadoopDB” </li></ul></ul><ul><li>Shared-nothing version of PostgreSQL as side-effect </li></ul>
  17. 19. Desiderata <ul><li>Performance </li></ul><ul><li>Fault tolerance </li></ul><ul><li>Ability to run in heterogeneous environment </li></ul><ul><ul><li>Slowest compute node should not determine completion time </li></ul></ul><ul><li>Flexible query interface </li></ul><ul><ul><li>ODBC/JDBC </li></ul></ul><ul><ul><li>UDFs </li></ul></ul>
  18. 20. HadoopDB <ul><li>RDBMS </li></ul><ul><li>Careful layout of data </li></ul><ul><li>Indexing </li></ul><ul><li>Sorting </li></ul><ul><li>Shared I/O, buffer management </li></ul><ul><li>Compression </li></ul><ul><li>Query Optimization </li></ul><ul><li>Hadoop </li></ul><ul><li>Job scheduling </li></ul><ul><li>Task coordination </li></ul><ul><li>Parallellization </li></ul>
  19. 21. HadoopDB <ul><li>Database connection </li></ul><ul><ul><li>JDBC, by extending Hadoop’s InputFormat </li></ul></ul><ul><li>Catalog </li></ul><ul><ul><li>XML file in HDFS </li></ul></ul>
  20. 22. Data Loader <ul><li>Globally repartitioning data on a given partitioning key upon loading </li></ul><ul><li>Breaks apart single node data into multiple smaller partitions ( chunks ) </li></ul><ul><li>Bulk-loads chunks in single-node databases </li></ul><ul><ul><li>Chunk size ~1GB in experiments </li></ul></ul>
  21. 23. Planner (SMS) <ul><li>SQL  MapReduce  SQL </li></ul><ul><li>Extends Hive </li></ul><ul><ul><li>All operators in DAG, bottom-up, until first repartition operator with a partitioning key different from the database’s key are converted into one or more SQL queries for that database </li></ul></ul><ul><ul><li>Takes advantage of relative quality of RDBMS query optimizer as opposed to ‘normal’ Hive query optimizer </li></ul></ul>
  22. 24. SELECT YEAR(saleDate), SUM(revenue) FROM SALES GROUP BY YEAR(saleDate)
  23. 25. Planner (SMS) <ul><li>Join queries </li></ul><ul><ul><li>Hive assumes tables are never collocated </li></ul></ul><ul><ul><li>SMS pushes entire join into database layer where possible (i.e., whenever join key matches database partitioning key) </li></ul></ul>
  24. 26. Comparison <ul><li>HadoopDB up to an order of magnitude faster than Hadoop and Hive </li></ul><ul><ul><li>But… also 10x longer load time (for the join benchmark query, amortized in 1 query) </li></ul></ul><ul><li>Outperformed by Vertica </li></ul><ul><ul><li>Even for fault-tolerance tests, in spite of larger slow-down </li></ul></ul><ul><li>Main performance difference attributed to efficiency of column-store and lack of compression </li></ul>
  25. 27. Hadoop / Hive <ul><li>Shortcomings: </li></ul><ul><ul><li>Data storage layer </li></ul></ul><ul><ul><ul><li>No use of hash partitioning on join keys for co-location of related tables </li></ul></ul></ul><ul><ul><li>No statistics of data in catalog </li></ul></ul><ul><ul><ul><li>No cost based optimization </li></ul></ul></ul><ul><ul><li>Lack of native indexing </li></ul></ul><ul><ul><ul><li>Most jobs heavy on I/O </li></ul></ul></ul><ul><li>BTW: Hive is catching up on some of these! </li></ul>
  26. 28. Hadapt <ul><li>Two heuristics to guide optimizations: </li></ul><ul><ul><li>Maximize single-node DBMS use </li></ul></ul><ul><ul><ul><li>DBMS processes data at faster rate than Hadoop </li></ul></ul></ul><ul><ul><li>Minimize # jobs per SQL query </li></ul></ul><ul><ul><ul><li>Each MapReduce job involves much I/O, both to disk and over network </li></ul></ul></ul>
  27. 29. Two orders of magnitude <ul><li>Three key database ideas at basis: </li></ul><ul><ul><li>Column store relational back-ends </li></ul></ul><ul><ul><li>Referential partitioning to maximize number of single node joins </li></ul></ul><ul><ul><li>Integrate semi-joins in Hadoop Map phase </li></ul></ul>
  28. 30. Dutch Database History!!! <ul><li>Vectorwise = MonetDB/X100 – Peter Boncz and Martin Kersten (CWI/UvA) </li></ul><ul><li>Semi-joins in distributed relational query processing – Peter Apers (UT) </li></ul><ul><ul><li>Peter M. G. Apers ,  Alan R. Hevner ,  S. Bing Yao : Optimization Algorithms for Distributed Queries.  IEEE Trans. Software Eng. 9(1) : 57-68(1983) </li></ul></ul>
  29. 31. Vectorwise <ul><li>Vectorized operations on in-cache data </li></ul><ul><ul><li>Attacks directly the memory wall </li></ul></ul><ul><li>Efficient I/O </li></ul><ul><ul><li>PFor and PForDelta lightweight compression algorithms – extremely high decompression rates by design for modern CPUs (including tricks like predication ) </li></ul></ul>
  30. 32. Improved Query Plans <ul><li>Join plans including data re-distribution before computing </li></ul><ul><ul><li>Extended Database Connector, giving access to multiple database tables in Map phase of single job </li></ul></ul><ul><ul><li>After repartitioning on the join key, related records sent to Reduce phase for actual join computation </li></ul></ul>
  31. 33. Improved Query Plans <ul><li>Referential Partitioning </li></ul><ul><ul><li>HadoopDB/Hadapt performs ‘aggressive’ hash-partitioning on foreign-key attributes </li></ul></ul><ul><ul><ul><li>~ Jimmy’s secondary (value) sort trick </li></ul></ul></ul><ul><ul><li>During data load, involves extra step of joining to parent table to enable partitioning </li></ul></ul>
  32. 34. Join in Hadoop <ul><li>Outline algorithm </li></ul><ul><ul><li>If tables not already co-partitioned… </li></ul></ul><ul><ul><li>Mappers read table partitions and output join attributes intended to re-partition the tables </li></ul></ul><ul><ul><li>Reducer processes the tuples with the same join key, i.e., does the join on the partition </li></ul></ul><ul><li>BTW… symmetric hash-join is a UT invention! </li></ul><ul><ul><ul><li>Wilschut & Apers, Dataflow execution in a parallel main memory environment, PDIS 1991 </li></ul></ul></ul>
  33. 35. Improved Query Plans <ul><li>Alternatives for fully partitioned hash-join </li></ul><ul><li>Directed join </li></ul><ul><ul><li>Re-partition only one table, when other argument already partitioned on join key </li></ul></ul><ul><li>Broadcast join </li></ul><ul><ul><li>Ship entire smaller table to all nodes with larger table </li></ul></ul>
  34. 36. Broadcast & Directed Joins <ul><li>Non-trivial in Hadoop </li></ul><ul><ul><li>HDFS does not guarantee to maintain co-partitioning between jobs; datasets using same hash may end up on different nodes </li></ul></ul><ul><ul><li>Requires join in Map phase; hard to do well when multiple passes required (unless both tables already sorted by join key) </li></ul></ul>
  35. 37. Broadcast Join <ul><li>Mapper reads smaller table from HDFS into in-memory hashtable, followed by sequential scan of larger table </li></ul><ul><ul><li>Map-side join </li></ul></ul><ul><ul><li>~ Jimmy’s in-mapper combiner </li></ul></ul><ul><li>Provided low-cost database support for temporary tables, the join can in HadoopDB be pushed into DBMS for (usually) more efficient execution </li></ul>
  36. 38. Directed Join <ul><li>The OutputFormat feature of Hadoop writes output of a repartitioning mapper, reading catalog data for other table, directly into DBMSs, circumventing HDFS </li></ul>
  37. 39. Semi-join <ul><li>Hadoop: </li></ul><ul><ul><li>Mapper performs selection and projection of join attribute on first table </li></ul></ul><ul><ul><li>Resulting column replicated as “map-side join” </li></ul></ul><ul><li>HadoopDB: </li></ul><ul><ul><li>If projected column is small (e.g., list of countries, …), transform to </li></ul></ul><ul><ul><ul><li>SELECT … WHERE foreignKey IN (list-of-Values) – skips completely the temporary table costs </li></ul></ul></ul><ul><ul><ul><li>~ Jimmy’s stripes </li></ul></ul></ul>
  38. 40. Results <ul><li>TPC-H 3TB on 45-node cluster </li></ul><ul><li>Loading time: </li></ul><ul><ul><li>DBMS-X 33h3m </li></ul></ul><ul><ul><li>Hive and Hadoop 49m </li></ul></ul><ul><ul><li>HadoopDB 11h4m (w/ 6h42m for referential partitioning) </li></ul></ul><ul><ul><li>VectorWise 3h47m (includes clustering index creation) </li></ul></ul>
  39. 41. Results <ul><li>DBMS-X >> Hive </li></ul><ul><ul><li>Lack of partitioning and indexing </li></ul></ul><ul><li>Switching HadoopDB from PostreSQL to Vectorwise results in a factor of 7 improvement on average </li></ul><ul><li>Generally, map-side join optimization improves efficiency by 2 to 3 when using column-store </li></ul><ul><li>Semi-join improves by factor of 2 over map-side join and factor of 3.6 over reduce join </li></ul>
  40. 42. Conclusion <ul><li>Hybrid is good </li></ul><ul><ul><li>MapReduce takes care “rack to cluster” </li></ul></ul><ul><ul><li>RDBMS takes care of the within-rack </li></ul></ul><ul><li>Not sure how good it is for text analytical tasks </li></ul><ul><ul><li>RDBMS often problems with data skew </li></ul></ul><ul><ul><li>Hadapt whitepaper suggests they do unstructured data with MapReduce and structured data with HadoopDB </li></ul></ul>
  41. 43. Conclusion <ul><li>Never a free lunch… </li></ul><ul><li>If your problem involves non-text data types, consider working with hybrid solution </li></ul><ul><li>If your problem involves primarily textual data, question still open whether hybrid will actually be of any help </li></ul>
  42. 44. Information Science <ul><li>“ Search for the fundamental knowledge which will allow us to postulate and utilize the most efficient combination of [human and machine] resources” </li></ul><ul><li>M.E. Senko. Information systems: records, relations, sets, entities, and things. Information systems , 1(1):3–13, 1975. </li></ul>
  43. 45. References <ul><li>Stonebraker et al., MapReduce and Parallel DBMSs: Friends or Foes?, in CACM 53, 1 (Jan 2010):64-70 </li></ul><ul><li>Bajda-Pawlikowski et al., Efficient Processing of Data Warehousing Queries in a Split Execution Environment, SIGMOD 2011 </li></ul><ul><li>Abouzeid et al., HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads, VLDB 2009 </li></ul><ul><li>Pavlo et al., A Comparison of Approaches to Large-Scale Data Analysis, SIGMOD 2009 </li></ul><ul><li>Thusoo et al., Data Warehousing and Analytics Infrastructure at Facebook, SIGMOD 2010 </li></ul><ul><li>Wilschut, Flokstra, Apers, Parallelism in a Main-Memory DBMS: The performance of PRISMA/DB, VLDB 1992 </li></ul><ul><li>Wilschut, Apers & Flokstra, Parallel Query Execution in PRISMA/DB. LNCS 503 (1990) </li></ul><ul><li>Daniel Abadi’s blog, </li></ul>