Big data hadoop rdbms

  • 3,824 views
Uploaded on

Siks course on Hadoop, discussing Stonebraker debate, HadoopDB, Hadapt, RDBMS roots

Siks course on Hadoop, discussing Stonebraker debate, HadoopDB, Hadapt, RDBMS roots

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,824
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
255
Comments
0
Likes
7

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Big Data MapReduce vs. RDBMS Arjen P. de Vries [email_address] Centrum Wiskunde & Informatica Delft University of Technology Spinque B.V.
  • 2. Context
    • Business ‘best practices’: decisions based on data and hard facts rather than instinct and theory
    • MapReduce, though originally designed for text, is more and more “ab”-used for structured data, at tremendous scale
      • Hadoop is used to manage Facebook’s 2.5 petabyte data warehouse
  • 3. Shared-nothing Architecture
    • A collection of independent, possibly virtual, machines, each with local disk and local main memory, connected together on a high-speed network
      • Possible trade-off: large number of low-end servers instead of small number of high-end ones
  • 4. @CWI – 2011
  • 5.  
  • 6. Programming Model
    • Parallel DBMS
      • Claimed best at ad-hoc analytical queries
        • Substantially faster once data is loaded, but loading the data takes considerably longer
        • Who wants to program parallel joins etc.?!
    • Map-Reduce
      • Very suited for extract-transform-load tasks
        • Ease-of-use for complex analytics tasks
    • Hybrid
      • Best of both worlds?
  • 7. Parallel DBMS
    • Horizontal partitioning of relational tables
    • Partitioned execution of SQL operators
      • Select, aggregate, join, project and update
      • New shuffle operator; dynamically repartition rows of intermediate results (usually by hash)
    • Partitioning techniques:
      • Hash, range, round-robin
  • 8. Parallel DBMS
    • DBMS automatically manages the various alternative strategies for the tables involved in the query, transparently to user and application program
  • 9. Parallel DBMS
    • Many map and Reduce operations can be expressed as plain SQL; reshuffle between Map and Reduce is equivalent to a GROUP BY operation in SQL
    • Map operations not easily expressed in SQL can be implemented as UDF ( not trivial)
    • Reduce operations not easily expressed in SQL can be implemented as user defined aggregates ( even less trivial )
  • 10. Comparison (on 100-node cluster) http://database.cs.brown.edu/projects/mapreduce-vs-dbms/ Hadoop DBMS-X Vertica Hadoop/ DBMS-X Hadoop/ Vertica Grep 284s 194s 108s 1.5 2.6 Web Log >1Ks 740s 268s 1.6 4.3 Join >1Ks 32s 55s 36.3 21
  • 11. Details Comparison Study
    • Avoid repetitive record parsing
      • Default HDFS setting stores data in textual format it has been generated
      • Even when using SequenceFiles, user code necessary to parse out multiple attributes
    • Don’t write intermediate results to disk
      • Push vs. pull model
      • Note:
        • DBMSs now offer ‘restart operators’ to improve fault tolerance; trade-off between runtime penalty and # work lost on failure
  • 12. Details Comparison Study
    • Column-oriented storage
    • Compression
      • Column-stores have focused on cheap (and even no) de-compression (some operations executed in compressed domains)
      • Column-by-column compression may achieve better compression (?)
  • 13. Parallel DBMS
    • Bad “out-of-the-box” experience
      • Reported difference of execution time from days to minutes, after tuning by vendor
      • However, tuning Hadoop to maximum performance also an arduous task!
    • Big difference in cost of loading data
      • Performance gains from faster queries offset upfront costs
      • Most DBMSs don’t work on “in situ” data
    • Only expensive, commercial offerings
      • No low-cost O/S alternative
  • 14. Ease-of-Use
    • Push programmers to a higher level of abstraction
      • Pig, Hive, …
      • SQL code substantially easier to write than MR code in their study
      • Just for them? Right model of real-life tasks in benchmark?
  • 15.  
  • 16. Ease-of-Use
    • Getting a MapReduce program up and running takes generally less effort than the Parallel DBMS alternative
      • No schema definition
      • No UDF registration
    • Modifying MapReduce program though…
  • 17. Parallel DBMS
    • Not used on 100s or 1000s of nodes
      • Assume homogeneous array of machines
      • Designed with the assumption that failures are a rare event
    • Combine MapReduce proven scalability with Parallel DBMS proven efficiency?
  • 18. Hybrid Solution?
    • HadoopDB: Hadoop as communication layer above multiple nodes running single-node DBMS instances
    • Full open-source solution:
      • PostgreSQL as DB layer
      • Hadoop as communication layer
      • Hive as translation layer
      • … and the rest is “HadoopDB”
    • Shared-nothing version of PostgreSQL as side-effect
  • 19. Desiderata
    • Performance
    • Fault tolerance
    • Ability to run in heterogeneous environment
      • Slowest compute node should not determine completion time
    • Flexible query interface
      • ODBC/JDBC
      • UDFs
  • 20. HadoopDB
    • RDBMS
    • Careful layout of data
    • Indexing
    • Sorting
    • Shared I/O, buffer management
    • Compression
    • Query Optimization
    • Hadoop
    • Job scheduling
    • Task coordination
    • Parallellization
  • 21. HadoopDB
    • Database connection
      • JDBC, by extending Hadoop’s InputFormat
    • Catalog
      • XML file in HDFS
  • 22. Data Loader
    • Globally repartitioning data on a given partitioning key upon loading
    • Breaks apart single node data into multiple smaller partitions ( chunks )
    • Bulk-loads chunks in single-node databases
      • Chunk size ~1GB in experiments
  • 23. Planner (SMS)
    • SQL  MapReduce  SQL
    • Extends Hive
      • All operators in DAG, bottom-up, until first repartition operator with a partitioning key different from the database’s key are converted into one or more SQL queries for that database
      • Takes advantage of relative quality of RDBMS query optimizer as opposed to ‘normal’ Hive query optimizer
  • 24. SELECT YEAR(saleDate), SUM(revenue) FROM SALES GROUP BY YEAR(saleDate)
  • 25. Planner (SMS)
    • Join queries
      • Hive assumes tables are never collocated
      • SMS pushes entire join into database layer where possible (i.e., whenever join key matches database partitioning key)
  • 26. Comparison
    • HadoopDB up to an order of magnitude faster than Hadoop and Hive
      • But… also 10x longer load time (for the join benchmark query, amortized in 1 query)
    • Outperformed by Vertica
      • Even for fault-tolerance tests, in spite of larger slow-down
    • Main performance difference attributed to efficiency of column-store and lack of compression
  • 27. Hadoop / Hive
    • Shortcomings:
      • Data storage layer
        • No use of hash partitioning on join keys for co-location of related tables
      • No statistics of data in catalog
        • No cost based optimization
      • Lack of native indexing
        • Most jobs heavy on I/O
    • BTW: Hive is catching up on some of these!
  • 28. Hadapt
    • Two heuristics to guide optimizations:
      • Maximize single-node DBMS use
        • DBMS processes data at faster rate than Hadoop
      • Minimize # jobs per SQL query
        • Each MapReduce job involves much I/O, both to disk and over network
  • 29. Two orders of magnitude
    • Three key database ideas at basis:
      • Column store relational back-ends
      • Referential partitioning to maximize number of single node joins
      • Integrate semi-joins in Hadoop Map phase
  • 30. Dutch Database History!!!
    • Vectorwise = MonetDB/X100 – Peter Boncz and Martin Kersten (CWI/UvA)
    • Semi-joins in distributed relational query processing – Peter Apers (UT)
      • Peter M. G. Apers ,  Alan R. Hevner ,  S. Bing Yao : Optimization Algorithms for Distributed Queries.  IEEE Trans. Software Eng. 9(1) : 57-68(1983)
  • 31. Vectorwise
    • Vectorized operations on in-cache data
      • Attacks directly the memory wall
    • Efficient I/O
      • PFor and PForDelta lightweight compression algorithms – extremely high decompression rates by design for modern CPUs (including tricks like predication )
  • 32. Improved Query Plans
    • Join plans including data re-distribution before computing
      • Extended Database Connector, giving access to multiple database tables in Map phase of single job
      • After repartitioning on the join key, related records sent to Reduce phase for actual join computation
  • 33. Improved Query Plans
    • Referential Partitioning
      • HadoopDB/Hadapt performs ‘aggressive’ hash-partitioning on foreign-key attributes
        • ~ Jimmy’s secondary (value) sort trick
      • During data load, involves extra step of joining to parent table to enable partitioning
  • 34. Join in Hadoop
    • Outline algorithm
      • If tables not already co-partitioned…
      • Mappers read table partitions and output join attributes intended to re-partition the tables
      • Reducer processes the tuples with the same join key, i.e., does the join on the partition
    • BTW… symmetric hash-join is a UT invention!
        • Wilschut & Apers, Dataflow execution in a parallel main memory environment, PDIS 1991
  • 35. Improved Query Plans
    • Alternatives for fully partitioned hash-join
    • Directed join
      • Re-partition only one table, when other argument already partitioned on join key
    • Broadcast join
      • Ship entire smaller table to all nodes with larger table
  • 36. Broadcast & Directed Joins
    • Non-trivial in Hadoop
      • HDFS does not guarantee to maintain co-partitioning between jobs; datasets using same hash may end up on different nodes
      • Requires join in Map phase; hard to do well when multiple passes required (unless both tables already sorted by join key)
  • 37. Broadcast Join
    • Mapper reads smaller table from HDFS into in-memory hashtable, followed by sequential scan of larger table
      • Map-side join
      • ~ Jimmy’s in-mapper combiner
    • Provided low-cost database support for temporary tables, the join can in HadoopDB be pushed into DBMS for (usually) more efficient execution
  • 38. Directed Join
    • The OutputFormat feature of Hadoop writes output of a repartitioning mapper, reading catalog data for other table, directly into DBMSs, circumventing HDFS
  • 39. Semi-join
    • Hadoop:
      • Mapper performs selection and projection of join attribute on first table
      • Resulting column replicated as “map-side join”
    • HadoopDB:
      • If projected column is small (e.g., list of countries, …), transform to
        • SELECT … WHERE foreignKey IN (list-of-Values) – skips completely the temporary table costs
        • ~ Jimmy’s stripes
  • 40. Results
    • TPC-H 3TB on 45-node cluster
    • Loading time:
      • DBMS-X 33h3m
      • Hive and Hadoop 49m
      • HadoopDB 11h4m (w/ 6h42m for referential partitioning)
      • VectorWise 3h47m (includes clustering index creation)
  • 41. Results
    • DBMS-X >> Hive
      • Lack of partitioning and indexing
    • Switching HadoopDB from PostreSQL to Vectorwise results in a factor of 7 improvement on average
    • Generally, map-side join optimization improves efficiency by 2 to 3 when using column-store
    • Semi-join improves by factor of 2 over map-side join and factor of 3.6 over reduce join
  • 42. Conclusion
    • Hybrid is good
      • MapReduce takes care “rack to cluster”
      • RDBMS takes care of the within-rack
    • Not sure how good it is for text analytical tasks
      • RDBMS often problems with data skew
      • Hadapt whitepaper suggests they do unstructured data with MapReduce and structured data with HadoopDB
  • 43. Conclusion
    • Never a free lunch…
    • If your problem involves non-text data types, consider working with hybrid solution
    • If your problem involves primarily textual data, question still open whether hybrid will actually be of any help
  • 44. Information Science
    • “ Search for the fundamental knowledge which will allow us to postulate and utilize the most efficient combination of [human and machine] resources”
    • M.E. Senko. Information systems: records, relations, sets, entities, and things. Information systems , 1(1):3–13, 1975.
  • 45. References
    • Stonebraker et al., MapReduce and Parallel DBMSs: Friends or Foes?, in CACM 53, 1 (Jan 2010):64-70
    • Bajda-Pawlikowski et al., Efficient Processing of Data Warehousing Queries in a Split Execution Environment, SIGMOD 2011
    • Abouzeid et al., HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads, VLDB 2009
    • Pavlo et al., A Comparison of Approaches to Large-Scale Data Analysis, SIGMOD 2009
    • Thusoo et al., Data Warehousing and Analytics Infrastructure at Facebook, SIGMOD 2010
    • Wilschut, Flokstra, Apers, Parallelism in a Main-Memory DBMS: The performance of PRISMA/DB, VLDB 1992
    • Wilschut, Apers & Flokstra, Parallel Query Execution in PRISMA/DB. LNCS 503 (1990)
    • Daniel Abadi’s blog, http://dbmsmusings.blogspot.com/