Big data hadoop rdbms
Upcoming SlideShare
Loading in...5
×
 

Big data hadoop rdbms

on

  • 4,311 views

Siks course on Hadoop, discussing Stonebraker debate, HadoopDB, Hadapt, RDBMS roots

Siks course on Hadoop, discussing Stonebraker debate, HadoopDB, Hadapt, RDBMS roots

Statistics

Views

Total Views
4,311
Views on SlideShare
4,308
Embed Views
3

Actions

Likes
6
Downloads
240
Comments
0

2 Embeds 3

http://www.actiweb.es 2
http://54.199.46.24 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Big data hadoop rdbms Big data hadoop rdbms Presentation Transcript

    • Big Data MapReduce vs. RDBMS Arjen P. de Vries [email_address] Centrum Wiskunde & Informatica Delft University of Technology Spinque B.V.
    • Context
      • Business ‘best practices’: decisions based on data and hard facts rather than instinct and theory
      • MapReduce, though originally designed for text, is more and more “ab”-used for structured data, at tremendous scale
        • Hadoop is used to manage Facebook’s 2.5 petabyte data warehouse
    • Shared-nothing Architecture
      • A collection of independent, possibly virtual, machines, each with local disk and local main memory, connected together on a high-speed network
        • Possible trade-off: large number of low-end servers instead of small number of high-end ones
    • @CWI – 2011
    •  
    • Programming Model
      • Parallel DBMS
        • Claimed best at ad-hoc analytical queries
          • Substantially faster once data is loaded, but loading the data takes considerably longer
          • Who wants to program parallel joins etc.?!
      • Map-Reduce
        • Very suited for extract-transform-load tasks
          • Ease-of-use for complex analytics tasks
      • Hybrid
        • Best of both worlds?
    • Parallel DBMS
      • Horizontal partitioning of relational tables
      • Partitioned execution of SQL operators
        • Select, aggregate, join, project and update
        • New shuffle operator; dynamically repartition rows of intermediate results (usually by hash)
      • Partitioning techniques:
        • Hash, range, round-robin
    • Parallel DBMS
      • DBMS automatically manages the various alternative strategies for the tables involved in the query, transparently to user and application program
    • Parallel DBMS
      • Many map and Reduce operations can be expressed as plain SQL; reshuffle between Map and Reduce is equivalent to a GROUP BY operation in SQL
      • Map operations not easily expressed in SQL can be implemented as UDF ( not trivial)
      • Reduce operations not easily expressed in SQL can be implemented as user defined aggregates ( even less trivial )
    • Comparison (on 100-node cluster) http://database.cs.brown.edu/projects/mapreduce-vs-dbms/ Hadoop DBMS-X Vertica Hadoop/ DBMS-X Hadoop/ Vertica Grep 284s 194s 108s 1.5 2.6 Web Log >1Ks 740s 268s 1.6 4.3 Join >1Ks 32s 55s 36.3 21
    • Details Comparison Study
      • Avoid repetitive record parsing
        • Default HDFS setting stores data in textual format it has been generated
        • Even when using SequenceFiles, user code necessary to parse out multiple attributes
      • Don’t write intermediate results to disk
        • Push vs. pull model
        • Note:
          • DBMSs now offer ‘restart operators’ to improve fault tolerance; trade-off between runtime penalty and # work lost on failure
    • Details Comparison Study
      • Column-oriented storage
      • Compression
        • Column-stores have focused on cheap (and even no) de-compression (some operations executed in compressed domains)
        • Column-by-column compression may achieve better compression (?)
    • Parallel DBMS
      • Bad “out-of-the-box” experience
        • Reported difference of execution time from days to minutes, after tuning by vendor
        • However, tuning Hadoop to maximum performance also an arduous task!
      • Big difference in cost of loading data
        • Performance gains from faster queries offset upfront costs
        • Most DBMSs don’t work on “in situ” data
      • Only expensive, commercial offerings
        • No low-cost O/S alternative
    • Ease-of-Use
      • Push programmers to a higher level of abstraction
        • Pig, Hive, …
        • SQL code substantially easier to write than MR code in their study
        • Just for them? Right model of real-life tasks in benchmark?
    •  
    • Ease-of-Use
      • Getting a MapReduce program up and running takes generally less effort than the Parallel DBMS alternative
        • No schema definition
        • No UDF registration
      • Modifying MapReduce program though…
    • Parallel DBMS
      • Not used on 100s or 1000s of nodes
        • Assume homogeneous array of machines
        • Designed with the assumption that failures are a rare event
      • Combine MapReduce proven scalability with Parallel DBMS proven efficiency?
    • Hybrid Solution?
      • HadoopDB: Hadoop as communication layer above multiple nodes running single-node DBMS instances
      • Full open-source solution:
        • PostgreSQL as DB layer
        • Hadoop as communication layer
        • Hive as translation layer
        • … and the rest is “HadoopDB”
      • Shared-nothing version of PostgreSQL as side-effect
    • Desiderata
      • Performance
      • Fault tolerance
      • Ability to run in heterogeneous environment
        • Slowest compute node should not determine completion time
      • Flexible query interface
        • ODBC/JDBC
        • UDFs
    • HadoopDB
      • RDBMS
      • Careful layout of data
      • Indexing
      • Sorting
      • Shared I/O, buffer management
      • Compression
      • Query Optimization
      • Hadoop
      • Job scheduling
      • Task coordination
      • Parallellization
    • HadoopDB
      • Database connection
        • JDBC, by extending Hadoop’s InputFormat
      • Catalog
        • XML file in HDFS
    • Data Loader
      • Globally repartitioning data on a given partitioning key upon loading
      • Breaks apart single node data into multiple smaller partitions ( chunks )
      • Bulk-loads chunks in single-node databases
        • Chunk size ~1GB in experiments
    • Planner (SMS)
      • SQL  MapReduce  SQL
      • Extends Hive
        • All operators in DAG, bottom-up, until first repartition operator with a partitioning key different from the database’s key are converted into one or more SQL queries for that database
        • Takes advantage of relative quality of RDBMS query optimizer as opposed to ‘normal’ Hive query optimizer
    • SELECT YEAR(saleDate), SUM(revenue) FROM SALES GROUP BY YEAR(saleDate)
    • Planner (SMS)
      • Join queries
        • Hive assumes tables are never collocated
        • SMS pushes entire join into database layer where possible (i.e., whenever join key matches database partitioning key)
    • Comparison
      • HadoopDB up to an order of magnitude faster than Hadoop and Hive
        • But… also 10x longer load time (for the join benchmark query, amortized in 1 query)
      • Outperformed by Vertica
        • Even for fault-tolerance tests, in spite of larger slow-down
      • Main performance difference attributed to efficiency of column-store and lack of compression
    • Hadoop / Hive
      • Shortcomings:
        • Data storage layer
          • No use of hash partitioning on join keys for co-location of related tables
        • No statistics of data in catalog
          • No cost based optimization
        • Lack of native indexing
          • Most jobs heavy on I/O
      • BTW: Hive is catching up on some of these!
    • Hadapt
      • Two heuristics to guide optimizations:
        • Maximize single-node DBMS use
          • DBMS processes data at faster rate than Hadoop
        • Minimize # jobs per SQL query
          • Each MapReduce job involves much I/O, both to disk and over network
    • Two orders of magnitude
      • Three key database ideas at basis:
        • Column store relational back-ends
        • Referential partitioning to maximize number of single node joins
        • Integrate semi-joins in Hadoop Map phase
    • Dutch Database History!!!
      • Vectorwise = MonetDB/X100 – Peter Boncz and Martin Kersten (CWI/UvA)
      • Semi-joins in distributed relational query processing – Peter Apers (UT)
        • Peter M. G. Apers ,  Alan R. Hevner ,  S. Bing Yao : Optimization Algorithms for Distributed Queries.  IEEE Trans. Software Eng. 9(1) : 57-68(1983)
    • Vectorwise
      • Vectorized operations on in-cache data
        • Attacks directly the memory wall
      • Efficient I/O
        • PFor and PForDelta lightweight compression algorithms – extremely high decompression rates by design for modern CPUs (including tricks like predication )
    • Improved Query Plans
      • Join plans including data re-distribution before computing
        • Extended Database Connector, giving access to multiple database tables in Map phase of single job
        • After repartitioning on the join key, related records sent to Reduce phase for actual join computation
    • Improved Query Plans
      • Referential Partitioning
        • HadoopDB/Hadapt performs ‘aggressive’ hash-partitioning on foreign-key attributes
          • ~ Jimmy’s secondary (value) sort trick
        • During data load, involves extra step of joining to parent table to enable partitioning
    • Join in Hadoop
      • Outline algorithm
        • If tables not already co-partitioned…
        • Mappers read table partitions and output join attributes intended to re-partition the tables
        • Reducer processes the tuples with the same join key, i.e., does the join on the partition
      • BTW… symmetric hash-join is a UT invention!
          • Wilschut & Apers, Dataflow execution in a parallel main memory environment, PDIS 1991
    • Improved Query Plans
      • Alternatives for fully partitioned hash-join
      • Directed join
        • Re-partition only one table, when other argument already partitioned on join key
      • Broadcast join
        • Ship entire smaller table to all nodes with larger table
    • Broadcast & Directed Joins
      • Non-trivial in Hadoop
        • HDFS does not guarantee to maintain co-partitioning between jobs; datasets using same hash may end up on different nodes
        • Requires join in Map phase; hard to do well when multiple passes required (unless both tables already sorted by join key)
    • Broadcast Join
      • Mapper reads smaller table from HDFS into in-memory hashtable, followed by sequential scan of larger table
        • Map-side join
        • ~ Jimmy’s in-mapper combiner
      • Provided low-cost database support for temporary tables, the join can in HadoopDB be pushed into DBMS for (usually) more efficient execution
    • Directed Join
      • The OutputFormat feature of Hadoop writes output of a repartitioning mapper, reading catalog data for other table, directly into DBMSs, circumventing HDFS
    • Semi-join
      • Hadoop:
        • Mapper performs selection and projection of join attribute on first table
        • Resulting column replicated as “map-side join”
      • HadoopDB:
        • If projected column is small (e.g., list of countries, …), transform to
          • SELECT … WHERE foreignKey IN (list-of-Values) – skips completely the temporary table costs
          • ~ Jimmy’s stripes
    • Results
      • TPC-H 3TB on 45-node cluster
      • Loading time:
        • DBMS-X 33h3m
        • Hive and Hadoop 49m
        • HadoopDB 11h4m (w/ 6h42m for referential partitioning)
        • VectorWise 3h47m (includes clustering index creation)
    • Results
      • DBMS-X >> Hive
        • Lack of partitioning and indexing
      • Switching HadoopDB from PostreSQL to Vectorwise results in a factor of 7 improvement on average
      • Generally, map-side join optimization improves efficiency by 2 to 3 when using column-store
      • Semi-join improves by factor of 2 over map-side join and factor of 3.6 over reduce join
    • Conclusion
      • Hybrid is good
        • MapReduce takes care “rack to cluster”
        • RDBMS takes care of the within-rack
      • Not sure how good it is for text analytical tasks
        • RDBMS often problems with data skew
        • Hadapt whitepaper suggests they do unstructured data with MapReduce and structured data with HadoopDB
    • Conclusion
      • Never a free lunch…
      • If your problem involves non-text data types, consider working with hybrid solution
      • If your problem involves primarily textual data, question still open whether hybrid will actually be of any help
    • Information Science
      • “ Search for the fundamental knowledge which will allow us to postulate and utilize the most efficient combination of [human and machine] resources”
      • M.E. Senko. Information systems: records, relations, sets, entities, and things. Information systems , 1(1):3–13, 1975.
    • References
      • Stonebraker et al., MapReduce and Parallel DBMSs: Friends or Foes?, in CACM 53, 1 (Jan 2010):64-70
      • Bajda-Pawlikowski et al., Efficient Processing of Data Warehousing Queries in a Split Execution Environment, SIGMOD 2011
      • Abouzeid et al., HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads, VLDB 2009
      • Pavlo et al., A Comparison of Approaches to Large-Scale Data Analysis, SIGMOD 2009
      • Thusoo et al., Data Warehousing and Analytics Infrastructure at Facebook, SIGMOD 2010
      • Wilschut, Flokstra, Apers, Parallelism in a Main-Memory DBMS: The performance of PRISMA/DB, VLDB 1992
      • Wilschut, Apers & Flokstra, Parallel Query Execution in PRISMA/DB. LNCS 503 (1990)
      • Daniel Abadi’s blog, http://dbmsmusings.blogspot.com/