#Technology
Map Reduce and Parallel DBMS: Friends or Foes?
Reference: Communications of ACM: Jan 2010 Vol 53, No. 1
Organization
1. Question: Are Map Reduce System making Parallel
DBMS seem like legacy systems?
2. Map Reduce Paradigm
3. P...
Map Reduce Paradigm
• large numbers of processors working in
parallel to solve computing problems.
• low end commodity ser...
• possible to write almost any parallel-
processing task as either a set of database
queries or a set of MR jobs
• Questio...
Parallel Database System
Salient Points:
• horizontal partitioning of relational tables
• partitioned execution of SQL que...
Parallel DBMS Paradigm
• provide high level programming environment
• inherently parallelizable
• commercial systems from ...
Partitioning Strategies
• Hash Range:
hash function is applied to one or more
attributes of each row to determine the targ...
Partitioned Query Execution
• SQL operators:
selection, aggregation, join, projection, and
update
Example :1 [Selection]:
• Example 2 [Aggregation]
Sales table: round-robin partitioned
• Example 3 [Join]:
Sales table : round-robin partitioned Customers table:
hash partitioned on Customer.custId attribute
• Automatically manage partitioning strategies:
Example:
• Sales and Customers are each hash-partitioned
on their custId a...
Mapping Parallel DBMSs
onto MapReduce
• MR program consists of only two functions —
Map and Reduce—written by a user to pr...
• Map & Reduce operations: SQL aggregates
augmented with UDFs and user-defined
aggregates provide DBMS users the same MR-
...
MR: Applications
• ETL & readonce data sets
– parse & clean data feed it to downstream dbs.
• Complex analytics:
– Multipl...
DBMS “Sweet Spot” [Benchmarks]
• Grep: scan through a data set of 100B records looking for a
three-character pattern. Each record consists of a unique ke...
Architectural Differences
Map Reduce DBMS
Record Parsing burden of parsing the fields of each
record on user code. This pa...
Map Reduce DBMS
Pipelining Producer writes intermediate
results to local data structures,
and consumer subsequently
“pulls...
Map Reduce DBMS
Scheduling task in an MR system is
scheduled on processing nodes
one storage block at a time.
Such runtime...
Learning from Each Other
Map Reduce Parallel DBMS
MR advocates should learn from parallel
DBMS the technologies and techni...
Conclusions:
• Parallel DBMSs excel at efficient querying of
large data sets.
• MR style systems excel at complex analytic...
Thanks
Saurav Basu
Map reduce and parallel dbms: Friends or Foes?
Upcoming SlideShare
Loading in …5
×

Map reduce and parallel dbms: Friends or Foes?

906 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
906
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
34
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Map reduce and parallel dbms: Friends or Foes?

  1. 1. #Technology Map Reduce and Parallel DBMS: Friends or Foes? Reference: Communications of ACM: Jan 2010 Vol 53, No. 1
  2. 2. Organization 1. Question: Are Map Reduce System making Parallel DBMS seem like legacy systems? 2. Map Reduce Paradigm 3. Parallel Database Systems (Query Execution) 4. Mapping Parallel DBMS onto Map Reduce 5. Possible Applications 6. DBMs Sweet Spot 7. Architectural Differences 8. Learning from each other 9. Conclusion
  3. 3. Map Reduce Paradigm • large numbers of processors working in parallel to solve computing problems. • low end commodity servers. • proliferation of programming tools • simple model to express sophisticated distributed programs.
  4. 4. • possible to write almost any parallel- processing task as either a set of database queries or a set of MR jobs • Questions: – Which is better? – How to benchmark?
  5. 5. Parallel Database System Salient Points: • horizontal partitioning of relational tables • partitioned execution of SQL queries • cluster of commodity servers with separate cpu, memory & disks.
  6. 6. Parallel DBMS Paradigm • provide high level programming environment • inherently parallelizable • commercial systems from dozen vendors. • complexity of parallelizing queries abstracted from programmer
  7. 7. Partitioning Strategies • Hash Range: hash function is applied to one or more attributes of each row to determine the target node and disk where the row should be stored • Round Robin: Target Node = row# modulo no. of m/cs
  8. 8. Partitioned Query Execution • SQL operators: selection, aggregation, join, projection, and update Example :1 [Selection]:
  9. 9. • Example 2 [Aggregation] Sales table: round-robin partitioned
  10. 10. • Example 3 [Join]: Sales table : round-robin partitioned Customers table: hash partitioned on Customer.custId attribute
  11. 11. • Automatically manage partitioning strategies: Example: • Sales and Customers are each hash-partitioned on their custId attribute, the query optimizer will recognize that the two tables are both hash- partitioned on the joining attributes and omit the shuffle operator from the compiled query plan. • if both tables are round-robin partitioned, then the optimizer will insert shuffle operators for both tables so tuples that join with one another end up on the same node
  12. 12. Mapping Parallel DBMSs onto MapReduce • MR program consists of only two functions — Map and Reduce—written by a user to process key/value data pairs • Input data set is stored in a collection of partitions in a distributed file system deployed on each node in the cluster.
  13. 13. • Map & Reduce operations: SQL aggregates augmented with UDFs and user-defined aggregates provide DBMS users the same MR- style reduce functionality. • reshuffle in Map and Reduce tasks in equivalent to a GROUP BY. • parallel DBMSs provide the same computing model as MR, with the added benefit of using a declarative language (SQL).
  14. 14. MR: Applications • ETL & readonce data sets – parse & clean data feed it to downstream dbs. • Complex analytics: – Multiple passes required, complex dataflows. • Semi structured data: – key-value pairs, where the number of attributes present in any given record varies(schema not reqd) • Quick & dirty analysis: - quick startup time one off analysis • Limited budget: open source
  15. 15. DBMS “Sweet Spot” [Benchmarks]
  16. 16. • Grep: scan through a data set of 100B records looking for a three-character pattern. Each record consists of a unique key in the first 10B, followed by a 90B random value. The search pattern is found only in the last 90B once in every 10,000 records. • Weblog: calculate the total ad revenue generated for each visited IP address from the logs • Join: consists of two subtasks that perform a complex calculation on the two data sets. In the first part of the task, each system must find the IP address that generated the most revenue within a particular date range in the user visits. Once these intermediate records are generated, the system must then calculate the average PageRank of all pages visited during this interval
  17. 17. Architectural Differences Map Reduce DBMS Record Parsing burden of parsing the fields of each record on user code. This parsing task requires each Map and Reduce task repeatedly parse and convert string fields into the appropriate type. records are parsed by DBMSs when the data is initially loaded. DBMSs storage manager carefully lay out records such that attributes can be directly addressed at runtime in their most efficient storage representation. Compression Hadoop often executed slower when compression used on its input files carefully tuned compression algorithms ensure that the cost of decompressing tuples does not offset the performance gains from the reduced I/O cost of reading compressed data.
  18. 18. Map Reduce DBMS Pipelining Producer writes intermediate results to local data structures, and consumer subsequently “pulls” the data. Check- pointing results gives greater Fault Tolerance at the expense of runtime performance. query plan distributed to nodes at execution time. operator in plan must send data to the next operator, regardless of whether that operator is running on same or different node. qualifying data is “pushed” by the first operator to the second operator. data is streamed from producer to consumer; intermediate data is never written to disk; resulting “back- pressure” in the runtime system will stall the producer before it has a chance to overrun the consumer.
  19. 19. Map Reduce DBMS Scheduling task in an MR system is scheduled on processing nodes one storage block at a time. Such runtime work scheduling at a granularity of storage blocks is much more expensive than the DBMS compile time scheduling. Argument: allowing the MR scheduler to adapt to workload skew and performance differences between nodes. each node knows exactly what it must do and when it must do it according to the distributed query plan. Because the operations are known in advance, the system is able to optimize the execution plan to minimize data transmission between nodes. Column Stores Hadoop/HDFS are row stores. Slower for aggregate queries on same attribute. Vertica is a column system reads only the attributes necessary for solving the user query. limited need for reading data represents a considerable performance advantage over traditional, row-stored Databases for aggregation queries.
  20. 20. Learning from Each Other Map Reduce Parallel DBMS MR advocates should learn from parallel DBMS the technologies and techniques for efficient query parallel execution commercial DBMS products must move toward one-button installs, automatic tuning that works correctly, better Web sites with example code, better query generators, and better documentation. higher-level languages are invariably a good idea for any data-processing system. higher-level interfaces on top of MR/Hadoop should be accelerated; (Hive, Pig, Scope,Dryad/Linq) DBMSs cannot deal with in situ data. Though some database systems (such as PostgreSQL, DB2, and SQL Server) have capabilities in this area, further flexibility is needed.
  21. 21. Conclusions: • Parallel DBMSs excel at efficient querying of large data sets. • MR style systems excel at complex analytics & ETL tasks • complementary • complex analytical problems require the capabilities provided by both systems.
  22. 22. Thanks Saurav Basu

×