Presentation of the paper:
Michael Stonebraker, Daniel J. Abadi, David J. DeWitt, Samuel Madden, Erik Paulson, Andrew Pavlo, Alexander Rasin: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53(1): 64-71 (2010)
MapReduceand ParallelDBMSs:Friendsor Foes?Spyros Eleftheriadis – email@example.comHarokopio University of AthensDepartment of Informatics and TelematicsPostgraduate Programme "Informatics and Telematics“Large Scale Data Management SystemsTuesday 23 May 2013
Conclusion before presentationMapReduce complementsDBMSs sincedatabases are not designed forextract-transform-load (ETL)tasks, a MapReduce specialty.
MR style systems excel at complexanalytics and ETL tasks.BUTParallel DBMSs excel at efficientquerying of large data sets while MRis significantly slower.
They conducted a benchmarkstudy using a popular opensourceMR implementation (Hadoop) and twoparallel DBMSs. The results showthat the DBMSs are substantially fasterthan the MR system once the datais loaded, but that loading the datatakes considerably longer in the databasesystems.
Parallel Database Systems• Mid-1980s• Teradata and Gamma projects• New Parallel database systems• Based on a cluster of “shared-nothing nodes” (orseparate CPU, memory, and disks) connected through ahigh-speed interconnect• Every parallel database system built since thenessentially uses the same two techniques• Horizontal partitioning of relational tables• Partitioned execution of SQL queries
Horizontal partitioning• 10-million-row table• Cluster of 50 nodes, each withfour disks• Places 50.000 rows on each ofthe 200 disks.
Partitioned executionof SQL queries (1)SELECT custId, amount FROM SalesWHERE date BETWEEN“12/1/2009” AND “12/25/2009”;• execute a SELECT operator against the Salesrecords with the specified date predicate on eachnode of the cluster.• intermediate results from each node are thensent to a single node• that node performs a MERGE operation to returnthe final result.
Partitioned executionof SQL queries (2a)SELECT custId, SUM(amount)FROM SalesWHERE date BETWEEN “12/1/2009” AND “12/25/2009”GROUP BY custId;• Sales table is round-robin partitioned across thenodes in the cluster• So the rows corresponding to any single customerare spread across multiple nodes.
Partitioned executionof SQL queries (2b)SELECT custId, SUM(amount)FROM SalesWHERE date BETWEEN “12/1/2009” AND“12/25/2009”GROUP BY custId;Three-operator pipeline: SELECTEach SELECT operator scans the fragment ofthe Sales table stored at that node.
Partitioned executionof SQL queries (2c)SELECT custId, SUM(amount)FROM SalesWHERE date BETWEEN “12/1/2009” AND“12/25/2009”GROUP BY custId;Three-operator pipeline: SHUFFLEAny rows satisfying the date predicate arepassed to a SHUFFLE operator thatdynamically repartitions the rows by applyinga hash function on the value of the custIdattribute of each row to map them to aparticular node.
Partitioned executionof SQL queries (2d)SELECT custId, SUM(amount)FROM SalesWHERE date BETWEEN “12/1/2009” AND“12/25/2009”GROUP BY custId;Three-operator pipeline: SUMSince the same hash function is used for theSHUFFLE operation on all nodes, rows for thesame customer are routed to the single nodewhere they are aggregated to compute thefinal total for each customer.
Partitioned executionof SQL queries (3a)SELECT C.name, C.email FROMCustomers C, Sales SWHERE C.custId = S.custIdAND S.amount > 1000AND S.date BETWEEN“12/1/2009” AND“12/25/2009”;• Sales table is round-robin partitioned• Customers table is hash-partitioned on theCustomer.custId attribute.(records with same Customer.custId are on thesame node)
Partitioned executionof SQL queries (3b)SELECT C.name, C.email FROMCustomers C, Sales SWHERE C.custId = S.custIdAND S.amount > 1000AND S.date BETWEEN“12/1/2009” AND“12/25/2009”;Three-operator pipeline: SELECTEach SELECT operator scans its fragment ofthe Sales table looking for rows that satisfythe predicate S.amount > 1000 and S.dateBETWEEN “12/1/2009” and “12/25/2009.”.
Partitioned executionof SQL queries (3c)SELECT C.name, C.email FROMCustomers C, Sales SWHERE C.custId = S.custIdAND S.amount > 1000AND S.date BETWEEN“12/1/2009” AND“12/25/2009”;Three-operator pipeline: SHUFFLEQualifying rows are pipelined into a shuffleoperator that repartitions its input rows by hashingon the Sales.custId attribute.By using the same hash function that was usedwhen loading rows of the Customer table(hash partitioned on the Customer.custIdattribute), the shuffle operators route eachqualifying Sales row to the node where thematching Customer tuple is stored.
Key benefit of parallel DBMSsThe system automatically manages the various alternativepartitioning strategies for the tables involved in the query.For example:If Sales and Customers are each hash-partitioned on theircustId attribute, the query optimizer will recognize that thetwo tables are both hash-partitioned on the joining attributesand omit the shuffle operator from the compiled query plan.Likewise, if both tables are round-robin partitioned, then theoptimizer will insert shuffle operators for both tables so tuplesthat join with one another end up on the same node.All this happens transparently to the user and to applicationprograms.
Mapping Parallel DBMSsonto MapReduceAn attractive quality of the MR programming model issimplicity; an MR program consists of only two functions Mapand Reduce—written by a user to process key/value data pairsMapReduce Parallel DBMSsFiltering and transformation ofindividual data items (tuples intables)SQL queriesMap operations that are noteasily expressedin SQLUDF (User Defined Functions)MR-style reduce functionalitySQL + UDFs = user-definedaggregatesReshuffle Map and Reduce GROUP BY in SQL
ETL and “read once” data setsThe canonical use of MR is characterized by the following template offive operations:•Read logs of information from several different sources;•Parse and clean the log data;•Perform complex transformations (such as “sessionalization”);•Decide what attribute data to store;•Load the information into a DBMS or other storage engine.These steps are analogous to the extract, transform, and loadphases in ETL systems•The MR system can be considered a general-purpose parallel ETLsystem.•For parallel DBMSs, many products perform ETL.•DBMSs do not try to do ETL, and ETL systems do not try to doDBMS services.•An ETL system is typically upstream from a DBMS, as the loadphase usually feeds data directly into a DBMS.
Complex analytics• In many data-mining and data-clustering applications, theprogram must make multiple passes over the data. Suchapplications cannot be structured as single SQL aggregatequeries, requiring instead a complex dataflow programwhere the output of one part of the application is theinput of another.
Semi-structured data• Unlike a DBMS, MR systems do not require users to definea schema for their data. Thus, MR-style systems easilystore and process what is known as “semi-structured”data. Such data often looks like key-value pairs, where thenumber of attributes present in any given record varies;this style of data is typical of Web traffic logs derived fromdisparate sources.• With a relational DBMS, one way to model such data is touse a wide table with many attributes to accommodatemultiple record types. Each unrequired attribute usesNULLs for the values that are not present for a givenrecord.• Column-based DBMSs (such as Vertica) mitigate theproblem by reading only the relevant attributes for anyquery and automatically suppressing the NULL values.
Quick-and-dirty analyses• Many current parallel DBMSs are difficult to install andconfigure properly, as users are often faced with a myriadof tuning parameters that must be set correctly for thesystem to operate effectively.• Compared to two commercial parallel systems, Hadoopprovides the best “out-of-the-box” experience.
Limited-budget operations• MR systems are open source projects available for free.• DBMSs, and in particular parallel DBMSs, are expensive.• Though there are good single-node open source solutions,• there are no robust, community supported parallel DBMSs.
The Benchmark• Hadoop MR framework• Vertica, a commercial column-store relational database• DBMS-X (probably Oracle Exadata), a row-based databasefrom a large commercial vendor.• a simple benchmark presented in the original MR paperfrom Google• four other analytical tasks of increasing complexity wethink are common processing tasks that could be doneusing either class of systems.• Run all experiments a 100-node shared-nothing cluster atthe University of Wisconsin-Madison.
Original MR Grep task• The task: each system must scan through a data set of100B records looking for a three-character pattern.• Each record consists of a unique key in the first 10B,followed by a 90B random value.• The search pattern is found only in the last 90B once inevery 10,000 records.• 1TB data set spread over the 100 nodes (10GB/node). Thedata set consists of 10 billion records, each 100B, whichprovides a simple measurement of how quickly a softwaresystem can scan through a large collection of records.• The task cannot take advantage of any sorting or indexingand is easy to specify in both MR and SQL.• The database systems are about two times faster thanHadoop
Web log task• The task is a conventional SQL aggregation with a GROUPBY clause on a table of user visits in a Web server log.• Such data is fairly typical of Web logs, and the query iscommonly used in traffic analytics.• For this experiment, we used a 2TB data set consisting of155 million records spread over the 100 nodes(20GB/node).• Each system must calculate the total ad revenue generatedfor each visited IP address from the logs.• Like the previous task, the records must all be read, andthus there is no indexing opportunity for the DBMSs.• Hadoop is beaten by the databases by a larger marginthan in the Grep task.
Join task• The task is a fairly complex join operation over two tablesrequiring an additional aggregation and filtering operation.• The user-visit data set from the previous task is joined withan additional 100GB table of PageRank values for 18million URLs (1GB/node).• The join task consists of two subtasks that perform acomplex calculation on the two data sets. In the first partof the task, each system must find the IP address thatgenerated the most revenue within a particular date rangein the user visits.• Once these intermediate records are generated, thesystem must then calculate the average PageRank of allpages visited during this interval.• DBMSs ought to be good at analytical queries involvingcomplex join operations.• The DBMSs are a factor of 36 and 21 respectively fasterthan Hadoop.
Architectural Differences• The performance differences between Hadoop and theDBMSs can be explained by a variety of factors.• For example, the MR processing model is independent ofthe underlying storage system, so data could theoreticallybe massaged, indexed, compressed, and carefully laid outon storage during a load phase, just like a DBMS.• The goal of our study was to compare the real-lifedifferences in performance of representative realizationsof the two models.
Architectural DifferencesRepetitive record parsing• Hadoop stores data in the accompanying distributed filesystem (HDFS), in the same textual format in which thedata was generated.• This default storage method places the burden of parsingthe fields of each record on user code. This parsing taskrequires each Map and Reduce task repeatedly parse andconvert string fields into the appropriate type.
Architectural DifferencesCompression• We found that enabling data compression in the DBMSsdelivered a significant performance gain.• On the other hand, Hadoop often executed slower whenwe used compression on its input files; at most,compression improved performance by 15%;• We postulate that commercial DBMSs use carefully tunedcompression algorithms to ensure that the cost ofdecompressing tuples does not offset the performancegains from the reduced I/O cost of reading compresseddata.
Architectural DifferencesPipelining• Parallel DBMSs operate by creating a query plan that isdistributed to the appropriate nodes at execution time.When one operator in this plan must send data to the nextoperator, regardless of whether that operator is running onthe same or a different node, the qualifying data is“pushed” by the first operator to the second operator.Hence, data is streamed from producer to consumer; theintermediate data is never written to disk• In Hadoop, the producer writes the intermediate results tolocal data structures, and the consumer subsequently“pulls” the data. These data structures are often quitelarge, so the system must write them out to disk,introducing a potential bottleneck.• Though writing data structures to disk gives Hadoop aconvenient way to checkpoint the output of intermediatemap jobs, thereby improving fault tolerance, it addssignificant performance overhead.
Architectural DifferencesScheduling• In a parallel DBMS, each node knows exactly what it mustdo and when it must do it according to the distributedquery plan. Because the operations are known in advance,the system is able to optimize the execution plan tominimize data transmission between nodes.• In contrast, each task in an MR system is scheduled onprocessing nodes one storage block at a time. Suchruntime work scheduling at a granularity of storage blocksis much more expensive than the DBMS compile-timescheduling. The former has the advantage, as some haveargued, of allowing the MR scheduler to adapt to workloadskew and performance differences between nodes.
Architectural DifferencesColumn-oriented storage• In a column store-based database (such as Vertica), thesystem reads only the attributes necessary for solving theuser query.• This limited need for reading data represents aconsiderable performance advantage over traditional, row-stored databases, where the system reads all attributes offthe disk. DBMS-X and Hadoop/HDFS are both essentiallyrow stores, while Vertica is a column store, giving Vertica asignificant advantage over the other two systems in ourWeb log benchmark task.
In general…• The Hadoop community will presumably fix thecompression problem in a future release.• Some of the other performance advantages of paralleldatabases (such as column-storage and operating directlyon compressed data) can be implemented in an MR systemwith user code.• Other implementations of the MR framework (such asGoogle’s proprietary implementation) may well have adifferent performance envelope.• The scheduling mechanism and pull model of datatransmission are fundamental to the MR block-level fault-tolerance model and thus unlikely to be changed.• Meanwhile, DBMSs offer transaction-level fault tolerance.DBMS researchers often point out that as databases getbigger and the number of nodes increases, the need forfiner granularity fault tolerance increases as well.
Learning from Each OtherWhat can MR learn from DBMSs?• MR advocates should learn from parallel DBMS thetechnologies and techniques for efficient query parallelexecution.• Higher-level languages are invariably a good idea for anydata-processing system. We found that writing the SQLcode for each task was substantially easier than writing MRcode.• Efforts to build higher-level interfaces on top ofMR/Hadoop should be accelerated (Hive, Pig etc.)
Learning from Each OtherWhat can DBMSs learn from MR?• The out-of-the-box experience for most DBMSs is less thanideal for being able to quickly set up and begin runningqueries. The commercial DBMS products must movetoward one-button installs, automatic tuning that workscorrectly, better Web sites with example code, betterquery generators, and better documentation.• Most database systems cannot deal with tables stored inthe file system (in situ data). Though some databasesystems (such as PostgreSQL, DB2, and SQL Server) havecapabilities in this area, further flexibility is needed.
Conclusion• Most of the architectural differences discussed here arethe result of the different focuses of the two classes ofsystem.• Parallel DBMSs excel at efficient querying of large datasets; MR style systems excel at complex analytics and ETLtasks. Neither is good at what the other does well. Hence,the two technologies are complementary, and we expectMR-style systems performing ETL to live directly upstreamfrom DBMSs.
Notes & ReferencesThe presentation is based on:• Michael Stonebraker, Daniel J. Abadi, David J. DeWitt, SamuelMadden, Erik Paulson, Andrew Pavlo, Alexander Rasin:MapReduce and parallel DBMSs: friends or foes? Commun.ACM 53(1): 64-71 (2010)• Many people maintain that the benchmark’s results do notgive the full image mostly because the tasks were rathersimple for Hadoop and because the sample data was too lowfor most web-scale apps today.http://www.netuality.ro/hadoop_map_reduce_vs_dbms_benchmarks/articles/20100103/• Since then many major DBMS vendors integrate Hadoop datain their DBMS suites.• Oracle Big Data Connectorshttp://www.oracle.com/technetwork/bdc/big-data-connectors/overview/index.html• Microsoft Parallel Data Warehouse (PDW)http://www.microsoft.com/en-us/sqlserver/solutions-technologies/data-warehousing/pdw.aspx