MapReduce and Parallel Database Systems: Complementary or Competitive Technology? Daniel Abadi Yale University October 12 th , 2010
Brief History of MapReduce Pre-2004: used at Google for many data processing apps, including Web indexing 2004: paper in academic conference not written in traditional academic style 2004-2006: Implemented in Nutch 2006-2008: Split off into Hadoop; significant usage at Yahoo; buzz increases
 
 
Controversy Vast majority of the outrage was about the comparison of the systems BUT: The line between MapReduce and Hadoop (which comes with HDFS) was blurring Hadoop can be used as an alternative to traditional  DW implementations built using DBMS software
 
SIGMOD 2009 Paper Benchmarked Hadoop vs. 2 parallel database systems Compared across a variety of dimensions including performance and ease of use Measured differences in load and query time for some common data processing tasks Used Web analytics benchmark whose goal was to be representative of tasks that: Both should excel at Hadoop should excel at Databases should excel at
Hardware Setup 100 node cluster Each node 2.4 GHz Code 2 Duo Processors 4 GB RAM 2 250 GB SATA HDs (74 MB/Sec sequential I/O) Dual GigE switches, each with 50 nodes 128 Gbit/sec fabric Connected by a 64 Gbit/sec ring
Join Task
UDF Task DBMS clearly doesn’t scale Calculate PageRank over a set of HTML documents Performed via a UDF
Benchmark Conclusions Hadoop has many advantages Load time much faster Significantly easier to install, use Better parallelization of UDFs Hadoop is consistently less efficient for structured, relational data Reasons both fundamental and non-fundamental Needs better support for compression and direct operation on compressed data Needs better support for indexing Needs better support for co-partitioning of datasets
Overall Conclusion MapReduce/Hadoop and parallel databases are clearly complementary Use MapReduce if you want to do: ETL Unstructured data processing Deep analysis that is hard to express in SQL Use parallel databases for: Traditional data warehousing / data marts Structured data processing expressible in SQL Cloudera agrees!
 
 
 
 
 
 
We’re all in agreement, right?
But Wait! Hadoop can do everything a parallel database can do Hadoop has (something resembling) a SQL interface (Hive) Many of Hadoop’s performance deficiencies not fundamental Result of initial design for unstructured data Over 20 research papers in the last two years on improving Hadoop performance for DBMS workloads Hadoop is free and open source (Oracle, IBM/Netezza, Microsoft, Teradata, Vertica, Greenplum, and Aster Data are all proprietary)
People are using Hadoop as a DW Facebook has 12PB data warehouse in Hadoop/Hive Adding 10TB per day Yahoo’s warehouse is the same order of magnitude Recently switched to Hadoop
Fault Tolerance and Cluster Heterogeneity Results Database systems restart entire query upon a single node failure, and do not adapt if a node is running slowly
So … Hadoop can do everything that parallel databases can do, but: Has better fault tolerance Adjusts better to runtime performance fluctuations Is more open / cheaper Has at least as good scalability (if not better) If only we could fix those performance problems on structured data HadoopDB!
HadoopDB Use Hadoop to coordinate execution of multiple independent (typically single node, open source) database systems Flexible query interface (accepts both SQL and MapReduce) Open source (built using open source components)
HadoopDB Architecture
TPC-H Benchmark Results
Fault Tolerance and Cluster Heterogeneity Results
HadoopDB: Current Status Initial open source release over a year ago A bunch of new code since then, but not yet put up online This new code is available by request  Expect the next release to be in mid-2011 Money available for people who want to help with development (e-mail justin.borgman@yale.edu)
Invisible Loading Data starts in HDFS Data is immediately available for processing (immediate gratification paradigm) Each MapReduce job causes data movement from HDFS to database systems Data is incrementally loaded, sorted, and indexed Query performance improves “invisibly”
Conclusions MapReduce and parallel databases are definitely complimentary MapReduce and parallel databases are definitely competitive HadoopDB is awesome

Daniel Abadi HadoopWorld 2010

  • 1.
    MapReduce and ParallelDatabase Systems: Complementary or Competitive Technology? Daniel Abadi Yale University October 12 th , 2010
  • 2.
    Brief History ofMapReduce Pre-2004: used at Google for many data processing apps, including Web indexing 2004: paper in academic conference not written in traditional academic style 2004-2006: Implemented in Nutch 2006-2008: Split off into Hadoop; significant usage at Yahoo; buzz increases
  • 3.
  • 4.
  • 5.
    Controversy Vast majorityof the outrage was about the comparison of the systems BUT: The line between MapReduce and Hadoop (which comes with HDFS) was blurring Hadoop can be used as an alternative to traditional DW implementations built using DBMS software
  • 6.
  • 7.
    SIGMOD 2009 PaperBenchmarked Hadoop vs. 2 parallel database systems Compared across a variety of dimensions including performance and ease of use Measured differences in load and query time for some common data processing tasks Used Web analytics benchmark whose goal was to be representative of tasks that: Both should excel at Hadoop should excel at Databases should excel at
  • 8.
    Hardware Setup 100node cluster Each node 2.4 GHz Code 2 Duo Processors 4 GB RAM 2 250 GB SATA HDs (74 MB/Sec sequential I/O) Dual GigE switches, each with 50 nodes 128 Gbit/sec fabric Connected by a 64 Gbit/sec ring
  • 9.
  • 10.
    UDF Task DBMSclearly doesn’t scale Calculate PageRank over a set of HTML documents Performed via a UDF
  • 11.
    Benchmark Conclusions Hadoophas many advantages Load time much faster Significantly easier to install, use Better parallelization of UDFs Hadoop is consistently less efficient for structured, relational data Reasons both fundamental and non-fundamental Needs better support for compression and direct operation on compressed data Needs better support for indexing Needs better support for co-partitioning of datasets
  • 12.
    Overall Conclusion MapReduce/Hadoopand parallel databases are clearly complementary Use MapReduce if you want to do: ETL Unstructured data processing Deep analysis that is hard to express in SQL Use parallel databases for: Traditional data warehousing / data marts Structured data processing expressible in SQL Cloudera agrees!
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
    We’re all inagreement, right?
  • 20.
    But Wait! Hadoopcan do everything a parallel database can do Hadoop has (something resembling) a SQL interface (Hive) Many of Hadoop’s performance deficiencies not fundamental Result of initial design for unstructured data Over 20 research papers in the last two years on improving Hadoop performance for DBMS workloads Hadoop is free and open source (Oracle, IBM/Netezza, Microsoft, Teradata, Vertica, Greenplum, and Aster Data are all proprietary)
  • 21.
    People are usingHadoop as a DW Facebook has 12PB data warehouse in Hadoop/Hive Adding 10TB per day Yahoo’s warehouse is the same order of magnitude Recently switched to Hadoop
  • 22.
    Fault Tolerance andCluster Heterogeneity Results Database systems restart entire query upon a single node failure, and do not adapt if a node is running slowly
  • 23.
    So … Hadoopcan do everything that parallel databases can do, but: Has better fault tolerance Adjusts better to runtime performance fluctuations Is more open / cheaper Has at least as good scalability (if not better) If only we could fix those performance problems on structured data HadoopDB!
  • 24.
    HadoopDB Use Hadoopto coordinate execution of multiple independent (typically single node, open source) database systems Flexible query interface (accepts both SQL and MapReduce) Open source (built using open source components)
  • 25.
  • 26.
  • 27.
    Fault Tolerance andCluster Heterogeneity Results
  • 28.
    HadoopDB: Current StatusInitial open source release over a year ago A bunch of new code since then, but not yet put up online This new code is available by request Expect the next release to be in mid-2011 Money available for people who want to help with development (e-mail justin.borgman@yale.edu)
  • 29.
    Invisible Loading Datastarts in HDFS Data is immediately available for processing (immediate gratification paradigm) Each MapReduce job causes data movement from HDFS to database systems Data is incrementally loaded, sorted, and indexed Query performance improves “invisibly”
  • 30.
    Conclusions MapReduce andparallel databases are definitely complimentary MapReduce and parallel databases are definitely competitive HadoopDB is awesome