Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Daniel Abadi HadoopWorld 2010


Published on

Daniel Abadi's HadoopWorld 2010 Slides

Published in: Technology
  • Be the first to comment

Daniel Abadi HadoopWorld 2010

  1. 1. MapReduce and Parallel Database Systems: Complementary or Competitive Technology? Daniel Abadi Yale University October 12 th , 2010
  2. 2. Brief History of MapReduce <ul><li>Pre-2004: used at Google for many data processing apps, including Web indexing </li></ul><ul><li>2004: paper in academic conference not written in traditional academic style </li></ul><ul><li>2004-2006: Implemented in Nutch </li></ul><ul><li>2006-2008: Split off into Hadoop; significant usage at Yahoo; buzz increases </li></ul>
  3. 5. Controversy <ul><li>Vast majority of the outrage was about the comparison of the systems </li></ul><ul><li>BUT: </li></ul><ul><ul><li>The line between MapReduce and Hadoop (which comes with HDFS) was blurring </li></ul></ul><ul><ul><li>Hadoop can be used as an alternative to traditional DW implementations built using DBMS software </li></ul></ul>
  4. 7. SIGMOD 2009 Paper <ul><li>Benchmarked Hadoop vs. 2 parallel database systems </li></ul><ul><ul><li>Compared across a variety of dimensions including performance and ease of use </li></ul></ul><ul><ul><li>Measured differences in load and query time for some common data processing tasks </li></ul></ul><ul><ul><li>Used Web analytics benchmark whose goal was to be representative of tasks that: </li></ul></ul><ul><ul><ul><li>Both should excel at </li></ul></ul></ul><ul><ul><ul><li>Hadoop should excel at </li></ul></ul></ul><ul><ul><ul><li>Databases should excel at </li></ul></ul></ul>
  5. 8. Hardware Setup <ul><li>100 node cluster </li></ul><ul><li>Each node </li></ul><ul><ul><li>2.4 GHz Code 2 Duo Processors </li></ul></ul><ul><ul><li>4 GB RAM </li></ul></ul><ul><ul><li>2 250 GB SATA HDs (74 MB/Sec sequential I/O) </li></ul></ul><ul><li>Dual GigE switches, each with 50 nodes </li></ul><ul><ul><li>128 Gbit/sec fabric </li></ul></ul><ul><li>Connected by a 64 Gbit/sec ring </li></ul>
  6. 9. Join Task
  7. 10. UDF Task DBMS clearly doesn’t scale <ul><li>Calculate PageRank over a set of HTML documents </li></ul><ul><li>Performed via a UDF </li></ul>
  8. 11. Benchmark Conclusions <ul><li>Hadoop has many advantages </li></ul><ul><ul><li>Load time much faster </li></ul></ul><ul><ul><li>Significantly easier to install, use </li></ul></ul><ul><ul><li>Better parallelization of UDFs </li></ul></ul><ul><li>Hadoop is consistently less efficient for structured, relational data </li></ul><ul><ul><li>Reasons both fundamental and non-fundamental </li></ul></ul><ul><ul><li>Needs better support for compression and direct operation on compressed data </li></ul></ul><ul><ul><li>Needs better support for indexing </li></ul></ul><ul><ul><li>Needs better support for co-partitioning of datasets </li></ul></ul>
  9. 12. Overall Conclusion <ul><li>MapReduce/Hadoop and parallel databases are clearly complementary </li></ul><ul><li>Use MapReduce if you want to do: </li></ul><ul><ul><li>ETL </li></ul></ul><ul><ul><li>Unstructured data processing </li></ul></ul><ul><ul><li>Deep analysis that is hard to express in SQL </li></ul></ul><ul><li>Use parallel databases for: </li></ul><ul><ul><li>Traditional data warehousing / data marts </li></ul></ul><ul><ul><li>Structured data processing expressible in SQL </li></ul></ul><ul><li>Cloudera agrees! </li></ul>
  10. 19. We’re all in agreement, right?
  11. 20. But Wait! <ul><li>Hadoop can do everything a parallel database can do </li></ul><ul><li>Hadoop has (something resembling) a SQL interface (Hive) </li></ul><ul><li>Many of Hadoop’s performance deficiencies not fundamental </li></ul><ul><ul><li>Result of initial design for unstructured data </li></ul></ul><ul><ul><li>Over 20 research papers in the last two years on improving Hadoop performance for DBMS workloads </li></ul></ul><ul><li>Hadoop is free and open source </li></ul><ul><ul><li>(Oracle, IBM/Netezza, Microsoft, Teradata, Vertica, Greenplum, and Aster Data are all proprietary) </li></ul></ul>
  12. 21. People are using Hadoop as a DW <ul><li>Facebook has 12PB data warehouse in Hadoop/Hive </li></ul><ul><ul><li>Adding 10TB per day </li></ul></ul><ul><li>Yahoo’s warehouse is the same order of magnitude </li></ul><ul><ul><li>Recently switched to Hadoop </li></ul></ul>
  13. 22. Fault Tolerance and Cluster Heterogeneity Results Database systems restart entire query upon a single node failure, and do not adapt if a node is running slowly
  14. 23. So … <ul><li>Hadoop can do everything that parallel databases can do, but: </li></ul><ul><ul><li>Has better fault tolerance </li></ul></ul><ul><ul><li>Adjusts better to runtime performance fluctuations </li></ul></ul><ul><ul><li>Is more open / cheaper </li></ul></ul><ul><ul><li>Has at least as good scalability (if not better) </li></ul></ul><ul><li>If only we could fix those performance problems on structured data </li></ul><ul><ul><li>HadoopDB! </li></ul></ul>
  15. 24. HadoopDB <ul><li>Use Hadoop to coordinate execution of multiple independent (typically single node, open source) database systems </li></ul><ul><ul><li>Flexible query interface (accepts both SQL and MapReduce) </li></ul></ul><ul><ul><li>Open source (built using open source components) </li></ul></ul>
  16. 25. HadoopDB Architecture
  17. 26. TPC-H Benchmark Results
  18. 27. Fault Tolerance and Cluster Heterogeneity Results
  19. 28. HadoopDB: Current Status <ul><li>Initial open source release over a year ago </li></ul><ul><ul><li>A bunch of new code since then, but not yet put up online </li></ul></ul><ul><ul><li>This new code is available by request </li></ul></ul><ul><li>Expect the next release to be in mid-2011 </li></ul><ul><li>Money available for people who want to help with development (e-mail </li></ul>
  20. 29. Invisible Loading <ul><li>Data starts in HDFS </li></ul><ul><li>Data is immediately available for processing (immediate gratification paradigm) </li></ul><ul><li>Each MapReduce job causes data movement from HDFS to database systems </li></ul><ul><li>Data is incrementally loaded, sorted, and indexed </li></ul><ul><li>Query performance improves “invisibly” </li></ul>
  21. 30. Conclusions <ul><li>MapReduce and parallel databases are definitely complimentary </li></ul><ul><li>MapReduce and parallel databases are definitely competitive </li></ul><ul><li>HadoopDB is awesome </li></ul>