MapReduce and Parallel Database Systems: Complementary or Competitive Technology? Daniel Abadi Yale University October 12 ...
Brief History of MapReduce <ul><li>Pre-2004: used at Google for many data processing apps, including Web indexing </li></u...
 
 
Controversy <ul><li>Vast majority of the outrage was about the comparison of the systems </li></ul><ul><li>BUT: </li></ul>...
 
SIGMOD 2009 Paper <ul><li>Benchmarked Hadoop vs. 2 parallel database systems </li></ul><ul><ul><li>Compared across a varie...
Hardware Setup <ul><li>100 node cluster </li></ul><ul><li>Each node </li></ul><ul><ul><li>2.4 GHz Code 2 Duo Processors </...
Join Task
UDF Task DBMS clearly doesn’t scale <ul><li>Calculate PageRank over a set of HTML documents </li></ul><ul><li>Performed vi...
Benchmark Conclusions <ul><li>Hadoop has many advantages </li></ul><ul><ul><li>Load time much faster </li></ul></ul><ul><u...
Overall Conclusion <ul><li>MapReduce/Hadoop and parallel databases are clearly complementary </li></ul><ul><li>Use MapRedu...
 
 
 
 
 
 
We’re all in agreement, right?
But Wait! <ul><li>Hadoop can do everything a parallel database can do </li></ul><ul><li>Hadoop has (something resembling) ...
People are using Hadoop as a DW <ul><li>Facebook has 12PB data warehouse in Hadoop/Hive </li></ul><ul><ul><li>Adding 10TB ...
Fault Tolerance and Cluster Heterogeneity Results Database systems restart entire query upon a single node failure, and do...
So … <ul><li>Hadoop can do everything that parallel databases can do, but: </li></ul><ul><ul><li>Has better fault toleranc...
HadoopDB <ul><li>Use Hadoop to coordinate execution of multiple independent (typically single node, open source) database ...
HadoopDB Architecture
TPC-H Benchmark Results
Fault Tolerance and Cluster Heterogeneity Results
HadoopDB: Current Status <ul><li>Initial open source release over a year ago </li></ul><ul><ul><li>A bunch of new code sin...
Invisible Loading <ul><li>Data starts in HDFS </li></ul><ul><li>Data is immediately available for processing (immediate gr...
Conclusions <ul><li>MapReduce and parallel databases are definitely complimentary </li></ul><ul><li>MapReduce and parallel...
Upcoming SlideShare
Loading in...5
×

Daniel Abadi HadoopWorld 2010

3,715

Published on

Daniel Abadi's HadoopWorld 2010 Slides

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,715
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
2
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Daniel Abadi HadoopWorld 2010

  1. 1. MapReduce and Parallel Database Systems: Complementary or Competitive Technology? Daniel Abadi Yale University October 12 th , 2010
  2. 2. Brief History of MapReduce <ul><li>Pre-2004: used at Google for many data processing apps, including Web indexing </li></ul><ul><li>2004: paper in academic conference not written in traditional academic style </li></ul><ul><li>2004-2006: Implemented in Nutch </li></ul><ul><li>2006-2008: Split off into Hadoop; significant usage at Yahoo; buzz increases </li></ul>
  3. 5. Controversy <ul><li>Vast majority of the outrage was about the comparison of the systems </li></ul><ul><li>BUT: </li></ul><ul><ul><li>The line between MapReduce and Hadoop (which comes with HDFS) was blurring </li></ul></ul><ul><ul><li>Hadoop can be used as an alternative to traditional DW implementations built using DBMS software </li></ul></ul>
  4. 7. SIGMOD 2009 Paper <ul><li>Benchmarked Hadoop vs. 2 parallel database systems </li></ul><ul><ul><li>Compared across a variety of dimensions including performance and ease of use </li></ul></ul><ul><ul><li>Measured differences in load and query time for some common data processing tasks </li></ul></ul><ul><ul><li>Used Web analytics benchmark whose goal was to be representative of tasks that: </li></ul></ul><ul><ul><ul><li>Both should excel at </li></ul></ul></ul><ul><ul><ul><li>Hadoop should excel at </li></ul></ul></ul><ul><ul><ul><li>Databases should excel at </li></ul></ul></ul>
  5. 8. Hardware Setup <ul><li>100 node cluster </li></ul><ul><li>Each node </li></ul><ul><ul><li>2.4 GHz Code 2 Duo Processors </li></ul></ul><ul><ul><li>4 GB RAM </li></ul></ul><ul><ul><li>2 250 GB SATA HDs (74 MB/Sec sequential I/O) </li></ul></ul><ul><li>Dual GigE switches, each with 50 nodes </li></ul><ul><ul><li>128 Gbit/sec fabric </li></ul></ul><ul><li>Connected by a 64 Gbit/sec ring </li></ul>
  6. 9. Join Task
  7. 10. UDF Task DBMS clearly doesn’t scale <ul><li>Calculate PageRank over a set of HTML documents </li></ul><ul><li>Performed via a UDF </li></ul>
  8. 11. Benchmark Conclusions <ul><li>Hadoop has many advantages </li></ul><ul><ul><li>Load time much faster </li></ul></ul><ul><ul><li>Significantly easier to install, use </li></ul></ul><ul><ul><li>Better parallelization of UDFs </li></ul></ul><ul><li>Hadoop is consistently less efficient for structured, relational data </li></ul><ul><ul><li>Reasons both fundamental and non-fundamental </li></ul></ul><ul><ul><li>Needs better support for compression and direct operation on compressed data </li></ul></ul><ul><ul><li>Needs better support for indexing </li></ul></ul><ul><ul><li>Needs better support for co-partitioning of datasets </li></ul></ul>
  9. 12. Overall Conclusion <ul><li>MapReduce/Hadoop and parallel databases are clearly complementary </li></ul><ul><li>Use MapReduce if you want to do: </li></ul><ul><ul><li>ETL </li></ul></ul><ul><ul><li>Unstructured data processing </li></ul></ul><ul><ul><li>Deep analysis that is hard to express in SQL </li></ul></ul><ul><li>Use parallel databases for: </li></ul><ul><ul><li>Traditional data warehousing / data marts </li></ul></ul><ul><ul><li>Structured data processing expressible in SQL </li></ul></ul><ul><li>Cloudera agrees! </li></ul>
  10. 19. We’re all in agreement, right?
  11. 20. But Wait! <ul><li>Hadoop can do everything a parallel database can do </li></ul><ul><li>Hadoop has (something resembling) a SQL interface (Hive) </li></ul><ul><li>Many of Hadoop’s performance deficiencies not fundamental </li></ul><ul><ul><li>Result of initial design for unstructured data </li></ul></ul><ul><ul><li>Over 20 research papers in the last two years on improving Hadoop performance for DBMS workloads </li></ul></ul><ul><li>Hadoop is free and open source </li></ul><ul><ul><li>(Oracle, IBM/Netezza, Microsoft, Teradata, Vertica, Greenplum, and Aster Data are all proprietary) </li></ul></ul>
  12. 21. People are using Hadoop as a DW <ul><li>Facebook has 12PB data warehouse in Hadoop/Hive </li></ul><ul><ul><li>Adding 10TB per day </li></ul></ul><ul><li>Yahoo’s warehouse is the same order of magnitude </li></ul><ul><ul><li>Recently switched to Hadoop </li></ul></ul>
  13. 22. Fault Tolerance and Cluster Heterogeneity Results Database systems restart entire query upon a single node failure, and do not adapt if a node is running slowly
  14. 23. So … <ul><li>Hadoop can do everything that parallel databases can do, but: </li></ul><ul><ul><li>Has better fault tolerance </li></ul></ul><ul><ul><li>Adjusts better to runtime performance fluctuations </li></ul></ul><ul><ul><li>Is more open / cheaper </li></ul></ul><ul><ul><li>Has at least as good scalability (if not better) </li></ul></ul><ul><li>If only we could fix those performance problems on structured data </li></ul><ul><ul><li>HadoopDB! </li></ul></ul>
  15. 24. HadoopDB <ul><li>Use Hadoop to coordinate execution of multiple independent (typically single node, open source) database systems </li></ul><ul><ul><li>Flexible query interface (accepts both SQL and MapReduce) </li></ul></ul><ul><ul><li>Open source (built using open source components) </li></ul></ul>
  16. 25. HadoopDB Architecture
  17. 26. TPC-H Benchmark Results
  18. 27. Fault Tolerance and Cluster Heterogeneity Results
  19. 28. HadoopDB: Current Status <ul><li>Initial open source release over a year ago </li></ul><ul><ul><li>A bunch of new code since then, but not yet put up online </li></ul></ul><ul><ul><li>This new code is available by request </li></ul></ul><ul><li>Expect the next release to be in mid-2011 </li></ul><ul><li>Money available for people who want to help with development (e-mail justin.borgman@yale.edu) </li></ul>
  20. 29. Invisible Loading <ul><li>Data starts in HDFS </li></ul><ul><li>Data is immediately available for processing (immediate gratification paradigm) </li></ul><ul><li>Each MapReduce job causes data movement from HDFS to database systems </li></ul><ul><li>Data is incrementally loaded, sorted, and indexed </li></ul><ul><li>Query performance improves “invisibly” </li></ul>
  21. 30. Conclusions <ul><li>MapReduce and parallel databases are definitely complimentary </li></ul><ul><li>MapReduce and parallel databases are definitely competitive </li></ul><ul><li>HadoopDB is awesome </li></ul>

×