Your SlideShare is downloading. ×
Daniel Abadi HadoopWorld 2010
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Daniel Abadi HadoopWorld 2010


Published on

Daniel Abadi's HadoopWorld 2010 Slides

Daniel Abadi's HadoopWorld 2010 Slides

Published in: Technology

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. MapReduce and Parallel Database Systems: Complementary or Competitive Technology? Daniel Abadi Yale University October 12 th , 2010
  • 2. Brief History of MapReduce
    • Pre-2004: used at Google for many data processing apps, including Web indexing
    • 2004: paper in academic conference not written in traditional academic style
    • 2004-2006: Implemented in Nutch
    • 2006-2008: Split off into Hadoop; significant usage at Yahoo; buzz increases
  • 3.  
  • 4.  
  • 5. Controversy
    • Vast majority of the outrage was about the comparison of the systems
    • BUT:
      • The line between MapReduce and Hadoop (which comes with HDFS) was blurring
      • Hadoop can be used as an alternative to traditional DW implementations built using DBMS software
  • 6.  
  • 7. SIGMOD 2009 Paper
    • Benchmarked Hadoop vs. 2 parallel database systems
      • Compared across a variety of dimensions including performance and ease of use
      • Measured differences in load and query time for some common data processing tasks
      • Used Web analytics benchmark whose goal was to be representative of tasks that:
        • Both should excel at
        • Hadoop should excel at
        • Databases should excel at
  • 8. Hardware Setup
    • 100 node cluster
    • Each node
      • 2.4 GHz Code 2 Duo Processors
      • 4 GB RAM
      • 2 250 GB SATA HDs (74 MB/Sec sequential I/O)
    • Dual GigE switches, each with 50 nodes
      • 128 Gbit/sec fabric
    • Connected by a 64 Gbit/sec ring
  • 9. Join Task
  • 10. UDF Task DBMS clearly doesn’t scale
    • Calculate PageRank over a set of HTML documents
    • Performed via a UDF
  • 11. Benchmark Conclusions
    • Hadoop has many advantages
      • Load time much faster
      • Significantly easier to install, use
      • Better parallelization of UDFs
    • Hadoop is consistently less efficient for structured, relational data
      • Reasons both fundamental and non-fundamental
      • Needs better support for compression and direct operation on compressed data
      • Needs better support for indexing
      • Needs better support for co-partitioning of datasets
  • 12. Overall Conclusion
    • MapReduce/Hadoop and parallel databases are clearly complementary
    • Use MapReduce if you want to do:
      • ETL
      • Unstructured data processing
      • Deep analysis that is hard to express in SQL
    • Use parallel databases for:
      • Traditional data warehousing / data marts
      • Structured data processing expressible in SQL
    • Cloudera agrees!
  • 13.  
  • 14.  
  • 15.  
  • 16.  
  • 17.  
  • 18.  
  • 19. We’re all in agreement, right?
  • 20. But Wait!
    • Hadoop can do everything a parallel database can do
    • Hadoop has (something resembling) a SQL interface (Hive)
    • Many of Hadoop’s performance deficiencies not fundamental
      • Result of initial design for unstructured data
      • Over 20 research papers in the last two years on improving Hadoop performance for DBMS workloads
    • Hadoop is free and open source
      • (Oracle, IBM/Netezza, Microsoft, Teradata, Vertica, Greenplum, and Aster Data are all proprietary)
  • 21. People are using Hadoop as a DW
    • Facebook has 12PB data warehouse in Hadoop/Hive
      • Adding 10TB per day
    • Yahoo’s warehouse is the same order of magnitude
      • Recently switched to Hadoop
  • 22. Fault Tolerance and Cluster Heterogeneity Results Database systems restart entire query upon a single node failure, and do not adapt if a node is running slowly
  • 23. So …
    • Hadoop can do everything that parallel databases can do, but:
      • Has better fault tolerance
      • Adjusts better to runtime performance fluctuations
      • Is more open / cheaper
      • Has at least as good scalability (if not better)
    • If only we could fix those performance problems on structured data
      • HadoopDB!
  • 24. HadoopDB
    • Use Hadoop to coordinate execution of multiple independent (typically single node, open source) database systems
      • Flexible query interface (accepts both SQL and MapReduce)
      • Open source (built using open source components)
  • 25. HadoopDB Architecture
  • 26. TPC-H Benchmark Results
  • 27. Fault Tolerance and Cluster Heterogeneity Results
  • 28. HadoopDB: Current Status
    • Initial open source release over a year ago
      • A bunch of new code since then, but not yet put up online
      • This new code is available by request
    • Expect the next release to be in mid-2011
    • Money available for people who want to help with development (e-mail
  • 29. Invisible Loading
    • Data starts in HDFS
    • Data is immediately available for processing (immediate gratification paradigm)
    • Each MapReduce job causes data movement from HDFS to database systems
    • Data is incrementally loaded, sorted, and indexed
    • Query performance improves “invisibly”
  • 30. Conclusions
    • MapReduce and parallel databases are definitely complimentary
    • MapReduce and parallel databases are definitely competitive
    • HadoopDB is awesome