Daniel Abadi HadoopWorld 2010

  • 3,499 views
Uploaded on

Daniel Abadi's HadoopWorld 2010 Slides

Daniel Abadi's HadoopWorld 2010 Slides

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,499
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
2
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. MapReduce and Parallel Database Systems: Complementary or Competitive Technology? Daniel Abadi Yale University October 12 th , 2010
  • 2. Brief History of MapReduce
    • Pre-2004: used at Google for many data processing apps, including Web indexing
    • 2004: paper in academic conference not written in traditional academic style
    • 2004-2006: Implemented in Nutch
    • 2006-2008: Split off into Hadoop; significant usage at Yahoo; buzz increases
  • 3.  
  • 4.  
  • 5. Controversy
    • Vast majority of the outrage was about the comparison of the systems
    • BUT:
      • The line between MapReduce and Hadoop (which comes with HDFS) was blurring
      • Hadoop can be used as an alternative to traditional DW implementations built using DBMS software
  • 6.  
  • 7. SIGMOD 2009 Paper
    • Benchmarked Hadoop vs. 2 parallel database systems
      • Compared across a variety of dimensions including performance and ease of use
      • Measured differences in load and query time for some common data processing tasks
      • Used Web analytics benchmark whose goal was to be representative of tasks that:
        • Both should excel at
        • Hadoop should excel at
        • Databases should excel at
  • 8. Hardware Setup
    • 100 node cluster
    • Each node
      • 2.4 GHz Code 2 Duo Processors
      • 4 GB RAM
      • 2 250 GB SATA HDs (74 MB/Sec sequential I/O)
    • Dual GigE switches, each with 50 nodes
      • 128 Gbit/sec fabric
    • Connected by a 64 Gbit/sec ring
  • 9. Join Task
  • 10. UDF Task DBMS clearly doesn’t scale
    • Calculate PageRank over a set of HTML documents
    • Performed via a UDF
  • 11. Benchmark Conclusions
    • Hadoop has many advantages
      • Load time much faster
      • Significantly easier to install, use
      • Better parallelization of UDFs
    • Hadoop is consistently less efficient for structured, relational data
      • Reasons both fundamental and non-fundamental
      • Needs better support for compression and direct operation on compressed data
      • Needs better support for indexing
      • Needs better support for co-partitioning of datasets
  • 12. Overall Conclusion
    • MapReduce/Hadoop and parallel databases are clearly complementary
    • Use MapReduce if you want to do:
      • ETL
      • Unstructured data processing
      • Deep analysis that is hard to express in SQL
    • Use parallel databases for:
      • Traditional data warehousing / data marts
      • Structured data processing expressible in SQL
    • Cloudera agrees!
  • 13.  
  • 14.  
  • 15.  
  • 16.  
  • 17.  
  • 18.  
  • 19. We’re all in agreement, right?
  • 20. But Wait!
    • Hadoop can do everything a parallel database can do
    • Hadoop has (something resembling) a SQL interface (Hive)
    • Many of Hadoop’s performance deficiencies not fundamental
      • Result of initial design for unstructured data
      • Over 20 research papers in the last two years on improving Hadoop performance for DBMS workloads
    • Hadoop is free and open source
      • (Oracle, IBM/Netezza, Microsoft, Teradata, Vertica, Greenplum, and Aster Data are all proprietary)
  • 21. People are using Hadoop as a DW
    • Facebook has 12PB data warehouse in Hadoop/Hive
      • Adding 10TB per day
    • Yahoo’s warehouse is the same order of magnitude
      • Recently switched to Hadoop
  • 22. Fault Tolerance and Cluster Heterogeneity Results Database systems restart entire query upon a single node failure, and do not adapt if a node is running slowly
  • 23. So …
    • Hadoop can do everything that parallel databases can do, but:
      • Has better fault tolerance
      • Adjusts better to runtime performance fluctuations
      • Is more open / cheaper
      • Has at least as good scalability (if not better)
    • If only we could fix those performance problems on structured data
      • HadoopDB!
  • 24. HadoopDB
    • Use Hadoop to coordinate execution of multiple independent (typically single node, open source) database systems
      • Flexible query interface (accepts both SQL and MapReduce)
      • Open source (built using open source components)
  • 25. HadoopDB Architecture
  • 26. TPC-H Benchmark Results
  • 27. Fault Tolerance and Cluster Heterogeneity Results
  • 28. HadoopDB: Current Status
    • Initial open source release over a year ago
      • A bunch of new code since then, but not yet put up online
      • This new code is available by request
    • Expect the next release to be in mid-2011
    • Money available for people who want to help with development (e-mail justin.borgman@yale.edu)
  • 29. Invisible Loading
    • Data starts in HDFS
    • Data is immediately available for processing (immediate gratification paradigm)
    • Each MapReduce job causes data movement from HDFS to database systems
    • Data is incrementally loaded, sorted, and indexed
    • Query performance improves “invisibly”
  • 30. Conclusions
    • MapReduce and parallel databases are definitely complimentary
    • MapReduce and parallel databases are definitely competitive
    • HadoopDB is awesome