Daniel Abadi: VLDB 2009 Panel


Published on

Panel presentation at VLDB 2009 by Daniel Abadi on "How best to build web-scale data managers?".

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Daniel Abadi: VLDB 2009 Panel

  1. 1. A Proposed Answer to Phil’s Question: What Does This Say About the Database Field? Daniel Abadi
  2. 2. We’re Addicts <ul><li>Addict (verb): “to devote or surrender (oneself) to something habitually or obsessively” </li></ul><ul><li>Mounting evidence that relational database technology is unsuitable for Web-scale data management </li></ul><ul><li>Yet we cling to our RDBMS technology, refusing to acknowledge this evidence </li></ul><ul><li>Addiction is a very serious matter </li></ul><ul><ul><li>Puts one at a disadvantage --- we’re being left behind </li></ul></ul><ul><ul><ul><li>Highest impact research on Web scale data management is being published outside of SIGMOD/VLDB </li></ul></ul></ul>
  3. 3. What should we do? <ul><li>There are lots of resources for addicts </li></ul><ul><li>Many programs work in steps to help addicts gradually kick the addiction </li></ul><ul><li>Stepwise programs generally designed for individuals, but straightforward to extend to entire research communities </li></ul>
  4. 4. Step 1: Admit You Have a Problem <ul><li>Case study: Facebook </li></ul><ul><ul><li>2.5 petabyte enterprise data warehouse </li></ul></ul><ul><ul><li>Adding 15TB of new data a day </li></ul></ul><ul><ul><li>RDBMSs should theoretically scale to this amount of data (esp. Gamma-style parallel DBMSs) </li></ul></ul><ul><ul><li>They use Hadoop instead </li></ul></ul><ul><ul><ul><li>But their analysts don’t speak MapReduce! </li></ul></ul></ul><ul><ul><ul><ul><li>So they allocate a team of superstar developers to build an SQL layer on top of Hadoop -- Hive </li></ul></ul></ul></ul><ul><li>Entire companies are being started that specialize in using Hadoop to create data warehouses </li></ul><ul><li>But data warehousing has always been the domain of relational database systems! </li></ul>
  5. 5. Step 2: Believe in a Higher Power Greater Than Yourself <ul><li>The higher power is … </li></ul><ul><ul><li>Google / systems community </li></ul></ul><ul><li>MapReduce published in OSDI </li></ul><ul><li>Dynamo published in SOSP </li></ul><ul><li>BigTable published in OSDI </li></ul><ul><li>Dryad published in EuroSys </li></ul>
  6. 6. Step 3: Make a Searching and Fearless Inventory of Yourself <ul><li>People who chose not to use database systems aren’t dumb </li></ul><ul><li>There must be a reason </li></ul><ul><ul><li>We’re too expensive </li></ul></ul><ul><ul><ul><li>Free / open source databases like MySQL/PostgreSQL/Ingres don’t scale out of the box </li></ul></ul></ul><ul><ul><ul><li>Proprietary solutions price by the TB </li></ul></ul></ul><ul><ul><li>We’re too hard to use </li></ul></ul><ul><ul><li>We don’t scale </li></ul></ul><ul><ul><ul><li>Seriously, we don’t scale </li></ul></ul></ul><ul><ul><ul><ul><li>Yes, I know we should scale in theory. But in practice we don’t scale. Even the expensive solutions. </li></ul></ul></ul></ul>
  7. 7. Step 4: Admit the Exact Nature of Our Wrongs <ul><li>Admitting all of our wrongs is too overwhelming </li></ul><ul><ul><li>For now, let’s focus on our wrongs for analytical workloads </li></ul></ul><ul><li>Parallel databases should be able to scale indefinitely </li></ul><ul><li>Current implementations have limitations </li></ul><ul><ul><li>Sometimes caused by first-order effects like hard limits required by various system components </li></ul></ul><ul><ul><li>More often caused by second-order effects </li></ul></ul><ul><ul><ul><li>Systems are designed assuming failures are a rare event (not true at scale!) </li></ul></ul></ul><ul><ul><ul><li>Systems designed assuming each node has predictable performance (not true at scale!) </li></ul></ul></ul>
  8. 8. Step 5: Remove Our Shortcomings <ul><li>Need more focus on fault tolerant systems research </li></ul><ul><li>Need more focus on runtime scheduling </li></ul><ul><li>Need better parallelization of UDFs </li></ul><ul><li>Need to convince one of the parallel DBMS upstarts to release their code open source </li></ul>
  9. 9. Bottom Line <ul><li>Additions are hard to kick </li></ul><ul><li>Need to work hard to remove our shortcomings </li></ul><ul><li>Need to reclaim our leadership in the data management arena </li></ul>