Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Daniel Abadi: VLDB 2009 Panel


Published on

Panel presentation at VLDB 2009 by Daniel Abadi on "How best to build web-scale data managers?".

Published in: Technology
  • Be the first to comment

Daniel Abadi: VLDB 2009 Panel

  1. 1. A Proposed Answer to Phil’s Question: What Does This Say About the Database Field? Daniel Abadi
  2. 2. We’re Addicts <ul><li>Addict (verb): “to devote or surrender (oneself) to something habitually or obsessively” </li></ul><ul><li>Mounting evidence that relational database technology is unsuitable for Web-scale data management </li></ul><ul><li>Yet we cling to our RDBMS technology, refusing to acknowledge this evidence </li></ul><ul><li>Addiction is a very serious matter </li></ul><ul><ul><li>Puts one at a disadvantage --- we’re being left behind </li></ul></ul><ul><ul><ul><li>Highest impact research on Web scale data management is being published outside of SIGMOD/VLDB </li></ul></ul></ul>
  3. 3. What should we do? <ul><li>There are lots of resources for addicts </li></ul><ul><li>Many programs work in steps to help addicts gradually kick the addiction </li></ul><ul><li>Stepwise programs generally designed for individuals, but straightforward to extend to entire research communities </li></ul>
  4. 4. Step 1: Admit You Have a Problem <ul><li>Case study: Facebook </li></ul><ul><ul><li>2.5 petabyte enterprise data warehouse </li></ul></ul><ul><ul><li>Adding 15TB of new data a day </li></ul></ul><ul><ul><li>RDBMSs should theoretically scale to this amount of data (esp. Gamma-style parallel DBMSs) </li></ul></ul><ul><ul><li>They use Hadoop instead </li></ul></ul><ul><ul><ul><li>But their analysts don’t speak MapReduce! </li></ul></ul></ul><ul><ul><ul><ul><li>So they allocate a team of superstar developers to build an SQL layer on top of Hadoop -- Hive </li></ul></ul></ul></ul><ul><li>Entire companies are being started that specialize in using Hadoop to create data warehouses </li></ul><ul><li>But data warehousing has always been the domain of relational database systems! </li></ul>
  5. 5. Step 2: Believe in a Higher Power Greater Than Yourself <ul><li>The higher power is … </li></ul><ul><ul><li>Google / systems community </li></ul></ul><ul><li>MapReduce published in OSDI </li></ul><ul><li>Dynamo published in SOSP </li></ul><ul><li>BigTable published in OSDI </li></ul><ul><li>Dryad published in EuroSys </li></ul>
  6. 6. Step 3: Make a Searching and Fearless Inventory of Yourself <ul><li>People who chose not to use database systems aren’t dumb </li></ul><ul><li>There must be a reason </li></ul><ul><ul><li>We’re too expensive </li></ul></ul><ul><ul><ul><li>Free / open source databases like MySQL/PostgreSQL/Ingres don’t scale out of the box </li></ul></ul></ul><ul><ul><ul><li>Proprietary solutions price by the TB </li></ul></ul></ul><ul><ul><li>We’re too hard to use </li></ul></ul><ul><ul><li>We don’t scale </li></ul></ul><ul><ul><ul><li>Seriously, we don’t scale </li></ul></ul></ul><ul><ul><ul><ul><li>Yes, I know we should scale in theory. But in practice we don’t scale. Even the expensive solutions. </li></ul></ul></ul></ul>
  7. 7. Step 4: Admit the Exact Nature of Our Wrongs <ul><li>Admitting all of our wrongs is too overwhelming </li></ul><ul><ul><li>For now, let’s focus on our wrongs for analytical workloads </li></ul></ul><ul><li>Parallel databases should be able to scale indefinitely </li></ul><ul><li>Current implementations have limitations </li></ul><ul><ul><li>Sometimes caused by first-order effects like hard limits required by various system components </li></ul></ul><ul><ul><li>More often caused by second-order effects </li></ul></ul><ul><ul><ul><li>Systems are designed assuming failures are a rare event (not true at scale!) </li></ul></ul></ul><ul><ul><ul><li>Systems designed assuming each node has predictable performance (not true at scale!) </li></ul></ul></ul>
  8. 8. Step 5: Remove Our Shortcomings <ul><li>Need more focus on fault tolerant systems research </li></ul><ul><li>Need more focus on runtime scheduling </li></ul><ul><li>Need better parallelization of UDFs </li></ul><ul><li>Need to convince one of the parallel DBMS upstarts to release their code open source </li></ul>
  9. 9. Bottom Line <ul><li>Additions are hard to kick </li></ul><ul><li>Need to work hard to remove our shortcomings </li></ul><ul><li>Need to reclaim our leadership in the data management arena </li></ul>