From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments


Published on

VLDB 2013 Early Career Research Contribution Award Presentation

Abstract: Four years ago at VLDB 2009, a paper was published about a research prototype, called HadoopDB, that attempted to transform Hadoop --- a batch-oriented scalable system designed for processing unstructured data --- into a full-fledged parallel database system that can achieve real-time (interactive) query responses across both structured and unstructured data. In 2010 it was commercialized by Hadapt, a start-up that was formed to accelerate the engineering of the HadoopDB ideas, and to harden the codebase for deployment in real-world, mission-critical applications. In this talk I will give an overview of HadoopDB, and how it combines ideas from the Hadoop and database system communities. I will then describe how the project transitioned from a research prototype written by PhD students at Yale University into enterprise-ready software written by a team of experienced engineers. We will examine particular technical features that are required in enterprise Hadoop deployments, and technical challenges that we ran into while making HadoopDB robust enough to be deployed in the real world. The talk will conclude with an analysis of how starting a company impacts the tenure process, and some thoughts for graduate students and junior faculty considering a similar path.

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

  1. 1. From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments Daniel Abadi Yale University August 28th, 2013 Twitter: @daniel_abadi
  2. 2. Overview of Talk Motivation for HadoopDB Overview of HadoopDB Overview of the commercialization process Technical features missing from HadoopDB that Hadapt needed to implement What does this mean for tenure?
  3. 3. Situation in 2008 Hadoop starting to take off as a “Big Data” processing platform Parallel database startups such as Netezza, Vertica, and Greenplum gaining traction for “Big Data” analysis 2 Schools of Thought – School 1: They are on a collision course – School 2: They are complementary technologies
  4. 4. From 10,000 feet Hadoop and Parallel Database Systems are Quite Similar Both are suitable for large-scale data processing – I.e. analytical processing workloads – Bulk loads – Not optimized for transactional workloads – Queries over large amounts of data – Both can handle both relational and nonrelational queries (DBMS via UDFs)
  5. 5. SIGMOD 2009 Paper Benchmarked Hadoop vs. 2 parallel database systems – Mostly focused on performance differences – Measured differences in load and query time for some common data processing tasks – Used Web analytics benchmark whose goal was to be representative of tasks that: Both should excel at Hadoop should excel at Databases should excel at
  6. 6. Hardware Setup 100 node cluster Each node – 2.4 GHz Code 2 Duo Processors – 4 GB RAM – 2 250 GB SATA HDs (74 MB/Sec sequential I/O) Dual GigE switches, each with 50 nodes – 128 Gbit/sec fabric Connected by a 64 Gbit/sec ring
  7. 7. Join Task 0 200 400 600 800 1000 1200 1400 1600 10 nodes 25 nodes 50 nodes 100 nodes Time(seconds) Vertica DBMS-X Hadoop
  8. 8. UDF Task 0 200 400 600 800 1000 1200 10 nodes 25 nodes 50 nodes 100 nodes Time(seconds) DBMS Hadoop DBMS clearly doesn’t scaleCalculate PageRank over a set of HTML documents Performed via a UDF
  9. 9. Scalability Except for UDFs all systems scale near linearly BUT: only ran on 100 nodes As nodes approach 1000, other effects come into play – Faults go from being rare, to not so rare – It is nearly impossible to maintain homogeneity at scale
  10. 10. Fault Tolerance and Cluster Heterogeneity Results 0 20 40 60 80 100 120 140 160 180 200 Fault tolerance Slowdown tolerance PercentageSlowdown DBMS Hadoop Database systems restart entire query upon a single node failure, and do not adapt if a node is running slowly
  11. 11. Benchmark Conclusions Hadoop had scalability advantages – Checkpointing allows for better fault tolerance – Runtime scheduling allows for better tolerance of unexpectedly slow nodes – Better parallelization of UDFs Hadoop was consistently less efficient for structured, relational data – Reasons mostly non-fundamental – Needed better support for compression and direct operation on compressed data – Needed better support for indexing – Needed better support for co-partitioning of datasets
  12. 12. Best of Both Worlds Possible? Connector
  13. 13. Problems With the Connector Approach Network delays and bandwidth limitations Data silos Multiple vendors Fundamentally wasteful – Very similar architectures Both partition data across a cluster Both parallelize processing across the cluster Both optimize for local data processing (to minimize network costs)
  14. 14. Unified System Two options: – Bring Hadoop technology to a parallel database system Problem: Hadoop is more than just technology – Bring parallel database system technology to Hadoop Far more likely to have impact
  15. 15. Adding DBMS Technology to Hadoop Option 1: Keep Hadoop’s storage and build parallel executor on top of it Cloudera Impala (which is sort of a combination of Hadoop++ and NoDB research projects) Need better Storage Formats (Trevni and Parquet are promising) Updates and Deletes are hard (Impala doesn’t support them) Option 2: Use relational storage on each node Accelerates “time to complete system” We chose this option for HadoopDB
  16. 16. HadoopDB Architecture
  17. 17. SMS Planner
  18. 18. TPC-H Benchmark Results
  19. 19. UDF Task 0 100 200 300 400 500 600 700 800 10 nodes 25 nodes 50 nodes Time(seconds) DBMS Hadoop HadoopDB
  20. 20. Fault Tolerance and Cluster Heterogeneity Results 0 20 40 60 80 100 120 140 160 180 200 Fault tolerance Slowdown tolerance PercentageSlowdown DBMS Hadoop HadoopDB
  21. 21. HadoopDB Commercialization Wanted to build a real system Released initial prototype open source Blog post about HadoopDB got slashdotted, led to VC interest – Initially reluctant to take VC money Posted a job for an engineer to help build out open source codebase – Low quality of applicants – Not enough government funding for more than 1 engineer
  22. 22. HadoopDB Commercialization VC money only route to building a complete system – Launched with $1.5 million in seed money in 2010 – Raised an additional $8 million in 2011 – Raised an additional $6.75 million in 2012
  23. 23. Commercializing HadoopDB: Where does development time go? Work we expected to transition from research prototype to commercial product – SQL coverage – Failover for high availability – Authorization / authentication – Error codes / messages for every situation – Installer – Documentation But what about unexpected work?
  24. 24. Infrastructure Tools Distributed systems are unwieldy – For a cluster of size n, many things need to be done n times Automated tools are critical Just to try some new code, the following needs to happen: – Build product – Provision a cluster – Deploy build to cluster – Install dependencies (Hadoop distro, libraries, etc) – Install Hadapt with correct configuration parameters for that cluster – Generate data or copy data files to cluster for load
  25. 25. Upgrader Start-ups need to move fast Hadapt delivers a new release every couple of months Upgrade process must be easy Downgrade (!) process must be easy Changes in storage layout or APIs add complexity to the process
  26. 26. UDF Support HadoopDB supported both MapReduce and SQL as interfaces MapReduce was not a sufficient replacement for database UDFs Hadapt provides an “HDK” that enables analysts to create functions that are invokable from SQL – Integrates with 3rd party tools
  27. 27. Search Hadoop is increasingly used as a data landfill – Granular data – Messy data – Unprocessed data Database for Hadoop cannot assume all data fits in rows and columns Search support was the first thing we built after our A round of financing
  28. 28. Is doing a start-up pre-tenure a good idea? Spinning off a company takes a ton of time – At first, you are the ONLY person who can give a complete description of the technical vision, so You’re talking to all the VCs to fundraise You’re talking to all the prospective customers You’re talking to all the prospective employees – Lots of travel – Eventually, others can help with the above, but a good CEO will not let you escape Ups and downs can be mentally draining
  29. 29. If you do a start-up you will: Publish less Advise fewer students Pursue fewer grants Avoid university committees as much as possible Skip faculty meetings (usually because of travel) Attend fewer academic conferences
  30. 30. At the end of the day Unless there are changes (see SIGMOD panel from June): – Publishing a lot is the best way to get tenure – Spinning off a company necessarily detracts from university measurable objectives Doing a start-up is putting all your eggs in one basket – If successful, you have a lot of impact you can point to – If not successful, you have nothing – A lot of market forces that you have no control over determine success