Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Shared slides-edbt-keynote-03-19-13


Published on

Daniel Abadi Keynote at EDBT 2013
This talk discusses: (1) Motivation for HadoopDB (2) Overview of HadoopDB (3) Lessons learned from commercializing HadoopDB into Hadapt (4) Ideas for overcoming the loading challenge (Invisible Loading)

Published in: Technology
  • Be the first to comment

Shared slides-edbt-keynote-03-19-13

  1. 1. Data Loading Challenges whenTransforming Hadoop into aParallel Database SystemDaniel AbadiYale UniversityMarch 19th, 2013Twitter: @daniel_abadi
  2. 2. Overview of TalkMotivation for HadoopDBOverview of HadoopDBLessons learned from commercializingHadoopDB into HadaptIdeas for overcoming the loading challenge(invisible loading)EDBT 2013 paper on invisible loading that ispublished alongside this talk can be found at:
  3. 3. Big DataData is growing (currently) faster than Moore’s Law– Scale becoming a bigger concern– Cost becoming a bigger concern– Unstructured data “variety” becoming a bigger concernIncreasing desire for incremental scale out oncommodity hardware– Fault tolerance a bigger concern– Dealing with unpredictable performance a bigger concernParallel database systems are the traditional solutionA lot of excitement around Hadoop as a tool to help withthese problems (initiated by Yahoo and Facebook)
  4. 4. Hadoop/MapReduceData is partitioned across N machines– Typically stored in a distributed file system (GFS/HDFS)On each machine n, apply a function, Map, to each data item d– Map(d)  {(key1,value1)} “map job”– Sort output of all map jobs on n by key– Send (key1,value1) pairs with same key value to same machine(using e.g., hashing)On each machine m, apply reduce function to (key1, {value1})pairs mapped to it– Reduce(key1,{value1})  (key2,value2) “reduce job”Optionally, collect output and present to user
  5. 5. Map WorkersExampleCount occurrences of the word “Italy” and “DBMS” in all documentsmap(d):words = split(d,’ ‘)foreach w in words:if w == ‘Italy’emit (‘Italy’, 1)if w == ‘DBMS’emit (‘DBMS’, 1)reduce(key, valueSet):count = 0for each v in valueSet:count += vemit (key,count)Data PartitionsReduce WorkersItaly reducer DBMS reducer(DBMS,1)(Italy, 1)
  6. 6. Relational Operators In MRStraightforward to implement relationaloperators in MapReduce– Select: simple filter in Map Phase– Project: project function in Map Phase– Join: Map produces tuples with join key askey; Reduce performs the joinQuery plans can be implemented as asequence of MapReduce jobs (e.g Hive)
  7. 7. Similarities Between Hadoopand Parallel Database SystemsBoth are suitable for large-scale dataprocessing– I.e. analytical processing workloads– Bulk loads– Not optimized for transactional workloads– Queries over large amounts of data– Both can handle both relational and nonrelationalqueries (DBMS via UDFs)
  8. 8. DifferencesHadoop can operate on in-situ data, withoutrequiring transformation or loadingSchemas:– Hadoop doesn’t require them, DBMSs do– Easy to write simple MR programs over raw dataIndexes– Hadoop’s built in support is primitiveDeclarative vs imperative programmingHadoop uses a run-time scheduler for fine-grainedload balancingHadoop checkpoints intermediate results for faulttolerance
  9. 9. SIGMOD 2009 PaperBenchmarked Hadoop vs. 2 paralleldatabase systems– Mostly focused on performance differences– Measured differences in load and query timefor some common data processing tasks– Used Web analytics benchmark whose goalwas to be representative of tasks that:Both should excel atHadoop should excel atDatabases should excel at
  10. 10. Hardware Setup100 node clusterEach node– 2.4 GHz Code 2 Duo Processors– 4 GB RAM– 2 250 GB SATA HDs (74 MB/Sec sequential I/O)Dual GigE switches, each with 50 nodes– 128 Gbit/sec fabricConnected by a 64 Gbit/sec ring
  11. 11. Loading0500010000150002000025000300003500040000450005000010 nodes 25 nodes 50 nodes 100nodesTime(seconds)VerticaDBMS-XHadoop
  12. 12. Join Task0200400600800100012001400160010 nodes 25 nodes 50 nodes 100 nodesTime(seconds)VerticaDBMS-XHadoop
  13. 13. UDF Task02004006008001000120010 nodes 25 nodes 50 nodes 100nodesTime(seconds)DBMSHadoopDBMS clearly doesn’t scaleCalculatePageRankover a set ofHTMLdocumentsPerformedvia a UDF
  14. 14. ScalabilityExcept for DBMS-X load time and UDFsall systems scale near linearlyBUT: only ran on 100 nodesAs nodes approach 1000, other effectscome into play– Faults go from being rare, to not so rare– It is nearly impossible to maintainhomogeneity at scale
  15. 15. Fault Tolerance and ClusterHeterogeneity Results020406080100120140160180200Fault tolerance Slowdown tolerancePercentageSlowdownDBMSHadoopDatabase systems restart entirequery upon a single node failure,and do not adapt if a node isrunning slowly
  16. 16. Benchmark ConclusionsHadoop is consistently more scalable– Checkpointing allows for better fault tolerance– Runtime scheduling allows for better tolerance ofunexpectedly slow nodes– Better parallelization of UDFsHadoop is consistently less efficient forstructured, relational data– Reasons mostly non-fundamental– Needs better support for compression and directoperation on compressed data– Needs better support for indexing– Needs better support for co-partitioning of datasets
  17. 17. Best of Both Worlds Possible?Connector
  18. 18. Problems With the ConnectorApproachNetwork delays and bandwidth limitationsData silosMultiple vendorsFundamentally wasteful– Very similar architecturesBoth partition data across a clusterBoth parallelize processing across the clusterBoth optimize for local data processing (tominimize network costs)
  19. 19. Unified SystemTwo options:– Bring Hadoop technology to a paralleldatabase systemProblem: Hadoop is more than just technology– Bring parallel database system technology toHadoopFar more likely to have impact
  20. 20. Adding DBMS Technology toHadoopOption 1:– Keep Hadoop’s storage and build parallel executor ontop of itCloudera Impala (which is sort of a combination of Hadoop++and NoDB research projects)Need better Storage Formats (Trevni and Parquet arepromising)Updates and Deletes are hard (Impala doesn’t support them)– Use relational storage on each nodeBetter support for traditional database management(including updates and deletes)Accelerates “time to complete system”We chose this option for HadoopDB
  21. 21. HadoopDB Architecture
  22. 22. SMS Planner
  23. 23. HadoopDB ExperimentsVLDB 2009 paper ran same Web analyticsbenchmark as SIGMOD 2009 paperUsed PostgreSQL as the DBMS storagelayer
  24. 24. Join Task0200400600800100012001400160010 nodes 25 nodes 50 nodesVerticaDBMS-XHadoopHadoopDB•HadoopDB much faster than Hadoop•In 2009, didn’t quite match the databasesystems in performance (but see next slide)•Hadoop start-up costs•PostgreSQL join performance•Fault tolerance/run-time scheduling
  25. 25. TPC-H Benchmark Results•More advanced storage and betterjoin algorithms can lead to betterperformance (even beating databasesystems for some queries)
  26. 26. UDF Task010020030040050060070080010 nodes 25 nodes 50 nodesTime(seconds)DBMSHadoopHadoopDB
  27. 27. Fault Tolerance and ClusterHeterogeneity Results020406080100120140160180200Fault tolerance Slowdown tolerancePercentageSlowdownDBMSHadoopHadoopDB
  28. 28. HadaptGoal: Turn HadoopDB into enterprise-grade software so it can be used for realanalytical applications– Launched with $1.5 million in seed money in2010– Raised an additional $8 million in 2011– Raised an additional $6.75 million in 2012Speaker has a significant financial interest in Hadapt
  29. 29. Hadapt: Lessons LearnedLocation matters– Impossible to scale quickly in New Haven,Connecticut– Company was forced to move to Cambridge(near other successful DB companies)Personally painful for me
  30. 30. Hadapt: Lessons LearnedBusiness side matters– Need intelligence and competence across thewhole company (not just in engineering)Decisions have to made all the time that haveenormous effects on the company– Technical founders often find themselvesdoing non-technical tasks more than half thetime
  31. 31. Hadapt: Lessons LearnedGood enterprise software: more time onminutia than innovative new features– SQL coverage– Failover for high availability– Authorization / authentication– Disaster Recovery– Installer, upgrader, downgrader (!)– Documentation
  32. 32. Hadapt: Lessons LearnedCustomer input to the design processcritical– Hadapt had no plans to integrate text searchinto the product until request from a customer– Today ~75% of Hadapt customers use thetext search feature
  33. 33. Hadapt: Lessons LearnedInstallation and load processes are critical– They are the customer’s first two interactionswith your product– Lead to the invisible loading ideaDiscussed in the remaining slides of thispresentation
  34. 34. If only we could make loadinginvisible …
  35. 35. Review: Analytic WorkflowsAssume data already in file system (HDFS)Two workflows– Workflow #1: In-situ queryinge.g. Map Reduce, HIVE / Pig, Impala, etcLatency optimization (time to first query answer)FileSystemQueryEngineData RetrievalQuery ResultSubmit Query
  36. 36. Review: Analytic WorkflowsAssume data already in file system (HDFS)Two workflows– Workflow #2: Querying after bulk loadE.g. RDBMS, HadaptStorage and retrieval optimization for throughput(repeated queries)Other benefits: data validation and quality controlFileSystemQueryEngine1. Bulk LoadQuery Result2. Submit QueryStorageEngine
  37. 37. Immediate Versus DeferredGratification
  38. 38. Use Case ExampleContext-Dependent Schema AwarenessDifferent analysts know the schema ofwhat they are looking for and don’t careabout other log messagesMessage fieldvaries dependingon application
  39. 39. Workload-Based Data LoadingInvisible loading makes workload-directeddecisions on:– When to bulk load which data– By piggy-backing on queriesFileSystemQueryEnginePartial Load 1Query 1 ResultSubmit Query 1StorageEngine
  40. 40. Workload-Based Data LoadingInvisible loading makes workload-directeddecisions on:– When to bulk load which data– By piggy-backing on queriesFileSystemQueryEnginePartial Load 2Query 2 ResultSubmit Query 2StorageEngine
  41. 41. Workload-Based Data LoadingInvisible loading makes workload-directeddecisions on:– When to bulk load which data– By piggy-backing on queriesBenefits– Eliminates manual workflows ondata modeling and bulk load scripting– Optimizes query performance over time– Boosts productivity and performance
  42. 42. Interface and User ExperienceDesignStructured parser interface for mapper– getAttribute(int index)– Similar to PigLatinMinimize load’s impact on query performance– Load a subset of rows (based on HDFS file splits)– Load size tunable based on query performanceimpactOptional: load a subset of columns– Useful for columnar file formats
  43. 43. Physical Tuning: IncrementalReorganization
  44. 44. Performance Study
  45. 45. ConclusionInvisible loading: Immediate gratification+ long term performance benefitsApplicability beyond HDFSFuture work– Automatic data schema / format recognition– Schema evolution support– Decouple load from queryWorkload-sensitive loading strategies