Hw09   Hadoop Db
 

Hw09 Hadoop Db

on

  • 4,726 views

 

Statistics

Views

Total Views
4,726
Views on SlideShare
4,701
Embed Views
25

Actions

Likes
2
Downloads
215
Comments
0

1 Embed 25

http://www.slideshare.net 25

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hw09   Hadoop Db Hw09 Hadoop Db Presentation Transcript

  • HadoopDB: An open source hybrid of MapReduce and DBMS technologies Azza Abouzeid, Kamil Bajda-Pawlikowski Daniel J. Abadi, Avi Silberschatz Yale University http://hadoopdb.sourceforge.net October 2, 2009 HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Avi Silberschatz, Alex Rasin. In Proceedings of VLDB, 2009.
  • Introduction Candidates Differences HadoopDB Evaluation Conclusion Motivation Major Trends 1 Data explosion: Automation of business processes, proliferation of digital devices. eBay has a 6.5 PB warehouse, Yahoo! Everest has 10 PB. 2 Analysis over raw data Yale University, HadoopWorld 2009 HadoopDB 2/24
  • Introduction Candidates Differences HadoopDB Evaluation Conclusion Motivation Major Trends 1 Data explosion: Automation of business processes, proliferation of digital devices. eBay has a 6.5 PB warehouse, Yahoo! Everest has 10 PB. 2 Analysis over raw data Bottom line Analyzing massive structured data on 1000s of shared-nothing nodes. Yale University, HadoopWorld 2009 HadoopDB 2/24
  • Introduction Candidates Differences HadoopDB Evaluation Conclusion Motivation Sales Record Example Consider a large data set of sales log records, each consisting of sales information including: 1 a date of sale 2 a price We would like to take the log records and generate a report showing the total sales for each year. Question: How do we generate this report efficiently and cheaply over massive data contained in a shared-nothing cluster of 1000s of machines? Yale University, HadoopWorld 2009 HadoopDB 3/24
  • Introduction Candidates Differences HadoopDB Evaluation Conclusion MapReduce Parallel Databases MapReduce (Hadoop) MapReduce is a programming model which specifies: A map function that processes a key/value pair to generate a set of intermediate key/value pairs, A reduce function that merges all intermediate values associated with the same intermediate key. Hadoop is a MapReduce implementation for processing large data sets over 1000s of nodes. Maps (and Reduces) run independently of each other over blocks of data distributed across a cluster. Yale University, HadoopWorld 2009 HadoopDB 4/24
  • Introduction Candidates Differences HadoopDB Evaluation Conclusion MapReduce Parallel Databases Sales Record Example using Hadoop Query: Calculate total sales for each year. We write a MapReduce program: Map: Takes log records and extracts a key-value pair of year and sale price in dollars. Outputs the key-value pairs. Shuffle: Hadoop automatically partitions the key-value pairs by year to the nodes executing the Reduce function Reduce: Simply sums up all the dollar values for a year. Yale University, HadoopWorld 2009 HadoopDB 5/24
  • Introduction Candidates Differences HadoopDB Evaluation Conclusion MapReduce Parallel Databases Relational Databases Suppose that the data is stored in a relational database system, the sales record example could be expressed in SQL as: SELECT YEAR(date) AS year, SUM(price) FROM sales GROUP BY year The execution plan is: projection(year,price) → hash aggregation(year,price) . Question: How do we process this efficiently if the data is very large? Yale University, HadoopWorld 2009 HadoopDB 6/24
  • Introduction Candidates Differences HadoopDB Evaluation Conclusion MapReduce Parallel Databases Parallel Databases Parallel Databases are like single-node databases except: Data is partitioned across nodes Individual relational operations can be executed in parallel xxx SELECT YEAR(date) AS year, SUM(price) FROM sales GROUP BY year Execution plan for the query: projection(year,price) → partial hash aggregation(year,price) → partitioning(year) → final aggregation(year,price) . Note that the execution plan resembles the map and reduce phases of Hadoop. Yale University, HadoopWorld 2009 HadoopDB 7/24
  • Introduction Candidates Differences HadoopDB Evaluation Conclusion Differences between Parallel Databases and Hadoop Yale University, HadoopWorld 2009 HadoopDB 8/24
  • Introduction Candidates Differences HadoopDB Evaluation Conclusion Differences between Parallel Databases and Hadoop Yale University, HadoopWorld 2009 HadoopDB 8/24
  • Introduction Candidates Differences HadoopDB Evaluation Conclusion To summarize Yale University, HadoopWorld 2009 HadoopDB 9/24
  • At Yale, we looked beyond the differences ...
  • At Yale, we looked beyond the differences ...
  • and we discovered ... Basic design idea Multiple, independent, single node databases coordinated by ... that they complete each other Hadoop. http://i214.photobucket.com/albums/cc19/brittanybutton/elephants.jpg
  • Introduction Candidates Differences HadoopDB Evaluation Conclusion Background Architecture SMS Hadoop Basics Yale University, HadoopWorld 2009 HadoopDB 12/24
  • Introduction Candidates Differences HadoopDB Evaluation Conclusion Background Architecture SMS Architecture Yale University, HadoopWorld 2009 HadoopDB 13/24
  • Introduction Candidates Differences HadoopDB Evaluation Conclusion Background Architecture SMS SQL-MR-SQL SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY YEAR(saleDate); Yale University, HadoopWorld 2009 HadoopDB 14/24
  • Introduction Candidates Differences HadoopDB Evaluation Conclusion Hypotheses Load Performance Scalability Evaluating HadoopDB Compare HadoopDB to 1 Hadoop 2 Parallel databases (Vertica, DBMS-X) Features: 1 Performance: We expected HadoopDB to approach the performance of parallel databases 2 Scalability: We expected HadoopDB to scale as well as Hadoop We ran the Pavlo et al. SIGMOD’09 benchmark on Amazon EC2 clusters of 10, 50, 100 nodes. Yale University, HadoopWorld 2009 HadoopDB 15/24
  • Introduction Candidates Differences HadoopDB Evaluation Conclusion Hypotheses Load Performance Scalability Load 1600 Vertica DB-X HadoopDB Hadoop Vertica DB-X Thousands 1400 HadoopDB Hadoop 50 1200 40 1000 seconds seconds 800 30 600 20 400 164 161 141 139 200 10 100 92 77 47 43 0 10 nodes 50 nodes 100 nodes 0 10 nodes 50 nodes 100 nodes Random Unstructured Data Structured data (20GB/node) (535MB/node) Yale University, HadoopWorld 2009 HadoopDB 16/24
  • Introduction Candidates Differences HadoopDB Evaluation Conclusion Hypotheses Load Performance Scalability Performance: Grep Task 70 Vertica DB-X HadoopDB Hadoop 60 1 Full table scan, highly 50 selective filter 40 2 Random data, no seconds 30 room for indexing 20 3 Hadoop overhead outweighs query 10 processing time in 0 10 nodes 50 nodes 100 nodes single-node databases SELECT * FROM grep WHERE field LIKE ‘%xyz%’; Yale University, HadoopWorld 2009 HadoopDB 17/24
  • Introduction Candidates Differences HadoopDB Evaluation Conclusion Hypotheses Load Performance Scalability Performance: Join Task 2000 Vertica DB-X HadoopDB Hadoop 1800 1600 1400 1200 seconds 1000 800 1 No full table scan due 600 to clustered indexing 300.5 224.2 400 2 Hash partitioning and 126.4 67.7 200 34.7 31.9 29.4 28.0 20.6 0 efficient join 10 nodes 50 nodes 100 nodes algorithm SELECT sourceIP, AVG(pageRank), SUM(adRevenue) FROM rankings, uservisits WHERE pageURL=destURL AND visitDate BETWEEN 2000-1-15 AND 2000-1-22 GROUP BY sourceIP ORDER BY SUM(adRevenue) DESC LIMIT 1; Yale University, HadoopWorld 2009 HadoopDB 18/24
  • Introduction Candidates Differences HadoopDB Evaluation Conclusion Hypotheses Load Performance Scalability Performance: Bottom Line 1 Unstructured data HadoopDB’s performance matches Hadoop 2 Structured data HadoopDB’s performance is close to parallel databases Yale University, HadoopWorld 2009 HadoopDB 19/24
  • Introduction Candidates Differences HadoopDB Evaluation Conclusion Hypotheses Load Performance Scalability Scalability: Setup 1 Simple aggregation task - full table scan 2 Data replicated across 10 nodes 3 Fault-tolerance: Kill a node halfway 4 Fluctuation-tolerance: Slow down a node for the entire experiment Yale University, HadoopWorld 2009 HadoopDB 20/24
  • Introduction Candidates Differences HadoopDB Evaluation Conclusion Hypotheses Load Performance Scalability Scalability: Results 200% Vertica 1 HadoopDB and HadoopDB 180% Hadoop Hadoop take 160% advantage of runtime percentage slowdown 140% scheduling by 120% 100% splitting data into 80% chunks or blocks 60% 2 Parallel databases 40% restart entire query on 20% 0% node failure or wait Fault-tolerance Fluctuation-tolerance for the slowest node Yale University, HadoopWorld 2009 HadoopDB 21/24
  • Introduction Candidates Differences HadoopDB Evaluation Conclusion Summary Future To summarize HadoopDB ... 1 is a hybrid of DBMS and MapReduce 2 scales better than commercial parallel databases 3 is as fault-tolerant as Hadoop 4 approaches the performance of parallel databases 5 is free and open-source http://hadoopdb.sourceforge.net Yale University, HadoopWorld 2009 HadoopDB 22/24
  • Introduction Candidates Differences HadoopDB Evaluation Conclusion Summary Future Future work Engineering work: 1 Full SQL support in SMS 2 Data compression 3 Integration with other open source databases 4 Full automation of the loading and replication process 5 Out-of-the box deployment 6 We’re hiring! Research work: Incremental loading and on-the-fly repartitioning Dynamically adjusting fault-tolerance levels based on failure rate Yale University, HadoopWorld 2009 HadoopDB 23/24
  • Thank You ... We welcome all thoughts on how to raise HadoopDB ... http://www.jpbutler.com/thailand/images/elephant-8-days-old.jpg