HadoopDB Miguel Angel Pastor Olivar miguelinlas3 at gmail dot com http://miguelinlas3.blogspot.com http://twitter.com/migu...
Contenidos <ul><li>Introduction,o bjetives  and   background
HadoopDB  Architecture
Results
Conclusions </li></ul>
Introduction
General <ul><li>Analytics are  important today
Data amount is  exploding
Previous problem -> Shared nothing architectures
Approachs: </li><ul><li>Parallel databases
Map/Reduce systems </li></ul></ul>
Desired properties <ul><li>Performance </li><ul><li>Cheaper upgrades
Pricing mode (cloud) </li></ul><li>Fault tolerance </li><ul><li>Transactional workloads: recover
Analytics environments: not restart querys
Problem at scaling </li></ul></ul>
Desired properties <ul><li>Heterogeneus environments </li><ul><li>Increasing number of nodes
Difficult homogeneous </li></ul><li>Flexible query interface </li><ul><li>BI  usually  JDBC  or  ODBC
UDF mechanism
Desirable  SQL  and no  SQL  interfaces </li></ul></ul>
Background:  parallel  databases <ul><li>Standard  relational tables  and  SQL </li><ul><li>Indexing, compression,caching,...
Optimizer  tailored </li></ul><li>Meet p erformance </li><ul><li>Needed highly skilled DBA </li></ul></ul>
Background:  parallel  databases <ul><li>Flexible query interfaces </li><ul><li>UDFs varies acroos implementations </li></...
Assumption: failures are rare
Assumption: dozens of nodes in clusters
Engineering decisions </li></ul></ul>
Background: Map/Reduce
Background: Map/Reduce <ul><li>Satisfies fault tolerance
Works on heterogeneus environment
Drawback: performance </li><ul><li>Not previous modeling
No enhacing performance techniques </li></ul><li>Interfaces </li><ul><li>Write M/R jobs in multiple languages
SQL not supported directly ( Hive ) </li></ul></ul>
HadoopDB
Ideas <ul><li>Main goal: achieve the properties described before
Connect multiple single-datanode systems </li><ul><li>Hadoop reponsible for task coordination and network layer
Queries parallelized along de nodes </li></ul><li>Fault tolerant and work in heterogeneus nodes
Parallel databases performance </li><ul><li>Query processing in database engine </li></ul></ul>
Architecture background <ul><li>Hadoop distributed file system (HDFS) </li><ul><li>Block structured file system managed by...
Files broken in blocks and ditributed </li></ul><li>Processing layer (Map/Reduce framework) </li><ul><li>Master/slave arch...
Job and Task trackers </li></ul></ul>
Upcoming SlideShare
Loading in...5
×

HadoopDB

2,751

Published on

Brief introduction to a new approach on handling big amount of data

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,751
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
120
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

HadoopDB

  1. 1. HadoopDB Miguel Angel Pastor Olivar miguelinlas3 at gmail dot com http://miguelinlas3.blogspot.com http://twitter.com/miguelinlas3
  2. 2. Contenidos <ul><li>Introduction,o bjetives and background
  3. 3. HadoopDB Architecture
  4. 4. Results
  5. 5. Conclusions </li></ul>
  6. 6. Introduction
  7. 7. General <ul><li>Analytics are important today
  8. 8. Data amount is exploding
  9. 9. Previous problem -> Shared nothing architectures
  10. 10. Approachs: </li><ul><li>Parallel databases
  11. 11. Map/Reduce systems </li></ul></ul>
  12. 12. Desired properties <ul><li>Performance </li><ul><li>Cheaper upgrades
  13. 13. Pricing mode (cloud) </li></ul><li>Fault tolerance </li><ul><li>Transactional workloads: recover
  14. 14. Analytics environments: not restart querys
  15. 15. Problem at scaling </li></ul></ul>
  16. 16. Desired properties <ul><li>Heterogeneus environments </li><ul><li>Increasing number of nodes
  17. 17. Difficult homogeneous </li></ul><li>Flexible query interface </li><ul><li>BI usually JDBC or ODBC
  18. 18. UDF mechanism
  19. 19. Desirable SQL and no SQL interfaces </li></ul></ul>
  20. 20. Background: parallel databases <ul><li>Standard relational tables and SQL </li><ul><li>Indexing, compression,caching, I/O sharing </li></ul><li>Tables partitioned over nodes </li><ul><li>Transparent to the user
  21. 21. Optimizer tailored </li></ul><li>Meet p erformance </li><ul><li>Needed highly skilled DBA </li></ul></ul>
  22. 22. Background: parallel databases <ul><li>Flexible query interfaces </li><ul><li>UDFs varies acroos implementations </li></ul><li>Fault tolerance </li><ul><li>Not score so well
  23. 23. Assumption: failures are rare
  24. 24. Assumption: dozens of nodes in clusters
  25. 25. Engineering decisions </li></ul></ul>
  26. 26. Background: Map/Reduce
  27. 27. Background: Map/Reduce <ul><li>Satisfies fault tolerance
  28. 28. Works on heterogeneus environment
  29. 29. Drawback: performance </li><ul><li>Not previous modeling
  30. 30. No enhacing performance techniques </li></ul><li>Interfaces </li><ul><li>Write M/R jobs in multiple languages
  31. 31. SQL not supported directly ( Hive ) </li></ul></ul>
  32. 32. HadoopDB
  33. 33. Ideas <ul><li>Main goal: achieve the properties described before
  34. 34. Connect multiple single-datanode systems </li><ul><li>Hadoop reponsible for task coordination and network layer
  35. 35. Queries parallelized along de nodes </li></ul><li>Fault tolerant and work in heterogeneus nodes
  36. 36. Parallel databases performance </li><ul><li>Query processing in database engine </li></ul></ul>
  37. 37. Architecture background <ul><li>Hadoop distributed file system (HDFS) </li><ul><li>Block structured file system managed by central node
  38. 38. Files broken in blocks and ditributed </li></ul><li>Processing layer (Map/Reduce framework) </li><ul><li>Master/slave architecture
  39. 39. Job and Task trackers </li></ul></ul>
  40. 40. Architecture
  41. 41. Database connector module <ul><li>Interface between database and task tracker
  42. 42. Responsabilities </li><ul><li>Connect to the database
  43. 43. Execute the SQL query
  44. 44. Return the results as key-value pairs </li></ul><li>Achieved goal </li><ul><li>Datasources are similar to datablocks in HDFS </li></ul></ul>
  45. 45. Catalog module <ul><li>Metadata about databases </li><ul><li>Database location, driver class, credentials
  46. 46. Datasets in cluster, replica or partitioning </li></ul><li>Catalog stored as xml file in HDFS
  47. 47. Plan to deploy as separated service </li></ul>
  48. 48. Data loader module <ul><li>Responsabilities: </li><ul><li>Globally repartitioning data
  49. 49. Breaking single data node in ckunks
  50. 50. Bulk-load data in single data node chunks </li></ul><li>Two main components: </li><ul><li>Global hasher </li><ul><li>Map/Reduce job read from HDS and repartition </li></ul><li>Local Hasher </li><ul><li>Copies from HDFS to local file system </li></ul></ul></ul>
  51. 51. SMS Planner module <ul><li>SQL interface to analyst based on Hive
  52. 52. Steps </li><ul><li>AST building
  53. 53. Semantic analyzer connects to catalog
  54. 54. DAG of relational operators
  55. 55. Optimizer reestructuration
  56. 56. Convert plan to M/R jobs
  57. 57. DAG in M/R serialized in xml plan </li></ul></ul>
  58. 58. SMS Planner extensions <ul><li>Update metastore with table references
  59. 59. Two phases before execution </li><ul><li>Retrieve data fields to determine partitioning keys
  60. 60. Traverse DAG (bottom up). Rule based SQL generator </li></ul></ul>
  61. 61. Benckmarking
  62. 62. Environment <ul><li>Amazon EC2 “large” instances
  63. 63. Each instance </li><ul><li>7,5 GB memory
  64. 64. 2 virtual cores
  65. 65. 850 GB storage
  66. 66. 64 bits Linux Fedora 8 </li></ul></ul>
  67. 67. Benchmarked systems <ul><li>Hadoop </li><ul><li>256MB data blocks
  68. 68. 1024 MB heap size
  69. 69. 200Mb sort buffer </li></ul><li>HadoopDB </li><ul><li>Similar to Hadoop conf,
  70. 70. PostgreSQL 8.2.5
  71. 71. No compress data </li></ul></ul>
  72. 72. Benchmarked systems <ul><li>Vertica </li><ul><li>New parallel database (column store),
  73. 73. Used a cloud edition
  74. 74. All data is compressed </li></ul><li>DBMS-X </li><ul><li>Comercial parallel row
  75. 75. Run on EC2 (not cloud edition available) </li></ul></ul>
  76. 76. Used data <ul><li>Http log files, html pages, ranking
  77. 77. Sizes (per node): </li><ul><li>155 millions user visits (~ 20Gigabytes)
  78. 78. 18 millions ranking (~1Gigabyte)
  79. 79. Stored as plain text in HDFS </li></ul></ul>
  80. 80. Loading data
  81. 81. Grep Task
  82. 82. Selection Task <ul><li>Consulta ejecutada </li><ul><li>Select pageUrl, pageRank from Rankings where pageRank > 10 </li></ul></ul>
  83. 83. Aggregation Task <ul><li>Smaller query </li></ul>SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue) FROM UserVisits GROUP BY SUBSTR(sourceIP, 1, 7); <ul><li>Larger query </li></ul>SELECT sourceIP, SUM(adRevenue) FROM UserVisits GROUP BY sourceIP
  84. 84. Join Task <ul><li>Query </li></ul>SELECT sourceIP, COUNT(pageRank), SUM(pageRank),SUM(adRevenue) FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN ‘2000-01-15’ AND ‘2000-01-22’ GROUP BY UV.sourceIP; <ul><li>Not same query </li></ul>
  85. 85. UDF Aggregation Task
  86. 86. Summary <ul><li>HaddopDB approach parallel databases in absence of failures </li><ul><li>PostgreSLQ not column store
  87. 87. DBMS-X 15% overly optimistic
  88. 88. No compression un PosgreSQL data </li></ul><li>Outperforms Hadoop </li></ul>
  89. 89. Fault tolerance and heterogeneus environments
  90. 90. Benchmarks
  91. 91. Discussion <ul><li>Vertica is faster
  92. 92. Reduce the number of nodes to achieve the same order of magnitude
  93. 93. Fault tolerance is important </li></ul>
  94. 94. Conclusions
  95. 95. Conclusion <ul><li>Approach parallel databases and fault tolerance
  96. 96. PostgreSQL is not a column store
  97. 97. Hadoop and hive relatively new open source projects
  98. 98. HadoopDB is flexible and extensible </li></ul>
  99. 99. References
  100. 100. References <ul><li>Hadoop web page
  101. 101. HadoopDB article
  102. 102. HadoopDB project
  103. 103. Vertica
  104. 104. Apache Hive </li></ul>
  105. 105. That´s all!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×