HadoopDB

  • 2,611 views
Uploaded on

Brief introduction to a new approach on handling big amount of data

Brief introduction to a new approach on handling big amount of data

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,611
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
112
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. HadoopDB Miguel Angel Pastor Olivar miguelinlas3 at gmail dot com http://miguelinlas3.blogspot.com http://twitter.com/miguelinlas3
  • 2. Contenidos
    • Introduction,o bjetives and background
    • 3. HadoopDB Architecture
    • 4. Results
    • 5. Conclusions
  • 6. Introduction
  • 7. General
    • Analytics are important today
    • 8. Data amount is exploding
    • 9. Previous problem -> Shared nothing architectures
    • 10. Approachs:
      • Parallel databases
      • 11. Map/Reduce systems
  • 12. Desired properties
    • Performance
      • Cheaper upgrades
      • 13. Pricing mode (cloud)
    • Fault tolerance
      • Transactional workloads: recover
      • 14. Analytics environments: not restart querys
      • 15. Problem at scaling
  • 16. Desired properties
    • Heterogeneus environments
      • Increasing number of nodes
      • 17. Difficult homogeneous
    • Flexible query interface
      • BI usually JDBC or ODBC
      • 18. UDF mechanism
      • 19. Desirable SQL and no SQL interfaces
  • 20. Background: parallel databases
    • Standard relational tables and SQL
      • Indexing, compression,caching, I/O sharing
    • Tables partitioned over nodes
      • Transparent to the user
      • 21. Optimizer tailored
    • Meet p erformance
      • Needed highly skilled DBA
  • 22. Background: parallel databases
    • Flexible query interfaces
      • UDFs varies acroos implementations
    • Fault tolerance
      • Not score so well
      • 23. Assumption: failures are rare
      • 24. Assumption: dozens of nodes in clusters
      • 25. Engineering decisions
  • 26. Background: Map/Reduce
  • 27. Background: Map/Reduce
    • Satisfies fault tolerance
    • 28. Works on heterogeneus environment
    • 29. Drawback: performance
      • Not previous modeling
      • 30. No enhacing performance techniques
    • Interfaces
      • Write M/R jobs in multiple languages
      • 31. SQL not supported directly ( Hive )
  • 32. HadoopDB
  • 33. Ideas
    • Main goal: achieve the properties described before
    • 34. Connect multiple single-datanode systems
      • Hadoop reponsible for task coordination and network layer
      • 35. Queries parallelized along de nodes
    • Fault tolerant and work in heterogeneus nodes
    • 36. Parallel databases performance
      • Query processing in database engine
  • 37. Architecture background
    • Hadoop distributed file system (HDFS)
      • Block structured file system managed by central node
      • 38. Files broken in blocks and ditributed
    • Processing layer (Map/Reduce framework)
      • Master/slave architecture
      • 39. Job and Task trackers
  • 40. Architecture
  • 41. Database connector module
    • Interface between database and task tracker
    • 42. Responsabilities
      • Connect to the database
      • 43. Execute the SQL query
      • 44. Return the results as key-value pairs
    • Achieved goal
      • Datasources are similar to datablocks in HDFS
  • 45. Catalog module
    • Metadata about databases
      • Database location, driver class, credentials
      • 46. Datasets in cluster, replica or partitioning
    • Catalog stored as xml file in HDFS
    • 47. Plan to deploy as separated service
  • 48. Data loader module
    • Responsabilities:
      • Globally repartitioning data
      • 49. Breaking single data node in ckunks
      • 50. Bulk-load data in single data node chunks
    • Two main components:
      • Global hasher
        • Map/Reduce job read from HDS and repartition
      • Local Hasher
        • Copies from HDFS to local file system
  • 51. SMS Planner module
    • SQL interface to analyst based on Hive
    • 52. Steps
      • AST building
      • 53. Semantic analyzer connects to catalog
      • 54. DAG of relational operators
      • 55. Optimizer reestructuration
      • 56. Convert plan to M/R jobs
      • 57. DAG in M/R serialized in xml plan
  • 58. SMS Planner extensions
    • Update metastore with table references
    • 59. Two phases before execution
      • Retrieve data fields to determine partitioning keys
      • 60. Traverse DAG (bottom up). Rule based SQL generator
  • 61. Benckmarking
  • 62. Environment
    • Amazon EC2 “large” instances
    • 63. Each instance
      • 7,5 GB memory
      • 64. 2 virtual cores
      • 65. 850 GB storage
      • 66. 64 bits Linux Fedora 8
  • 67. Benchmarked systems
    • Hadoop
      • 256MB data blocks
      • 68. 1024 MB heap size
      • 69. 200Mb sort buffer
    • HadoopDB
      • Similar to Hadoop conf,
      • 70. PostgreSQL 8.2.5
      • 71. No compress data
  • 72. Benchmarked systems
    • Vertica
      • New parallel database (column store),
      • 73. Used a cloud edition
      • 74. All data is compressed
    • DBMS-X
      • Comercial parallel row
      • 75. Run on EC2 (not cloud edition available)
  • 76. Used data
    • Http log files, html pages, ranking
    • 77. Sizes (per node):
      • 155 millions user visits (~ 20Gigabytes)
      • 78. 18 millions ranking (~1Gigabyte)
      • 79. Stored as plain text in HDFS
  • 80. Loading data
  • 81. Grep Task
  • 82. Selection Task
    • Consulta ejecutada
      • Select pageUrl, pageRank from Rankings where pageRank > 10
  • 83. Aggregation Task
    • Smaller query
    SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue) FROM UserVisits GROUP BY SUBSTR(sourceIP, 1, 7);
    • Larger query
    SELECT sourceIP, SUM(adRevenue) FROM UserVisits GROUP BY sourceIP
  • 84. Join Task
    • Query
    SELECT sourceIP, COUNT(pageRank), SUM(pageRank),SUM(adRevenue) FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN ‘2000-01-15’ AND ‘2000-01-22’ GROUP BY UV.sourceIP;
    • Not same query
  • 85. UDF Aggregation Task
  • 86. Summary
    • HaddopDB approach parallel databases in absence of failures
      • PostgreSLQ not column store
      • 87. DBMS-X 15% overly optimistic
      • 88. No compression un PosgreSQL data
    • Outperforms Hadoop
  • 89. Fault tolerance and heterogeneus environments
  • 90. Benchmarks
  • 91. Discussion
    • Vertica is faster
    • 92. Reduce the number of nodes to achieve the same order of magnitude
    • 93. Fault tolerance is important
  • 94. Conclusions
  • 95. Conclusion
    • Approach parallel databases and fault tolerance
    • 96. PostgreSQL is not a column store
    • 97. Hadoop and hive relatively new open source projects
    • 98. HadoopDB is flexible and extensible
  • 99. References
  • 100. References
  • 105. That´s all!