Your SlideShare is downloading. ×
0
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
HadoopDB
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

HadoopDB

2,696

Published on

Brief introduction to a new approach on handling big amount of data

Brief introduction to a new approach on handling big amount of data

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,696
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
116
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. HadoopDB Miguel Angel Pastor Olivar miguelinlas3 at gmail dot com http://miguelinlas3.blogspot.com http://twitter.com/miguelinlas3
  • 2. Contenidos
    • Introduction,o bjetives and background
    • 3. HadoopDB Architecture
    • 4. Results
    • 5. Conclusions
  • 6. Introduction
  • 7. General
    • Analytics are important today
    • 8. Data amount is exploding
    • 9. Previous problem -> Shared nothing architectures
    • 10. Approachs:
      • Parallel databases
      • 11. Map/Reduce systems
  • 12. Desired properties
    • Performance
      • Cheaper upgrades
      • 13. Pricing mode (cloud)
    • Fault tolerance
      • Transactional workloads: recover
      • 14. Analytics environments: not restart querys
      • 15. Problem at scaling
  • 16. Desired properties
    • Heterogeneus environments
      • Increasing number of nodes
      • 17. Difficult homogeneous
    • Flexible query interface
      • BI usually JDBC or ODBC
      • 18. UDF mechanism
      • 19. Desirable SQL and no SQL interfaces
  • 20. Background: parallel databases
    • Standard relational tables and SQL
      • Indexing, compression,caching, I/O sharing
    • Tables partitioned over nodes
      • Transparent to the user
      • 21. Optimizer tailored
    • Meet p erformance
      • Needed highly skilled DBA
  • 22. Background: parallel databases
    • Flexible query interfaces
      • UDFs varies acroos implementations
    • Fault tolerance
      • Not score so well
      • 23. Assumption: failures are rare
      • 24. Assumption: dozens of nodes in clusters
      • 25. Engineering decisions
  • 26. Background: Map/Reduce
  • 27. Background: Map/Reduce
    • Satisfies fault tolerance
    • 28. Works on heterogeneus environment
    • 29. Drawback: performance
      • Not previous modeling
      • 30. No enhacing performance techniques
    • Interfaces
      • Write M/R jobs in multiple languages
      • 31. SQL not supported directly ( Hive )
  • 32. HadoopDB
  • 33. Ideas
    • Main goal: achieve the properties described before
    • 34. Connect multiple single-datanode systems
      • Hadoop reponsible for task coordination and network layer
      • 35. Queries parallelized along de nodes
    • Fault tolerant and work in heterogeneus nodes
    • 36. Parallel databases performance
      • Query processing in database engine
  • 37. Architecture background
    • Hadoop distributed file system (HDFS)
      • Block structured file system managed by central node
      • 38. Files broken in blocks and ditributed
    • Processing layer (Map/Reduce framework)
      • Master/slave architecture
      • 39. Job and Task trackers
  • 40. Architecture
  • 41. Database connector module
    • Interface between database and task tracker
    • 42. Responsabilities
      • Connect to the database
      • 43. Execute the SQL query
      • 44. Return the results as key-value pairs
    • Achieved goal
      • Datasources are similar to datablocks in HDFS
  • 45. Catalog module
    • Metadata about databases
      • Database location, driver class, credentials
      • 46. Datasets in cluster, replica or partitioning
    • Catalog stored as xml file in HDFS
    • 47. Plan to deploy as separated service
  • 48. Data loader module
    • Responsabilities:
      • Globally repartitioning data
      • 49. Breaking single data node in ckunks
      • 50. Bulk-load data in single data node chunks
    • Two main components:
      • Global hasher
        • Map/Reduce job read from HDS and repartition
      • Local Hasher
        • Copies from HDFS to local file system
  • 51. SMS Planner module
    • SQL interface to analyst based on Hive
    • 52. Steps
      • AST building
      • 53. Semantic analyzer connects to catalog
      • 54. DAG of relational operators
      • 55. Optimizer reestructuration
      • 56. Convert plan to M/R jobs
      • 57. DAG in M/R serialized in xml plan
  • 58. SMS Planner extensions
    • Update metastore with table references
    • 59. Two phases before execution
      • Retrieve data fields to determine partitioning keys
      • 60. Traverse DAG (bottom up). Rule based SQL generator
  • 61. Benckmarking
  • 62. Environment
    • Amazon EC2 “large” instances
    • 63. Each instance
      • 7,5 GB memory
      • 64. 2 virtual cores
      • 65. 850 GB storage
      • 66. 64 bits Linux Fedora 8
  • 67. Benchmarked systems
    • Hadoop
      • 256MB data blocks
      • 68. 1024 MB heap size
      • 69. 200Mb sort buffer
    • HadoopDB
      • Similar to Hadoop conf,
      • 70. PostgreSQL 8.2.5
      • 71. No compress data
  • 72. Benchmarked systems
    • Vertica
      • New parallel database (column store),
      • 73. Used a cloud edition
      • 74. All data is compressed
    • DBMS-X
      • Comercial parallel row
      • 75. Run on EC2 (not cloud edition available)
  • 76. Used data
    • Http log files, html pages, ranking
    • 77. Sizes (per node):
      • 155 millions user visits (~ 20Gigabytes)
      • 78. 18 millions ranking (~1Gigabyte)
      • 79. Stored as plain text in HDFS
  • 80. Loading data
  • 81. Grep Task
  • 82. Selection Task
    • Consulta ejecutada
      • Select pageUrl, pageRank from Rankings where pageRank > 10
  • 83. Aggregation Task
    • Smaller query
    SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue) FROM UserVisits GROUP BY SUBSTR(sourceIP, 1, 7);
    • Larger query
    SELECT sourceIP, SUM(adRevenue) FROM UserVisits GROUP BY sourceIP
  • 84. Join Task
    • Query
    SELECT sourceIP, COUNT(pageRank), SUM(pageRank),SUM(adRevenue) FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN ‘2000-01-15’ AND ‘2000-01-22’ GROUP BY UV.sourceIP;
    • Not same query
  • 85. UDF Aggregation Task
  • 86. Summary
    • HaddopDB approach parallel databases in absence of failures
      • PostgreSLQ not column store
      • 87. DBMS-X 15% overly optimistic
      • 88. No compression un PosgreSQL data
    • Outperforms Hadoop
  • 89. Fault tolerance and heterogeneus environments
  • 90. Benchmarks
  • 91. Discussion
    • Vertica is faster
    • 92. Reduce the number of nodes to achieve the same order of magnitude
    • 93. Fault tolerance is important
  • 94. Conclusions
  • 95. Conclusion
    • Approach parallel databases and fault tolerance
    • 96. PostgreSQL is not a column store
    • 97. Hadoop and hive relatively new open source projects
    • 98. HadoopDB is flexible and extensible
  • 99. References
  • 100. References
  • 105. That´s all!

×