Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Maintainable cloud architecture_of_hadoop

3,683 views

Published on

The architecture around Hadoop based on Cloud services in Treasure Data.

Published in: Software
  • Be the first to comment

Maintainable cloud architecture_of_hadoop

  1. 1. Maintainable Cloud Architecture of Hadoop Kai Sasaki Treasure Data Inc.
  2. 2. Who am I? • Kai Sasaki (佐々木 海) • @Lewuathe at Twitter, GitHub • Treasure Data Inc.
 Software Engineer • Contributing Hadoop, Spark.
  3. 3. Hadoop in Treasure Data
  4. 4. Cloud-based Data warehousing service
  5. 5. Hadoop is the core of Treasure Data
  6. 6. Hadoop on Cloud 1. Features provided by AWS, IDCF, Heroku etc 2. Fast growing reliability and integrity
  7. 7. Hadoop on Cloud 1. Features provided by AWS, IDCF, Heroku etc 2. Fast growing reliability and integrity Maintainability of Middleware
  8. 8. Agenda • Maintainability of Distributed System • Our Challenges • Stateless Hive Metastore • Cloud Storage for Hadoop • Multiple Hadoop Version Management • Regression Test for Hive Queries • REST API for Hadoop • Workflow Integration • What we should keep in mind
  9. 9. Maintainability We think high maintainability is achieved by… • Stateless
  10. 10. Maintainability We think high maintainability is achieved by… • Stateless • Mobility
  11. 11. Maintainability We think high maintainability is achieved by… • Stateless • Mobility • Queueing
  12. 12. Stateless • Stateless Hive metastore • Cloud Storage for Hadoop
  13. 13. Stateless Hive MS
  14. 14. Stateful Hive MS MySQL
  15. 15. Stateful Hive MS Driver Metastore MySQL
  16. 16. Stateful Hive MS Driver Metastore MySQL Require Maintaining RDBMS for only Meta Store
  17. 17. Stateless Hive MS Driver Metastore
  18. 18. Stateless Hive MS Driver Metastore Derby
  19. 19. Stateless Hive MS Driver Metastore Derby Worker Submit DDL request
  20. 20. Stateless Hive MS Driver Metastore Derby Worker Submit DDL request Aggregate Stateful points Treasure Data API
  21. 21. Cloud Storage for Hadoop
  22. 22. PlazmaDB Data Connector S3, Redshift, MySQL, PostgreSQL, Salesforce and more SDK iOS, Android, JavaScript
 Unity Bulk Import td client ...
  23. 23. PlazmaDB Data Connector S3, Redshift, MySQL, PostgreSQL, Salesforce and more SDK iOS, Android, JavaScript
 Unity Bulk Import td client ... msgpack
  24. 24. PlazmaDB Data Connector S3, Redshift, MySQL, PostgreSQL, Salesforce and more SDK iOS, Android, JavaScript
 Unity Bulk Import td client ... msgpack Hadoop
  25. 25. PlazmaDB Data Connector S3, Redshift, MySQL, PostgreSQL, Salesforce and more SDK iOS, Android, JavaScript
 Unity Bulk Import td client ... msgpack Hadoop Stateful
  26. 26. PlazmaDB PostgreSQL S3 or Riak S3 or Riak S3 or Riak S3 or Riak msgpack Amazon RDS
  27. 27. PlazmaDB PostgreSQL S3 or Riak S3 or Riak S3 or Riak S3 or Riak msgpack Amazon RDS Transaction Immutable
  28. 28. Mobility • Multiple Hadoop Version Management • Regression Test for Hive Queries
  29. 29. Multiple Hadoop Version Management
  30. 30. Multiple Version 
 Management CDH HDP Apache
  31. 31. Multiple Version 
 Management CDH HDP Apache client client client
  32. 32. Multiple Version 
 Management CDH HDP Apache client client client Tough Operation
  33. 33. Multiple Version 
 Management CDH HDP Apache Worker
  34. 34. Multiple Version 
 Management CDH HDP Apache Worker switching
  35. 35. Multiple Version 
 Management CDH HDP Apache Worker switching
  36. 36. Multiple Version 
 Management CDH HDP Apache Worker CDH package HDP package Apache package switching
  37. 37. Multiple Version 
 Management CDH HDP Apache Worker CDH package HDP package Apache package S3 switching
  38. 38. Multiple Version Management S3 /test /stable ...
  39. 39. Multiple Version Management CDH package HDP package Apache package S3 /test /stable ...
  40. 40. Multiple Version Management CDH package HDP package Apache package S3 /test /stable ... CDH HDP Apache Worker download
  41. 41. Regression Test for Hive • Introducing new features, version up, migration
 must be done without regression • Running integration system test and regression test for Hive queries
  42. 42. CDH HDP Apache Worker http://blog.circleci.com/meet-our-new-logo/ System Test Repository
  43. 43. CDH HDP Apache Worker http://blog.circleci.com/meet-our-new-logo/ System Test Repository
  44. 44. CDH HDP Apache Worker http://blog.circleci.com/meet-our-new-logo/ System Test Repository S3 Hadoop Repository
  45. 45. CDH HDP Apache Worker http://blog.circleci.com/meet-our-new-logo/ System Test Repository S3 Apache package Hadoop Repository
  46. 46. CDH HDP Apache Worker http://blog.circleci.com/meet-our-new-logo/ System Test Repository S3 Apache package Hadoop Repository
  47. 47. Queueing • REST API for Hadoop • RDS based Queue management system
  48. 48. REST API for Hadoop
  49. 49. REST API for Hadoop CDH HDP Apache Worker
  50. 50. REST API for Hadoop CDH HDP Apache Worker PerfectQueue Hadoop Job Server REST API
  51. 51. REST API for Hadoop CDH HDP Apache Worker PerfectQueue Hadoop Job Server REST API Presto API
  52. 52. RDBMS-based Queue Management System
  53. 53. RDBMS based queue management CDH HDP Apache Worker Client Client Client PerfectQueue Hadoop Job Server
  54. 54. PerfectQueue • Highly available distributed queue build on RDBMS • Amazon SQS like API • Resource scheduling for multi tenancy • Graceful and Live Restarting https://github.com/treasure-data/perfectqueue
  55. 55. What we should 
 keep in mind • Stateless
 Delegate responsibility to Cloud systems • Mobility
 Looking ahead for version up, migration • Queueing
 Make each request persistent
  56. 56. Recap • Maintainability of Distributed System • Our Challenges • Stateless Hive Metastore • Cloud Storage for Hadoop • Multiple Hadoop version management • Regression Test for Hive queries • REST API for Hadoop • Workflow Integration • What we should keep in mind
  57. 57. https://www.treasuredata.com/

×