Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Self-Adapting, Energy-Conserving Distributed File Systems


Published on

Overview of self-adapting, Energy-conserving distributed file systems. Study case: GreenHDFS

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

Self-Adapting, Energy-Conserving Distributed File Systems

  1. 1. EEDC 34330 Self-Adapting, Energy-Execution Conserving DistributedEnvironments for File SystemsDistributedComputingEuropean Master in DistributedComputing - EMDC EEDC Presentation Mário Almeida– 4knahs[@]
  2. 2. Outline● Introduction ● Conclusions ○ Green Computing ● References ○ Distributed File Systems ○ DFS issues● Hadoop Distributed File System ○ Overview ○ Evaluation● Green HDFS ○ Overview ○ Design ○ Goal ○ Energy-management policies ○ Machine learning ○ Evaluation *
  3. 3. Introduction - Green Computing● Environmentally sustainable computing with minimal impact on the environment.● Reduction of the energy consumption, the GreenHouse Gas emissions and the operational costs. *
  4. 4. Introduction - Distributed FS● A Distributed File System (DFS) is any file system that allows access to files from multiple hosts sharing via a computer network.● May include facilities for transparent replication and fault tolerance. *
  5. 5. Introduction - DFS Issues● Distributed File Systems are often built to run on a large number of commodity servers.● Which means that: ○ it generates heat and consumes large amounts of energy. ○ costs are dependent on the initial acquisition costs and power, cooling, etc. *
  6. 6. Introduction - DFS Issues● Common approach: ○ Scale-Down -Transitioning servers into low power consumption states. ○ Other approaches not exclusive to DFS might include renewable energy, free cooling, etc. *
  7. 7. HDFS Overview● Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications.● HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster of enable reliable, extremely rapid computations. *
  8. 8. HDFS EvaluationIn 2010, a detailed analysis of files was done in aproduction Yahoo! Hadoop cluster with the followingcharacteristics:● 2600 servers● 34 million files● Over 5 PB of data● 3 months of observation *
  9. 9. HDFS EvaluationKey observations:● Files are heterogeneous in access and lifespan patterns.● 60% of data is "cold" or dormant.● 95-98% of files have a very short "hotness" lifespan of less than 3 days.● 90% of files were dormant or "cold" for more than 18 days.● Majority of the data had a news-server-like access pattern. *
  10. 10. GHDFS Overview● Self-Adaptive - depends only on HDFS and file access patterns● Applies Data-Classification techniques● Energy-Aware placement of data● Trades cost, performance and power by separating cluster into logical zones. *
  11. 11. GHDFS Design Hot Zone Cold Zone Files currently Files with low to accessed and rare access newly created Low energy useHigh energy usage and Sleeping and performance mode *
  12. 12. GHDFS - Management PoliciesGreenHDFS uses three different managementpolicies:● FMP - File Migration Policy Hot Zone● SCP - Server Power Conserver Policy Cold Zone● FRP - File Reversal Policy *
  13. 13. GHDFS - File Migration Policy● FMP monitors the dormancy of files● Runs in the Hot Zone● Gives higher storage effiency for the Hot Zone as less accessed files are moved to the Cold Zone Coldness > Threshold Hot Zone Cold Zone Hotness > Threshold *
  14. 14. GHDFS - Power Conserver Policy ● SCP runs in the ColdZone ● Determines which servers can go to stanby/sleep mode. ● Uses hardware techniques to transfer CPU, Disks and FRAM into low power state. ● Wakes the server up only if: ○ Data on that server is accessed ○ New data needs to be placed on that server Cold Zone *
  15. 15. GHDFS - File Reversal Policy ● FRP runs in the ColdZone. ● Ensures QoS, bandwidth and response time is well managed in case a file becomes popular. #accesses > Threshold Hot Zone Cold Zone *
  16. 16. GHDFS - Machine Learning● Designing and developing algorithms that allow computers to evolve behaviors based on empirical data.● Recognize patterns and make decisions based on data. *
  17. 17. GHDFS - Machine LearningGHDFS uses: ● Supervised machine learning. ● A variant of Multiple Linear Regression to find the statistical correlation between directory and file attributions. ● Training data preparation - audit logs and metadata. ● Predicts the files Lifespan, Size and Heat upon creation of file.It works because there is a high correlation between thedirectory hierarchy and file attributes in a well-laid out andpartitioned name space!! *
  18. 18. GHDFS - Machine Learning *
  19. 19. GHDFS - Evaluation *
  20. 20. GHDFS - Evaluation *
  21. 21. GHDFS - Evaluation *
  22. 22. GHDFS - Evaluation *
  23. 23. GHDFS - Evaluation● Energy consumption reduced by 24% and saved $2.1 millions saved in energy costs per annum (38000 servers).● Maximizes the usage of the power budget by allowing the infrastructure to expand. More Hot Zone servers offer more availability and performance. *
  24. 24. Conclusions● Machine learning can be applied for a predictive self- managed energy control system that achieves better results than reactive approaches.● Good Energy Management Policies can result in high savings in energy consumption.● Data-Classification techniques can help achieving a better energy-aware placement of data in Distributed File Systems.● The presented techniques applied in conjunction to other more common green computing technologies can impact significantly the maintenance costs of the cluster. *
  25. 25. References● GreenHDFS : Torwards an Energy-Conserving Storage- Efficient, Hybrid Hadoop Compute Cluster● Evaluation and Analysis of GreenHDFS: A Self- Adaptive, Energy-Conserving Variant of the Hadoop Distributed File System● Predictive Data and Energy Management in GreenHDFS● The Hadoop Distributed File System● Introduction to Machine Learning (Adaptive Computation and Machine Learning) *
  26. 26. EEDC 34330 Self-Adapting, Energy-Execution Conserving DistributedEnvironments for File SystemsDistributedComputingEuropean Master in DistributedComputing - EMDC EEDC Presentation Mário Almeida– 4knahs[@]