EEDC
                           34330
                                   Self-Adapting, Energy-
Execution                          Conserving Distributed
Environments for                        File Systems
Distributed
Computing
European Master in Distributed
Computing - EMDC




                                          EEDC Presentation
                                    Mário Almeida– 4knahs[@]gmail.com
                                          www.marioalmeida.eu
Outline
●   Introduction                     ●   Conclusions
     ○ Green Computing               ●   References
     ○ Distributed File Systems
     ○ DFS issues
●   Hadoop Distributed File System
     ○ Overview
     ○ Evaluation
●   Green HDFS
     ○ Overview
     ○ Design
     ○ Goal
     ○ Energy-management
        policies
     ○ Machine learning
     ○ Evaluation
                             *
Introduction - Green Computing
●   Environmentally sustainable computing with minimal
    impact on the environment.
●   Reduction of the energy consumption, the GreenHouse
    Gas emissions and the operational costs.




                            *
Introduction - Distributed FS
●   A Distributed File System (DFS) is any file system that
    allows access to files from multiple hosts sharing via a
    computer network.
●   May include facilities for transparent replication and
    fault tolerance.




                               *
Introduction - DFS Issues
●   Distributed File Systems are often built to run on a large
    number of commodity servers.
●   Which means that:
     ○ it generates heat and consumes large amounts of
       energy.
     ○ costs are dependent on the initial acquisition costs
       and power, cooling, etc.




                                *
Introduction - DFS Issues
●   Common approach:
     ○ Scale-Down -Transitioning servers into low power
       consumption states.
     ○ Other approaches not exclusive to DFS might
       include renewable energy, free cooling, etc.




                             *
HDFS Overview
●   Hadoop Distributed File System (HDFS) is the primary
    storage system used by Hadoop applications.

●   HDFS creates multiple replicas of data blocks and
    distributes them on compute nodes throughout a cluster
    of enable reliable, extremely rapid computations.




                              *
HDFS Evaluation

In 2010, a detailed analysis of files was done in a
production Yahoo! Hadoop cluster with the following
characteristics:

●   2600 servers
●   34 million files
●   Over 5 PB of data
●   3 months of observation




                              *
HDFS Evaluation

Key observations:
●   Files are heterogeneous in access and lifespan patterns.
●   60% of data is "cold" or dormant.
●   95-98% of files have a very short "hotness" lifespan of
    less than 3 days.
●   90% of files were dormant or "cold" for more than 18
    days.
●   Majority of the data had a news-server-like access
    pattern.




                              *
GHDFS Overview

●   Self-Adaptive - depends only on HDFS and file access
    patterns
●   Applies Data-Classification techniques
●   Energy-Aware placement of data
●   Trades cost, performance and power by separating
    cluster into logical zones.




                             *
GHDFS Design


    Hot Zone               Cold Zone

  Files currently       Files with low to
  accessed and             rare access
  newly created
                        Low energy use
High energy usage        and Sleeping
 and performance            mode


                    *
GHDFS - Management Policies

GreenHDFS uses three different management
policies:

●   FMP - File Migration Policy           Hot Zone

●   SCP - Server Power Conserver Policy
                                          Cold Zone
●   FRP - File Reversal Policy




                              *
GHDFS - File Migration Policy
●   FMP monitors the dormancy of files
●   Runs in the Hot Zone

●   Gives higher storage effiency for the Hot Zone as less
    accessed files are moved to the Cold Zone

                      Coldness > Threshold



    Hot Zone                                  Cold Zone

                      Hotness > Threshold



                                *
GHDFS - Power Conserver Policy
 ●   SCP runs in the ColdZone
 ●   Determines which servers can go to stanby/sleep mode.

 ●   Uses hardware techniques to transfer CPU, Disks and
     FRAM into low power state.

 ●   Wakes the server up only if:
     ○ Data on that server is accessed
     ○ New data needs to be placed on that server




                         Cold
                         Zone

                             *
GHDFS - File Reversal Policy
 ●   FRP runs in the ColdZone.
 ●   Ensures QoS, bandwidth and response time is well
     managed in case a file becomes popular.




                     #accesses > Threshold
 Hot Zone                                    Cold Zone



                               *
GHDFS - Machine Learning
●   Designing and developing algorithms that allow
    computers to evolve behaviors based on empirical
    data.
●   Recognize patterns and make decisions based on data.




                             *
GHDFS - Machine Learning
GHDFS uses:
 ● Supervised machine learning.
 ● A variant of Multiple Linear Regression to find the
   statistical correlation between directory and file
   attributions.
 ● Training data preparation - audit logs and metadata.
 ● Predicts the files Lifespan, Size and Heat upon creation
   of file.

It works because there is a high correlation between the
directory hierarchy and file attributes in a well-laid out and
partitioned name space!!



                                *
GHDFS - Machine Learning




               *
GHDFS - Evaluation




               *
GHDFS - Evaluation




               *
GHDFS - Evaluation




               *
GHDFS - Evaluation




               *
GHDFS - Evaluation

●   Energy consumption reduced by 24% and saved $2.1
    millions saved in energy costs per annum (38000
    servers).

●   Maximizes the usage of the power budget by allowing
    the infrastructure to expand. More Hot Zone servers
    offer more availability and performance.




                             *
Conclusions
●   Machine learning can be applied for a predictive self-
    managed energy control system that achieves better
    results than reactive approaches.

●   Good Energy Management Policies can result in high
    savings in energy consumption.

●   Data-Classification techniques can help achieving a
    better energy-aware placement of data in Distributed
    File Systems.

●   The presented techniques applied in conjunction to
    other more common green computing technologies can
    impact significantly the maintenance costs of the cluster.
                               *
References

●   GreenHDFS : Torwards an Energy-Conserving Storage-
    Efficient, Hybrid Hadoop Compute Cluster
●   Evaluation and Analysis of GreenHDFS: A Self-
    Adaptive, Energy-Conserving Variant of the Hadoop
    Distributed File System
●   Predictive Data and Energy Management in
    GreenHDFS
●   The Hadoop Distributed File System
●   Introduction to Machine Learning (Adaptive Computation
    and Machine Learning)




                             *
EEDC
                           34330
                                   Self-Adapting, Energy-
Execution                          Conserving Distributed
Environments for                        File Systems
Distributed
Computing
European Master in Distributed
Computing - EMDC




                                          EEDC Presentation
                                    Mário Almeida– 4knahs[@]gmail.com
                                          www.marioalmeida.eu

Self-Adapting, Energy-Conserving Distributed File Systems

  • 1.
    EEDC 34330 Self-Adapting, Energy- Execution Conserving Distributed Environments for File Systems Distributed Computing European Master in Distributed Computing - EMDC EEDC Presentation Mário Almeida– 4knahs[@]gmail.com www.marioalmeida.eu
  • 2.
    Outline ● Introduction ● Conclusions ○ Green Computing ● References ○ Distributed File Systems ○ DFS issues ● Hadoop Distributed File System ○ Overview ○ Evaluation ● Green HDFS ○ Overview ○ Design ○ Goal ○ Energy-management policies ○ Machine learning ○ Evaluation *
  • 3.
    Introduction - GreenComputing ● Environmentally sustainable computing with minimal impact on the environment. ● Reduction of the energy consumption, the GreenHouse Gas emissions and the operational costs. *
  • 4.
    Introduction - DistributedFS ● A Distributed File System (DFS) is any file system that allows access to files from multiple hosts sharing via a computer network. ● May include facilities for transparent replication and fault tolerance. *
  • 5.
    Introduction - DFSIssues ● Distributed File Systems are often built to run on a large number of commodity servers. ● Which means that: ○ it generates heat and consumes large amounts of energy. ○ costs are dependent on the initial acquisition costs and power, cooling, etc. *
  • 6.
    Introduction - DFSIssues ● Common approach: ○ Scale-Down -Transitioning servers into low power consumption states. ○ Other approaches not exclusive to DFS might include renewable energy, free cooling, etc. *
  • 7.
    HDFS Overview ● Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. ● HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster of enable reliable, extremely rapid computations. *
  • 8.
    HDFS Evaluation In 2010,a detailed analysis of files was done in a production Yahoo! Hadoop cluster with the following characteristics: ● 2600 servers ● 34 million files ● Over 5 PB of data ● 3 months of observation *
  • 9.
    HDFS Evaluation Key observations: ● Files are heterogeneous in access and lifespan patterns. ● 60% of data is "cold" or dormant. ● 95-98% of files have a very short "hotness" lifespan of less than 3 days. ● 90% of files were dormant or "cold" for more than 18 days. ● Majority of the data had a news-server-like access pattern. *
  • 10.
    GHDFS Overview ● Self-Adaptive - depends only on HDFS and file access patterns ● Applies Data-Classification techniques ● Energy-Aware placement of data ● Trades cost, performance and power by separating cluster into logical zones. *
  • 11.
    GHDFS Design Hot Zone Cold Zone Files currently Files with low to accessed and rare access newly created Low energy use High energy usage and Sleeping and performance mode *
  • 12.
    GHDFS - ManagementPolicies GreenHDFS uses three different management policies: ● FMP - File Migration Policy Hot Zone ● SCP - Server Power Conserver Policy Cold Zone ● FRP - File Reversal Policy *
  • 13.
    GHDFS - FileMigration Policy ● FMP monitors the dormancy of files ● Runs in the Hot Zone ● Gives higher storage effiency for the Hot Zone as less accessed files are moved to the Cold Zone Coldness > Threshold Hot Zone Cold Zone Hotness > Threshold *
  • 14.
    GHDFS - PowerConserver Policy ● SCP runs in the ColdZone ● Determines which servers can go to stanby/sleep mode. ● Uses hardware techniques to transfer CPU, Disks and FRAM into low power state. ● Wakes the server up only if: ○ Data on that server is accessed ○ New data needs to be placed on that server Cold Zone *
  • 15.
    GHDFS - FileReversal Policy ● FRP runs in the ColdZone. ● Ensures QoS, bandwidth and response time is well managed in case a file becomes popular. #accesses > Threshold Hot Zone Cold Zone *
  • 16.
    GHDFS - MachineLearning ● Designing and developing algorithms that allow computers to evolve behaviors based on empirical data. ● Recognize patterns and make decisions based on data. *
  • 17.
    GHDFS - MachineLearning GHDFS uses: ● Supervised machine learning. ● A variant of Multiple Linear Regression to find the statistical correlation between directory and file attributions. ● Training data preparation - audit logs and metadata. ● Predicts the files Lifespan, Size and Heat upon creation of file. It works because there is a high correlation between the directory hierarchy and file attributes in a well-laid out and partitioned name space!! *
  • 18.
    GHDFS - MachineLearning *
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
    GHDFS - Evaluation ● Energy consumption reduced by 24% and saved $2.1 millions saved in energy costs per annum (38000 servers). ● Maximizes the usage of the power budget by allowing the infrastructure to expand. More Hot Zone servers offer more availability and performance. *
  • 24.
    Conclusions ● Machine learning can be applied for a predictive self- managed energy control system that achieves better results than reactive approaches. ● Good Energy Management Policies can result in high savings in energy consumption. ● Data-Classification techniques can help achieving a better energy-aware placement of data in Distributed File Systems. ● The presented techniques applied in conjunction to other more common green computing technologies can impact significantly the maintenance costs of the cluster. *
  • 25.
    References ● GreenHDFS : Torwards an Energy-Conserving Storage- Efficient, Hybrid Hadoop Compute Cluster ● Evaluation and Analysis of GreenHDFS: A Self- Adaptive, Energy-Conserving Variant of the Hadoop Distributed File System ● Predictive Data and Energy Management in GreenHDFS ● The Hadoop Distributed File System ● Introduction to Machine Learning (Adaptive Computation and Machine Learning) *
  • 26.
    EEDC 34330 Self-Adapting, Energy- Execution Conserving Distributed Environments for File Systems Distributed Computing European Master in Distributed Computing - EMDC EEDC Presentation Mário Almeida– 4knahs[@]gmail.com www.marioalmeida.eu