3. Introduction - Green Computing
● Environmentally sustainable computing with minimal
impact on the environment.
● Reduction of the energy consumption, the GreenHouse
Gas emissions and the operational costs.
*
4. Introduction - Distributed FS
● A Distributed File System (DFS) is any file system that
allows access to files from multiple hosts sharing via a
computer network.
● May include facilities for transparent replication and
fault tolerance.
*
5. Introduction - DFS Issues
● Distributed File Systems are often built to run on a large
number of commodity servers.
● Which means that:
○ it generates heat and consumes large amounts of
energy.
○ costs are dependent on the initial acquisition costs
and power, cooling, etc.
*
6. Introduction - DFS Issues
● Common approach:
○ Scale-Down -Transitioning servers into low power
consumption states.
○ Other approaches not exclusive to DFS might
include renewable energy, free cooling, etc.
*
7. HDFS Overview
● Hadoop Distributed File System (HDFS) is the primary
storage system used by Hadoop applications.
● HDFS creates multiple replicas of data blocks and
distributes them on compute nodes throughout a cluster
of enable reliable, extremely rapid computations.
*
8. HDFS Evaluation
In 2010, a detailed analysis of files was done in a
production Yahoo! Hadoop cluster with the following
characteristics:
● 2600 servers
● 34 million files
● Over 5 PB of data
● 3 months of observation
*
9. HDFS Evaluation
Key observations:
● Files are heterogeneous in access and lifespan patterns.
● 60% of data is "cold" or dormant.
● 95-98% of files have a very short "hotness" lifespan of
less than 3 days.
● 90% of files were dormant or "cold" for more than 18
days.
● Majority of the data had a news-server-like access
pattern.
*
10. GHDFS Overview
● Self-Adaptive - depends only on HDFS and file access
patterns
● Applies Data-Classification techniques
● Energy-Aware placement of data
● Trades cost, performance and power by separating
cluster into logical zones.
*
11. GHDFS Design
Hot Zone Cold Zone
Files currently Files with low to
accessed and rare access
newly created
Low energy use
High energy usage and Sleeping
and performance mode
*
12. GHDFS - Management Policies
GreenHDFS uses three different management
policies:
● FMP - File Migration Policy Hot Zone
● SCP - Server Power Conserver Policy
Cold Zone
● FRP - File Reversal Policy
*
13. GHDFS - File Migration Policy
● FMP monitors the dormancy of files
● Runs in the Hot Zone
● Gives higher storage effiency for the Hot Zone as less
accessed files are moved to the Cold Zone
Coldness > Threshold
Hot Zone Cold Zone
Hotness > Threshold
*
14. GHDFS - Power Conserver Policy
● SCP runs in the ColdZone
● Determines which servers can go to stanby/sleep mode.
● Uses hardware techniques to transfer CPU, Disks and
FRAM into low power state.
● Wakes the server up only if:
○ Data on that server is accessed
○ New data needs to be placed on that server
Cold
Zone
*
15. GHDFS - File Reversal Policy
● FRP runs in the ColdZone.
● Ensures QoS, bandwidth and response time is well
managed in case a file becomes popular.
#accesses > Threshold
Hot Zone Cold Zone
*
16. GHDFS - Machine Learning
● Designing and developing algorithms that allow
computers to evolve behaviors based on empirical
data.
● Recognize patterns and make decisions based on data.
*
17. GHDFS - Machine Learning
GHDFS uses:
● Supervised machine learning.
● A variant of Multiple Linear Regression to find the
statistical correlation between directory and file
attributions.
● Training data preparation - audit logs and metadata.
● Predicts the files Lifespan, Size and Heat upon creation
of file.
It works because there is a high correlation between the
directory hierarchy and file attributes in a well-laid out and
partitioned name space!!
*
23. GHDFS - Evaluation
● Energy consumption reduced by 24% and saved $2.1
millions saved in energy costs per annum (38000
servers).
● Maximizes the usage of the power budget by allowing
the infrastructure to expand. More Hot Zone servers
offer more availability and performance.
*
24. Conclusions
● Machine learning can be applied for a predictive self-
managed energy control system that achieves better
results than reactive approaches.
● Good Energy Management Policies can result in high
savings in energy consumption.
● Data-Classification techniques can help achieving a
better energy-aware placement of data in Distributed
File Systems.
● The presented techniques applied in conjunction to
other more common green computing technologies can
impact significantly the maintenance costs of the cluster.
*
25. References
● GreenHDFS : Torwards an Energy-Conserving Storage-
Efficient, Hybrid Hadoop Compute Cluster
● Evaluation and Analysis of GreenHDFS: A Self-
Adaptive, Energy-Conserving Variant of the Hadoop
Distributed File System
● Predictive Data and Energy Management in
GreenHDFS
● The Hadoop Distributed File System
● Introduction to Machine Learning (Adaptive Computation
and Machine Learning)
*
26. EEDC
34330
Self-Adapting, Energy-
Execution Conserving Distributed
Environments for File Systems
Distributed
Computing
European Master in Distributed
Computing - EMDC
EEDC Presentation
Mário Almeida– 4knahs[@]gmail.com
www.marioalmeida.eu