2. the Hadoop clusters and renders usage of inactive power work and conclude.
modes infeasible [26].
Recent research on scale-down in GFS and HDFS man- 2 Key observations
aged clusters [3, 27] propose maintaining a primary replica
of the data on a small covering subset of nodes that are guar- We did a detailed analysis of the evolution and lifespan
anteed to be on. However, these solutions suffer from de- of the ο¬les in in a production Yahoo! Hadoop cluster us-
graded write-performance as they rely on write-ofο¬oading ing one-month long HDFS traces and Namespace metadata
technique [31] to avoid server wakeups at the time of writes. checkpoints. We analyzed each top-level directory sepa-
Write-performance is an important consideration in Hadoop rately in the production multi-tenant Yahoo! Hadoop clus-
and even more so in a production Hadoop cluster as dis- ter as each top-level directory in the namespace exhibited
cussed in Section 3.1. different access patterns and lifespan distributions. The key
We took a different approach and proposed GreenHDFS, observations from the analysis are:
an energy-conserving, self-adaptive, hybrid, logical multi-
zoned variant of HDFS in our paper [23]. Instead of an β There is signiο¬cant heterogeneity in the access pat-
energy-efο¬cient placement of computations or using a small terns and the lifespan distributions across the various
covering set for primary replicas as done in earlier research, top-level directories in the production Hadoop clus-
GreenHDFS focuses on data-classiο¬cation techniques to ter and one-size-ο¬ts-all energy-management policies
extract energy savings by doing energy-aware placement of donβt sufο¬ce across all directories.
data.
β Signiο¬cant amount of data amounting to 60% of used
GreenHDFS trades cost, performance and power by sep-
capacity is cold (i.e., is lying dormant in the system
arating cluster into logical zones of servers. Each cluster
without getting accessed) in the production Hadoop
zone has a different temperature characteristic where tem-
cluster. A majority of this cold data needs to exist for
perature is measured by the power consumption and the per-
regulatory and historical trend analysis purposes.
formance requirements of the zone. GreenHDFS relies on
the inherent heterogeneity in the access patterns in the data β We found that the 95-98% ο¬les in majority of the top-
stored in HDFS to differentiate the data and to come up with level directories had a very short hotness lifespan of
an energy-conserving data layout and data placement onto less than 3 days. Only one directory had ο¬les with
the zones. Since, computations exhibit high data locality in longer hotness lifespan. Even in that directory 80%
the Hadoop framework, the computations then ο¬ow natu- of ο¬les were hot for less than 8 days.
rally to the data in the right temperature zones.
The contribution of this paper lies in showing that the β We found that 90% of ο¬les amounting to 80.1% of the
energy-aware data-differentiation based data-placement in total used capacity in the most storage-heavy top-level
GreenHDFS is able to meet all the effective scale-down directory were dormant and hence, cold for more than
mandates (i.e., generates signiο¬cant idleness, results in 18 days. Dormancy periods were much shorter in the
few power state transitions, and doesnβt degrade write per- rest of the directories and only 20% ο¬les were dormant
formance) despite the signiο¬cant challenges posed by a beyond 1 day.
Hadoop cluster to scale-down. We do a detailed evaluation
β Access pattern to majority of the data in the production
and sensitivity analysis of the policy thresholds in use in
Hadoop cluster have a news-server-like access pattern
GreenHDFS with a trace-driven simulator with real-world
whereby most of the computations to the data happens
HDFS traces from a production Hadoop cluster at Yahoo!.
soon after the dataβs creation.
While some aspects of GreenHDFS are sensitive to the pol-
icy thresholds, we found that energy-conservation is mini-
mally sensitive to the policy thresholds in GreenHDFS. 3 Background
The remainder of the paper is structured as follows. In
Section 2, we list some of the key observations from our Map-reduce is a programming model designed to sim-
analysis of the production Hadoop cluster at Yahoo!. In plify data processing [13]. Google, Yahoo!, Facebook,
Section 3, we provide background on HDFS, and discuss Twitter etc. use Map-reduce to process massive amount of
scale-down mandates. In Section 4, we give an overview of data on large-scale commodity clusters. Hadoop is an open-
the energy management policies of GreenHDFS. In Section source cluster-based Map-reduce implementation written in
5, we present an analysis of the Yahoo! cluster. In Section Java [1]. It is logically separated into two subsystems: a
6, we include experimental results demonstrating the effec- highly resilient and scalable Hadoop Distributed File Sys-
tiveness and robustness of our design and algorithms in a tem (HDFS), and a Map-reduce task execution framework.
simulation environment. In Section 7, we discuss related HDFS runs on clusters of commodity hardware and is an
275
3. object-based distributed ο¬le system. The namespace and to the class of data residing in that zone. Differentiating
the metadata (modiο¬cation, access times, permissions, and the zones in terms of power is crucial towards attaining our
quotas) are stored on a dedicated server called the NameN- energy-conservation goal.
ode and are decoupled from the actual data which is stored Hot zone consists of ο¬les that are being accessed cur-
on servers called the DataNodes. Each ο¬le in HDFS is repli- rently and the newly created ο¬les. This zone has strict SLA
cated for resiliency and split into blocks of typically 128MB (Service Level Agreements) requirements and hence, per-
and individual blocks and replicas are placed on the DataN- formance is of the greatest importance. We trade-off energy
odes for ο¬ne-grained load-balancing. savings in interest of very high performance in this zone. In
this paper, GreenHDFS employs data chunking, placement
3.1 Importance of Write-Performance in and replication policies similar to the policies in baseline
Production Hadoop Cluster HDFS or GFS.
Cold zone consists of ο¬les with low to rare accesses.
Reduce phase of a Map-reduce task writes intermediate Files are moved by File Migration policy from the Hot
computation results back to the Hadoop cluster and relies on zones to the Cold zone as their temperature decreases be-
high write performance for overall performance of a Map- yond a certain threshold. Performance and SLA require-
reduce task. Furthermore, we observed that the majority of ments are not as critical for this zone and GreenHDFS em-
the data in a production Hadoop cluster has a news-server ploys aggressive energy-management schemes and policies
like access pattern. Predominant number of computations in this zone to transition servers to low power inactive state.
happen on newly created data; thereby mandating good read Hence, GreenHDFS trades-off performance with high en-
and write performance of the newly created data. ergy savings in the Cold zone.
For optimal energy savings, it is important to increase
3.2 Scale-down Mandates the idle times of the servers and limit the wakeups of servers
that have transitioned to the power saving mode. Keeping
Scale-down, in which server components such as CPU,
this rationale in mind and recognizing the low performance
disks, and DRAM are transitioned to inactive, low power
needs and infrequency of data accesses to the Cold zone;
consuming mode, is a popular energy-conservation tech-
this zone will not chunk the data. This will ensure that upon
nique. However, scale-down cannot be applied naively. En-
a future access only the server containing the data will be
ergy is expended and transition time penalty is incurred
woken up.
when the components are transitioned back to an active
By default, the servers in Cold zone are in a sleeping
power mode. For example, transition time of components
mode. A server is woken up when either new data needs
such as the disks can be as high as 10secs. Hence, an effec-
to be placed on it or when data already residing on the
tive scale-down technique mandates the following:
server is accessed. GreenHDFS tries to avoid powering-on
β Sufο¬cient idleness to ensure that energy savings are a server in the Cold zone and maximizes the use of the exist-
higher than the energy spent in the transition. ing powered-on servers in its server allocation decisions in
β Less number of power state transitions as some com- interest of maximizing the energy savings. One server wo-
ponents (e.g., disks) have limited number of start/stop ken up and is ο¬lled completely to its capacity before next
cycles and too frequent transitions may adversely im- server is chosen to be transitioned to an active power state
pact the lifetime of the disks. from an ordered list of servers in the Cold zone.
The goal of GreenHDFS is to maximize the allocation
β No performance degradation. Steps need to be taken of the servers to the Hot zone to minimize the performance
to amortize performance penalty of power state transi- impact of zoning and minimize the number of servers allo-
tions and to ensure that load concentration on the re- cated to the Cold zone. We introduced a hybrid, storage-
maining active state servers doesnβt adversely impact heavy cluster model in [23] paper whereby servers in the
overall performance of the system. Cold zone are storage-heavy and have 12, 1TB disks/server.
We argue that zoning in GreenHDFS will not affect the
4 GreenHDFS Design Hot zoneβs performance adversely and the computational
workload can be consolidated on the servers in the Hot zone
GreenHDFS is a variant of the Hadoop Distributed File without exceeding the CPU utilization above the provision-
System (HDFS) and GreenHDFS logically organizes the ing guidelines. A study of 5000 Google compute servers,
servers in the datacenter in multiple dynamically provi- showed that most of the time is spent within the 10% - 50%
sioned Hot and Cold zones. Each zone has a distinct perfor- CPU utilization range [4]. Hence, signiο¬cant opportunities
mance, cost, and power characteristic. Each zone is man- exist in workload consolidation. And, the compute capacity
aged by power and data placement policies most conducive of the Cold zone can always be harnessed under peak load
276
4. scenarios. 4.1.2 Server Power Conserver Policy
4.1 Energy-management Policies The Server Power Conserver Policy runs in the Cold zone
and determines the servers which can be transitioned into
Files are moved from the Hot Zones to the Cold Zone as a power saving standby/sleep mode in the Cold Zone as
their temperature changes over time as shown in Figure 1. shown in Algorithm 2. The current trend in the internet-
In this paper, we use dormancy of a ο¬le, as deο¬ned by the scale data warehouses and Hadoop clusters is to use com-
elapsed time since the last access to the ο¬le, as the measure modity servers with 4-6 directly attached disks instead of
of temperature of the ο¬le. Higher the dormancy lower is the using expensive RAID controllers. In such systems, disks
temperature of the ο¬le and hence, higher is the coldness of actually just constitute 10% of the entire power usage as il-
the ο¬les. On the other hand, lower the dormancy, higher is lustrated in a study performed at Google [21] and CPU and
the heat of the ο¬les. GreenHDFS uses existing mechanism DRAM constitute of 63% of the total power usage. Hence,
in baseline HDFS to record and update the last access time power management of any one component is not sufο¬cient.
of the ο¬les upon every ο¬le read. We leverage energy cost savings at the entire server granu-
larity (CPU, Disks, and DRAM) in the Cold zone.
The GreenHDFS uses hardware techniques similar to
4.1.1 File Migration Policy
[28] to transition the processors, disks and the DRAM into
The File Migration Policy runs in the Hot zone, monitors a low power state. GreenHDFS uses the disk Sleep mode 1 ,
the dormancy of the ο¬les as shown in Algorithm 1 and CPUβs ACPI S3 Sleep state as it consumes minimal power
moves dormant, i.e., cold ο¬les to the Cold Zone. The advan- and requires only 30us to transition from sleep back to ac-
tages of this policy are two-fold: 1) leads to higher space- tive execution, and DRAMβs self-refresh operating mode in
efο¬ciency as space is freed up on the hot Zone for ο¬les which transitions into and out of self refresh can be com-
which have higher SLA requirements by moving rarely ac- pleted in less than a microsecond in the Cold zone.
cessed ο¬les out of the servers in these zones, and 2) allows The servers are transitioned back to an active power
signiο¬cant energy-conservation. Data-locality is an impor- mode in three conditions: 1) data residing on the server is
tant consideration in the Map-reduce framework and com- accessed, 2) additional data needs to be placed on the server,
putations are co-located with data. Thus, computations nat- or 3) block scanner needs to run on the server to ensure
urally happen on the data residing in the Hot zone. This the integrity of the data residing in the Cold zone servers.
results in signiο¬cant idleness in all the components of the GreenHDFS relies on Wake-on-LAN in the NICs to send a
servers in the Cold zone (i.e., CPU, DRAM and Disks), al- magic packet to transition a server back to an active power
lowing effective scale-down of these servers. state.
Wake-up Events:
File Access
Bit Rot Integrity Checker
File Placement
Coldness > ThresholdFMP File Deletion
Hot Cold Active Inactive
Zone Zone
Server Power Conserver Policy:
Hotness > ThresholdFRP Coldness > Threshold PCS
Figure 1. State Diagram of a Fileβs Zone Alloca- Figure 2. Triggering events leading to Power State
tion based on Migration Policies Transitions in the Cold Zone
Algorithm 1 Description of the File Migration Policy which Algorithm 2 Server Power Conserver Policy
Classiο¬es and Migrates cold data to the Cold Zone from the
{For every Server i in Cold Zone}
Hot Zones for π = 1 to n do
{For every ο¬le i in Hot Zone} coldness π β max0β€πβ€π last access time π
for π = 1 to n do if coldness π β₯ π βπππ βπππ ππ πΆ then
dormancy π β current time β last access time π S π β INACTIVE STATE
if dormancy π β₯ π βπππ βπππ πΉ π π then end if
{Cold Zone} β {Cold Zone} βͺ {f π } end for
{Hot Zone} β {Hot Zone} / {f π }//ο¬lesystem metadata structures are
changed to Cold Zone
end if
end for 1 In the Sleep mode the drive buffer is disabled, the heads are parked
and the spindle is at rest.
277
5. 4.1.3 File Reversal Policy after they have been dormant in the system for a longer pe-
riod of time. This would be an overkill for ο¬les with very
The File Reversal Policy runs in the Cold zone and en-
short πΏππ ππ πππ πΆ πΏπ (hotness lifespan) as such ο¬les will
sures that the QoS, bandwidth and response time of ο¬les
unnecessarily lie dormant in the system, occupying precious
that becomes popular again after a period of dormancy is
Hot zone capacity for a longer period of time.
not impacted. If the number of accesses to a ο¬le that is re-
π βπππ βπππ ππΆπ : A high π βπππ βπππ ππΆπ increases the
siding in the Cold zone becomes higher than the threshold
number of the days the servers in the Cold Zone remain
π βπππ βπππ πΉ π π , the ο¬le is moved back to the Hot zone as
in active power state and hence, lowers the energy savings.
shown in 3. The ο¬le is chunked and placed unto the servers
On the other hand, it results in a reduction in the power state
in the Hot zone in congruence with the policies in the Hot
transitions which results in improved performance of the ac-
zone.
cesses to the Cold Zone. Thus, a trade-off needs to be made
Algorithm 3 Description of the File Reversal Policy Which between energy-conservation and data access performance
Monitors temperature of the cold ο¬les in the Cold Zones and in the selection of the value for π βπππ βπππ ππΆπ .
Moves Files Back to Hot Zones if their temperature changes π βπππ βπππ πΉ π π : A relatively high value of
{For every ο¬le i in Cold Zone} π βπππ βπππ πΉ π π ensures that ο¬les are accurately clas-
for π = 1 to n do
if num accesses π β₯ π βπππ βπππ πΉ π π then
siο¬ed as hot-again ο¬les before they are moved back to the
{Hot Zone} β {Hot Zone} βͺ {f π } Hot zone from the Cold zone. This reduces data oscillations
{Cold Zone} β {Cold Zone} / {f π }//ο¬lesystem metadata are changed to
Hot Zone
in the system and reduces unnecessary ο¬le reversals.
end if
end for
5 Analysis of a production Hadoop cluster at
Yahoo!
4.1.4 Policy Thresholds Discussion We analyzed one-month of HDFS logs 2 and namespace
A good data migration scheme should result in maximal checkpoints in a multi-tenant cluster at Yahoo!. The clus-
energy savings, minimal data oscillations between Green- ter had 2600 servers, hosted 34 million ο¬les in the names-
HDFS zones and minimal performance degradation. Min- pace and the data set size was 6 Petabytes. There were
imization of the accesses to the Cold zone ο¬les results in 425 million entries in the HDFS logs and each names-
maximal energy savings and minimal performance impact. pace checkpoint contained 30-40 million ο¬les. The clus-
For this, policy thresholds should be chosen in a way that ter namespace was divided into six main top-level directo-
minimizes the number of accesses to the ο¬les residing in the ries, whereby each directory addresses different workloads
Cold zone while maximizing the movement of the dormant and access patterns. We only considered 4 main directories
data to the Cold zone. Results from our detailed sensitivity and refer to them as: d, p, u, and m in our analysis instead
analysis of the thresholds used in GreenHDFS are covered of referring them by their real names. The total number
in Section 6.3.5. of unique ο¬les that was seen in the HDFS logs in the one-
π βπππ βπππ πΉ π π : Low (i.e., aggressive) value of month duration were 70 million (d-1.8million, p-30million,
π βπππ βπππ πΉ π π results in an ultra-greedy selection of u-23million, and m-2million).
ο¬les as potential candidates for migration to the Cold The logs and the metadata checkpoints were huge in size
zone. While there are several advantages of an aggressive and we used a large-scale research Hadoop cluster at Yahoo!
π βπππ βπππ πΉ π π such as higher space-savings in the Cold extensively for our analysis. We wrote the analysis scripts
zone, there are disadvantages as well. If ο¬les have inter- in Pig. We considered several cases in our analysis as shown
mittent periods of dormancy, the ο¬les may incorrectly get below:
labeled as cold and get moved to the Cold zone. There is β Files created before the analysis period and which
high probability that such ο¬les will get accessed in the near were not read or deleted subsequently at all. We clas-
future. Such accesses may suffer performance degradation sify these ο¬les as long-living cold ο¬les.
as the accesses may get subject to power transition penalty
and may trigger data oscillations because of ο¬le reversals β Files created before the analysis period and which
back to the Hot zone. were read during the analysis period.
A higher value of π βπππ βπππ πΉ π π results in a higher 2 The inode data and the list of blocks belonging to each ο¬le comprise
accuracy in determining the really cold ο¬les. Hence, the the metadata of the name system called the image. The persistent record of
number of reversals, server wakeups and associated perfor- the image is called a checkpoint. HDFS has the ability to log all ο¬le system
access requests, which is required for auditing purposes in enterprises. The
mance degradation decreases as the threshold is increased.
audit logging is implemented using log4j and once enabled, logs every
On the other hand, higher value of π βπππ βπππ πΉ π π signi- HDFS event in the NameNodeβs log [37]. We used the above-mentioned
ο¬es that ο¬les will be chosen as candidates for migration only checkpoint and HDFS logs for our analysis.
278
6. β Files created before the analysis period and which β FileLifetime. This metric helps in determining the life-
were both read and deleted during the analysis period. time of the ο¬le between its creation and its deletion.
β Files created during the analysis period and which
were not read during the analysis period or deleted. 5.1.1 πΉ ππππΏππ πππππ πΆπΉ π
The πΉ ππππΏππ πππππ πΆπΉ π distribution throws light on the
β Files created during the analysis period and which
clustering of the ο¬le reads with the ο¬le creation. As shown
were not read during the analysis period, but were
in Figure 3, 99% of the ο¬les have a πΉ ππππΏππ πππππ πΆπΉ π of
deleted.
less than 2 days.
β Files created during the analysis period and which
were read and deleted during the analysis period. 5.1.2 πΉ ππππΏππ πππππ πΆπΏπ
To accurately account for the ο¬le lifespan and lifetime, Figure 4 shows the distribution of πΉ ππππΏππ πππππ πΆπΏπ in
we handled the following cases: (a) Filename reuse. We the cluster. In directory d, 80% of ο¬les are hot for less than
appended a timestamp to each ο¬le create to accurately track 8 days and 90% of the ο¬les amounting to 94.62% storage,
the audit log entries following the ο¬le create entry in the au- are hot for less than 24 days. The πΉ ππππΏππ πππππ πΆπΏπ of
dit log, (b) File renames. We used an unique id per ο¬le to ac- 95% of the ο¬les amounting to 96.51% storage in the direc-
curately track its lifetime across create, rename and delete, tory p is less than 3 days and the πΉ ππππΏππ πππππ πΆπΏπ of the
(c) Renames and deletes at higher level in the path hierarchy 100% of ο¬les in directory m and 98% of ο¬les in directory
had to be translated to leaf-level renames and deletes for our a is as small as 2 days. In directory u, 98% of ο¬les have
analysis, (d) HDFS logs do not have ο¬le size information πΉ ππππΏππ πππππ πΆπΏπ of less than 1 day. Thus, majority of
and hence, did a join of the dataset found in the HDFS logs the ο¬les in the cluster have a short hotness lifespan.
and namespace checkpoint to get the ο¬le size information.
5.1.3 πΉ ππππΏππ πππππ πΏπ π·
5.1 File Lifespan Analysis of the Yahoo!
Hadoop Cluster πΉ ππππΏππ πππππ πΏπ π· indicates the time for which a ο¬le stays
in a dormant state in the system. The longer the dormancy
A ο¬le goes to several stages in its lifetime: 1) ο¬le cre- period, higher is the coldness of the ο¬le and hence, higher
ation, 2) hot period during which the ο¬le is frequently ac- the suitability of the ο¬le for migration to the cold zone. Fig-
cessed, 3) dormant period during which ο¬le is not accessed, ure 5 shows the distribution of πΉ ππππΏππ πππππ πΏπ π· in the
and 4) deletion. We introduced and considered various lifes- cluster. In directory d, 90% of ο¬les are dormant beyond
pan metrics in our analysis to characterize a ο¬leβs evolution. 1 day and 80% of ο¬les, amounting to 80.1% of storage
A study of the various lifespan distributions helps in decid- exist in dormant state past 20 days. In directory p, only
ing the energy-management policy thresholds that need to 25% ο¬les are dormant beyond 1 day and only 20% of the
be in place in GreenHDFS. ο¬les remain dormant in the system beyond 10 days. In di-
rectory m, only 0.02% ο¬les are dormant for more than 1
β πΉ ππππΏππ πππππ πΆπΉ π metric is deο¬ned as the File lifes- day and in directory u, 20% of ο¬les are dormant beyond
pan between the ο¬le creation and ο¬rst read access. This 10 days. The πΉ ππππΏππ πππππ πΏπ π· needs to be considered
metric is used to ο¬nd the clustering of the read accesses to ο¬nd true migration suitability of a ο¬le. For example,
around the ο¬le creation. given the extremely short dormancy period of the ο¬les in
the directory m, there is no point in exercising the File Mi-
β πΉ ππππΏππ πππππ πΆπΏπ metric is deο¬ned as the File lifes-
gration Policy on directory m. For directories p, and u,
pan between creation and last read access. This metric
π βπππ βπππ πΉ π π less than 5 days will result in unneces-
is used to determine the hotness proο¬le of the ο¬les.
sary movement of ο¬les to the Cold zone as these ο¬les are
β πΉ ππππΏππ πππππ πΏπ π· metric is deο¬ned as the File lifes- due for deletion in any case. On the other hand, given the
pan between last read access and ο¬le deletion. This short πΉ ππππΏππ πππππ πΆπΏπ in these directories, high value of
metric helps in determine the coldness proο¬le of the π βπππ βπππ πΉ π π wonβt do justice to space-efο¬ciency in the
ο¬les as this is the period for which ο¬les are dormant in Cold zone as discussed in Section 4.1.4.
the system.
5.1.4 File Lifetime Analysis
β πΉ ππππΏππ πππππ πΉ πΏπ metric is deο¬ned as the File lifes-
pan between ο¬rst read access and last read access. This Knowledge of the FileLifetime further assists in the
metric helps in determining another dimension of the migration ο¬le candidate selection and needs to be ac-
hotness proο¬le of the ο¬les. counted for in addition to the πΉ ππππΏππ πππππ πΏπ π· and
279
7. d p m u
102% d p m u
120%
% of Tota Used Capacity
% of To File Count
100%
100%
98%
80%
96%
60%
otal
94%
al
40%
92% 20%
90% 0%
1 3 5 7 9 11 13 15 17 19 21 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
FileLifeSpanCFR (Days)
FileLifeSpanCFR (Days)
Figure 3. πΉ ππππΏππ πππππ πΆπΉ π distribution. 99% of ο¬les in directory d and 98% of ο¬les in directory p were
accessed for the ο¬rst time less than 2 days of creation.
d p m u d p m u
105% 120%
% of Tota Used Capacity
% of Total File Count
100%
100%
95%
90% 80%
85%
60%
80%
al
75% 40%
T
70%
20%
65%
60% 0%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
FileLifeSpanCLR (Days) FileLifeSpanCLR (Days)
Figure 4. πΉ ππππΏππ πππππ πΆπΏπ Distribution in the four main top-level directories in the Yahoo! production cluster.
πΉ ππππΏππ πππππ πΆπΏπ characterizes the lifespan for which ο¬les are hot. In directory d, 80% of ο¬les were hot for less than
8 days and 90% of the ο¬les amounting to 94.62% storage, are hot for less than 24 days. The hotness lifespan of 95% of
the ο¬les amounting to 96.51% storage in the directory p is less than 3 days and the hotness lifespan of the 100% of ο¬les in
directory m and in directory u, 98% of ο¬les are hot for less than 1 day.
d p m u d p m u
120% 120%
% if Tota Used Capacity
% of Total File Count
100% 100%
80% 80%
60% 60%
al
40% 40%
T
20% 20%
0% 0%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
FileLifeSpanLRD (Days) FileLifeSpanLRD (Days)
Figure 5. πΉ ππππΏππ πππππ πΏπ π· distribution of the top-level directories in the Yahoo! production cluster. πΉ ππππΏππ πππππ πΏπ π·
characterizes the coldness in the cluster and is indicative of the time a ο¬le stays in a dormant state in the system. 80% of ο¬les,
amounting to 80.1% of storage in the directory d have a dormancy period of higher than 20 days. 20% of ο¬les, amounting to
28.6% storage in directory p are dormant beyond 10 days. 0.02% of ο¬les in directory m are dormant beyond 1 day.
d p m u d p m u
120% 120%
% of Tota Used Capacity
% of Total File Count
100% 100%
80% 80%
60% 60%
al
40% 40%
T
20% 20%
0% 7 0%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 0 2 4 6 8 1012141618202224262830
FileLifetime (Days) FileLifetime(Days)
Figure 6. FileLifetime distribution. 67% of the ο¬les in the directory p are deleted within one day of their creation. Only
23% ο¬les live beyond 20 days. On the other hand, in directory d 80% of the ο¬les have a FileLifetime of more than 30 days.
280
8. % of Total File Count % of Total Used Storage
40.00%
35.00%
30.00%
25.00%
20.00%
15.00%
15 00%
10.00%
5.00%
0.00%
d p u
Figure 7. File size and ο¬le count percentage of long-living cold ο¬les. The cold ο¬les are deο¬ned as the ο¬les that were created
prior to the start of the observation period of one-month and were not accessed during the period of observation at all. In
case of directory d directory, 13% of the total ο¬le count in the cluster which amounts to 33% of total used capacity is cold.
In case of directory p, 37% of the total ο¬le count in the cluster which amounts to 16% of total used capacity is cold. Overall,
63.16% of total ο¬le count and 56.23% of total used capacity is cold in the system
d p u d p u
7
100%
% of Total File Count
File Count (Millions)
6
80% 5
60% 4
3
40%
2
C
20% 1
0% 0
10 20 40 60 80 100 120 140 10 20 40 60 80 100 120 140
Dormancy > than (Days) Dormancy > than (Days)
d p u
d p u 3500
90%
% of Total Used Storage
3000
80%
Used Storage Capaicty (TB)
70% 2500
60% 2000
50%
Capacity
1500
40%
30% 1000
20% 500
10%
0% 0
10 20 40 60 80 100 120 140
10 20 40 60 80 100 120 140
Dormancy > than (Days)
Dormancy > than (Days)
Figure 8. Dormant period analysis of the ο¬le count distribution and histogram in one namespace checkpoint. Dormancy
of the ο¬le is deο¬ned as the elapsed time between the last access time recorded in the checkpoint and the day of observation.
34% of the ο¬les in the directory p and 58% of the ο¬les in the directory d were not accessed in the last 40 days.
281
9. πΉ ππππΏππ πππππ πΆπΏπ metrices covered earlier. As shown in β What is the sensitivity of the various policy thresholds
Figure 6, directory p only has 23% ο¬les that live beyond 20 used in GreenHDFS on the energy savings results?
days. On the other hand, 80% of ο¬les in directory d live
β How many power state transitions does a server go
for more than 30 days and 80% of the ο¬les have a hot lifes-
through in average in the Cold Zone?
pan of less than 8 days. Thus, directory d is a very good
candidate for invoking the File Migration Policy. β Finally, what is the number of accesses that happen to
the ο¬les in the Cold Zones, the days servers are pow-
5.2 Coldness Characterization of the Files ered on and the number of migrations and reversals ob-
served in the system?
In this section, we show the ο¬le count and the storage
capacity used by the long-living cold ο¬les. The long-living β How many migrations happen daily?
cold ο¬les are deο¬ned as the ο¬les that were created prior to β How may power state transitions are occurred during
the start of the observation period and were not accessed the simulation-run?
during the one-month period of observation at all. As shown
in Figure 13, 63.16% of ο¬les amounting to 56.23% of the The following evaluation sections answer these questions,
total used capacity are cold in the system. Such long-living beginning with a description of our methodology, and the
cold ο¬les present signiο¬cant opportunity to conserve energy trace workloads we use as inputs to the experiments.
in GreenHDFS.
5.3 Dormancy Characterization of the 6.1 Evaluation methodology
Files
We evaluated GreenHDFS using a trace-driven simula-
The HDFS trace analysis gives information only about tor. The simulator was driven by real-world HDFS traces
the ο¬les that were accessed in the one-month duration. To generated by a production Hadoop cluster at Yahoo!. The
get a better picture, we analyzed the namespace checkpoints cluster had 2600 servers, hosted 34 million ο¬les in the
for historical data on the ο¬le temperatures and periods of namespace and the data set size was 6 Petabytes.
dormancy. The namespace checkpoints contain the last ac- We focused our analysis on the directory d as this di-
cess time information of the ο¬les and used this information rectory constituted of 60% of the used storage capacity in
to calculate the dormancy of the ο¬les. The Dormancy met- the cluster (4PB out of the 6PB total used capacity). Just
ric deο¬nes the elapsed time between the last noted access focusing our analysis on the directory d cut down on our
time of the ο¬le and the day of observation. Figure 8 contains simulation time signiο¬cantly and reduced our analysis time
the frequency histograms and distributions of the dormancy. 4
. We used 60% of the total cluster nodes in our analysis to
34% of ο¬les amounting to 37% of storage in the directory p make the results realistic for just directory d analysis. The
present in the namespace checkpoint were not accessed in total number of unique ο¬les that were seen in the HDFS
the last 40 days. 58% of ο¬les amounting to 53% of storage traces for the directory d in the one-month duration were
in the directory d were not accessed in the last 40 days. The 0.9 million. In our experiments, we compare GreenHDFS
extent of dormancy exhibited in the system again shows the to the baseline case (HDFS without energy management).
viability of the GreenHDFS solution.3 The baseline results give us the upper bound for energy con-
sumption and the lower bound for average response time.
6 Evaluation Simulation Platform: We used a trace-driven simula-
tor for GreenHDFS to perform our experiments. We used
In this section, we ο¬rst present our experimental platform models for the power levels, power state transitions times
and methodology, followed by a description of the work- and access times of the disk, processor and the DRAM in
loads used and then we give our experimental results. Our the simulator. The GreenHDFS simulator was implemented
goal is to answer seven high-level sets of questions: in Java and MySQL distribution 5.1.41 and executed using
Java 2 SDK, version 1.6.0-17. 5 Table 1 lists the various
β What much energy is GreenHDFS able to conserve power, latency, transition times etc. used in the Simulator.
compared to a baseline HDFS with no energy manage- The simulator was run on 10 nodes in a development cluster
ment? at Yahoo!.
β What is the penalty of the energy management on av-
4 An important consideration given the massive scale of the traces
erage response time?
5 Both,performance and energy statistics were calculated based on the
3 The number of ο¬les present in the namespace checkpoints were less information extracted from the datasheet of Seagate Barracuda ES.2 which
than half the number of the ο¬les seen in the one-month trace. is a 1TB SATA hard drive, a Quad core Intel Xeon X5400 processor
282
10. 6.3.2 Storage-Efο¬ciency
Table 1. Power and power-on penalties used in Simu-
lator In this section, we show the increased storage efο¬ciency of
the Hot Zones compared to baseline. Figure 10 shows that
Component Active Idle Sleep Power- in the baseline case, the average capacity utilization of the
Power Power Power up
(W) (W) (W) time 1560 servers is higher than that of GreenHDFS which just
CPU (Quad core, Intel Xeon 80-150 12.0- 3.4 30 us has 1170 servers out of the 1560 servers provisioned to the
X5400 [22]) 20.0
DRAM DIMM [29] 3.5-5 1.8- 0.2 1 us Hot second Zone. GreenHDFS has much higher amount of
2.5 free space available in the Hot zone which tremendously in-
NIC [35] 0.7 0.3 0.3 NA
SATA HDD (Seagate Bar- 11.16 9.29 0.99 10 sec creases the potential for better data placement techniques on
racuda ES.2 1TB [16] the Hot zone. More aggressive the policy threshold, more
PSU [2] 50-60 25-35 0.5 300 us
Hot server (2 CPU, 8 DRAM 445.34 132.46 13.16 space is available in the Hot zone for truly hot data as more
DIMM, 4 1TB HDD) data is migrated out to the Cold zone.
Cold server (2 CPU, 8 DRAM 534.62 206.78 21.08
DIMM, 12 1TB HDD)
6.3.3 File Migrations and Reversals
6.2 Simulator Parameters The Figure 10 (right-most) shows the number and total size
of the ο¬les which were migrated to the Cold zone daily with
The default simulation parameters used by in this paper a π βπππ βπππ πΉ π π value of 10 Days. Every day, on average
are shown in Table 2. 6.38TB worth of data and 28.9 thousand ο¬les are migrated
to the Cold zone. Since, we have assumed storage-heavy
servers in the Cold zone where each server has 12, 1TB
Table 2. Simulator Parameters disks, assuming 80MB/sec of disk bandwidth, 6.38TB data
Parameter Value
NumServer 1560
can be absorbed in less than 2hrs by one server. The mi-
NumZones 2 gration policy can be run during off-peak hours to minimize
πΌππ‘πππ£ππ πΉ π π 1 Day
π βπππ βπππ πΉ π π 5, 10, 15, 20 Days
any performance impact.
πΌππ‘πππ£ππ ππ πΆ 1 Day
π βπππ βπππ ππ πΆ 2, 4, 6, 8 Days
πΌππ‘πππ£ππ πΉ π π 1 Day 6.3.4 Impact of Power Management on Response Time
π βπππ βπππ πΉ π π 1, 5, 10 Accesses
NumServersPerZone Hot 1170 Cold 390
We examined the impact of server power management on
the response time of a ο¬le which was moved to the Cold
Zone following a period of dormancy and was accessed
6.3 Simulation results again for some reason. The ο¬les residing on the Cold Zone
may suffer performance degradation in two ways: 1) if the
ο¬le resides on a server that is not powered ON currentlyβ
6.3.1 Energy-Conservation
this will incur a server wakeup time penalty, 2) transfer time
In this section, we show the energy savings made possible degradation courtesy of no striping on the lower Zones. The
by GreenHDFS, compared to baseline, in one month sim- ο¬le is moved back to Hot zone and chunked again by the ο¬le
ply by doing power management in one of the main tenant reversal policy. Figure 11 shows the impact on the average
directory of the Hadoop Cluster. The cost of electricity was response time. 97.8% of the total read requests are not im-
assumed to be $0.063/KWh. Figure 9(Left) shows a 24% pacted by the power management. Impact is seen only by
reduction in energy consumption of a 1560 server datacen- 2.1% of the reads. With a less aggressive π βπππ βπππ πΉ π π
ter with 80% capacity utilization. Extrapolating, $2.1mil- (15, 20 days), impact on the Response time will reduce
lion can be saved in the energy costs if GreenHDFS tech- much further.
nique is applied to all the Hadoop clusters at Yahoo (up-
wards of 38000 servers). Energy saving from off-power 6.3.5 Sensitivity Analysis
servers will be further compounded in the cooling system of
a real datacenter. For every Watt of power consumed by the We tried different values of the thresholds for the File Mi-
compute infrastructure, a modern data center expends an- gration policy and the Server Power Conserver policy to
other one-half to one Watt to power the cooling infrastruc- understand the sensitivity of these thresholds on storage-
ture [32]. Energy-saving results underscore the importance efο¬ciency, energy-conservation and number of power state
of supporting access time recording in the Hadoop compute transitions. A discussion on the impact of the various
clusters. thresholds is done in Section 4.1.4.
283
11. $35,000 Cold Zone Hot Zone # Migrations # Reversals
35 8
$30,000
30 7
Energy Costs
$25,000
Cou (x100000)
Day Server ON
25 6
$20,000
5
$15,000 20
4
$10,000 15
unt
3
ys
$5,000 10 2
$0 5 1
0 0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 5 10 15 20
File Migration Policy (Days) Cold Zone Servers File Migration Policy Interval (Days)
Figure 9. (Left) Energy Savings with GreenHDFS and (Middle) Days Servers in Cold Zone were ON compared to the
Baseline. Energy Cost Savings are Minimally Sensitive to the Policy Threshold Values. GreenHDFS achieves 24% savings in
the energy costs in one month simply by doing power management in one of the main tenant directory of the Hadoop Cluster.
(Right) Number of migrations and reversals in GreenHDFS with different values of the π βπππ βπππ πΉ π π threshold.
500 600 FileSize FileCount
orage Capacity (GB)
Cold Zo Used Capacity
450 Baseline 12 45
500
400 40
10
ount (x 1000)
350 Policy15 400 35
300 8 30
(TB)
File Size (TB)
250 Policy10 300 25
6
200 20
one
150 200 4 15
File Co
Policy5
10
Used Sto
100 2
50 100 5
- - 0
-
6/12
6/14
6/16
6/18
6/20
6/22
6/24
6/26
6/28
6/30
1105
1197
1
93
185
277
369
461
553
645
737
829
921
1013
1289
1381
1473
5 10 15 20
File Migration Policy Interval (Days) Days
Server Number
Figure 10. Capacity Growth and Utilization in the Hot and Cold Zone compared to the Baseline and Daily Migrations.
GreenHDFS substantially increases the free space in the Hot Zones by migrating cold data to the Cold Zones. In the left
and middle chart, we only consider the new data that was introduced in the data directory and old data which was accessed
during the 1 month period. Right chart shows the number and total size of the ο¬les migrated daily to the Cold zone with
π βπππ βπππ πΉ π π value of 10 Days.
π βπππ βπππ πΉ π π : We found that the energy costs are experiments were done with a π βπππ βπππ πΉ π π value of 1.
minimally sensitive to the π βπππ βπππ πΉ π π threshold value. The number of ο¬le reversals are substantially reduced by in-
As shown in Figure 9[Left], the energy cost savings varied creasing the π βπππ βπππ πΉ π π value. With a π βπππ βπππ πΉ π π
minimally when the π βπππ βπππ πΉ π π was changed to 5, 10, value of 10, zero reversals happen in the system.
15 and 20 days. The storage-efο¬ciency is sensitive to the value of the
The performance impact and number of ο¬le reversals is π βπππ βπππ πΉ π π threshold as shown in Figure 10[Left]. An
minimally sensitive to the π βπππ βπππ πΉ π π value as well. increase in the π βπππ βπππ πΉ π π value results in less efο¬-
This behavior can be explained by the observation that ma- cient capacity utilization of the Hot Zones. Higher value of
jority of the data in the production Hadoop cluster at Yahoo! π βπππ βπππ πΉ π π threshold signiο¬es that ο¬les will be chosen
has a news-server-like access pattern. This implies that once as candidates for migration only after they have been dor-
data is deemed cold, there is low probability of data getting mant in the system for a longer period of time. This would
accessed again. be an overkill for ο¬les with very short πΉ ππππΏππ ππ πππ πΆπΏπ
The Figure 9 (right-most) shows the total number of mi- as they will unnecessarily lie dormant in the system, oc-
grations of the ο¬les which were deemed cold by the ο¬le mi- cupying precious Hot zone capacity for a longer period of
gration policy and the reversals of the moved ο¬les in case time.
they were later accessed by a client in the one-month sim- π βπππ βπππ ππΆπ : As Figure 12(Right) illustrates, in-
ulation run. There were more instances (40,170, i.e., 4% creasing the π βπππ βπππ ππΆπ value, minimally increases the
of overall ο¬le count) of ο¬le reversals with the most ag- number of the days the servers in the Cold Zone remain ON
gressive π βπππ βπππ πΉ π π of 5 days. With less aggressive and hence, minimally lowers the energy savings. On the
π βπππ βπππ πΉ π π of 15 days, the number of reversals in the other hand, increasing the π βπππ βπππ ππΆπ value results in
system went down to 6,548 (i.e., 0.7% of ο¬le count). The a reduction in the power state transitions which improves
284