Fault-Tolerant Scheduling for Real-Time Scientific Workflows with Elastic Resource Provisioning in Virtualized Clouds

For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2016 TOPICS
Fault-Tolerant Scheduling for Real-Time Scientific Workflows with Elastic Resource
Provisioning in Virtualized Clouds
Clouds are becoming an important platform for scientific workflow applications. However, with
many nodes being deployed in clouds, managing reliability of resources becomes a critical issue,
especially for the real-time scientific workflow execution where deadlines should be satisfied.
Therefore, fault tolerance in clouds is extremely essential. The PB (primary backup) based
scheduling is a popular technique for fault tolerance and has effectively been used in the cluster
and grid computing. However, applying this technique for real-time workflows in a virtualized
cloud is much more complicated and has rarely been studied. In this paper, we address this
problem. We first establish a real-time workflow fault-tolerant model that extends the traditional
PB model by incorporating the cloud characteristics. Based on this model, we develop
approaches for task allocation and message transmission to ensure faults can be tolerated during
the workflow execution. Finally, we propose a dynamic fault-tolerant scheduling algorithm,
FASTER, for realtime workflows in the virtualized cloud. FASTER has three key features: 1) it
employs a backward shifting method to make full use of the idle resources and incorporates task
overlapping and VM migration for high resource utilization, 2) it applies the vertical/horizontal
scaling-up technique to quickly provision resources for a burst of workflows, and 3) it uses the
vertical scaling-down scheme to avoid unnecessary and ineffective resource changes due to
fluctuated workflow requests. We evaluate our FASTER algorithm with synthetic workflows and
workflows collected from the real scientific and business applications and compare it with six
baseline algorithms. The experimental results demonstrate that FASTER can effectively improve
the resource utilization and schedulability even in the presence of node failures in virtualized
clouds.
IEEE Transactions on Parallel and Distributed Systems (March 2016)
Correlation-Aware Heuristics for Evaluating the Distribution of the Longest Path Length
of a DAG with Random Weights

Coping with uncertainties when scheduling task graphs on parallel machines requires to perform
non-trivial evaluations. When considering that each computation and communication duration is
a random variable, evaluating the distribution of the critical path length of such graphs involves
computing maximums and sums of possibly dependent random variables. The discrete version of
this evaluation problem is known to be #P-hard. Here, we propose two heuristics, CorLCA and
Cordyn, to compute such lengths. They approximate the input random variables and the
intermediate ones as normal random variables, and they precisely take into account correlations
with two distinct mechanisms: through lowest common ancestor queries for CorLCA and with a
dynamic programming approach for Cordyn. Moreover, we empirically compare some classical
methods from the literature and confront them to our solutions. Simulations on a large set of
cases indicate that CorLCA and Cordyn constitute each a new relevant trade-off in terms of
rapidity and precision.
IEEE Transactions on Parallel and Distributed Systems (February 2016)
A Hybrid Static-Dynamic Classification for Dual-Consistency Cache Coherence
Traditional cache coherence protocols manage all memory accesses equally and ensure the
strongest memory model, namely, sequential consistency. Recent cache coherence protocols
based on self-invalidation advocate for the model sequential consistency for data-race-free,
which enables powerful optimizations for race-free code. However, for racy code these cache
coherence protocols provide sub-optimal performance compared to traditional protocols. This
paper proposes SPEL++, a dual-consistency cache coherence protocol that supports two
execution modes: a traditional sequential-consistent protocol and a protocol that provides weak
consistency (or sequential consistency for data-race-free). SPEL++ exploits a static-dynamic
hybrid classification of memory accesses based on (i) a compile-time identification of extended
data-race-free code regions for OpenMP applications and (ii) a runtime classification of accesses
based on the operating system’s memory page management. By executing racy code under the
sequential-consistent protocol and race-free code under the cache coherence protocol that
provides sequential consistency for data-race-free, the end result is an efficient execution of the
applications while still providing sequential consistency. Compared to a traditional protocol, we
show improvements in performance from 19% to 38% and reductions in energy consumption
from 47% to 53%, on average for different benchmark suites, on a 64-core chip multiprocessor.

REFRESH: REDEFINE for Face Recognition using SURE Homogeneous Cores
In this paper we present design and analysis of a scalable real-time Face Recognition (FR)
module to perform 450 recognitions per second. We introduce an algorithm for FR, which is a
combination of Weighted Modular Principle Component Analysis and Radial Basis Function
Neural Networks. This algorithm offers better recognition accuracy in various practical
conditions than algorithms used in existing architectures for real-time FR. To meet real-time
requirements, a Scalable Parallel Pipelined Architecture (SPPA) is developed by realizing the
above FR algorithm as independent parallel streams and sub-streams of computations. SPPA is
capable of supporting large databases maintained in external (DDR) memory. By casting the
computations in a stream into hardware, we present the design of a Scalable Unit for Region
Evaluation (SURE) core. Using SURE cores as computer elements in a massively parallel
CGRA, like REDFINE, we provide a FR system on REDEFINE called REFRESH. We report
FPGA and ASIC synthesis results for SPPA and REFRESH. Through analysis using these
results, we show that excellent scalability and added programmability in REFRESH makes it a
flexible and favorable solution for real-time FR.
Improving Batch Scheduling on Blue Gene/Q by Relaxing Network Allocation Constraints
As systems scale toward exascale, many resources will become increasingly constrained. While
some of these resources have historically been explicitly allocated, many—such as network
bandwidth, I/O bandwidth, or power—have not. As systems continue to evolve, we expect many
such resources to become explicitly managed. This change will pose critical challenges to
resource management and job scheduling. In this paper, we explore the potential of relaxing
network allocation constraints for Blue Gene systems. Our objective is to improve the batch
scheduling performance, where the partition-based interconnect architecture provides a unique
opportunity to explicitly allocate network resources to jobs. This paper makes three major
contributions. The first is substantial benchmarking of parallel applications, focusing on
assessing application sensitivity to communication bandwidth at large scale. The second is three
new scheduling schemes using relaxed network allocation and targeted at balancing individual
job performance with overall system performance. The third is a comparative study of our

scheduling schemes versus the existing scheduler on Mira, a 48-rack Blue Gene/Q system at
Argonne National Laboratory. Specifically, we use job traces collected from this production
system.
Transparent and optimized distributed processing on GPUs
DistributedCL is a middleware which enables transparent parallel processing on distributed
GPUs. With the support of the DistributedCL middleware an application designed to use the
OpenCL API can run in a distributed manner and transparently use remote GPUs without having
to change or rebuild the code. The proposed architecture for the DistributedCL middleware is
modular, with well-defined layers. A prototype was built according to the architecture, which
considered various optimization points, including sending data in batches, network asynchronous
communication and asynchronous request to the OpenCL API. The prototype was evaluated
using available benchmarks and a specific benchmark, the CLBench, was developed to facilitate
the evaluations according to the amount of processed data. The prototype presented good
performance, higher when compared to similar proposals, which also consider transparent use of
remote GPUs. The data size to be transmitted over the network was the major limiting factor.
IEEE Transactions on Parallel and Distributed Systems (April 2016)
Distributed Control for Charging Multiple Electric Vehicles with Overload Limitation
Severe pollution induced by traditional fossil fuels arouses great attention on the usage of plug-in
electric vehicles (PEVs) and renewable energy. However, large-scale penetration of PEVs
combined with other kinds of appliances tends to cause excessive or even disastrous burden on
the power grid, especially during peak hours. This paper focuses on the scheduling of PEVs
charging process among different charging stations and each station can be supplied by both
renewable energy generators and a distribution network. The distribution network also powers
some uncontrollable loads. In order to minimize the on-grid energy cost with local renewable
energy and non-ideal storage while avoiding the overload risk of the distribution network, an
online algorithm consisting of scheduling the charging of PEVs and energy management of
charging stations is developed based on Lyapunov optimization and Lagrange dual
decomposition techniques. The algorithm can satisfy the random charging requests from PEVs
with provable performance. Simulation results with real data demonstrate that the proposed

algorithm can decrease the time-average cost of stations while avoiding overload in the
distribution network in the presence of random uncontrollable loads.
CoRE: Cooperative End-to-End Traffic Redundancy Elimination for Reducing Cloud
Bandwidth Cost
The pay-as-you-go service model impels cloud customers to reduce the usage cost of bandwidth.
Traffic Redundancy Elimination (TRE) has been shown to be an effective solution for reducing
bandwidth costs, and thus has recently captured significant attention in the cloud environment.
By studying the TRE techniques in a trace driven approach, we found that both short-term (time
span of seconds) and long-term (time span of hours or days) data redundancy can concurrently
appear in the traffic, and solely using either sender-based TRE or receiver-based TRE cannot
simultaneously capture both types of traffic redundancy. Also, the efficiency of existing receiver-
based TRE solution is susceptible to the data changes compared to the historical data in the
cache. In this paper, we propose a Cooperative end-to-end TRE solution (CoRE) that can detect
and remove both short-term and long-term redundancy through a two-layer TRE design with
cooperative operations between layers. An adaptive prediction algorithm is further proposed to
improve TRE efficiency through dynamically adjusting the prediction window size based on the
hit ratio of historical predictions. Besides, we enhance CoRE to adapt to different traffic
redundancy characteristics of cloud applications to improve its operation cost. Extensive
evaluation with several real traces show that CoRE is capable of effectively identifying both
short-term and long-term redundancy with low additional cost while ensuring TRE efficiency
from data changes.
IEEE Transactions on Parallel and Distributed Systems (June 2016)
Clustering-based Task Scheduling in a Large Number of Heterogeneous Processors
Parallelization paradigms for effective execution in a Directed Acyclic Graph (DAG) application
have been widely studied in the area of task scheduling. Schedule length can be varied depending
on task assignment policies, scheduling policies, and heterogeneity in terms of each processor
and each communication bandwidth in a heterogeneous system. One disadvantage of existing
task scheduling algorithms is that the schedule length cannot be reduced for a data intensive
application. In this paper, we propose a clustering-based task scheduling algorithm called

Clustering for Minimizing the Worst Schedule Length (CMWSL) to minimize the schedule
length in a large number of heterogeneous processors. First, the proposed method derives the
lower bound of the total execution time for each processor by taking both the system and
application characteristics into account. As a result, the number of processors used for actual
execution is regulated to minimize the Worst Schedule Length (WSL). Then, the actual task
assignment and task clustering are performed to minimize the schedule length until the total
execution time in a task cluster exceeds the lower bound. Experimental results indicate that
CMWSL outperforms both existing list-based and clustering-based task scheduling algorithms in
terms of the schedule length and efficiency, especially in data-intensive applications.
IEEE Transactions on Parallel and Distributed Systems (February 2016 )
Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint
Model
The traditional single-level checkpointing method suffers from significant overhead on large-
scale platforms. Hence, multilevel checkpointing protocols have been studied extensively in
recent years. The multilevel checkpoint approach allows different levels of checkpoints to be set
(each with different checkpoint overheads and recovery abilities), in order to further improve the
fault tolerance performance of extreme-scale HPC applications. How to optimize the checkpoint
intervals for each level, however, is an extremely difficult problem. In this paper, we construct
an easy-to-use two-level checkpoint model. Checkpoint level 1 deals with errors with low
checkpoint/recovery overheads such as transient memory errors, while checkpoint level 2 deals
with hardware crashes such as node failures. Compared with previous optimization work, our
new optimal checkpoint solution offers two improvements: (1) it is an online solution without
requiring knowledge of the job length in advance, and (2) it shows that periodic patterns are
optimal and determines the best pattern. We evaluate the proposed solution and compare it with
the most up-to-date related approaches on an extreme-scale simulation testbed constructed based
on a real HPC application execution. Simulation results show that our proposed solution
outperforms other optimized solutions and can improve the performance significantly in some
cases. Specifically, with the new solution the wall-clock time can be reduced by up to 25.3%
over that of other state-of-the-art approaches. Finally, a brute-force comparison with all possible
patterns shows that our solution is always within 1% of the best pattern in the experiments.

Shadow/Puppet Synthesis: A Stepwise Method for the Design of Self-Stabilization
This paper presents a novel two-step method for automated design of self-stabilization. The first
step enables the specification of legitimate states and an intuitive (but imprecise) specification of
the desired functional behaviors in the set of legitimate states (hence the term ―shadow‖). After
creating the shadow specifications, we systematically introduce the main variables and the
topology of the desired self-stabilizing system. Subsequently, we devise a parallel and complete
backtracking search towards finding a self-stabilizing solution that implements a precise version
of the shadow behaviors, and guarantees recovery to legitimate states from any state. To the best
of our knowledge, the shadow/puppet synthesis is the first sound and complete method that
exploits parallelism and randomization along with the expansion of the state space towards
generating self-stabilizing systems that cannot be synthesized with existing methods. We have
validated the proposed method by creating both a sequential and a parallel implementation in the
context of a software tool, called Protocon. Moreover, we have used Protocon to automatically
design three new self-stabilizing protocols that we conjecture to require the minimal number of
states per process to achieve stabilization (when processes are deterministic): 2-state maximal
matching on bidirectional rings, 5-state token passing on unidirectional rings, and 3-state token
passing on bidirectional chains.
Optimizing End-to-End Big Data Transfers over Terabits Network Infrastructure
While future terabit networks hold the promise of significantly improving big-data motion
among geographically distributed data centers, significant challenges must be overcome even on
today’s 100 gigabit networks to realize end-to-end performance. Multiple bottlenecks exist along
the end-to-end path from source to sink, for instance, the data storage infrastructure at both the
source and sink and its interplay with the wide-area network are increasingly the bottleneck to
achieving high performance. In this paper, we identify the issues that lead to congestion on the
path of an end-to-end data transfer in the terabit network environment, and we present a new
bulk data movement framework for terabit networks, called LADS. LADS exploits the
underlying storage layout at each endpoint to maximize throughput without negatively impacting
the performance of shared storage resources for other users. LADS also uses the Common

Communication Interface (CCI) in lieu of the sockets interface to benefit from hardware-level
zero-copy, and operating system bypass capabilities when available. It can further improve data
transfer performance under congestion on the end systems using buffering at the source using
flash storage. With our evaluations, we show that LADS can avoid congested storage elements
within the shared storage resource, improving input/output bandwidth, and data transfer rates
across the high speed networks. We also investigate the performance degradation problems of
LADS due to I/O contention on the parallel file system (PFS), when multiple LADS tools share
the PFS. We design and evaluate a meta-scheduler to coordinate multiple I/O streams while
sharing the PFS, to minimize the I/O contention on the PFS. With our evaluations, we observe
that LADS with meta-scheduling can further improve the performance by up to 14% relative to
LADS without meta-scheduling.
Analysis of parallel computing strategies to accelerate ultrasound imaging processes
This work analyses the use of parallel processing techniques in synthetic aperture ultrasonic
imaging applications. In particular, the Total Focussing Method, which is a O(N2 P) problem, is
studied. This work presents different parallelization strategies for multicore CPU and GPU
architectures. The parallelization processes on both platforms are discussed and optimized in
order to achieve real-time performance.
Improving Performance of Parallel I/O Systems through Selective and Layout-Aware SSD
Cache
Parallel file systems (PFS) are widely-used to ease the I/O bottleneck of modern high-
performance computing systems. However, PFSs do not work well for small requests, especially
small random requests. Newer Solid State Drives (SSD) have excellent performance on small
random data accesses, but also incur a high monetary cost. In this study, we propose SLA-Cache,
a Selective and Layout-Aware Cache system that employs a small set of SSD-based file servers
as a cache of conventional HDD-based file servers. SLA-Cache uses a novel scheme to identify
performance-critical data, and conducts a selective cache admission (SCA) policy to fully utilize
SSD-based file servers. Moreover, since data layout of the cache system can also largely
influence its access performance, SLA-Cache applies a layout-aware cache placement scheme

(LCP) to store data on SSD-based file servers. By storing data with an optimal layout requiring
the lowest access cost among three typical layout candidates, LCP can further improve system
performance. We have implemented SLA-Cache under the MPICH2 I/O library. Experimental
results show that SLA-Cache can significantly improve I/O throughput, and is a promising
approach for parallel applications.
IEEE Transactions on Parallel and Distributed Systems (January 2016)
Elastic Reliability Optimization Through Peer-to-Peer Checkpointing in Cloud Computing
Modern day data centers coordinate hundreds of thousands of heterogeneous tasks and aim at
delivering highly reliable cloud computing services. Although offering equal reliability to all
users benefits everyone at the same time, users may find such an approach either inadequate or
too expensive to fit their individual requirements, which may vary dramatically. In this paper, we
propose a novel method for providing elastic reliability optimization in cloud computing. Our
scheme makes use of peer-to-peer checkpointing and allows user reliability levels to be jointly
optimized based on an assessment of their individual requirements and total available resources
in the data center. We show that the joint optimization can be efficiently solved by a distributed
algorithm using dual decomposition. The solution improves resource utilization and presents an
additional source of revenue to data center operators. Our validation results suggest a significant
improvement of reliability over existing schemes.
IEEE Transactions on Parallel and Distributed Systems (May 2016)
A Taxonomy of Job Scheduling on Distributed Computing Systems
Hundreds of papers on job scheduling for distributed systems are published every year and it
becomes increasingly difficult to classify them. Our analysis revealed that half of these papers
are barely cited. This paper presents a general taxonomy for scheduling problems and solutions
in distributed systems. This taxonomy was used to classify and make publicly available the
classification of 109 scheduling problems and their solutions. These 109 problems were further
clustered into ten groups based on the features of the taxonomy. The proposed taxonomy will
facilitate researchers to build on prior art, increase new research visibility, and minimize
redundant effort.
Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications

For exascale HPC applications, silent data corruption (SDC) is one of the most dangerous
problems because there is no indication that there are errors during the execution. We propose an
adaptive impact-driven method that can detect SDCs dynamically. The key contributions are
threefold. (1) We carefully characterize 18 HPC applications/benchmarks and discuss the
runtime data features, as well as the impact of the SDCs on their execution results. (2) We
propose an impact-driven detection model that does not blindly improve the prediction accuracy,
but instead detects only influential SDCs to guarantee user-acceptable execution results. (3) Our
solution can adapt to dynamic prediction errors based on local runtime data and can
automatically tune detection ranges for guaranteeing low false alarms. Experiments show that
our detector can detect 80-99.99% of SDCs with a false alarm rate less that 1% of iterations for
most cases. The memory cost and detection overhead are reduced to 15% and 6.3%, respectively,
for a large majority of applications.
Time Series-Oriented Load Prediction Model and Migration Policies for Distributed
Simulation Systems
HLA-based simulation systems are prone to load imbalances due to lack management of shared
resources in distributed environments. Such imbalances lead these simulations to exhibit
performance loss in terms of execution time. As a result, many dynamic load balancing systems
have been introduced to manage distributed load. These systems use specific methods, depending
on load or application characteristics, to perform the required balancing. Load prediction is a
technique that has been used extensively to enhance load redistribution heuristics towards
preventing load imbalances. In this paper, several efficient Time Series model variants are
presented and used to enhance prediction precision for large-scale distributed simulation-based
systems. These variants are proposed to extend and correct the issues originating from the
implementation of Holt’s model for time series in the predictive module of a dynamic load
balancing system for HLA-based distributed simulations. A set of migration decision-making
techniques is also proposed to enable a prediction-based load balancing system to be independent
of any prediction model, promoting a more modular construction.
Enabling Parallel Simulation of Large-Scale HPC Network Systems

With the increasing complexity of today’s high-performance computing (HPC) architectures,
simulation has become an indispensable tool for exploring the design space of HPC systems—in
particular, networks. In order to make effective design decisions, simulations of these systems
must possess the following properties: (1) have high accuracy and fidelity, (2) produce results in
a timely manner, and (3) be able to analyze a broad range of network workloads. Most state-of-
the-art HPC network simulation frameworks, however, are constrained in one or more of these
areas. In this work, we present a simulation framework for modeling two important classes of
networks used in today’s IBM and Cray supercomputers: torus and dragonfly networks. We use
the Co-Design of Multi-layer Exascale Storage Architecture (CODES) simulation framework to
simulate these network topologies at a flit-level detail using the Rensselaer Optimistic
Simulation System (ROSS) for parallel discrete-event simulation. Our simulation framework
meets all the requirements of a practical network simulation and can assist network designers in
design space exploration. First, it uses validated and detailed flit-level network models to provide
an accurate and high-fidelity network simulation. Second, instead of relying on serial time-
stepped or traditional conservative discrete-event simulations that limit simulation scalability and
efficiency, we use the optimistic event-scheduling capability of ROSS to achieve efficient and
scalable HPC network simulations on today’s high-performance cluster systems. Third, our
models give network designers a choice in simulating a broad range of network workloads,
including HPC application workloads using detailed network traces, an ability that is rarely
offered in parallel with high-fidelity network simulations.
An Evolutionary Optimal Fuzzy System with Information Fusion of Heterogeneous
Distributed Computing and Polar-Space Dynamic Model for Online Motion Control of
Swedish Redundant Robots
This paper presents an evolutionary optimal fuzzy system with information fusion of
heterogeneous distributed computing and polar-space dynamic model for online motion control
of Swedish redundant robots. The intelligent fuzzy system incorporated with the parallel
metaheuristic BFO (Bacteria Foraging Optimization)-AIS (Artificial Immune System), called
FS-PBFOAIS and its field-programmable gate array (FPGA) realization to optimal polar-space
online motion control of four-wheeled redundant mobile robots. This hybrid paradigm gains the

benefits of Taguchi quality method, BFO, AIS, distributed processing and FPGA technique.
Experimental results are conducted to present effective optimization and high accuracy of the
proposed FPGA-based FS-PBFOAIS tracking controller. Finally, the comparative works are
provided to demonstrate the superiority of the FPGA-based FS-PBFOAIS polar-space redundant
controller over other conventional control methods.
IEEE Transactions on Industrial Electronics (May 2016)
Cache Line Aware Algorithm Design for Cache-Coherent Architectures
The increase in the number of cores per processor and the complexity of memory hierarchies
make cache coherence key for programmability of current shared memory systems. However,
ignoring its detailed architectural characteristics can harm performance significantly. In order to
assist performance-centric programming, we propose a methodology to allow semi-automatic
performance tuning with the systematic translation from an algorithm to an analytic performance
model for cache line transfers. For this, we design a simple interface for cache line aware
optimization, a translation methodology, and a full performance model that exposes the block-
based design of caches to middleware designers. We investigate two different architectures to
show the applicability of our techniques and methods: the many-core accelerator Intel Xeon Phi
and a multi-core processor with a NUMA configuration (Intel Sandy Bridge). We use
mathematical optimization techniques to tune synchronization algorithms to the
microarchitectures, identifying three techniques to design and optimize data transfers in our
model: single-use, single-step broadcast, and private cache lines.
CUDAlign 4.0: Incremental Speculative Traceback for Exact Chromosome-Wide
Alignment in GPU Clusters
This paper proposes and evaluates CUDAlign 4.0, a parallel strategy to obtain the optimal
alignment of huge DNA sequences in multi-GPU platforms, using the exact Smith-Waterman
(SW) algorithm. In the first phase of CUDAlign 4.0, a huge Dynamic Programming (DP) matrix
is computed by multiple GPUs, which asynchronously communicate border elements to the right
neighbor in order to find the optimal score. After that, the traceback phase of SW is executed.
The efficient parallelization of the traceback phase is very challenging because of the high
amount of data dependency, which particularly impacts the performance and limits the

application scalability. In order to obtain a multi-GPU highly parallel traceback phase, we
propose and evaluate a new parallel traceback algorithm called Incremental Speculative
Traceback (IST), which pipelines the traceback phase, speculating incrementally over the values
calculated so far, producing results in advance. With CUDAlign 4.0, we were able to calculate
SW matrices with up to 60 Peta cells, obtaining the optimal local alignments of all Human and
Chimpanzee homologous chromosomes, whose sizes range from 26 Millions of Base Pairs
(MBP) up to 249 MBP. As far as we know, this is the first time such comparison was made with
the SW exact method. We also show that the IST algorithm is able to reduce the traceback time
from 2.15x up to 21.03x, when compared with the baseline traceback algorithm. The human x
chimpanzee chromosome 5 comparison (180 MBP x 183 MBP) attained 10,370.00 GCUPS
(Billions of Cells Updated per Second) using 384 GPUs, with a speculation hit ratio of 98.2%.
Xscale: Online X-code RAID-6 Scaling Using Lightweight Data Reorganization
Disk additions to a RAID-6 storage system can simultaneously increase the I/O parallelism and
expand the storage capacity. To regain a balanced load among both old and new disks, RAID-6
scaling requires moving certain data blocks onto newly added disks. Existing approaches to
RAID-6 scaling are restricted by preserving a round-robin data distribution, and require
migrating all the data, resulting in an expensive cost for RAID-6 scaling. In this paper, we
propose Xscale, a new approach to accelerating X-code RAID-6 scaling by using lightweight
data reorganization. Xscale minimizes the number of data blocks that require being moved, while
maintaining a uniform data distribution across all disks. Furthermore, Xscale eliminates metadata
updates while guaranteeing data consistency and data reliability. Compared with the round-robin
approach, Xscale reduces the number of blocks to be moved by 63.6–89.5%, decreases the
reorganization time by 35.62-37.26%, and reduces the I/O latency by 23.29-37.74% while the
scaling programs are running in the background. In addition, there is no penalty in the
performance of the data layout after scaling using Xscale, compared with the layouts maintained
by other existing scaling approache
The Importance of Worker Reputation Information in Microtask-Based Crowd Work
Systems

This paper presents the first systematic investigation of the potential performance gains for
crowd work systems, deriving from available information at the requester about individual
worker reputation. In particular, we first formalize the optimal task assignment problem when
workers’ reputation estimates are available, as the maximization of a monotone (submodular)
function subject to Matroid constraints. Then, being the optimal problem NP-hard, we propose a
simple but efficient greedy heuristic task allocation algorithm. We also propose a simple
―maximum a-posteriori‖ decision rule and a decision algorithm based on message passing.
Finally, we test and compare different solutions, showing that system performance can greatly
benefit from information about workers’ reputation. Our main findings are that: i) even largely
inaccurate estimates of workers’ reputation can be effectively exploited in the task assignment to
greatly improve system performance; ii) the performance of the maximum a-posteriori decision
rule quickly degrades as worker reputation estimates become inaccurate; iii) when workers’
reputation estimates are significantly inaccurate, the best performance can be obtained by
combining our proposed task assignment algorithm with the message-passing decision algorithm.
On Data Integrity Attacks against Real-time Pricing in Energy-based Cyber-Physical
Systems
In this paper, we investigate a novel real-time pricing scheme, which considers both renewable
energy resources and traditional power resources and could effectively guide the participants to
achieve individual welfare maximization in the system. To be specific, we develop a Lagrangian-
based approach to transform the global optimization conducted by the power company into
distributed optimization problems to obtain explicit energy consumption, supply, and price
decisions for individual participants. Also, we show that these distributed problems derived from
the global optimization by the power company are consistent with individual welfare
maximization problems for end-users and traditional power plants. We also investigate and
formalize the vulnerabilities of the real-time pricing scheme by considering two types of data
integrity attacks: Ex-ante attacks and Ex-post attacks, which are launched by the adversary
before or after the decision-making process. We systematically analyze the welfare impacts of
these attacks on the real-time pricing scheme. Through a combination of theoretical analysis and
performance evaluation, our data shows that the proposed real-time pricing scheme could

effectively guide the participants to achieve welfare maximization, while cyber-attacks could
significantly disrupt the results of real-time pricing decisions, imposing welfare reduction on the
participants.
A Fast and Accurate Hardware String Matching Module with Bloom Filters
Many fields of computing such as Deep Packet Inspection (DPI) employ string matching
modules (SMM) that search for a given set of positive strings in their input. An SMM is expected
to produce correct outcomes while scanning the input data at high rates. Furthermore the string
sets that are searched for are usually large and their sizes increase steadily. Bloom Filters (BFs)
are hashing data structures which are fast but their false positive results require further
processing. That is, their speed can be exploited for Standard Bloom Filter SMMs (SBFs) as long
as the positive probability is low. Multiple BFs in parallel can further increase the throughput. In
this paper, we propose the Double Bloom Filter SMM (DBF) which achieves a higher
throughput than the SBF and maintains a high throughput even for large positive probabilities.
The second Bloom Filter of DBF stores a small enough subset of the positive strings such that its
false positive probability is approximately zero. We develop an analytical model of the DBF and
show that the throughput advantage of DBF over SBF becomes more prominent if the positive
probability and the fraction of matches in the second Bloom Filter increase. Accordingly, we
propose a heuristic algorithm that stores the strings that are more frequently matched in the
second Bloom Filter according to localities identified in the input. Our numerical results are
obtained using realistic values from an FPGA implementation and are validated by SystemC
simulations.
Seer Grid: Privacy and Utility Implications of Two-Level Load Prediction in Smart Grids
We propose ―Seer Grid‖, a novel two-level energy consumption prediction framework for smart
grids, aimed to decrease the trade-off between privacy requirements (of the customer) and data
utility requirements (of the energy company (EC)). The first-level prediction at the household
level is performed by each smart meter (SM), and the predicted energy consumption pattern
(instead of the actual energy usage data) is reported to a cluster head (CH). Then, a second-level
prediction at the neighborhood level is done by the CH which predicts the energy spikes in the

neighborhood or cluster and shares it with the EC. Our two-level prediction mechanism is
designed such that it preserves the correlation between the predicted and actual energy
consumption patterns at the cluster level and removes this correlation in the predicted data
communicated by each SM to the CH. This maintains the usefulness of the cluster-level energy
consumption data communicated to the EC, while preserving the privacy of the household-level
energy consumption data against the CH (and thus the EC). Our evaluation results show that Seer
Grid is successful in hiding private consumption patterns at the household-level while still being
able to accurately predict energy consumption at the neighborhood-level.
Many-Core Real-Time Task Scheduling with Scratchpad Memory
This work is motivated by the demand for scheduling tasks upon the increasingly popular island-
based many-core architectures. On such an architecture, homogeneous cores are grouped into
islands, each of which is equipped with a scratchpad memory module (referred to as local
memory). We first show the NP-hardness and the inapproximability of the scheduling problem.
Despite the inapproximability, positive results can still be found when different cases of the
problem are investigated. A (3 − 1 F )- approximation algorithm is proposed for the minimization
of the maximum system utilization, where F is the number of cores in the platform. When the
technique of resource augmentation is considered, this paper further develops a (γ + 1)-memory
2γ−1 γ−1 - approximation algorithm, where γ represents the trade-off between CPU utilization
and local memory space. On the other hand, a special case is also considered when the ratio of
the worst-case execution time of a task without and with using the local memory is bounded by a
constant. The capabilities of the proposed algorithms are then evaluated with benchmarks from
MRTC, UTDSP, NetBench and DSPstone, where the maximum system utilization can be
significantly reduced even when the local memory size is only 5% of the total footprint of all of
the tasks.
Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-
Grafting
It is difficult to obtain high performance when computing matchings on parallel processors
because matching algorithms explicitly or implicitly search for paths in the graph, and when

these paths become long, there is little concurrency. In spite of this limitation, we present a new
algorithm and its shared-memory parallelization that achieves good performance and scalability
in computing maximum cardinality matchings in bipartite graphs. Our algorithm searches for
augmenting paths via specialized breadth-first searches (BFS) from multiple source vertices,
hence creating more parallelism than single source algorithms. Algorithms that employ multiple-
source searches cannot discard a search tree once no augmenting path is discovered from the
tree, unlike algorithms that rely on single-source searches. We describe a novel tree-grafting
method that eliminates most of the redundant edge traversals resulting from this property of
multiple-source searches. We also employ the recent direction-optimizing BFS algorithm as a
subroutine to discover augmenting paths faster. Our algorithm compares favorably with the
current best algorithms in terms of the number of edges traversed, the average augmenting path
length, and the number of iterations. We provide a proof of correctness for our algorithm. Our
NUMA-aware implementation is scalable to 80 threads of an Intel multiprocessor and to 240
threads on an Intel Knights Corner coprocessor. On average, our parallel algorithm runs an order
of magnitude faster than the fastest algorithms available. The performance improvement is more
significant on graphs with small matching number.
A Group-Ordered Fast Iterative Method for Eikonal Equations
In the past decade, many numerical algorithms for the Eikonal equation have been proposed.
Recently, the research of Eikonal equation solver has focused more on developing efficient
parallel algorithms in order to leverage the computing power of parallel systems, such as multi-
core CPUs and GPUs (Graphics Processing Units). In this paper, we introduce an efficient
parallel algorithm that extends Jeong et al.’s FIM (Fast Iterative Method, [1]), originally
developed for the GPU, for multi-core shared memory systems. First, we propose a parallel
implementation of FIM using a lock-free local queue approach and provide an in-depth analysis
of the parallel performance of the method. Second, we propose a new parallel algorithm, Group-
Ordered Fast Iterative Method (GO-FIM), that exploits causality of grid blocks to reduce
redundant computations, which was the main drawback of the original FIM. In addition, the
proposed GO-FIM method employs clustering of blocks based on the updating order where each

cluster can be updated in parallel using multi-core parallel architectures. We discuss the
performance of GO-FIM and compare with the state-of-the-art parallel Eikonal equation solvers.
Optimal Reconfiguration of High-Performance VLSI Subarrays with Network Flow
A two-dimensional mesh-connected processor array is an extensively investigated architecture
used in parallel processing. Massive studies have addressed the use of reconfiguration algorithms
for the processor arrays with faults. However, the subarray generated by previous algorithms
contains a large number of long interconnects, which in turn leads to more communication costs,
capacitance and dynamic power dissipation. In this paper, we propose novel techniques, making
use of the idea of network flow, to construct the high-performance subarray, which has the
minimum number of long interconnects. Firstly, we construct a network flow model according to
the host array under a specific constraint. Secondly, we show that the reconfiguration problem of
high-performance subarray can be optimally solved in polynomial time by using efficient
minimum-cost flow algorithms. Finally, we prove that the geometric properties of the resulted
subarray meet the system requirements. Simulations based on several random and clustered fault
scenarios clearly reveal the advantage of the proposed technique for reducing the number of long
interconnects. It is shown that, for a host array of size 512 512, the number of long interconnects
in the subarray can be reduced by up to 70.05% for clustered faults and by up to 55.28% for
random faults with density of 1% as compared to the-state-of-the-art.
DREAM-(L)G: A Distributed Grouping-based Algorithm for Resource Assignment for
Bandwidth-Intensive Applications in the Cloud
Increasingly, many bandwidth-intensive applications have been ported to the cloud platform. In
practice, however, some disadvantages including equipment failures, bandwidth overload and
long-distance transmission often damage the QoS about data availability, bandwidth provision
and access locality respectively. While some recent solutions have been proposed to cope with
one or two of disadvantages, but not all. Moreover, as the number of data objects scales, most of
the current offline algorithms solving a constraint optimization problem suffer from low
computational efficiency. To overcome these problems, in this paper we propose an approach
that aims to make fully efficient use of the cloud resources to enable bandwidth-intensive

applications to achieve the desirable level of SLA-specified QoS mentioned above cost-
effectively and timely. First we devise a constraint-based model that describes the relationship
among data object placement, user cells bandwidth allocation, operating costs and QoS
constraints. Second, we use the distributed heuristic algorithm, called DREAM-L, that solves the
model and produces a budget solution to meet SLA-specified QoS. Third, we propose an object-
grouping technique that is integrated into DREAM-L, called DREAM-LG, to significantly
improve the computational efficiency of our algorithm. The results of hundreds of thousands of
simulation-based experiments demonstrate that DREAM-LG provides much better data
availability, bandwidth provision and access locality than the state-of-the-art solutions at modest
cloud operating costs and within a small and acceptable range of time.
A Hybrid Parallel Solving Algorithm on GPU for Quasi-Tridiagonal System of Linear
Equations
There are some quasi-tridiagonal system of linear equations arising from numerical simulations,
and some solving algorithms encounter great challenge on solving quasi-tridiagonal system of
linear equations with more than millions of dimensions as the scale of problems increases. We
present a solving method which mixes direct and iterative methods, and our method needs less
storage space in a computing process. A quasi-tridiagonal matrix is split into a tridiagonal matrix
and a sparse matrix using our method and then the tridiagonal equation can be solved by the
direct methods in the iteration processes. Because the approximate solutions obtained by the
direct methods are closer to the exact solutions, the convergence speed of solving the quasi-
tridiagonal system of linear equations can be improved. Furthermore, we present an improved
cyclic reduction algorithm using a partition strategy to solve tridiagonal equations on GPU, and
the intermediate data in computing are stored in shared memory so as to significantly reduce the
latency of memory access. According to our experiments on 10 test cases, the average number of
iterations is reduced significantly by using our method compared with Jacobi, GS, GMRES, and
BiCG respectively, and close to those of BiCGSTAB, BiCRSTAB, and TFQMR. For parallel
mode, the parallel computing efficiency of our method is raised by partition strategy, and the
performance using our method is better than those of the commonly used iterative and direct
methods because of less amount of calculation in an iteration.

Shield: A Reliable Network-on-Chip Router Architecture for Chip Multiprocessors
The increasing number of cores on a chip has made the Network on Chip (NoC) concept the
standard communication paradigm for Chip Multiprocessors. A fault in an NoC leads to
undesirable ramifications that can severely impact the performance of a chip. Therefore, it is
vital to design fault tolerant NoCs. In this paper, we present Shield, a reliable NoC router
architecture that has the unique ability to tolerate both hard and soft errors in the routing pipeline
using techniques such as spatial redundancy, exploitation of idle cycles, bypassing of faulty
resources and selective hardening. Using Mean Time to Failure and Silicon Protection Factor
metrics, we show that Shield is six times more reliable than the baseline-unprotected router and
is at least 1.5 times more reliable than existing fault tolerant router architectures. We introduce a
new metric called Soft Error Improvement Factor and show that the soft error tolerance of Shield
has improved by three times in comparison to the baseline-unprotected router. This reliability
improvement is accomplished by incurring an area and power overhead of 34% and 31%
respectively. Latency analysis using SPLASH-2 and PARSEC reveals that in the presence of
faults, latency increases by a modest 13% and 10% respectively.
An Energy-Efficient Directory Based Multicore Architecture with Wireless Routers to
Minimize the Communication Latency
Multicore architectures suffer from high core-to-core communication latency primarily due to
the cache’s dynamic behavior. Studies suggest that a directory-approach can be helpful to reduce
communication latency by storing the cached block information. Recent studies also indicate that
a wireless router has potential to help decrease communication latency in multicore architectures.
In this work, we propose a directory based multicore architecture with wireless routers to
minimize communication latency. We simulate systems with mesh (used in the Standford
Directory Architecture for SHared memory (DASH) architecture), wireless network-on-chip
(WNoC), and the proposed directory based architecture with wireless routers. According to the
experimental results, our proposed architecture outperforms the WNoC and the mesh
architectures. It is observed that the proposed architecture helps decrease the communication
delay by up to 15.71% and the total power consumption by up to 67.58% when compared with

the mesh architecture. Similarly, the proposed architecture helps decrease the communication
delay by up to 10.00% and the total power consumption by up to 58.10% when compared with
the WNoC architecture. This is due to the fact that the proposed directory based mechanism
helps reduce the number of core-to-core communication and the wireless routers help reduce the
total number of hops.
Trajectory Pattern Mining for Urban Computing in the Cloud
The increasing pervasiveness of mobile devices along with the use of technologies like GPS,
Wifi networks, RFID, and sensors, allows for the collections of large amounts of movement data.
This amount of data can be analyzed to extract descriptive and predictive models that can be
properly exploited to improve urban life. From a technological viewpoint, Cloud computing can
play an essential role by helping city administrators to quickly acquire new capabilities and
reducing initial capital costs by means of a comprehensive pay-as-you-go solution. This paper
presents a workflow-based parallel approach for discovering patterns and rules from trajectory
data, in a Cloud-based framework. Experimental evaluation has been carried out on both real-
world and synthetic trajectory data, up to one million of trajectories. The results show that, due
to the high complexity and large volumes of data involved in the application scenario, the
trajectory pattern mining process takes advantage from the scalable execution environment
offered by a Cloud architecture in terms of both execution time, speed-up and scale-up.
FiDoop-DP: Data Partitioning in Frequent Itemset Mining on Hadoop Clusters
Traditional parallel algorithms for mining frequent itemsets aim to balance load by equally
partitioning data among a group of computing nodes. We start this study by discovering a serious
performance problem of the existing parallel Frequent Itemset Mining algorithms. Given a large
dataset, data partitioning strategies in the existing solutions suffer high communication and
mining overhead induced by redundant transactions transmitted among computing nodes. We
address this problem by developing a data partitioning approach called FiDoop-DP using the
MapReduce programming model. The overarching goal of FiDoop-DP is to boost the
performance of parallel Frequent Itemset Mining on Hadoop clusters. At the heart of FiDoop-DP
is the Voronoi diagram-based data partitioning technique, which exploits correlations among

transactions. Incorporating the similarity metric and the Locality-Sensitive Hashing technique,
FiDoop-DP places highly similar transactions into a data partition to improve locality without
creating an excessive number of redundant transactions. We implement FiDoop-DP on a 24-node
Hadoop cluster, driven by a wide range of datasets created by IBM Quest Market-Basket
Synthetic Data Generator. Experimental results reveal that FiDoop-DP is conducive to reducing
network and computing loads by the virtue of eliminating redundant transactions on Hadoop
nodes. FiDoop-DP significantly improves the performance of the existing parallel frequent-
pattern scheme by up to 31% with an average of 18%.
DistR: A Distributed Method for the Reachability Query over Large Uncertain Graphs
Among uncertain graph queries, reachability, i.e., the probability that one vertex is reachable
from another, is likely the most fundamental one. Although this problem has been studied within
the field of network reliability, solutions are implemented on a single computer and can only
handle small graphs. However, as the size of graph applications continually increases, the
corresponding graph data can no longer fit within a single computer’s memory and must
therefore be distributed across several machines. Furthermore, the computation of probabilistic
reachability queries is #P-complete making it very expensive even on small graphs. In this paper,
we develop an efficient distributed strategy, called DistR, to solve the problem of reachability
query over large uncertain graphs. Specifically, we perform the task in two steps: distributed
graph reduction and distributed consolidation. In the distributed graph reduction step, we find all
of the maximal subgraphs of the original graph, whose reachability probabilities can be
calculated in polynomial time, compute them and reduce the graph accordingly. After this step,
only a small graph remains. In the distributed consolidation step, we transform the problem into
a relational join process and provide an approximate answer to the #P-complete reachability
query. Extensive experimental studies show that our distributed approach is efficient in terms of
both computational and communication costs, and has high accuracy.
A Constraint Programming Scheduler for Heterogeneous High-Performance Computing
Machines

Scheduling and dispatching tools for High-Performance Computing (HPC) machines have the
key role of mapping jobs to the available resources, trying to maximize performance and
Quality-of-Service (QoS). Allocation and Scheduling in the general case are well-known NP-
hard problems, forcing commercial schedulers to adopt greedy approaches to improve
performance and QoS. Searchbased approaches featuring the exploration of the solution space
have seldom been employed in this setting, but mostly applied in off-line scenarios. In this paper,
we present the first search-based approach to job allocation and scheduling for HPC machines,
working in a production environment. The scheduler is based on Constraint Programming, an
effective programming technique for optimization problems. The resulting scheduler is flexible,
as it can be easily customized for dealing with heterogeneous resources, user-defined constraints
and different metrics. We evaluate our solution both on virtual machines using synthetic
workloads, and on the Eurora HPC with production workloads. Tests on a wide range of
operating conditions show significant improvements in waitings and QoS in mid-tier HPC
machines w.r.t state-of-the-art commercial rule-based dispatchers. Furthermore, we analyze the
conditions under which our approach outperforms commercial approaches, to create a portfolio
of scheduling algorithms that ensures robustness, flexibility and scalability.
Enabling data-centric distribution technology for partitioned embedded systems
Modern complex embedded systems are evolving into mixed-criticality systems in order to
satisfy a wide set of non-functional requirements such as security, cost, weight, timing or power
consumption. Partitioning is an enabling technology for this purpose, as it provides an
environment with strong temporal and spatial isolation which allows the integration of
applications with different requirements into a common hardware platform. At the same time,
embedded systems are increasingly networked (e.g., cyber-physical systems) and they even
might require global connectivity in open environments so enhanced communication
mechanisms are needed to develop distributed partitioned systems. To this end, this work
proposes an architecture to enable the use of data-centric real-time distribution middleware in
partitioned embedded systems based on a hypervisor. This architecture relies on distribution
middleware and a set of virtual devices to provide mixed-criticality partitions with a
homogeneous and interoperable communication subsystem. The results obtained show that this

approach provides low overhead and a reasonable trade-off between temporal isolation and
performance
A Recursive Hypergraph Bipartitioning Framework for Reducing Bandwidth and Latency
Costs Simultaneously
Intelligent partitioning models are commonly used for efficient parallelization of irregular
applications on distributed systems. These models usually aim to minimize a single
communication cost metric, which is either related to communication volume or message count.
However, both volume- and message-related metrics should be taken into account during
partitioning for a more efficient parallelization. There are only a few works that consider both of
them and they usually address each in separate phases of a two-phase approach. In this work, we
propose a recursive hypergraph bipartitioning framework that reduces the total volume and total
message count in a single phase. In this framework, the standard hypergraph models, nets of
which already capture the bandwidth cost, are augmented with message nets. The message nets
encode the message count so that minimizing conventional cutsize captures the minimization of
bandwidth and latency costs together. Our model provides a more accurate representation of the
overall communication cost by incorporating both the bandwidth and the latency components
into the partitioning objective. The use of the widely-adopted successful recursive bipartitioning
framework provides the flexibility of using any existing hypergraph partitioner. The experiments
on instances from different domains show that our model on the average achieves up to 52%
reduction in total message count and hence results in 29% reduction in parallel running time
compared to the model that considers only the total volume.
Leaky Buffer: A Novel Abstraction for Relieving Memory Pressure from Cluster Data
Processing Frameworks
The shift to the in-memory data processing paradigm has had a major influence on the
development of cluster data processing frameworks. Numerous frameworks from the industry,
open source community and academia are adopting the in-memory paradigm to achieve
functionalities and performance breakthroughs. However, despite the advantages of these
inmemory frameworks, in practice they are susceptible to memorypressure related performance

collapse and failures. The contributions of this paper are two-fold. Firstly, we conduct a detailed
diagnosis of the memory pressure problem and identify three preconditions for the performance
collapse. These preconditions not only explain the problem but also shed light on the possible
solution strategies. Secondly, we propose a novel programming abstraction called the leaky
buffer that eliminates one of the preconditions, thereby addressing the underlying problem. We
have implemented a leaky buffer enabled hashtable in Spark, and we believe it is also able to
substitute the hashtable that performs similar hash aggregation operations in any other programs
or data processing frameworks. Experiments on a range of memory intensive aggregation
operations show that the leaky buffer abstraction can drastically reduce the occurrence of
memory-related failures, improve performance by up to 507% and reduce memory usage by up
to 87.5%.
Parity-Switched Data Placement: Optimizing Partial Stripe Writes in XOR-Coded Storage
Systems
Erasure codes tolerate disk failures by pre-storing a low degree of data redundancy, and have
been commonly adopted in current storage systems. However, the attached requirement on data
consistency exaggerates partial stripe write operations and thus seriously downgrades system
performance. Previous works to optimize partial stripe writes are relatively limited, and a general
mechanism is still absent. In this paper, we propose a Parity-Switched Data Placement (PDP) to
optimize partial stripe writes for any XOR-coded storage system. PDP first reduces the write
operations by arranging continuous data elements to join a common parity element’s generation.
To achieve a deeper optimization, PDP further explores the generation orders of parity elements
and makes any two continuous data elements associate with a common parity element. Intensive
evaluations show that for tested erasure codes, PDP reduces up to 31.9% of write operations and
further increases the write speed by up to 59.8% when compared with two state-of-the-art data
placement methods.
VINEA: An Architecture for Virtual Network Embedding Policy Programmability
Network virtualization has enabled new business models by allowing infrastructure providers to
lease or share their physical network. A fundamental management problem that cloud providers

face to support customized virtual network (VN) services is the virtual network embedding. This
requires solving the (NP-hard) problem of matching constrained virtual networks onto the
physical network. In this paper we present VINEA, a policy-based virtual network embedding
architecture, and its system implementation. VINEA leverages our previous results on VN
embedding optimality and convergence guarantees, and it is based on a network utility
maximization approach that separates policies (i:e:, high-level goals) from underlying
embedding mechanisms: resource discovery, virtual network mapping, and allocation on the
physical infrastructure. We show how VINEA can subsume existing embedding approaches, and
how it can be used to design novel solutions that adapt to different scenarios, by merely
instantiating different policies. We describe the VINEA architecture, as well as our object model:
our VINO protocol and the API to program the embedding policies; we then analyze key
representative tradeoffs among novel and existing VN embedding policy configurations, via
event-driven simulations, and with our prototype implementation. Among our findings, our
evaluation shows how, in contrast to existing solutions, simultaneously embedding nodes and
links may lead to lower providers’ revenue. We release our implementation on a testbed that uses
a Linux system architecture to reserve virtual node and link capacities. Our prototype can be also
used to augment existing open-source ―Networking as a Service‖ architectures such as
OpenStack Neutron, that currently lacks a VN embedding protocol, and as a policy-
programmable solution to the ―slice stitching‖ problem within wide-area virtual network
testbeds.
Application control configurations for parallel connection of single-phase energy
conversion units operating in island mode
This paper presents the design and implementation of controllers for parallel connection of single
phase energy conversion units, for applications in island mode operation. Orders to comply with
this objective, two control configurations are implemented having as reference to the output
voltage of droop scheme. These controllers are: two degrees of freedom control plus repetitive
controller and proportional integral - proportional controller plus resonant controller. With these
control configurations is intended maintain amplitude, waveform and frequency of the voltage
signal and attend increases linear and nonlinear load in island mode operation of a single phase

energy conversion unit. That is, with these control strategies is achieved that several inverters
connected in parallel to a microgrid can operate as sources of tension carving up the active and
reactive power demanded by the load.
IEEE Latin America Transactions (March 2016)
PathGraph: A Path Centric Graph Processing System
Large scale iterative graph computation presents an interesting systems challenge due to two
well known problems: (1) the lack of access locality and (2) the lack of storage efficiency. This
paper presents PathGraph, a system for improving iterative graph computation on graphs with
billions of edges. First, we improve the memory and disk access locality for iterative
computation algorithms on large graphs by modeling a large graph using a collection of tree-
based partitions. This enables us to use path-centric computation rather than vertexcentric or
edge-centric computation. For each tree partition, we re-label vertices using DFS in order to
preserve consistency between the order of vertex ids and vertex order in the paths. Second, a
compact storage that is optimized for iterative graph parallel computation is developed in the
PathGraph system. Concretely, we employ delta-compression and store tree-based partitions in a
DFS order. By clustering highly correlated paths together as tree based partitions, we maximize
sequential access and minimize random access on storage media. Third but not the least, our
path-centric computation model is implemented using a scatter/gather programming model. We
parallel the iterative computation at partition tree level and perform sequential local updates for
vertices in each tree partition to improve the convergence speed. To provide well balanced
workloads among parallel threads at tree partition level, we introduce the concept of multiple
stealing points based task queue to allow work stealings from multiple points in the task queue.
We evaluate the effectiveness of PathGraph by comparing with recent representative graph
processing systems such as GraphChi and X-Stream etc. Our experimental results show that our
approach outperforms the two systems on a number of graph algorithms for both in-memory and
out-of-core graphs. While our approach achieves better data balance and load balance, it also
shows better speedup than the two - ystems with the growth of threads.
CIACP: A Correlation- and Iteration- Aware Cache Partitioning Mechanism to Improve
Performance of Multiple Coarse-Grained Reconfigurable Arrays

Multiple coarse-grained reconfigurable arrays (CGRA), which are organized in parallel or
pipeline to complete applications, have become a productive solution to balance the performance
with the flexibility. One of the keys to obtain high performance from multiple CGRAs is to
manage the shared on-chip cache efficiently to reduce off-chip memory bandwidth requirements.
Cache partitioning has been viewed as a promising technique to enhance the efficiency of a
shared cache. However, the majority of prior partitioning techniques were developed for multi-
core platform and aimed at multi-programmed workloads. They cannot directly address the
adverse impacts of data correlation and computation imbalance among competing CGRAs in
multi-CGRA platform. This paper proposes a correlation- and iteration- aware cache partitioning
(CIACP) mechanism for shared cache partitioning in multiple CGRAs systems. This mechanism
employs correlation monitors (CMONs) to trace the amount of overlapping data among parallel
CGRAs, and iteration monitors (IMONs) to track the computation load of each CGRA. Using the
information collected by CMONs and IMONs, the CIACP mechanism can eliminate redundant
cache utilization of the overlapping data and can also shorten the total execution time of
pipelined CGRAs. Experimental results showed that CIACP outperformed state-of-the-art utility-
based cache partitioning techniques by up to 16% in performance.
Failure Diagnosis for Distributed Systems using Targeted Fault Injection
This paper introduces a novel approach to automating failure diagnostics in distributed systems
by combining fault injection and data analytics. We use fault injection to populate the database
of failures for a target distributed system. When a failure is reported from production
environment, the database is queried to find ―matched‖ failures generated by fault injections.
Relying on the assumption that similar faults generate similar failures, we use information from
the matched failures as hints to locate the actual root cause of the reported failures. In order to
implement this approach, we introduce techniques for (i) reconstructing end-to-end execution
flows of distributed software components, (ii) computing the similarity of the reconstructed
flows, and (iii) performing precise fault injection at pre-specified executing points in distributed
systems. We have evaluated our approach using an OpenStack cloud platform, a popular cloud
infrastructure management system. Our experimental results showed that this approach is
effective in determining the root causes, e.g., fault types and affected components, for 71-100%

of tested failures. Furthermore, it can provide fault locations close to actual ones and can easily
be used to find and fix actual root causes. We have also validated this technique by localizing
real bugs that occurred in OpenStack.
Towards Practical and Near-optimal Coflow Scheduling for Data Center Networks
In current data centers, an application (e.g., MapReduce, Dryad, search platform, etc.) usually
generates a group of parallel flows to complete a job. These flows compose a coflow and only
completing them all is meaningful to the application. Accordingly, minimizing the average
Coflow Completion Time (CCT) becomes a critical objective of flow scheduling. However,
achieving this goal in today’s Data Center Networks (DCNs) is quite challenging, not only
because the schedule problem is theoretically NP-hard, but also because it is tough to perform
practical flow scheduling in large-scale DCNs. In this paper, we find that minimizing the average
CCT of a set of coflows is equivalent to the well-known problem of minimizing the sum of
completion times in a concurrent open shop. As there are abundant existing solutions for
concurrent open shop, we open up a variety of techniques for coflow scheduling. Inspired by the
best known result, we derive a 2-approximation algorithm for coflow scheduling, and further
develop a decentralized coflow scheduling system, D-CAS, which avoids the system problems
associated with current centralized proposals while addressing the performance challenges of
decentralized suggestions. Trace-driven simulations indicate that D-CAS achieves a performance
close to Varys, the state-of-the-art centralized method, and outperforms Baraat, the only existing
decentralized method, significantly.
Traffic Load Balancing Schemes for Devolved Controllers in Mega Data Centers
In most existing cloud services, a centralized controller is used for resource management and
coordination. However, such infrastructure is gradually not sufficient to meet the rapid growth of
mega data centers. In recent literature, a new approach named devolved controller was proposed
for scalability concern. This approach splits the whole network into several regions, each with
one controller to monitor and reroute a portion of the flows. This technique alleviates the
problem of an overloaded single controller, but brings other problems such as unbalanced work
load among controllers and reconfiguration complexities. In this paper, we make an exploration

on the usage of devolved controllers for mega data centers, and design some new schemes to
overcome these shortcomings and improve the performance of the system. We first formulate
Load Balancing problem for Devolved Controllers (LBDC) in data centers, and prove that it is
NP-complete. We then design an f-approximation for LBDC, where f is the largest number of
potential controllers for a switch in the network. Furthermore, we propose both centralized and
distributed greedy approaches to solve the LBDC problem effectively. The numerical results
validate the efficiency of our schemes, which can become a solution to monitoring, managing,
and coordinating mega data centers with multiple controllers working together.
Reactive Molecular Dynamics on Massively Parallel Heterogeneous Architectures
We present a parallel implementation of the ReaxFF force field on massively parallel
heterogeneous architectures, called PuReMD-Hybrid. PuReMD, on which this work is based,
along with its integration into LAMMPS, is currently used by a large number of research groups
worldwide. Accelerating this important community codebase that implements a complex reactive
force field poses a number of algorithmic, design, and optimization challenges, as we discuss in
detail. In particular, different computational kernels are best suited to different computing
substrates – CPUs or GPUs. Scheduling these computations requires complex resource
management, as well as minimizing data movement across CPUs and GPUs. Integrating
powerful nodes, each with multiple CPUs and GPUs, into clusters and utilizing the immense
compute power of these clusters requires significant optimizations for minimizing
communication and, potentially, redundant computations. From a programming model
perspective, PuReMD-Hybrid relies on MPI across nodes, pthreads across cores, and CUDA on
the GPUs to address these challenges. Using a variety of innovative algorithms and
optimizations, we demonstrate that our code can achieve over 565-fold speedup compared to a
single core implementation on a cluster of 36 state-of-the-art GPUs for complex systems. In
terms of application performance, our code enables simulations of over 1.8M atoms in under
0.68 seconds per simulation time step.
A fast discrete wavelet transform using hybrid parallelism on GPUs

Wavelet transform has been widely used in many signal and image processing applications. Due
to its wide adoption for timecritical applications, such as streaming and real-time signal
processing, many acceleration techniques were developed during the past decade. Recently, the
graphics processing unit (GPU) has gained much attention for accelerating computationally-
intensive problems and many solutions of GPU-based discrete wavelet transform (DWT) have
been introduced, but most of them did not fully leverage the potential of the GPU. In this paper,
we present various state-of-the-art GPU optimization strategies in DWT implementation, such as
leveraging shared memory, registers, warp shuffling instructions, and thread- and instruction-
level parallelism (TLP, ILP), and finally elaborate our hybrid approach to further boost up its
performance. In addition, we introduce a novel mixed-band memory layout for Haar DWT,
where multi-level transform can be carried out in a single fused kernel launch. As a result, unlike
recent GPU DWT methods that focus mainly on maximizing ILP, we show that the optimal GPU
DWT performance can be achieved by hybrid parallelism combining both TLP and ILP together
in a mixed-band approach. We demonstrate the performance of our proposed method by
comparison with other CPU and GPU DWT methods.
SUPPORT OFFERED TO REGISTERED STUDENTS:
1. IEEE Base paper.
2. Review material as per individuals’ university guidelines
3. Future Enhancement
4. assist in answering all critical questions
5. Training on programming language
6. Complete Source Code.
7. Final Report / Document
8. International Conference / International Journal Publication on your Project.
FOLLOW US ON FACEBOOK @ TSYS Academic Projects

Fault-Tolerant Scheduling for Real-Time Scientific Workflows with Elastic Resource Provisioning in Virtualized Clouds

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Similar to Fault-Tolerant Scheduling for Real-Time Scientific Workflows with Elastic Resource Provisioning in Virtualized Clouds

Similar to Fault-Tolerant Scheduling for Real-Time Scientific Workflows with Elastic Resource Provisioning in Virtualized Clouds (20)

Recently uploaded

Recently uploaded (20)

Fault-Tolerant Scheduling for Real-Time Scientific Workflows with Elastic Resource Provisioning in Virtualized Clouds