SlideShare a Scribd company logo
1 of 31
Download to read offline
For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2016 TOPICS
Fault-Tolerant Scheduling for Real-Time Scientific Workflows with Elastic Resource
Provisioning in Virtualized Clouds
Clouds are becoming an important platform for scientific workflow applications. However, with
many nodes being deployed in clouds, managing reliability of resources becomes a critical issue,
especially for the real-time scientific workflow execution where deadlines should be satisfied.
Therefore, fault tolerance in clouds is extremely essential. The PB (primary backup) based
scheduling is a popular technique for fault tolerance and has effectively been used in the cluster
and grid computing. However, applying this technique for real-time workflows in a virtualized
cloud is much more complicated and has rarely been studied. In this paper, we address this
problem. We first establish a real-time workflow fault-tolerant model that extends the traditional
PB model by incorporating the cloud characteristics. Based on this model, we develop
approaches for task allocation and message transmission to ensure faults can be tolerated during
the workflow execution. Finally, we propose a dynamic fault-tolerant scheduling algorithm,
FASTER, for realtime workflows in the virtualized cloud. FASTER has three key features: 1) it
employs a backward shifting method to make full use of the idle resources and incorporates task
overlapping and VM migration for high resource utilization, 2) it applies the vertical/horizontal
scaling-up technique to quickly provision resources for a burst of workflows, and 3) it uses the
vertical scaling-down scheme to avoid unnecessary and ineffective resource changes due to
fluctuated workflow requests. We evaluate our FASTER algorithm with synthetic workflows and
workflows collected from the real scientific and business applications and compare it with six
baseline algorithms. The experimental results demonstrate that FASTER can effectively improve
the resource utilization and schedulability even in the presence of node failures in virtualized
clouds.
IEEE Transactions on Parallel and Distributed Systems (March 2016)
Correlation-Aware Heuristics for Evaluating the Distribution of the Longest Path Length
of a DAG with Random Weights
For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
Coping with uncertainties when scheduling task graphs on parallel machines requires to perform
non-trivial evaluations. When considering that each computation and communication duration is
a random variable, evaluating the distribution of the critical path length of such graphs involves
computing maximums and sums of possibly dependent random variables. The discrete version of
this evaluation problem is known to be #P-hard. Here, we propose two heuristics, CorLCA and
Cordyn, to compute such lengths. They approximate the input random variables and the
intermediate ones as normal random variables, and they precisely take into account correlations
with two distinct mechanisms: through lowest common ancestor queries for CorLCA and with a
dynamic programming approach for Cordyn. Moreover, we empirically compare some classical
methods from the literature and confront them to our solutions. Simulations on a large set of
cases indicate that CorLCA and Cordyn constitute each a new relevant trade-off in terms of
rapidity and precision.
IEEE Transactions on Parallel and Distributed Systems (February 2016)
A Hybrid Static-Dynamic Classification for Dual-Consistency Cache Coherence
Traditional cache coherence protocols manage all memory accesses equally and ensure the
strongest memory model, namely, sequential consistency. Recent cache coherence protocols
based on self-invalidation advocate for the model sequential consistency for data-race-free,
which enables powerful optimizations for race-free code. However, for racy code these cache
coherence protocols provide sub-optimal performance compared to traditional protocols. This
paper proposes SPEL++, a dual-consistency cache coherence protocol that supports two
execution modes: a traditional sequential-consistent protocol and a protocol that provides weak
consistency (or sequential consistency for data-race-free). SPEL++ exploits a static-dynamic
hybrid classification of memory accesses based on (i) a compile-time identification of extended
data-race-free code regions for OpenMP applications and (ii) a runtime classification of accesses
based on the operating system’s memory page management. By executing racy code under the
sequential-consistent protocol and race-free code under the cache coherence protocol that
provides sequential consistency for data-race-free, the end result is an efficient execution of the
applications while still providing sequential consistency. Compared to a traditional protocol, we
show improvements in performance from 19% to 38% and reductions in energy consumption
from 47% to 53%, on average for different benchmark suites, on a 64-core chip multiprocessor.
For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
IEEE Transactions on Parallel and Distributed Systems (February 2016)
REFRESH: REDEFINE for Face Recognition using SURE Homogeneous Cores
In this paper we present design and analysis of a scalable real-time Face Recognition (FR)
module to perform 450 recognitions per second. We introduce an algorithm for FR, which is a
combination of Weighted Modular Principle Component Analysis and Radial Basis Function
Neural Networks. This algorithm offers better recognition accuracy in various practical
conditions than algorithms used in existing architectures for real-time FR. To meet real-time
requirements, a Scalable Parallel Pipelined Architecture (SPPA) is developed by realizing the
above FR algorithm as independent parallel streams and sub-streams of computations. SPPA is
capable of supporting large databases maintained in external (DDR) memory. By casting the
computations in a stream into hardware, we present the design of a Scalable Unit for Region
Evaluation (SURE) core. Using SURE cores as computer elements in a massively parallel
CGRA, like REDFINE, we provide a FR system on REDEFINE called REFRESH. We report
FPGA and ASIC synthesis results for SPPA and REFRESH. Through analysis using these
results, we show that excellent scalability and added programmability in REFRESH makes it a
flexible and favorable solution for real-time FR.
IEEE Transactions on Parallel and Distributed Systems (March 2016)
Improving Batch Scheduling on Blue Gene/Q by Relaxing Network Allocation Constraints
As systems scale toward exascale, many resources will become increasingly constrained. While
some of these resources have historically been explicitly allocated, many—such as network
bandwidth, I/O bandwidth, or power—have not. As systems continue to evolve, we expect many
such resources to become explicitly managed. This change will pose critical challenges to
resource management and job scheduling. In this paper, we explore the potential of relaxing
network allocation constraints for Blue Gene systems. Our objective is to improve the batch
scheduling performance, where the partition-based interconnect architecture provides a unique
opportunity to explicitly allocate network resources to jobs. This paper makes three major
contributions. The first is substantial benchmarking of parallel applications, focusing on
assessing application sensitivity to communication bandwidth at large scale. The second is three
new scheduling schemes using relaxed network allocation and targeted at balancing individual
job performance with overall system performance. The third is a comparative study of our
For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
scheduling schemes versus the existing scheduler on Mira, a 48-rack Blue Gene/Q system at
Argonne National Laboratory. Specifically, we use job traces collected from this production
system.
IEEE Transactions on Parallel and Distributed Systems (February 2016)
Transparent and optimized distributed processing on GPUs
DistributedCL is a middleware which enables transparent parallel processing on distributed
GPUs. With the support of the DistributedCL middleware an application designed to use the
OpenCL API can run in a distributed manner and transparently use remote GPUs without having
to change or rebuild the code. The proposed architecture for the DistributedCL middleware is
modular, with well-defined layers. A prototype was built according to the architecture, which
considered various optimization points, including sending data in batches, network asynchronous
communication and asynchronous request to the OpenCL API. The prototype was evaluated
using available benchmarks and a specific benchmark, the CLBench, was developed to facilitate
the evaluations according to the amount of processed data. The prototype presented good
performance, higher when compared to similar proposals, which also consider transparent use of
remote GPUs. The data size to be transmitted over the network was the major limiting factor.
IEEE Transactions on Parallel and Distributed Systems (April 2016)
Distributed Control for Charging Multiple Electric Vehicles with Overload Limitation
Severe pollution induced by traditional fossil fuels arouses great attention on the usage of plug-in
electric vehicles (PEVs) and renewable energy. However, large-scale penetration of PEVs
combined with other kinds of appliances tends to cause excessive or even disastrous burden on
the power grid, especially during peak hours. This paper focuses on the scheduling of PEVs
charging process among different charging stations and each station can be supplied by both
renewable energy generators and a distribution network. The distribution network also powers
some uncontrollable loads. In order to minimize the on-grid energy cost with local renewable
energy and non-ideal storage while avoiding the overload risk of the distribution network, an
online algorithm consisting of scheduling the charging of PEVs and energy management of
charging stations is developed based on Lyapunov optimization and Lagrange dual
decomposition techniques. The algorithm can satisfy the random charging requests from PEVs
with provable performance. Simulation results with real data demonstrate that the proposed
For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
algorithm can decrease the time-average cost of stations while avoiding overload in the
distribution network in the presence of random uncontrollable loads.
IEEE Transactions on Parallel and Distributed Systems (February 2016)
CoRE: Cooperative End-to-End Traffic Redundancy Elimination for Reducing Cloud
Bandwidth Cost
The pay-as-you-go service model impels cloud customers to reduce the usage cost of bandwidth.
Traffic Redundancy Elimination (TRE) has been shown to be an effective solution for reducing
bandwidth costs, and thus has recently captured significant attention in the cloud environment.
By studying the TRE techniques in a trace driven approach, we found that both short-term (time
span of seconds) and long-term (time span of hours or days) data redundancy can concurrently
appear in the traffic, and solely using either sender-based TRE or receiver-based TRE cannot
simultaneously capture both types of traffic redundancy. Also, the efficiency of existing receiver-
based TRE solution is susceptible to the data changes compared to the historical data in the
cache. In this paper, we propose a Cooperative end-to-end TRE solution (CoRE) that can detect
and remove both short-term and long-term redundancy through a two-layer TRE design with
cooperative operations between layers. An adaptive prediction algorithm is further proposed to
improve TRE efficiency through dynamically adjusting the prediction window size based on the
hit ratio of historical predictions. Besides, we enhance CoRE to adapt to different traffic
redundancy characteristics of cloud applications to improve its operation cost. Extensive
evaluation with several real traces show that CoRE is capable of effectively identifying both
short-term and long-term redundancy with low additional cost while ensuring TRE efficiency
from data changes.
IEEE Transactions on Parallel and Distributed Systems (June 2016)
Clustering-based Task Scheduling in a Large Number of Heterogeneous Processors
Parallelization paradigms for effective execution in a Directed Acyclic Graph (DAG) application
have been widely studied in the area of task scheduling. Schedule length can be varied depending
on task assignment policies, scheduling policies, and heterogeneity in terms of each processor
and each communication bandwidth in a heterogeneous system. One disadvantage of existing
task scheduling algorithms is that the schedule length cannot be reduced for a data intensive
application. In this paper, we propose a clustering-based task scheduling algorithm called
For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
Clustering for Minimizing the Worst Schedule Length (CMWSL) to minimize the schedule
length in a large number of heterogeneous processors. First, the proposed method derives the
lower bound of the total execution time for each processor by taking both the system and
application characteristics into account. As a result, the number of processors used for actual
execution is regulated to minimize the Worst Schedule Length (WSL). Then, the actual task
assignment and task clustering are performed to minimize the schedule length until the total
execution time in a task cluster exceeds the lower bound. Experimental results indicate that
CMWSL outperforms both existing list-based and clustering-based task scheduling algorithms in
terms of the schedule length and efficiency, especially in data-intensive applications.
IEEE Transactions on Parallel and Distributed Systems (February 2016 )
Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint
Model
The traditional single-level checkpointing method suffers from significant overhead on large-
scale platforms. Hence, multilevel checkpointing protocols have been studied extensively in
recent years. The multilevel checkpoint approach allows different levels of checkpoints to be set
(each with different checkpoint overheads and recovery abilities), in order to further improve the
fault tolerance performance of extreme-scale HPC applications. How to optimize the checkpoint
intervals for each level, however, is an extremely difficult problem. In this paper, we construct
an easy-to-use two-level checkpoint model. Checkpoint level 1 deals with errors with low
checkpoint/recovery overheads such as transient memory errors, while checkpoint level 2 deals
with hardware crashes such as node failures. Compared with previous optimization work, our
new optimal checkpoint solution offers two improvements: (1) it is an online solution without
requiring knowledge of the job length in advance, and (2) it shows that periodic patterns are
optimal and determines the best pattern. We evaluate the proposed solution and compare it with
the most up-to-date related approaches on an extreme-scale simulation testbed constructed based
on a real HPC application execution. Simulation results show that our proposed solution
outperforms other optimized solutions and can improve the performance significantly in some
cases. Specifically, with the new solution the wall-clock time can be reduced by up to 25.3%
over that of other state-of-the-art approaches. Finally, a brute-force comparison with all possible
patterns shows that our solution is always within 1% of the best pattern in the experiments.
For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
IEEE Transactions on Parallel and Distributed Systems (March 2016)
Shadow/Puppet Synthesis: A Stepwise Method for the Design of Self-Stabilization
This paper presents a novel two-step method for automated design of self-stabilization. The first
step enables the specification of legitimate states and an intuitive (but imprecise) specification of
the desired functional behaviors in the set of legitimate states (hence the term ―shadow‖). After
creating the shadow specifications, we systematically introduce the main variables and the
topology of the desired self-stabilizing system. Subsequently, we devise a parallel and complete
backtracking search towards finding a self-stabilizing solution that implements a precise version
of the shadow behaviors, and guarantees recovery to legitimate states from any state. To the best
of our knowledge, the shadow/puppet synthesis is the first sound and complete method that
exploits parallelism and randomization along with the expansion of the state space towards
generating self-stabilizing systems that cannot be synthesized with existing methods. We have
validated the proposed method by creating both a sequential and a parallel implementation in the
context of a software tool, called Protocon. Moreover, we have used Protocon to automatically
design three new self-stabilizing protocols that we conjecture to require the minimal number of
states per process to achieve stabilization (when processes are deterministic): 2-state maximal
matching on bidirectional rings, 5-state token passing on unidirectional rings, and 3-state token
passing on bidirectional chains.
IEEE Transactions on Parallel and Distributed Systems (February 2016)
Optimizing End-to-End Big Data Transfers over Terabits Network Infrastructure
While future terabit networks hold the promise of significantly improving big-data motion
among geographically distributed data centers, significant challenges must be overcome even on
today’s 100 gigabit networks to realize end-to-end performance. Multiple bottlenecks exist along
the end-to-end path from source to sink, for instance, the data storage infrastructure at both the
source and sink and its interplay with the wide-area network are increasingly the bottleneck to
achieving high performance. In this paper, we identify the issues that lead to congestion on the
path of an end-to-end data transfer in the terabit network environment, and we present a new
bulk data movement framework for terabit networks, called LADS. LADS exploits the
underlying storage layout at each endpoint to maximize throughput without negatively impacting
the performance of shared storage resources for other users. LADS also uses the Common
For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
Communication Interface (CCI) in lieu of the sockets interface to benefit from hardware-level
zero-copy, and operating system bypass capabilities when available. It can further improve data
transfer performance under congestion on the end systems using buffering at the source using
flash storage. With our evaluations, we show that LADS can avoid congested storage elements
within the shared storage resource, improving input/output bandwidth, and data transfer rates
across the high speed networks. We also investigate the performance degradation problems of
LADS due to I/O contention on the parallel file system (PFS), when multiple LADS tools share
the PFS. We design and evaluate a meta-scheduler to coordinate multiple I/O streams while
sharing the PFS, to minimize the I/O contention on the PFS. With our evaluations, we observe
that LADS with meta-scheduling can further improve the performance by up to 14% relative to
LADS without meta-scheduling.
IEEE Transactions on Parallel and Distributed Systems (April 2016)
Analysis of parallel computing strategies to accelerate ultrasound imaging processes
This work analyses the use of parallel processing techniques in synthetic aperture ultrasonic
imaging applications. In particular, the Total Focussing Method, which is a O(N2 P) problem, is
studied. This work presents different parallelization strategies for multicore CPU and GPU
architectures. The parallelization processes on both platforms are discussed and optimized in
order to achieve real-time performance.
IEEE Transactions on Parallel and Distributed Systems (March 2016)
Improving Performance of Parallel I/O Systems through Selective and Layout-Aware SSD
Cache
Parallel file systems (PFS) are widely-used to ease the I/O bottleneck of modern high-
performance computing systems. However, PFSs do not work well for small requests, especially
small random requests. Newer Solid State Drives (SSD) have excellent performance on small
random data accesses, but also incur a high monetary cost. In this study, we propose SLA-Cache,
a Selective and Layout-Aware Cache system that employs a small set of SSD-based file servers
as a cache of conventional HDD-based file servers. SLA-Cache uses a novel scheme to identify
performance-critical data, and conducts a selective cache admission (SCA) policy to fully utilize
SSD-based file servers. Moreover, since data layout of the cache system can also largely
influence its access performance, SLA-Cache applies a layout-aware cache placement scheme
For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
(LCP) to store data on SSD-based file servers. By storing data with an optimal layout requiring
the lowest access cost among three typical layout candidates, LCP can further improve system
performance. We have implemented SLA-Cache under the MPICH2 I/O library. Experimental
results show that SLA-Cache can significantly improve I/O throughput, and is a promising
approach for parallel applications.
IEEE Transactions on Parallel and Distributed Systems (January 2016)
Elastic Reliability Optimization Through Peer-to-Peer Checkpointing in Cloud Computing
Modern day data centers coordinate hundreds of thousands of heterogeneous tasks and aim at
delivering highly reliable cloud computing services. Although offering equal reliability to all
users benefits everyone at the same time, users may find such an approach either inadequate or
too expensive to fit their individual requirements, which may vary dramatically. In this paper, we
propose a novel method for providing elastic reliability optimization in cloud computing. Our
scheme makes use of peer-to-peer checkpointing and allows user reliability levels to be jointly
optimized based on an assessment of their individual requirements and total available resources
in the data center. We show that the joint optimization can be efficiently solved by a distributed
algorithm using dual decomposition. The solution improves resource utilization and presents an
additional source of revenue to data center operators. Our validation results suggest a significant
improvement of reliability over existing schemes.
IEEE Transactions on Parallel and Distributed Systems (May 2016)
A Taxonomy of Job Scheduling on Distributed Computing Systems
Hundreds of papers on job scheduling for distributed systems are published every year and it
becomes increasingly difficult to classify them. Our analysis revealed that half of these papers
are barely cited. This paper presents a general taxonomy for scheduling problems and solutions
in distributed systems. This taxonomy was used to classify and make publicly available the
classification of 109 scheduling problems and their solutions. These 109 problems were further
clustered into ten groups based on the features of the taxonomy. The proposed taxonomy will
facilitate researchers to build on prior art, increase new research visibility, and minimize
redundant effort.
IEEE Transactions on Parallel and Distributed Systems (March 2016)
Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications
For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
For exascale HPC applications, silent data corruption (SDC) is one of the most dangerous
problems because there is no indication that there are errors during the execution. We propose an
adaptive impact-driven method that can detect SDCs dynamically. The key contributions are
threefold. (1) We carefully characterize 18 HPC applications/benchmarks and discuss the
runtime data features, as well as the impact of the SDCs on their execution results. (2) We
propose an impact-driven detection model that does not blindly improve the prediction accuracy,
but instead detects only influential SDCs to guarantee user-acceptable execution results. (3) Our
solution can adapt to dynamic prediction errors based on local runtime data and can
automatically tune detection ranges for guaranteeing low false alarms. Experiments show that
our detector can detect 80-99.99% of SDCs with a false alarm rate less that 1% of iterations for
most cases. The memory cost and detection overhead are reduced to 15% and 6.3%, respectively,
for a large majority of applications.
IEEE Transactions on Parallel and Distributed Systems (January 2016)
Time Series-Oriented Load Prediction Model and Migration Policies for Distributed
Simulation Systems
HLA-based simulation systems are prone to load imbalances due to lack management of shared
resources in distributed environments. Such imbalances lead these simulations to exhibit
performance loss in terms of execution time. As a result, many dynamic load balancing systems
have been introduced to manage distributed load. These systems use specific methods, depending
on load or application characteristics, to perform the required balancing. Load prediction is a
technique that has been used extensively to enhance load redistribution heuristics towards
preventing load imbalances. In this paper, several efficient Time Series model variants are
presented and used to enhance prediction precision for large-scale distributed simulation-based
systems. These variants are proposed to extend and correct the issues originating from the
implementation of Holt’s model for time series in the predictive module of a dynamic load
balancing system for HLA-based distributed simulations. A set of migration decision-making
techniques is also proposed to enable a prediction-based load balancing system to be independent
of any prediction model, promoting a more modular construction.
IEEE Transactions on Parallel and Distributed Systems (April 2016)
Enabling Parallel Simulation of Large-Scale HPC Network Systems
For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
With the increasing complexity of today’s high-performance computing (HPC) architectures,
simulation has become an indispensable tool for exploring the design space of HPC systems—in
particular, networks. In order to make effective design decisions, simulations of these systems
must possess the following properties: (1) have high accuracy and fidelity, (2) produce results in
a timely manner, and (3) be able to analyze a broad range of network workloads. Most state-of-
the-art HPC network simulation frameworks, however, are constrained in one or more of these
areas. In this work, we present a simulation framework for modeling two important classes of
networks used in today’s IBM and Cray supercomputers: torus and dragonfly networks. We use
the Co-Design of Multi-layer Exascale Storage Architecture (CODES) simulation framework to
simulate these network topologies at a flit-level detail using the Rensselaer Optimistic
Simulation System (ROSS) for parallel discrete-event simulation. Our simulation framework
meets all the requirements of a practical network simulation and can assist network designers in
design space exploration. First, it uses validated and detailed flit-level network models to provide
an accurate and high-fidelity network simulation. Second, instead of relying on serial time-
stepped or traditional conservative discrete-event simulations that limit simulation scalability and
efficiency, we use the optimistic event-scheduling capability of ROSS to achieve efficient and
scalable HPC network simulations on today’s high-performance cluster systems. Third, our
models give network designers a choice in simulating a broad range of network workloads,
including HPC application workloads using detailed network traces, an ability that is rarely
offered in parallel with high-fidelity network simulations.
IEEE Transactions on Parallel and Distributed Systems (April 2016)
An Evolutionary Optimal Fuzzy System with Information Fusion of Heterogeneous
Distributed Computing and Polar-Space Dynamic Model for Online Motion Control of
Swedish Redundant Robots
This paper presents an evolutionary optimal fuzzy system with information fusion of
heterogeneous distributed computing and polar-space dynamic model for online motion control
of Swedish redundant robots. The intelligent fuzzy system incorporated with the parallel
metaheuristic BFO (Bacteria Foraging Optimization)-AIS (Artificial Immune System), called
FS-PBFOAIS and its field-programmable gate array (FPGA) realization to optimal polar-space
online motion control of four-wheeled redundant mobile robots. This hybrid paradigm gains the
For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
benefits of Taguchi quality method, BFO, AIS, distributed processing and FPGA technique.
Experimental results are conducted to present effective optimization and high accuracy of the
proposed FPGA-based FS-PBFOAIS tracking controller. Finally, the comparative works are
provided to demonstrate the superiority of the FPGA-based FS-PBFOAIS polar-space redundant
controller over other conventional control methods.
IEEE Transactions on Industrial Electronics (May 2016)
Cache Line Aware Algorithm Design for Cache-Coherent Architectures
The increase in the number of cores per processor and the complexity of memory hierarchies
make cache coherence key for programmability of current shared memory systems. However,
ignoring its detailed architectural characteristics can harm performance significantly. In order to
assist performance-centric programming, we propose a methodology to allow semi-automatic
performance tuning with the systematic translation from an algorithm to an analytic performance
model for cache line transfers. For this, we design a simple interface for cache line aware
optimization, a translation methodology, and a full performance model that exposes the block-
based design of caches to middleware designers. We investigate two different architectures to
show the applicability of our techniques and methods: the many-core accelerator Intel Xeon Phi
and a multi-core processor with a NUMA configuration (Intel Sandy Bridge). We use
mathematical optimization techniques to tune synchronization algorithms to the
microarchitectures, identifying three techniques to design and optimize data transfers in our
model: single-use, single-step broadcast, and private cache lines.
IEEE Transactions on Parallel and Distributed Systems (January 2016)
CUDAlign 4.0: Incremental Speculative Traceback for Exact Chromosome-Wide
Alignment in GPU Clusters
This paper proposes and evaluates CUDAlign 4.0, a parallel strategy to obtain the optimal
alignment of huge DNA sequences in multi-GPU platforms, using the exact Smith-Waterman
(SW) algorithm. In the first phase of CUDAlign 4.0, a huge Dynamic Programming (DP) matrix
is computed by multiple GPUs, which asynchronously communicate border elements to the right
neighbor in order to find the optimal score. After that, the traceback phase of SW is executed.
The efficient parallelization of the traceback phase is very challenging because of the high
amount of data dependency, which particularly impacts the performance and limits the
For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
application scalability. In order to obtain a multi-GPU highly parallel traceback phase, we
propose and evaluate a new parallel traceback algorithm called Incremental Speculative
Traceback (IST), which pipelines the traceback phase, speculating incrementally over the values
calculated so far, producing results in advance. With CUDAlign 4.0, we were able to calculate
SW matrices with up to 60 Peta cells, obtaining the optimal local alignments of all Human and
Chimpanzee homologous chromosomes, whose sizes range from 26 Millions of Base Pairs
(MBP) up to 249 MBP. As far as we know, this is the first time such comparison was made with
the SW exact method. We also show that the IST algorithm is able to reduce the traceback time
from 2.15x up to 21.03x, when compared with the baseline traceback algorithm. The human x
chimpanzee chromosome 5 comparison (180 MBP x 183 MBP) attained 10,370.00 GCUPS
(Billions of Cells Updated per Second) using 384 GPUs, with a speculation hit ratio of 98.2%.
IEEE Transactions on Parallel and Distributed Systems (January 2016)
Xscale: Online X-code RAID-6 Scaling Using Lightweight Data Reorganization
Disk additions to a RAID-6 storage system can simultaneously increase the I/O parallelism and
expand the storage capacity. To regain a balanced load among both old and new disks, RAID-6
scaling requires moving certain data blocks onto newly added disks. Existing approaches to
RAID-6 scaling are restricted by preserving a round-robin data distribution, and require
migrating all the data, resulting in an expensive cost for RAID-6 scaling. In this paper, we
propose Xscale, a new approach to accelerating X-code RAID-6 scaling by using lightweight
data reorganization. Xscale minimizes the number of data blocks that require being moved, while
maintaining a uniform data distribution across all disks. Furthermore, Xscale eliminates metadata
updates while guaranteeing data consistency and data reliability. Compared with the round-robin
approach, Xscale reduces the number of blocks to be moved by 63.6–89.5%, decreases the
reorganization time by 35.62-37.26%, and reduces the I/O latency by 23.29-37.74% while the
scaling programs are running in the background. In addition, there is no penalty in the
performance of the data layout after scaling using Xscale, compared with the layouts maintained
by other existing scaling approache
IEEE Transactions on Parallel and Distributed Systems (March 2016)
The Importance of Worker Reputation Information in Microtask-Based Crowd Work
Systems
For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
This paper presents the first systematic investigation of the potential performance gains for
crowd work systems, deriving from available information at the requester about individual
worker reputation. In particular, we first formalize the optimal task assignment problem when
workers’ reputation estimates are available, as the maximization of a monotone (submodular)
function subject to Matroid constraints. Then, being the optimal problem NP-hard, we propose a
simple but efficient greedy heuristic task allocation algorithm. We also propose a simple
―maximum a-posteriori‖ decision rule and a decision algorithm based on message passing.
Finally, we test and compare different solutions, showing that system performance can greatly
benefit from information about workers’ reputation. Our main findings are that: i) even largely
inaccurate estimates of workers’ reputation can be effectively exploited in the task assignment to
greatly improve system performance; ii) the performance of the maximum a-posteriori decision
rule quickly degrades as worker reputation estimates become inaccurate; iii) when workers’
reputation estimates are significantly inaccurate, the best performance can be obtained by
combining our proposed task assignment algorithm with the message-passing decision algorithm.
IEEE Transactions on Parallel and Distributed Systems (May 2016)
On Data Integrity Attacks against Real-time Pricing in Energy-based Cyber-Physical
Systems
In this paper, we investigate a novel real-time pricing scheme, which considers both renewable
energy resources and traditional power resources and could effectively guide the participants to
achieve individual welfare maximization in the system. To be specific, we develop a Lagrangian-
based approach to transform the global optimization conducted by the power company into
distributed optimization problems to obtain explicit energy consumption, supply, and price
decisions for individual participants. Also, we show that these distributed problems derived from
the global optimization by the power company are consistent with individual welfare
maximization problems for end-users and traditional power plants. We also investigate and
formalize the vulnerabilities of the real-time pricing scheme by considering two types of data
integrity attacks: Ex-ante attacks and Ex-post attacks, which are launched by the adversary
before or after the decision-making process. We systematically analyze the welfare impacts of
these attacks on the real-time pricing scheme. Through a combination of theoretical analysis and
performance evaluation, our data shows that the proposed real-time pricing scheme could
For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
effectively guide the participants to achieve welfare maximization, while cyber-attacks could
significantly disrupt the results of real-time pricing decisions, imposing welfare reduction on the
participants.
IEEE Transactions on Parallel and Distributed Systems (March 2016)
A Fast and Accurate Hardware String Matching Module with Bloom Filters
Many fields of computing such as Deep Packet Inspection (DPI) employ string matching
modules (SMM) that search for a given set of positive strings in their input. An SMM is expected
to produce correct outcomes while scanning the input data at high rates. Furthermore the string
sets that are searched for are usually large and their sizes increase steadily. Bloom Filters (BFs)
are hashing data structures which are fast but their false positive results require further
processing. That is, their speed can be exploited for Standard Bloom Filter SMMs (SBFs) as long
as the positive probability is low. Multiple BFs in parallel can further increase the throughput. In
this paper, we propose the Double Bloom Filter SMM (DBF) which achieves a higher
throughput than the SBF and maintains a high throughput even for large positive probabilities.
The second Bloom Filter of DBF stores a small enough subset of the positive strings such that its
false positive probability is approximately zero. We develop an analytical model of the DBF and
show that the throughput advantage of DBF over SBF becomes more prominent if the positive
probability and the fraction of matches in the second Bloom Filter increase. Accordingly, we
propose a heuristic algorithm that stores the strings that are more frequently matched in the
second Bloom Filter according to localities identified in the input. Our numerical results are
obtained using realistic values from an FPGA implementation and are validated by SystemC
simulations.
IEEE Transactions on Parallel and Distributed Systems (June 2016)
Seer Grid: Privacy and Utility Implications of Two-Level Load Prediction in Smart Grids
We propose ―Seer Grid‖, a novel two-level energy consumption prediction framework for smart
grids, aimed to decrease the trade-off between privacy requirements (of the customer) and data
utility requirements (of the energy company (EC)). The first-level prediction at the household
level is performed by each smart meter (SM), and the predicted energy consumption pattern
(instead of the actual energy usage data) is reported to a cluster head (CH). Then, a second-level
prediction at the neighborhood level is done by the CH which predicts the energy spikes in the
For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
neighborhood or cluster and shares it with the EC. Our two-level prediction mechanism is
designed such that it preserves the correlation between the predicted and actual energy
consumption patterns at the cluster level and removes this correlation in the predicted data
communicated by each SM to the CH. This maintains the usefulness of the cluster-level energy
consumption data communicated to the EC, while preserving the privacy of the household-level
energy consumption data against the CH (and thus the EC). Our evaluation results show that Seer
Grid is successful in hiding private consumption patterns at the household-level while still being
able to accurately predict energy consumption at the neighborhood-level.
IEEE Transactions on Parallel and Distributed Systems (May 2016)
Many-Core Real-Time Task Scheduling with Scratchpad Memory
This work is motivated by the demand for scheduling tasks upon the increasingly popular island-
based many-core architectures. On such an architecture, homogeneous cores are grouped into
islands, each of which is equipped with a scratchpad memory module (referred to as local
memory). We first show the NP-hardness and the inapproximability of the scheduling problem.
Despite the inapproximability, positive results can still be found when different cases of the
problem are investigated. A (3 − 1 F )- approximation algorithm is proposed for the minimization
of the maximum system utilization, where F is the number of cores in the platform. When the
technique of resource augmentation is considered, this paper further develops a (γ + 1)-memory
2γ−1 γ−1 - approximation algorithm, where γ represents the trade-off between CPU utilization
and local memory space. On the other hand, a special case is also considered when the ratio of
the worst-case execution time of a task without and with using the local memory is bounded by a
constant. The capabilities of the proposed algorithms are then evaluated with benchmarks from
MRTC, UTDSP, NetBench and DSPstone, where the maximum system utilization can be
significantly reduced even when the local memory size is only 5% of the total footprint of all of
the tasks.
IEEE Transactions on Parallel and Distributed Systems (January 2016)
Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-
Grafting
It is difficult to obtain high performance when computing matchings on parallel processors
because matching algorithms explicitly or implicitly search for paths in the graph, and when
For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
these paths become long, there is little concurrency. In spite of this limitation, we present a new
algorithm and its shared-memory parallelization that achieves good performance and scalability
in computing maximum cardinality matchings in bipartite graphs. Our algorithm searches for
augmenting paths via specialized breadth-first searches (BFS) from multiple source vertices,
hence creating more parallelism than single source algorithms. Algorithms that employ multiple-
source searches cannot discard a search tree once no augmenting path is discovered from the
tree, unlike algorithms that rely on single-source searches. We describe a novel tree-grafting
method that eliminates most of the redundant edge traversals resulting from this property of
multiple-source searches. We also employ the recent direction-optimizing BFS algorithm as a
subroutine to discover augmenting paths faster. Our algorithm compares favorably with the
current best algorithms in terms of the number of edges traversed, the average augmenting path
length, and the number of iterations. We provide a proof of correctness for our algorithm. Our
NUMA-aware implementation is scalable to 80 threads of an Intel multiprocessor and to 240
threads on an Intel Knights Corner coprocessor. On average, our parallel algorithm runs an order
of magnitude faster than the fastest algorithms available. The performance improvement is more
significant on graphs with small matching number.
IEEE Transactions on Parallel and Distributed Systems (March 2016)
A Group-Ordered Fast Iterative Method for Eikonal Equations
In the past decade, many numerical algorithms for the Eikonal equation have been proposed.
Recently, the research of Eikonal equation solver has focused more on developing efficient
parallel algorithms in order to leverage the computing power of parallel systems, such as multi-
core CPUs and GPUs (Graphics Processing Units). In this paper, we introduce an efficient
parallel algorithm that extends Jeong et al.’s FIM (Fast Iterative Method, [1]), originally
developed for the GPU, for multi-core shared memory systems. First, we propose a parallel
implementation of FIM using a lock-free local queue approach and provide an in-depth analysis
of the parallel performance of the method. Second, we propose a new parallel algorithm, Group-
Ordered Fast Iterative Method (GO-FIM), that exploits causality of grid blocks to reduce
redundant computations, which was the main drawback of the original FIM. In addition, the
proposed GO-FIM method employs clustering of blocks based on the updating order where each
For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
cluster can be updated in parallel using multi-core parallel architectures. We discuss the
performance of GO-FIM and compare with the state-of-the-art parallel Eikonal equation solvers.
IEEE Transactions on Parallel and Distributed Systems (May 2016)
Optimal Reconfiguration of High-Performance VLSI Subarrays with Network Flow
A two-dimensional mesh-connected processor array is an extensively investigated architecture
used in parallel processing. Massive studies have addressed the use of reconfiguration algorithms
for the processor arrays with faults. However, the subarray generated by previous algorithms
contains a large number of long interconnects, which in turn leads to more communication costs,
capacitance and dynamic power dissipation. In this paper, we propose novel techniques, making
use of the idea of network flow, to construct the high-performance subarray, which has the
minimum number of long interconnects. Firstly, we construct a network flow model according to
the host array under a specific constraint. Secondly, we show that the reconfiguration problem of
high-performance subarray can be optimally solved in polynomial time by using efficient
minimum-cost flow algorithms. Finally, we prove that the geometric properties of the resulted
subarray meet the system requirements. Simulations based on several random and clustered fault
scenarios clearly reveal the advantage of the proposed technique for reducing the number of long
interconnects. It is shown that, for a host array of size 512 512, the number of long interconnects
in the subarray can be reduced by up to 70.05% for clustered faults and by up to 55.28% for
random faults with density of 1% as compared to the-state-of-the-art.
IEEE Transactions on Parallel and Distributed Systems (March 2016)
DREAM-(L)G: A Distributed Grouping-based Algorithm for Resource Assignment for
Bandwidth-Intensive Applications in the Cloud
Increasingly, many bandwidth-intensive applications have been ported to the cloud platform. In
practice, however, some disadvantages including equipment failures, bandwidth overload and
long-distance transmission often damage the QoS about data availability, bandwidth provision
and access locality respectively. While some recent solutions have been proposed to cope with
one or two of disadvantages, but not all. Moreover, as the number of data objects scales, most of
the current offline algorithms solving a constraint optimization problem suffer from low
computational efficiency. To overcome these problems, in this paper we propose an approach
that aims to make fully efficient use of the cloud resources to enable bandwidth-intensive
For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
applications to achieve the desirable level of SLA-specified QoS mentioned above cost-
effectively and timely. First we devise a constraint-based model that describes the relationship
among data object placement, user cells bandwidth allocation, operating costs and QoS
constraints. Second, we use the distributed heuristic algorithm, called DREAM-L, that solves the
model and produces a budget solution to meet SLA-specified QoS. Third, we propose an object-
grouping technique that is integrated into DREAM-L, called DREAM-LG, to significantly
improve the computational efficiency of our algorithm. The results of hundreds of thousands of
simulation-based experiments demonstrate that DREAM-LG provides much better data
availability, bandwidth provision and access locality than the state-of-the-art solutions at modest
cloud operating costs and within a small and acceptable range of time.
IEEE Transactions on Parallel and Distributed Systems (March 2016)
A Hybrid Parallel Solving Algorithm on GPU for Quasi-Tridiagonal System of Linear
Equations
There are some quasi-tridiagonal system of linear equations arising from numerical simulations,
and some solving algorithms encounter great challenge on solving quasi-tridiagonal system of
linear equations with more than millions of dimensions as the scale of problems increases. We
present a solving method which mixes direct and iterative methods, and our method needs less
storage space in a computing process. A quasi-tridiagonal matrix is split into a tridiagonal matrix
and a sparse matrix using our method and then the tridiagonal equation can be solved by the
direct methods in the iteration processes. Because the approximate solutions obtained by the
direct methods are closer to the exact solutions, the convergence speed of solving the quasi-
tridiagonal system of linear equations can be improved. Furthermore, we present an improved
cyclic reduction algorithm using a partition strategy to solve tridiagonal equations on GPU, and
the intermediate data in computing are stored in shared memory so as to significantly reduce the
latency of memory access. According to our experiments on 10 test cases, the average number of
iterations is reduced significantly by using our method compared with Jacobi, GS, GMRES, and
BiCG respectively, and close to those of BiCGSTAB, BiCRSTAB, and TFQMR. For parallel
mode, the parallel computing efficiency of our method is raised by partition strategy, and the
performance using our method is better than those of the commonly used iterative and direct
methods because of less amount of calculation in an iteration.
For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
IEEE Transactions on Parallel and Distributed Systems (January 2016)
Shield: A Reliable Network-on-Chip Router Architecture for Chip Multiprocessors
The increasing number of cores on a chip has made the Network on Chip (NoC) concept the
standard communication paradigm for Chip Multiprocessors. A fault in an NoC leads to
undesirable ramifications that can severely impact the performance of a chip. Therefore, it is
vital to design fault tolerant NoCs. In this paper, we present Shield, a reliable NoC router
architecture that has the unique ability to tolerate both hard and soft errors in the routing pipeline
using techniques such as spatial redundancy, exploitation of idle cycles, bypassing of faulty
resources and selective hardening. Using Mean Time to Failure and Silicon Protection Factor
metrics, we show that Shield is six times more reliable than the baseline-unprotected router and
is at least 1.5 times more reliable than existing fault tolerant router architectures. We introduce a
new metric called Soft Error Improvement Factor and show that the soft error tolerance of Shield
has improved by three times in comparison to the baseline-unprotected router. This reliability
improvement is accomplished by incurring an area and power overhead of 34% and 31%
respectively. Latency analysis using SPLASH-2 and PARSEC reveals that in the presence of
faults, latency increases by a modest 13% and 10% respectively.
IEEE Transactions on Parallel and Distributed Systems (January 2016)
An Energy-Efficient Directory Based Multicore Architecture with Wireless Routers to
Minimize the Communication Latency
Multicore architectures suffer from high core-to-core communication latency primarily due to
the cache’s dynamic behavior. Studies suggest that a directory-approach can be helpful to reduce
communication latency by storing the cached block information. Recent studies also indicate that
a wireless router has potential to help decrease communication latency in multicore architectures.
In this work, we propose a directory based multicore architecture with wireless routers to
minimize communication latency. We simulate systems with mesh (used in the Standford
Directory Architecture for SHared memory (DASH) architecture), wireless network-on-chip
(WNoC), and the proposed directory based architecture with wireless routers. According to the
experimental results, our proposed architecture outperforms the WNoC and the mesh
architectures. It is observed that the proposed architecture helps decrease the communication
delay by up to 15.71% and the total power consumption by up to 67.58% when compared with
For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
the mesh architecture. Similarly, the proposed architecture helps decrease the communication
delay by up to 10.00% and the total power consumption by up to 58.10% when compared with
the WNoC architecture. This is due to the fact that the proposed directory based mechanism
helps reduce the number of core-to-core communication and the wireless routers help reduce the
total number of hops.
IEEE Transactions on Parallel and Distributed Systems (May 2016)
Trajectory Pattern Mining for Urban Computing in the Cloud
The increasing pervasiveness of mobile devices along with the use of technologies like GPS,
Wifi networks, RFID, and sensors, allows for the collections of large amounts of movement data.
This amount of data can be analyzed to extract descriptive and predictive models that can be
properly exploited to improve urban life. From a technological viewpoint, Cloud computing can
play an essential role by helping city administrators to quickly acquire new capabilities and
reducing initial capital costs by means of a comprehensive pay-as-you-go solution. This paper
presents a workflow-based parallel approach for discovering patterns and rules from trajectory
data, in a Cloud-based framework. Experimental evaluation has been carried out on both real-
world and synthetic trajectory data, up to one million of trajectories. The results show that, due
to the high complexity and large volumes of data involved in the application scenario, the
trajectory pattern mining process takes advantage from the scalable execution environment
offered by a Cloud architecture in terms of both execution time, speed-up and scale-up.
IEEE Transactions on Parallel and Distributed Systems (May 2016)
FiDoop-DP: Data Partitioning in Frequent Itemset Mining on Hadoop Clusters
Traditional parallel algorithms for mining frequent itemsets aim to balance load by equally
partitioning data among a group of computing nodes. We start this study by discovering a serious
performance problem of the existing parallel Frequent Itemset Mining algorithms. Given a large
dataset, data partitioning strategies in the existing solutions suffer high communication and
mining overhead induced by redundant transactions transmitted among computing nodes. We
address this problem by developing a data partitioning approach called FiDoop-DP using the
MapReduce programming model. The overarching goal of FiDoop-DP is to boost the
performance of parallel Frequent Itemset Mining on Hadoop clusters. At the heart of FiDoop-DP
is the Voronoi diagram-based data partitioning technique, which exploits correlations among
For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
transactions. Incorporating the similarity metric and the Locality-Sensitive Hashing technique,
FiDoop-DP places highly similar transactions into a data partition to improve locality without
creating an excessive number of redundant transactions. We implement FiDoop-DP on a 24-node
Hadoop cluster, driven by a wide range of datasets created by IBM Quest Market-Basket
Synthetic Data Generator. Experimental results reveal that FiDoop-DP is conducive to reducing
network and computing loads by the virtue of eliminating redundant transactions on Hadoop
nodes. FiDoop-DP significantly improves the performance of the existing parallel frequent-
pattern scheme by up to 31% with an average of 18%.
IEEE Transactions on Parallel and Distributed Systems (April 2016)
DistR: A Distributed Method for the Reachability Query over Large Uncertain Graphs
Among uncertain graph queries, reachability, i.e., the probability that one vertex is reachable
from another, is likely the most fundamental one. Although this problem has been studied within
the field of network reliability, solutions are implemented on a single computer and can only
handle small graphs. However, as the size of graph applications continually increases, the
corresponding graph data can no longer fit within a single computer’s memory and must
therefore be distributed across several machines. Furthermore, the computation of probabilistic
reachability queries is #P-complete making it very expensive even on small graphs. In this paper,
we develop an efficient distributed strategy, called DistR, to solve the problem of reachability
query over large uncertain graphs. Specifically, we perform the task in two steps: distributed
graph reduction and distributed consolidation. In the distributed graph reduction step, we find all
of the maximal subgraphs of the original graph, whose reachability probabilities can be
calculated in polynomial time, compute them and reduce the graph accordingly. After this step,
only a small graph remains. In the distributed consolidation step, we transform the problem into
a relational join process and provide an approximate answer to the #P-complete reachability
query. Extensive experimental studies show that our distributed approach is efficient in terms of
both computational and communication costs, and has high accuracy.
IEEE Transactions on Parallel and Distributed Systems (February 2016)
A Constraint Programming Scheduler for Heterogeneous High-Performance Computing
Machines
For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
Scheduling and dispatching tools for High-Performance Computing (HPC) machines have the
key role of mapping jobs to the available resources, trying to maximize performance and
Quality-of-Service (QoS). Allocation and Scheduling in the general case are well-known NP-
hard problems, forcing commercial schedulers to adopt greedy approaches to improve
performance and QoS. Searchbased approaches featuring the exploration of the solution space
have seldom been employed in this setting, but mostly applied in off-line scenarios. In this paper,
we present the first search-based approach to job allocation and scheduling for HPC machines,
working in a production environment. The scheduler is based on Constraint Programming, an
effective programming technique for optimization problems. The resulting scheduler is flexible,
as it can be easily customized for dealing with heterogeneous resources, user-defined constraints
and different metrics. We evaluate our solution both on virtual machines using synthetic
workloads, and on the Eurora HPC with production workloads. Tests on a wide range of
operating conditions show significant improvements in waitings and QoS in mid-tier HPC
machines w.r.t state-of-the-art commercial rule-based dispatchers. Furthermore, we analyze the
conditions under which our approach outperforms commercial approaches, to create a portfolio
of scheduling algorithms that ensures robustness, flexibility and scalability.
IEEE Transactions on Parallel and Distributed Systems (January 2016)
Enabling data-centric distribution technology for partitioned embedded systems
Modern complex embedded systems are evolving into mixed-criticality systems in order to
satisfy a wide set of non-functional requirements such as security, cost, weight, timing or power
consumption. Partitioning is an enabling technology for this purpose, as it provides an
environment with strong temporal and spatial isolation which allows the integration of
applications with different requirements into a common hardware platform. At the same time,
embedded systems are increasingly networked (e.g., cyber-physical systems) and they even
might require global connectivity in open environments so enhanced communication
mechanisms are needed to develop distributed partitioned systems. To this end, this work
proposes an architecture to enable the use of data-centric real-time distribution middleware in
partitioned embedded systems based on a hypervisor. This architecture relies on distribution
middleware and a set of virtual devices to provide mixed-criticality partitions with a
homogeneous and interoperable communication subsystem. The results obtained show that this
For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
approach provides low overhead and a reasonable trade-off between temporal isolation and
performance
IEEE Transactions on Parallel and Distributed Systems (February 2016)
A Recursive Hypergraph Bipartitioning Framework for Reducing Bandwidth and Latency
Costs Simultaneously
Intelligent partitioning models are commonly used for efficient parallelization of irregular
applications on distributed systems. These models usually aim to minimize a single
communication cost metric, which is either related to communication volume or message count.
However, both volume- and message-related metrics should be taken into account during
partitioning for a more efficient parallelization. There are only a few works that consider both of
them and they usually address each in separate phases of a two-phase approach. In this work, we
propose a recursive hypergraph bipartitioning framework that reduces the total volume and total
message count in a single phase. In this framework, the standard hypergraph models, nets of
which already capture the bandwidth cost, are augmented with message nets. The message nets
encode the message count so that minimizing conventional cutsize captures the minimization of
bandwidth and latency costs together. Our model provides a more accurate representation of the
overall communication cost by incorporating both the bandwidth and the latency components
into the partitioning objective. The use of the widely-adopted successful recursive bipartitioning
framework provides the flexibility of using any existing hypergraph partitioner. The experiments
on instances from different domains show that our model on the average achieves up to 52%
reduction in total message count and hence results in 29% reduction in parallel running time
compared to the model that considers only the total volume.
IEEE Transactions on Parallel and Distributed Systems (June 2016)
Leaky Buffer: A Novel Abstraction for Relieving Memory Pressure from Cluster Data
Processing Frameworks
The shift to the in-memory data processing paradigm has had a major influence on the
development of cluster data processing frameworks. Numerous frameworks from the industry,
open source community and academia are adopting the in-memory paradigm to achieve
functionalities and performance breakthroughs. However, despite the advantages of these
inmemory frameworks, in practice they are susceptible to memorypressure related performance
For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
collapse and failures. The contributions of this paper are two-fold. Firstly, we conduct a detailed
diagnosis of the memory pressure problem and identify three preconditions for the performance
collapse. These preconditions not only explain the problem but also shed light on the possible
solution strategies. Secondly, we propose a novel programming abstraction called the leaky
buffer that eliminates one of the preconditions, thereby addressing the underlying problem. We
have implemented a leaky buffer enabled hashtable in Spark, and we believe it is also able to
substitute the hashtable that performs similar hash aggregation operations in any other programs
or data processing frameworks. Experiments on a range of memory intensive aggregation
operations show that the leaky buffer abstraction can drastically reduce the occurrence of
memory-related failures, improve performance by up to 507% and reduce memory usage by up
to 87.5%.
IEEE Transactions on Parallel and Distributed Systems (March 2016)
Parity-Switched Data Placement: Optimizing Partial Stripe Writes in XOR-Coded Storage
Systems
Erasure codes tolerate disk failures by pre-storing a low degree of data redundancy, and have
been commonly adopted in current storage systems. However, the attached requirement on data
consistency exaggerates partial stripe write operations and thus seriously downgrades system
performance. Previous works to optimize partial stripe writes are relatively limited, and a general
mechanism is still absent. In this paper, we propose a Parity-Switched Data Placement (PDP) to
optimize partial stripe writes for any XOR-coded storage system. PDP first reduces the write
operations by arranging continuous data elements to join a common parity element’s generation.
To achieve a deeper optimization, PDP further explores the generation orders of parity elements
and makes any two continuous data elements associate with a common parity element. Intensive
evaluations show that for tested erasure codes, PDP reduces up to 31.9% of write operations and
further increases the write speed by up to 59.8% when compared with two state-of-the-art data
placement methods.
IEEE Transactions on Parallel and Distributed Systems (February 2016)
VINEA: An Architecture for Virtual Network Embedding Policy Programmability
Network virtualization has enabled new business models by allowing infrastructure providers to
lease or share their physical network. A fundamental management problem that cloud providers
For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
face to support customized virtual network (VN) services is the virtual network embedding. This
requires solving the (NP-hard) problem of matching constrained virtual networks onto the
physical network. In this paper we present VINEA, a policy-based virtual network embedding
architecture, and its system implementation. VINEA leverages our previous results on VN
embedding optimality and convergence guarantees, and it is based on a network utility
maximization approach that separates policies (i:e:, high-level goals) from underlying
embedding mechanisms: resource discovery, virtual network mapping, and allocation on the
physical infrastructure. We show how VINEA can subsume existing embedding approaches, and
how it can be used to design novel solutions that adapt to different scenarios, by merely
instantiating different policies. We describe the VINEA architecture, as well as our object model:
our VINO protocol and the API to program the embedding policies; we then analyze key
representative tradeoffs among novel and existing VN embedding policy configurations, via
event-driven simulations, and with our prototype implementation. Among our findings, our
evaluation shows how, in contrast to existing solutions, simultaneously embedding nodes and
links may lead to lower providers’ revenue. We release our implementation on a testbed that uses
a Linux system architecture to reserve virtual node and link capacities. Our prototype can be also
used to augment existing open-source ―Networking as a Service‖ architectures such as
OpenStack Neutron, that currently lacks a VN embedding protocol, and as a policy-
programmable solution to the ―slice stitching‖ problem within wide-area virtual network
testbeds.
IEEE Transactions on Parallel and Distributed Systems (February 2016)
Application control configurations for parallel connection of single-phase energy
conversion units operating in island mode
This paper presents the design and implementation of controllers for parallel connection of single
phase energy conversion units, for applications in island mode operation. Orders to comply with
this objective, two control configurations are implemented having as reference to the output
voltage of droop scheme. These controllers are: two degrees of freedom control plus repetitive
controller and proportional integral - proportional controller plus resonant controller. With these
control configurations is intended maintain amplitude, waveform and frequency of the voltage
signal and attend increases linear and nonlinear load in island mode operation of a single phase
For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
energy conversion unit. That is, with these control strategies is achieved that several inverters
connected in parallel to a microgrid can operate as sources of tension carving up the active and
reactive power demanded by the load.
IEEE Latin America Transactions (March 2016)
PathGraph: A Path Centric Graph Processing System
Large scale iterative graph computation presents an interesting systems challenge due to two
well known problems: (1) the lack of access locality and (2) the lack of storage efficiency. This
paper presents PathGraph, a system for improving iterative graph computation on graphs with
billions of edges. First, we improve the memory and disk access locality for iterative
computation algorithms on large graphs by modeling a large graph using a collection of tree-
based partitions. This enables us to use path-centric computation rather than vertexcentric or
edge-centric computation. For each tree partition, we re-label vertices using DFS in order to
preserve consistency between the order of vertex ids and vertex order in the paths. Second, a
compact storage that is optimized for iterative graph parallel computation is developed in the
PathGraph system. Concretely, we employ delta-compression and store tree-based partitions in a
DFS order. By clustering highly correlated paths together as tree based partitions, we maximize
sequential access and minimize random access on storage media. Third but not the least, our
path-centric computation model is implemented using a scatter/gather programming model. We
parallel the iterative computation at partition tree level and perform sequential local updates for
vertices in each tree partition to improve the convergence speed. To provide well balanced
workloads among parallel threads at tree partition level, we introduce the concept of multiple
stealing points based task queue to allow work stealings from multiple points in the task queue.
We evaluate the effectiveness of PathGraph by comparing with recent representative graph
processing systems such as GraphChi and X-Stream etc. Our experimental results show that our
approach outperforms the two systems on a number of graph algorithms for both in-memory and
out-of-core graphs. While our approach achieves better data balance and load balance, it also
shows better speedup than the two - ystems with the growth of threads.
IEEE Transactions on Parallel and Distributed Systems (January 2016)
CIACP: A Correlation- and Iteration- Aware Cache Partitioning Mechanism to Improve
Performance of Multiple Coarse-Grained Reconfigurable Arrays
For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
Multiple coarse-grained reconfigurable arrays (CGRA), which are organized in parallel or
pipeline to complete applications, have become a productive solution to balance the performance
with the flexibility. One of the keys to obtain high performance from multiple CGRAs is to
manage the shared on-chip cache efficiently to reduce off-chip memory bandwidth requirements.
Cache partitioning has been viewed as a promising technique to enhance the efficiency of a
shared cache. However, the majority of prior partitioning techniques were developed for multi-
core platform and aimed at multi-programmed workloads. They cannot directly address the
adverse impacts of data correlation and computation imbalance among competing CGRAs in
multi-CGRA platform. This paper proposes a correlation- and iteration- aware cache partitioning
(CIACP) mechanism for shared cache partitioning in multiple CGRAs systems. This mechanism
employs correlation monitors (CMONs) to trace the amount of overlapping data among parallel
CGRAs, and iteration monitors (IMONs) to track the computation load of each CGRA. Using the
information collected by CMONs and IMONs, the CIACP mechanism can eliminate redundant
cache utilization of the overlapping data and can also shorten the total execution time of
pipelined CGRAs. Experimental results showed that CIACP outperformed state-of-the-art utility-
based cache partitioning techniques by up to 16% in performance.
IEEE Transactions on Parallel and Distributed Systems (April 2016)
Failure Diagnosis for Distributed Systems using Targeted Fault Injection
This paper introduces a novel approach to automating failure diagnostics in distributed systems
by combining fault injection and data analytics. We use fault injection to populate the database
of failures for a target distributed system. When a failure is reported from production
environment, the database is queried to find ―matched‖ failures generated by fault injections.
Relying on the assumption that similar faults generate similar failures, we use information from
the matched failures as hints to locate the actual root cause of the reported failures. In order to
implement this approach, we introduce techniques for (i) reconstructing end-to-end execution
flows of distributed software components, (ii) computing the similarity of the reconstructed
flows, and (iii) performing precise fault injection at pre-specified executing points in distributed
systems. We have evaluated our approach using an OpenStack cloud platform, a popular cloud
infrastructure management system. Our experimental results showed that this approach is
effective in determining the root causes, e.g., fault types and affected components, for 71-100%
For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
of tested failures. Furthermore, it can provide fault locations close to actual ones and can easily
be used to find and fix actual root causes. We have also validated this technique by localizing
real bugs that occurred in OpenStack.
IEEE Transactions on Parallel and Distributed Systems (June 2016)
Towards Practical and Near-optimal Coflow Scheduling for Data Center Networks
In current data centers, an application (e.g., MapReduce, Dryad, search platform, etc.) usually
generates a group of parallel flows to complete a job. These flows compose a coflow and only
completing them all is meaningful to the application. Accordingly, minimizing the average
Coflow Completion Time (CCT) becomes a critical objective of flow scheduling. However,
achieving this goal in today’s Data Center Networks (DCNs) is quite challenging, not only
because the schedule problem is theoretically NP-hard, but also because it is tough to perform
practical flow scheduling in large-scale DCNs. In this paper, we find that minimizing the average
CCT of a set of coflows is equivalent to the well-known problem of minimizing the sum of
completion times in a concurrent open shop. As there are abundant existing solutions for
concurrent open shop, we open up a variety of techniques for coflow scheduling. Inspired by the
best known result, we derive a 2-approximation algorithm for coflow scheduling, and further
develop a decentralized coflow scheduling system, D-CAS, which avoids the system problems
associated with current centralized proposals while addressing the performance challenges of
decentralized suggestions. Trace-driven simulations indicate that D-CAS achieves a performance
close to Varys, the state-of-the-art centralized method, and outperforms Baraat, the only existing
decentralized method, significantly.
IEEE Transactions on Parallel and Distributed Systems (February 2016)
Traffic Load Balancing Schemes for Devolved Controllers in Mega Data Centers
In most existing cloud services, a centralized controller is used for resource management and
coordination. However, such infrastructure is gradually not sufficient to meet the rapid growth of
mega data centers. In recent literature, a new approach named devolved controller was proposed
for scalability concern. This approach splits the whole network into several regions, each with
one controller to monitor and reroute a portion of the flows. This technique alleviates the
problem of an overloaded single controller, but brings other problems such as unbalanced work
load among controllers and reconfiguration complexities. In this paper, we make an exploration
For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
on the usage of devolved controllers for mega data centers, and design some new schemes to
overcome these shortcomings and improve the performance of the system. We first formulate
Load Balancing problem for Devolved Controllers (LBDC) in data centers, and prove that it is
NP-complete. We then design an f-approximation for LBDC, where f is the largest number of
potential controllers for a switch in the network. Furthermore, we propose both centralized and
distributed greedy approaches to solve the LBDC problem effectively. The numerical results
validate the efficiency of our schemes, which can become a solution to monitoring, managing,
and coordinating mega data centers with multiple controllers working together.
IEEE Transactions on Parallel and Distributed Systems (June 2016)
Reactive Molecular Dynamics on Massively Parallel Heterogeneous Architectures
We present a parallel implementation of the ReaxFF force field on massively parallel
heterogeneous architectures, called PuReMD-Hybrid. PuReMD, on which this work is based,
along with its integration into LAMMPS, is currently used by a large number of research groups
worldwide. Accelerating this important community codebase that implements a complex reactive
force field poses a number of algorithmic, design, and optimization challenges, as we discuss in
detail. In particular, different computational kernels are best suited to different computing
substrates – CPUs or GPUs. Scheduling these computations requires complex resource
management, as well as minimizing data movement across CPUs and GPUs. Integrating
powerful nodes, each with multiple CPUs and GPUs, into clusters and utilizing the immense
compute power of these clusters requires significant optimizations for minimizing
communication and, potentially, redundant computations. From a programming model
perspective, PuReMD-Hybrid relies on MPI across nodes, pthreads across cores, and CUDA on
the GPUs to address these challenges. Using a variety of innovative algorithms and
optimizations, we demonstrate that our code can achieve over 565-fold speedup compared to a
single core implementation on a cluster of 36 state-of-the-art GPUs for complex systems. In
terms of application performance, our code enables simulations of over 1.8M atoms in under
0.68 seconds per simulation time step.
IEEE Transactions on Parallel and Distributed Systems (March 2016)
A fast discrete wavelet transform using hybrid parallelism on GPUs
For more Details, Feel free to contact us at any time.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
Wavelet transform has been widely used in many signal and image processing applications. Due
to its wide adoption for timecritical applications, such as streaming and real-time signal
processing, many acceleration techniques were developed during the past decade. Recently, the
graphics processing unit (GPU) has gained much attention for accelerating computationally-
intensive problems and many solutions of GPU-based discrete wavelet transform (DWT) have
been introduced, but most of them did not fully leverage the potential of the GPU. In this paper,
we present various state-of-the-art GPU optimization strategies in DWT implementation, such as
leveraging shared memory, registers, warp shuffling instructions, and thread- and instruction-
level parallelism (TLP, ILP), and finally elaborate our hybrid approach to further boost up its
performance. In addition, we introduce a novel mixed-band memory layout for Haar DWT,
where multi-level transform can be carried out in a single fused kernel launch. As a result, unlike
recent GPU DWT methods that focus mainly on maximizing ILP, we show that the optimal GPU
DWT performance can be achieved by hybrid parallelism combining both TLP and ILP together
in a mixed-band approach. We demonstrate the performance of our proposed method by
comparison with other CPU and GPU DWT methods.
IEEE Transactions on Parallel and Distributed Systems (February 2016)
SUPPORT OFFERED TO REGISTERED STUDENTS:
1. IEEE Base paper.
2. Review material as per individuals’ university guidelines
3. Future Enhancement
4. assist in answering all critical questions
5. Training on programming language
6. Complete Source Code.
7. Final Report / Document
8. International Conference / International Journal Publication on your Project.
FOLLOW US ON FACEBOOK @ TSYS Academic Projects

More Related Content

What's hot

An Optimal Cooperative Provable Data Possession Scheme for Distributed Cloud ...
An Optimal Cooperative Provable Data Possession Scheme for Distributed Cloud ...An Optimal Cooperative Provable Data Possession Scheme for Distributed Cloud ...
An Optimal Cooperative Provable Data Possession Scheme for Distributed Cloud ...IJMER
 
Fast aggregation scheduling in wireless sensor networks
Fast aggregation scheduling in wireless sensor networksFast aggregation scheduling in wireless sensor networks
Fast aggregation scheduling in wireless sensor networksLogicMindtech Nologies
 
Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading
Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading
Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading Onyebuchi nosiri
 
5 1-33-1-10-20161221 kennedy
5 1-33-1-10-20161221 kennedy5 1-33-1-10-20161221 kennedy
5 1-33-1-10-20161221 kennedyOnyebuchi nosiri
 
Building confidential and efficient query services in the cloud with rasp dat...
Building confidential and efficient query services in the cloud with rasp dat...Building confidential and efficient query services in the cloud with rasp dat...
Building confidential and efficient query services in the cloud with rasp dat...Adz91 Digital Ads Pvt Ltd
 
Selective Redundancy in Network-as-a-Service: Differentiated QoS in Multi-Ten...
Selective Redundancy in Network-as-a-Service: Differentiated QoS in Multi-Ten...Selective Redundancy in Network-as-a-Service: Differentiated QoS in Multi-Ten...
Selective Redundancy in Network-as-a-Service: Differentiated QoS in Multi-Ten...Pradeeban Kathiravelu, Ph.D.
 
building confidential and efficient query services in the cloud with rasp dat...
building confidential and efficient query services in the cloud with rasp dat...building confidential and efficient query services in the cloud with rasp dat...
building confidential and efficient query services in the cloud with rasp dat...swathi78
 
A Review - Synchronization Approaches to Digital systems
A Review - Synchronization Approaches to Digital systemsA Review - Synchronization Approaches to Digital systems
A Review - Synchronization Approaches to Digital systemsIJERA Editor
 
Data mining projects topics for java and dot net
Data mining projects topics for java and dot netData mining projects topics for java and dot net
Data mining projects topics for java and dot netredpel dot com
 
DYNAMIC ADDRESS ROUTING FOR SCALABLE AD HOC NETWORKS
DYNAMIC ADDRESS ROUTING FOR SCALABLE AD HOC NETWORKSDYNAMIC ADDRESS ROUTING FOR SCALABLE AD HOC NETWORKS
DYNAMIC ADDRESS ROUTING FOR SCALABLE AD HOC NETWORKSIAEME Publication
 
Dual-resource TCPAQM for Processing-constrained Networks
Dual-resource TCPAQM for Processing-constrained NetworksDual-resource TCPAQM for Processing-constrained Networks
Dual-resource TCPAQM for Processing-constrained Networksambitlick
 

What's hot (12)

An Optimal Cooperative Provable Data Possession Scheme for Distributed Cloud ...
An Optimal Cooperative Provable Data Possession Scheme for Distributed Cloud ...An Optimal Cooperative Provable Data Possession Scheme for Distributed Cloud ...
An Optimal Cooperative Provable Data Possession Scheme for Distributed Cloud ...
 
Fast aggregation scheduling in wireless sensor networks
Fast aggregation scheduling in wireless sensor networksFast aggregation scheduling in wireless sensor networks
Fast aggregation scheduling in wireless sensor networks
 
Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading
Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading
Cloud Based Datacenter Network Acceleration Using FPGA for Data-Offloading
 
5 1-33-1-10-20161221 kennedy
5 1-33-1-10-20161221 kennedy5 1-33-1-10-20161221 kennedy
5 1-33-1-10-20161221 kennedy
 
Building confidential and efficient query services in the cloud with rasp dat...
Building confidential and efficient query services in the cloud with rasp dat...Building confidential and efficient query services in the cloud with rasp dat...
Building confidential and efficient query services in the cloud with rasp dat...
 
Selective Redundancy in Network-as-a-Service: Differentiated QoS in Multi-Ten...
Selective Redundancy in Network-as-a-Service: Differentiated QoS in Multi-Ten...Selective Redundancy in Network-as-a-Service: Differentiated QoS in Multi-Ten...
Selective Redundancy in Network-as-a-Service: Differentiated QoS in Multi-Ten...
 
building confidential and efficient query services in the cloud with rasp dat...
building confidential and efficient query services in the cloud with rasp dat...building confidential and efficient query services in the cloud with rasp dat...
building confidential and efficient query services in the cloud with rasp dat...
 
A Review - Synchronization Approaches to Digital systems
A Review - Synchronization Approaches to Digital systemsA Review - Synchronization Approaches to Digital systems
A Review - Synchronization Approaches to Digital systems
 
Data mining projects topics for java and dot net
Data mining projects topics for java and dot netData mining projects topics for java and dot net
Data mining projects topics for java and dot net
 
DYNAMIC ADDRESS ROUTING FOR SCALABLE AD HOC NETWORKS
DYNAMIC ADDRESS ROUTING FOR SCALABLE AD HOC NETWORKSDYNAMIC ADDRESS ROUTING FOR SCALABLE AD HOC NETWORKS
DYNAMIC ADDRESS ROUTING FOR SCALABLE AD HOC NETWORKS
 
Dual-resource TCPAQM for Processing-constrained Networks
Dual-resource TCPAQM for Processing-constrained NetworksDual-resource TCPAQM for Processing-constrained Networks
Dual-resource TCPAQM for Processing-constrained Networks
 
06425531
0642553106425531
06425531
 

Similar to Fault-Tolerant Scheduling for Real-Time Scientific Workflows with Elastic Resource Provisioning in Virtualized Clouds

2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdfLevLafayette1
 
IEEE Networking 2016 Title and Abstract
IEEE Networking 2016 Title and AbstractIEEE Networking 2016 Title and Abstract
IEEE Networking 2016 Title and Abstracttsysglobalsolutions
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titlestema_solution
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titlestema_solution
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titlestema_solution
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titlestema_solution
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titlesSoundar Msr
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titlestema_solution
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titlestema_solution
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titlestema_solution
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titlestema_solution
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titlestema_solution
 
A Study on Replication and Failover Cluster to Maximize System Uptime
A Study on Replication and Failover Cluster to Maximize System UptimeA Study on Replication and Failover Cluster to Maximize System Uptime
A Study on Replication and Failover Cluster to Maximize System UptimeYogeshIJTSRD
 
LOCK-FREE PARALLEL ACCESS COLLECTIONS
LOCK-FREE PARALLEL ACCESS COLLECTIONSLOCK-FREE PARALLEL ACCESS COLLECTIONS
LOCK-FREE PARALLEL ACCESS COLLECTIONSijdpsjournal
 
Lock free parallel access collections
Lock free parallel access collectionsLock free parallel access collections
Lock free parallel access collectionsijdpsjournal
 
An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...riyaniaes
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
IEEE Cloud computing 2016 Title and Abstract
IEEE Cloud computing 2016 Title and AbstractIEEE Cloud computing 2016 Title and Abstract
IEEE Cloud computing 2016 Title and Abstracttsysglobalsolutions
 

Similar to Fault-Tolerant Scheduling for Real-Time Scientific Workflows with Elastic Resource Provisioning in Virtualized Clouds (20)

2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf
 
IEEE Networking 2016 Title and Abstract
IEEE Networking 2016 Title and AbstractIEEE Networking 2016 Title and Abstract
IEEE Networking 2016 Title and Abstract
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titles
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titles
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titles
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titles
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titles
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titles
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titles
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titles
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titles
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titles
 
Resisting skew accumulation
Resisting skew accumulationResisting skew accumulation
Resisting skew accumulation
 
A Study on Replication and Failover Cluster to Maximize System Uptime
A Study on Replication and Failover Cluster to Maximize System UptimeA Study on Replication and Failover Cluster to Maximize System Uptime
A Study on Replication and Failover Cluster to Maximize System Uptime
 
LOCK-FREE PARALLEL ACCESS COLLECTIONS
LOCK-FREE PARALLEL ACCESS COLLECTIONSLOCK-FREE PARALLEL ACCESS COLLECTIONS
LOCK-FREE PARALLEL ACCESS COLLECTIONS
 
Lock free parallel access collections
Lock free parallel access collectionsLock free parallel access collections
Lock free parallel access collections
 
An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
IEEE Cloud computing 2016 Title and Abstract
IEEE Cloud computing 2016 Title and AbstractIEEE Cloud computing 2016 Title and Abstract
IEEE Cloud computing 2016 Title and Abstract
 
cug2011-praveen
cug2011-praveencug2011-praveen
cug2011-praveen
 

Recently uploaded

internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerunnathinaik
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxsocialsciencegdgrohi
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
 

Recently uploaded (20)

internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developer
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
 

Fault-Tolerant Scheduling for Real-Time Scientific Workflows with Elastic Resource Provisioning in Virtualized Clouds

  • 1. For more Details, Feel free to contact us at any time. Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/ Mail Id: tsysglobalsolutions2014@gmail.com. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2016 TOPICS Fault-Tolerant Scheduling for Real-Time Scientific Workflows with Elastic Resource Provisioning in Virtualized Clouds Clouds are becoming an important platform for scientific workflow applications. However, with many nodes being deployed in clouds, managing reliability of resources becomes a critical issue, especially for the real-time scientific workflow execution where deadlines should be satisfied. Therefore, fault tolerance in clouds is extremely essential. The PB (primary backup) based scheduling is a popular technique for fault tolerance and has effectively been used in the cluster and grid computing. However, applying this technique for real-time workflows in a virtualized cloud is much more complicated and has rarely been studied. In this paper, we address this problem. We first establish a real-time workflow fault-tolerant model that extends the traditional PB model by incorporating the cloud characteristics. Based on this model, we develop approaches for task allocation and message transmission to ensure faults can be tolerated during the workflow execution. Finally, we propose a dynamic fault-tolerant scheduling algorithm, FASTER, for realtime workflows in the virtualized cloud. FASTER has three key features: 1) it employs a backward shifting method to make full use of the idle resources and incorporates task overlapping and VM migration for high resource utilization, 2) it applies the vertical/horizontal scaling-up technique to quickly provision resources for a burst of workflows, and 3) it uses the vertical scaling-down scheme to avoid unnecessary and ineffective resource changes due to fluctuated workflow requests. We evaluate our FASTER algorithm with synthetic workflows and workflows collected from the real scientific and business applications and compare it with six baseline algorithms. The experimental results demonstrate that FASTER can effectively improve the resource utilization and schedulability even in the presence of node failures in virtualized clouds. IEEE Transactions on Parallel and Distributed Systems (March 2016) Correlation-Aware Heuristics for Evaluating the Distribution of the Longest Path Length of a DAG with Random Weights
  • 2. For more Details, Feel free to contact us at any time. Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/ Mail Id: tsysglobalsolutions2014@gmail.com. Coping with uncertainties when scheduling task graphs on parallel machines requires to perform non-trivial evaluations. When considering that each computation and communication duration is a random variable, evaluating the distribution of the critical path length of such graphs involves computing maximums and sums of possibly dependent random variables. The discrete version of this evaluation problem is known to be #P-hard. Here, we propose two heuristics, CorLCA and Cordyn, to compute such lengths. They approximate the input random variables and the intermediate ones as normal random variables, and they precisely take into account correlations with two distinct mechanisms: through lowest common ancestor queries for CorLCA and with a dynamic programming approach for Cordyn. Moreover, we empirically compare some classical methods from the literature and confront them to our solutions. Simulations on a large set of cases indicate that CorLCA and Cordyn constitute each a new relevant trade-off in terms of rapidity and precision. IEEE Transactions on Parallel and Distributed Systems (February 2016) A Hybrid Static-Dynamic Classification for Dual-Consistency Cache Coherence Traditional cache coherence protocols manage all memory accesses equally and ensure the strongest memory model, namely, sequential consistency. Recent cache coherence protocols based on self-invalidation advocate for the model sequential consistency for data-race-free, which enables powerful optimizations for race-free code. However, for racy code these cache coherence protocols provide sub-optimal performance compared to traditional protocols. This paper proposes SPEL++, a dual-consistency cache coherence protocol that supports two execution modes: a traditional sequential-consistent protocol and a protocol that provides weak consistency (or sequential consistency for data-race-free). SPEL++ exploits a static-dynamic hybrid classification of memory accesses based on (i) a compile-time identification of extended data-race-free code regions for OpenMP applications and (ii) a runtime classification of accesses based on the operating system’s memory page management. By executing racy code under the sequential-consistent protocol and race-free code under the cache coherence protocol that provides sequential consistency for data-race-free, the end result is an efficient execution of the applications while still providing sequential consistency. Compared to a traditional protocol, we show improvements in performance from 19% to 38% and reductions in energy consumption from 47% to 53%, on average for different benchmark suites, on a 64-core chip multiprocessor.
  • 3. For more Details, Feel free to contact us at any time. Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/ Mail Id: tsysglobalsolutions2014@gmail.com. IEEE Transactions on Parallel and Distributed Systems (February 2016) REFRESH: REDEFINE for Face Recognition using SURE Homogeneous Cores In this paper we present design and analysis of a scalable real-time Face Recognition (FR) module to perform 450 recognitions per second. We introduce an algorithm for FR, which is a combination of Weighted Modular Principle Component Analysis and Radial Basis Function Neural Networks. This algorithm offers better recognition accuracy in various practical conditions than algorithms used in existing architectures for real-time FR. To meet real-time requirements, a Scalable Parallel Pipelined Architecture (SPPA) is developed by realizing the above FR algorithm as independent parallel streams and sub-streams of computations. SPPA is capable of supporting large databases maintained in external (DDR) memory. By casting the computations in a stream into hardware, we present the design of a Scalable Unit for Region Evaluation (SURE) core. Using SURE cores as computer elements in a massively parallel CGRA, like REDFINE, we provide a FR system on REDEFINE called REFRESH. We report FPGA and ASIC synthesis results for SPPA and REFRESH. Through analysis using these results, we show that excellent scalability and added programmability in REFRESH makes it a flexible and favorable solution for real-time FR. IEEE Transactions on Parallel and Distributed Systems (March 2016) Improving Batch Scheduling on Blue Gene/Q by Relaxing Network Allocation Constraints As systems scale toward exascale, many resources will become increasingly constrained. While some of these resources have historically been explicitly allocated, many—such as network bandwidth, I/O bandwidth, or power—have not. As systems continue to evolve, we expect many such resources to become explicitly managed. This change will pose critical challenges to resource management and job scheduling. In this paper, we explore the potential of relaxing network allocation constraints for Blue Gene systems. Our objective is to improve the batch scheduling performance, where the partition-based interconnect architecture provides a unique opportunity to explicitly allocate network resources to jobs. This paper makes three major contributions. The first is substantial benchmarking of parallel applications, focusing on assessing application sensitivity to communication bandwidth at large scale. The second is three new scheduling schemes using relaxed network allocation and targeted at balancing individual job performance with overall system performance. The third is a comparative study of our
  • 4. For more Details, Feel free to contact us at any time. Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/ Mail Id: tsysglobalsolutions2014@gmail.com. scheduling schemes versus the existing scheduler on Mira, a 48-rack Blue Gene/Q system at Argonne National Laboratory. Specifically, we use job traces collected from this production system. IEEE Transactions on Parallel and Distributed Systems (February 2016) Transparent and optimized distributed processing on GPUs DistributedCL is a middleware which enables transparent parallel processing on distributed GPUs. With the support of the DistributedCL middleware an application designed to use the OpenCL API can run in a distributed manner and transparently use remote GPUs without having to change or rebuild the code. The proposed architecture for the DistributedCL middleware is modular, with well-defined layers. A prototype was built according to the architecture, which considered various optimization points, including sending data in batches, network asynchronous communication and asynchronous request to the OpenCL API. The prototype was evaluated using available benchmarks and a specific benchmark, the CLBench, was developed to facilitate the evaluations according to the amount of processed data. The prototype presented good performance, higher when compared to similar proposals, which also consider transparent use of remote GPUs. The data size to be transmitted over the network was the major limiting factor. IEEE Transactions on Parallel and Distributed Systems (April 2016) Distributed Control for Charging Multiple Electric Vehicles with Overload Limitation Severe pollution induced by traditional fossil fuels arouses great attention on the usage of plug-in electric vehicles (PEVs) and renewable energy. However, large-scale penetration of PEVs combined with other kinds of appliances tends to cause excessive or even disastrous burden on the power grid, especially during peak hours. This paper focuses on the scheduling of PEVs charging process among different charging stations and each station can be supplied by both renewable energy generators and a distribution network. The distribution network also powers some uncontrollable loads. In order to minimize the on-grid energy cost with local renewable energy and non-ideal storage while avoiding the overload risk of the distribution network, an online algorithm consisting of scheduling the charging of PEVs and energy management of charging stations is developed based on Lyapunov optimization and Lagrange dual decomposition techniques. The algorithm can satisfy the random charging requests from PEVs with provable performance. Simulation results with real data demonstrate that the proposed
  • 5. For more Details, Feel free to contact us at any time. Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/ Mail Id: tsysglobalsolutions2014@gmail.com. algorithm can decrease the time-average cost of stations while avoiding overload in the distribution network in the presence of random uncontrollable loads. IEEE Transactions on Parallel and Distributed Systems (February 2016) CoRE: Cooperative End-to-End Traffic Redundancy Elimination for Reducing Cloud Bandwidth Cost The pay-as-you-go service model impels cloud customers to reduce the usage cost of bandwidth. Traffic Redundancy Elimination (TRE) has been shown to be an effective solution for reducing bandwidth costs, and thus has recently captured significant attention in the cloud environment. By studying the TRE techniques in a trace driven approach, we found that both short-term (time span of seconds) and long-term (time span of hours or days) data redundancy can concurrently appear in the traffic, and solely using either sender-based TRE or receiver-based TRE cannot simultaneously capture both types of traffic redundancy. Also, the efficiency of existing receiver- based TRE solution is susceptible to the data changes compared to the historical data in the cache. In this paper, we propose a Cooperative end-to-end TRE solution (CoRE) that can detect and remove both short-term and long-term redundancy through a two-layer TRE design with cooperative operations between layers. An adaptive prediction algorithm is further proposed to improve TRE efficiency through dynamically adjusting the prediction window size based on the hit ratio of historical predictions. Besides, we enhance CoRE to adapt to different traffic redundancy characteristics of cloud applications to improve its operation cost. Extensive evaluation with several real traces show that CoRE is capable of effectively identifying both short-term and long-term redundancy with low additional cost while ensuring TRE efficiency from data changes. IEEE Transactions on Parallel and Distributed Systems (June 2016) Clustering-based Task Scheduling in a Large Number of Heterogeneous Processors Parallelization paradigms for effective execution in a Directed Acyclic Graph (DAG) application have been widely studied in the area of task scheduling. Schedule length can be varied depending on task assignment policies, scheduling policies, and heterogeneity in terms of each processor and each communication bandwidth in a heterogeneous system. One disadvantage of existing task scheduling algorithms is that the schedule length cannot be reduced for a data intensive application. In this paper, we propose a clustering-based task scheduling algorithm called
  • 6. For more Details, Feel free to contact us at any time. Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/ Mail Id: tsysglobalsolutions2014@gmail.com. Clustering for Minimizing the Worst Schedule Length (CMWSL) to minimize the schedule length in a large number of heterogeneous processors. First, the proposed method derives the lower bound of the total execution time for each processor by taking both the system and application characteristics into account. As a result, the number of processors used for actual execution is regulated to minimize the Worst Schedule Length (WSL). Then, the actual task assignment and task clustering are performed to minimize the schedule length until the total execution time in a task cluster exceeds the lower bound. Experimental results indicate that CMWSL outperforms both existing list-based and clustering-based task scheduling algorithms in terms of the schedule length and efficiency, especially in data-intensive applications. IEEE Transactions on Parallel and Distributed Systems (February 2016 ) Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model The traditional single-level checkpointing method suffers from significant overhead on large- scale platforms. Hence, multilevel checkpointing protocols have been studied extensively in recent years. The multilevel checkpoint approach allows different levels of checkpoints to be set (each with different checkpoint overheads and recovery abilities), in order to further improve the fault tolerance performance of extreme-scale HPC applications. How to optimize the checkpoint intervals for each level, however, is an extremely difficult problem. In this paper, we construct an easy-to-use two-level checkpoint model. Checkpoint level 1 deals with errors with low checkpoint/recovery overheads such as transient memory errors, while checkpoint level 2 deals with hardware crashes such as node failures. Compared with previous optimization work, our new optimal checkpoint solution offers two improvements: (1) it is an online solution without requiring knowledge of the job length in advance, and (2) it shows that periodic patterns are optimal and determines the best pattern. We evaluate the proposed solution and compare it with the most up-to-date related approaches on an extreme-scale simulation testbed constructed based on a real HPC application execution. Simulation results show that our proposed solution outperforms other optimized solutions and can improve the performance significantly in some cases. Specifically, with the new solution the wall-clock time can be reduced by up to 25.3% over that of other state-of-the-art approaches. Finally, a brute-force comparison with all possible patterns shows that our solution is always within 1% of the best pattern in the experiments.
  • 7. For more Details, Feel free to contact us at any time. Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/ Mail Id: tsysglobalsolutions2014@gmail.com. IEEE Transactions on Parallel and Distributed Systems (March 2016) Shadow/Puppet Synthesis: A Stepwise Method for the Design of Self-Stabilization This paper presents a novel two-step method for automated design of self-stabilization. The first step enables the specification of legitimate states and an intuitive (but imprecise) specification of the desired functional behaviors in the set of legitimate states (hence the term ―shadow‖). After creating the shadow specifications, we systematically introduce the main variables and the topology of the desired self-stabilizing system. Subsequently, we devise a parallel and complete backtracking search towards finding a self-stabilizing solution that implements a precise version of the shadow behaviors, and guarantees recovery to legitimate states from any state. To the best of our knowledge, the shadow/puppet synthesis is the first sound and complete method that exploits parallelism and randomization along with the expansion of the state space towards generating self-stabilizing systems that cannot be synthesized with existing methods. We have validated the proposed method by creating both a sequential and a parallel implementation in the context of a software tool, called Protocon. Moreover, we have used Protocon to automatically design three new self-stabilizing protocols that we conjecture to require the minimal number of states per process to achieve stabilization (when processes are deterministic): 2-state maximal matching on bidirectional rings, 5-state token passing on unidirectional rings, and 3-state token passing on bidirectional chains. IEEE Transactions on Parallel and Distributed Systems (February 2016) Optimizing End-to-End Big Data Transfers over Terabits Network Infrastructure While future terabit networks hold the promise of significantly improving big-data motion among geographically distributed data centers, significant challenges must be overcome even on today’s 100 gigabit networks to realize end-to-end performance. Multiple bottlenecks exist along the end-to-end path from source to sink, for instance, the data storage infrastructure at both the source and sink and its interplay with the wide-area network are increasingly the bottleneck to achieving high performance. In this paper, we identify the issues that lead to congestion on the path of an end-to-end data transfer in the terabit network environment, and we present a new bulk data movement framework for terabit networks, called LADS. LADS exploits the underlying storage layout at each endpoint to maximize throughput without negatively impacting the performance of shared storage resources for other users. LADS also uses the Common
  • 8. For more Details, Feel free to contact us at any time. Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/ Mail Id: tsysglobalsolutions2014@gmail.com. Communication Interface (CCI) in lieu of the sockets interface to benefit from hardware-level zero-copy, and operating system bypass capabilities when available. It can further improve data transfer performance under congestion on the end systems using buffering at the source using flash storage. With our evaluations, we show that LADS can avoid congested storage elements within the shared storage resource, improving input/output bandwidth, and data transfer rates across the high speed networks. We also investigate the performance degradation problems of LADS due to I/O contention on the parallel file system (PFS), when multiple LADS tools share the PFS. We design and evaluate a meta-scheduler to coordinate multiple I/O streams while sharing the PFS, to minimize the I/O contention on the PFS. With our evaluations, we observe that LADS with meta-scheduling can further improve the performance by up to 14% relative to LADS without meta-scheduling. IEEE Transactions on Parallel and Distributed Systems (April 2016) Analysis of parallel computing strategies to accelerate ultrasound imaging processes This work analyses the use of parallel processing techniques in synthetic aperture ultrasonic imaging applications. In particular, the Total Focussing Method, which is a O(N2 P) problem, is studied. This work presents different parallelization strategies for multicore CPU and GPU architectures. The parallelization processes on both platforms are discussed and optimized in order to achieve real-time performance. IEEE Transactions on Parallel and Distributed Systems (March 2016) Improving Performance of Parallel I/O Systems through Selective and Layout-Aware SSD Cache Parallel file systems (PFS) are widely-used to ease the I/O bottleneck of modern high- performance computing systems. However, PFSs do not work well for small requests, especially small random requests. Newer Solid State Drives (SSD) have excellent performance on small random data accesses, but also incur a high monetary cost. In this study, we propose SLA-Cache, a Selective and Layout-Aware Cache system that employs a small set of SSD-based file servers as a cache of conventional HDD-based file servers. SLA-Cache uses a novel scheme to identify performance-critical data, and conducts a selective cache admission (SCA) policy to fully utilize SSD-based file servers. Moreover, since data layout of the cache system can also largely influence its access performance, SLA-Cache applies a layout-aware cache placement scheme
  • 9. For more Details, Feel free to contact us at any time. Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/ Mail Id: tsysglobalsolutions2014@gmail.com. (LCP) to store data on SSD-based file servers. By storing data with an optimal layout requiring the lowest access cost among three typical layout candidates, LCP can further improve system performance. We have implemented SLA-Cache under the MPICH2 I/O library. Experimental results show that SLA-Cache can significantly improve I/O throughput, and is a promising approach for parallel applications. IEEE Transactions on Parallel and Distributed Systems (January 2016) Elastic Reliability Optimization Through Peer-to-Peer Checkpointing in Cloud Computing Modern day data centers coordinate hundreds of thousands of heterogeneous tasks and aim at delivering highly reliable cloud computing services. Although offering equal reliability to all users benefits everyone at the same time, users may find such an approach either inadequate or too expensive to fit their individual requirements, which may vary dramatically. In this paper, we propose a novel method for providing elastic reliability optimization in cloud computing. Our scheme makes use of peer-to-peer checkpointing and allows user reliability levels to be jointly optimized based on an assessment of their individual requirements and total available resources in the data center. We show that the joint optimization can be efficiently solved by a distributed algorithm using dual decomposition. The solution improves resource utilization and presents an additional source of revenue to data center operators. Our validation results suggest a significant improvement of reliability over existing schemes. IEEE Transactions on Parallel and Distributed Systems (May 2016) A Taxonomy of Job Scheduling on Distributed Computing Systems Hundreds of papers on job scheduling for distributed systems are published every year and it becomes increasingly difficult to classify them. Our analysis revealed that half of these papers are barely cited. This paper presents a general taxonomy for scheduling problems and solutions in distributed systems. This taxonomy was used to classify and make publicly available the classification of 109 scheduling problems and their solutions. These 109 problems were further clustered into ten groups based on the features of the taxonomy. The proposed taxonomy will facilitate researchers to build on prior art, increase new research visibility, and minimize redundant effort. IEEE Transactions on Parallel and Distributed Systems (March 2016) Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications
  • 10. For more Details, Feel free to contact us at any time. Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/ Mail Id: tsysglobalsolutions2014@gmail.com. For exascale HPC applications, silent data corruption (SDC) is one of the most dangerous problems because there is no indication that there are errors during the execution. We propose an adaptive impact-driven method that can detect SDCs dynamically. The key contributions are threefold. (1) We carefully characterize 18 HPC applications/benchmarks and discuss the runtime data features, as well as the impact of the SDCs on their execution results. (2) We propose an impact-driven detection model that does not blindly improve the prediction accuracy, but instead detects only influential SDCs to guarantee user-acceptable execution results. (3) Our solution can adapt to dynamic prediction errors based on local runtime data and can automatically tune detection ranges for guaranteeing low false alarms. Experiments show that our detector can detect 80-99.99% of SDCs with a false alarm rate less that 1% of iterations for most cases. The memory cost and detection overhead are reduced to 15% and 6.3%, respectively, for a large majority of applications. IEEE Transactions on Parallel and Distributed Systems (January 2016) Time Series-Oriented Load Prediction Model and Migration Policies for Distributed Simulation Systems HLA-based simulation systems are prone to load imbalances due to lack management of shared resources in distributed environments. Such imbalances lead these simulations to exhibit performance loss in terms of execution time. As a result, many dynamic load balancing systems have been introduced to manage distributed load. These systems use specific methods, depending on load or application characteristics, to perform the required balancing. Load prediction is a technique that has been used extensively to enhance load redistribution heuristics towards preventing load imbalances. In this paper, several efficient Time Series model variants are presented and used to enhance prediction precision for large-scale distributed simulation-based systems. These variants are proposed to extend and correct the issues originating from the implementation of Holt’s model for time series in the predictive module of a dynamic load balancing system for HLA-based distributed simulations. A set of migration decision-making techniques is also proposed to enable a prediction-based load balancing system to be independent of any prediction model, promoting a more modular construction. IEEE Transactions on Parallel and Distributed Systems (April 2016) Enabling Parallel Simulation of Large-Scale HPC Network Systems
  • 11. For more Details, Feel free to contact us at any time. Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/ Mail Id: tsysglobalsolutions2014@gmail.com. With the increasing complexity of today’s high-performance computing (HPC) architectures, simulation has become an indispensable tool for exploring the design space of HPC systems—in particular, networks. In order to make effective design decisions, simulations of these systems must possess the following properties: (1) have high accuracy and fidelity, (2) produce results in a timely manner, and (3) be able to analyze a broad range of network workloads. Most state-of- the-art HPC network simulation frameworks, however, are constrained in one or more of these areas. In this work, we present a simulation framework for modeling two important classes of networks used in today’s IBM and Cray supercomputers: torus and dragonfly networks. We use the Co-Design of Multi-layer Exascale Storage Architecture (CODES) simulation framework to simulate these network topologies at a flit-level detail using the Rensselaer Optimistic Simulation System (ROSS) for parallel discrete-event simulation. Our simulation framework meets all the requirements of a practical network simulation and can assist network designers in design space exploration. First, it uses validated and detailed flit-level network models to provide an accurate and high-fidelity network simulation. Second, instead of relying on serial time- stepped or traditional conservative discrete-event simulations that limit simulation scalability and efficiency, we use the optimistic event-scheduling capability of ROSS to achieve efficient and scalable HPC network simulations on today’s high-performance cluster systems. Third, our models give network designers a choice in simulating a broad range of network workloads, including HPC application workloads using detailed network traces, an ability that is rarely offered in parallel with high-fidelity network simulations. IEEE Transactions on Parallel and Distributed Systems (April 2016) An Evolutionary Optimal Fuzzy System with Information Fusion of Heterogeneous Distributed Computing and Polar-Space Dynamic Model for Online Motion Control of Swedish Redundant Robots This paper presents an evolutionary optimal fuzzy system with information fusion of heterogeneous distributed computing and polar-space dynamic model for online motion control of Swedish redundant robots. The intelligent fuzzy system incorporated with the parallel metaheuristic BFO (Bacteria Foraging Optimization)-AIS (Artificial Immune System), called FS-PBFOAIS and its field-programmable gate array (FPGA) realization to optimal polar-space online motion control of four-wheeled redundant mobile robots. This hybrid paradigm gains the
  • 12. For more Details, Feel free to contact us at any time. Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/ Mail Id: tsysglobalsolutions2014@gmail.com. benefits of Taguchi quality method, BFO, AIS, distributed processing and FPGA technique. Experimental results are conducted to present effective optimization and high accuracy of the proposed FPGA-based FS-PBFOAIS tracking controller. Finally, the comparative works are provided to demonstrate the superiority of the FPGA-based FS-PBFOAIS polar-space redundant controller over other conventional control methods. IEEE Transactions on Industrial Electronics (May 2016) Cache Line Aware Algorithm Design for Cache-Coherent Architectures The increase in the number of cores per processor and the complexity of memory hierarchies make cache coherence key for programmability of current shared memory systems. However, ignoring its detailed architectural characteristics can harm performance significantly. In order to assist performance-centric programming, we propose a methodology to allow semi-automatic performance tuning with the systematic translation from an algorithm to an analytic performance model for cache line transfers. For this, we design a simple interface for cache line aware optimization, a translation methodology, and a full performance model that exposes the block- based design of caches to middleware designers. We investigate two different architectures to show the applicability of our techniques and methods: the many-core accelerator Intel Xeon Phi and a multi-core processor with a NUMA configuration (Intel Sandy Bridge). We use mathematical optimization techniques to tune synchronization algorithms to the microarchitectures, identifying three techniques to design and optimize data transfers in our model: single-use, single-step broadcast, and private cache lines. IEEE Transactions on Parallel and Distributed Systems (January 2016) CUDAlign 4.0: Incremental Speculative Traceback for Exact Chromosome-Wide Alignment in GPU Clusters This paper proposes and evaluates CUDAlign 4.0, a parallel strategy to obtain the optimal alignment of huge DNA sequences in multi-GPU platforms, using the exact Smith-Waterman (SW) algorithm. In the first phase of CUDAlign 4.0, a huge Dynamic Programming (DP) matrix is computed by multiple GPUs, which asynchronously communicate border elements to the right neighbor in order to find the optimal score. After that, the traceback phase of SW is executed. The efficient parallelization of the traceback phase is very challenging because of the high amount of data dependency, which particularly impacts the performance and limits the
  • 13. For more Details, Feel free to contact us at any time. Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/ Mail Id: tsysglobalsolutions2014@gmail.com. application scalability. In order to obtain a multi-GPU highly parallel traceback phase, we propose and evaluate a new parallel traceback algorithm called Incremental Speculative Traceback (IST), which pipelines the traceback phase, speculating incrementally over the values calculated so far, producing results in advance. With CUDAlign 4.0, we were able to calculate SW matrices with up to 60 Peta cells, obtaining the optimal local alignments of all Human and Chimpanzee homologous chromosomes, whose sizes range from 26 Millions of Base Pairs (MBP) up to 249 MBP. As far as we know, this is the first time such comparison was made with the SW exact method. We also show that the IST algorithm is able to reduce the traceback time from 2.15x up to 21.03x, when compared with the baseline traceback algorithm. The human x chimpanzee chromosome 5 comparison (180 MBP x 183 MBP) attained 10,370.00 GCUPS (Billions of Cells Updated per Second) using 384 GPUs, with a speculation hit ratio of 98.2%. IEEE Transactions on Parallel and Distributed Systems (January 2016) Xscale: Online X-code RAID-6 Scaling Using Lightweight Data Reorganization Disk additions to a RAID-6 storage system can simultaneously increase the I/O parallelism and expand the storage capacity. To regain a balanced load among both old and new disks, RAID-6 scaling requires moving certain data blocks onto newly added disks. Existing approaches to RAID-6 scaling are restricted by preserving a round-robin data distribution, and require migrating all the data, resulting in an expensive cost for RAID-6 scaling. In this paper, we propose Xscale, a new approach to accelerating X-code RAID-6 scaling by using lightweight data reorganization. Xscale minimizes the number of data blocks that require being moved, while maintaining a uniform data distribution across all disks. Furthermore, Xscale eliminates metadata updates while guaranteeing data consistency and data reliability. Compared with the round-robin approach, Xscale reduces the number of blocks to be moved by 63.6–89.5%, decreases the reorganization time by 35.62-37.26%, and reduces the I/O latency by 23.29-37.74% while the scaling programs are running in the background. In addition, there is no penalty in the performance of the data layout after scaling using Xscale, compared with the layouts maintained by other existing scaling approache IEEE Transactions on Parallel and Distributed Systems (March 2016) The Importance of Worker Reputation Information in Microtask-Based Crowd Work Systems
  • 14. For more Details, Feel free to contact us at any time. Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/ Mail Id: tsysglobalsolutions2014@gmail.com. This paper presents the first systematic investigation of the potential performance gains for crowd work systems, deriving from available information at the requester about individual worker reputation. In particular, we first formalize the optimal task assignment problem when workers’ reputation estimates are available, as the maximization of a monotone (submodular) function subject to Matroid constraints. Then, being the optimal problem NP-hard, we propose a simple but efficient greedy heuristic task allocation algorithm. We also propose a simple ―maximum a-posteriori‖ decision rule and a decision algorithm based on message passing. Finally, we test and compare different solutions, showing that system performance can greatly benefit from information about workers’ reputation. Our main findings are that: i) even largely inaccurate estimates of workers’ reputation can be effectively exploited in the task assignment to greatly improve system performance; ii) the performance of the maximum a-posteriori decision rule quickly degrades as worker reputation estimates become inaccurate; iii) when workers’ reputation estimates are significantly inaccurate, the best performance can be obtained by combining our proposed task assignment algorithm with the message-passing decision algorithm. IEEE Transactions on Parallel and Distributed Systems (May 2016) On Data Integrity Attacks against Real-time Pricing in Energy-based Cyber-Physical Systems In this paper, we investigate a novel real-time pricing scheme, which considers both renewable energy resources and traditional power resources and could effectively guide the participants to achieve individual welfare maximization in the system. To be specific, we develop a Lagrangian- based approach to transform the global optimization conducted by the power company into distributed optimization problems to obtain explicit energy consumption, supply, and price decisions for individual participants. Also, we show that these distributed problems derived from the global optimization by the power company are consistent with individual welfare maximization problems for end-users and traditional power plants. We also investigate and formalize the vulnerabilities of the real-time pricing scheme by considering two types of data integrity attacks: Ex-ante attacks and Ex-post attacks, which are launched by the adversary before or after the decision-making process. We systematically analyze the welfare impacts of these attacks on the real-time pricing scheme. Through a combination of theoretical analysis and performance evaluation, our data shows that the proposed real-time pricing scheme could
  • 15. For more Details, Feel free to contact us at any time. Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/ Mail Id: tsysglobalsolutions2014@gmail.com. effectively guide the participants to achieve welfare maximization, while cyber-attacks could significantly disrupt the results of real-time pricing decisions, imposing welfare reduction on the participants. IEEE Transactions on Parallel and Distributed Systems (March 2016) A Fast and Accurate Hardware String Matching Module with Bloom Filters Many fields of computing such as Deep Packet Inspection (DPI) employ string matching modules (SMM) that search for a given set of positive strings in their input. An SMM is expected to produce correct outcomes while scanning the input data at high rates. Furthermore the string sets that are searched for are usually large and their sizes increase steadily. Bloom Filters (BFs) are hashing data structures which are fast but their false positive results require further processing. That is, their speed can be exploited for Standard Bloom Filter SMMs (SBFs) as long as the positive probability is low. Multiple BFs in parallel can further increase the throughput. In this paper, we propose the Double Bloom Filter SMM (DBF) which achieves a higher throughput than the SBF and maintains a high throughput even for large positive probabilities. The second Bloom Filter of DBF stores a small enough subset of the positive strings such that its false positive probability is approximately zero. We develop an analytical model of the DBF and show that the throughput advantage of DBF over SBF becomes more prominent if the positive probability and the fraction of matches in the second Bloom Filter increase. Accordingly, we propose a heuristic algorithm that stores the strings that are more frequently matched in the second Bloom Filter according to localities identified in the input. Our numerical results are obtained using realistic values from an FPGA implementation and are validated by SystemC simulations. IEEE Transactions on Parallel and Distributed Systems (June 2016) Seer Grid: Privacy and Utility Implications of Two-Level Load Prediction in Smart Grids We propose ―Seer Grid‖, a novel two-level energy consumption prediction framework for smart grids, aimed to decrease the trade-off between privacy requirements (of the customer) and data utility requirements (of the energy company (EC)). The first-level prediction at the household level is performed by each smart meter (SM), and the predicted energy consumption pattern (instead of the actual energy usage data) is reported to a cluster head (CH). Then, a second-level prediction at the neighborhood level is done by the CH which predicts the energy spikes in the
  • 16. For more Details, Feel free to contact us at any time. Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/ Mail Id: tsysglobalsolutions2014@gmail.com. neighborhood or cluster and shares it with the EC. Our two-level prediction mechanism is designed such that it preserves the correlation between the predicted and actual energy consumption patterns at the cluster level and removes this correlation in the predicted data communicated by each SM to the CH. This maintains the usefulness of the cluster-level energy consumption data communicated to the EC, while preserving the privacy of the household-level energy consumption data against the CH (and thus the EC). Our evaluation results show that Seer Grid is successful in hiding private consumption patterns at the household-level while still being able to accurately predict energy consumption at the neighborhood-level. IEEE Transactions on Parallel and Distributed Systems (May 2016) Many-Core Real-Time Task Scheduling with Scratchpad Memory This work is motivated by the demand for scheduling tasks upon the increasingly popular island- based many-core architectures. On such an architecture, homogeneous cores are grouped into islands, each of which is equipped with a scratchpad memory module (referred to as local memory). We first show the NP-hardness and the inapproximability of the scheduling problem. Despite the inapproximability, positive results can still be found when different cases of the problem are investigated. A (3 − 1 F )- approximation algorithm is proposed for the minimization of the maximum system utilization, where F is the number of cores in the platform. When the technique of resource augmentation is considered, this paper further develops a (γ + 1)-memory 2γ−1 γ−1 - approximation algorithm, where γ represents the trade-off between CPU utilization and local memory space. On the other hand, a special case is also considered when the ratio of the worst-case execution time of a task without and with using the local memory is bounded by a constant. The capabilities of the proposed algorithms are then evaluated with benchmarks from MRTC, UTDSP, NetBench and DSPstone, where the maximum system utilization can be significantly reduced even when the local memory size is only 5% of the total footprint of all of the tasks. IEEE Transactions on Parallel and Distributed Systems (January 2016) Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree- Grafting It is difficult to obtain high performance when computing matchings on parallel processors because matching algorithms explicitly or implicitly search for paths in the graph, and when
  • 17. For more Details, Feel free to contact us at any time. Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/ Mail Id: tsysglobalsolutions2014@gmail.com. these paths become long, there is little concurrency. In spite of this limitation, we present a new algorithm and its shared-memory parallelization that achieves good performance and scalability in computing maximum cardinality matchings in bipartite graphs. Our algorithm searches for augmenting paths via specialized breadth-first searches (BFS) from multiple source vertices, hence creating more parallelism than single source algorithms. Algorithms that employ multiple- source searches cannot discard a search tree once no augmenting path is discovered from the tree, unlike algorithms that rely on single-source searches. We describe a novel tree-grafting method that eliminates most of the redundant edge traversals resulting from this property of multiple-source searches. We also employ the recent direction-optimizing BFS algorithm as a subroutine to discover augmenting paths faster. Our algorithm compares favorably with the current best algorithms in terms of the number of edges traversed, the average augmenting path length, and the number of iterations. We provide a proof of correctness for our algorithm. Our NUMA-aware implementation is scalable to 80 threads of an Intel multiprocessor and to 240 threads on an Intel Knights Corner coprocessor. On average, our parallel algorithm runs an order of magnitude faster than the fastest algorithms available. The performance improvement is more significant on graphs with small matching number. IEEE Transactions on Parallel and Distributed Systems (March 2016) A Group-Ordered Fast Iterative Method for Eikonal Equations In the past decade, many numerical algorithms for the Eikonal equation have been proposed. Recently, the research of Eikonal equation solver has focused more on developing efficient parallel algorithms in order to leverage the computing power of parallel systems, such as multi- core CPUs and GPUs (Graphics Processing Units). In this paper, we introduce an efficient parallel algorithm that extends Jeong et al.’s FIM (Fast Iterative Method, [1]), originally developed for the GPU, for multi-core shared memory systems. First, we propose a parallel implementation of FIM using a lock-free local queue approach and provide an in-depth analysis of the parallel performance of the method. Second, we propose a new parallel algorithm, Group- Ordered Fast Iterative Method (GO-FIM), that exploits causality of grid blocks to reduce redundant computations, which was the main drawback of the original FIM. In addition, the proposed GO-FIM method employs clustering of blocks based on the updating order where each
  • 18. For more Details, Feel free to contact us at any time. Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/ Mail Id: tsysglobalsolutions2014@gmail.com. cluster can be updated in parallel using multi-core parallel architectures. We discuss the performance of GO-FIM and compare with the state-of-the-art parallel Eikonal equation solvers. IEEE Transactions on Parallel and Distributed Systems (May 2016) Optimal Reconfiguration of High-Performance VLSI Subarrays with Network Flow A two-dimensional mesh-connected processor array is an extensively investigated architecture used in parallel processing. Massive studies have addressed the use of reconfiguration algorithms for the processor arrays with faults. However, the subarray generated by previous algorithms contains a large number of long interconnects, which in turn leads to more communication costs, capacitance and dynamic power dissipation. In this paper, we propose novel techniques, making use of the idea of network flow, to construct the high-performance subarray, which has the minimum number of long interconnects. Firstly, we construct a network flow model according to the host array under a specific constraint. Secondly, we show that the reconfiguration problem of high-performance subarray can be optimally solved in polynomial time by using efficient minimum-cost flow algorithms. Finally, we prove that the geometric properties of the resulted subarray meet the system requirements. Simulations based on several random and clustered fault scenarios clearly reveal the advantage of the proposed technique for reducing the number of long interconnects. It is shown that, for a host array of size 512 512, the number of long interconnects in the subarray can be reduced by up to 70.05% for clustered faults and by up to 55.28% for random faults with density of 1% as compared to the-state-of-the-art. IEEE Transactions on Parallel and Distributed Systems (March 2016) DREAM-(L)G: A Distributed Grouping-based Algorithm for Resource Assignment for Bandwidth-Intensive Applications in the Cloud Increasingly, many bandwidth-intensive applications have been ported to the cloud platform. In practice, however, some disadvantages including equipment failures, bandwidth overload and long-distance transmission often damage the QoS about data availability, bandwidth provision and access locality respectively. While some recent solutions have been proposed to cope with one or two of disadvantages, but not all. Moreover, as the number of data objects scales, most of the current offline algorithms solving a constraint optimization problem suffer from low computational efficiency. To overcome these problems, in this paper we propose an approach that aims to make fully efficient use of the cloud resources to enable bandwidth-intensive
  • 19. For more Details, Feel free to contact us at any time. Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/ Mail Id: tsysglobalsolutions2014@gmail.com. applications to achieve the desirable level of SLA-specified QoS mentioned above cost- effectively and timely. First we devise a constraint-based model that describes the relationship among data object placement, user cells bandwidth allocation, operating costs and QoS constraints. Second, we use the distributed heuristic algorithm, called DREAM-L, that solves the model and produces a budget solution to meet SLA-specified QoS. Third, we propose an object- grouping technique that is integrated into DREAM-L, called DREAM-LG, to significantly improve the computational efficiency of our algorithm. The results of hundreds of thousands of simulation-based experiments demonstrate that DREAM-LG provides much better data availability, bandwidth provision and access locality than the state-of-the-art solutions at modest cloud operating costs and within a small and acceptable range of time. IEEE Transactions on Parallel and Distributed Systems (March 2016) A Hybrid Parallel Solving Algorithm on GPU for Quasi-Tridiagonal System of Linear Equations There are some quasi-tridiagonal system of linear equations arising from numerical simulations, and some solving algorithms encounter great challenge on solving quasi-tridiagonal system of linear equations with more than millions of dimensions as the scale of problems increases. We present a solving method which mixes direct and iterative methods, and our method needs less storage space in a computing process. A quasi-tridiagonal matrix is split into a tridiagonal matrix and a sparse matrix using our method and then the tridiagonal equation can be solved by the direct methods in the iteration processes. Because the approximate solutions obtained by the direct methods are closer to the exact solutions, the convergence speed of solving the quasi- tridiagonal system of linear equations can be improved. Furthermore, we present an improved cyclic reduction algorithm using a partition strategy to solve tridiagonal equations on GPU, and the intermediate data in computing are stored in shared memory so as to significantly reduce the latency of memory access. According to our experiments on 10 test cases, the average number of iterations is reduced significantly by using our method compared with Jacobi, GS, GMRES, and BiCG respectively, and close to those of BiCGSTAB, BiCRSTAB, and TFQMR. For parallel mode, the parallel computing efficiency of our method is raised by partition strategy, and the performance using our method is better than those of the commonly used iterative and direct methods because of less amount of calculation in an iteration.
  • 20. For more Details, Feel free to contact us at any time. Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/ Mail Id: tsysglobalsolutions2014@gmail.com. IEEE Transactions on Parallel and Distributed Systems (January 2016) Shield: A Reliable Network-on-Chip Router Architecture for Chip Multiprocessors The increasing number of cores on a chip has made the Network on Chip (NoC) concept the standard communication paradigm for Chip Multiprocessors. A fault in an NoC leads to undesirable ramifications that can severely impact the performance of a chip. Therefore, it is vital to design fault tolerant NoCs. In this paper, we present Shield, a reliable NoC router architecture that has the unique ability to tolerate both hard and soft errors in the routing pipeline using techniques such as spatial redundancy, exploitation of idle cycles, bypassing of faulty resources and selective hardening. Using Mean Time to Failure and Silicon Protection Factor metrics, we show that Shield is six times more reliable than the baseline-unprotected router and is at least 1.5 times more reliable than existing fault tolerant router architectures. We introduce a new metric called Soft Error Improvement Factor and show that the soft error tolerance of Shield has improved by three times in comparison to the baseline-unprotected router. This reliability improvement is accomplished by incurring an area and power overhead of 34% and 31% respectively. Latency analysis using SPLASH-2 and PARSEC reveals that in the presence of faults, latency increases by a modest 13% and 10% respectively. IEEE Transactions on Parallel and Distributed Systems (January 2016) An Energy-Efficient Directory Based Multicore Architecture with Wireless Routers to Minimize the Communication Latency Multicore architectures suffer from high core-to-core communication latency primarily due to the cache’s dynamic behavior. Studies suggest that a directory-approach can be helpful to reduce communication latency by storing the cached block information. Recent studies also indicate that a wireless router has potential to help decrease communication latency in multicore architectures. In this work, we propose a directory based multicore architecture with wireless routers to minimize communication latency. We simulate systems with mesh (used in the Standford Directory Architecture for SHared memory (DASH) architecture), wireless network-on-chip (WNoC), and the proposed directory based architecture with wireless routers. According to the experimental results, our proposed architecture outperforms the WNoC and the mesh architectures. It is observed that the proposed architecture helps decrease the communication delay by up to 15.71% and the total power consumption by up to 67.58% when compared with
  • 21. For more Details, Feel free to contact us at any time. Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/ Mail Id: tsysglobalsolutions2014@gmail.com. the mesh architecture. Similarly, the proposed architecture helps decrease the communication delay by up to 10.00% and the total power consumption by up to 58.10% when compared with the WNoC architecture. This is due to the fact that the proposed directory based mechanism helps reduce the number of core-to-core communication and the wireless routers help reduce the total number of hops. IEEE Transactions on Parallel and Distributed Systems (May 2016) Trajectory Pattern Mining for Urban Computing in the Cloud The increasing pervasiveness of mobile devices along with the use of technologies like GPS, Wifi networks, RFID, and sensors, allows for the collections of large amounts of movement data. This amount of data can be analyzed to extract descriptive and predictive models that can be properly exploited to improve urban life. From a technological viewpoint, Cloud computing can play an essential role by helping city administrators to quickly acquire new capabilities and reducing initial capital costs by means of a comprehensive pay-as-you-go solution. This paper presents a workflow-based parallel approach for discovering patterns and rules from trajectory data, in a Cloud-based framework. Experimental evaluation has been carried out on both real- world and synthetic trajectory data, up to one million of trajectories. The results show that, due to the high complexity and large volumes of data involved in the application scenario, the trajectory pattern mining process takes advantage from the scalable execution environment offered by a Cloud architecture in terms of both execution time, speed-up and scale-up. IEEE Transactions on Parallel and Distributed Systems (May 2016) FiDoop-DP: Data Partitioning in Frequent Itemset Mining on Hadoop Clusters Traditional parallel algorithms for mining frequent itemsets aim to balance load by equally partitioning data among a group of computing nodes. We start this study by discovering a serious performance problem of the existing parallel Frequent Itemset Mining algorithms. Given a large dataset, data partitioning strategies in the existing solutions suffer high communication and mining overhead induced by redundant transactions transmitted among computing nodes. We address this problem by developing a data partitioning approach called FiDoop-DP using the MapReduce programming model. The overarching goal of FiDoop-DP is to boost the performance of parallel Frequent Itemset Mining on Hadoop clusters. At the heart of FiDoop-DP is the Voronoi diagram-based data partitioning technique, which exploits correlations among
  • 22. For more Details, Feel free to contact us at any time. Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/ Mail Id: tsysglobalsolutions2014@gmail.com. transactions. Incorporating the similarity metric and the Locality-Sensitive Hashing technique, FiDoop-DP places highly similar transactions into a data partition to improve locality without creating an excessive number of redundant transactions. We implement FiDoop-DP on a 24-node Hadoop cluster, driven by a wide range of datasets created by IBM Quest Market-Basket Synthetic Data Generator. Experimental results reveal that FiDoop-DP is conducive to reducing network and computing loads by the virtue of eliminating redundant transactions on Hadoop nodes. FiDoop-DP significantly improves the performance of the existing parallel frequent- pattern scheme by up to 31% with an average of 18%. IEEE Transactions on Parallel and Distributed Systems (April 2016) DistR: A Distributed Method for the Reachability Query over Large Uncertain Graphs Among uncertain graph queries, reachability, i.e., the probability that one vertex is reachable from another, is likely the most fundamental one. Although this problem has been studied within the field of network reliability, solutions are implemented on a single computer and can only handle small graphs. However, as the size of graph applications continually increases, the corresponding graph data can no longer fit within a single computer’s memory and must therefore be distributed across several machines. Furthermore, the computation of probabilistic reachability queries is #P-complete making it very expensive even on small graphs. In this paper, we develop an efficient distributed strategy, called DistR, to solve the problem of reachability query over large uncertain graphs. Specifically, we perform the task in two steps: distributed graph reduction and distributed consolidation. In the distributed graph reduction step, we find all of the maximal subgraphs of the original graph, whose reachability probabilities can be calculated in polynomial time, compute them and reduce the graph accordingly. After this step, only a small graph remains. In the distributed consolidation step, we transform the problem into a relational join process and provide an approximate answer to the #P-complete reachability query. Extensive experimental studies show that our distributed approach is efficient in terms of both computational and communication costs, and has high accuracy. IEEE Transactions on Parallel and Distributed Systems (February 2016) A Constraint Programming Scheduler for Heterogeneous High-Performance Computing Machines
  • 23. For more Details, Feel free to contact us at any time. Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/ Mail Id: tsysglobalsolutions2014@gmail.com. Scheduling and dispatching tools for High-Performance Computing (HPC) machines have the key role of mapping jobs to the available resources, trying to maximize performance and Quality-of-Service (QoS). Allocation and Scheduling in the general case are well-known NP- hard problems, forcing commercial schedulers to adopt greedy approaches to improve performance and QoS. Searchbased approaches featuring the exploration of the solution space have seldom been employed in this setting, but mostly applied in off-line scenarios. In this paper, we present the first search-based approach to job allocation and scheduling for HPC machines, working in a production environment. The scheduler is based on Constraint Programming, an effective programming technique for optimization problems. The resulting scheduler is flexible, as it can be easily customized for dealing with heterogeneous resources, user-defined constraints and different metrics. We evaluate our solution both on virtual machines using synthetic workloads, and on the Eurora HPC with production workloads. Tests on a wide range of operating conditions show significant improvements in waitings and QoS in mid-tier HPC machines w.r.t state-of-the-art commercial rule-based dispatchers. Furthermore, we analyze the conditions under which our approach outperforms commercial approaches, to create a portfolio of scheduling algorithms that ensures robustness, flexibility and scalability. IEEE Transactions on Parallel and Distributed Systems (January 2016) Enabling data-centric distribution technology for partitioned embedded systems Modern complex embedded systems are evolving into mixed-criticality systems in order to satisfy a wide set of non-functional requirements such as security, cost, weight, timing or power consumption. Partitioning is an enabling technology for this purpose, as it provides an environment with strong temporal and spatial isolation which allows the integration of applications with different requirements into a common hardware platform. At the same time, embedded systems are increasingly networked (e.g., cyber-physical systems) and they even might require global connectivity in open environments so enhanced communication mechanisms are needed to develop distributed partitioned systems. To this end, this work proposes an architecture to enable the use of data-centric real-time distribution middleware in partitioned embedded systems based on a hypervisor. This architecture relies on distribution middleware and a set of virtual devices to provide mixed-criticality partitions with a homogeneous and interoperable communication subsystem. The results obtained show that this
  • 24. For more Details, Feel free to contact us at any time. Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/ Mail Id: tsysglobalsolutions2014@gmail.com. approach provides low overhead and a reasonable trade-off between temporal isolation and performance IEEE Transactions on Parallel and Distributed Systems (February 2016) A Recursive Hypergraph Bipartitioning Framework for Reducing Bandwidth and Latency Costs Simultaneously Intelligent partitioning models are commonly used for efficient parallelization of irregular applications on distributed systems. These models usually aim to minimize a single communication cost metric, which is either related to communication volume or message count. However, both volume- and message-related metrics should be taken into account during partitioning for a more efficient parallelization. There are only a few works that consider both of them and they usually address each in separate phases of a two-phase approach. In this work, we propose a recursive hypergraph bipartitioning framework that reduces the total volume and total message count in a single phase. In this framework, the standard hypergraph models, nets of which already capture the bandwidth cost, are augmented with message nets. The message nets encode the message count so that minimizing conventional cutsize captures the minimization of bandwidth and latency costs together. Our model provides a more accurate representation of the overall communication cost by incorporating both the bandwidth and the latency components into the partitioning objective. The use of the widely-adopted successful recursive bipartitioning framework provides the flexibility of using any existing hypergraph partitioner. The experiments on instances from different domains show that our model on the average achieves up to 52% reduction in total message count and hence results in 29% reduction in parallel running time compared to the model that considers only the total volume. IEEE Transactions on Parallel and Distributed Systems (June 2016) Leaky Buffer: A Novel Abstraction for Relieving Memory Pressure from Cluster Data Processing Frameworks The shift to the in-memory data processing paradigm has had a major influence on the development of cluster data processing frameworks. Numerous frameworks from the industry, open source community and academia are adopting the in-memory paradigm to achieve functionalities and performance breakthroughs. However, despite the advantages of these inmemory frameworks, in practice they are susceptible to memorypressure related performance
  • 25. For more Details, Feel free to contact us at any time. Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/ Mail Id: tsysglobalsolutions2014@gmail.com. collapse and failures. The contributions of this paper are two-fold. Firstly, we conduct a detailed diagnosis of the memory pressure problem and identify three preconditions for the performance collapse. These preconditions not only explain the problem but also shed light on the possible solution strategies. Secondly, we propose a novel programming abstraction called the leaky buffer that eliminates one of the preconditions, thereby addressing the underlying problem. We have implemented a leaky buffer enabled hashtable in Spark, and we believe it is also able to substitute the hashtable that performs similar hash aggregation operations in any other programs or data processing frameworks. Experiments on a range of memory intensive aggregation operations show that the leaky buffer abstraction can drastically reduce the occurrence of memory-related failures, improve performance by up to 507% and reduce memory usage by up to 87.5%. IEEE Transactions on Parallel and Distributed Systems (March 2016) Parity-Switched Data Placement: Optimizing Partial Stripe Writes in XOR-Coded Storage Systems Erasure codes tolerate disk failures by pre-storing a low degree of data redundancy, and have been commonly adopted in current storage systems. However, the attached requirement on data consistency exaggerates partial stripe write operations and thus seriously downgrades system performance. Previous works to optimize partial stripe writes are relatively limited, and a general mechanism is still absent. In this paper, we propose a Parity-Switched Data Placement (PDP) to optimize partial stripe writes for any XOR-coded storage system. PDP first reduces the write operations by arranging continuous data elements to join a common parity element’s generation. To achieve a deeper optimization, PDP further explores the generation orders of parity elements and makes any two continuous data elements associate with a common parity element. Intensive evaluations show that for tested erasure codes, PDP reduces up to 31.9% of write operations and further increases the write speed by up to 59.8% when compared with two state-of-the-art data placement methods. IEEE Transactions on Parallel and Distributed Systems (February 2016) VINEA: An Architecture for Virtual Network Embedding Policy Programmability Network virtualization has enabled new business models by allowing infrastructure providers to lease or share their physical network. A fundamental management problem that cloud providers
  • 26. For more Details, Feel free to contact us at any time. Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/ Mail Id: tsysglobalsolutions2014@gmail.com. face to support customized virtual network (VN) services is the virtual network embedding. This requires solving the (NP-hard) problem of matching constrained virtual networks onto the physical network. In this paper we present VINEA, a policy-based virtual network embedding architecture, and its system implementation. VINEA leverages our previous results on VN embedding optimality and convergence guarantees, and it is based on a network utility maximization approach that separates policies (i:e:, high-level goals) from underlying embedding mechanisms: resource discovery, virtual network mapping, and allocation on the physical infrastructure. We show how VINEA can subsume existing embedding approaches, and how it can be used to design novel solutions that adapt to different scenarios, by merely instantiating different policies. We describe the VINEA architecture, as well as our object model: our VINO protocol and the API to program the embedding policies; we then analyze key representative tradeoffs among novel and existing VN embedding policy configurations, via event-driven simulations, and with our prototype implementation. Among our findings, our evaluation shows how, in contrast to existing solutions, simultaneously embedding nodes and links may lead to lower providers’ revenue. We release our implementation on a testbed that uses a Linux system architecture to reserve virtual node and link capacities. Our prototype can be also used to augment existing open-source ―Networking as a Service‖ architectures such as OpenStack Neutron, that currently lacks a VN embedding protocol, and as a policy- programmable solution to the ―slice stitching‖ problem within wide-area virtual network testbeds. IEEE Transactions on Parallel and Distributed Systems (February 2016) Application control configurations for parallel connection of single-phase energy conversion units operating in island mode This paper presents the design and implementation of controllers for parallel connection of single phase energy conversion units, for applications in island mode operation. Orders to comply with this objective, two control configurations are implemented having as reference to the output voltage of droop scheme. These controllers are: two degrees of freedom control plus repetitive controller and proportional integral - proportional controller plus resonant controller. With these control configurations is intended maintain amplitude, waveform and frequency of the voltage signal and attend increases linear and nonlinear load in island mode operation of a single phase
  • 27. For more Details, Feel free to contact us at any time. Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/ Mail Id: tsysglobalsolutions2014@gmail.com. energy conversion unit. That is, with these control strategies is achieved that several inverters connected in parallel to a microgrid can operate as sources of tension carving up the active and reactive power demanded by the load. IEEE Latin America Transactions (March 2016) PathGraph: A Path Centric Graph Processing System Large scale iterative graph computation presents an interesting systems challenge due to two well known problems: (1) the lack of access locality and (2) the lack of storage efficiency. This paper presents PathGraph, a system for improving iterative graph computation on graphs with billions of edges. First, we improve the memory and disk access locality for iterative computation algorithms on large graphs by modeling a large graph using a collection of tree- based partitions. This enables us to use path-centric computation rather than vertexcentric or edge-centric computation. For each tree partition, we re-label vertices using DFS in order to preserve consistency between the order of vertex ids and vertex order in the paths. Second, a compact storage that is optimized for iterative graph parallel computation is developed in the PathGraph system. Concretely, we employ delta-compression and store tree-based partitions in a DFS order. By clustering highly correlated paths together as tree based partitions, we maximize sequential access and minimize random access on storage media. Third but not the least, our path-centric computation model is implemented using a scatter/gather programming model. We parallel the iterative computation at partition tree level and perform sequential local updates for vertices in each tree partition to improve the convergence speed. To provide well balanced workloads among parallel threads at tree partition level, we introduce the concept of multiple stealing points based task queue to allow work stealings from multiple points in the task queue. We evaluate the effectiveness of PathGraph by comparing with recent representative graph processing systems such as GraphChi and X-Stream etc. Our experimental results show that our approach outperforms the two systems on a number of graph algorithms for both in-memory and out-of-core graphs. While our approach achieves better data balance and load balance, it also shows better speedup than the two - ystems with the growth of threads. IEEE Transactions on Parallel and Distributed Systems (January 2016) CIACP: A Correlation- and Iteration- Aware Cache Partitioning Mechanism to Improve Performance of Multiple Coarse-Grained Reconfigurable Arrays
  • 28. For more Details, Feel free to contact us at any time. Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/ Mail Id: tsysglobalsolutions2014@gmail.com. Multiple coarse-grained reconfigurable arrays (CGRA), which are organized in parallel or pipeline to complete applications, have become a productive solution to balance the performance with the flexibility. One of the keys to obtain high performance from multiple CGRAs is to manage the shared on-chip cache efficiently to reduce off-chip memory bandwidth requirements. Cache partitioning has been viewed as a promising technique to enhance the efficiency of a shared cache. However, the majority of prior partitioning techniques were developed for multi- core platform and aimed at multi-programmed workloads. They cannot directly address the adverse impacts of data correlation and computation imbalance among competing CGRAs in multi-CGRA platform. This paper proposes a correlation- and iteration- aware cache partitioning (CIACP) mechanism for shared cache partitioning in multiple CGRAs systems. This mechanism employs correlation monitors (CMONs) to trace the amount of overlapping data among parallel CGRAs, and iteration monitors (IMONs) to track the computation load of each CGRA. Using the information collected by CMONs and IMONs, the CIACP mechanism can eliminate redundant cache utilization of the overlapping data and can also shorten the total execution time of pipelined CGRAs. Experimental results showed that CIACP outperformed state-of-the-art utility- based cache partitioning techniques by up to 16% in performance. IEEE Transactions on Parallel and Distributed Systems (April 2016) Failure Diagnosis for Distributed Systems using Targeted Fault Injection This paper introduces a novel approach to automating failure diagnostics in distributed systems by combining fault injection and data analytics. We use fault injection to populate the database of failures for a target distributed system. When a failure is reported from production environment, the database is queried to find ―matched‖ failures generated by fault injections. Relying on the assumption that similar faults generate similar failures, we use information from the matched failures as hints to locate the actual root cause of the reported failures. In order to implement this approach, we introduce techniques for (i) reconstructing end-to-end execution flows of distributed software components, (ii) computing the similarity of the reconstructed flows, and (iii) performing precise fault injection at pre-specified executing points in distributed systems. We have evaluated our approach using an OpenStack cloud platform, a popular cloud infrastructure management system. Our experimental results showed that this approach is effective in determining the root causes, e.g., fault types and affected components, for 71-100%
  • 29. For more Details, Feel free to contact us at any time. Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/ Mail Id: tsysglobalsolutions2014@gmail.com. of tested failures. Furthermore, it can provide fault locations close to actual ones and can easily be used to find and fix actual root causes. We have also validated this technique by localizing real bugs that occurred in OpenStack. IEEE Transactions on Parallel and Distributed Systems (June 2016) Towards Practical and Near-optimal Coflow Scheduling for Data Center Networks In current data centers, an application (e.g., MapReduce, Dryad, search platform, etc.) usually generates a group of parallel flows to complete a job. These flows compose a coflow and only completing them all is meaningful to the application. Accordingly, minimizing the average Coflow Completion Time (CCT) becomes a critical objective of flow scheduling. However, achieving this goal in today’s Data Center Networks (DCNs) is quite challenging, not only because the schedule problem is theoretically NP-hard, but also because it is tough to perform practical flow scheduling in large-scale DCNs. In this paper, we find that minimizing the average CCT of a set of coflows is equivalent to the well-known problem of minimizing the sum of completion times in a concurrent open shop. As there are abundant existing solutions for concurrent open shop, we open up a variety of techniques for coflow scheduling. Inspired by the best known result, we derive a 2-approximation algorithm for coflow scheduling, and further develop a decentralized coflow scheduling system, D-CAS, which avoids the system problems associated with current centralized proposals while addressing the performance challenges of decentralized suggestions. Trace-driven simulations indicate that D-CAS achieves a performance close to Varys, the state-of-the-art centralized method, and outperforms Baraat, the only existing decentralized method, significantly. IEEE Transactions on Parallel and Distributed Systems (February 2016) Traffic Load Balancing Schemes for Devolved Controllers in Mega Data Centers In most existing cloud services, a centralized controller is used for resource management and coordination. However, such infrastructure is gradually not sufficient to meet the rapid growth of mega data centers. In recent literature, a new approach named devolved controller was proposed for scalability concern. This approach splits the whole network into several regions, each with one controller to monitor and reroute a portion of the flows. This technique alleviates the problem of an overloaded single controller, but brings other problems such as unbalanced work load among controllers and reconfiguration complexities. In this paper, we make an exploration
  • 30. For more Details, Feel free to contact us at any time. Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/ Mail Id: tsysglobalsolutions2014@gmail.com. on the usage of devolved controllers for mega data centers, and design some new schemes to overcome these shortcomings and improve the performance of the system. We first formulate Load Balancing problem for Devolved Controllers (LBDC) in data centers, and prove that it is NP-complete. We then design an f-approximation for LBDC, where f is the largest number of potential controllers for a switch in the network. Furthermore, we propose both centralized and distributed greedy approaches to solve the LBDC problem effectively. The numerical results validate the efficiency of our schemes, which can become a solution to monitoring, managing, and coordinating mega data centers with multiple controllers working together. IEEE Transactions on Parallel and Distributed Systems (June 2016) Reactive Molecular Dynamics on Massively Parallel Heterogeneous Architectures We present a parallel implementation of the ReaxFF force field on massively parallel heterogeneous architectures, called PuReMD-Hybrid. PuReMD, on which this work is based, along with its integration into LAMMPS, is currently used by a large number of research groups worldwide. Accelerating this important community codebase that implements a complex reactive force field poses a number of algorithmic, design, and optimization challenges, as we discuss in detail. In particular, different computational kernels are best suited to different computing substrates – CPUs or GPUs. Scheduling these computations requires complex resource management, as well as minimizing data movement across CPUs and GPUs. Integrating powerful nodes, each with multiple CPUs and GPUs, into clusters and utilizing the immense compute power of these clusters requires significant optimizations for minimizing communication and, potentially, redundant computations. From a programming model perspective, PuReMD-Hybrid relies on MPI across nodes, pthreads across cores, and CUDA on the GPUs to address these challenges. Using a variety of innovative algorithms and optimizations, we demonstrate that our code can achieve over 565-fold speedup compared to a single core implementation on a cluster of 36 state-of-the-art GPUs for complex systems. In terms of application performance, our code enables simulations of over 1.8M atoms in under 0.68 seconds per simulation time step. IEEE Transactions on Parallel and Distributed Systems (March 2016) A fast discrete wavelet transform using hybrid parallelism on GPUs
  • 31. For more Details, Feel free to contact us at any time. Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/ Mail Id: tsysglobalsolutions2014@gmail.com. Wavelet transform has been widely used in many signal and image processing applications. Due to its wide adoption for timecritical applications, such as streaming and real-time signal processing, many acceleration techniques were developed during the past decade. Recently, the graphics processing unit (GPU) has gained much attention for accelerating computationally- intensive problems and many solutions of GPU-based discrete wavelet transform (DWT) have been introduced, but most of them did not fully leverage the potential of the GPU. In this paper, we present various state-of-the-art GPU optimization strategies in DWT implementation, such as leveraging shared memory, registers, warp shuffling instructions, and thread- and instruction- level parallelism (TLP, ILP), and finally elaborate our hybrid approach to further boost up its performance. In addition, we introduce a novel mixed-band memory layout for Haar DWT, where multi-level transform can be carried out in a single fused kernel launch. As a result, unlike recent GPU DWT methods that focus mainly on maximizing ILP, we show that the optimal GPU DWT performance can be achieved by hybrid parallelism combining both TLP and ILP together in a mixed-band approach. We demonstrate the performance of our proposed method by comparison with other CPU and GPU DWT methods. IEEE Transactions on Parallel and Distributed Systems (February 2016) SUPPORT OFFERED TO REGISTERED STUDENTS: 1. IEEE Base paper. 2. Review material as per individuals’ university guidelines 3. Future Enhancement 4. assist in answering all critical questions 5. Training on programming language 6. Complete Source Code. 7. Final Report / Document 8. International Conference / International Journal Publication on your Project. FOLLOW US ON FACEBOOK @ TSYS Academic Projects