IEEE Final Year Projects 2011-2012 :: Elysium Technologies Pvt Ltd::Parallel Computing

Elysium Technologies Private Limited
ISO 9001:2008 A leading Research and Development Division
Madurai | Chennai | Trichy | Coimbatore | Kollam| Singapore
Website: elysiumtechnologies.com, elysiumtechnologies.info
Email: info@elysiumtechnologies.com

IEEE Project List 2011 - 2012

20 Churn-Resilient Protocol for Massive Data Dissemination in P2P Networks

Massive data dissemination is often disrupted by frequent join and departure or failure of client nodes in a peer-to-peer (P2P)
network. We propose a new churn-resilient protocol (CRP) to assure alternating path and data proximity to accelerate the data
dissemination process under network churn. The CRP enables the construction of proximity-aware P2P content delivery systems. We
present new data dissemination algorithms using this proximity-aware overlay design. We simulated P2P networks up to 20,000
nodes to validate the claimed advantages. Specifically, we make four technical contributions: 1). The CRP scheme promotes
proximity awareness, dynamic load balancing, and resilience to node failures and network anomalies. 2). The proximity-aware
overlay network has a 28-50 percent speed gain in massive data dissemination, compared with the use of scope-flooding or epidemic
tree schemes in unstructured P2P networks. 3). The CRP-enabled network requires only 1/3 of the control messages used in a large
CAM-Chord network. 4) Even with 40 percent of node failures, the CRP network guarantees atomic broadcast of all data items. These
results clearly demonstrate the scalability and robustness of CRP networks under churn conditions. The scheme appeals especially
to webscale applications in digital content delivery, network worm containment, and consumer relationship management over
hundreds of datacenters in cloud computing services.

21 Cloud Technologies for Bioinformatics Applications

Executing large number of independent jobs or jobs comprising of large number of tasks that perform minimal inter
task communication is a common requirement in many domains. Various technologies ranging from classic job
schedulers to the latest cloud technologies such as Map Reduce can be used to execute these “many-tasks” in
parallel. In this paper, we present our experience in applying two cloud technologies Apache Ha doop and Microsoft
DryadLINQ to two bioinformatics applications with the above characteristics. The applications are a pair wise Alu
sequence alignment application and an Expressed Sequence Tag (EST) sequence assembly program. First, we
compare the performance of these cloud technologies using the above applications and also compare them with
traditional MPI implementation in one application. Next, we analyze the effect of inhomogeneous data on the
scheduling mechanisms of the cloud technologies. Finally, we present a comparison of performance of the cloud
technologies under virtual and nonvirtual hardware platforms.

22 Collective Receiver-Initiated Multicast for Grid Applications

Grid applications often need to distribute large amounts of data efficiently from one cluster to multiple others
(multicast). Existing sender-initiated methods arrange nodes in optimized tree structures, based on external network
monitoring data. This dependence on monitoring data severely impacts both ease of deployment and adaptivity to
dynamically changing network conditions. In this paper, we present Robber, a collective, receiver-initiated, high-
throughput multicast approach inspired by the BitTorrent protocol. Unlike BitTorrent, Robber is specifically designed
to maximize the throughput between multiple cluster computers. Nodes in the same cluster work together as a
collective that tries to steal data from peer clusters. Instead of using potentially outdated monitoring data, Robber
automatically adapts to the currently achievable bandwidth ratios. Within a collective, nodes automatically tune the
amount of data they steal remotely to their relative performance. Our experimental evaluation compares Robber to
BitTorrent, to Balanced Multicasting, and to its predecessor MOB. Balanced Multicasting optimizes multicast trees
based on external monitoring data, while MOB uses collective, receiver-initiated multicast with static load balancing.

Madurai Trichy Kollam
Elysium Technologies Private Limited Elysium Technologies Private Limited Elysium Technologies Private Limited
230, Church Road, Annanagar, 3rd Floor,SI Towers, Surya Complex,Vendor junction,
Madurai , Tamilnadu – 625 020. 15 ,Melapudur , Trichy, kollam,Kerala – 691 010.
Contact : 91452 4390702, 4392702, 4394702. Tamilnadu – 620 001. Contact : 91474 2723622.
eMail: info@elysiumtechnologies.com Contact : 91431 - 4002234. eMail: elysium.kollam@gmail.com
eMail: elysium.trichy@gmail.com
1



We show that both Robber and MOB outperform BitTorrent. They are competitive with Balanced Multicasting as long
as the network bandwidth remains stable, and outperform it by wide margins when bandwidth changes dynamically. In
large environments and heterogeneous clusters, Robber outperforms MOB.

23 Comparing Hardware Accelerators in Scientific Applications: A Case Study

Multicore processors and a variety of accelerators have allowed scientific applications to scale to larger problem
sizes. We present a performance, design methodology, platform, and architectural comparison of several application
accelerators executing a Quantum Monte Carlo application. We compare the application’s performance and
programmability on a variety of platforms including CUDA with Nvidia GPUs, Brook+ with ATI graphics accelerators,
OpenCL running on both multicore and graphics processors, C++ running on multicore processors, and a VHDL
implementation running on a Xilinx FPGA. We show that OpenCL provides application portability between multicore
processors and GPUs, but may incur a performance cost. Furthermore, we illustrate that graphics accelerators can
make simulations involving large numbers of particles feasible.

24 Computing Localized Power-Efficient Data Aggregation Trees for Sensor Networks

We propose localized, self organizing, robust, and energy-efficient data aggregation tree approaches for sensor
networks, which we call Localized Power-Efficient Data Aggregation Protocols (L-PEDAPs). They are based on
topologies, such as LMST and RNG,that can approximate minimum spanning tree and can be efficiently computed
using only position or distance information of one-hop neighbors. The actual routing tree is constructed over these
topologies. We also consider different parent selection strategies while constructing a routing tree. We compare each
topology and parent selection strategy and conclude that the best among them is the shortest path strategy over
LMSTstructure. Our solution also involves route maintenance procedures that will be executed when a sensor node
fails or a new node is added to the network. The proposed solution is also adapted to consider the remaining power
levels of nodes in order to increase the network lifetime. Our simulation results show that by using our power-aware
localized approach, we can almost have the same performance of a centralized solution in terms of network lifetime,
and close to 90 percent of an upper bound derived here.

25 Conflicts and Incentives in Wireless Cooperative Relaying: A Distributed Market Pricing Framework

Extensive research in recent years has shown the benefits of cooperative relaying in wireless networks, where nodes
overhear and cooperatively forward packets transmitted between their neighbors. Most existing studies focus on
physical-layer optimization of the effective channel capacity for a given transmitter-receiver link; however, the
interaction among simultaneous flows between different endpoint pairs, and the conflicts arising from their
competition for a shared pool of relay nodes, are not yet well understood. In this paper, we study a distributed pricing
framework, where sources pay relay nodes to forward their packets, and the payment is shared equally whenever a
packet is successfully relayed by several nodes at once. We formulate this scenario as a Stackelberg (leader-follower)
game, in which sources set the payment rates they offer, and relay nodes respond by choosing the flows to cooperate
with. We provide a systematic analysis of the fundamental structural properties of this generic model. We show that
multiple follower equilibria exist in general due to the nonconcave nature of their game, yet only one equilibrium
2



possesses certain continuity properties that further lead to a unique system equilibrium among the leaders. We further
demonstrate that the resulting equilibria are reasonably efficient in several typical scenarios.

26 Consensus and Mutual Exclusion in a Multiple Access Channel

We consider deterministic feasibility and time complexity of two fundamental tasks in distributed computing:
consensus and mutual exclusion. Processes have different labels and communicate through a multiple access
channel. The adversary wakes up some processes in possibly different rounds. In any round, every awake process
either listens or transmits. The message of a process i is heard by all other awake processes, if i is the only process to
transmit in a given round. If more than one process transmits simultaneously, there is a collision and no message is
heard. We consider three characteristics that may or may not exist in the channel: collision detection (listening
processes can distinguish collision from silence), the availability of a global clock showing the round number, and the
knowledge of the number n of all processes. If none of the above three characteristics is available in the channel, we
prove that consensus and mutual exclusion are infeasible; if at least one of them is available, both tasks are feasible,
and we study their time complexity. Collision detection is shown to cause an exponential gap in complexity: if it is
available, both tasks can be performed in time logarithmic in n, which is optimal, and without collision detection both
tasks require linear time. We then investigate both consensus and mutual exclusion in the absence of collision
detection, but under alternative presence of the two other features. With global clock, we give an algorithm whose time
complexity linearly depends on n and on the wake-up time, and an algorithm whose complexity does not depend on
the wake-up time and differs from the linear lower bound only by a factor O(log2 n). If n is known, we also show an
algorithm whose complexity differs from the linear lower bound only by a factor O(log2 n).

27 Cooperative Channelization in Wireless Networks with Network Coding

n this paper, we address congestion of multicast traffic in multihop wireless networks through a combination of
network coding and resource reservation. Network coding reduces the number of transmissions required in multicast
flows, thus allowing a network to approach its multicast capacity. In addition, it efficiently repairs errors in multicast
flows by combining packets lost at different destinations. However, under conditions of extremely high congestion the
repair capability of network coding is seriously degraded. In this paper, we propose cooperative channelization, in
which portions of the transmission media are allocated to links that are congested at the point where network coding
cannot efficiently repair loss. A health metric is proposed to allow comparison of need for channelization of different
multicast links. Cooperative channelization considers the impact of channelization on overall network performance
before resource reservation is triggered. Our results show that cooperative channelization improves overall network
performance while being well suited for wireless networks using network coding.

28 Cooperative Search and Survey Using Autonomous Underwater Vehicles (AUVs)

In this work, we study algorithms for cooperative search and survey using a fleet of Autonomous Underwater Vehicles
(AUVs). Due to the limited energy, communication range/bandwidth, and sensing range of the AUVs, underwater
search and survey with multiple AUVs brings about several new challenges since a large amount of data needs to be
collected by each AUV, and any AUV may fail unexpectedly. To address the challenges and meet our objectives of
minimizing the total survey time and traveled distance of AUVs, we propose a cooperative rendezvous scheme called

3



Synchronization-Based Survey (SBS) to facilitate cooperation among a large number of AUVs when surveying a large
area. In SBS, AUVs form an intermittently connected network (ICN) in that they periodically meet each other for data
aggregation, control signal dissemination, and AUV failure detection/recovery. Numerical analysis and simulations
have been performed to compare the performance of three variants of SBS schemes, namely, Alternating Column
Synchronization (ACS), Strict Line Synchronization (SLS), and X Synchronization (XS). The results show that XS can
outperform other SBS schemes in terms of the survey time and the traveled distance of AUVs. We also compare XS
with nonsynchronization-based survey and the lower bound on the survey time and traveled distance. The results
show that XS achieves a close to optimal performance..

29 Coordinating Computation and I/O in Massively Parallel Sequence Search

With the explosive growth of genomic information, the searching of sequence databases has emerged as one of the
most computation and data-intensive scientific applications. Our previous studies suggested that parallel genomic
sequence-search possesses highly irregular computation and I/O patterns. Effectively addressing these runtime
irregularities is thus the key to designing scalable sequence-search tools on massively parallel computers. While the
computation scheduling for irregular scientific applications and the optimization of noncontiguous file accesses have
been well-studied independently, little attention has been paid to the interplay between the two. In this paper, we
systematically investigate the computation and I/O scheduling for data-intensive, irregular scientific applications
within the context of genomic sequence search. Our study reveals that the lack of coordination between computation
scheduling and I/O optimization could result in severe performance issues. We then propose an integrated scheduling
approach that effectively improves sequence-search throughput by gracefully coordinating the dynamic load
balancing of computation and highperformance noncontiguous I/O.

30 Coordinating Power Control and Performance Management for Virtualized Server Clusters

Today’s data centers face two critical challenges. First, various customers need to be assured by meeting their
required service-level agreements such as response time and throughput. Second, server power consumption must
be controlled in order to avoid failures caused by power capacity overload or system overheating due to increasing
high server density. However, existing work controls power and application-level performance separately, and thus,
cannot simultaneously provide explicit guarantees on both. In addition, as power and performance control strategies
may come from different hardware/software vendors and coexist at different layers, it is more feasible to coordinate
various strategies to achieve the desired control objectives than relying on a single centralized control strategy. This
paper proposes Co-Con, a novel cluster-level control architecture that coordinates individual power and performance
control loops for virtualized server clusters. To emulate the current practice in data centers, the power control loop
changes hardware power states with no regard to the application-level performance. The performance control loop is
then designed for each virtual machine to achieve the desired performance even when the system model varies
significantly due to the impact of power control. Co-Con configures the two control loops rigorously, based on
feedback control theory, for theoretically guaranteed control accuracy and system stability. Empirical results on a
physical testbed demonstrate that Co-Con can simultaneously provide effective control on both application-level
performance and underlying power consumption.

4



31 Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed-Precision Multigrid

We have previously suggested mixed precision iterative solvers specifically tailored to the iterative solution of sparse
linear equation systems as they typically arise in the finite element discretization of partial differential equations. These
schemes have been evaluated for a number of hardware platforms, in particular, single-precision GPUs as accelerators
to the general purpose CPU. This paper reevaluates the situation with new mixed precision solvers that run entirely on
the GPU: We demonstrate that mixed precision schemes constitute a significant performance gain over native double
precision. Moreover, we present a new implementation of cyclic reduction for the parallel solution of tridiagonal
systems and employ this scheme as a line relaxation smoother in our GPU-based multigrid solver. With an alternating
direction implicit variant of this advanced smoother, we can extend the applicability of the GPU multigrid solvers to
very ill-conditioned systems arising from the discretization on anisotropic meshes, that previously had to be solved on
the CPU. The resulting mixed-precision schemes are always faster than double precision alone, and outperform tuned
CPU solvers consistently by almost an order of magnitude.

32 Data Fusion with Desired Reliability in Wireless Sensor Networks

Energy-efficient and reliable transmission of sensory information is a key problem in wireless sensor networks. To
save more energy, in-network processing such as data fusion is a widely used technique, which, however, may often
lead to unbalanced information among nodes in the data fusion tree. Traditional schemes aim at providing reliable
transmission to individual data packets from source node to the sink, but seldom offer the desired reliability to a data
fusion tree. In this paper, we explore the problem of Minimum Energy Reliable Information Gathering (MERIG) when
performing data fusion. By adaptively using redundant transmission on fusion routes without acknowledgments,
packets with more information are delivered with higher reliability. For different data fusion topologies, such as star,
chain, and tree, we provide optimal solutions to compute the number of transmissions for each node. We also propose
practical, distributed approximation algorithms for chain and tree topologies. Analytical proofs and simulation results
show that energy-efficient information reliability can be guaranteed in an unreliable wireless environment with the help
of our proposed schemes.

33 Data Replication in Data Intensive Scientific Applications with Performance Guarantee

Data replication has been well adopted in data intensive scientific applications to reduce data file transfer time and
bandwidth consumption. However, the problem of data replication in Data Grids, an enabling technology for data
intensive applications, has proven to be NP-hard and even non approximable, making this problem difficult to solve.
Meanwhile, most of the previous research in this field is either theoretical investigation without practical consideration,
or heuristics-based with little or no theoretical performance guarantee. In this paper, we propose a data replication
algorithm that not only has a provable theoretical performance guarantee, but also can be implemented in a distributed
and practical manner. Specifically, we design a polynomial time centralized replication algorithm that reduces the total
data file access delay by at least half of that reduced by the optimal replication solution. Based on this centralized
algorithm, we also design a distributed caching algorithm, which can be easily adopted in a distributed environment
such as Data Grids. Extensive simulations are performed to validate the efficiency of our proposed algorithms. Using
our own simulator, we show that our centralized replication algorithm performs comparably to the optimal algorithm
and other intuitive heuristics under different network parameters. Using GridSim, a popular distributed Grid simulator,
we demonstrate that the distributed caching technique significantly outperforms an existing popular file caching

5



technique in Data Grids, and it is more scalable and adaptive to the dynamic change of file access patterns in Data
Grids.

34 Dealing with Nonuniformity in Data Centric Storage for Wireless Sensor Networks

In-network storage of data in Wireless Sensor Networks (WSNs) is considered a promising alternative to external
storage since it contributes to reduce the communication overhead inside the network. Recent approaches to data
storage rely on Geographic Hash Tables (GHT) for efficient data storage and retrieval. These approaches, however,
assume that sensors are uniformly distributed in the sensor field, which is seldom true in real applications. Also they
do not allow tuning the redundancy level in the storage according to the importance of the data to be stored. To deal
with these issues, we propose an approach based on two mechanisms. The first is aimed at estimating the real network
distribution. The second exploits data dispersal method based on the estimated network distribution. Experiments
through simulation show that our approach approximates quite closely the real distribution of sensors and that our
dispersal protocol sensibly reduces data losses due to unbalanced data load.

35 Decomposing Workload Bursts for Efficient Storage Resource Management

The growing popularity of hosted storage services and shared storage infrastructure in data centers is driving the
recent interest in resource management and QoS in storage systems. The bursty nature of storage workloads raises
significant performance and provisioning challenges, leading to increased resource requirements, management costs,
and energy consumption. We present a novel workload shaping framework to handle bursty workloads, where the
arrival stream is dynamically decomposed to isolate its bursts, and then rescheduled to exploit available slack. We
show how decomposition reduces the server capacity requirements and power consumption significantly, while
affecting QoS guarantees minimally. We present an optimal decomposition algorithm RTT and a recombination
algorithm Miser, and show the benefits of the approach by evaluating the performance of several storage workloads
using both simulation and Linux implementation.

36 Design and Evaluation of MPI File Domain Partitioning Methods under Extent-Based File Locking Protocol

MPI collective I/O has been an effective method for parallel shared-file access and maintaining the canonical orders of
structured data in files. Its implementation commonly uses a two-phase I/O strategy that partitions a file into disjoint file
domains, assigns each domain to a unique process, redistributes the I/O data based on their locations in the domains,
and has each process perform I/O for the assigned domain. The partitioning quality determines the maximal
performance achievable by the underlying file system, as the shared-file I/O has long been impeded by the cost of file
system’s data consistency control, particularly due to the conflicted locks. This paper proposes a few file domain
partitioning methods designed to reduce lock conflicts under the extent-based file locking protocol. Experiments from
four I/O benchmarks on the IBM GPFS and Lustre parallel file systems show that the partitioning method producing

6



minimum lock conflicts wins the highest performance. The benefit of removing conflicted locks can be so significant
that more than thirty times of write bandwidth differences are observed between the best and worst methods.

37 Design and Evaluation of Multiple-Level Data Staging for Blue Gene Systems

Parallel applications currently suffer from a significant imbalance between computational power and available I/O
bandwidth. Additionally, the hierarchical organization of current Petascale systems contributes to an increase of the I/O
subsystem latency. In these hierarchies, file access involves pipelining data through several networks with incremental
latencies and higher probability of congestion. Future Exascale systems are likely to share this trait. This paper
presents a scalable parallel I/O software system designed to transparently hide the latency of file system accesses to
applications on these platforms. Our solution takes advantage of the hierarchy of networks involved in file accesses, to
maximize the degree of overlap between computation, file I/O-related communication, and file system access. We
describe and evaluate a two-level hierarchy for Blue Gene systems consisting of client-side and I/O node-side caching.
Our file cache management modules coordinate the data staging between application and storage through the Blue
Gene networks. The experimental results demonstrate that our architecture achieves significant performance
improvements through a high degree of overlap between computation, communication, and file I/O.

38 Design and Performance Evaluation of Image Processing Algorithms on GPUs

In this paper, we construe key factors in design and evaluation of image processing algorithms on the massive parallel
graphics processing units (GPUs) using the compute unified device architecture (CUDA) programming model. A set of
metrics, customized for image processing, is proposed to quantitatively evaluate algorithm characteristics. In addition,
we show that a range of image processing algorithms map readily to CUDA using multiview stereo matching, linear
feature extraction, JPEG2000 image encoding, and nonphotorealistic rendering (NPR) as our example applications. The
algorithms are carefully selected from major domains of image processing, so they inherently contain a variety of
subalgorithms with diverse characteristics when implemented on the GPU. Performance is evaluated in terms of
execution time and is compared to the fastest host-only version implemented using OpenMP. It is shown that the
observed speedup varies extensively depending on the characteristics of each algorithm. Intensive analysis is
conducted to show the appropriateness of the proposed metrics in predicting the effectiveness of an application for
parallel implementation.

39 Design of Distributed Heterogeneous Embedded Systems in DDFCharts

The use of formal models of computation in dealing with increasing complexity of embedded systems design is gaining
attention. A successful model of computation must be able to handle both control-dominated and data-dominated
behaviors, which are most often simultaneously present in complex embedded systems. Besides behavioral

7



heterogeneity, direct support for modeling distributed systems is also desirable, since an increasing number of
embedded systems belong to this category. In this paper, we present distributed DFCharts (DDFCharts), a language
based on a formal model that targets distributed heterogeneous embedded systems. Its top hierarchical level is made
suitable to capture distributed systems. Behavioral heterogeneity is addressed by composing finite-state machines
(FSMs) and synchronous dataflow graphs (SDFGs). We illustrate modeling in DDFCharts with practical examples and
describe its implementation on heterogeneous target architecture.

40 Dynamic Resource Provisioning in Massively Multiplayer Online Games

Today’s Massively Multiplayer Online Games (MMOGs) can include millions of concurrent players spread across the
world and interacting with each other within a single session. Faced with high resource demand variability and with
misfit resource renting policies, the current industry practice is to overprovision for each game tens of self-owned data
centers, making the market entry affordable only for big companies. Focusing on the reduction of entry and operational
costs, we investigate a new dynamic resource provisioning method for MMOG operation using external data centers as
low-cost resource providers. First, we identify in the various types of player interaction a source of short-term load
variability, which complements the long-term load variability due to the size of the player population. Then, we
introduce a combined MMOG processor, network, and memory load model that takes into account both the player
interaction type and the population size. Our model is best used for estimating the MMOG resource demand
dynamically, and thus, for dynamic resource provisioning based on the game world entity distribution. We evaluate
several classes of online predictors for MMOG entity distribution and propose and tune a neural network-based
predictor to deliver good accuracy consistently under real-time performance constraints. We assess using trace-based
simulation the impact of the data center policies on the quality of resource provisioning. We find that the dynamic
resource provisioning can be much more efficient than its static alternative even when the external data centers are
busy, and that data centers with policies unsuitable for MMOGs are penalized by our dynamic resource provisioning
method. Finally, we present experimental results showing the real-time parallelization and load balancing of a real game
prototype using data center resources provisioned using our method and show its advantage against a rudimentary
client threshold approach.

41 Edge Self-Monitoring for Wireless Sensor Networks

Local monitoring is an effective mechanism for the security of wireless sensor networks (WSNs). Existing schemes
assume the existence of sufficient number of active nodes to carry out monitoring operations. Such an assumption,
however, is often difficult for a large-scale sensor network. In this work, we focus on designing an efficient scheme
integrated with good self-monitoring capability as well as providing an infrastructure for various security protocols
using local monitoring. To the best of our knowledge, we are the first to present the formal study on optimizing network
topology for edge self-monitoring in WSNs. We show that the problem is NP-complete even under the unit disk graph
(UDG) model and give the upper bound on the approximation ratio in various graph models. We provide polynomial-
time approximation scheme (PTAS) algorithms for the problem in some specific graphs, for example, the monitoring-
setbounded graph. We further design two distributed polynomial algorithms with provable approximation ratio.
Through comprehensive simulations, we evaluate the effectiveness of our design.

42 Efficient Adaptive Scheduling of Multiprocessors with Stable Parallelism Feedback

With proliferation of multicore computers and multiprocessor systems, an imminent challenge is to efficiently schedule
8



parallel applications on these resources. In contrast to conventional static scheduling, adaptive schedulers that
dynamically allocate processors to jobs possess good potential for improving processor utilization and speeding up
job’s execution. In this paper, we focus on adaptive scheduling of malleable jobs with periodic processor reallocations
based on parallelism feedback of the jobs and allocation policy of the system. We present an efficient adaptive
scheduler ACDEQ that provides parallelism feedback using an adaptive controller A-CONTROL and allocates
processors based on the well-known Dynamic Equipartitioning algorithm (DEQ). Compared to A-GREEDY, an existing
adaptive scheduler that experiences feedback instability thus incurs unnecessary scheduling overheads, we show that
A-CONTROL achieves much more stable feedback among other desirable control-theoretic properties. Furthermore, we
analyze algorithmically the performances of ACDEQ in terms of its response time and processor waste for an individual
job as well as makespan and total response time for a set of jobs. To the best of our knowledge, ACDEQ is the first
multiprocessor scheduling algorithm that offers both control-theoretic and algorithmic guarantees. We further evaluate
ACDEQ via simulations by using Downey’s parallel job model augmented with internal parallelism variations. The
results confirm its improved performances over AGDEQ, and they show that ACDEQ excels especially when the
scheduling overhead becomes high.

43 Enabling Public Auditability and Data Dynamics for Storage Security in Cloud Computing

Cloud Computing has been envisioned as the next-generation architecture of IT Enterprise. It moves the application
software and databases to the centralized large data centers, where the management of the data and services may not
be fully trustworthy. This unique paradigm brings about many new security challenges, which have not been well
understood. This work studies the problem of ensuring the integrity of data storage in Cloud Computing. In particular,
we consider the task of allowing a third party auditor (TPA), on behalf of the cloud client, to verify the integrity of the
dynamic data stored in the cloud. The introduction of TPA eliminates the involvement of the client through the auditing
of whether his data stored in the cloud are indeed intact, which can be important in achieving economies of scale for
Cloud Computing. The support for data dynamics via the most general forms of data operation, such as block
modification, insertion, and deletion, is also a significant step toward practicality, since services in Cloud Computing
are not limited to archive or backup data only. While prior works on ensuring remote data integrity often lacks the
support of either public auditability or dynamic data operations, this paper achieves both. We first identify the
difficulties and potential security problems of direct extensions with fully dynamic data updates from prior works and
then show how to construct an elegant verification scheme for the seamless integration of these two salient features in
our protocol design. In particular, to achieve efficient data dynamics, we improve the existing proof of storage models
by manipulating the classic Merkle Hash Tree construction for block tag authentication. To support efficient handling of
multiple auditing tasks, we further explore the technique of bilinear aggregate signature to extend our main result into a
multiuser setting, where TPA can perform multiple auditing tasks simultaneously. Extensive security and performance
analysis show that the proposed schemes are highly efficient and provably secure.

44 Energy Conscious Scheduling for Distributed Computing Systems under Different Operating Conditions

Traditionally, the primary performance goal of computer systems has focused on reducing the execution time of
applications while increasing throughput. This performance goal has been mostly achieved by the development of
high-density computer systems. As witnessed recently, these systems provide very powerful processing capability and
capacity. They often consist of tens or hundreds of thousands of processors and other resource-hungry devices. The
energy consumption of these systems has become a major concern. In this paper, we address the problem of
scheduling precedence-constrained parallel applications on multiprocessor computer systems and present two energy-
conscious scheduling algorithms using dynamic voltage scaling (DVS). A number of recent commodity processors are
capable of DVS, which enables processors to operate at different voltage supply levels at the expense of sacrificing

9



clock frequencies. In the context of scheduling, this multiple voltage facility implies that there is a trade-off between the
quality of schedules and energy consumption. To effectively balance these two performance goals, we have devised a
novel objective function and a variant from that. The main difference between the two algorithms is in their
measurement of energy consumption. The extensive comparative evaluations conducted as part of this work show that
the performance of our algorithms is very compelling in terms of both application completion time and energy
consumption.

45 Energy-Efficient Localized Routing in Random Multihop Wireless Networks

A number of energy-aware routing protocols were proposed to seek the energy efficiency of routes in multihop wireless
networks. Among them, several geographical localized routing protocols were proposed to help making smarter
routing decision using only local information and reduce the routing overhead. However, all proposed localized routing
methods cannot guarantee the energy efficiency of their routes. In this paper, we first give a simple localized routing
algorithm, called Localized Energy-Aware Restricted Neighborhood routing (LEARN), which can guarantee the energy
efficiency of its route if it can find the route successfully. We then theoretically study its critical transmission radius in
random networks which can guarantee that LEARN routing finds a route for any source and destination pairs
asymptotically almost surely. We also extend the proposed routing into three-dimensional (3D) networks and derive its
critical transmission radius in 3D random networks. Simulation results confirm our theoretical analysis of LEARN
routing and demonstrate its energy efficiency in large scale random networks.

46 Exploiting Dynamic Resource Allocation for Efficient Parallel Data Processing in the Cloud

In recent years ad hoc parallel data processing has emerged to be one of the killer applications for Infrastructure-as-a-
Service (IaaS) clouds. Major Cloud computing companies have started to integrate frameworks for parallel data
processing in their product portfolio, making it easy for customers to access these services and to deploy their
programs. However, the processing frameworks which are currently used have been designed for static, homogeneous
cluster setups and disregard the particular nature of a cloud. Consequently, the allocated compute resources may be
inadequate for big parts of the submitted job and unnecessarily increase processing time and cost. In this paper, we
discuss the opportunities and challenges for efficient parallel data processing in clouds and present our research
project Nephele. Nephele is the first data processing framework to explicitly exploit the dynamic resource allocation
offered by today’s IaaS clouds for both, task scheduling and execution. Particular tasks of a processing job can be
assigned to different types of virtual machines which are automatically instantiated and terminated during the job
execution. Based on this new framework, we perform extended evaluations of MapReduce-inspired processing jobs on
an IaaS cloud system and compare the results to the popular data processing framework Hadoop.

47 Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

The introduction of General-Purpose computation on GPUs (GPGPUs) has changed the landscape for the future of
parallel computing. At the core of this phenomenon are massively multithreaded, data-parallel architectures
possessing impressive acceleration ratings, offering low-cost supercomputing together with attractive power budgets.
Even given the numerous benefits provided by GPGPUs, there remain a number of barriers that delay wider adoption of

10



these architectures. One major issue is the heterogeneous and distributed nature of the memory subsystem commonly
found on data-parallel architectures. Application acceleration is highly dependent on being able to utilize the memory
subsystem effectively so that all execution units remain busy. In this paper, we present techniques for enhancing the
memory efficiency of applications on data-parallel architectures, based on the analysis and characterization of memory
access patterns in loop bodies; we target vectorization via data transformation to benefit vector-based architectures
(e.g., AMD GPUs) and algorithmic memory selection for scalar-based architectures (e.g., NVIDIA GPUs). We
demonstrate the effectiveness of our proposed methods with kernels from a wide range of benchmark suites. For the
benchmark kernels studied, we achieve consistent and significant performance improvements (up to 11:4x and 13:5x
over baseline GPU implementations on each platform, respectively) by applying our proposed methodology.

48 Fast and Cost-Effective Online Load-Balancing in Distributed Range-Queriable Systems

Distributed systems such as Peer-to-Peer overlays have been shown to efficiently support the processing of range
queries over large numbers of participating hosts. In such systems, uneven load allocation has to be effectively tackled
in order to minimize overloaded peers and optimize their performance. In this work, we detect the two basic
methodologies used to achieve load-balancing: Iterative key redistribution between neighbors and node migration. We
identify these two key mechanisms and describe their relative advantages and disadvantages. Based on this analysis,
we propose NIXMIG, a hybrid method that adaptively utilizes these two extremes to achieve both fast and cost-effective
load-balancing in distributed systems that support range queries. We theoretically prove its convergence and as a case
study, we offer an implementation on top of a Skip Graph, where we thoroughly validate our findings in a variety of
static, dynamic and realistic workloads. We compare NIXMIG with an existing load-balancing algorithm proposed by
Karger and Ruhl [1] and our experimental analysis shows that, NIXMIG can be as much as three times faster, requiring
only one sixth and one third of message and item exchanges, respectively, to bring the system to a balanced state.

49 FDAC: Toward Fine-Grained Distributed Data Access Control in Wireless Sensor Networks

Distributed sensor data storage and retrieval have gained increasing popularity in recent years for supporting various
applications. While distributed architecture enjoys a more robust and fault-tolerant wireless sensor network (WSN),
such architecture also poses a number of security challenges especially when applied in mission-critical applications
such as battlefield and ehealthcare. First, as sensor data are stored and maintained by individual sensors and
unattended sensors are easily subject to strong attacks such as physical compromise, it is significantly harder to
ensure data security. Second, in many mission-critical applications, fine-grained data access control is a must as illegal
access to the sensitive data may cause disastrous results and/or be prohibited by the law. Last but not least, sensor
nodes usually are resource-constrained, which limits the direct adoption of expensive cryptographic primitives. To
address the above challenges, we propose, in this paper, a distributed data access control scheme that is able to
enforce fine-grained access control over sensor data and is resilient against strong attacks such as sensor
compromise and user colluding. The proposed scheme exploits a novel cryptographic primitive called attribute-based
encryption (ABE), tailors, and adapts it for WSNs with respect to both performance and security requirements. The
feasibility of the scheme is demonstrated by experiments on real sensor platforms. To our best knowledge, this paper
is the first to realize distributed fine-grained data access control for WSNs.

50 Flexible Robust Group Key Agreement

A robust group key agreement protocol (GKA) allows a set of players to establish a shared secret key, regardless of

11



network/node failures. Current constant-round GKA protocols are either efficient and nonrobust or robust but not
efficient; assuming a reliable broadcast communication medium, the standard encryption-based group key agreement
protocol can be robust against arbitrary number of node faults, but the size of the messages broadcast by every player
is proportional to the number of players. In contrast, nonrobust group key agreement can be achieved with each player
broadcasting just constant-sized messages. We propose a novel 2-round group key agreement protocol, which
tolerates up to T node failures, using OðTÞ-sized messages for any T. We show that the new protocol implies a fully-
robust group key agreement with logarithmic-sized messages and expected round complexity close to 2, assuming
random node faults. The protocol can be extended to withstand malicious insiders at small constant factor increases in
bandwidth and computation. The proposed protocol is secure under the (standard) Decisional Square Diffie-Hellman
assumption.

51 Group Strategy proof Multicast in Wireless Networks

We study the dissemination of common information from a source to multiple nodes within a multihop wireless network,
where nodes are equipped with uniform omnidirectional antennas and have a fixed cost per packet transmission. While
many nodes may be interested in the dissemination service, their valuation or utility for such a service is usually private
information. A desirable routing and charging mechanism encourages truthful utility reports from the nodes. We provide
both negative and positive results toward such mechanism design. We show that in order to achieve the group
strategyproof property, a compromise in routing optimality or budget-balance is inevitable. In particular, the fraction of
optimal routing cost that can be recovered through node charges cannot be significantly higher than 1 2 . To answer the
question whether constant-ratio cost recovery is possible, we further apply a primal-dual schema to simultaneously
build a routing solution and a cost-sharing scheme, and prove that the resulting mechanism is group strategyproof and
guarantees 1 4 -approximate cost recovery against an optimal routing scheme.

52 HaRP: Rapid Packet Classification via Hashing Round-Down Prefixes

Packet classification is central to a wide array of Internet applications and services, with its approaches mostly
involving either hardware support or optimization steps needed by software-oriented techniques (to add precomputed
markers and insert rules in the search data structures). Unfortunately, an approach with hardware support is expensive
and has limited scalability, whereas one with optimization fails to handle incremental rule updates effectively. This work
deals with rapid packet classification, realized by hashing round-down prefixes (HaRP) in a way that the source and the
destination IP prefixes specified in a rule are rounded down to “designated prefix lengths” (DPL) for indexing into hash
sets. HaRP exhibits superb hash storage utilization, able to not only outperform those earlier software-oriented
classification techniques but also well accommodate dynamic creation and deletion of rules. HaRP makes it possible to
hold all its search data structures in the local cache of each core within a contemporary processor, dramatically
elevating its classification performance. Empirical results measured on an AMD 4-way 2.8 GHz Opteron system (with 1
MB cache for each core) under six filter data sets (each with up to 30 K rules) obtained from a public source unveil that
HaRP enjoys up to some 3:6 throughput level achievable by the best known decision tree-based counterpart,
HyperCuts (HC).

53 hiCUDA: High-Level GPGPU Programming

Graphics Processing Units (GPUs) have become a competitive accelerator for applications outside the graphics domain,

12



mainly driven by the improvements in GPU programmability. Although the Compute Unified Device Architecture (CUDA)
is a simple C-like interface for programming NVIDIA GPUs, porting applications to CUDA remains a challenge to average
programmers. In particular, CUDA places on the programmer the burden of packaging GPU code in separate functions,
of explicitly managing data transfer between the host and GPU memories, and of manually optimizing the utilization of
the GPU memory. Practical experience shows that the programmer needs to make significant code changes, often
tedious and error-prone, before getting an optimized program. We have designed hi CUDA, a high-level directive-based
language for CUDA programming. It allows programmers to perform these tedious tasks in a simpler manner and
directly to the sequential code, thus speeding up the porting process. In this paper, we describe the hi CUDA directives
as well as the design and implementation of a prototype compiler that translates a hi CUDA program to a CUDA
program. Our compiler is able to support real-world applications that span multiple procedures and use dynamically
allocated arrays. Experiments using nine CUDA benchmarks show that the simplicity hi CUDA provides comes at no
expense to performance.

54 Hybrid Core Acceleration of UWB SIRE Radar Signal Processing

To move High-Performance Computing (HPC) closer to forward operating environments and missions, the Army Research
Laboratory is developing approaches using hybrid, asymmetric core computing. By blending capabilities found in Graphics
Processing Units (GPUs) and traditional von Neumann multicore Central Processing Units (CPUs), approaches are being
developed and optimized to provide at or near real-time processing speeds for research project applications. Algorithms are
designed to partition work to resources best designed to handle the processing load. The use of commodity resources allows the
design to be flexible throughout the life cycle without the costly and time-consuming delays associated with Application-Specific
Integrated Circuit (ASIC) development. This paradigm allows for rapid technology transfer to end users. In this paper, we describe
a synchronous impulse reconstruction radar imaging algorithm that has been designed for hybrid CPU-GPU processing. We
discuss various optimizations such as asynchronous task partitioning between the CPU and GPU as well as data movement
reduction. We also discuss analysis and design of the algorithms within the context of two programming models: NVIDIA’s CUDA
and AMD’s ATI Brook+. Finally, we report on the speedup achieved by this approach that allowed us to take a code once
restricted to post processing and transform it into one that exceeds real-time
Performance requirements.

55 Impact of Traffic Influxes: Revealing Exponential Inter contact Time in Urban VANETs

Intercontact time between moving vehicles is one of the key metrics in vehicular ad hoc networks (VANETs) and central
to forwarding algorithms and the end-to-end delay. Due to prohibitive costs, little work has conducted experimental
study on intercontact time in urban vehicular environments. In this paper, we carry out an extensive experiment
involving thousands of operational taxies in Shanghai city. Studying the taxi trace data on the frequency and duration of
transfer opportunities between taxies, we observe that the tail distribution of the intercontact time, that is, the time gap
separating two contacts of the same pair of taxies, exhibits an exponential decay, over a large range of timescale. This
observation is in sharp contrast to recent empirical data studies based on human mobility, in which the distribution of
the intercontact time obeys a power law. By analyzing a simplified mobility model that captures the effect of hot areas in
the city, we rigorously prove that common traffic influxes, where large volume of traffic converges, play a major role in
generating the exponential tail of the intercontact time. Our results thus provide fundamental guidelines on design of
new vehicular mobility models in urban scenarios, new data forwarding protocols and their performance analysis.

13



56 Integrating Caching and Prefetching Mechanisms in a Distributed Transactional Memory

We present a distributed transactional memory system that exploits a new opportunity to automatically hide network latency by
speculatively prefetching and caching objects. The system includes an object caching framework, language extensions to
support our approach, and symbolic prefetches. To our knowledge, this is the first prefetching approach that can prefetch objects
whose addresses have not been computed or predicted. Our approach makes aggressive use of both prefetching and caching of
remote objects to hide network latency while relying on the transaction commit mechanism to preserve the simple transactional
consistency model that we present to the developer. We have evaluated this approach on three distributed benchmarks, five
scientific benchmarks, and several micro benchmarks. We have found that our approach enables our benchmark applications to
effectively utilize multiple machines and benefit from prefetching and caching. We have observed a speedup of up to 7:26 for
distributed applications on our system using prefetching and caching and a speedup of up to 5:55 for parallel applications on
our system.

57 Interlacing Bypass Rings to Torus Networks for More Efficient Networks

We introduce a new technique for generating more efficient networks by systematically interlacing bypass rings to torus
networks (iBT networks). The resulting network can improve the original torus network by reducing the network
diameter, node-to-node distances, and by increasing the bisection width without increasing wiring and other
engineering complexity. We present and analyze the statement that a 3D iBT network proposed by our technique
outperforms 4D torus networks of the same node degree. We found that interlacing rings of sizes 6 and 12 to all three
dimensions of a torus network with meshes 30 x30 x 36 generate the best network of all possible networks, including 4D
torus and hypercube of approximately 32,000 nodes. This demonstrates that strategically interlacing bypass rings into a
3D torus network enhances the torus network more effectively than adding a fourth dimension, although we may
generalize the claim. We also present a node-to-node distance formula for the iBT networks.

58 Joint Optimization of Complexity and Overhead for the Routing in Hierarchical Networks

The hierarchical network structure was proposed in the early 80s and becomes popular nowadays. The routing complexity and
the routing table size are the two primary performance measures in a dynamic route guidance system. Although various
algorithms exist for finding the best routing policy in a hierarchical network, hardly exists any work in studying and evaluating the
aforementioned measures for a hierarchical network. In this paper, a new mathematical framework to carry out the averages of
the routing complexity and the routing table size is proposed to express the routing complexity and the routing table size as the
functions of the hierarchical network parameters such as the number of the hierarchical levels and the subscriber density
(cluster-population) for each hierarchical level.

14



59 Key Pre distribution Schemes for Establishing Pairwise Keys with a Mobile Sink in Sensor Networks

Security services such as authentication and pair wise key establishment are critical to sensor networks. They enable
sensor nodes to communicate securely with each other using cryptographic techniques. In this paper, we propose two
key pre distribution schemes that enable a mobile sink to establish a secure data-communication link, on the fly, with
any sensor nodes. The proposed schemes are based on the polynomial pool-based key pre distribution scheme, the
probabilistic generation key pre distribution scheme, and the Q-composite scheme. The security analysis in this paper
indicates that these two proposed pre distribution schemes assure, with high probability and low communication
overhead, that any sensor node can establish a pair wise key with the mobile sink. Comparing the two proposed key pre
distribution schemes with the Q-composite scheme, the probabilistic key pre distribution scheme, and the polynomial
pool-based scheme, our analytical results clearly show that our schemes perform better in terms of network resilience to
node capture than existing schemes if used in wireless sensor networks with mobile sinks.

60 LBMP: A Logarithm-Barrier-Based Multipath Protocol for Internet Traffic Management

Traffic management is the adaptation of source rates and routing to efficiently utilize network resources. Recently, the
complicated interactions between different Internet traffic management modules have been elegantly modeled by distributed
primaldual utility maximization, which sheds new light for developing effective management protocols. For single-path routing
with given routes, the dual is a strictly concave network optimization problem. Unfortunately, the general form of multipath utility
optimization is not strictly concave, making its solution quite unstable. Decomposition-based techniques like TRaffic-
management Using Multipath Protocol (TRUMP) alleviates the instability, but their convergence is not guaranteed, nor is their
optimality. They are also inflexible in differentiating the control at different links. In this paper, we address the above issues
through a novel logarithm-barrier-based approach. Our approach jointly considers user utility and routing/congestion control. It
translates the multipath utility maximization into a sequence of unconstrained optimization problems, with infinite logarithm
barriers being deployed at the constraint boundary. We demonstrate that setting up barriers is much simpler than choosing
traditional cost functions and, more importantly, it makes optimal solution achievable. We further demonstrate a distributed
implementation, together with the design of a practical Logarithm Barrierbased- Multipath Protocol (LBMP). We evaluate the
performance of LBMP through both numerical analysis and packet-level simulations. The results show that LBMP achieves high
throughput and fast convergence over diverse representative network topologies. Such performance is comparable to TRUMP,
and is often better. Moreover, LBMP is flexible in differentiating the control at different links, and its optimality and convergence
are theoretically guaranteed.

61 Lightweight Chip Multi-Threading (LCMT): Maximizing Fine-Grained Parallelism On-Chip

Irregular and dynamic applications, such as graph problems and agent-based simulations, often require fine-grained
parallelism to achieve good performance. However, current multicore processors only provide architectural support for
coarse-grained parallelism, making it necessary to use software-based multithreading environments to effectively
implement fine-grained parallelism. Although these software-based environments have demonstrated superior
performance over heavyweight, OS-level threads, they are still limited by the significant overhead involved in thread
management and synchronization. In order to address this, we propose a Lightweight Chip Multi-Threaded (LCMT)
architecture that further exploits thread-level parallelism (TLP) by incorporating direct architectural support for an
15



“unlimited” number of dynamically created lightweight threads with very low thread management and synchronization
overhead. The LCMT architecture can be implemented atop a mainstream architecture with minimum extra hardware to
leverage existing legacy software environments. We compare the LCMT architecture with a Niagara-like baseline
architecture. Our results show up to 1.8X better scalability, 1.91X better performance, and more importantly, 1.74X better
performance per watt, using the LCMT architecture for irregular and dynamic benchmarks, when compared to the
baseline architecture. The LCMT architecture delivers similar performance to the baseline architecture for regular
benchmarks.

62 Load Balance with Imperfect Information in Structured Peer-to-Peer Systems

With the notion of virtual servers, peers participating in a heterogeneous, structured peer-to-peer (P2P) network may host
different numbers of virtual servers, and by migrating virtual servers, peers can balance their loads proportional to their
capacities. The existing and decentralized load balance algorithms designed for the heterogeneous, structured P2P networks
either explicitly construct auxiliary networks to manipulate global information or implicitly demand the P2P substrates organized
in a hierarchical fashion. Without relying on any auxiliary networks and independent of the geometry of the P2P substrates, we
present, in this paper, a novel load balancing algorithm that is unique in that each participating peer is based on the partial
knowledge of the system to estimate the probability distributions of the capacities of peers and the loads of virtual servers,
resulting in imperfect knowledge of the system state. With the imperfect system state, peers can compute their expected loads
and reallocate their loads in parallel. Through extensive simulations, we compare our proposal to prior load balancing algorithms.

63 Many Task Computing for Real-Time Uncertainty Prediction and Data Assimilation in the Ocean

Uncertainty prediction for ocean and climate predictions is essential for multiple applications today. Many-Task
Computing can play a significant role in making such predictions feasible. In this manuscript, we focus on ocean
uncertainty prediction using the Error Subspace Statistical Estimation (ESSE) approach. In ESSE, uncertainties are
represented by an error subspace of variable size. To predict these uncertainties, we perturb an initial state based on the
initial error subspace and integrate the corresponding ensemble of initial conditions forward in time, including
stochastic forcing during each simulation. The dominant error covariance (generated via SVD of the ensemble) is used
for data assimilation. The resulting ocean fields are used as inputs for predictions of underwater sound propagation.
ESSE is a classic case of Many Task Computing: It uses dynamic heterogeneous workflows and ESSE ensembles are
data intensive applications. We first study the execution characteristics of a distributed ESSE workflow on a medium
size dedicated cluster, examine in more detail the I/O patterns exhibited and throughputs achieved by its components as
well as the overall ensemble performance seen in practice. We then study the performance/usability challenges of
employing Amazon EC2 and the Teragrid to augment our ESSE ensembles and provide better solutions faster.

64 Mars: Accelerating MapReduce with Graphics Processors

We design and implement Mars, a MapReduce runtime system accelerated with graphics processing units (GPUs). MapReduce
is a simple and flexible parallel programming paradigm originally proposed by Google, for the ease of large-scale data
processing on thousands of CPUs. Compared with CPUs, GPUs have an order of magnitude higher computation power and
memory bandwidth. However, GPUs are designed as special-purpose coprocessors and their programming interfaces are less
16



familiar than those on the CPUs to MapReduce programmers. To harness GPUs’ power for MapReduce, we developed Mars to
run on NVIDIA GPUs, AMD GPUs as well as multicore CPUs. Furthermore, we integrated Mars into Hadoop, an open-source
CPU-based MapReduce system. Mars hides the programming complexity of GPUs behind the simple and familiar MapReduce
interface, and automatically manages task partitioning, data distribution, and parallelization on the processors. We have
implemented six representative applications on Mars and evaluated their performance on PCs equipped with GPUs as well as
multicore CPUs. The experimental results show that, the GPU-CPU coprocessing of Mars on an NVIDIA GTX280 GPU and an
Intel quad-core CPU outperformed Phoenix, the state-of-the-art MapReduce on the multicore CPU with a speedup of up to 72
times and 24 times on average, depending on the applications. Additionally, integrating Mars into Hadoop enabled GPU
acceleration for a network of PCs.

65 Massively LDPC Decoding on Multicore Architectures

Unlike usual VLSI approaches necessary for the computation of intensive Low-Density Parity-Check (LDPC) code
decoders, this paper presents flexible software-based LDPC decoders. Algorithms and data structures suitable for
parallel computing are proposed in this paper to perform LDPC decoding on multicore architectures. To evaluate the
efficiency of the proposed parallel algorithms, LDPC decoders were developed on recent multicores, such as off-the-
shelf general-purpose x86 processors, Graphics Processing Units (GPUs), and the CELL Broadband Engine (CELL/B.E.).
Challenging restrictions, such as memory access conflicts, latency, coalescence, or unknown behavior of thread and
block schedulers, were unraveled and worked out. Experimental results for different code lengths show throughputs in
the order of 1 ~ 2 Mbps on the general-purpose multicores, and ranging from 40 Mbps on the GPU to nearly 70 Mbps on
the CELL/B.E. The analysis of the obtained results allows to conclude that the CELL/B.E. performs better for short to
medium length codes, while the GPU achieves superior throughputs with larger codes. They achieve throughputs that in
some cases approach very well those obtained with VLSI decoders. From the analysis of the results, we can predict a
throughput increase with the rise of the number of cores. Index Terms—LDPC, data-parallel computing, multicore,
graphics

66 Maximizing the Number of Broadcast Operations in Random Geometric Ad Hoc Wireless Networks

We consider static ad hoc wireless networks whose nodes, equipped with the same initial battery charge, may dynamically
change their transmission range. When a node v transmits with range r(v), its battery charge is decreased by B r(v)2, where B >
0 is a fixed constant. The goal is to provide a range assignment schedule that maximizes the number of broadcast operations
from a given source (this number is denoted by the length of the schedule). This maximization problem, denoted by MAX
LIFETIME, is known to be NP-hard and the best algorithm yields worst-case approximation ratio (log n), where n is the number
of nodes of the network. We consider random geometric instances formed by selecting n points independently and uniformly at
random from a square of side length root( n) p in the Euclidean plane. We present an efficient algorithm that constructs a range
assignment schedule having length not smaller than 12 of the optimum with high probability. Then we design an efficient
distributed version of the above algorithm, where nodes initially know n and their own position only. The resulting schedule
guarantees the same approximation ratio achieved by the centralized version, thus, obtaining the first distributed algorithm
having provably good performance for this problem.

67 Measuring Client-Perceived Page view Response Time of Internet Services

As e-commerce services are exponentially growing, businesses need quantitative estimates of client-perceived
response times to continuously improve the quality of their services. Current server-side nonintrusive measurement
techniques are limited to non secured HTTP traffic. In this paper, we present the design and evaluation a monitor,
17

IEEE Final Year Projects 2011-2012 :: Elysium Technologies Pvt Ltd::Parallel Computing

IEEE Final Year Projects 2011-2012 :: Elysium Technologies Pvt Ltd::Parallel Computing

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Viewers also liked

Viewers also liked (6)

Similar to IEEE Final Year Projects 2011-2012 :: Elysium Technologies Pvt Ltd::Parallel Computing

Similar to IEEE Final Year Projects 2011-2012 :: Elysium Technologies Pvt Ltd::Parallel Computing (20)

More from sunda2011

More from sunda2011 (6)

Recently uploaded

Recently uploaded (20)

IEEE Final Year Projects 2011-2012 :: Elysium Technologies Pvt Ltd::Parallel Computing