The document describes parallelizing the Smith-Waterman algorithm for sequence alignment using MPI. It explores different parallelization techniques including blocking and blocking with interleaving. It presents solutions using scatter-gather and send-receive approaches. Performance is evaluated on an Altix cluster for various problem sizes, numbers of processors, and blocking/interleaving parameters to determine optimal configuration. Code was modified to improve load balancing and further optimize performance.
Wireless Universal Serial Bus (USB) is a wireless extension of the USB standard that allows peripheral devices to connect to a host such as a personal computer without cables. It uses ultra-wideband radio technology to provide high bandwidth and security. Wireless USB aims to offer the same features and speeds as wired USB while allowing cable-free connections between devices up to 10 meters apart at speeds up to 480 Mbps. It uses frequency bands between 3.1-10.6 GHz and orthogonal frequency-division multiplexing to provide robust and secure wireless connections between devices. Early implementations of Wireless USB are beginning to emerge, but widespread adoption will depend on continued development and integration into more devices.
This document is a seminar report submitted by Sabith P. for the degree of Bachelor of Technology in Computer Science & Engineering at Cochin University of Science and Technology. The report discusses Certified Wireless USB, which uses Ultra-Wideband technology to provide a high-bandwidth wireless connection as an alternative to wired USB. It describes the motivation for developing Wireless USB to address issues with wired USB and limitations of existing wireless technologies like Bluetooth and Wi-Fi. The report also provides details on the radio environment including UWB protocols and PHY/MAC specifications.
Wireless USB (WUSB) allows devices like printers, mice, hard drives and cameras to connect to computers without wires. It uses ultra-wideband radio technology to transmit data at speeds comparable to USB cables within a 10 meter range. WUSB provides an easy migration path for existing USB devices to work wirelessly. It is being used in game controllers, printers, and other peripherals and can transfer video streams between devices. The technology was introduced in early 2007 and is expected to grow substantially in laptops and PCs over the next few years.
This document discusses diodes, including:
1) Diodes are made from silicon or germanium semiconductors and function by allowing current to flow in one direction.
2) There are two types of semiconductors used in diodes - N-type and P-type, which are created by adding impurities.
3) Diodes can be forward or reverse biased, determining whether current flows through the diode.
The document describes a pill-sized camera that can be swallowed to take pictures inside the digestive tract. It contains a camera, lights, transmitter and batteries inside a capsule. Over 50,000 color images are transmitted as it passes through the tract. Components include an optical dome, lens, LED lights, image sensor, battery and transmitter. The capsule is swallowed and images are transmitted to a receiver and computer for processing. It can diagnose conditions like Crohn's disease without surgery. Advantages are it is painless and provides high quality images of the small intestine. Drawbacks are it may get stuck if obstructions are present, though new bi-directional cameras aim to overcome this.
The document provides an overview of the Carey Foster bridge circuit, which is a modification of the Wheatstone bridge used to measure low resistances more accurately. It works on the same principles as the Wheatstone bridge but increases the effective length of the bridge wire by adding resistances in series at each end, improving sensitivity. The document explains how to set up the circuit and calculate the resistance per unit length of the bridge wire to determine an unknown resistance value. Precautions for accurate measurements like using similar resistance values and gently operating the jockey are also outlined.
Wireless USB (WUSB) allows USB devices to connect without cables. It uses ultra-wideband radio technology to transmit data at speeds up to 480 Mbps within a 10 meter range. WUSB maintains the same architecture and compatibility as wired USB, allowing for easy migration. It will bring wireless connectivity to devices like printers, hard drives, cameras and displays. WUSB aims to eliminate cables while providing a fast, standardized wireless connection for USB devices.
This document outlines an experiment to study the characteristics and voltage regulation properties of a Zener diode. The experiment involves constructing circuits using a 10V Zener diode and measuring voltages and currents across the diode and various resistors as the supply voltage is varied. Key characteristics of the Zener diode such as reverse breakdown voltage and operation as a voltage regulator are examined.
Wireless Universal Serial Bus (USB) is a wireless extension of the USB standard that allows peripheral devices to connect to a host such as a personal computer without cables. It uses ultra-wideband radio technology to provide high bandwidth and security. Wireless USB aims to offer the same features and speeds as wired USB while allowing cable-free connections between devices up to 10 meters apart at speeds up to 480 Mbps. It uses frequency bands between 3.1-10.6 GHz and orthogonal frequency-division multiplexing to provide robust and secure wireless connections between devices. Early implementations of Wireless USB are beginning to emerge, but widespread adoption will depend on continued development and integration into more devices.
This document is a seminar report submitted by Sabith P. for the degree of Bachelor of Technology in Computer Science & Engineering at Cochin University of Science and Technology. The report discusses Certified Wireless USB, which uses Ultra-Wideband technology to provide a high-bandwidth wireless connection as an alternative to wired USB. It describes the motivation for developing Wireless USB to address issues with wired USB and limitations of existing wireless technologies like Bluetooth and Wi-Fi. The report also provides details on the radio environment including UWB protocols and PHY/MAC specifications.
Wireless USB (WUSB) allows devices like printers, mice, hard drives and cameras to connect to computers without wires. It uses ultra-wideband radio technology to transmit data at speeds comparable to USB cables within a 10 meter range. WUSB provides an easy migration path for existing USB devices to work wirelessly. It is being used in game controllers, printers, and other peripherals and can transfer video streams between devices. The technology was introduced in early 2007 and is expected to grow substantially in laptops and PCs over the next few years.
This document discusses diodes, including:
1) Diodes are made from silicon or germanium semiconductors and function by allowing current to flow in one direction.
2) There are two types of semiconductors used in diodes - N-type and P-type, which are created by adding impurities.
3) Diodes can be forward or reverse biased, determining whether current flows through the diode.
The document describes a pill-sized camera that can be swallowed to take pictures inside the digestive tract. It contains a camera, lights, transmitter and batteries inside a capsule. Over 50,000 color images are transmitted as it passes through the tract. Components include an optical dome, lens, LED lights, image sensor, battery and transmitter. The capsule is swallowed and images are transmitted to a receiver and computer for processing. It can diagnose conditions like Crohn's disease without surgery. Advantages are it is painless and provides high quality images of the small intestine. Drawbacks are it may get stuck if obstructions are present, though new bi-directional cameras aim to overcome this.
The document provides an overview of the Carey Foster bridge circuit, which is a modification of the Wheatstone bridge used to measure low resistances more accurately. It works on the same principles as the Wheatstone bridge but increases the effective length of the bridge wire by adding resistances in series at each end, improving sensitivity. The document explains how to set up the circuit and calculate the resistance per unit length of the bridge wire to determine an unknown resistance value. Precautions for accurate measurements like using similar resistance values and gently operating the jockey are also outlined.
Wireless USB (WUSB) allows USB devices to connect without cables. It uses ultra-wideband radio technology to transmit data at speeds up to 480 Mbps within a 10 meter range. WUSB maintains the same architecture and compatibility as wired USB, allowing for easy migration. It will bring wireless connectivity to devices like printers, hard drives, cameras and displays. WUSB aims to eliminate cables while providing a fast, standardized wireless connection for USB devices.
This document outlines an experiment to study the characteristics and voltage regulation properties of a Zener diode. The experiment involves constructing circuits using a 10V Zener diode and measuring voltages and currents across the diode and various resistors as the supply voltage is varied. Key characteristics of the Zener diode such as reverse breakdown voltage and operation as a voltage regulator are examined.
Physics Class 12 Communication Powerpoint presentationBibin Vincent
This document provides an overview of communication systems and modulation techniques. It begins with defining communication as the transfer of information from one place to another, and a communication system as consisting of components that act together to accomplish information transfer. It then discusses the basic components of a communication system including the input and output transducers, transmitter, channel, and receiver. The document also covers basic modulation techniques like amplitude modulation and provides examples of its applications and drawbacks.
The document discusses how bipolar junction transistors (BJTs) and field effect transistors (FETs) can be used as switches. It provides examples of using NPN and PNP BJTs as switches by biasing them into saturation or cutoff regions. FETs like JFETs and MOSFETs can also be used as switches by operating them in saturation, cutoff, or ohmic regions depending on the gate-source voltage. Circuits are presented showing BJTs and FETs controlling loads like lamps and LEDs by acting as open or closed switches. The document explains how small control signals can turn transistors on to handle much larger load currents.
This document provides an overview of the key components and functioning of a typical power supply that converts 230V AC household voltage to 12V DC. It describes how the AC voltage passes through a filter, then a transformer that steps it down, before being rectified and filtered to produce a pulsating DC signal. Finally, a regulator circuit stabilizes the output to produce a steady 12V DC power source required by many electronic devices.
Do Diodes and electronic stuff freaks you out?And what about those clippers and clampers?The details are as follows.
You can learn every concept related to it here.Enjoy clipping :)
A diode is a two-terminal electronic component that allows current to flow in only one direction. It is used to convert alternating current to direct current through a process called rectification. Diodes come in various types including laser diodes, light emitting diodes, Zener diodes, and silicon diodes. Rectification uses diodes to convert AC to DC through either half-wave or full-wave rectification circuits. Zener diodes are used in the reverse bias mode as voltage regulators. Photodiodes generate current or voltage when illuminated by light and are used in applications like machine vision, range finding, and medical diagnostics.
This document provides an overview of various types of electronic test equipment used for measuring voltage, current, resistance and radio signals. It discusses measurement systems and how equipment can affect the system under test through loading or accuracy issues. Specifically, it describes voltmeters, ammeters, ohmmeters and multi-meters, how they measure different components, and factors like ranges, internal resistance and accuracy. Standards agencies and devices like the Weston cell are also mentioned.
This document describes the P-ISM (Pen-style Personal Networking Gadget Package), which was created in 2012. P-ISM allows users to use two pens to control a projected keyboard and monitor on any flat surface. It functions like a desktop computer through its CPU pen, communication pen, LED projector, virtual keyboard, digital camera, and battery. The document discusses P-ISM's history, components, functions, block diagram, working, merits such as portability, demertis like cost, and references.
In this PPT we will study about the Transistor , symbol of transistor , types of transistor, operation of transistor , configurations of transistor, advantages of transistor and limitations of transistor.
#physicspptclub #In this video we will study about the Transistor , symbol of transistor , types of transistor, operation of transistor , configurations of transistor, advantages of transistor and limitations of transistor.
#physicspptclub #physicsexperiments #solid state #magnetism #magneticmaterial #presentation #education #physicsfacts #scienceexperiment #presentation #education #physicsfacts #scienceexperiment #quantum #presentation #quantumphysics #bsc #msc #btech #diodecircuits #pnjunctiondiode #physicsfacts #characteristics #education #transistor #pnptransistor #npntransistor #solid state #magnetism #magneticmaterial #presentation #education #physicsfacts #scienceexperiment #presentation #education #physicsfacts #scienceexperiment #quantum #presentation #quantumphysics #bsc #msc #btech #diodecircuits #pnjunctiondiode #physicsfacts #characteristicsis #education #transistor #pnptransistor #npntransistor
The document discusses recent advances in wearable sensors. It describes how wearable sensors composed of wireless body area networks, personal servers, and medical servers are being used for health monitoring. Key medical use cases discussed are monitoring Parkinson's disease using movement sensors, stroke rehabilitation using exercise coaching sensors, and detecting head impacts using accelerometers. The document outlines advantages like early disease detection and cost savings, disadvantages like cost and weight of units, and applications in health/wellness monitoring, safety, and sports. The conclusion is that wearable sensors show promise for remote healthcare monitoring with improved integration of sensors and power sources.
The document summarizes the evolution of wireless technologies from 1G to 5G. It discusses the key features and limitations of each generation including the increasing data speeds and capabilities. The document compares technologies such as 2G, 3G, 4G and highlights how each new generation improved upon the previous by offering higher speeds and new services like texting, multimedia messaging and video calling. It concludes that 5G will provide wireless connectivity with almost no limitations and will be the next wireless standard after fully deploying in 2020.
On this presentation i describe all the features and types of diode. This presentation started from short but understandable history of diode or zener . How diode is working? Answer of this question also clear after read all this presentation.
This document provides an overview of wireless technology. It discusses how wireless uses electromagnetic waves for communication and how early wireless transmitters used radiotelegraphy. It describes different types of wireless including fixed, mobile, portable, and infrared. Examples of common wireless technologies are given like cellular phones, GPS, and cordless peripherals. The document also discusses the history and development of wireless technology as well as comparisons to wired networks in terms of speed, installation, reliability, and cost. Finally, it outlines how wireless networks work and describes technologies like Wi-Fi and Bluetooth.
Wireless charge share between two mobilesMOHITH Royal
This document discusses wireless charging technology that allows two mobile phones to share battery charge between each other. It begins with an introduction to wireless power transfer (WPT) and its uses in consumer electronics. It then outlines a proposed system where a mobile phone can act as either a wireless power transmitter or receiver, allowing battery-to-battery (B2B) charging between two phones wirelessly. The objectives are to enable wireless charging between any two mobile phones, not just certain models. It reviews related work on bidirectional wireless charging and discusses the advantages and disadvantages of existing wired phone charging systems compared to the proposed wireless charging system.
Photo resistor devices its working and applications.Soudip Sinha Roy
Here you will get every details about photo resistor device and its applications. How it makes in factory. And the important thing is The Light-Controlled Switch Using a Transistor.
WIRELESS NETWORKED DIGITAL DEVICES BY SAIKIRAN PANJALASaikiran Panjala
Wireless networks allow devices to connect to each other or to access points without the use of cables. Digital devices represent information using binary digits of 0 and 1. Mobile phones and other wireless devices now access the internet for photos, videos, music and more through high-speed wireless networks. Wireless transmission uses technologies like microwave systems and communication satellites. Common wireless devices include cell phones, PDAs and smartphones. Wireless networks are convenient, flexible and lower cost than wired networks. Standards like GSM, CDMA, Bluetooth and Wi-Fi govern wireless connectivity and communication.
Nanowire Based FET Biosensors and Their Biomedical Applications. Fawad Majeed...Fawad Majeed
This document discusses nanowire-based biosensors and their applications in biomedicine. It begins by defining sensors and biosensors. Nanowires can be used as sensors due to their distinct electrical and optical properties. There are two main types of nanosensors: mechanical and chemical. Nanowire field-effect transistors can detect charged biomolecules through changes in conductivity. These sensors have been used to detect proteins, DNA, RNA and viruses by functionalizing the nanowire surface with specific receptors. While promising for detection, further improvements are needed for commercialization, such as enhanced sensitivity and simpler fabrication processes.
Suomen Taidoliiton Dan-leirillä 2012 aiheena oli reaktiokyky. Tässä on osa lauantain luennon kalvoista ja kokoelma käsiajanotolla mitattuja tekniikoiden kestoja.
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive functioning. Exercise causes chemical changes in the brain that may help protect against mental illness and improve symptoms.
Physics Class 12 Communication Powerpoint presentationBibin Vincent
This document provides an overview of communication systems and modulation techniques. It begins with defining communication as the transfer of information from one place to another, and a communication system as consisting of components that act together to accomplish information transfer. It then discusses the basic components of a communication system including the input and output transducers, transmitter, channel, and receiver. The document also covers basic modulation techniques like amplitude modulation and provides examples of its applications and drawbacks.
The document discusses how bipolar junction transistors (BJTs) and field effect transistors (FETs) can be used as switches. It provides examples of using NPN and PNP BJTs as switches by biasing them into saturation or cutoff regions. FETs like JFETs and MOSFETs can also be used as switches by operating them in saturation, cutoff, or ohmic regions depending on the gate-source voltage. Circuits are presented showing BJTs and FETs controlling loads like lamps and LEDs by acting as open or closed switches. The document explains how small control signals can turn transistors on to handle much larger load currents.
This document provides an overview of the key components and functioning of a typical power supply that converts 230V AC household voltage to 12V DC. It describes how the AC voltage passes through a filter, then a transformer that steps it down, before being rectified and filtered to produce a pulsating DC signal. Finally, a regulator circuit stabilizes the output to produce a steady 12V DC power source required by many electronic devices.
Do Diodes and electronic stuff freaks you out?And what about those clippers and clampers?The details are as follows.
You can learn every concept related to it here.Enjoy clipping :)
A diode is a two-terminal electronic component that allows current to flow in only one direction. It is used to convert alternating current to direct current through a process called rectification. Diodes come in various types including laser diodes, light emitting diodes, Zener diodes, and silicon diodes. Rectification uses diodes to convert AC to DC through either half-wave or full-wave rectification circuits. Zener diodes are used in the reverse bias mode as voltage regulators. Photodiodes generate current or voltage when illuminated by light and are used in applications like machine vision, range finding, and medical diagnostics.
This document provides an overview of various types of electronic test equipment used for measuring voltage, current, resistance and radio signals. It discusses measurement systems and how equipment can affect the system under test through loading or accuracy issues. Specifically, it describes voltmeters, ammeters, ohmmeters and multi-meters, how they measure different components, and factors like ranges, internal resistance and accuracy. Standards agencies and devices like the Weston cell are also mentioned.
This document describes the P-ISM (Pen-style Personal Networking Gadget Package), which was created in 2012. P-ISM allows users to use two pens to control a projected keyboard and monitor on any flat surface. It functions like a desktop computer through its CPU pen, communication pen, LED projector, virtual keyboard, digital camera, and battery. The document discusses P-ISM's history, components, functions, block diagram, working, merits such as portability, demertis like cost, and references.
In this PPT we will study about the Transistor , symbol of transistor , types of transistor, operation of transistor , configurations of transistor, advantages of transistor and limitations of transistor.
#physicspptclub #In this video we will study about the Transistor , symbol of transistor , types of transistor, operation of transistor , configurations of transistor, advantages of transistor and limitations of transistor.
#physicspptclub #physicsexperiments #solid state #magnetism #magneticmaterial #presentation #education #physicsfacts #scienceexperiment #presentation #education #physicsfacts #scienceexperiment #quantum #presentation #quantumphysics #bsc #msc #btech #diodecircuits #pnjunctiondiode #physicsfacts #characteristics #education #transistor #pnptransistor #npntransistor #solid state #magnetism #magneticmaterial #presentation #education #physicsfacts #scienceexperiment #presentation #education #physicsfacts #scienceexperiment #quantum #presentation #quantumphysics #bsc #msc #btech #diodecircuits #pnjunctiondiode #physicsfacts #characteristicsis #education #transistor #pnptransistor #npntransistor
The document discusses recent advances in wearable sensors. It describes how wearable sensors composed of wireless body area networks, personal servers, and medical servers are being used for health monitoring. Key medical use cases discussed are monitoring Parkinson's disease using movement sensors, stroke rehabilitation using exercise coaching sensors, and detecting head impacts using accelerometers. The document outlines advantages like early disease detection and cost savings, disadvantages like cost and weight of units, and applications in health/wellness monitoring, safety, and sports. The conclusion is that wearable sensors show promise for remote healthcare monitoring with improved integration of sensors and power sources.
The document summarizes the evolution of wireless technologies from 1G to 5G. It discusses the key features and limitations of each generation including the increasing data speeds and capabilities. The document compares technologies such as 2G, 3G, 4G and highlights how each new generation improved upon the previous by offering higher speeds and new services like texting, multimedia messaging and video calling. It concludes that 5G will provide wireless connectivity with almost no limitations and will be the next wireless standard after fully deploying in 2020.
On this presentation i describe all the features and types of diode. This presentation started from short but understandable history of diode or zener . How diode is working? Answer of this question also clear after read all this presentation.
This document provides an overview of wireless technology. It discusses how wireless uses electromagnetic waves for communication and how early wireless transmitters used radiotelegraphy. It describes different types of wireless including fixed, mobile, portable, and infrared. Examples of common wireless technologies are given like cellular phones, GPS, and cordless peripherals. The document also discusses the history and development of wireless technology as well as comparisons to wired networks in terms of speed, installation, reliability, and cost. Finally, it outlines how wireless networks work and describes technologies like Wi-Fi and Bluetooth.
Wireless charge share between two mobilesMOHITH Royal
This document discusses wireless charging technology that allows two mobile phones to share battery charge between each other. It begins with an introduction to wireless power transfer (WPT) and its uses in consumer electronics. It then outlines a proposed system where a mobile phone can act as either a wireless power transmitter or receiver, allowing battery-to-battery (B2B) charging between two phones wirelessly. The objectives are to enable wireless charging between any two mobile phones, not just certain models. It reviews related work on bidirectional wireless charging and discusses the advantages and disadvantages of existing wired phone charging systems compared to the proposed wireless charging system.
Photo resistor devices its working and applications.Soudip Sinha Roy
Here you will get every details about photo resistor device and its applications. How it makes in factory. And the important thing is The Light-Controlled Switch Using a Transistor.
WIRELESS NETWORKED DIGITAL DEVICES BY SAIKIRAN PANJALASaikiran Panjala
Wireless networks allow devices to connect to each other or to access points without the use of cables. Digital devices represent information using binary digits of 0 and 1. Mobile phones and other wireless devices now access the internet for photos, videos, music and more through high-speed wireless networks. Wireless transmission uses technologies like microwave systems and communication satellites. Common wireless devices include cell phones, PDAs and smartphones. Wireless networks are convenient, flexible and lower cost than wired networks. Standards like GSM, CDMA, Bluetooth and Wi-Fi govern wireless connectivity and communication.
Nanowire Based FET Biosensors and Their Biomedical Applications. Fawad Majeed...Fawad Majeed
This document discusses nanowire-based biosensors and their applications in biomedicine. It begins by defining sensors and biosensors. Nanowires can be used as sensors due to their distinct electrical and optical properties. There are two main types of nanosensors: mechanical and chemical. Nanowire field-effect transistors can detect charged biomolecules through changes in conductivity. These sensors have been used to detect proteins, DNA, RNA and viruses by functionalizing the nanowire surface with specific receptors. While promising for detection, further improvements are needed for commercialization, such as enhanced sensitivity and simpler fabrication processes.
Suomen Taidoliiton Dan-leirillä 2012 aiheena oli reaktiokyky. Tässä on osa lauantain luennon kalvoista ja kokoelma käsiajanotolla mitattuja tekniikoiden kestoja.
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive functioning. Exercise causes chemical changes in the brain that may help protect against mental illness and improve symptoms.
Arviointi ja palaute on Suomen Taidoliiton valmentaja- ja ohjaajakoulutusjärjestelmän II-tasolle kuuluva koulutus, jossa perehdytään arviointimenetelmiin ja palautteenannon psykologiaan teoriassa ja käytännössä.
Kelvin Glen has over 22 years of experience in fundraising and corporate social investment in South Africa. He has managed CSR programs for many corporations and non-profits. Kelvin is a social worker by trade and holds an MBA. He currently serves on the boards of several non-profits and runs two social enterprises. The document goes on to define corporate social investment and the responsibilities of businesses to contribute to economic and social development. It describes the Netcare Foundation, which manages Netcare's corporate social investment programs for the benefit of South African citizens. Kelvin encourages Netcare to strengthen employee volunteer programs and leverage its brand to have greater social impact.
An Integer Programming Representation for Data Center Power-Aware Management ...Arinto Murdopo
This document describes research on scheduling jobs in data center grids. It presents an integer linear programming (ILP) model to optimize revenue, power usage, and quality of service. It also describes a heuristic algorithm as an alternative to solve larger problems. The authors test the ILP and heuristic on generated data sets of various sizes. Results show the heuristic achieves near-optimal solutions faster than the ILP for large problems. Lower values of a parameter alpha and random node selection produced the best heuristic results.
This report discusses quantum cryptography and potential attacks on quantum key distribution systems. It provides background on quantum cryptography and describes the BB84 quantum key distribution protocol. It then analyzes several potential attacks on quantum key distribution systems, including photon number attacks, spectral attacks, and random number attacks that are relatively easy to solve. It focuses on the more challenging "faked-state" attack, providing details on how an attacker could implement this attack in practice using superconducting nanowire single-photon detectors. The report evaluates the security of quantum key distribution against these attacks.
The document provides links to three YouTube videos about natural gas drilling and how it can contaminate drinking water sources. The titles and URLs of the videos are listed but no other context or descriptions are provided.
Architecting a Cloud-Scale Identity FabricArinto Murdopo
This document discusses architecting an identity fabric for cloud-scale computing. It argues that a new approach is needed for identity management in cloud environments due to issues around cross-cutting nature, organizational impacts, and lack of management skills. The key components of a cloud-scale identity fabric are discussed, including access control, authentication, user management, auditing, and meeting requirements of cloud platforms. Building identity as a distributed fabric can significantly reduce management costs and complexity compared to traditional centralized models. Identity must integrate and abstract to provide infrastructure as a service for applications and users in cloud environments.
The document is a lesson on the eight parts of speech taught by Miss Lawson to 9th grade students. It defines each part of speech and provides examples to identify them, including nouns, pronouns, verbs, adjectives, adverbs, prepositions, conjunctions, and interjections. Students are instructed to identify the parts of speech in sentences for each category and email their answers to the teacher for a grade.
File sharing networks can expose sensitive personal and financial information due to misconfigured sharing settings, confusing interfaces, and malware distribution. An experiment sharing documents with credit card and calling card details showed the information was quickly downloaded and redistributed, with the funds drained within a week. A separate experiment sharing mock business documents saw them downloaded 12 times within a week, illustrating the risk of unintended secondary disclosures on these networks. The growing usage and "set and forget" tendencies of some users increases the risk of privacy leaks and losses over such file sharing networks.
Slides for sharing session with PPI Stockholm. The topic is about Distributed Computing, covering what it is, why it is important in our daily life and how we can utilize it in Indonesia.
This document summarizes Kanika Anand's master's thesis which examines global optimization of noisy computer simulators using surrogate models. The thesis compares two improvement functions - one proposed by Picheny et al. and one by Ranjan - for choosing new points to minimize a simulator output observed with noise. Gaussian process and Bayesian additive regression tree models are used as surrogates. Four test functions acting as simulators are optimized using either a one-shot design or genetic algorithm to find new points. Results show how well the surrogate minimum matches the true minimum and distance between the two minimizers under different settings.
This document summarizes Nicholas Dingwall's Master's thesis project which extends the Deep Q Network (DQN) to recurrent neural networks, called the Recurrent DQN (RDQN). The project tests the RDQN and DQN on three toy games - Pong, Pong 2, and Transporter Man. The results show that while both models perform poorly on stochastic games, the RDQN is able to outperform the DQN on Transporter Man, which requires memory, and matches its performance on deterministic Pong games that do not require memory. The thesis provides background on reinforcement learning, neural networks, describes the three experiments and discusses the findings.
Robustness in Deep Learning - Single Image Denoising using Untrained Networks...Daniel983829
This document is a thesis submitted by Esha Singh to the University of Minnesota for the degree of Master of Science in May 2021. The thesis explores single image denoising using untrained neural networks. It first provides background on deep learning, inverse problems, image denoising and neural networks. It then reviews existing image denoising algorithms including spatial/transform domain and neural network methods. The thesis also discusses recent work on deep image priors and rethinking single image denoising using over-parameterization and low-rank matrix recovery. Preliminary experiments on denoising images with salt and pepper noise are presented to demonstrate the proposed methodology.
Build Your Own 3D Scanner:
Course Notes
http://mesh.brown.edu/byo3d/
SIGGRAPH 2009 Courses
Douglas Lanman and Gabriel Taubin
This course provides a beginner with the necessary mathematics, software, and practical details to leverage projector-camera systems in their own 3D scanning projects. An example-driven approach is used throughout; each new concept is illustrated using a practical scanner implemented with off-the-shelf parts. The course concludes by detailing how these new approaches are used in rapid prototyping, entertainment, cultural heritage, and web-based applications.
This document summarizes a student project on predicting malicious activity using real-time video surveillance. The project applies techniques like super-resolution, face and object recognition using HOG features, and neural networks to enhance video quality, identify objects and faces, and semantically describe scenes to detect unusual activity. Algorithms were implemented in MATLAB and results were stored in a MongoDB database. Key techniques included super-resolution, PCA-based face recognition, HOG-based object detection, and neural networks like CNNs and RNNs for image captioning. The project aims to help detect criminal activity and track convicted individuals in public spaces.
This document is a thesis submitted by Md Sabbir Hussain in partial fulfillment of the requirements for a Master of Applied Science degree from Dalhousie University in July 2014. The thesis proposes a new design for asymmetric M-ary quadrature amplitude modulation based on triangular QAM called asymmetric TQAM, which provides considerable power gain over asymmetric square QAM. It also considers image transmission using the proposed asymmetric 64-TQAM modulation scheme and analyzes the quality of reconstructed images at different signal-to-noise ratios.
The report discusses two decomposition methods: Proper Orthogonal Decomposition (POD) and Dynamic Mode Decomposition (DMD). Two examples of data decomposition applications are presented: approximating the pulsating velocity profile of the Poiseuille flow and approximating the flow behind a cylinder. The results are obtained using Matlab software and various codes, along with the development of the Graphical User Interface (GUI) for performing POD and DMD and for post-processing are shown.
This document describes research on spectral X-ray computed tomography using energy sensitive pixel detectors. It introduces X-ray CT and reconstruction techniques. It then describes the Medipix chip, which can perform photon counting and energy discrimination on individual pixels. Experiments were conducted to characterize the energy response of Medipix detectors using particle beams and synchrotron radiation. The energy response was modeled based on charge sharing between pixels. Finally, the document proposes methods for spectral CT reconstruction that leverage the energy information from Medipix to distinguish and quantify different materials.
The document is a project report for a Master's thesis investigating conjugate heat transfer for electronic cooling using the open-source software OpenFOAM. The student, Avinash Gorde, modeled a server system with solid components like RAM and a PCB coupled to an air domain. The project involved setting up the geometry, meshing, boundary conditions, and solving the conjugate heat transfer equations using OpenFOAM's chtMultiRegionSimpleFoam solver. Results showed temperature distributions within the heat sink and across solid components like the socket and RAM. The project provided experience applying OpenFOAM to model complex fluid-solid interaction problems for electronic cooling applications.
Performance Evaluation of Path Planning Techniques for Unmanned Aerial VehiclesApuroop Paleti
This thesis evaluates the performance of the A-star algorithm and Mixed Integer Linear Programming (MILP) for path planning of unmanned aerial vehicles (UAVs). The author conducted simulation experiments to compare the computational time and efficiency of the two methods in different scenarios. The key findings are:
1) MILP had significantly lower computation times than A-star for both scenarios, taking on average 10-12 seconds per function compared to 1-3 seconds for A-star.
2) Statistical analysis via hypothesis testing found a significant difference between the computation times of the two algorithms.
3) Based on the results, MILP was found to be more efficient than A-star for path planning
On Inexact Newton Directions in Interior Point Methods for Linear OptimizationSSA KPI
This document discusses using interior point methods to solve large-scale linear programming problems. It focuses on primal-dual interior point methods applied to the primal-dual formulation. The key steps are:
1) The primal-dual interior point method solves the Karush-Kuhn-Tucker (KKT) optimality conditions by moving through the interior of the feasible region along a central path defined by a parameter τ.
2) In each iteration, at least one linear system involving the KKT matrix must be solved. Solving this directly becomes very expensive for large problems.
3) The document proposes using an iterative method like preconditioned conjugate gradients to solve the augmented KKT system approximately. This reduces
This thesis proposes a new interpolation-based method for linearly constrained convex optimization problems. The method generalizes Shepard's interpolation formula to handle linear constraints by projecting points to the constraint boundaries. The interpolation surface passes through objective function values at the boundary of the feasible region. Several descent methods like steepest descent, quasi-Newton, and Newton's method are explored to optimize the interpolation surface. The method is implemented in MATLAB and tested on problems up to 1000 dimensions, showing competitive performance with standard methods like log-barrier and active-set. Steepest descent appears most effective, with quasi-Newton and Newton's method not improving the number of iterations.
Internship Report: Interaction of two particles in a pipe flowPau Molas Roca
The present document sums up the development and results of the research internship carried out at LEGI Laboratory. The study aimed to understand the hydrodynamic forces involvement in the interaction between two red blood cells located in a capillary (pipe flow). The problem regarding Red Blood Cells (RBCs) moving through a capillary has been tackled from a two-dimensional point of view and has been both analytically and numerically outlined. Finite elements have been used to discretize the geometries considered. Several boundary conditions and geometries were simulated and deeply examined aiming to understand the mechanism governing hydrodynamic attraction and repulsion between red blood cells. The consequent results are analyzed in this report.
Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...Alexander Zhdanov
This master's thesis investigates efficiency optimization techniques for real-time GPU raytracing used in modeling car-to-car communication systems. Specifically, it aims to improve the simulation of the propagation channel through ray reordering and caching. The research analyzes existing caching schemes exploiting frame coherence, GPU data structures, and ray reordering techniques. It proposes algorithms for ray sorting on the CPU and caching tracing data. The thesis then implements and evaluates the proposed methods, analyzing system performance for static and dynamic scenes. Testing shows ray reordering significantly increases efficiency, though caching provides varying benefits depending on the scheme used.
This document is a master's thesis that examines localization techniques in wireless sensor networks. It provides background on wireless sensor networks and how they emerged from military applications but are now used in various civil applications. The thesis focuses on developing and analyzing new localization algorithms. It presents the results of experiments measuring received signal strength indication (RSSI) from wireless sensor nodes, which indicate significant fluctuations that could limit the reliability of localization schemes. Overall, the thesis evaluates localization methods and develops new algorithms to improve positioning accuracy in wireless sensor networks.
This is my master degree dissertation project. Which has inspired the "Alagba" project.
It is the proposal for the use of a drone which detects turtle nests using the camera and deep learning.
A Comparative Study Of Generalized Arc-Consistency AlgorithmsSandra Long
This document describes a thesis that compares several algorithms for enforcing Generalized Arc Consistency (GAC) on Constraint Satisfaction Problems (CSPs). It studies GAC2001, algorithms for table constraints (STR1, STR2, STR3, eSTR1, eSTR2), and an algorithm for constraints represented as multi-valued decision diagrams (mddc). The author empirically evaluates and compares the performance of these algorithms on randomly generated CSPs and benchmark problems. A new hybrid algorithm (STR-h) is also proposed that uses a selection criterion to combine STR1 and STR-Ni.
Math Senior Project Digital- Ahnaf KhanM. Ahnaf Khan
This document proposes a pre-processing approach for K-means clustering using Delaunay triangulation. It defines a measure of clusterability based on edge weights in the triangulation. The pre-processing removes edges above a threshold to identify an optimal range for K and initial centroids. Results show the pre-processed K-means requires fewer iterations and produces more accurate clusters than standard K-means based on evaluation metrics. A new clustering algorithm is also proposed that uses the triangulation to directly obtain clusters, demonstrating improved performance over K-means on some datasets.
Similar to Parallelization of Smith-Waterman Algorithm using MPI (20)
Distributed Decision Tree Learning for Mining Big Data StreamsArinto Murdopo
This document presents a distributed decision tree learning algorithm called Vertical Hoeffding Tree (VHT) for mining big data streams. It summarizes the contributions of the master's thesis, which include: (1) Developing the SAMOA framework for distributed streaming machine learning, (2) Integrating SAMOA with the Storm distributed stream processing engine, and (3) Implementing the VHT algorithm to improve scalability over the standard Hoeffding Tree algorithm when dealing with high-dimensional data streams. The evaluation shows that VHT achieves similar accuracy to Hoeffding Tree but higher throughput, especially on datasets with many attributes.
Distributed Decision Tree Learning for Mining Big Data StreamsArinto Murdopo
The document presents a master's thesis that proposes and develops Scalable Advanced Massive Online Analysis (SAMOA), a distributed streaming machine learning framework. SAMOA aims to address the big data challenges of volume, velocity, and variety by providing flexible APIs for developing machine learning algorithms, and integrating with Storm, a stream processing engine, to inherit its scalability. The thesis describes SAMOA's modular components, its integration with Storm, and evaluates a distributed online classification algorithm implemented on SAMOA and Storm to demonstrate its features.
Next Generation Hadoop: High Availability for YARN Arinto Murdopo
The document proposes a new architecture for YARN to solve its availability limitation of single-point-of-failure in the resource manager. The key aspects of the proposed architecture are:
1. It utilizes a stateless failure model where all necessary states and information used by the resource manager are stored in a persistent storage.
2. MySQL Cluster (NDB) is proposed as the storage technology due to its high availability, linear scalability, and high throughput of up to 1.8 million writes per second.
3. A proof-of-concept implementation was done using NDB to store application states and their corresponding attempts. Evaluations showed the architecture is able to increase YARN's availability and NDB
Project presentation for High Availability in YARN project. We propose to use MySQL Cluster (NDB) to tackle High Availability issue in YARN. We also developed benchmark framework to investigate whether MySQL Cluster (NDB) is better than Apache's proposed storage (ZooKeeper and HDFS)
Full project report will be uploaded after I finish it.
An Integer Programming Representation for Data Center Power-Aware Management ...Arinto Murdopo
This document describes an integer linear programming (ILP) model and heuristic approach for scheduling jobs in a data center while maximizing benefits related to power costs, revenue, migration costs, and quality of service. The ILP formulation is implemented in CPLEX and a greedy randomized adaptive search procedure (GRASP) metaheuristic is designed to find near-optimal solutions more efficiently. Two variants of the GRASP heuristic are tested on generated problem instances and results are compared to the optimal ILP solutions in terms of solution quality and runtime.
Quantum Cryptography and Possible Attacks-slideArinto Murdopo
Quantum cryptography uses principles of quantum mechanics to securely distribute encryption keys. The BB84 protocol is a seminal quantum key distribution protocol that works as follows: Alice sends Bob polarized photons encoded with random key bits and basis choices. Bob measures the photons randomly in different bases. They communicate to discard mismatched bases, leaving a shared raw key. They test for errors from eavesdropping and apply privacy amplification to distill a final secure key. However, quantum cryptography is vulnerable to attacks like the faked-state attack, where an eavesdropper Eve blind's Bob's detectors and forces measurement outcomes to match her own. If successful, Eve can learn almost all the raw key without introducing errors.
The document describes Megastore, a scalable and highly available storage system for interactive services. Megastore provides ACID semantics through entity groups and uses a modified Paxos algorithm for synchronous replication across groups. It scales through data partitioning and ensures availability by replicating write-ahead logs within entity groups. The system aims to balance the easy usability of relational databases with the scalability of NoSQL systems.
This document analyzes the scalability of Apache Flume by conducting experiments with different Flume configurations and load levels. Two experiment setups are used: one with a one-to-one relationship between Flume nodes and load generators, and another using a cascading setup to aggregate events from two scale nodes into a collector node. The results show that doubling the channel capacity does not necessarily double the maximum event rate, and that a cascading setup can improve scalability over a non-cascading setup.
Large Scale Distributed Storage Systems in Volunteer Computing - SlideArinto Murdopo
This document discusses using decentralized storage systems (DSS) with volunteer computing (VC). It outlines the problem, defines VC and DSS, and reviews several DSS approaches. Key criteria for DSS like availability, scalability, and consistency are examined. The document then analyzes characteristics of DSS that could integrate with VC, challenges of providing incentives, and security issues. It concludes that balancing functionality with complexity is important when integrating DSS with VC systems.
Large-Scale Decentralized Storage Systems for Volunter Computing SystemsArinto Murdopo
This document provides a survey of existing decentralized storage systems and their suitability for use in volunteer computing systems. It discusses several decentralized storage systems including Farsite, Ivy, Overnet/Kademlia, PAST, PASTIS, Voldemort, OceanStore, Glacier, Total Recall, Cassandra, Riak, Dynamo, and Attic. It evaluates each system based on availability, scalability, eventual consistency, performance, and security. The document proposes that the most suitable state-of-the-art decentralized storage system for volunteer computing would combine the best properties of these existing systems.
The document discusses the rise of network virtualization and software-defined networking (SDN). It describes early research projects on campus networks like Ethane in 2007 that led to the OpenFlow specification in 2008. This allowed experiments with new network protocols. The document outlines the founding of Nicira in 2007 and the company's product launch of the Network Virtualization Platform in 2012, which used SDN to virtualize networks.
Distributed Storage System for Volunteer ComputingArinto Murdopo
This document presents a preliminary proposal for a distributed storage system for volunteer computing. It discusses using volunteer computing resources to create a distributed storage system with no single point of failure that provides data availability and integrity. Some challenges include that distributed storage systems for volunteer computing do not currently exist. The proposal reviews existing peer-to-peer distributed storage systems and surveys which may be suitable. The objectives are to evaluate systems based on security, availability, and reliability, and possibly experiment with suitable systems in a distributed storage testbed.
This document outlines Apache Flume, a distributed system for collecting large amounts of log data from various sources and transporting it to a centralized data store such as Hadoop. It describes the key components of Flume including agents, sources, sinks and flows. It explains how Flume provides reliable, scalable, extensible and manageable log aggregation capabilities through its node-based architecture and horizontal scalability. An example use case of using Flume for near real-time log aggregation is also briefly mentioned.
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...Levi Shapiro
Letter from the Congress of the United States regarding Anti-Semitism sent June 3rd to MIT President Sally Kornbluth, MIT Corp Chair, Mark Gorenberg
Dear Dr. Kornbluth and Mr. Gorenberg,
The US House of Representatives is deeply concerned by ongoing and pervasive acts of antisemitic
harassment and intimidation at the Massachusetts Institute of Technology (MIT). Failing to act decisively to ensure a safe learning environment for all students would be a grave dereliction of your responsibilities as President of MIT and Chair of the MIT Corporation.
This Congress will not stand idly by and allow an environment hostile to Jewish students to persist. The House believes that your institution is in violation of Title VI of the Civil Rights Act, and the inability or
unwillingness to rectify this violation through action requires accountability.
Postsecondary education is a unique opportunity for students to learn and have their ideas and beliefs challenged. However, universities receiving hundreds of millions of federal funds annually have denied
students that opportunity and have been hijacked to become venues for the promotion of terrorism, antisemitic harassment and intimidation, unlawful encampments, and in some cases, assaults and riots.
The House of Representatives will not countenance the use of federal funds to indoctrinate students into hateful, antisemitic, anti-American supporters of terrorism. Investigations into campus antisemitism by the Committee on Education and the Workforce and the Committee on Ways and Means have been expanded into a Congress-wide probe across all relevant jurisdictions to address this national crisis. The undersigned Committees will conduct oversight into the use of federal funds at MIT and its learning environment under authorities granted to each Committee.
• The Committee on Education and the Workforce has been investigating your institution since December 7, 2023. The Committee has broad jurisdiction over postsecondary education, including its compliance with Title VI of the Civil Rights Act, campus safety concerns over disruptions to the learning environment, and the awarding of federal student aid under the Higher Education Act.
• The Committee on Oversight and Accountability is investigating the sources of funding and other support flowing to groups espousing pro-Hamas propaganda and engaged in antisemitic harassment and intimidation of students. The Committee on Oversight and Accountability is the principal oversight committee of the US House of Representatives and has broad authority to investigate “any matter” at “any time” under House Rule X.
• The Committee on Ways and Means has been investigating several universities since November 15, 2023, when the Committee held a hearing entitled From Ivory Towers to Dark Corners: Investigating the Nexus Between Antisemitism, Tax-Exempt Universities, and Terror Financing. The Committee followed the hearing with letters to those institutions on January 10, 202
it describes the bony anatomy including the femoral head , acetabulum, labrum . also discusses the capsule , ligaments . muscle that act on the hip joint and the range of motion are outlined. factors affecting hip joint stability and weight transmission through the joint are summarized.
How to Build a Module in Odoo 17 Using the Scaffold MethodCeline George
Odoo provides an option for creating a module by using a single line command. By using this command the user can make a whole structure of a module. It is very easy for a beginner to make a module. There is no need to make each file manually. This slide will show how to create a module using the scaffold method.
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
How to Add Chatter in the odoo 17 ERP ModuleCeline George
In Odoo, the chatter is like a chat tool that helps you work together on records. You can leave notes and track things, making it easier to talk with your team and partners. Inside chatter, all communication history, activity, and changes will be displayed.
A Survey of Techniques for Maximizing LLM Performance.pptx
Parallelization of Smith-Waterman Algorithm using MPI
1. `
Universitat Politecnica de Catalunya
AMPP Final Project Report
Parallelization of Smith-Waterman
Algorithm
Author:
Supervisor:
Iuliia Proskurnia
Josep R. Herrero
Arinto Murdopo
Dani Jimenez-Gonzalez
Muhammad Anis uddin Nasir
January 16, 2012
3. 3.5 Testing with different GAP penalties . . . . . . . . . . . . . . 47
4 Conclusions 48
A Source Code Compilation 49
B Execution on ALTIX 50
C Timing diagram for Blocking technique in Solution 2 51
D Timing diagram for Blocking-and-Interleave technique in
Solution 2 52
ii
4. List of Figures
1 Blocking Communication . . . . . . . . . . . . . . . . . . . . 7
2 Data Partitioning among processes . . . . . . . . . . . . . . . 12
3 Blocking Communication . . . . . . . . . . . . . . . . . . . . 14
4 Blocking and interleave communication . . . . . . . . . . . . 19
5 Blocking and Interleave Communication . . . . . . . . . . . . 24
6 Sequential Code Performance Measurement Result . . . . . . 34
7 Measurement result when N is 5000, B is 100 and I is 1 . . . 35
8 Diagram of measurement result when N is 5000, B is 100, I is 1 35
9 Measurement result when N is 10000, B is 100 and I is 1 . . . 36
10 Diagram of measurement result when N is 10000, B is 100, I
is 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
11 Performance measurement result when N is 10000, P is 8, I is 1 37
12 Diagram of measurement result when N is 10000, P is 8, I is 1 37
13 Diagram of measurement result when N is 10000, P is 8, B is
100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
14 Measurement result when N is 10000, B is 100 and I is 1 . . . 38
15 Diagram of measurement result when N is 10000, B is 100, I
is 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
16 Performance measurement result when N is 10000, P is 8, I is 1 40
17 Diagram of measurement result when N is 10000, P is 8, I is 1 40
18 Diagram of measurement result when N is 10000, P is 8, B is
200 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
19 Measurement result when N is 5000, B is 100 and I is 1 . . . 41
20 Diagram of measurement result when N is 5000, B is 100, I is 1 42
21 Measurement result when N is 10000, B is 100 and I is 1 . . . 42
22 Diagram of measurement result when N is 10000, B is 100, I
is 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
23 Performance measurement result when N is 10000, P is 32, I
is 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
24 Diagram of measurement result when N is 10000, P is 32, I is 1 44
25 Performance measurement result when N is 10000, P is 32, B
is 50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
26 Diagram of measurement result when N is 10000, P is 32, B
is 50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
27 Putting all of them together . . . . . . . . . . . . . . . . . . . 46
28 Putting all of them together - the plot . . . . . . . . . . . . . 46
29 Testing with different gap penalties . . . . . . . . . . . . . . . 47
30 gap penalty vs Time . . . . . . . . . . . . . . . . . . . . . . . 47
31 Performance Model Solution 2 . . . . . . . . . . . . . . . . . . 51
32 Performance Model with Interleave . . . . . . . . . . . . . . . 52
iii
5. 1 Introduction
The Smith–Waterman algorithm is a well-known algorithm for performing
local sequence alignment for determining similar regions between two nu-
cleotide or protein sequences. Proteins are made by aminoacid sequences
and similar protein structure has similar aminoacid sequence.In this project
we did the parallel implementation of the Smith-Waterman Algorithm using
Message Passing Interface code.
To compare two aminoacid sequence, initially we have to align the se-
quences to compare them. To find the best alignment between two sequences
the algorithm initially populates a matrix H of size N × N (N is size of se-
quence) using a scoring criteria. It requires a scoring matrix (cost of match-
ing of two symbols) and a gap penalty for mismatch of two symbols. After
populating the matrix H we can obtain the optimum local alignment by
tracking back the matrix starting with the highest value in the matrix.
In our implementation of Smith-Waterman algorithm we populated the
matrix H in parallel using multiple processes running of multicore machines.
We used pipelined computation to achieve specific degree of parrallelism and
compared different parallelizing techniques to find optimum parallelization
technique for the problem. We started parallelizing our code using differ-
ent blocking sizes B at the column level. Furthermore, we also introduced
parallelization using different levels of interleave I at the row level.
For performance measurement we created the performance model of both
the implementations for two interconnection networks which are linear and
2D-Mesh interconnection network. We executed our code for evaluation
on Altix machine using different values of parameter ∆ (gap penalty), B
(column interleaving factor) and I (row interleaving factor) to empirically
find optimum B and I for the problem. We also calculated the optimum
B and I by finding the global minima of the equations of the performance
model.
1
6. 2 Main Issues and Solutions
2.1 Available Parallelization Techniques
We can achieve pipelining with both blocking at column and row level.
Blocking at column level can be interpreted in different ways.
1. Each processor Pi processes B complete columns of the matrix before
doing any communication.
2. Each processor Pi processes B complete columns. However after pro-
cessing B columns of a row of the matrix it does a communication to
next processor.
3. Each processor Pi processes B complete columns. However after pro-
cessing B columns of a set of rows of the complete B columns of the
matrix it does a communication.
4. Each processor Pi processes N/P complete rows. After processing B
columns of those N/P rows, it does a communication.
Among above mentioned techniques, we choosed the last one because it
provides us with most optimum pipelined computation using the scheme.
2.2 Blocking Technique
2.2.1 Solution 1: Using Scatter and Gather
Based on chosen technique from our available parallelization techniques, we
developed this following solution. Note that in our solution here we already
incorporated I (Interleave factor), but we set the I to 1.
At the first step, process with rank 0 (which is the master process) reads
all the necessary files which are two protein sequence files. The reading
result is stored in short* a and short* b. Other than that, it also allocates
enough memory to store the resulting matrix as shown in code snippet below
1 {
2 // n o t e t h a t s i z e A i s t h e t o t a l number o f rows t h a t we need
p r o c e s s . We round up N i f N i s n o t d i v i s i b l e by
t o t a l p r o c e s s e s as shown b e l o w . I i s s e t t o 1 h e r e .
3
4 i f (N % ( t o t a l p r o c e s s e s ∗ I ) != 0 ) {
5 s i z e A = N + ( t o t a l p r o c e s s e s ∗ I ) − (N % (
t o t a l p r o c e s s e s ∗ I ) ) ; // t o h a n d l e c a s e where N i s n o t
d i v i s i b l e by ( t o t a l p r o c e s s e s ∗ I )
6 } else {
7 s i z e A = N;
8 }
9
2
7. 10 r e a d f i l e s ( in1 , in2 , a , b , N − 1 ) ; // i n 1 = i n p u t f i l e 1 ,
i n 2 = i n p u t f i l e 2 , a = r e s u l t i n g r e a d i n g from in1 , b
= r e s u l t i n g r e a d i n g from i n 2
11 c h u n k s i z e = s i z e A / ( t o t a l p r o c e s s e s ∗ I ) ; // number o f
rows t h a t each p r o c e s s e s n e e d s t o work on
12 CHECK NULL( ( h a l l p t r = ( i n t ∗ ) c a l l o c (N ∗ ( s i z e A +1) ,
s i z e o f ( i n t ) ) ) ) ; // r e s u l t i n g d a t a
13 CHECK NULL( ( h a l l = ( i n t ∗ ∗ ) c a l l o c ( ( s i z e A +1) , s i z e o f (
i n t ∗ ) ) ) ) ; // c o n t a i n l i s t o f p o i n t e r
14
15 f o r ( i = 0 ; i <= s i z e A ; i ++)
16 h a l l [ i ] = h a l l p t r + i ∗ N; // p u t t h e p o i n t e r i n an
array
17
18 // i n i t i a l i z e t h e f i r s t row o f r e s u l t i n g m a t r i x w i t h 0
19 f o r ( i = 0 ; i < N; i ++)
20 {
21 h all [ 0 ] [ i ] = 0;
22 }
23
24 }
Every process reads the PAM matrix, and master process performs
broadcast of N and B value.
1 MPI Bcast(& c h u n k s i z e , 1 , MPI INT , 0 , MPI COMM WORLD) ; //
B r o a d c a s t chunk s i z e , which i s t h e number o f c a l c u l a t e d
rows by each s l a v e
2 MPI Bcast(&N, 1 , MPI INT , 0 , MPI COMM WORLD) ; // B r o a d c a s t N
3 MPI Bcast(&B, 1 , MPI INT , 0 , MPI COMM WORLD) ; // B r o a d c a s t B
4 MPI Bcast(&I , 1 , MPI INT , 0 , MPI COMM WORLD) ; // B r o a d c a s t I
Then each process needs to allocate enough memory to receive chunk size.
Other than process with rank 0, they need to allocate memory to receive
the whole part of protein 2 (which has size equals to N).
1 CHECK NULL( ( chunk a = ( short ∗ ) c a l l o c ( s i z e o f ( short ) ,
c h u n k s i z e ) ) ) ; // s l a v e p r o c e s s w i l l o b t a i n i t from master
process
2 i f ( rank != 0 ) {
3 CHECK NULL( ( b = ( short ∗ ) m a l l o c ( s i z e o f ( short ) ∗ (N) ) ) ) ;
// s l a v e p r o c e s s w i l l o b t a i n i t from master p r o c e s s
4 }
5
6 MPI Bcast ( b , N, MPI SHORT, 0 , MPI COMM WORLD) ; // b r o a d c a s t
protein 2 to every process
Now, let’s go to the parallel part, first we calculate how many blocks that
we will process. We calculate, total blocks variable and also last block
variable. last block variable contains the size of the last block to process
if N is not divisible by B (N %B != 0)
1 i n t t o t a l b l o c k s = N / B + (N % B == 0 ? 0 : 1 ) ;
2 i n t l a s t b l o c k = N % B == 0 ? B : N % B ;
3
8. Then we scatter 1st protein sequence(in here we store it in a), with size
of each scattered part equals to chunk size. After each process receives
each scattered part, the computation begins for process with rank 0. It will
not wait to receive any data from other process and directly calculate the
1st block of data. Meanwhile other proces with rank r, will wait for data
from process with rank r-1. The data sent between process here is the last
row of calculated block (which is an array of short with size equals to B.
After a process receive the required data, each process performs compu-
tations for received data. In the end, each process with rank r will send the
last row of calculated block with size B to neighboring process with
rank r+1.
In the end, we perform gather to combine the result. Note that cur-
rent interleave variable is set to 0 and I is set to 1 here because we’re not
using interleaving factor. Code snippet below show how to implement this
functionality
1 for ( int c u r r e n t i n t e r l e a v e = 0 ; c u r r e n t i n t e r l e a v e < I ;
c u r r e n t i n t e r l e a v e ++) {
2 MPI Scatter ( a + c u r r e n t i n t e r l e a v e ∗ c h u n k s i z e ∗
total processes ,
3 c h u n k s i z e , MPI SHORT, chunk a , c h u n k s i z e ,
MPI SHORT, 0 , MPI COMM WORLD) ; // c h u n k a i s t h e
receiving buffer
4 int current column = 1 ;
5 f o r ( i = 0 ; i < c h u n k s i z e + 1 ; i ++) h [ i ] [ 0 ] = 0 ;
6
7 for ( int c u r r e n t b l o c k = 0 ; c u r r e n t b l o c k < t o t a l b l o c k s
; c u r r e n t b l o c k ++) {
8 // R e c e i v e
9 i n t b l o c k e n d = MIN2( c u r r e n t c o l u m n − ( c u r r e n t b l o c k
== 0 ? 1 : 0 ) + B, N) ;
10 i f ( rank == 0 && c u r r e n t i n t e r l e a v e == 0 ) { // i f
rank 0 i s p r o c e s s i n g t h e f i r s t b l o c k , i t doesn ’ t
need t o r e c e i v e any t h i n g
11 f o r ( i n t k = c u r r e n t c o l u m n ; k < b l o c k e n d ; k++)
{
12 h [ 0 ] [ k ] = 0 ; // i n i t row 0
13 }
14 } else {
15 i n t r e c e i v e f r o m = rank == 0 ? t o t a l p r o c e s s e s −
1 : rank − 1 ; // r e c e i v e from n e i g h b o r i n g
process
16 i n t s i z e t o r e c e i v e = c u r r e n t b l o c k ==
t o t a l b l o c k s − 1 ? l a s t b l o c k : B;
17 MPI Recv ( h [ 0 ] + c u r r e n t b l o c k ∗ B,
s i z e t o r e c e i v e , MPI INT , r e c e i v e f r o m , 0 ,
MPI COMM WORLD, &s t a t u s ) ;
18 }
19 // P r o c e s s
20 f o r ( j = c u r r e n t c o l u m n ; j < b l o c k e n d ; j ++,
4
9. c u r r e n t c o l u m n++) {
21 f o r ( i = 1 ; i < c h u n k s i z e + 1 ; i ++) {
22 d i a g = h [ i − 1 ] [ j −1] + sim [ chunk a
[ i − 1]][b[ j − 1]];
23 down = h [ i − 1 ] [ j ] + DELTA;
24 r i g h t = h [ i ] [ j −1] + DELTA;
25 max = MAX3( diag , down , r i g h t ) ;
26 i f (max <= 0 ) {
27 h [ i ] [ j ] = 0;
28 } else {
29 h [ i ] [ j ] = max ;
30 }
31 }
32 }
33
34 // Send
35 i f ( c u r r e n t i n t e r l e a v e + 1 != I | | rank + 1 !=
total processes ) {
36 i n t s e n d t o = rank + 1 == t o t a l p r o c e s s e s ? 0 :
rank + 1 ;
37 i n t s i z e t o s e n d = c u r r e n t b l o c k == t o t a l b l o c k s
− 1 ? l a s t b l o c k : B;
38 MPI Send ( h [ c h u n k s i z e ] + c u r r e n t b l o c k ∗ B,
s i z e t o s e n d , MPI INT , s e n d t o , 0 ,
MPI COMM WORLD) ;
39 p r i n t v e c t o r ( h [ c h u n k s i z e ] + c u r r e n t b l o c k ∗ B,
size to send ) ;
40 }
41
42 // G a t h e r i n g r e s u l t
43 MPI Gather ( h p t r + N, N ∗ c h u n k s i z e , MPI INT ,
44 h a l l p t r + N + current interleave ∗ chunk size ∗
t o t a l p r o c e s s e s ∗ N,
45 N ∗ c h u n k s i z e , MPI INT , 0 , MPI COMM WORLD) ;
46 }
Once the result is gathered, process with rank 0 deallocates the memory
and perform optional verification result. The verification result is obtained
by comparing the resulting parallel version of h matrix (by using h all ) with
serial version of h matrix (by using hverify)
1 i f ( rank == 0 ) {
2 i f ( v e r i f y R e s u l t == 1 ) {
3 Max = 0 ;
4 xMax = 0 ;
5 yMax = 0 ;
6 CHECK NULL( ( h v e r i f y p t r = ( i n t ∗ ) m a l l o c ( s i z e o f ( i n t )
∗ (N+1) ∗ (N+1) ) ) ) ;
7 CHECK NULL( ( h v e r i f y = ( i n t ∗ ∗ ) m a l l o c ( s i z e o f ( i n t ∗ ) ∗ (
N+1) ) ) ) ;
8 /∗ Mount h v e r i f y [N ] [ N] ∗/
9 f o r ( i =0; i<=N; i ++)
10 h v e r i f y [ i ]= h v e r i f y p t r+i ∗ (N+1) ;
11 f o r ( i =0; i<=N; i ++) h v e r i f y [ i ] [ 0 ] = 0 ;
5
10. 12 f o r ( j =0; j<=N; j ++) h v e r i f y [ 0 ] [ j ] = 0 ;
13
14 f o r ( i =1; i<=N; i ++)
15 f o r ( j =1; j<=N; j ++) {
16 diag = h v e r i f y [ i − 1 ] [ j −1] + sim [ a [ i − 1 ] ] [ b [
j −1]];
17 down = h v e r i f y [ i − 1 ] [ j ] + DELTA;
18 right = h v e r i f y [ i ] [ j −1] + DELTA;
19 max=MAX3( diag , down , r i g h t ) ;
20 i f (max <= 0 ) {
21 hve rify [ i ] [ j ]=0;
22 }
23 e l s e i f (max == d i a g ) {
24 h v e r i f y [ i ] [ j ]= d i a g ;
25 }
26 e l s e i f (max == down ) {
27 h v e r i f y [ i ] [ j ]=down ;
28 }
29 else {
30 h v e r i f y [ i ] [ j ]= r i g h t ;
31 }
32 i f (max > Max) {
33 Max=max ;
34 xMax=i ;
35 yMax=j ;
36 }
37 }
38
39 int v e r F a i l F l a g = 0 ;
40 f o r ( i =0; i<=N−1; i ++){
41 f o r ( j =0; j<=N−1; j ++){
42 i f ( h a l l [ i ] [ j ] != h v e r i f y [ i ] [ j ] ) {
43 p r i n t f ( ” V e r i f i c a t i o n f a i l ! n” ) ;
44 p r i n t f ( ” h a l l [ i ] [ j ] = %d , h v e r i f y [ i
] [ j ] = %dn” , h a l l [ i ] [ j ] ,
hverify [ i ] [ j ]) ;
45 v e r F a i l F l a g = −1;
46 break ;
47 }
48 }
49
50 i f ( v e r F a i l F l a g != 0 ) {
51 break ;
52 }
53 }
54
55 i f ( v e r F a i l F l a g ==0)
56 {
57 p r i n t f ( ” V e r i f i c a t i o n s u c c e s s ! n” ) ;
58 }
59
60 }
61
62 free ( hverifyptr ) ;
6
11. 63 free ( hverify ) ;
64 free (a) ;
65 free ( h all ptr ) ;
66 free ( h all ) ;
67 }
68
69 free (b) ;
70 f r e e ( chunk a ) ;
71 free (h) ;
72 f r e e ( hptr ) ;
73
74 MPI Finalize () ;
Figure 1: Blocking Communication
To summarize this technique, Figure 1 shows the dividing of block in a
matrix. The number inside the block indicates the step. The red portion in
block 1 indicate the amount of data (which is B integers) that is sent from
process 0 to process 1 in the end of calculation of block 1, in step 1.
2.2.2 Solution 1: Linear-array Model
First, we use linear-array topology to model our solution. Here is the model
for communication part of our chosen blocking technique
1. Broadcasting chunk size, N, B, and I
tcomm−bcast−4−int = 4 × (ts + tw ) × log2 (p)
2. Broadcasting of 2nd protein sequence (vector b)
tcomm−bcast−protein−seq = (ts + tw × N ) × log2 (p)
3. Scattering chunk size for each process to compute
Note that the size of chunk size is the following
N
chunk size = p
Therefore communication time for scattering is shown below
N
tcomm−scatter−protein−seq = ts × log2 (p) + tw × p × (p − 1)
7
12. 4. Sending shared data
To start the first block of computation, process with rank 0 does not
need to wait for any data from other processes. That means we only
have ( N +p−2) stages for sending shared data. The shared data is the
B
last row of current finished block which consists of B items.Therefore
putting all of them together, communication time to send shared data
is
tcomm−send−shared−data = ( N + p − 2) × (ts + (B × tw ))
B
5. Gathering calculated data
Finally, we need to perform gather to combine all calculated data.
Note that every process will need to combine N × chunk size data,
which equals to N × N amount of data. Therefore the communication
P
time for this step is given by
N
tcomm−gather = ts × log2 (p) + tw × p × N × (p − 1)
6. Putting all the communication time together
tcomm−all = tcomm−bcast−4−int +tcomm−bcast−protein−seq +tcomm−scatter−protein−seq +
tcomm−send−shared−data + tcomm−gather
2
tcomm−all (B) = (6 log2 (p)+p−1)ts +((4+N ) log2 (p)+N + (N +N p)(p−1) )tw +
N ts
B + (p − 2) × B × tw
Now we calculate the calculation time for this blocking technique. Note
that in our blocking technique we have N + p − 1 stages of block-calculation.
B
In each block-calculation, we need to compute N × B points. Therefore,
p
if we represent time to compute one point as tc , we obtain this following
calculation time model
tcalc = ( N + p − 1) × ( N × B) × tc
B p
2
tcalc = ( N + N B −
p
NB
p ) × tc
N2 N ×(p−1)
tcalc = ( p + ( p ) × B) × tc
Final model can be obtained by adding calculation time and communi-
cation time
ttotal = tcomm + tcalc
2
ttotal (B) = (6 log2 (p) + p − 1)ts + ((4 + N ) log2 (p) + N + (N +N p)(p−1) )tw +
+ (p − 2) × B × tw + ( N + ( N ×(p−1) ) × B) × tc
N ts 2
B p p
8
13. 2.2.3 Solution 1: Optimum B for Linear-array Model
To find optimum B for linear array model, we need to calculate derivative of
final model of the linear topology with respect to B , and set the derivative
to 0 as shown below
dttotal (B)
=0
dB
And, using obtained model from section 2.2.2 we obtain this following
equation
−N N (p − 1)
t + (p − 2)tw +
2 s
× tc = 0
B p
N (p − 1) N
(p − 2)tw + × tc = 2 ts
p B
N ts
B2 = N (p−1)
(p − 2)tw + p × tc
pN ts
B2 =
p(p − 2)tw + N (p − 1) × tc
pN ts
B=
p(p − 2)tw + N (p − 1) × tc
Using assumption that P is very small in comparison with N, we simplify
the equation above into this following
ts
B≈
tc
2.2.4 Solution 1: 2-D Mesh Model
Using the same steps as in section 2.2.2, here is the 2-D Mesh Model of
solution 1.
1. Broadcasting chunk size, N, B, and I
√
tcomm−bcast−4−int = 4 × 2 × (ts + tw ) × log2 ( p)
2. Broadcasting of 2nd protein sequence (vector b)
√
tcomm−bcast−protein−seq = 2 × (ts + tw × N ) × log2 ( p)
3. Scattering chunk size for each process to compute
Note that the size of chunk size is the following
N
chunk size = p
9
14. Communication time for scattering in 2-D Mesh model can be modeled
using hypercube. It is similar as the communication time for scattering
in Linear Array model.[1]
N
tcomm−scatter−protein−seq = ts × log2 (p) + tw × p × (p − 1)
4. Sending shared data
Since sending shared data is using primitive send and receive, the
communication time for this part in 2 D mesh model also does not
change.
tcomm−send−shared−data = ( N + p − 2) × (ts + (B × tw ))
B
5. Gathering calculated data
Communication time for gathering is using same formula as scattering,
but different size of data that is gathered.
N
tcomm−gather = ts × log2 (p) + tw × p × N × (p − 1)
6. Putting all the communication time together
tcomm−all = tcomm−bcast−4−int +tcomm−bcast−protein−seq +tcomm−scatter−protein−seq +
tcomm−send−shared−data + tcomm−gather
√ √
tcomm−all (B) = ((10 log2 ( p)+log2 (p)+p−1)×ts +((8+2N ) log2 ( p)+
N ×(p−1) N 2 ×(p−1) N
p +N + p ) × tw + B × ts + (p − 2)B × tw
Calculation time does not change between 2-D mesh model and Linear
Array model, therefore the calculation time is
tcalc = ( N + ( N ×(p−1) ) × B) × tc
2
p p
Putting all together
ttotal = tcomm + tcalc
2 √ √
ttotal (B) = N ×tc +(10 log2 ( p)+log2 (p)+p−1)×ts +((8+2N ) log2 ( p)+
p
N ×(p−1) 2 ×(p−1)
p +N + N p ) × tw + N × ts + (p − 2)B × tw + (( N ×(p−1) ) × B) × tc
B p
2.2.5 Solution 1: Optimum B for 2-D Mesh Model
We need to calculate derivative of final model of the 2-D Mesh model with
respect to B , and set the derivative to 0 as shown below
dttotal (B)
=0
dB
And, using obtained model from section 2.2.4 we obtain this following
equation
10
15. −N N (p − 1)
t + (p − 2)tw +
2 s
tc = 0
B p
N (p − 1) N
(p − 2)tw + × tc = 2 ts
p B
N ts
B2 = N (p−1)
(p − 2)tw + p tc
pN ts
B2 =
p(p − 2)tw + N (p − 1)tc
pN ts
B=
p(p − 2)tw + N (p − 1)tc
ts
B≈
tc
As we observed here, the optimum B does not change when we use 2-
D Mesh to model the communication. Using our solution 1, the usage of
2-D mesh model only affect the broadcast time. And refering to total time
equation with respect to B(ttotal (B)), broadcast time is only a constant and
it disappears when we calculate dttotal (B) .
dB
2.2.6 Solution 2: Using Send and Receive
In the second solution, we used Send and Receive methods provided in MPI
library for communicating among the processes. In this implementation
every process reads the input file. Every process also reads the similarity
matrix.
After reading the files each process calculates the number of rows that
it has to process and declares the required memory. Process with rank 0
declares the matrix H of size N * N. In our implementation data distribu-
tion is fair among all the process. In case of number of rows in the list are
not divisible among all the processes we give one more row to each process
starting from the master process. Figure 2 shows the distribution of data in
case where data is not equally divisible among the processes.
Each process calculates the block size that it needs to communicate with
its neighbour. Filling starts by master process and other process waits to
receive the block to start processing. Master communicates its first block,
with its neighbour, after processing its required number of rows for the first
block. Below mentioned is the code snippet for filling the matrix at all the
process.
11
16. Figure 2: Data Partitioning among processes
1 i f ( i d == 0 )
2 {
3 f o r ( i =0; i <ColumnBlock ; i ++)
4 {
5 f o r ( j =1; j<=s ; j ++)
6 {
7 f o r ( k=i ∗B+1;k<=( i +1)∗B && k<=b [ 0 ] && k <= N; k++)
8 {
9 i n t RowPosition ;
10 i f ( id < r )
11 RowPosition = i d ∗ ( (N/p ) +1)+j ;
12 else
13 RowPosition = ( r ∗ ( (N/p ) +1) ) +(( id−r ) ∗ (N/p
) )+j ;
14
15 d i a g = h [ j − 1 ] [ k −1] + sim [ a [ RowPosition ] ] [ b [ k
]];
16 down = h [ j − 1 ] [ k ] + DELTA;
17 r i g h t = h [ j ] [ k −1] + DELTA;
18 max = MAX3( diag , down , r i g h t ) ;
19 i f (max <= 0 ) {
20 h [ j ] [ k ] = 0;
21 } else {
22 h [ j ] [ k ] = max ;
23 }
24 chunk [ k−( i ∗B+1) ] = h [ j ] [ k ] ;
25 }
26 }
27 MPI Send ( chunk , B, MPI SHORT, i d +1 ,0 ,MPI COMM WORLD) ;
28 }
29 } else
30 {
12
17. 31 f o r ( i =0; i <ColumnBlock ; i ++)
32 {
33 MPI Recv ( chunk , B, MPI SHORT, id −1 ,0 ,MPI COMM WORLD,&
status ) ;
34 f o r ( z =0; z<B ; z++)
35 {
36 i f ( ( i ∗B+z +1) <= N)
37 h [ 0 ] [ i ∗B+z +1] = chunk [ z ] ;
38 }
39 f o r ( j =1; j<=s ; j ++)
40 {
41 i n t RowPosition ;
42 i f ( id < r )
43 RowPosition = i d ∗ ( (N/p ) +1)+j ;
44 else
45 RowPosition = ( r ∗ ( (N/p ) +1) ) +(( id−r ) ∗ (N/p ) )+j
;
46
47 f o r ( k=i ∗B+1;k<=( i +1)∗B && k<=b [ 0 ] && k <= N; k++)
48 {
49 d i a g = h [ j − 1 ] [ k −1] + sim [ a [ RowPosition ] ] [ b [ k
]];
50 down = h [ j − 1 ] [ k ] + DELTA;
51 r i g h t = h [ j ] [ k −1] + DELTA;
52 max = MAX3( diag , down , r i g h t ) ;
53 i f (max <= 0 )
54 h [ j ] [ k ] = 0;
55 else
56 h [ j ] [ k ] = max ;
57
58 chunk [ k−( i ∗B+1) ] = h [ j ] [ k ] ;
59
60 }
61 }
62 i f ( i d != p−1)
63 MPI Send ( chunk , B, MPI SHORT, i d +1 ,0 ,MPI COMM WORLD
);
64 }
65 }
At the end every process sends its portion of the matrix H to the master
process using the Send method available in the MPI library. Below men-
tioned is the code snippet of gathering process.
1 i f ( i d ==0)
2 {
3 i n t row , c o l ;
4 f o r ( i =1; i <p ; i ++)
5 {
6 MPI Recv(&row , 1 , MPI INT , i , 0 ,MPI COMM WORLD,& s t a t u s ) ;
7 CHECK NULL( ( r e c v h p t r = ( i n t ∗ ) m a l l o c ( s i z e o f ( i n t ) ∗ (
row ) ∗ (N) ) ) ) ;
8
9 MPI Recv ( r e c v h p t r , row∗N, MPI INT , i , 0 ,MPI COMM WORLD
13
18. ,& s t a t u s ) ;
10
11 f o r ( j =0; j <row ; j ++)
12 {
13 i n t RowPosition ;
14 if ( i < r)
15 RowPosition = ( i ∗ ( (N/p ) +1) )+j +1;
16 else
17 RowPosition = ( r ∗ ( (N/p ) +1) ) +(( i −r ) ∗ (N/p ) )+j
+1;
18
19 f o r ( k=0;k<N; k++)
20 h [ RowPosition ] [ k+1]= r e c v h p t r [ j ∗N+k ] ;
21 }
22 free ( recv hptr ) ;
23 }
24 }
25 else
26 {
27 MPI Send(&s , 1 , MPI INT , 0 , 0 ,MPI COMM WORLD) ;
28 CHECK NULL( ( r e c v h p t r = ( i n t ∗ ) m a l l o c ( s i z e o f ( i n t ) ∗ ( s ) ∗ (
N) ) ) ) ;
29
30 f o r ( j =0; j <s ; j ++)
31 {
32 f o r ( k=0;k<N; k++)
33 {
34 r e c v h p t r [ j ∗N+k ] = h [ j + 1 ] [ k + 1 ] ;
35 }
36 }
37 MPI Send ( r e c v h p t r , s ∗N, MPI INT , 0 , 0 ,MPI COMM WORLD) ;
38 free ( recv hptr ) ;
39 }
Once the result is gathered, process with rank 0 deallocates the memory.
and perform optional verification result.
Figure 3: Blocking Communication
As reflected in Figure 3, the dividing of block in Solution 2 is same
with solution 1. But, instead of using scatter and gather to distribute data,
14
19. solution 2 uses primitive sends and receives.
2.2.7 Solution 2: Linear-array Model
Initially we calculated the performance model for the Linear interconnection
Network. The timing diagram could be found in the Appendix C.
1. In solution 2 every process calculates the N/p*B number of values be-
fore communicating a chunk with the other process. It takes (N/B)+p-
1 steps in total for computation.Below mentioned is the equation for
computation.
tcomp1 = ( N + p − 1) × ( N × B) × tc
B p
2. After computation step each process communicates a Block with its
neighbour process. There are (N/B)+p-2 steps of communication
among all the processes.
tcomm1 = ( N + p − 2) × (ts + B × tw )
B
3. After completing their part of matrix H every process sends it to the
master process.
N
tcomm2 = (ts + p × N × tw )
4. In the end master process puts all the partial result in the matrix H
to finalize the matrix H.
N
tcomp2 = (ts + p × N × tw )
The total time can be calculated by combining all the communication times.
ttotal = tcomp1 + tcomm1 + tcomp2 + tcomm2
ttotal = ( N + p − 1) × ( N × B) × tc + ( N + p − 2) × (ts + B × tw ) + (ts +
B p B
N
p × N × tw ) + (ts + N × N × tw )
p
2.2.8 Solution 2: Optimum B for Linear-array Model
To find optimum B for linear array model, we need to calculate derivative of
final model of the linear topology with respect to B , and set the derivative
to 0 as shown below
dttotal (B)
=0
dB
And, using obtained model from section 2.2.7 we obtain this following
equation
−N N (p − 1)
t + (p − 2)tw +
2 s
tc = 0
B p
15
20. N (p − 1) N
(p − 2)tw + tc = 2 ts
p B
N ts
B2 = N (p−1)
(p − 2)tw + p tc
pN ts
B2 =
p(p − 2)tw + N (p − 1)tc
pN ts
B=
p(p − 2)tw + N (p − 1)tc
2.2.9 Solution 2: 2-D Mesh Model
We calculated the performance model for the 2D-Mesh interconnection Net-
work. And we found that there is no difference between the Linear Array
Model and 2-D Mesh model because the difference between them is mainly
in the time to perform broadcasting and this solution does not involve any
broadcasting of element from root to other processes in the system.
2.3 Blocking-and-Interleave Technique
2.3.1 Solution 1: Using Scatter and Gather
Taking into account not only Blocking size B but also Interleave size I, we
developed solution below. First step is to allocate memory for all necessary
variables in each processes. Master process also allocates memory for the
final matrix where all the partial results will be stored. All slave processes
will also allocate memory for partial result matrices which eventually will
be send to the master process.
1 main ( i n t argc , char ∗ argv [ ] ) {
2
3 {...}
4
5 i n t B, I ;
6
7 M P I I n i t (& argc , &argv ) ;
8 MPI Comm rank (MPI COMM WORLD, &rank ) ;
9 MPI Comm size (MPI COMM WORLD, &t o t a l p r o c e s s e s ) ;
10
11 i f ( rank == 0 ) {
12 chunk size = sizeA / ( t o t a l p r o c e s s e s ∗ I ) ;
13
14 CHECK NULL( ( h a l l p t r = ( i n t ∗ ) c a l l o c (N ∗ ( s i z e A +1) ,
s i z e o f ( i n t ) ) ) ) ; // r e s u l t i n g d a t a
15 CHECK NULL( ( h a l l = ( i n t ∗ ∗ ) c a l l o c ( ( s i z e A +1) , s i z e o f (
i n t ∗ ) ) ) ) ; // c o n t a i n l i s t o f p o i n t e r
16
21. 16 f o r ( i = 0 ; i < s i z e A ; i ++)
17 h a l l [ i ] = h a l l p t r + i ∗ N;
18
19 // i n i t i a l i z e t h e f i r s t row o f r e s u l t i n g m a t r i x w i t h 0
20 f o r ( i = 0 ; i < N; i ++)
21 {
22 h all [ 0 ] [ i ] = 0;
23 }
24
25 }
26
27 MPI Bcast(& c h u n k size , 1 , MPI INT , 0 , MPI COMM WORLD) ;
28 MPI Bcast(&N, 1 , MPI INT , 0 , MPI COMM WORLD) ;
29 MPI Bcast(&B, 1 , MPI INT , 0 , MPI COMM WORLD) ;
30 MPI Bcast(&I , 1 , MPI INT , 0 , MPI COMM WORLD) ;
31
32 CHECK NULL( ( h p t r = ( i n t ∗ ) m a l l o c ( s i z e o f ( i n t ) ∗ (N) ∗ (
chunk size + 1) ) ) ) ;
33 CHECK NULL( ( h = ( i n t ∗ ∗ ) m a l l o c ( s i z e o f ( i n t ∗ ) ∗ ( c h u n k s i z e +
1) ) ) ) ;
34 f o r ( i = 0 ; i < c h u n k s i z e + 1 ; i ++)
35 h [ i ] = h p t r + i ∗ N;
36
37 CHECK NULL( ( chunk a = ( short ∗ ) c a l l o c ( s i z e o f ( short ) ,
chunk size ) ) ) ;
38 i f ( rank != 0 ) {
39 CHECK NULL( ( b = ( short ∗ ) m a l l o c ( s i z e o f ( short ) ∗ (N) ) ) ) ;
40 }
41 MPI Bcast ( b , N, MPI SHORT, 0 , MPI COMM WORLD) ;
The master process scattering vector A to each process partially. Each
interleave step there will be send part of the vector A. Sequence of code
for the interleave 0 will be the same as in previous section but only with
one exception that the last process will send its results to the first process.
Each process receives size B data from previous one before processing next
B columns. Each process sends data after processing B columns to the next
processes but the last process sends the data to the first(master) one if it’s
not the last stage.
Finally after calculating all partial matrices each process sends its result
to the master process (It happens interleave times).
1
2 for ( int c u r r e n t i n t e r l e a v e = 0 ; c u r r e n t i n t e r l e a v e < I ;
c u r r e n t i n t e r l e a v e ++) {
3 MPI Scatter ( a + c u r r e n t i n t e r l e a v e ∗ c h u n k s i z e ∗
total processes ,
4 c h u n k s i z e , MPI SHORT, chunk a , c h u n k s i z e , MPI SHORT, 0 ,
MPI COMM WORLD) ;
5 int current column = 1 ;
6 f o r ( i = 0 ; i < c h u n k s i z e + 1 ; i ++) h [ i ] [ 0 ] = 0 ;
7 for ( int c u r r e n t b l o c k = 0 ; c u r r e n t b l o c k < t o t a l b l o c k s ;
c u r r e n t b l o c k ++) {
17
22. 8 // R e c e i v e
9 i n t b l o c k e n d = MIN2( c u r r e n t c o l u m n − ( c u r r e n t b l o c k ==
0 ? 1 : 0 ) + B, N) ;
10 i f ( rank == 0 && c u r r e n t i n t e r l e a v e == 0 ) {
11 f o r ( i n t k = c u r r e n t c o l u m n ; k < b l o c k e n d ; k++) {
12 h [ 0 ] [ k ] = 0;
13 }
14 } else {
15 i n t r e c e i v e f r o m = rank == 0 ? t o t a l p r o c e s s e s − 1 :
rank − 1 ;
16 i n t s i z e t o r e c e i v e = c u r r e n t b l o c k == t o t a l b l o c k s
− 1 ? l a s t b l o c k : B;
17 MPI Recv ( h [ 0 ] + c u r r e n t b l o c k ∗ B, s i z e t o r e c e i v e ,
18 MPI INT , r e c e i v e f r o m , 0 , MPI COMM WORLD, &
status ) ;
19 }
20 // P r o c e s s
21 f o r ( j = c u r r e n t c o l u m n ; j < b l o c k e n d ; j ++,
c u r r e n t c o l u m n++) {
22 f o r ( i = 1 ; i < c h u n k s i z e + 1 ; i ++) {
23 d i a g = h [ i − 1 ] [ j −1] + sim [ chunk a [ i − 1 ] ] [ b [ j −
1]];
24 down = h [ i − 1 ] [ j ] + DELTA;
25 r i g h t = h [ i ] [ j −1] + DELTA;
26 max = MAX3( diag , down , r i g h t ) ;
27 i f (max <= 0 ) {
28 h[ i ] [ j ] = 0;
29 } else {
30 h [ i ] [ j ] = max ;
31 }
32 }
33 }
34 // Send
35 i f ( c u r r e n t i n t e r l e a v e + 1 != I | | rank + 1 !=
total processes ) {
36 i n t s e n d t o = rank + 1 == t o t a l p r o c e s s e s ? 0 : rank
+ 1;
37 i n t s i z e t o s e n d = c u r r e n t b l o c k == t o t a l b l o c k s − 1
? l a s t b l o c k : B;
38 MPI Send ( h [ c h u n k s i z e ] + c u r r e n t b l o c k ∗ B,
size to send ,
39 MPI INT , s e n d t o , 0 , MPI COMM WORLD) ;
40 }
41 }
42 MPI Gather ( h p t r + N, N ∗ c h u n k s i z e , MPI INT ,
43 h a l l p t r + N + current interleave ∗ chunk size ∗
t o t a l p r o c e s s e s ∗ N,
44 N ∗ c h u n k s i z e , MPI INT , 0 , MPI COMM WORLD) ;
45 }
46 MPI Finalize () ;
47 }
48 { . . . }
To summarize the interleave realization illustrated on Figure 4.
18
23. Figure 4: Blocking and interleave communication
2.3.2 Solution 1: Linear-Array Model
Here is linear array model for communication part for blocking technique
with interleave
1. Broadcasting chunk size, N, B, and I
tcomm−bcast−4−int = 4 × (ts + tw ) × log2 (p)
2. Broadcasting of 2nd protein sequence (vector b)
tcomm−bcast−protein−seq = (ts + tw × N ) × log2 (p)
3. Scattering chunk size for each process to compute
Note that the size of chunk size is the following
N
chunk size = p×I where I is the interleave factor.
And scattering is performed I times. Therefore, the communication
cost of scattering is
N
tcomm−scatter−protein−seq = I × (ts × log2 (p) + tw × p×I × (p − 1))
4. Sending shared data
To start the first block of computation, process with rank 0 does not
need to wait for any data from other processes. And, we need to take
19
24. note that in each interleave except the last interleave, last process
((p − i)th process) needs to send N data to process 0. Therefore, for
I − 1 occurences, we need ( N + p − 1) pipeline stages for sending
B
data, and for the last Interleave step (the I th steps), we will have
( N + p − 2) stages for sending data. The shared data is the last row
B
of current finished block which consists of B items.Therefore putting
all of them together, communication time to send shared data is
tcomm−send−shared−data = (I − 1) × ( N + p − 1) × (ts + (B × tw )) + ( N +
B B
p − 2) × (ts + (B × tw ))
5. Gathering calculated data
We need to perform gather to combine all calculated data in every
interleave step. Note that every process will need to combine N ×
chunk size data, which equals to N × PN amount of data. This
×I
gather procedure is repeated I times. Therefore the communication
time for this step is given by
N
tcomm−gather = I × (ts × log2 (p) + tw × p×I × N × (p − 1))
6. Putting all the communication time together
tcomm−all = tcomm−bcast−4−int +tcomm−bcast−protein−seq +tcomm−scatter−protein−seq +
tcomm−send−shared−data + tcomm−gather
Simplifying the equation with respect to B (by separating constant
of the equation with the component of the equation containing B, so
that we can easily calculate the derivative of the equation to obtain
maximum B), we obtain this following equation
tcomm−all (B) = ((5 + 2I)log2 (p) + (p − 1)(I − 1) + (p − 2)) × ts + ((4 +
2
N )log2 (p) + N (p − 1) + N (p − 1) + I − 1 + N ) × tw + IN × ts + ((I −
p p B
1)(p − 1) + p − 2)B × tw
Simplyfing the equation with respect to I, we obtain this following
equation
tcomm−all (I) = ((5 + 2I)log2 (p) − 1) × ts + ((4 + N )log2 (p) + N (p −
p
N2
1) + p (p − 1) + B) × tw + ( N + p − 1)(ts + Btw )I
B
Now we calculate the calculation time for this blocking technique. Note
that in our blocking technique we have I × ( N + p − 1) stages of block-
B
N
calculation. In each block-calculation, we need to compute p×I × B points.
Therefore, if we represent time to compute one point as tc , we obtain this
following calculation time model
20
25. tcalc = I × ( N + p − 1) × ( p×I × B) × tc
B
N
I will be canceled and we obtain this following
2
tcalc = ( N + N B −
p
NB
p ) × tc
( N ×(p−1) ) × B) × tc
2
tcalc = ( N +
p p
Final model can be obtained by adding calculation time and communi-
cation time, and here is the final equation with respect to B
ttotal = tcomm + tcalc
ttotal (B) = ((5+2I)log2 (p)+(p−1)(I −1)+(p−2))×ts +((4+N )log2 (p)+
N 2
p (p− 1) + N (p − 1) + I − 1 + N ) × tw + IN × ts + ((I − 1)(p − 1) + p −
p B
2)B × tw + ( N + ( N ×(p−1) ) × B) × tc
2
p p
Here is the final equation with respect to I
N
tcomm−all (I) = ((5 + 2I)log2 (p) − 1) × ts + ((4 + N )log2 (p) + p (p − 1) +
N2
+ ( N + p − 1)(ts + Btw )I + ( N + ( N ×(p−1) ) × B) × tc
2
p (p − 1) + B) × tw B p p
2.3.3 Solution 1: Optimum B and I for Linear-array Model
dttotal (B)
Optimum B can be derived by calculating dB and set the inequality
to 0.
dttotal (B)
=0
dB
And, using obtained model from previous section we obtain this follow-
ing equation
−IN N (p − 1)
t + ((I − 1)(p − 1) + (p − 2))tw +
2 s
tc = 0
B p
N (p − 1) IN
((I − 1)(p − 1) + (p − 2))tw + tc = 2 ts
p B
IN ts
B2 = N (p−1)
((I − 1)(p − 1) + (p − 2))tw + p tc
pIN ts
B2 =
((I − 1)(p − 1) + (p − 2))ptw + N (p − 1)tc
21
26. pIN ts
B=
((I − 1)(p − 1) + (p − 2))ptw + N (p − 1)tc
IN ts
B≈
(N tc + I)
However, we can not find optimum I for Blocking-and-Interleave tech-
nique because the derivation of dttotal (I) results in a constant as shown below
dI
dttotal (I)
=0
dI
N
(+ p − 1)(ts + Btw ) = 0
B
Looking at equation of dttotal (I), interleave factor only introduce more
communication time when sending and receiving shared data. Therefore no
optimum interleave level can be derived using this model.
2.3.4 Solution 1: 2-D Mesh Model
Using similar technique as what we have done in Linear-array model, here
is the communication and computation model of 2-D Mesh Model
1. Broadcasting chunk size, N, B, and I
√
tcomm−bcast−4−int = 4 × 2 × (ts + tw ) × log2 ( p)
2. Broadcasting of 2nd protein sequence (vector b)
√
tcomm−bcast−protein−seq = 2 × (ts + tw × N ) × log2 ( p)
3. Scattering chunk size for each process to compute
As what we discuss in section 2.2.4, scattering communication model
between 2-D Mesh model and Linear Array model are equals.
N
tcomm−scatter−protein−seq = I × (ts × log2 (p) + tw × p×I × (p − 1))
4. Sending shared data Communication time for sending shared data also
equal between 2-D Mesh model and Linear Array model.
tcomm−send−shared−data = (I − 1) × ( N + p − 1) × (ts + (B × tw )) + ( N +
B B
p − 2) × (ts + (B × tw ))
5. Gathering calculated data Gathering formula is equal to scattering
except for the amount of data being gathered.
N
tcomm−gather = I × (ts × log2 (p) + tw × p×I × N × (p − 1))
22
27. 6. Putting all the communication time together
tcomm−all = tcomm−bcast−4−int +tcomm−bcast−protein−seq +tcomm−scatter−protein−seq +
tcomm−send−shared−data + tcomm−gather
Simplifying the equation with respect to B (by separating constant
of the equation with the component of the equation containing B, so
that we can easily calculate the derivative of the equation to obtain
maximum B), we obtain this following equation
√
tcomm−all (B) = (10 log2 ( p) + 2I log2 (p) + (p − 1)(I − 1) + (p − 2)) ×
√ 2
ts + ((8 + 2N )log2 ( p) + N (p − 1) + N (p − 1) + I − 1 + N ) × tw +
p p
IN
B × ts + ((I − 1)(p − 1) + p − 2)B × tw
Simplyfing the equation with respect to I, we obtain this following
equation
√ √
tcomm−all (I) = (10 log2 ( p)+2I log2 (p)−1)×ts +((8+2N )log2 ( p)+
N N2 N
p (p − 1) + p (p − 1) + B) × tw + ( B + p − 1)(ts + Btw )I
2.3.5 Solution 1: Optimum B and I for 2-D Mesh Model
dttotal (B)
Optimum B can be derived by calculating dB and set the inequality
to 0.
dttotal (B)
=0
dB
And, using obtained model from previous section we obtain this follow-
ing equation
−IN N (p − 1)
t + ((I − 1)(p − 1) + (p − 2))tw +
2 s
× tc = 0
B p
N (p − 1) IN
((I − 1)(p − 1) + (p − 2))tw + × tc = 2 ts
p B
IN ts
B2 = N (p−1)
((I − 1)(p − 1) + (p − 2))tw + p × tc
pIN ts
B2 =
((I − 1)(p − 1) + (p − 2))ptw + N (p − 1) × tc
pIN ts
B=
((I − 1)(p − 1) + (p − 2))ptw + N (p − 1) × tc
23
28. We observe that the resulting optimum B for 2-D Mesh model is equal to
Linear Array model. As what we have discussed in section 2.2.5, 2-D Mesh
model only differs in the broadcast time which act as constant in ttotal (B)
equation and the constant disappear when we calculate the derivaion of the
equation.
Similar to calculation of optimum I in Linear Array Model, we can not
find optimum I for Blocking-and-Interleave technique because the derivation
of dttotal (I) results in a constant as shown below
dI
dttotal (I)
=0
dI
N
( + p − 1)(ts + Btw ) = 0
B
2.3.6 Solution 1: Improvement
Figure 5: Blocking and Interleave Communication
The main idea of this improvement is moving the gathering final data
process into the end of whole calculation in each process. That means,
refering to Figure 5, gathering will be performed after step 14.
To implement this improvement, we performed these following steps:
1. Allocate enough memory for each process, to hold I × N × chunk size.
Note that chunk size in this case is PN .
×I
24
29. 1 CHECK NULL( ( h p t r = ( i n t ∗ ) m a l l o c ( s i z e o f ( i n t ) ∗ (N) ∗ I ∗
( chunk size
2 + 1 ) ) ) ) ; // I n s t a n t i a t e temporary r e s u l t i n g m a t r i x f o r each
process
3 CHECK NULL( ( h = ( i n t ∗ ∗ ) m a l l o c ( s i z e o f ( i n t ∗ ) ∗ I ∗ (
chunk size +
4 1 ) ) ) ) ; // l i s t o f p o i n t e r
5
6 i n t ∗∗∗ h f i n ;
7 CHECK NULL( h f i n = ( i n t ∗ ∗ ∗ ) m a l l o c ( s i z e o f ( i n t ∗ ∗ ∗ ) ∗ I ) ) ;
8
9 f o r ( i = 0 ; i < ( c h u n k s i z e + 1 ) ∗ I ; i ++) {
10 h [ i ] = h p t r + i ∗ N; // p u t t h e p r o i n t e r i n t t h e a r r a y
11 }
12
13 f o r ( i = 0 ; i < I ; i ++) {
14 h f i n [ i ] = h + i ∗ ( chunk size + 1) ;
15 }
2. Change the way each process manipulates the data. Each process
stores the data using hfin. hfin is a variable with type ***int, therefore
we need to store the data as shown in the following code snippet
1 for ( int c u r r e n t i n t e r l e a v e = 0 ; c u r r e n t i n t e r l e a v e < I
; c u r r e n t i n t e r l e a v e ++) {
2
3 MPI Scatter ( a + c u r r e n t i n t e r l e a v e ∗ c h u n k s i z e ∗
total processes ,
4 c h u n k s i z e , MPI SHORT, chunk a , c h u n k s i z e ,
MPI SHORT, 0 , MPI COMM WORLD) ; // c h u n k a i s
the receiving buffer
5
6 int current column = 1 ;
7 f o r ( i = 0 ; i < c h u n k s i z e + 1 ; i ++) h f i n [
current interleave ] [ i ] [ 0 ] = 0;
8
9 for ( int c u r r e n t b l o c k = 0 ; c u r r e n t b l o c k <
t o t a l b l o c k s ; c u r r e n t b l o c k ++) {
10 // R e c e i v e
11 i n t b l o c k e n d = MIN2( c u r r e n t c o l u m n − (
c u r r e n t b l o c k == 0 ? 1 : 0 ) + B, N) ;
12 i f ( rank == 0 && c u r r e n t i n t e r l e a v e == 0 ) { //
i f rank 0 i s p r o c e s s i n g t h e f i r s t b l o c k , i t
doesn ’ t need t o r e c e i v e any t h i n g
13 for ( int k = current column ; k < block end ;
k++) {
14 h f i n [ c u r r e n t i n t e r l e a v e ] [ 0 ] [ k ] = 0 ; //
i n i t row 0
15 }
16 } else {
17 i n t r e c e i v e f r o m = rank == 0 ?
t o t a l p r o c e s s e s − 1 : rank − 1 ; // r e c e i v e
25
30. from n e i g h b o r i n g p r o c e s s
18 i n t s i z e t o r e c e i v e = c u r r e n t b l o c k ==
t o t a l b l o c k s − 1 ? l a s t b l o c k : B;
19
20 MPI Recv ( h f i n [ c u r r e n t i n t e r l e a v e ] [ 0 ] +
c u r r e n t b l o c k ∗ B, s i z e t o r e c e i v e ,
MPI INT , r e c e i v e f r o m , 0 , MPI COMM WORLD
, &s t a t u s ) ;
21 }
22 f o r ( j = c u r r e n t c o l u m n ; j < b l o c k e n d ; j ++,
c u r r e n t c o l u m n++) {
23 f o r ( i = 1 ; i < c h u n k s i z e + 1 ; i ++) {
24 d i a g = h f i n [ c u r r e n t i n t e r l e a v e ] [ i − 1 ] [ j −1] +
sim [ chunk a [ i − 1 ] ] [ b [ j − 1 ] ] ;
25 down = h f i n [ c u r r e n t i n t e r l e a v e ] [ i − 1 ] [ j ] +
DELTA;
26 r i g h t = h f i n [ c u r r e n t i n t e r l e a v e ] [ i ] [ j −1] +
DELTA;
27 max = MAX3( diag , down , r i g h t ) ;
28 i f (max <= 0 ) {
29 hfin [ current interleave ] [ i ] [ j ] = 0;
30 } else {
31 h f i n [ c u r r e n t i n t e r l e a v e ] [ i ] [ j ] = max ;
32 }
33 }
34 }
35
36 // Send
37 i f ( c u r r e n t i n t e r l e a v e + 1 != I | | rank + 1 !=
total processes ) {
38 i n t s e n d t o = rank + 1 == t o t a l p r o c e s s e s ?
0 : rank + 1 ;
39 i n t s i z e t o s e n d = c u r r e n t b l o c k ==
t o t a l b l o c k s − 1 ? l a s t b l o c k : B;
40 MPI Send ( h f i n [ c u r r e n t i n t e r l e a v e ] [
c h u n k s i z e ] + c u r r e n t b l o c k ∗ B,
s i z e t o s e n d , MPI INT , s e n d t o , 0 ,
MPI COMM WORLD) ;
41 }
42 }
43 }
Note that hfin[i] means it contains the data for the ith interleaving
stage in each process.
3. Move gathering process into the end of all calculation as shown in the
following code snippet
1 f o r ( i = 0 ; i < I ; i ++) {
2 MPI Gather ( h p t r + N + i ∗ c h u n k s i z e ∗ N, N ∗
c h u n k s i z e , MPI INT ,
3 h a l l p t r + N + i ∗ chunk size ∗
t o t a l p r o c e s s e s ∗ N,
26
31. 4 N ∗ c h u n k s i z e , MPI INT , 0 , MPI COMM WORLD) ;
5 }
2.3.7 Solution 1: Optimum B and I for the Improved Solution
Here is the part of the model that are affected by the improved solution.
1. Sending shared data
For the first I - 1 interleaving stages the communication time is fol-
lowed:
N
(I − 1) × (ts + tw × B) × B
Then the last interleaving stage consist of following amount of com-
munication time:
(ts + tw × B) × ( N + P − 2)
B
Therefore putting all of them together, communication time to send
shared data is
(ts + tw × B) × ( N + P − 2) + (I − 1) × (ts + tw × B) ×
B
N
B
2. Computational time
As well with sending and receive changes, time for computation are
also improved.
(N × B ×
B
N
P ×I × (I − 1) + B × N
P ×I × ( N + P − 1)) × tc
B
Optimal B and I for Improved Solution
To calculate the optimal value we ignore all the communication time
which is not going to influent the value of optimal B and I. For optimal B,
we only have the following formula the calculation.
t total improved(B) = (ts + tw × B) × ( N + P − 2) + (I − 1) × (ts + tw ×
B
B) × N + ( N × B × PN × (I − 1) + B × PN × ( N + P − 1)) × tc
B B ×I ×I B
dt total improved(B)
=0
dB
(I − 1) × ts × N N × ts N
− 2
− 2
+ (P − 2) × tw + (P − 1) × × tc = 0
B B P ×I
I 2 × ts × N × P
B=
(P − 2) × tw × P × I + (P − 1) × N × tc
I × N × ts
B≈
(P − 2) × tw
27
32. However, for optimal I value, we need to consider also scatter time as
well. Therefore we obtain this following formula for t total improved(I)
t total improved(I) = I × ts × log2 (p) + (ts + tw × B) × ( N + P − 2) + (I −
B
1) × (ts + tw × B) × N + N × B × × PN × (I − 1) + B × PN × ( N + P − 1) × tc )
B B ×I ×I B
dt total improved(I)
=0
dI
N N2 × B B×N N
ts ×log2 (p)+(ts +tw ×B)× + 2
×tc − 2
×( +P −1)×tc = 0
B B×P ×I P ×I B
B 2 × N × ( N + P − 1) × tc − N 2 × B × tc
B
I=
B × P × ts × log2 (p) + (ts + tw × B) × N × P
B × N × tc
I≈ N
ts × log2 (p) + N × tw + B × ts
2.3.8 Solution 2: Using Send and Receive
This implementation also takes in account the row interleave factor along
with the column interleave. Every process calculates the number of rows it
has to process at every interleave and initializes the memory. Master process
declares the matrix H and use it for its partial processing as well.
Each process process N/(p*I) number of rows in every interleave and
communicates the block with its neighbour process. Last process communi-
cates its block with the master process and do not perform any communi-
cation in the last interleave.
1 i f ( i d == 0 )
2 {
3 f o r ( i =0; i <ColumnBlock ; i ++)
4 {
5 CHECK NULL( ( chunk = ( i n t ∗ ) m a l l o c ( s i z e o f ( i n t ) ∗ (
B) ) ) ) ;
6
7 f o r ( j =1; j<=s ; j ++)
8 {
9 f o r ( k=i ∗B+1;k<=( i +1)∗B && k<=b [ 0 ] && k <= N;
k++)
10 {
11 i n t RowPosition ;
12
13 i f ( ( i n t e r l e a v e ∗p+i d ) < r )
14 RowPosition = ( i n t e r l e a v e ∗ (N/ ( p∗ I )
+1)∗p ) + i d ∗ ( (N/ ( p∗ I ) +1) ) + j ;
15 else
28
33. 16 RowPosition = ( r ∗ (N/ ( p∗ I ) +1) ) + (
i n t e r l e a v e ∗p+id−r ) ∗ (N/ ( p∗ I ) ) + j ;
17
18 d i a g = h [ RowPosition − 1 ] [ k −1] + sim [ a [
RowPosition ] ] [ b [ k ] ] ;
19 down = h [ RowPosition − 1 ] [ k ] + DELTA;
20 r i g h t = h [ RowPosition ] [ k −1] + DELTA;
21 max = MAX3( diag , down , r i g h t ) ;
22
23 i f (max <= 0 ) {
24 h [ RowPosition ] [ k ] = 0 ;
25 } else {
26 h [ RowPosition ] [ k ] = max ;
27 }
28 chunk [ k−( i ∗B+1) ] = h [ RowPosition ] [ k ] ;
29 }
30 } // communicate t o e p a r t i a l b l o c k t o n e x t p r o c e s s
31 MPI Send ( chunk , B, MPI INT , i d +1 ,0 ,MPI COMM WORLD) ;
32 f r e e ( chunk ) ;
33 }
34 // end f i l l i n g m a t r i x H [ ] [ ] a t master
35 } e l s e i f ( i d != p−1)
36 { // f i l l i n g m a t r i x a t o t h e r p r o c e s s e s
37
38 f o r ( i =0; i <ColumnBlock ; i ++)
39 {
40 CHECK NULL( ( chunk = ( i n t ∗ ) m a l l o c ( s i z e o f ( i n t ) ∗ (
B) ) ) ) ;
41
42 MPI Recv ( chunk , B, MPI INT , id −1 ,0 ,MPI COMM WORLD,&
status ) ;
43 f o r ( z =0; z<B ; z++)
44 {
45 i f ( ( i ∗B+z ) <= N)
46 h [ 0 ] [ i ∗B+z +1] = chunk [ z ] ;
47 }
48 f o r ( j =1; j<=s ; j ++)
49 {
50 i n t RowPosition ;
51
52 i f ( ( i n t e r l e a v e ∗p+i d ) < r )
53 RowPosition = ( i n t e r l e a v e ∗ (N/ ( p∗ I ) +1)∗p )
+ i d ∗ ( (N/ ( p∗ I ) +1) ) + j ;
54 else
55 RowPosition = ( r ∗ (N/ ( p∗ I ) +1) ) + (
i n t e r l e a v e ∗p+id−r ) ∗ (N/ ( p∗ I ) ) + j ;
56
57
58 f o r ( k=i ∗B+1;k<=( i +1)∗B && k<=b [ 0 ] && k <= N;
k++)
59 {
60 d i a g = h [ j − 1 ] [ k −1] + sim [ a [ RowPosition
] ] [ b[k ] ] ;
61 down = h [ j − 1 ] [ k ] + DELTA;
29
34. 62 r i g h t = h [ j ] [ k −1] + DELTA;
63 max = MAX3( diag , down , r i g h t ) ;
64 i f (max <= 0 )
65 h [ j ] [ k ] = 0;
66 else
67 h [ j ] [ k ] = max ;
68
69 chunk [ k−( i ∗B+1) ] = h [ j ] [ k ] ;
70 }
71 }
72 MPI Send ( chunk , B, MPI INT , i d +1 ,0 ,MPI COMM WORLD) ;
73 f r e e ( chunk ) ;
74 } // end f i l l i n g m a t r i x a t o t h e r p r o c e s s e s
75 } e l s e // s t a r t f i l l i n g m a t r i x a t l a s t p r o c e s s
76 {
77 f o r ( i =0; i <ColumnBlock ; i ++)
78 {
79 CHECK NULL( ( chunk = ( i n t ∗ ) m a l l o c ( s i z e o f ( i n t ) ∗ (
B) ) ) ) ;
80
81 MPI Recv ( chunk , B, MPI INT , id −1 ,0 ,MPI COMM WORLD,&
status ) ;
82 f o r ( z =0; z<B ; z++)
83 {
84 i f ( ( i ∗B+z ) <= N)
85 h [ 0 ] [ i ∗B+z +1] = chunk [ z ] ;
86 }
87
88 f r e e ( chunk ) ;
89 f o r ( j =1; j<=s ; j ++)
90 {
91 i n t RowPosition ;
92 i f ( ( i n t e r l e a v e ∗p+i d ) < r )
93 RowPosition = ( i n t e r l e a v e ∗ (N/ ( p∗ I ) +1)∗p )
+ i d ∗ ( (N/ ( p∗ I ) +1) ) + j ;
94 else
95 RowPosition = ( r ∗ (N/ ( p∗ I ) +1) ) + (
i n t e r l e a v e ∗p+id−r ) ∗ (N/ ( p∗ I ) ) + j ;
96
97
98 f o r ( k=i ∗B+1;k<=( i +1)∗B && k<=b [ 0 ] && k <= N;
k++)
99 {
100 d i a g = h [ j − 1 ] [ k −1] + sim [ a [ RowPosition
] ] [ b[k ] ] ;
101 down = h [ j − 1 ] [ k ] + DELTA;
102 r i g h t = h [ j ] [ k −1] + DELTA;
103 max = MAX3( diag , down , r i g h t ) ;
104 i f (max <= 0 )
105 h [ j ] [ k ] = 0;
106 else
107 h [ j ] [ k ] = max ;
108
109
30
35. 110 }
111 }
112 }
113 }
After filling the partial matrix H, every process sends the partial result to
the master process at every interleave. Below mentioned is the code snippet
of master gathering the partial result after every interleave.
1 i f ( i d ==0)
2 {
3 i n t row , c o l ;
4 f o r ( i =1; i <p ; i ++)
5 {
6 MPI Recv(&row , 1 , MPI INT , i , 0 ,MPI COMM WORLD,&
status ) ;
7 CHECK NULL( ( r e c v h p t r = ( i n t ∗ ) m a l l o c ( s i z e o f (
i n t ) ∗ ( row ) ∗ (N) ) ) ) ;
8
9 MPI Recv ( r e c v h p t r , row∗N, MPI INT , i , 0 ,
MPI COMM WORLD,& s t a t u s ) ;
10
11 f o r ( j =0; j <row ; j ++)
12 {
13 i n t RowPosition ;
14
15 i f ( ( i n t e r l e a v e ∗p+i ) < r )
16 RowPosition = ( i n t e r l e a v e ∗ (N/ ( p∗ I ) +1)∗p )
+ i ∗ ( (N/ ( p∗ I ) +1) ) + j +1;
17 else
18 RowPosition = ( r ∗ (N/ ( p∗ I ) +1) ) + (
i n t e r l e a v e ∗p+i −r ) ∗ (N/ ( p∗ I ) ) + j +1;
19
20 f o r ( k =0;k<N; k++)
21 h [ RowPosition ] [ k+1]= r e c v h p t r [ j ∗N+k ] ;
22
23 }
24 free ( recv hptr ) ;
25 }
26 }
27 else
28 {
29 MPI Send(&s , 1 , MPI INT , 0 , 0 ,MPI COMM WORLD) ;
30 CHECK NULL( ( r e c v h p t r = ( i n t ∗ ) m a l l o c ( s i z e o f ( i n t ) ∗ (
s ) ∗ (N) ) ) ) ;
31
32 f o r ( j =0; j <s ; j ++)
33 {
34 f o r ( k=0;k<N; k++)
35 r e c v h p t r [ j ∗N+k ] = h [ j + 1 ] [ k + 1 ] ;
36 }
37 MPI Send ( r e c v h p t r , s ∗N, MPI INT , 0 , 0 ,MPI COMM WORLD) ;
38
39 free ( recv hptr ) ;
31
36. 40 }
To summarize the interleave realization illustrated in Appendix D.
2.3.9 Solution 2: Linear-array Model
1. Every process calculates the (N/(p*I))*B number of values in every
interleave before communicating a chunk with the other process. It
takes ((N/B)+p-1)*I steps in total for computation.Below mentioned
is the equation for computation.
tcomp1 = I × ( N + p − 1) × ( p×I × B) × tc
B
N
2. After computation step each process communicates a Block with its
neighbour process. There are (N/B)+p-2 steps of communication
among all the processes.
tcomm1 = (I −1)×( N +p−1)×(ts +B ×tw )+( N +p−2)×(ts +B ×tw )
B B
3. After completing their part of matrix H every process sends it to the
master process.
N
tcomm2 = (ts + (p×I × N × tw ) × I
4. In the end master process puts all the partial result in the matrix H
to finalize the matrix H.
N
tcomp2 = I × (ts + (p×I) × N × tw )
The total execution time can be calculated by combining all the times.
ttotal = tcomp1 + tcomm1 + tcomp2 + tcomm2
ttotal = I ×( N +p−1)×( p×I ×B)×tc +(I −1)×( N +p−1)×(ts +B×tw )+
B
N
B
( N + p − 2) × (ts + B × tw ) + (ts + (p×I × N × tw ) × I + I × (ts + (p×I) × N × tw )
B
N N
2.3.10 Solution 2: Optimum B and I for Linear-array Model
dttotal (B)
Optimum B can be derived by calculating dB and set the inequality
to 0.
dttotal (B)
=0
dB
And, using obtained model from previous section we obtain this follow-
ing equation
−IN N (p − 1)
t + ((I − 1)(p − 1) + (p − 2))tw +
2 s
=0
B p
32
37. N (p − 1) IN
((I − 1)(p − 1) + (p − 2))tw + = 2 ts
p B
IN ts
B2 = N (p−1)
((I − 1)(p − 1) + (p − 2))tw + p
pIN ts
B2 =
((I − 1)(p − 1) + (p − 2))ptw + N (p − 1)
pIN ts
B=
((I − 1)(p − 1) + (p − 2))ptw + N (p − 1)
However, we can not find optimum I for Blocking-and-Interleave tech-
nique because the derivation of dttotal (I) results in a constant as shown below
dI
dttotal (I)
=0
dI
N
( + p − 1)(ts + Btw ) = 0
B
Looking at equation of dttotal (I), interleave factor only introduce more
communication time when sending and receiving shared data. Therefore no
optimum interleave level can be derived using this model.
2.3.11 Solution 2: 2-D Mesh Model
As we have discussed in section 2.2.9, 2-D Mesh Model is same with Linear
Array model because 2-D Mesh Model only affects the broadcast procedure
and solution 2 does not include any broadcast procedure in its implementa-
tion.
33
38. 3 Performance Results
We did performance measurement of both parallel versions in Altix Machine
and compare the results against the sequential version.
3.1 Solution 1
3.1.1 Performance of Sequential Code
First we measured the performance of Smith-Waterman algorithm, using
sequential code. Figure 6 shows the results.
Figure 6: Sequential Code Performance Measurement Result
Figure 6 shows that when N is increased, the time taken to complete
filling matrix h is also increased almost linearly.
34
39. 3.1.2 Find Out Optimum Number of Processor (P)
At first, we observe the performance by fixing number of compared pro-
tein(N) to 5000 and 10000, block size (B) to 100 and set the interleave
factor (I) to 1. The result is shown in Figure 7.
1. Protein size equals to 5000 (N = 5000) Block size (B) is 100, and In-
terleave factor (I) is 1
Figure 7: Measurement result when N is 5000, B is 100 and I is 1
Plotting the result in diagram in Figure 8
Figure 8: Diagram of measurement result when N is 5000, B is 100, I is 1
When the protein size (N) is 5000 and number of processor (P) is 4,
we obtain t tparallel = 1.454 = 2.26 times speedup.
serial 3.3
35
40. 2. Protein size equals to 10000 (N = 10000) We obtain this following
result in Figure 9
Figure 9: Measurement result when N is 10000, B is 100 and I is 1
Plotting the result in diagram in Figure 10
Figure 10: Diagram of measurement result when N is 10000, B is 100, I is 1
When the protein size (N) is 10000 and number of processor (P) is 8,
we obtains t tparallel = 12.508 = 5.06 times speedup.
serial
2.47
Based on the result above, we found that maximum speedup is achieved
when number of processor (P) is 8 and protein size (N) is 10000. Therefore,
for the subsequent experiment, we will fix the number of processor to 8 and
modify other parameters.
3.1.3 Find Out Optimum Blocking Size (B)
In this subsection, we analyze the performance result and find optimum
blocking size (B). We fix number of processor (P) to 8, number of protein
(N) to 10000 and interleave factor (I) to 1. The results are on Figure 11
36
41. Figure 11: Performance measurement result when N is 10000, P is 8, I is 1
Figure 12: Diagram of measurement result when N is 10000, P is 8, I is 1
Plotting the result into diagram as shown in Figure 12. We zoomed in
the diagram in right hand side of Figure 12 so that we have clearer picture
on the performance when B is less than or equal to 500.
We found that optimum empirical blocking size (B) in the solution 1 is
100. And this yield in t tparallel = 12.508 = 5.21 times speedup.
serial
2.401
37
42. 3.1.4 Find Out Optimum Interleave Factor (I)
Using the result from previous section in finding optimum blocking size (B),
we find out most optimum I. The result is shown on Figure 13
Figure 13: Diagram of measurement result when N is 10000, P is 8, B is 100
We found that optimum I is 1. And using optimum I of 1, we obtain
4.76 times speedup compared to sequential execution.
3.2 Solution 1-Improved
We did the same experiment as Solution 1 performance result to obtain
necessary data about our improved solution
3.2.1 Find Out Optimum Number of Processor (P)
At first, we observe the performance by fixing number of compared pro-
tein(N) to 10000, block size (B) to 100 and set the interleave factor (I) to 1.
The result is shown in Figure 14.
We obtain this following result in Figure 14
Figure 14: Measurement result when N is 10000, B is 100 and I is 1
Plotting the result in diagram in Figure 15
When the protein size (N) is 10000 and number of processor (P) is 8, we
obtains t tparallel = 12.508 = 4.201 times speedup.
serial
2.977
38
43. Figure 15: Diagram of measurement result when N is 10000, B is 100, I is 1
Based on the result above, we found that maximum speedup is achieved
when number of processor (P) is 8 and protein size (N) is 10000. Therefore,
for the subsequent experiment, we will fix the number of processor to 8 and
modify other parameters.
39
44. 3.2.2 Find Out Optimum Blocking Size (B)
In this subsection, we analyze the performance result and find optimum
blocking size (B). We fix number of processor (P) to 8, number of protein
(N) to 10000 and interleave factor (I) to 1. The results are on Figure 16
Figure 16: Performance measurement result when N is 10000, P is 8, I is 1
Plotting the result into diagram as shown in Figure 17. We zoomed in
the diagram in right hand side of Figure 17 so that we have clearer picture
on the performance when B is less than or equal to 500.
Figure 17: Diagram of measurement result when N is 10000, P is 8, I is 1
We found that optimum empirical blocking size (B) in the solution 1 is
serial
200. And this yield in t tparallel = 12.508 = 5.08 times speedup.
2.464
40
45. 3.2.3 Find Out Optimum Interleave Factor (I)
Using the result from previous section in finding optimum blocking size (B),
we find out most optimum I. The result is shown on Figure 18
Figure 18: Diagram of measurement result when N is 10000, P is 8, B is 200
We found that optimum I is 2. And using optimum I of 2, we obtain
t serial
= 12.508 = 5.08 times speedup.
t parallel 2.613
3.3 Solution 2
Using similar sequential code performance result obtained during Solution
1 evalution, we measured the performance of solution 2.
3.3.1 Find Out Optimum Number of Processor (P)
The first step that we did is to observe the performance by fixing number of
compared protein (N), block size (B) and set the interleave factor (I) to 1.
1. Protein size equals to 5000 (N = 5000) Block size (B) is 100, and
Interleave factor (I) is 1
Figure 19: Measurement result when N is 5000, B is 100 and I is 1
Plotting the result into diagram
41
46. Figure 20: Diagram of measurement result when N is 5000, B is 100, I is 1
Using protein size (N) of 5000 and number of processor (P) is 32,
we achieve maximum 31.55% speedup compared to existing sequential
code.
2. Protein size equals to 10000 (N = 10000) Block size (B) is 100, and
Interleave factor (I) is 1
Figure 21: Measurement result when N is 10000, B is 100 and I is 1
Plotting the result into Figure 22
Using protein size (N) equals to 10000 and number of processor (P) is
32, we achieve 54.67% speedup compared to existing sequential code.
Based on results obtained in this section, we found that parallel imple-
mentation of solution 2 achieve most speedup when the number of procesor
42
47. Figure 22: Diagram of measurement result when N is 10000, B is 100, I is 1
is 32. In our subsequent performance evaluation, we will fix the number of
processor to 32, and observe most optimum value for other variables.
3.3.2 Find Out Optimum Blocking Size (B)
In this subsection, we analyze the performance result and find optimum
blocking size (B). We fix number of processor (P) to 32, number of protein
(N) to 10000 and interleave factor (I) to 1. The results are on Figure 23
Figure 23: Performance measurement result when N is 10000, P is 32, I is 1
Plotting the result into diagram as shown in Figure 24 below
We found that optimum empirical blocking size (B) in our solution 2 is
50. Interestingly, the performance using optimum B is slightly worse com-
43
48. Figure 24: Diagram of measurement result when N is 10000, P is 32, I is 1
pared to the result from section 3.3.1. Using B equals to 50, we achieve
53.91% speedup compared to sequential execution, but 1.69% slower com-
pared to result from section 3.3.1.
3.3.3 Find Out Optimum Interleave Factor (I)
Using the result from previous section in finding optimum blocking size (B),
we find out most optimum I. The result is shown on Figure 25 and Figure 26
Figure 25: Performance measurement result when N is 10000, P is 32, B is
50
We found that optimum I is 30. Another interesting point is the obtained
results shows that the execution times are very close to each other when I
is 10 up to 100. That means for existing configuration (N = 10000, P = 32
and B = 50), the value of I does not affect the execution time much when
44
49. Figure 26: Diagram of measurement result when N is 10000, P is 32, B is
50
it is from 10 to 100. Practically, we can choose any I value from 10 to 100.
Using optimum I of 30, we obtain 58.79% speedup compared to sequential
execution, and 10.58% speedup compared to result without using interleav-
ing.
45