This document discusses various strategies for parallelizing medical image processing tasks across multi-core CPUs. It begins with a poll asking about readers' computer hardware and parallel programming experience. It then covers degrees of parallelism from serial to large-scale parallelism. The document presents pragmatic approaches using C/C++ with "bolted on" parallel concepts. It provides a brief introduction to SIMD and focuses on SMP concepts using threads, concurrency, and parallel programming models like OpenMP, TBB, and ITK. It discusses example problems like thresholding images and common errors. It concludes with next steps in parallel computing.
The Java Memory Model describes how threads in the Java programming language interact through memory. Together with the description of single-threaded execution of code, the memory model provides the semantics of the Java programming language.
It is crucial for a programmer to know how, according to Java Language Specification, write correctly synchronized, race free programs.
When you are about to delete an object, how do you know that it is not being used in another thread?
How can you tell if the object is still alive before you trying call its member function? Is it being destructed in another thread?
The heavyweight "process model", historically used by Unix systems, including Linux, to split a large system into smaller, more tractable pieces doesn’t always lend itself to embedded environments
owing to substantial computational overhead. POSIX threads, also known as Pthreads, is a multithreading API that looks more like what embedded programmers are used to but runs in a
Unix/Linux environment. This presentation introduces Posix Threads and shows you how to use threads to create more efficient, more responsive programs.
Slides from tech talk about the art of non-blocking waiting in Java with LockSupport.park/unpark and AbstractQueuedSynchronizer. Presented on JPoint 2016 Conference.
The Java Memory Model describes how threads in the Java programming language interact through memory. Together with the description of single-threaded execution of code, the memory model provides the semantics of the Java programming language.
It is crucial for a programmer to know how, according to Java Language Specification, write correctly synchronized, race free programs.
When you are about to delete an object, how do you know that it is not being used in another thread?
How can you tell if the object is still alive before you trying call its member function? Is it being destructed in another thread?
The heavyweight "process model", historically used by Unix systems, including Linux, to split a large system into smaller, more tractable pieces doesn’t always lend itself to embedded environments
owing to substantial computational overhead. POSIX threads, also known as Pthreads, is a multithreading API that looks more like what embedded programmers are used to but runs in a
Unix/Linux environment. This presentation introduces Posix Threads and shows you how to use threads to create more efficient, more responsive programs.
Slides from tech talk about the art of non-blocking waiting in Java with LockSupport.park/unpark and AbstractQueuedSynchronizer. Presented on JPoint 2016 Conference.
JVM Mechanics: When Does the JVM JIT & Deoptimize?Doug Hawkins
HotSpot promises to do the "right" thing for us by identifying our hot code and compiling "just-in-time", but how does HotSpot make those decisions?
This presentation aims to detail how HotSpot makes those decisions and how it corrects its mistakes through a series of demos that you run yourself.
Traditionally, computer software has been written for serial computation. To solve a problem, an algorithm is constructed and implemented as a serial stream of instructions. These instructions are executed on a central processing unit on one computer. Only one instruction may execute at a time—after that instruction is finished, the next is executed.
Fast as C: How to Write Really Terrible JavaCharles Nutter
For years we’ve been told that the JVM’s amazing optimizers can take your running code and make it “fast” or “as fast as C++” or “as fast as C”…or sometimes “faster than C”. And yet we don’t often see this happen in practice, due in large part to (good and bad) development patterns that have taken hold in the Java world.
In this talk, we’ll explore the main reasons why Java code rarely runs as fast as C or C++ and how you can write really bad Java code that the JVM will do a better job of optimizing. We’ll take some popular microbenchmarks and burn them to the ground, monitoring JIT logs and assembly dumps along the way.
JVM Mechanics: When Does the JVM JIT & Deoptimize?Doug Hawkins
HotSpot promises to do the "right" thing for us by identifying our hot code and compiling "just-in-time", but how does HotSpot make those decisions?
This presentation aims to detail how HotSpot makes those decisions and how it corrects its mistakes through a series of demos that you run yourself.
Traditionally, computer software has been written for serial computation. To solve a problem, an algorithm is constructed and implemented as a serial stream of instructions. These instructions are executed on a central processing unit on one computer. Only one instruction may execute at a time—after that instruction is finished, the next is executed.
Fast as C: How to Write Really Terrible JavaCharles Nutter
For years we’ve been told that the JVM’s amazing optimizers can take your running code and make it “fast” or “as fast as C++” or “as fast as C”…or sometimes “faster than C”. And yet we don’t often see this happen in practice, due in large part to (good and bad) development patterns that have taken hold in the Java world.
In this talk, we’ll explore the main reasons why Java code rarely runs as fast as C or C++ and how you can write really bad Java code that the JVM will do a better job of optimizing. We’ll take some popular microbenchmarks and burn them to the ground, monitoring JIT logs and assembly dumps along the way.
At a time when Herbt Sutter announced to everyone that the free lunch is over (The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software), concurrency has become our everyday life.A big change is coming to Java, the Loom project and with it such new terms as "virtual thread", "continuations" and "structured concurrency". If you've been wondering what they will change in our daily work or
whether it's worth rewriting your Tomcat-based application to super-efficient reactive Netty,or whether to wait for Project Loom? This presentation is for you.
I will talk about the Loom project and the new possibilities related to virtual wattles and "structured concurrency". I will tell you how it works and what can be achieved and the impact on performance
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2msgBlb.
Kavya Joshi explores when and why locks affect performance, delves into Go’s lock implementation as a case study, and discusses strategies one can use when locks are actually a problem. Filmed at qconnewyork.com.
Kavya Joshi works as a software engineer at Samsara - a start-up in San Francisco. She particularly enjoys architecting and building highly concurrent, highly scalable systems.
Slide deck from my presentation on multi-threading with .NET. The presentation covers from beginner onwards and looks at current technologies (i.e. pre .NET 4.0) specifically.
What makes this extra special is the entire process of how I prepared for it, from finding content to slide deck layout to presentation prep is documented at: http://www.sadev.co.za/content/how-i-build-presentations-series-index
Java Multi Threading Concept
By N.V.Raja Sekhar Reddy
www.technolamp.co.in
Want more...
Like us @ https://www.facebook.com/Technolamp.co.in
subscribe videos @ http://www.youtube.com/user/nvrajasekhar
Java and the machine - Martijn Verburg and Kirk PepperdineJAX London
In Terminator 3 - Rise of the Machines, bare metal comes back to haunt humanity, ruthlessly crushing all resistance. This keynote is here to warn you that the same thing is happening to Java and the JVM! Java was designed in a world where there were a wide range of hardware platforms to support. Its premise of Write Once Run Anywhere (WORA) proved to be one of the compelling reasons behind Java's dominance (even if the reality didn't quite meet the marketing hype). However, this WORA property means that Java and the JVM struggled to utilise specialist hardware and operating system features that could make a massive difference in the performance of your application. This problem has recently gotten much, much worse. Due to the rise of multi-core processors, massive increases in main memory and enhancements to other major hardware components (e.g. SSD), the JVM is now distant from utilising that hardware, causing some major performance and scalability issues! Kirk Pepperdine and Martijn Verburg will take you through the complexities of where Java meets the machine and loses. They'll give up some of their hard-won insights on how to work around these issues so that you can plan to avoid termination, unlike some of the poor souls that ran into the T-800...
A broad introduction to Java.
What is Java and where is it used
Programming Languages in the web development
What is Java and where is it used
OOP PRINCIPLES
JAVA SE, JRE, JDK
IDE’s
Where Java used in the “Real World”
Similar to Medical Image Processing Strategies for multi-core CPUs (20)
Report Back from SGO 2024: What’s the Latest in Cervical Cancer?bkling
Are you curious about what’s new in cervical cancer research or unsure what the findings mean? Join Dr. Emily Ko, a gynecologic oncologist at Penn Medicine, to learn about the latest updates from the Society of Gynecologic Oncology (SGO) 2024 Annual Meeting on Women’s Cancer. Dr. Ko will discuss what the research presented at the conference means for you and answer your questions about the new developments.
Title: Sense of Smell
Presenter: Dr. Faiza, Assistant Professor of Physiology
Qualifications:
MBBS (Best Graduate, AIMC Lahore)
FCPS Physiology
ICMT, CHPE, DHPE (STMU)
MPH (GC University, Faisalabad)
MBA (Virtual University of Pakistan)
Learning Objectives:
Describe the primary categories of smells and the concept of odor blindness.
Explain the structure and location of the olfactory membrane and mucosa, including the types and roles of cells involved in olfaction.
Describe the pathway and mechanisms of olfactory signal transmission from the olfactory receptors to the brain.
Illustrate the biochemical cascade triggered by odorant binding to olfactory receptors, including the role of G-proteins and second messengers in generating an action potential.
Identify different types of olfactory disorders such as anosmia, hyposmia, hyperosmia, and dysosmia, including their potential causes.
Key Topics:
Olfactory Genes:
3% of the human genome accounts for olfactory genes.
400 genes for odorant receptors.
Olfactory Membrane:
Located in the superior part of the nasal cavity.
Medially: Folds downward along the superior septum.
Laterally: Folds over the superior turbinate and upper surface of the middle turbinate.
Total surface area: 5-10 square centimeters.
Olfactory Mucosa:
Olfactory Cells: Bipolar nerve cells derived from the CNS (100 million), with 4-25 olfactory cilia per cell.
Sustentacular Cells: Produce mucus and maintain ionic and molecular environment.
Basal Cells: Replace worn-out olfactory cells with an average lifespan of 1-2 months.
Bowman’s Gland: Secretes mucus.
Stimulation of Olfactory Cells:
Odorant dissolves in mucus and attaches to receptors on olfactory cilia.
Involves a cascade effect through G-proteins and second messengers, leading to depolarization and action potential generation in the olfactory nerve.
Quality of a Good Odorant:
Small (3-20 Carbon atoms), volatile, water-soluble, and lipid-soluble.
Facilitated by odorant-binding proteins in mucus.
Membrane Potential and Action Potential:
Resting membrane potential: -55mV.
Action potential frequency in the olfactory nerve increases with odorant strength.
Adaptation Towards the Sense of Smell:
Rapid adaptation within the first second, with further slow adaptation.
Psychological adaptation greater than receptor adaptation, involving feedback inhibition from the central nervous system.
Primary Sensations of Smell:
Camphoraceous, Musky, Floral, Pepperminty, Ethereal, Pungent, Putrid.
Odor Detection Threshold:
Examples: Hydrogen sulfide (0.0005 ppm), Methyl-mercaptan (0.002 ppm).
Some toxic substances are odorless at lethal concentrations.
Characteristics of Smell:
Odor blindness for single substances due to lack of appropriate receptor protein.
Behavioral and emotional influences of smell.
Transmission of Olfactory Signals:
From olfactory cells to glomeruli in the olfactory bulb, involving lateral inhibition.
Primitive, less old, and new olfactory systems with different path
Explore natural remedies for syphilis treatment in Singapore. Discover alternative therapies, herbal remedies, and lifestyle changes that may complement conventional treatments. Learn about holistic approaches to managing syphilis symptoms and supporting overall health.
Flu Vaccine Alert in Bangalore Karnatakaaddon Scans
As flu season approaches, health officials in Bangalore, Karnataka, are urging residents to get their flu vaccinations. The seasonal flu, while common, can lead to severe health complications, particularly for vulnerable populations such as young children, the elderly, and those with underlying health conditions.
Dr. Vidisha Kumari, a leading epidemiologist in Bangalore, emphasizes the importance of getting vaccinated. "The flu vaccine is our best defense against the influenza virus. It not only protects individuals but also helps prevent the spread of the virus in our communities," he says.
This year, the flu season is expected to coincide with a potential increase in other respiratory illnesses. The Karnataka Health Department has launched an awareness campaign highlighting the significance of flu vaccinations. They have set up multiple vaccination centers across Bangalore, making it convenient for residents to receive their shots.
To encourage widespread vaccination, the government is also collaborating with local schools, workplaces, and community centers to facilitate vaccination drives. Special attention is being given to ensuring that the vaccine is accessible to all, including marginalized communities who may have limited access to healthcare.
Residents are reminded that the flu vaccine is safe and effective. Common side effects are mild and may include soreness at the injection site, mild fever, or muscle aches. These side effects are generally short-lived and far less severe than the flu itself.
Healthcare providers are also stressing the importance of continuing COVID-19 precautions. Wearing masks, practicing good hand hygiene, and maintaining social distancing are still crucial, especially in crowded places.
Protect yourself and your loved ones by getting vaccinated. Together, we can help keep Bangalore healthy and safe this flu season. For more information on vaccination centers and schedules, residents can visit the Karnataka Health Department’s official website or follow their social media pages.
Stay informed, stay safe, and get your flu shot today!
NVBDCP.pptx Nation vector borne disease control programSapna Thakur
NVBDCP was launched in 2003-2004 . Vector-Borne Disease: Disease that results from an infection transmitted to humans and other animals by blood-feeding arthropods, such as mosquitoes, ticks, and fleas. Examples of vector-borne diseases include Dengue fever, West Nile Virus, Lyme disease, and malaria.
Recomendações da OMS sobre cuidados maternos e neonatais para uma experiência pós-natal positiva.
Em consonância com os ODS – Objetivos do Desenvolvimento Sustentável e a Estratégia Global para a Saúde das Mulheres, Crianças e Adolescentes, e aplicando uma abordagem baseada nos direitos humanos, os esforços de cuidados pós-natais devem expandir-se para além da cobertura e da simples sobrevivência, de modo a incluir cuidados de qualidade.
Estas diretrizes visam melhorar a qualidade dos cuidados pós-natais essenciais e de rotina prestados às mulheres e aos recém-nascidos, com o objetivo final de melhorar a saúde e o bem-estar materno e neonatal.
Uma “experiência pós-natal positiva” é um resultado importante para todas as mulheres que dão à luz e para os seus recém-nascidos, estabelecendo as bases para a melhoria da saúde e do bem-estar a curto e longo prazo. Uma experiência pós-natal positiva é definida como aquela em que as mulheres, pessoas que gestam, os recém-nascidos, os casais, os pais, os cuidadores e as famílias recebem informação consistente, garantia e apoio de profissionais de saúde motivados; e onde um sistema de saúde flexível e com recursos reconheça as necessidades das mulheres e dos bebês e respeite o seu contexto cultural.
Estas diretrizes consolidadas apresentam algumas recomendações novas e já bem fundamentadas sobre cuidados pós-natais de rotina para mulheres e neonatos que recebem cuidados no pós-parto em unidades de saúde ou na comunidade, independentemente dos recursos disponíveis.
É fornecido um conjunto abrangente de recomendações para cuidados durante o período puerperal, com ênfase nos cuidados essenciais que todas as mulheres e recém-nascidos devem receber, e com a devida atenção à qualidade dos cuidados; isto é, a entrega e a experiência do cuidado recebido. Estas diretrizes atualizam e ampliam as recomendações da OMS de 2014 sobre cuidados pós-natais da mãe e do recém-nascido e complementam as atuais diretrizes da OMS sobre a gestão de complicações pós-natais.
O estabelecimento da amamentação e o manejo das principais intercorrências é contemplada.
Recomendamos muito.
Vamos discutir essas recomendações no nosso curso de pós-graduação em Aleitamento no Instituto Ciclos.
Esta publicação só está disponível em inglês até o momento.
Prof. Marcus Renato de Carvalho
www.agostodourado.com
Title: Sense of Taste
Presenter: Dr. Faiza, Assistant Professor of Physiology
Qualifications:
MBBS (Best Graduate, AIMC Lahore)
FCPS Physiology
ICMT, CHPE, DHPE (STMU)
MPH (GC University, Faisalabad)
MBA (Virtual University of Pakistan)
Learning Objectives:
Describe the structure and function of taste buds.
Describe the relationship between the taste threshold and taste index of common substances.
Explain the chemical basis and signal transduction of taste perception for each type of primary taste sensation.
Recognize different abnormalities of taste perception and their causes.
Key Topics:
Significance of Taste Sensation:
Differentiation between pleasant and harmful food
Influence on behavior
Selection of food based on metabolic needs
Receptors of Taste:
Taste buds on the tongue
Influence of sense of smell, texture of food, and pain stimulation (e.g., by pepper)
Primary and Secondary Taste Sensations:
Primary taste sensations: Sweet, Sour, Salty, Bitter, Umami
Chemical basis and signal transduction mechanisms for each taste
Taste Threshold and Index:
Taste threshold values for Sweet (sucrose), Salty (NaCl), Sour (HCl), and Bitter (Quinine)
Taste index relationship: Inversely proportional to taste threshold
Taste Blindness:
Inability to taste certain substances, particularly thiourea compounds
Example: Phenylthiocarbamide
Structure and Function of Taste Buds:
Composition: Epithelial cells, Sustentacular/Supporting cells, Taste cells, Basal cells
Features: Taste pores, Taste hairs/microvilli, and Taste nerve fibers
Location of Taste Buds:
Found in papillae of the tongue (Fungiform, Circumvallate, Foliate)
Also present on the palate, tonsillar pillars, epiglottis, and proximal esophagus
Mechanism of Taste Stimulation:
Interaction of taste substances with receptors on microvilli
Signal transduction pathways for Umami, Sweet, Bitter, Sour, and Salty tastes
Taste Sensitivity and Adaptation:
Decrease in sensitivity with age
Rapid adaptation of taste sensation
Role of Saliva in Taste:
Dissolution of tastants to reach receptors
Washing away the stimulus
Taste Preferences and Aversions:
Mechanisms behind taste preference and aversion
Influence of receptors and neural pathways
Impact of Sensory Nerve Damage:
Degeneration of taste buds if the sensory nerve fiber is cut
Abnormalities of Taste Detection:
Conditions: Ageusia, Hypogeusia, Dysgeusia (parageusia)
Causes: Nerve damage, neurological disorders, infections, poor oral hygiene, adverse drug effects, deficiencies, aging, tobacco use, altered neurotransmitter levels
Neurotransmitters and Taste Threshold:
Effects of serotonin (5-HT) and norepinephrine (NE) on taste sensitivity
Supertasters:
25% of the population with heightened sensitivity to taste, especially bitterness
Increased number of fungiform papillae
Basavarajeeyam is an important text for ayurvedic physician belonging to andhra pradehs. It is a popular compendium in various parts of our country as well as in andhra pradesh. The content of the text was presented in sanskrit and telugu language (Bilingual). One of the most famous book in ayurvedic pharmaceutics and therapeutics. This book contains 25 chapters called as prakaranas. Many rasaoushadis were explained, pioneer of dhatu druti, nadi pareeksha, mutra pareeksha etc. Belongs to the period of 15-16 century. New diseases like upadamsha, phiranga rogas are explained.
10. Parallel Computing – according to Google “parallel computing” 1.4M hits on Google “multithreading” 10M hits “multicore” 2.4M hits “parallel programming” 1.1M hits Why is it so hard? the world is parallel we all think in parallel yet we are taught to program in serial 4 driving
11. Degrees of parallelism (my take) Serial – SISD single thread of execution Data parallel – SIMD (fine grained parallelism) Embarrassingly parallel – larger scale SIMD CT or MR reconstruction Each operation is independent, e.g. iFFT of slices Worker thread – e.g. virus scanning software Coarse grained parallelism – SMP or MIMD Focus of this presentation, more in GPU talk Concurrency, OpenMP, TBB, pthreads/Winthreads Large scale – MPI on cluster, tight coupling Large scale – Grid computing, loose coupling 5
12. Pragmatic approach C/C++ and Fortran are the kings of performance (I’ve never written a single line of Fortran, so don’t ask) “Bolted on” parallel concepts Zero language support Huge existing codebase 6
13. Pragmatic approach Briefly touch on SIMD Introduce SMP concepts Threads, concurrency Development models pthreads/WinThreads OpenMP TBB ITK Medical Image Processing Example problems Common errors Next steps 7 packed
16. Data structures for SIMD Array of Structures struct Vec { float x, y, z; }; Vec[] points = new Vec[sz]; 10 X Y Z -- Pack X Y Z -- X Y Z -- * Unpack X Y Z --
17. Data structures for SIMD 11 Structure of Arrays struct Vec { float[] x; float[] y; float[] z; Vec ( int sz ) { x = new float[sz]; y = new float[sz]; z = new float[sz]; }; }; Structure of Arrays struct Vec{ Vector4f[] v; Vec ( int sz ) { // must be word // aligned v = new Vector4f[sz]; }; };
18. SIMD pitfalls Structure alignment Usually needs to be aligned on word boundary Structure considerations May need to refactor existing code/structures Generally not cross-platform MMX, 3D Now!, SSE, SSE2, SSE4, AltVec, AVX, etc... Performance gains are modest 2x – 4x common Limited instructions Add, multiply, divide, round Not suitable for branching logic Autovectorizing compilers for simple loops -ftree-vectorize (GCC), -fast, -O2 or -O3 (Intel Compiler) 12
21. SMP concepts 15 Useful to think in terms of “cores” 2 dual-core CPU = 4 “cores” Cores share main memory, may share cache Threads in same process share memory Generally, one executing thread per core Other threads sleeping
22. Cores – they’re everywhere 16 How many cores does your laptop have? Mine has 50(!) 2 Intel CPU (Core 2 Duo) 32 nVidia cores (9600M GT) 16 nVidia cores (9400M)
23. Parallel concepts for SMP Process Started by the OS Single thread executes “main” No direct access to memory of other processes Threads Stream of execution under a process Access to memory in containing process Private memory Lifetime may be less than main thread Concurrency Coordination between threads High level (mutex, locks, barriers) Low level (atomic operations) 17
25. #include <pthread.h> // Thread work function, must return pointer to void void *doWork(void *work) { // Do work return work; // equivalent to pthread_exit ( myWork ); } ... pthread_t child; ... rc=pthread_create(&child, &attr, doWork, (void *)work); ... rc = pthread_join ( child, &threadwork ); ... Thread construction – pthread example 19
26. Thread construction – Win32 example 20 #include <windows.h> DWORD WINAPI doWork( LPVOID work) {}; ... PMYDATA work; DWORD childID; HANDLE child; child = CreateThread( NULL, // default security attributes 0, // use default stack size doWork, // thread function name work, // argument to thread function 0, // use default creation flags &childID); // returns the thread identifier WaitForMultipleObjects(NThreads, child, TRUE, INFINITE);
27. Thread construction – Java example 21 import java.lang.Thread; class Worker implements Runnable { public Worker ( Work work ) {}; public void run() {}; // Do work here } ... Worker worker = new Worker ( someWork ); New Thread ( worker ).start();
29. Mutex Mutex – Mutual exclusion lock Protects a section of code Only one thread has a lock on the object Threads may wait for the mutex return a status if the mutex is locked Semaphore N threads Critical Section One thread executes code Protects global resources Maintain consistent state 23
30. Race Conditions 24 ... N = 0; ... // Start some threads ... void* doWork() { N++; // get, incr, store } Mutexmutex; mutex.lock(); mutex.release(); Solution w/Mutex NoNo
31. Atomic operations Locks are not perfect Cause blocking Relatively heavy-weight Atomic operations Simple operations Hardware support Can implement w/Mutex Conditions Invisibility – no other thread knows about the change Atomicity – if operation fails, return to original state 25
33. Thread synchronization – barrier Initialized with the number of threads expected Threads signal when they are ready Wait until all expected threads are there A stalled or dead thread can stall all the threads 27
34. Thread synchronization – Condition variables Workers atomically release mutex and wait Master atomically releases mutex and signals Workers wake up and acquire mutex 28 Mutex A Working Condition Mutex A Condition Mutex A Wait Mutex A Condition Mutex A Mutex Thread
35. Thread pool & Futures 29 Maintains a “pool” of Worker threads Work queued until thread available Optionally notify through a “Future” Future can query status, holds return value Thread returns to pool, no startup overhead Core concept for OpenMP and TBB
37. Introduction to OpenMP Scatter / gather paradigm Maintains a thread pool Requires compiler support Visual C++, gcc 4.0, Intel Compiler Easy to adapt existing serial code, easy to debug Simple paradigm 31
39. OpenMP – parallel for 33 #pragmaomp parallel for for ( int i = 0; i < NumberOfIterations; i++ ) { // Threads scatter here // each thread has a private copy of i doSomeWork( i ); } // Implicit barrier Scheduling the iterations
40. OpenMP – reduction 34 int TotalAmountOfWork = 0; #pragmaomp parallel for reduction ( + : TotalAmountOfWork ) for ( int i = 0; i < NumberOfIterations; i++ ) { // Threads scatter here // each thread has a private copy of i & TotalAmountOfWork TotalAmountOfWork += doSomeWork( i ); } // Implicit barrier // TotalAmountOfWork was properly accumulated // Each thread has local copy, barrier does reduction // No need to use critical sections
41. OpenMP – “atomic” reduction 35 int TotalAmountOfWork = 0; #pragmaomp parallel for for ( int i = 0; i < NumberOfIterations; i++ ) { // Threads scatter here int myWork = doSomeWork( i); #pragmaomp atomic TotalAmountOfWork += myWork; } // Implicit barrier // TotalAmountOfWork was properly accumulated // However, the atomic section can cause thread stalls
42. OpenMP – critical 36 int TotalAmountOfWork = 0; #pragmaomp parallel for reduction ( + : TotalAmountOfWork ) for ( int i = 0; i < NumberOfIterations; i++ ) { // Threads scatter here // each thread has a private copy of i TotalAmountOfWork += doSomeWork( i ); #pragmaomp critical { // Execute by one thread at a time, e.g., “Mutex lock” criticalOperation(); } } // Implicit barrier
43. OpenMP – single 37 int TotalAmountOfWork = 0; #pragmaomp parallel for reduction ( + : TotalAmountOfWork ) for ( int i = 0; i < NumberOfIterations; i++ ) { // Threads scatter here // each thread has a private copy of i TotalAmountOfWork += doSomeWork( i ); #pragmaomp single nowait { // Execute by one thread, use “master” for the main thread reportProgress ( TotalAmountOfWork ); } // !! No implicit barrier because of “nowait” clause !! } // Implicit barrier
45. Introduction to TBB Commercial and Open Source Licenses GPL with runtime exception Cross-platform C++ library Similar to STL Usual concurrency classes Several different constructs for threading for, do, reduction, pipeline Finer control over scheduling Maintains a thread pool to execute tasks http://www.threadingbuildingblocks.org/ 39
46. TBB – parallel for 40 #include "tbb/blocked_range.h” #include "tbb/parallel_for.h” class Worker { public: Worker ( /* ... */ ) {...}; void operator() ( const tbb::blocked_range<int>& r ) const { for ( int i = r.begin(); i != r.end(); ++i ) { doWork ( i ); } } }; ... tbb::parallel_for ( tbb::blocked_range<int> ( 0, N ), Worker ( /* ... */ ), tbb::auto_partitioner() );
51. ITK Implementation Threads operate across slices Only implemented behavior in ITK itk::MultiThreader is somewhat flexible Requires that you break the ITK model Uses Thread Join, higher overhead No thread pool 45
55. Image class 48 class Image { public: short* mData; int mWidth, mHeight, mDepth; int mVoxelsPerSlice; int mVoxelsPerVolume; short* mSlicePointers; // Pointers to the start of each slice short getVoxel ( int x, int y, int z ) {...} void setVoxel ( int x, int y, int z, short v ) {...} };
56. Trivial problem – threshold Threshold an image If intensity > 100, output 1 otherwise output 0 Present from simple to complex OpenMP TBB ITK pthread(see extra slides) 49
57. Threshold – OpenMP #1 50 void doThreshold ( Image* in, Image* out ) { #pragmaomp parallel for for ( int z = 0; z < in->mDepth; z++ ) { for ( int y = 0; y < in->mHeight; y++ ) { for ( int x = 0; x < in->mWidth; x++ ) { if ( in->getVoxel(x,y,z) > 100 ) { out->setVoxel(x,y,z,1); } else { out->setVoxel(x,y,z,0); } } } } } // NB: can loop over slices, rows or columns by moving // pragma, but must choose at compile time
58. Threshold – OpenMP #2 51 void doThreshold ( Image* in, Image* out ) { #pragmaomp parallel for for ( int s = 0; s < in->mVoxelsPerVolume; s++ ) { if ( in->mData[s] > 100 ) { out->mData[s] = 1; } else { out->mData[s] = 0; } } } // Likely a lot faster than previous code
59. class Threshold { public: Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...} void operator() ( const tbb::blocked_range<int>& r ) { for ( int x = r.begin(); x != r.end(); ++x ) { if ( in->mData[x] > 100 ) { out->mData[x] = 1; } else { out->mData[x] = 0; } } } } ... parallel_for ( tbb::blocked_range<int>(0, in->mVoxelsPerVolume ), Threshold ( in, out ), auto_partitioner() ); // NB: default “grain size” for blocked_range is 1 pixel // tbb::blocked_range<int>(..., in->mVoxelsPerVolume / NumberOfCPUs ) Threshold – TBB #1 52
60. class Threshold { public: Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...} void operator() ( const tbb::blocked_range<int>& r ) {...} void operator() ( const tbb::blocked_range2d<int,int>& r ) { for ( int z = in->mDepth; z < in->mDepth; z++ ) { for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ if ( in->getVoxel(x,y,z) > 100 ) { out->setVoxel(x,y,z,1); } else { out->setVoxel(x,y,z,0); } } } } } }; ... parallel_for ( tbb::blocked_range2d<int,int>( 0, in->mHeight, 32 0, in->mWidth, 32 ), Threshold ( in, out ), auto_partitioner() ); Threshold – TBB #2 53
61. class Threshold { public: Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...} void operator() ( const tbb::blocked_range<int>& r ) {...} void operator() ( const tbb::blocked_range2d<int,int>& r ) {...} void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) { for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ if ( in->getVoxel(x,y,z) > 100 ) { out->setVoxel(x,y,z,1); } else { out->setVoxel(x,y,z,0); } } } } } }; ... parallel_for ( tbb::blocked_range3d<int,int,int>(0, in->mDepth, 1 0, in->mHeight, 32 0, in->mWidth, 32 ), Threshold ( in, out ), auto_partitioner() ); Threshold – TBB #3 54
63. Interesting problem – anisotropic diffusion Edge preserving smoothing method Perona and Malik. Scale-space and edge detection using anisotropic diffusion. Pattern Analysis and Machine Intelligence, IEEE Transactions on (1990) vol. 12 (7) pp. 629 – 639 Iterative process Demonstrate OpenMP TBB (ITK has an implementation) (pthreads are tedious at the very least) Pop quiz – are the following correct? 56
64. Anisotropic diffusion – OpenMP 57 void doAD ( Image* in, Image* out ) { #pragmaomp parallel for for ( int t = 0; t < TotalTime; t++ ) { for ( int z = 0; z < in->mDepth; z++ ) { ... } } }
65. Anisotropic diffusion – OpenMP 58 void doAD ( Image* in, Image* out ) { short *previousSlice, *slice, *nextSlice; for ( int t = 0; t < TotalTime; t++ ) { #pragmaomp parallel for for ( int z = 1; z < in->mDepth-1; z++ ) { previousSlice = in->mSlicePointers[z-1]; slice = in->mSlicePointers[z]; nextSlice = in->mSlicePointers[z+1]; for ( int y = 1; y < in->mHeight-1; y++ ) { short* previousRow = slice + y-1 * in->mWidth; short* row = slice + y * in->mWidth; short* nextRow = slice + y-1 * in->mWidth; short* aboveRow = previousSlice + y * in->mWidth; short* belowRow = nextSlice + y * in->mWidth; for ( int x = 1; i < in->mWidth-1; x++ ) { dx = 2 * row[x] – row[x-1] – row[x+1]; dy = 2 * row[x] – previousRow[x] – nextRow[x]; dz = 2 * row[x] – aboveRow[x] – belowRow[x]; ...
66. Anisotropic diffusion – OpenMP 59 void doAD ( Image* in, Image* out ) { for ( int t = 0; t < TotalTime; t++ ) { #pragmaomp parallel for for ( int z = 1; z < in->mDepth-1; z++ ) { short* previousSlice = in->mSlicePointers[z-1]; short* slice = in->mSlicePointers[z]; short* nextSlice = in->mSlicePointers[z+1]; for ( int y = 1; y < in->mHeight-1; y++ ) { short* previousRow = slice + y-1 * in->mWidth; short* row = slice + y * in->mWidth; short* nextRow = slice + y-1 * in->mWidth; short* aboveRow = previousSlice + y * in->mWidth; short* belowRow = nextSlice + y * in->mWidth; for ( int x = 1; i < in->mWidth-1; x++ ) { dx = 2 * row[x] – row[x-1] – row[x+1]; dy = 2 * row[x] – previousRow[x] – nextRow[x]; dz = 2 * row[x] – aboveRow[x] – belowRow[x]; ...
67. Anisotropic diffusion – TBB #1 60 class doAD { public: static ADConstants* sConstants; doAD ( Image* in, Image* out ) { ... } void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { if ( !sConstants == NULL ) { initConstants(); } // process ... } }
68. Threshold – TBB #2 61 class doAD { public: doAd ( ... ) {...} void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) { for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ ... } }; ... parallel_for ( tbb::blocked_range3d<int,int,int>(0, in->mDepth 0, in->mHeight 0, in->mWidth ), doAD ( in, out ), auto_partitioner() );
69. Threshold – TBB #3 62 class doAD { public: static tbb::atomic<int> sProgress; tbb::spin_mutexmMutex; doAd ( ... ) {...} void reportProgress ( int p ) { ... } void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) { tbb::spin_mutex::scoped_lock lock ( mMutex ); sProgress++; reportProgress ( sProgress ); for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ ... } }; ... doAD::sProgress = 0; parallel_for (...);
70. Threshold – TBB #4 63 class doAD { public: static tbb::atomic<int> sProgress; static tbb::spin_mutexmMutex; doAd ( ... ) {...} void reportProgress ( int p ) { ... } void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) { tbb::spin_mutex::scoped_lock lock ( mMutex ); sProgress++; reportProgress ( sProgress ); for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ ... } }; ... doAD::sProgress = 0; parallel_for (...);
71. nowait Anisotropic diffusion – OpenMP (Progress) 64 using std; void doAD ( Image* in, Image* out ) { int progress = 0; for ( int t = 0; t < TotalTime; t++ ) { #pragmaomp parallel for for ( int s = 0; s < in->mDepth; s++ ) { #pragmaomp atomic progress++; #pragmaomp single reportProgress ( progress ); ... } } }
72. Real-life problem Compute Frangi’svesselness measure Frangi et al. Model-based quantitation of 3-D magnetic resonance angiographic images. IEEE Transactions on Medical Imaging (1999) vol. 18 (10) pp. 946-956 Memory constrained solution ITK implementation requires 1.2G for 100M volume Antiga. Generalizing vesselness with respect to dimensionality and shape. Insight Journal (2007) Possible solutions using OpenMP, TBB 65
79. Algorithm sketch – Serial 72 intBlockSize = 32; for ( intz = 0; z < image->mDepth; z += BlockSize ) { for ( inty = 0; y < image->mHeight; y += BlockSize ) { for ( intx = 0; x < image->mWidth; x += BlockSize ) { processBlock ( in, out, x, y, z, BlockSize ); } } }
80. Algorithm sketch – OpenMP 73 intBlockSize = 32; #pragmaomp parallel for for ( intz = 0; z < image->mDepth; z += BlockSize ) { for ( inty = 0; y < image->mHeight; y += BlockSize ) { for ( intx = 0; x < image->mWidth; x += BlockSize ) { processBlock ( in, out, x, y, z, BlockSize ); } } } Each thread is on a different slice May cause cache contention Similar problems for “y” direction
81. Algorithm sketch – OpenMP 74 intBlockSize = 32; for ( intz = 0; z < image->mDepth; z += BlockSize ) { for ( inty = 0; y < image->mHeight; y += BlockSize ) { #pragmaomp parallel for for ( intx = 0; x < image->mWidth; x += BlockSize ) { processBlock ( in, out, x, y, z, BlockSize ); } } } All threads on same rows May not utilize all CPUs If Ratio of Width to BlockSize < # CPUs Better cache utilization
82. Algorithm sketch – TBB 75 class Vesselness { public: void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { // Process the block, could use ITK here processBlock ( r.cols().begin(), r.rows().begin(), r.pages().begin(), r.cols().size(), r.rows().size(), r.pages().size() ); ... parallel_for ( tbb::blocked_range3d<int,int,int>( 0, in->mDepth, 32 0, in->mHeight, 32 0, in->mWidth, 32 ), Vesselness( in, out ), auto_partitioner() ); Individual blocks Full CPUs May not have best cache performance
83. Next steps Go try parallel development Try threads to gain understanding and insight Next OpenMP, adapting existing code TBB: more constructs, different approachs Experiment with new languages Erlang, Scala, Reia, Chapel, X10, Fortress... Check out some of the resources provided Have fun! It’s a brave new world out there... 76
84. Resources TBB (http://www.threadingbuildingblocks.org/) OpenMP (http://openmp.org/wp/) Books/Articles Java Concurrency in Practice (http://www.javaconcurrencyinpractice.com/) Parallel Programming (http://www-users.cs.umn.edu/~karypis/parbook/) ITK Software Guide (http://www.itk.org/ItkSoftwareGuide.pdf) The Problem with Threads (http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf) Tutorials Parallel Programming(https://computing.llnl.gov/tutorials/parallel_comp/) pthreads (https://computing.llnl.gov/tutorials/pthreads/) OpenMP (https://computing.llnl.gov/tutorials/openMP/) Other LLNL (https://computing.llnl.gov/) Erlang (http://en.wikipedia.org/wiki/Erlang_programming_language) GCC-OpenMP (http://gcc.gnu.org/projects/gomp/) Intel Compiler (http://software.intel.com/en-us/intel-compilers/) 77
86. Medical image processing strategies for multi-core CPUs Daniel Blezek, Mayo Clinic blezek.daniel@mayo.edu
87. Thread construction – pthread example 80 include <pthread.h> void *(*start_routine)(void *); int pthread_create(pthread_t *restrict thread, const pthread_attr_t *restrict attr, void *(*start_routine)(void *), void *restrict arg); void pthread_exit(void *value_ptr); int pthread_join(pthread_t thread, void **value_ptr);
88. Mutex – pthread example 81 #include <pthread.h> pthread_mutex_t myMutex; ... pthread_mutex_init ( &myMutex, NULL ); ... pthread_mutex_lock ( &myMutex ); // Critical Section, only one thread at a time ... pthread_mutex_unlock ( &myMutex ); ... if ( pthread_mutex_trylock ( &myMutex ) == EBUSY ) { // We did get the lock, so we are in the critical section ... pthread_mutex_unlock ( &myMutex ); }
89. Mutex – Java example 82 import java.lang.*; class Foo { public synchronized int doWork () { // only one thread can execute doWork } Object resource; public int otherWork () { synchronized ( resource ) { // critical section, resource is the mutex ... } }
90. struct Work { Image* in; Image *out; int start; int end; }; Work workArray[THREADCOUNT]; pthread_t thread[THREADCOUNT]; void* doThreshold ( void* inWork ) { Work* work = (Work*) inWork; for ( int s = work->start; s < work->end; s++ ) {...} } ... pthread_attr_t attributes; pthread_attr_init ( &attributes ); pthread_attr_setdetachstate ( &attributes, PTHREAD_CREATE_JOINABLE ); for ( int t = 0; t < THREADCOUNT; t++ ) { initializeWork ( in, out, t, workArray[t] ); pthread_create ( &thead[t], &attributes, doThreshold, (void*) workArray[t] ); } for ( int t = 0; t < THREADCOUNT; t++ ) { pthread_join ( thread[t], NULL ); } Threshold – pthread 83
92. Semaphore Allow N threads access Protects limited resources Binary semaphore N = 1 Equivalent to Mutex 85
93. ITK Implementation Threads operate across slices Only implemented behavior in ITK itk::MultiThreader is somewhat flexible Requires that you break the ITK model Uses Thread Join, higher overhead No thread pool 86
95. #include <itkImageToImageFilter.h> template <In, Out> Worker : public ImageToImageFilter<In, Out> { ... void BeforeThreadedGenerateData() { // Master thread only ... } void ThreadedGenerateData(constOutputImageRegionType &r, int tid ){ // Generate output data for r ... } voidAfterThreadedGenerateData() { // Master thread only ... } // Output split on last dimension // i.e. Slices for 3D volumes Insight Toolkit 88
96. Anisotropic diffusion – OpenMP 89 using std; void doAD ( Image* in, Image* out ) { for ( int t = 0; t < TotalTime; t++ ) { #pragmaomp parallel for for ( int slice = 0; slice < in->mDepth; slice++ ) { ... } } }
Editor's Notes
If I had asked this question 5 years ago, almost no one would have raised their hand.
Driving is inherently a parallel task, we coordinate at stop signs, stop lights, we obey the rules of the road, but we can get deadlocked (grid lock).