Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Parallel Computing 2007: Overview
1. Parallel Computing 2007: Overview February 26-March 1 2007 Geoffrey Fox Community Grids Laboratory Indiana University 505 N Morton Suite 224 Bloomington IN [email_address] http://grids.ucs.indiana.edu/ptliupages/presentations/PC2007/
2.
3.
4.
5.
6.
7.
8.
9.
10.
11. What is …? What if …? Is it …? R ecognition M ining S ynthesis Create a model instance RMS: Recognition Mining Synthesis Model-based multimodal recognition Find a model instance Model Real-time analytics on dynamic, unstructured, multimodal datasets Photo-realism and physics-based animation Model-less Real-time streaming and transactions on static – structured datasets Very limited realism Tomorrow Today
12. What is a tumor? Is there a tumor here? What if the tumor progresses? It is all about dealing efficiently with complex multimodal datasets R ecognition M ining S ynthesis Images courtesy: http://splweb.bwh.harvard.edu:8000/pages/images_movies.html
40. Performance of Typical Science Code I FLASH Astrophysics code from DoE Center at Chicago Plotted as time as a function of number of nodes Scaled Speedup as constant grain size as number of nodes increases
41. Performance of Typical Science Code II FLASH Astrophysics code from DoE Center at Chicago on Blue Gene Note both communication and simulation time are independent of number of processors – again the scaled speedup scenario Communication Simulation
122. Thread0 Port3 Thread2 Port2 Port1 Port0 Thread3 Thread1 Thread2 Port2 Thread0 Port0 Port3 Thread3 Port1 Thread1 Thread3 Port3 Thread2 Port2 Thread0 Port0 Thread1 Port1 (a) Pipeline (b) Shift (d) Exchange Thread0 Port3 Thread2 Port2 Port1 Port0 Thread3 Thread1 (c) Two Shifts Four Communication Patterns used in CCR Tests. (a) and (b) use CCR Receive while (c) and (d) use CCR Multiple Item Receive
123. Stages (millions) Fixed amount of computation (4.10 7 units) divided into 4 cores and from 1 to 10 7 stages on HP Opteron Multicore . Each stage separated by reading and writing CCR ports in Pipeline mode Time Seconds 8.04 microseconds per stage averaged from 1 to 10 million stages Overhead = Computation Computation Component if no Overhead 4-way Pipeline Pattern 4 Dispatcher Threads HP Opteron
124. Stages (millions) Fixed amount of computation (4.10 7 units) divided into 4 cores and from 1 to 10 7 stages on Dell Xeon Multicore . Each stage separated by reading and writing CCR ports in Pipeline mode Time Seconds 12.40 microseconds per stage averaged from 1 to 10 million stages 4-way Pipeline Pattern 4 Dispatcher Threads Dell Xeon Overhead = Computation Computation Component if no Overhead
125.
126.
127.
128.
129.
130. Typical Bandwidth measurements showing effect of cache with slope change 5,000 stages with run time plotted against size of double array copied in each stage from thread to stepped locations in a large array on Dell Xeon Multicore Time Seconds 4-way Pipeline Pattern 4 Dispatcher Threads Dell Xeon Total Bandwidth 1.0 Gigabytes/Sec up to one million double words and 1.75 Gigabytes/Sec up to 100,000 double words Array Size: Millions of Double Words Slope Change (Cache Effect)