Multicore ProcessingWu, Lieh-HaoVMI Class 2010Computer Science
What’s Multicore? Multiple cores in a single chipImproving performance by adding coreBecame the main stream in recent years    - Examples        - Core 2 dual, Core 2 Quad, Core-i3/5/7  Intel       - Athlon II X2, Phenom II X4, Opteron AMD       - Cell Broad Engine  IBM
Why Multicore?The difficulties of single core processor’s development    - Overheat    - Energy consumption    - Electron leakage    - Example       - Intel abandoned the project of 4GHz processor             in fall 2004 Multicore processor resolve these problem and has better performance
Research IntroductionPurpose:    - To see the performance difference  between        single core and multicore processorsHow:    - Use the PS3 as the host machine   - Use the CPU of PS3 to execute a series of       matrix multiplication      - Execute with single core       - Execute with multicore         - programming tools are needed for handling                  cores   - Record the time and analysis the performance
Play Station 3Physical ComponentsCPU: Cell Broad EngineMemory: 256MBStorage: 80GBSoftware Yellow Dog LinuxCell SDK
Cell Broad Engine ProcessorDeveloped by Sony, Toshiba, and IBM jointly.Multicore structure  - Power Processing Element x 1   (PPE)    - Like a traditional processor    - It has its own L1, L2 cache  - Synergistic Processing Element x 8     (SPE)     - Can be used synchronously    - It has 256KB local storage
Matrix MultiplicationSimple but time consumingSome assumptions are made for research purpose   - Dimension is set to N2    - Data type is set to double   - Only even numbers are       applied
SequoiaAn explicit multicore programming model
Mapping a tree structure as a memory hierarchy
Basic idea  - Consist of three functions  - task<inner>: distribute  - task<leaf>: compute  - task<ext>: connect
Programming in SequoiaTo programming in Sequoia, four files are required to run the matrix multiplication.   - “Makefile”  for compiling   - “matrixmult.sq” Sequoia program   - “mapping_ps3.xml”  for mapping    - “main.cc”  for starting During the process   - Good documentation   - Good adaptability for different purposes   - Details need to be handled by programmers
CellgenAn implicit multicore programming modelC/C++ based programming toolLike OpenMP style   - OpenMP APIBasic idea   - Starts after “#pragma cell”   - Parameters      - public: shared by SPEs      - private: each SPE has a copyScott SchneiderPh.D. Candidate Virginia Tech 
Programming in CellgenThere are files needed to run matrix multiplication   - Two “Makefile”  for compiling   - One “matrixmult.cellgen”  Cellgen code   - One “double16b_t.h”  for padding column data      - suggested by the author to improve performanceDuring the process   - Understandable      - C/C++ based; easy to catch up.   - Lack of documentation      - Only “Readme” file is available.
Result in TableThe following is the table for the execution time of PPE only, SPE with Sequoia, and SPE with Cellgen.3 x 8 x 3264x 3264 ≈ 256 MB   - Oversize matrix will be swapped between disk    and main memory.Can’t get result of Cellgen from size 3072 to 4096.   - Either no response or bus error.
Result in Graph (1)The following is the line chart generated from thedata of the table.Memory size limitPPE OnlyCellgenSequoia
Result in Graph (2)
Result AnalysisPerformance of Cellgen  - Unexpected overhead or runtime error may      occur and throw the performance back.Performance of Sequoia  - According to the stable record, it is about 8      times faster than the execution time of PPE.    - Although the memory size is 256MB, performance starts dropping down after 2048 2.   - The performance becomes the same with PPE   after reaching 4096 2 .      - Probably the most of the data are swapped         with disk, which is out of the Sequoia’s ability.
ConclusionMulticore processor has better performance than single core processor, which is about 8 times faster if the memory space is sufficient.Multicore may also have some unexpected overhead or error, which may draw back the performance like what I have in Cellgen. Multicore processing is art.   - In the paper “ Programming Multiprocessors With     Explicitly Managed Memory Hierarchies,” Cellgen        has better performance than Sequoia does. However,     Cellgen doesn’t do well like Sequoia in this    research.

Multicore processing

  • 1.
    Multicore ProcessingWu, Lieh-HaoVMIClass 2010Computer Science
  • 2.
    What’s Multicore? Multiplecores in a single chipImproving performance by adding coreBecame the main stream in recent years - Examples - Core 2 dual, Core 2 Quad, Core-i3/5/7  Intel - Athlon II X2, Phenom II X4, Opteron AMD - Cell Broad Engine  IBM
  • 3.
    Why Multicore?The difficultiesof single core processor’s development - Overheat - Energy consumption - Electron leakage - Example - Intel abandoned the project of 4GHz processor in fall 2004 Multicore processor resolve these problem and has better performance
  • 4.
    Research IntroductionPurpose: - To see the performance difference between single core and multicore processorsHow: - Use the PS3 as the host machine - Use the CPU of PS3 to execute a series of matrix multiplication - Execute with single core - Execute with multicore - programming tools are needed for handling cores - Record the time and analysis the performance
  • 5.
    Play Station 3PhysicalComponentsCPU: Cell Broad EngineMemory: 256MBStorage: 80GBSoftware Yellow Dog LinuxCell SDK
  • 6.
    Cell Broad EngineProcessorDeveloped by Sony, Toshiba, and IBM jointly.Multicore structure - Power Processing Element x 1 (PPE) - Like a traditional processor - It has its own L1, L2 cache - Synergistic Processing Element x 8 (SPE) - Can be used synchronously - It has 256KB local storage
  • 7.
    Matrix MultiplicationSimple buttime consumingSome assumptions are made for research purpose - Dimension is set to N2 - Data type is set to double - Only even numbers are applied
  • 8.
  • 9.
    Mapping a treestructure as a memory hierarchy
  • 10.
    Basic idea - Consist of three functions - task<inner>: distribute - task<leaf>: compute - task<ext>: connect
  • 11.
    Programming in SequoiaToprogramming in Sequoia, four files are required to run the matrix multiplication. - “Makefile”  for compiling - “matrixmult.sq” Sequoia program - “mapping_ps3.xml”  for mapping - “main.cc”  for starting During the process - Good documentation - Good adaptability for different purposes - Details need to be handled by programmers
  • 12.
    CellgenAn implicit multicoreprogramming modelC/C++ based programming toolLike OpenMP style - OpenMP APIBasic idea - Starts after “#pragma cell” - Parameters - public: shared by SPEs - private: each SPE has a copyScott SchneiderPh.D. Candidate Virginia Tech 
  • 13.
    Programming in CellgenThereare files needed to run matrix multiplication - Two “Makefile”  for compiling - One “matrixmult.cellgen”  Cellgen code - One “double16b_t.h”  for padding column data - suggested by the author to improve performanceDuring the process - Understandable - C/C++ based; easy to catch up. - Lack of documentation - Only “Readme” file is available.
  • 14.
    Result in TableThefollowing is the table for the execution time of PPE only, SPE with Sequoia, and SPE with Cellgen.3 x 8 x 3264x 3264 ≈ 256 MB - Oversize matrix will be swapped between disk and main memory.Can’t get result of Cellgen from size 3072 to 4096. - Either no response or bus error.
  • 15.
    Result in Graph(1)The following is the line chart generated from thedata of the table.Memory size limitPPE OnlyCellgenSequoia
  • 16.
  • 17.
    Result AnalysisPerformance ofCellgen - Unexpected overhead or runtime error may occur and throw the performance back.Performance of Sequoia - According to the stable record, it is about 8 times faster than the execution time of PPE. - Although the memory size is 256MB, performance starts dropping down after 2048 2. - The performance becomes the same with PPE after reaching 4096 2 . - Probably the most of the data are swapped with disk, which is out of the Sequoia’s ability.
  • 18.
    ConclusionMulticore processor hasbetter performance than single core processor, which is about 8 times faster if the memory space is sufficient.Multicore may also have some unexpected overhead or error, which may draw back the performance like what I have in Cellgen. Multicore processing is art. - In the paper “ Programming Multiprocessors With Explicitly Managed Memory Hierarchies,” Cellgen has better performance than Sequoia does. However, Cellgen doesn’t do well like Sequoia in this research.

Editor's Notes

  • #2 Title page
  • #3 Main introduction about what multicore is. Take the old English paper as reference!
  • #4 The reasons why the CPU manufactures change from single core to multicore.
  • #5 Research intro. Talking about what my purpose is, how I test and justify my results, and what application, programming models, and host machine I will use. (PS. Application -&gt; large dimension matrix; Programming models -&gt; Sequoia, Cellgen; Host machine -&gt; PS3)
  • #6 Introduce PS3 with its memory size, CPU, and what OS we use for. (PS. Main memory -&gt; 256MB; CPU -&gt; Cell Broad Engine; OS -&gt; Yellow Dog Linux) Don’t forget to mention that Cell SDK is necessary for developing Cell CPU!Picture for YDL: http://elhabib.at/files/2008/07/yellowdog-vorlage_p1.jpg Picture: http://scawley.files.wordpress.com/2008/03/sony_playstation_3_60gb_game_console__brand_new.jpg
  • #7 Introduction of Cell Broad Engine multicore processor. PPE L1 cache  32KB; L2 cache  512KBStructure Picture  http://www.5ilight.com/dianzi/upimg/20070222/11H154H0L05A08.jpgPicture  http://moss.csc.ncsu.edu/~mueller/cluster/ps3/cell.jpg
  • #8 Application is a series of matrix multiplication. Also, put the reason why I choose matrix multiplication to be my application. Picture: http://upload.wikimedia.org/wikipedia/en/thumb/e/eb/Matrix_multiplication_diagram_2.svg/313px-Matrix_multiplication_diagram_2.svg.png
  • #9 Brief introduction about Sequoia. Major points -&gt; explicit local storage management, mapping a tree structure as a memory hierarchy, and major programming points(inner, leaf, and ext tasks).Picture tree structure: http://www.stanford.edu/group/sequoia/cgi-bin/node/182Sequoia logo: http://www.stanford.edu/group/sequoia/cgi-bin/
  • #10 Connect to matrixmult.sq, matrixmult_ps3_mapping.xml, and main.cc files here and explain briefly. Then, talk about how I feel during the process. (Basic idea -&gt; good documentation and adaptability for different purpose, but programmer has to handle much more in detail!) Also remind that five files are required to use Sequoia: two Makefile files (for compile purpose), xxx.sq code (main program), xxx.xml (for mapping purpose), main.cc (for execution purpose)
  • #11 Introduction about Cellgen. Major points: C/C++ based software tool, implicit local storage management, OpenMP-like support. (PS. OpenMP needs to be explained -&gt; orally brief explanation; use http://openmp.org/wp/about-openmp/ as reference!!) Author photo: http://people.cs.vt.edu/~scschnei/pictures/scott.jpgOpenMP logo: http://openmp.org/wp/openmp_336x120.gif
  • #12 Put matrixmult.cellgen and double16b_t.h files here and explain briefly. Also mention that the problem of lack of documentation and the problem the author said.
  • #13 Put the overall result table here and “explain”. Do not say too much here, analysis will be left on later slides!
  • #14 Put the result graph here and “explain”. Do not say too much here.
  • #15 The line chart about
  • #16 Just leave the “important” partial data here and explain my analysis. Major point: how fast can Sequoia reach, Sequoia and Cellgen limit of the physical main memory which only has 256MB, and the unexpected poor performance of Cellgen (maybe some potential overhead draw back the overall performance).
  • #17 Multicore processing is art!
  • #18 Reference list: form old EN paper, from OpenMP website, from Cellgen author websites, from Sequoia websites, the IEEE magazine.
  • #19 Questions be prepared!http://www.cmoe.com/blog/wp-content/images/question-mark.jpg