SlideShare a Scribd company logo
1 of 45
Download to read offline
Micro-benchmarking the Tera M TA

   E. Jason Riedy                   Rich Vuduc
ejr@cs.berkeley.edu           richie@cs.berkeley.edu


               C S 258 – Spring 1999
Outline
         Motivation

         Basics: Threads, Streams, and Teams
         Memory System
         Synchronization

         Parallel Challenge: List Ranking
         Tera-able or Terrible?




C S 258 – Spring 1999               2          benchmarking the Tera M TA
Prior Work
         Recent evaluations have been “macrobenchmarks”
          1. NAS Parallel benchmarks (Snavely, et al., SC’98)
          2. C3I Benchmarks (Brunett, et al., SC’98)
          3. Volume visualization (Snavely, et al., ASTC’99)
         Basic result: Apps need many threads.
         The documentation already says that.

                So it runs, but what is happening under the hood?




C S 258 – Spring 1999                3          benchmarking the Tera M TA
Prior Findings
      1. Decent baseline performance on most benchmarks (conjugate
         gradient, integer sort, multigrid) which vectorize nicely

      2. Poor performance on FFT benchmark, which is tuned for memory
         heirarchy performance
      3. Shaky scalability claims?
      4. Air Force benchmark; Command, control, communication, and
         intelligence
      5. C3I was selected because it was easy to parallelize — not a very
         interesting choice, is it?
      6. Poor sequential execution




C S 258 – Spring 1999                4        benchmarking the Tera M TA
Threads, Streams, and Teams

                                          Team #1                                                              Team #2
                                       ..
                                           Threads
                                              ..                                                            ..
                                                                                                                Threads
                                                                                                                   ..
                                       ..
                                     ...
                                      ..                        ..
                                                                .
                                                               ..
                                                                                                            ..
                                                                                                          ...                        ..
                                                                                                                                   ...
                                                                                                                                    ..
                                                                ....                                       ..
                             .           ....          .                       .                  .           ...          .           ....
                         ..
                         ..                  .
                                             .     ..
                                                   ..
                                                  ..
                                                                    ...
                                                                      .     ...
                                                                          ...                 ..
                                                                                              ..                 ..
                                                                                                                  .    ..
                                                                                                                       ..                  ..
                        ...                  .      ....              .    ..                ...                  .   ...                   .
                           ....
                               ..       .....            ...     .....        ....              ....
                                                                                                    ..       ...
                                                                                                                ..       ....
                                                                                                                             ..       .....
                                 .   ..                    .    .                 ..                  .   ...                  .   ..
                               ..     ...                  .   ...                ..                ...    ..                  .    ...
                          ...
                              .           ....      ...
                                                        ..        ...
                                                                     ..      .....             ...
                                                                                                   .          ...
                                                                                                                 ..     ...
                                                                                                                            ..          ...
                         .
                        ...                  ..    .
                                                  ...                 .   ..                  .
                                                                                             ...                  .    .
                                                                                                                      ...                  ..
                                             .                        .    ..
                                                                            ...                                   .                         .
                           ....
                               ..         ...        ....
                                                         ..        ...          ...             ....
                                                                                                    ..         ...       ....
                                                                                                                             ...        ...
                                 .                         .                      .
                                ..                        ..                      ..
                                                                                   .                  .
                                                                                                    ...
                                                                                                                               .
                                                                                                                              ..
                             ...                       ...                     ...                ..                       ...


                            ????                                                                 ????
                                                                                       ...                                                      ...

                                                  Streams                                                             Streams

         A team corresponds to a CPU.

         A thread is the unit of work specified in the program.
         A stream is an instruction-issuing slot within the CPU.


C S 258 – Spring 1999                                                                  5                   benchmarking the Tera M TA
NOTES to slide 5

  1. Teams are currently defined to be anonymous CPUs, but that may change.
  2. Threads are the units created and managed through the Tera runtime li-
     braries.
  3. Streams are created in hardware. Creation involves an instruction to re-
     serve a number of streams, then a sequence of STREAM_CREATE in-
     structions. Each stream has its own full register file. The compiler does
     this directly in some circumstances.
Thread Migration

                                           Team #1                                                   Team #2
                                        ..
                                            Threads
                                               ..                                                 ..
                                                                                                      Threads
                                                                                                         ..
                                       ..                          .                              ..                       ..
                                      ...                       ...
                                                                 ..                             ..
                                                                                                 ...                      ...
                                                                                                                                       ..
                                         ....                       ....                            ...                      ....
                                                                                                                                      ....
                             .                       ...                               ..                       ...              ..
                         ...
                        ...
                                             ...
                                               .   ...                  ..
                                                                         .          ..
                                                                                   ...
                                                                                                        ...   ...                 .
                                                                                                                                  .
                            ...
                                ..       ...
                                            ..      ..
                                                       ...         .....              ...
                                                                                          ..       .....
                                                                                                         .     ..
                                                                                                                  ....       ...
                                                                                                                                ..      ...
                                        .                 ..     .                                .                         .              .
                                                                                                                                         ...
                                  .
                                  .   ...                   .
                                                            .    .
                                                                ..                          .
                                                                                            .   ...                   .
                                                                                                                      .
                                                                                                                      .   ...
                                       ..                                                        ..                        ..
                               ..         ....          ...       ....                   ..         ...            ...        ....
                           ...                                                       ...
                                                                                                                                      ...
                                                      ..              ...                                        ..
                         ..
                        ...
                                              ..
                                              ..   ..
                                                    ...
                                                                        ..
                                                                                    .
                                                                                    .
                                                                                   ...
                                                                                                       .. .
                                                                                                         ..   ...
                                                                                                               ..
                                                                                                                                  .
                                                                                                                                  .
                                                                                                                                  .    .
                            ....
                                 ..        ...         ...           ...
                                                                        .             ....
                                                                                           ..        ...          ....         ...
                                                                                                                                  .     ....
                                                           ..                                                        .                     .
                                                                                                                                         ...
                                  .
                                  .                         .
                                                            .                               .
                                                                                            .                         .
                                                                                                                      .
                              ...                       ...                             ...                        ...


                             ????                                                      ?????
                                                                             ...                                                               ...

                                                   Streams                                                    Streams

          When there is excess parallelism, the runtime migrates threads.




C S 258 – Spring 1999                                                        6                   benchmarking the Tera M TA
NOTES to slide 6

  1. Because teams are anonymous, threads can be transparently migrated.
  2. The runtime also tries to manage the cost of team creation...
  3. There is no way to keep track of migration, and no way to measure the
     overhead. It’s likely a trap, a save of about 70 words, and some notifica-
     tion of the other team. The run-time streams on the other team probably
     watch for new threads, but there could be an inter-processor interrupt.
  4. If the program is not already executing in the other team, there will be
     some cold-start misses.
Thread Re-mapping

                                         Team #1                                                              Team #2
                                      ..
                                          Threads
                                             ..                                                            ..
                                                                                                               Threads
                                                                                                                  ..
                                      .                         .                                          ..                        .
                                    ..
                                     ...                      ..
                                                               ...                                       ..
                                                                                                          ...                      ..
                                                                                                                                    ...
                          ..            ....        ..            ....         .                ..           ...          ..           ....
                        ..
                        ...                 ..
                                             .   ...                  ..
                                                                       .    ...
                                                                           ..                ..
                                                                                             ...
                                                                                                                 ...   ...
                                                                                                                       .                   ..
                                                                                                                                            .
                         ...                      ..                                          ..                   .    ...
                             ...       .. ...        ...
                                                        ..       .....      ....
                                                                                ...              ...        .. ...          ...       .. ...
                                                                                                    ..     .
                                .
                                .
                                     .
                                    ...                   .
                                                          .   ..
                                                               ..
                                                                ...
                                                                                  .
                                                                                  ..                 .
                                                                                                     .   ...
                                                                                                          ..
                                                                                                                               .
                                                                                                                               .
                                                                                                                               .
                                                                                                                                    .
                                                                                                                                   ...
                          ...
                              ..       ....
                                           ..       .....           ...       .....            .. ...        ....         .....       ....
                                                                                                                                          ..
                         .                         .                        .                 .                   ..     .
                         .
                        ...                 ..   ...
                                                  ..
                                                                      ..   ..
                                                                            ...
                                                                                              .
                                                                                             ..                    .   ...
                                                                                                                        ..
                                                                                                                                           ..
                          ....
                               ..        ...         ...
                                                         ..        ...         ....            ....
                                                                                                   ..         ...          ....         ...
                                .                         .                       ..                 .                        ..
                                .                         .                        .                 .                         .



                            ??? 
                            ...                       ...                       ...               ...                       ...


                                                                                                 ????
                                                                                       ...                                                      ...

                                                 Streams                                                               Streams

                  Blocked streams are re-mapped by the runtime.
         Streams block due to traps and politeness (system calls).
         Streams are not re-mapped due to latency, unless the latency is so
         large that it traps out.


C S 258 – Spring 1999                                                                  7                  benchmarking the Tera M TA
NOTES to slide 7

  1. Streams can also be locked to a stream.
Measuring the Re-mapping Cost
                             Outer thread:
                   terart_create_thread_on_team (...);
                   purge (end_time);
                   writeef (start_time, TERA_CLOCK (0));
                   readff (end_time);
                              Inner thread:
                   writeef (end_time, TERA_CLOCK (0));
                               From 100 samples:

                                 Clocks       seconds
                        Mean   1:72  105   6:60  10,1
                        Std    1:55  104   5:95  10,2



C S 258 – Spring 1999              8        benchmarking the Tera M TA
NOTES to slide 8

  1. A memory access is on the order of   103 cycles.
  2. Most run-time features (especially auto-growth) must be disabled. We
     disable them in all micro-benchmarks.
Measuring the Stream Creation Cost
                             Outer thread:
                   purge (end_time);
                   writeef (start_time, TERA_CLOCK (0));
                   terart_create_stream (...);
                   readff (end_time);
                              Inner thread:
                   writeef (end_time, TERA_CLOCK (0));
                        From 100 samples, through runtime:

                                  Clocks       seconds
                        Mean    2:72  104   1:00  10,4
                        Std     9:72  102   3:74  10,6



C S 258 – Spring 1999                9        benchmarking the Tera M TA
How Many Instructions Is That?
                            160

                                                                                  Mean
                                                                                  Std
                            140




                            120




                            100
                   Clocks




                             80




                             60




                             40




                             20
                                  0   10   20   30          40         50   60   70      80
                                                     Number of Streams


        The minimum possible instruction latency is 21, the depth of the
        pipeline. The maximum should be 128, the number of hardware
        streams, if scheduling is strictly round-robin. The CPU prefers pre-
        viously unready streams, however.


C S 258 – Spring 1999                           10                    benchmarking the Tera M TA
The Tera Memory System
                                          64 bits
                            Trap 0 enable
                            Trap 1 enable
                            Forward enable
                            Full bit

                  X                   X             47 bits
                                             Trap 0 disable
                                             Trap 1 disable
                                             Full/empty control
                                             Forward disable

         The segment size can vary from 8 KB to 256 MB.
         The TLB (512 entry, 4-way assoc.) holds privilege information and
         can toggle stalling and distribution.
         The maximum memory size is currently     242 = 4 TB, not 247 .

C S 258 – Spring 1999               11         benchmarking the Tera M TA
NOTES to slide 11

  1. There can be up to 8 memory transactions per stream outstanding at once;
     branches can optionally wait for these to complete.
  2. The stall field allows the OS to perform quick memory management with-
     out having to stop all work.
  3. Memory accesses can be re-distributed so consecutive addresses are spread
     over the entire memory space.
  4. The TLB coherence mechanism is not specified.
  5. Privileges are handled as a stack.
Memory Bandwidth
                   The standard S TREAM benchmark compiled
                   with Tera’s parallelizing compiler shows fair
                   speedup for almost no effort.

                               Teams      Bandwidth
                                 1         1.7 GB/s
                                 4         5.8 GB/s
                              Speedup        3.4




C S 258 – Spring 1999                12        benchmarking the Tera M TA
Memory read latency
                                                                 Memory read latency
                              180




                              170




                              160




                              150
              time (cycles)




                                             1 KB
                                             2 KB
                                             4 KB
                              140            8 KB
                                             16 KB
                                             32 KB
                                             64 KB
                              130            128 KB
                                             256 KB
                                             512 KB
                                             1 MB
                                             2 MB
                              120
                                             4 MB
                                             8 MB
                                             16 MB

                              110
                                  −1    0              1    2               3            4    5     6     7
                                10     10             10   10              10           10   10    10    10
                                                                     stride (Bytes)



                                            L MBENCH results, not parallelized


C S 258 – Spring 1999                                           13                     benchmarking the Tera M TA
NOTES to slide 13

  1. Measured the time to perform a load by streaming through an array of
     memory addresses
  2. Drop-off: hot-spot caches
  3. Highest point: maximum time to perform a load

  4. Lowest point: round-trip network latency
  5. Difference: corresponds with claimed memory access time
  6. Number of points between any drop-off and the end: 8
Effects of Software Pipelining
                                                   for i = 1 to n step 2,
                                                      read A[i]
     for i = 1 to n,
        read A[i]                         @
                                          @
                                                      read A[i+1]

        loop                              ,
                                          ,
                                                      loop
                                                           operate on A[i]
             operate on A[i]
                                                           operate on A[i+1]


               We measured the apparent bandwidth as a function of
         the depth of unrolling,
         the number of streams,

         the number of teams, and
         the number of loop iterations.


C S 258 – Spring 1999                14       benchmarking the Tera M TA
NOTES to slide 14

  1. The loop does trivial floating-point work, but the compiler is instructed
     not to optimize floating-point operations.
  2. All loops are due by the time the loop begins because of this.
Bandwidth by Depth and Streams
                                                                          Loop iterations: 0                                                                                            Loop iterations: 40




                                 1.4                                                                                                                45

                                                                                                                                                    40
                                 1.2
                                                                                                                                                    35
     Percent of Peak Bandwidth




                                  1
                                                                                                                                                    30




                                                                                                                                 Bandwidth (MB/s)
                                 0.8                                                                                                                25

                                 0.6                                                                                                                20

                                                                                                                                                    15
                                 0.4
                                                                                                                                                    10
                                 0.2
                                                                                                                                                     5

                             0                                                                                                              0
                           100                                                                                                            100
                                         80                                                                                  8                             80                                                                              8
                                                60                                                                      7                                         60                                                                   7
                                                                                                                 6                                                                                                              6
                                                           40                                          5                                                                     40                                       5
                                                                 20                            4                                                                                   20                         4
                                                                                       3                                                                                                              3
                                       Number of Streams              0       2                                                                          Number of Streams              0   2
                                                                                                   Depth of Unrolling                                                                                             Depth of Unrolling



                                 Depth                          N. Streams                         % of Peak                                        Depth                         N. Streams                      % of Peak
                                        2                             51                                   46%                                            2                             71                                60%
                                        5                             26                                   111%                                           5                             76                                76%
                                        8                             76                                   101%                                           8                             51                                80%




C S 258 – Spring 1999                                                                                                   15                                      benchmarking the Tera M TA
NOTES to slide 15

  1. We assume that the Tera can load one word (8 bytes) per clock cycle,
     giving a peak of 2.08 GB/s.
  2. The    100 results must be due to multiple loads completing at once.
  3. The loop overhead is very small; the scheduler packs the instructions into
     other parts of the instruction word.
  4. The number of instructions words in the loop is actually a bit over half
     the number of instructions. Two floating-point additions can be packed
     into one instruction.
Bandwidth by Iterations and Streams
                                                                                               Load depth: 5




                                                      1.4

                                                      1.2




                          Percent of Peak Bandwidth
                                                       1

                                                      0.8

                                                      0.6

                                                      0.4

                                                      0.2

                                                  0
                                                100
                                                              80                                                                           40
                                                                     60                                                               30
                                                                                40                                      20
                                                                                     20                        10
                                                                                           0     0
                                                            Number of Streams
                                                                                                                    Number of Iterations




                        N. Iterations                                                     N. Streams                         % of Peak
                                                             0                                       26                             111%
                                                            20                                       61                               76%
                                                            40                                       76                               77%



C S 258 – Spring 1999                                                                     16                        benchmarking the Tera M TA
NOTES to slide 16

  1. Bandwidth used scaled linearly with teams, once adjusted by number of
     streams per processor.
The Full / Empty Bit
         Cases when retrieving a value from memory:
         Full Return with the value, possibly toggling the f/e state.
         Forwarding Retry with the new address.
         Empty Retry with the same address
         Too many retries triggers a trap.
          1. Allocate a handler structure.
          2. Copy the current value into that structure.
          3. Set forwarding to that value.
          4. Set the appropriate trap bit in memory.
          5. Add current stream’s thread to waiting list and re-map the stream.




C S 258 – Spring 1999                 17        benchmarking the Tera M TA
The Full / Empty Bit
         The only interesting case when storing a value is one that changes the
         f/e state to full.

         When this happens, the access will trap.
          1. Ignoring forwarding, fetch the pointer in memory.
          2. Subtract off the offset, giving the handler structure.
          3. Store the value and turn off forwarding and trapping in the
             original location.
          4. Wake threads as appropriate, allowing them to be mapped to
             streams again.
          5. De-allocate the handler structure.




C S 258 – Spring 1999                 18          benchmarking the Tera M TA
NOTES to slide 18

  1. There must be some synchronization again in this. This is carried out
     in by the program’s linked runtime environment, so it is only local to
     the program. Most likely: The thread doing this is locked to a stream,
     preventing traps while waiting for another stream to set this up.
  2. The operations necessary to set this up are privileged. The right traps
     raise the privilege enough to do this.
Number of Retries
        The Tera will keep track of the total number of retries if requested.

      1. Purge two sync variables.
      2. Create a stream.
         (a) Read (ff) from the first sync variable.
         (b) Write the retry counter to the second.
      3. Spin in registers long enough for the other stream to trap.

      4. Write the retry counter to the first sync variable.
      5. Spin in registers long enough for the other stream to set the counter
         result.
      6. Print second and first variables.
     Result: 2045 1022                        Result, locking: 95409 95021


C S 258 – Spring 1999                19         benchmarking the Tera M TA
A Look Back at Re-mapping...
          Total                                                    172 000
          Retries        1024 accesses  120 cycles / access   =   122 880
          Remaining                                                 49 120
          State write         70 words  170 cycles / word     =    11 900
          State read                                                11 900
          Remaining                                                 25 320
          Insn words               30 cycles per instruction !        900
            The documentation claims that re-mapping threads takes on
            the order of a few hundred instructions. Assuming a bit
            more memory traffic, this result agrees.




C S 258 – Spring 1999                20        benchmarking the Tera M TA
NOTES to slide 20

  1. Note that it is likely to be 1 800 instructions, as the compiler seems to
     average two insns in book-keeping sections.
Synchronization Costs
                                     −5                                                                                               −4
                                 x 10                                                                                             x 10
                           1.4                                                                                              1.6

                                                                          Mass FE, 0 iters
                                                                          Chained FE, 0 iters
                                                                                                                            1.4
                           1.2


                                                                                                                            1.2
                            1


                                                                                                                             1
     Time (s) per stream




                                                                                                      Time (s) per stream
                           0.8

                                                                                                                            0.8

                           0.6
                                                                                                                            0.6


                           0.4
                                                                                                                            0.4

                                                                                                                                                                           Mass FE, 100 iters
                           0.2                                                                                                                                             Mass FE, 200 iters
                                                                                                                            0.2
                                                                                                                                                                           Chained FE, 100 iters
                                                                                                                                                                           Chained FE, 200 iters

                            0                                                                                                0
                                 0        10   20    30           40      50           60        70                               0          10   20    30           40      50            60      70
                                                    Number of Streams                                                                                  Number of Streams




                                        The times include a read, at least a test, and a write for each stream.

                                        One memory transaction should take                                                                 6:5  10,7 s, two :13  10,5 s.
                                        Using only one location might get a boost from the hot-spot cache (or
                                        request merging?), but that becomes insignificant with a load.


C S 258 – Spring 1999                                                                       21                                             benchmarking the Tera M TA
NOTES to slide 21

  1. The “Chained FE” test measures the time per element to traverse a linked
     list, emptying a full variable and filling the next.
  2. The “Mass FE” test is similar, but all threads are emptying and filling the
     same location.
  3. Only two non-runtime streams are active, so the insn latency should be
     dominated by the memory latency.
  4. Does mass-fe cause a thundering herd problem? Or does the trap han-
     dler note that the thread to be restarted will leave the memory empty, so
     it only starts one?
Multi-Team Synchronization Costs
                                                        −4
                                                    x 10
                                              3.5




                                               3




                                              2.5




                        Time (s) per stream
                                               2




                                              1.5




                                               1


                                                                                            Mass FE, 200, 1 team
                                              0.5                                           Mass FE, 200, 3 teams
                                                                                            Chained FE, 200, 1 team
                                                                                            Chained FE, 200, 2 teams

                                               0
                                                    0        10   20     30           40           50           60     70
                                                                        Number of Streams




         Until the streams begin to trap, there is no difference between
         multiple and single-team synchronization.

         Note that each stream in “Chained FE” wakes a stream in another
         team.



C S 258 – Spring 1999                                                  22                     benchmarking the Tera M TA
List ranking (and list scan)
         List ranking: given a linked-list, find the distance of each node from
         the head (tail) of the list.

         Basic parallel primitive
         Practical parallel implementation is a challenge: irregular data
         structure, communication intensive
         We examine three practical algorithms
          1. Straight-forward serial implementation
          2. Wyllie’s pointer jumping algorithm (Wyllie, 1979)
          3. Reid-Miller/Blelloch (1994)

         Straight-forward serial implementation is tough to beat!




C S 258 – Spring 1999                23         benchmarking the Tera M TA
NOTES to slide 23

  1. List ranking is a special case of list scan; our implementations are actu-
     ally scan implementations
  2. Uses: ordering elements of a list, finding the Euler tour of a tree, load
     balancing (Gazit, et al. ’88, “Optimal tree construction...”), scheduling,
     parallel tree contraction
  3. Other important practical algorithms which we did not try: randomized
     algorithms by Andersen/Miller, Reif/Miller
Wyllie’s Pointer Jumping algorithm
                                   (Wyllie, 1979)
                1            1              1            1             0




               2             2             2             1             0




               4            3           2                1             0




         Idea: Each node adds itself to its neighbor, then links to its
         neighbor’s neighbor
         Simple but not work efficient — O       n log n total ops.

C S 258 – Spring 1999                 24           benchmarking the Tera M TA
NOTES to slide 24

  1. At each bulk step, double the number of pointers skipped; must synchro-
               log
     nize to get    n factor
  2. Also, must synchronize accesses to value and next fields together.
Reid-Miller algorithm
                                          (Reid-Miller and Blelloch, 1994)
            Phase I: Randomly break into sublists and compute the sublist sums

               L              1           1              1                1                 1   1       0




                      Sub-L           2                                   3                         2




              L               1           1                                                     1       0

                                                     1                1                 1



            Phase 2: Find the list scan of the sublist list (Sub-L)



                      Sub-L           0                                   2                         5



            Phase 3: Fill in the scan of the original list using the reduced list results


                  L               0           1          2                3                 4   5           6




C S 258 – Spring 1999                                         25                    benchmarking the Tera M TA
NOTES to slide 25

  1. Phases 1 and 3 are parallelizable
  2. Phase 2 is serial, or a call to Wyllie, or a recursive call
  3. Periodic load balancing (packing) during phases 1 and 3
  4. Many tuning parameters — size of sublists? when to load balance?
  5. Strictly speaking, not work optimal, but has small constants.

  6. Expected running time:    On=p + n=mlog m.
  7. New algorithm by Helman-JaJa (UMD, 1998) based on the Reid-Miller
     sparse ruling set idea, tuned for SMPs
On a conventional SMP: Ultra E5000
                                                            List ranking on the an Ultra E5000
                                       3
                                      10

                                                                                                      serial
                                                                                                      p=2 (Wyllie)
                                                                                                      4
                                                                                                      8
                                                                                                      p=1 (RM)
                                       2
                                      10                                                              2
                                                                                                      4




                                       1
            time per element (µsec)




                                      10




                                       0
                                      10




                                       −1
                                      10




                                       −2
                                      10
                                            3         4         5                         6       7                   8
                                           10        10       10                        10       10                  10
                                                                     no. of elements




C S 258 – Spring 1999                                          26                      benchmarking the Tera M TA
List scan on the Tera MTA
                                                          (auto-parallel)
                                                          List ranking on the Tera MTA (auto−parallel)
                                         2
                                        10


                                                                                                              serial
                                                                                                              p=1 (Wyllie)
                                                                                                              p=1 (RM)




                                         1
                                        10
              time per element (µsec)




                                         0
                                        10




                                         −1
                                        10
                                              3       4                         5                         6                   7
                                             10      10                        10                        10                  10
                                                                        no. of elements




C S 258 – Spring 1999                                           27                         benchmarking the Tera M TA
NOTES to slide 27

  1. Compiler requested 40 streams for parallel region in Wyllie
  2. Compiler requested 120 streams for parallel regions in Reid-Miller
Alternating Wyllie’s Algorithm
                                                          List ranking on the Tera MTA (read/write phases)
                                         2
                                        10




                                                                                                             serial
                                                                                                             s=2 (Wyllie)
                                                                                                             8
                                         1
                                        10                                                                   16
                                                                                                             32
              time per element (µsec)




                                                                                                             64
                                                                                                             p=1 (RM)




                                         0
                                        10




                                         −1
                                        10
                                              3              4                                          5                    6
                                             10           10                                         10                     10
                                                                          no. of elements




C S 258 – Spring 1999                                              28                        benchmarking the Tera M TA
So, Tera-able or Terrible?
                        Given enough streams, a processor can use peak (?)
                        memory bandwidth.
                        Stream creation is fast, and the runtime seems to pro-
                        vide flexible performance.
                        The fine-grained synchronization is fast, yielding more
                        fine-grained parallelism to invest.



                        All the problems of an experimental machine...
                        The compiler is tempermental about parallelizing code.
                        No magic. Such algorithms as sparse matrix-vector
                        multiplication still need programmer effort, of course.




C S 258 – Spring 1999                29        benchmarking the Tera M TA

More Related Content

Viewers also liked

Sparse Data Structures for Weighted Bipartite Matching
Sparse Data Structures for Weighted Bipartite Matching Sparse Data Structures for Weighted Bipartite Matching
Sparse Data Structures for Weighted Bipartite Matching Jason Riedy
 
Cross domain sentiment classification via spectral feature alignment
Cross domain sentiment classification via spectral feature alignmentCross domain sentiment classification via spectral feature alignment
Cross domain sentiment classification via spectral feature alignmentlau
 
Graph Matching
Graph MatchingGraph Matching
Graph Matchinggraphitech
 
Simple algorithm & hopcroft karp for bipartite graph
Simple algorithm & hopcroft karp for bipartite graphSimple algorithm & hopcroft karp for bipartite graph
Simple algorithm & hopcroft karp for bipartite graphMiguel Pereira
 
Intro to sentiment analysis
Intro to sentiment analysisIntro to sentiment analysis
Intro to sentiment analysisTimea Turdean
 
Maximum Matching in General Graphs
Maximum Matching in General GraphsMaximum Matching in General Graphs
Maximum Matching in General GraphsAhmad Khayyat
 
[ACM-ICPC] Dinic's Algorithm
[ACM-ICPC] Dinic's Algorithm[ACM-ICPC] Dinic's Algorithm
[ACM-ICPC] Dinic's AlgorithmChih-Hsuan Kuo
 

Viewers also liked (10)

JDeodorant: Clone Refactoring
JDeodorant: Clone RefactoringJDeodorant: Clone Refactoring
JDeodorant: Clone Refactoring
 
Graphs
GraphsGraphs
Graphs
 
Sparse Data Structures for Weighted Bipartite Matching
Sparse Data Structures for Weighted Bipartite Matching Sparse Data Structures for Weighted Bipartite Matching
Sparse Data Structures for Weighted Bipartite Matching
 
Cross domain sentiment classification via spectral feature alignment
Cross domain sentiment classification via spectral feature alignmentCross domain sentiment classification via spectral feature alignment
Cross domain sentiment classification via spectral feature alignment
 
Lect12 graph mining
Lect12 graph miningLect12 graph mining
Lect12 graph mining
 
Graph Matching
Graph MatchingGraph Matching
Graph Matching
 
Simple algorithm & hopcroft karp for bipartite graph
Simple algorithm & hopcroft karp for bipartite graphSimple algorithm & hopcroft karp for bipartite graph
Simple algorithm & hopcroft karp for bipartite graph
 
Intro to sentiment analysis
Intro to sentiment analysisIntro to sentiment analysis
Intro to sentiment analysis
 
Maximum Matching in General Graphs
Maximum Matching in General GraphsMaximum Matching in General Graphs
Maximum Matching in General Graphs
 
[ACM-ICPC] Dinic's Algorithm
[ACM-ICPC] Dinic's Algorithm[ACM-ICPC] Dinic's Algorithm
[ACM-ICPC] Dinic's Algorithm
 

Similar to Micro-benchmarking the Tera MTA

Sonoma County Wildlife Rescue PR Campaign
Sonoma County Wildlife Rescue PR CampaignSonoma County Wildlife Rescue PR Campaign
Sonoma County Wildlife Rescue PR CampaignEJurbina
 
Feasibility studies.1
Feasibility studies.1Feasibility studies.1
Feasibility studies.1Nahid Fawi
 
Soal Olimpiade
Soal OlimpiadeSoal Olimpiade
Soal Olimpiadefajarmath
 
Internet research tutorial
Internet research tutorialInternet research tutorial
Internet research tutorialMelissa Hall
 
1 group presentationhitech
1 group presentationhitech1 group presentationhitech
1 group presentationhitechSidharth Sareen
 
Discrete Time Signal Processing Oppenhm book 2nd
Discrete Time Signal Processing Oppenhm book 2nd Discrete Time Signal Processing Oppenhm book 2nd
Discrete Time Signal Processing Oppenhm book 2nd Tanzeel Ahmad
 
Brandon Hupp's Portfolio
Brandon Hupp's PortfolioBrandon Hupp's Portfolio
Brandon Hupp's PortfolioBrandon
 
The Smiley Model - Concept Model for designing engaging and motivating game...
The Smiley Model  -  Concept Model for designing engaging and motivating game...The Smiley Model  -  Concept Model for designing engaging and motivating game...
The Smiley Model - Concept Model for designing engaging and motivating game...CharlotteLarke
 

Similar to Micro-benchmarking the Tera MTA (20)

P13 035
P13 035P13 035
P13 035
 
Agt music case layout
Agt music case layoutAgt music case layout
Agt music case layout
 
Comprension 1
Comprension 1Comprension 1
Comprension 1
 
Sonoma County Wildlife Rescue PR Campaign
Sonoma County Wildlife Rescue PR CampaignSonoma County Wildlife Rescue PR Campaign
Sonoma County Wildlife Rescue PR Campaign
 
Feasibility studies.1
Feasibility studies.1Feasibility studies.1
Feasibility studies.1
 
P13 007
P13 007P13 007
P13 007
 
4385
43854385
4385
 
Soal Olimpiade
Soal OlimpiadeSoal Olimpiade
Soal Olimpiade
 
Internet research tutorial
Internet research tutorialInternet research tutorial
Internet research tutorial
 
1 group presentationhitech
1 group presentationhitech1 group presentationhitech
1 group presentationhitech
 
1 group presentation
1 group presentation1 group presentation
1 group presentation
 
Discrete Time Signal Processing Oppenhm book 2nd
Discrete Time Signal Processing Oppenhm book 2nd Discrete Time Signal Processing Oppenhm book 2nd
Discrete Time Signal Processing Oppenhm book 2nd
 
Brandon Hupp's Portfolio
Brandon Hupp's PortfolioBrandon Hupp's Portfolio
Brandon Hupp's Portfolio
 
The Smiley Model - Concept Model for designing engaging and motivating game...
The Smiley Model  -  Concept Model for designing engaging and motivating game...The Smiley Model  -  Concept Model for designing engaging and motivating game...
The Smiley Model - Concept Model for designing engaging and motivating game...
 
Roland Berger
Roland Berger Roland Berger
Roland Berger
 
6282
62826282
6282
 
3089
30893089
3089
 
3089
30893089
3089
 
6414
64146414
6414
 
5838
58385838
5838
 

More from Jason Riedy

Lucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoFLucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoFJason Riedy
 
LAGraph 2021-10-13
LAGraph 2021-10-13LAGraph 2021-10-13
LAGraph 2021-10-13Jason Riedy
 
Lucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoFLucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoFJason Riedy
 
Graph analysis and novel architectures
Graph analysis and novel architecturesGraph analysis and novel architectures
Graph analysis and novel architecturesJason Riedy
 
GraphBLAS and Emus
GraphBLAS and EmusGraphBLAS and Emus
GraphBLAS and EmusJason Riedy
 
Reproducible Linear Algebra from Application to Architecture
Reproducible Linear Algebra from Application to ArchitectureReproducible Linear Algebra from Application to Architecture
Reproducible Linear Algebra from Application to ArchitectureJason Riedy
 
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...Jason Riedy
 
ICIAM 2019: Reproducible Linear Algebra from Application to Architecture
ICIAM 2019: Reproducible Linear Algebra from Application to ArchitectureICIAM 2019: Reproducible Linear Algebra from Application to Architecture
ICIAM 2019: Reproducible Linear Algebra from Application to ArchitectureJason Riedy
 
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisJason Riedy
 
Novel Architectures for Applications in Data Science and Beyond
Novel Architectures for Applications in Data Science and BeyondNovel Architectures for Applications in Data Science and Beyond
Novel Architectures for Applications in Data Science and BeyondJason Riedy
 
Characterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with MicrobenchmarksCharacterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with MicrobenchmarksJason Riedy
 
CRNCH 2018 Summit: Rogues Gallery Update
CRNCH 2018 Summit: Rogues Gallery UpdateCRNCH 2018 Summit: Rogues Gallery Update
CRNCH 2018 Summit: Rogues Gallery UpdateJason Riedy
 
Augmented Arithmetic Operations Proposed for IEEE-754 2018
Augmented Arithmetic Operations Proposed for IEEE-754 2018Augmented Arithmetic Operations Proposed for IEEE-754 2018
Augmented Arithmetic Operations Proposed for IEEE-754 2018Jason Riedy
 
Graph Analysis: New Algorithm Models, New Architectures
Graph Analysis: New Algorithm Models, New ArchitecturesGraph Analysis: New Algorithm Models, New Architectures
Graph Analysis: New Algorithm Models, New ArchitecturesJason Riedy
 
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing PlatformsCRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing PlatformsJason Riedy
 
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing PlatformsCRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing PlatformsJason Riedy
 
A New Algorithm Model for Massive-Scale Streaming Graph Analysis
A New Algorithm Model for Massive-Scale Streaming Graph AnalysisA New Algorithm Model for Massive-Scale Streaming Graph Analysis
A New Algorithm Model for Massive-Scale Streaming Graph AnalysisJason Riedy
 
High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming Graphs High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming Graphs Jason Riedy
 
High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming GraphsHigh-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming GraphsJason Riedy
 
Updating PageRank for Streaming Graphs
Updating PageRank for Streaming GraphsUpdating PageRank for Streaming Graphs
Updating PageRank for Streaming GraphsJason Riedy
 

More from Jason Riedy (20)

Lucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoFLucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoF
 
LAGraph 2021-10-13
LAGraph 2021-10-13LAGraph 2021-10-13
LAGraph 2021-10-13
 
Lucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoFLucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoF
 
Graph analysis and novel architectures
Graph analysis and novel architecturesGraph analysis and novel architectures
Graph analysis and novel architectures
 
GraphBLAS and Emus
GraphBLAS and EmusGraphBLAS and Emus
GraphBLAS and Emus
 
Reproducible Linear Algebra from Application to Architecture
Reproducible Linear Algebra from Application to ArchitectureReproducible Linear Algebra from Application to Architecture
Reproducible Linear Algebra from Application to Architecture
 
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...
 
ICIAM 2019: Reproducible Linear Algebra from Application to Architecture
ICIAM 2019: Reproducible Linear Algebra from Application to ArchitectureICIAM 2019: Reproducible Linear Algebra from Application to Architecture
ICIAM 2019: Reproducible Linear Algebra from Application to Architecture
 
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
 
Novel Architectures for Applications in Data Science and Beyond
Novel Architectures for Applications in Data Science and BeyondNovel Architectures for Applications in Data Science and Beyond
Novel Architectures for Applications in Data Science and Beyond
 
Characterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with MicrobenchmarksCharacterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with Microbenchmarks
 
CRNCH 2018 Summit: Rogues Gallery Update
CRNCH 2018 Summit: Rogues Gallery UpdateCRNCH 2018 Summit: Rogues Gallery Update
CRNCH 2018 Summit: Rogues Gallery Update
 
Augmented Arithmetic Operations Proposed for IEEE-754 2018
Augmented Arithmetic Operations Proposed for IEEE-754 2018Augmented Arithmetic Operations Proposed for IEEE-754 2018
Augmented Arithmetic Operations Proposed for IEEE-754 2018
 
Graph Analysis: New Algorithm Models, New Architectures
Graph Analysis: New Algorithm Models, New ArchitecturesGraph Analysis: New Algorithm Models, New Architectures
Graph Analysis: New Algorithm Models, New Architectures
 
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing PlatformsCRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
 
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing PlatformsCRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
 
A New Algorithm Model for Massive-Scale Streaming Graph Analysis
A New Algorithm Model for Massive-Scale Streaming Graph AnalysisA New Algorithm Model for Massive-Scale Streaming Graph Analysis
A New Algorithm Model for Massive-Scale Streaming Graph Analysis
 
High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming Graphs High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming Graphs
 
High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming GraphsHigh-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming Graphs
 
Updating PageRank for Streaming Graphs
Updating PageRank for Streaming GraphsUpdating PageRank for Streaming Graphs
Updating PageRank for Streaming Graphs
 

Recently uploaded

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 

Recently uploaded (20)

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 

Micro-benchmarking the Tera MTA

  • 1. Micro-benchmarking the Tera M TA E. Jason Riedy Rich Vuduc ejr@cs.berkeley.edu richie@cs.berkeley.edu C S 258 – Spring 1999
  • 2. Outline Motivation Basics: Threads, Streams, and Teams Memory System Synchronization Parallel Challenge: List Ranking Tera-able or Terrible? C S 258 – Spring 1999 2 benchmarking the Tera M TA
  • 3. Prior Work Recent evaluations have been “macrobenchmarks” 1. NAS Parallel benchmarks (Snavely, et al., SC’98) 2. C3I Benchmarks (Brunett, et al., SC’98) 3. Volume visualization (Snavely, et al., ASTC’99) Basic result: Apps need many threads. The documentation already says that. So it runs, but what is happening under the hood? C S 258 – Spring 1999 3 benchmarking the Tera M TA
  • 4. Prior Findings 1. Decent baseline performance on most benchmarks (conjugate gradient, integer sort, multigrid) which vectorize nicely 2. Poor performance on FFT benchmark, which is tuned for memory heirarchy performance 3. Shaky scalability claims? 4. Air Force benchmark; Command, control, communication, and intelligence 5. C3I was selected because it was easy to parallelize — not a very interesting choice, is it? 6. Poor sequential execution C S 258 – Spring 1999 4 benchmarking the Tera M TA
  • 5. Threads, Streams, and Teams Team #1 Team #2 .. Threads .. .. Threads .. .. ... .. .. . .. .. ... .. ... .. .... .. . .... . . . ... . .... .. .. . . .. .. .. ... . ... ... .. .. .. . .. .. .. ... . .... . .. ... . ... . .... .. ..... ... ..... .... .... .. ... .. .... .. ..... . .. . . .. . ... . .. .. ... . ... .. ... .. . ... ... . .... ... .. ... .. ..... ... . ... .. ... .. ... . ... .. . ... . .. . ... . . ... .. . . .. ... . . .... .. ... .... .. ... ... .... .. ... .... ... ... . . . .. .. .. . . ... . .. ... ... ... .. ... ???? ???? ... ... Streams Streams A team corresponds to a CPU. A thread is the unit of work specified in the program. A stream is an instruction-issuing slot within the CPU. C S 258 – Spring 1999 5 benchmarking the Tera M TA
  • 6. NOTES to slide 5 1. Teams are currently defined to be anonymous CPUs, but that may change. 2. Threads are the units created and managed through the Tera runtime li- braries. 3. Streams are created in hardware. Creation involves an instruction to re- serve a number of streams, then a sequence of STREAM_CREATE in- structions. Each stream has its own full register file. The compiler does this directly in some circumstances.
  • 7. Thread Migration Team #1 Team #2 .. Threads .. .. Threads .. .. . .. .. ... ... .. .. ... ... .. .... .... ... .... .... . ... .. ... .. ... ... ... . ... .. . .. ... ... ... . . ... .. ... .. .. ... ..... ... .. ..... . .. .... ... .. ... . .. . . . . ... . . ... . . . .. . . ... . . . ... .. .. .. .. .... ... .... .. ... ... .... ... ... ... .. ... .. .. ... .. .. .. ... .. . . ... .. . .. ... .. . . . . .... .. ... ... ... . .... .. ... .... ... . .... .. . . ... . . . . . . . . ... ... ... ... ???? ????? ... ... Streams Streams When there is excess parallelism, the runtime migrates threads. C S 258 – Spring 1999 6 benchmarking the Tera M TA
  • 8. NOTES to slide 6 1. Because teams are anonymous, threads can be transparently migrated. 2. The runtime also tries to manage the cost of team creation... 3. There is no way to keep track of migration, and no way to measure the overhead. It’s likely a trap, a save of about 70 words, and some notifica- tion of the other team. The run-time streams on the other team probably watch for new threads, but there could be an inter-processor interrupt. 4. If the program is not already executing in the other team, there will be some cold-start misses.
  • 9. Thread Re-mapping Team #1 Team #2 .. Threads .. .. Threads .. . . .. . .. ... .. ... .. ... .. ... .. .... .. .... . .. ... .. .... .. ... .. . ... .. . ... .. .. ... ... ... . .. . ... .. .. . ... ... .. ... ... .. ..... .... ... ... .. ... ... .. ... .. . . . . ... . . .. .. ... . .. . . ... .. . . . . ... ... .. .... .. ..... ... ..... .. ... .... ..... .... .. . . . . .. . . ... .. ... .. .. .. ... . .. . ... .. .. .... .. ... ... .. ... .... .... .. ... .... ... . . .. . .. . . . . . ??? ... ... ... ... ... ???? ... ... Streams Streams Blocked streams are re-mapped by the runtime. Streams block due to traps and politeness (system calls). Streams are not re-mapped due to latency, unless the latency is so large that it traps out. C S 258 – Spring 1999 7 benchmarking the Tera M TA
  • 10. NOTES to slide 7 1. Streams can also be locked to a stream.
  • 11. Measuring the Re-mapping Cost Outer thread: terart_create_thread_on_team (...); purge (end_time); writeef (start_time, TERA_CLOCK (0)); readff (end_time); Inner thread: writeef (end_time, TERA_CLOCK (0)); From 100 samples: Clocks seconds Mean 1:72 105 6:60 10,1 Std 1:55 104 5:95 10,2 C S 258 – Spring 1999 8 benchmarking the Tera M TA
  • 12. NOTES to slide 8 1. A memory access is on the order of 103 cycles. 2. Most run-time features (especially auto-growth) must be disabled. We disable them in all micro-benchmarks.
  • 13. Measuring the Stream Creation Cost Outer thread: purge (end_time); writeef (start_time, TERA_CLOCK (0)); terart_create_stream (...); readff (end_time); Inner thread: writeef (end_time, TERA_CLOCK (0)); From 100 samples, through runtime: Clocks seconds Mean 2:72 104 1:00 10,4 Std 9:72 102 3:74 10,6 C S 258 – Spring 1999 9 benchmarking the Tera M TA
  • 14. How Many Instructions Is That? 160 Mean Std 140 120 100 Clocks 80 60 40 20 0 10 20 30 40 50 60 70 80 Number of Streams The minimum possible instruction latency is 21, the depth of the pipeline. The maximum should be 128, the number of hardware streams, if scheduling is strictly round-robin. The CPU prefers pre- viously unready streams, however. C S 258 – Spring 1999 10 benchmarking the Tera M TA
  • 15. The Tera Memory System 64 bits Trap 0 enable Trap 1 enable Forward enable Full bit X X 47 bits Trap 0 disable Trap 1 disable Full/empty control Forward disable The segment size can vary from 8 KB to 256 MB. The TLB (512 entry, 4-way assoc.) holds privilege information and can toggle stalling and distribution. The maximum memory size is currently 242 = 4 TB, not 247 . C S 258 – Spring 1999 11 benchmarking the Tera M TA
  • 16. NOTES to slide 11 1. There can be up to 8 memory transactions per stream outstanding at once; branches can optionally wait for these to complete. 2. The stall field allows the OS to perform quick memory management with- out having to stop all work. 3. Memory accesses can be re-distributed so consecutive addresses are spread over the entire memory space. 4. The TLB coherence mechanism is not specified. 5. Privileges are handled as a stack.
  • 17. Memory Bandwidth The standard S TREAM benchmark compiled with Tera’s parallelizing compiler shows fair speedup for almost no effort. Teams Bandwidth 1 1.7 GB/s 4 5.8 GB/s Speedup 3.4 C S 258 – Spring 1999 12 benchmarking the Tera M TA
  • 18. Memory read latency Memory read latency 180 170 160 150 time (cycles) 1 KB 2 KB 4 KB 140 8 KB 16 KB 32 KB 64 KB 130 128 KB 256 KB 512 KB 1 MB 2 MB 120 4 MB 8 MB 16 MB 110 −1 0 1 2 3 4 5 6 7 10 10 10 10 10 10 10 10 10 stride (Bytes) L MBENCH results, not parallelized C S 258 – Spring 1999 13 benchmarking the Tera M TA
  • 19. NOTES to slide 13 1. Measured the time to perform a load by streaming through an array of memory addresses 2. Drop-off: hot-spot caches 3. Highest point: maximum time to perform a load 4. Lowest point: round-trip network latency 5. Difference: corresponds with claimed memory access time 6. Number of points between any drop-off and the end: 8
  • 20. Effects of Software Pipelining for i = 1 to n step 2, read A[i] for i = 1 to n, read A[i] @ @ read A[i+1] loop , , loop operate on A[i] operate on A[i] operate on A[i+1] We measured the apparent bandwidth as a function of the depth of unrolling, the number of streams, the number of teams, and the number of loop iterations. C S 258 – Spring 1999 14 benchmarking the Tera M TA
  • 21. NOTES to slide 14 1. The loop does trivial floating-point work, but the compiler is instructed not to optimize floating-point operations. 2. All loops are due by the time the loop begins because of this.
  • 22. Bandwidth by Depth and Streams Loop iterations: 0 Loop iterations: 40 1.4 45 40 1.2 35 Percent of Peak Bandwidth 1 30 Bandwidth (MB/s) 0.8 25 0.6 20 15 0.4 10 0.2 5 0 0 100 100 80 8 80 8 60 7 60 7 6 6 40 5 40 5 20 4 20 4 3 3 Number of Streams 0 2 Number of Streams 0 2 Depth of Unrolling Depth of Unrolling Depth N. Streams % of Peak Depth N. Streams % of Peak 2 51 46% 2 71 60% 5 26 111% 5 76 76% 8 76 101% 8 51 80% C S 258 – Spring 1999 15 benchmarking the Tera M TA
  • 23. NOTES to slide 15 1. We assume that the Tera can load one word (8 bytes) per clock cycle, giving a peak of 2.08 GB/s. 2. The 100 results must be due to multiple loads completing at once. 3. The loop overhead is very small; the scheduler packs the instructions into other parts of the instruction word. 4. The number of instructions words in the loop is actually a bit over half the number of instructions. Two floating-point additions can be packed into one instruction.
  • 24. Bandwidth by Iterations and Streams Load depth: 5 1.4 1.2 Percent of Peak Bandwidth 1 0.8 0.6 0.4 0.2 0 100 80 40 60 30 40 20 20 10 0 0 Number of Streams Number of Iterations N. Iterations N. Streams % of Peak 0 26 111% 20 61 76% 40 76 77% C S 258 – Spring 1999 16 benchmarking the Tera M TA
  • 25. NOTES to slide 16 1. Bandwidth used scaled linearly with teams, once adjusted by number of streams per processor.
  • 26. The Full / Empty Bit Cases when retrieving a value from memory: Full Return with the value, possibly toggling the f/e state. Forwarding Retry with the new address. Empty Retry with the same address Too many retries triggers a trap. 1. Allocate a handler structure. 2. Copy the current value into that structure. 3. Set forwarding to that value. 4. Set the appropriate trap bit in memory. 5. Add current stream’s thread to waiting list and re-map the stream. C S 258 – Spring 1999 17 benchmarking the Tera M TA
  • 27. The Full / Empty Bit The only interesting case when storing a value is one that changes the f/e state to full. When this happens, the access will trap. 1. Ignoring forwarding, fetch the pointer in memory. 2. Subtract off the offset, giving the handler structure. 3. Store the value and turn off forwarding and trapping in the original location. 4. Wake threads as appropriate, allowing them to be mapped to streams again. 5. De-allocate the handler structure. C S 258 – Spring 1999 18 benchmarking the Tera M TA
  • 28. NOTES to slide 18 1. There must be some synchronization again in this. This is carried out in by the program’s linked runtime environment, so it is only local to the program. Most likely: The thread doing this is locked to a stream, preventing traps while waiting for another stream to set this up. 2. The operations necessary to set this up are privileged. The right traps raise the privilege enough to do this.
  • 29. Number of Retries The Tera will keep track of the total number of retries if requested. 1. Purge two sync variables. 2. Create a stream. (a) Read (ff) from the first sync variable. (b) Write the retry counter to the second. 3. Spin in registers long enough for the other stream to trap. 4. Write the retry counter to the first sync variable. 5. Spin in registers long enough for the other stream to set the counter result. 6. Print second and first variables. Result: 2045 1022 Result, locking: 95409 95021 C S 258 – Spring 1999 19 benchmarking the Tera M TA
  • 30. A Look Back at Re-mapping... Total 172 000 Retries 1024 accesses 120 cycles / access = 122 880 Remaining 49 120 State write 70 words 170 cycles / word = 11 900 State read 11 900 Remaining 25 320 Insn words 30 cycles per instruction ! 900 The documentation claims that re-mapping threads takes on the order of a few hundred instructions. Assuming a bit more memory traffic, this result agrees. C S 258 – Spring 1999 20 benchmarking the Tera M TA
  • 31. NOTES to slide 20 1. Note that it is likely to be 1 800 instructions, as the compiler seems to average two insns in book-keeping sections.
  • 32. Synchronization Costs −5 −4 x 10 x 10 1.4 1.6 Mass FE, 0 iters Chained FE, 0 iters 1.4 1.2 1.2 1 1 Time (s) per stream Time (s) per stream 0.8 0.8 0.6 0.6 0.4 0.4 Mass FE, 100 iters 0.2 Mass FE, 200 iters 0.2 Chained FE, 100 iters Chained FE, 200 iters 0 0 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Number of Streams Number of Streams The times include a read, at least a test, and a write for each stream. One memory transaction should take 6:5 10,7 s, two :13 10,5 s. Using only one location might get a boost from the hot-spot cache (or request merging?), but that becomes insignificant with a load. C S 258 – Spring 1999 21 benchmarking the Tera M TA
  • 33. NOTES to slide 21 1. The “Chained FE” test measures the time per element to traverse a linked list, emptying a full variable and filling the next. 2. The “Mass FE” test is similar, but all threads are emptying and filling the same location. 3. Only two non-runtime streams are active, so the insn latency should be dominated by the memory latency. 4. Does mass-fe cause a thundering herd problem? Or does the trap han- dler note that the thread to be restarted will leave the memory empty, so it only starts one?
  • 34. Multi-Team Synchronization Costs −4 x 10 3.5 3 2.5 Time (s) per stream 2 1.5 1 Mass FE, 200, 1 team 0.5 Mass FE, 200, 3 teams Chained FE, 200, 1 team Chained FE, 200, 2 teams 0 0 10 20 30 40 50 60 70 Number of Streams Until the streams begin to trap, there is no difference between multiple and single-team synchronization. Note that each stream in “Chained FE” wakes a stream in another team. C S 258 – Spring 1999 22 benchmarking the Tera M TA
  • 35. List ranking (and list scan) List ranking: given a linked-list, find the distance of each node from the head (tail) of the list. Basic parallel primitive Practical parallel implementation is a challenge: irregular data structure, communication intensive We examine three practical algorithms 1. Straight-forward serial implementation 2. Wyllie’s pointer jumping algorithm (Wyllie, 1979) 3. Reid-Miller/Blelloch (1994) Straight-forward serial implementation is tough to beat! C S 258 – Spring 1999 23 benchmarking the Tera M TA
  • 36. NOTES to slide 23 1. List ranking is a special case of list scan; our implementations are actu- ally scan implementations 2. Uses: ordering elements of a list, finding the Euler tour of a tree, load balancing (Gazit, et al. ’88, “Optimal tree construction...”), scheduling, parallel tree contraction 3. Other important practical algorithms which we did not try: randomized algorithms by Andersen/Miller, Reif/Miller
  • 37. Wyllie’s Pointer Jumping algorithm (Wyllie, 1979) 1 1 1 1 0 2 2 2 1 0 4 3 2 1 0 Idea: Each node adds itself to its neighbor, then links to its neighbor’s neighbor Simple but not work efficient — O n log n total ops. C S 258 – Spring 1999 24 benchmarking the Tera M TA
  • 38. NOTES to slide 24 1. At each bulk step, double the number of pointers skipped; must synchro- log nize to get n factor 2. Also, must synchronize accesses to value and next fields together.
  • 39. Reid-Miller algorithm (Reid-Miller and Blelloch, 1994) Phase I: Randomly break into sublists and compute the sublist sums L 1 1 1 1 1 1 0 Sub-L 2 3 2 L 1 1 1 0 1 1 1 Phase 2: Find the list scan of the sublist list (Sub-L) Sub-L 0 2 5 Phase 3: Fill in the scan of the original list using the reduced list results L 0 1 2 3 4 5 6 C S 258 – Spring 1999 25 benchmarking the Tera M TA
  • 40. NOTES to slide 25 1. Phases 1 and 3 are parallelizable 2. Phase 2 is serial, or a call to Wyllie, or a recursive call 3. Periodic load balancing (packing) during phases 1 and 3 4. Many tuning parameters — size of sublists? when to load balance? 5. Strictly speaking, not work optimal, but has small constants. 6. Expected running time: On=p + n=mlog m. 7. New algorithm by Helman-JaJa (UMD, 1998) based on the Reid-Miller sparse ruling set idea, tuned for SMPs
  • 41. On a conventional SMP: Ultra E5000 List ranking on the an Ultra E5000 3 10 serial p=2 (Wyllie) 4 8 p=1 (RM) 2 10 2 4 1 time per element (µsec) 10 0 10 −1 10 −2 10 3 4 5 6 7 8 10 10 10 10 10 10 no. of elements C S 258 – Spring 1999 26 benchmarking the Tera M TA
  • 42. List scan on the Tera MTA (auto-parallel) List ranking on the Tera MTA (auto−parallel) 2 10 serial p=1 (Wyllie) p=1 (RM) 1 10 time per element (µsec) 0 10 −1 10 3 4 5 6 7 10 10 10 10 10 no. of elements C S 258 – Spring 1999 27 benchmarking the Tera M TA
  • 43. NOTES to slide 27 1. Compiler requested 40 streams for parallel region in Wyllie 2. Compiler requested 120 streams for parallel regions in Reid-Miller
  • 44. Alternating Wyllie’s Algorithm List ranking on the Tera MTA (read/write phases) 2 10 serial s=2 (Wyllie) 8 1 10 16 32 time per element (µsec) 64 p=1 (RM) 0 10 −1 10 3 4 5 6 10 10 10 10 no. of elements C S 258 – Spring 1999 28 benchmarking the Tera M TA
  • 45. So, Tera-able or Terrible? Given enough streams, a processor can use peak (?) memory bandwidth. Stream creation is fast, and the runtime seems to pro- vide flexible performance. The fine-grained synchronization is fast, yielding more fine-grained parallelism to invest. All the problems of an experimental machine... The compiler is tempermental about parallelizing code. No magic. Such algorithms as sparse matrix-vector multiplication still need programmer effort, of course. C S 258 – Spring 1999 29 benchmarking the Tera M TA