As opposed to the current CMPs which either tile large cores for high serial thread performance or tile all small cores for high throughput, the ACMP provides one large core and many small cores. The LARGE core of the ACMP executes the serial, or the non-parallelized, part of the application and the small cores execute the parallelized part. Today I will show how the ACMP paradigm can ALSO improve performance of the parallelized part by accelerating the execution of critical sections. So what are critical sections!!! Homogeneous ISA One (or a few) large core(s) Many small cores All cores on the same interconnect Hardware cache coherence
The large core and the small cores are functionally similar. The difference is in their performance characterisitcis. We envision the large core to be an aggressive high perofrmance processor. It may include features like out of order execution ,wide fetch, deeper piples, aggressive gbrnach rpediction etc. On the other hand the small core must be POEWR-EFFICIENT. It can be a simple mickey mouse core. It can be in-order with a narrow fetch, a shallow pipline and mickey mouse bernach predictor.
First, we compare the three approaches analytically. Y-Axis is the speedup achieved over a single conventional P6 core. X-axis is the degree of parallelism which is the percentage of the program parallelized by the programmer. The three curves show performance of ACMP, Niagara, and P6-Tile. When the parallelism is low, both the ACMP and P6-Tile outperform the Niagara approach because of their high-single thread performance. When parallelism is high, the Niagara outperforms both P6-Tile and ACMP because of its high throughput. But when the parallelism is medium, The ACMP outperforms both Niagara and P6-Tile Note that the Tile-Large approach never outperforms the ACMP, However Niagara does because it has a higher throughput.