The document summarizes a talk on future high performance microprocessors. It discusses how multi-core chips came to be due to limitations in improving single core performance. It argues that multi-core is not a true solution and that breaking down abstraction layers between software and hardware is needed to fully utilize increasing transistor counts. The talk proposes designing microprocessors with a few high-performance cores, many simple cores, and specialized accelerators, along with multiple programming interfaces.
1. Yale Patt The University of Texas at Austin Chalmers University Goteborg, Sweden September 6, 2010 Future High Performance Microprocessors: What will they look like? How will we get there?
As opposed to the current CMPs which either tile large cores for high serial thread performance or tile all small cores for high throughput, the ACMP provides one large core and many small cores. The LARGE core of the ACMP executes the serial, or the non-parallelized, part of the application and the small cores execute the parallelized part. Today I will show how the ACMP paradigm can ALSO improve performance of the parallelized part by accelerating the execution of critical sections. So what are critical sections!!! Homogeneous ISA One (or a few) large core(s) Many small cores All cores on the same interconnect Hardware cache coherence
The large core and the small cores are functionally similar. The difference is in their performance characterisitcis. We envision the large core to be an aggressive high perofrmance processor. It may include features like out of order execution ,wide fetch, deeper piples, aggressive gbrnach rpediction etc. On the other hand the small core must be POEWR-EFFICIENT. It can be a simple mickey mouse core. It can be in-order with a narrow fetch, a shallow pipline and mickey mouse bernach predictor.
First, we compare the three approaches analytically. Y-Axis is the speedup achieved over a single conventional P6 core. X-axis is the degree of parallelism which is the percentage of the program parallelized by the programmer. The three curves show performance of ACMP, Niagara, and P6-Tile. When the parallelism is low, both the ACMP and P6-Tile outperform the Niagara approach because of their high-single thread performance. When parallelism is high, the Niagara outperforms both P6-Tile and ACMP because of its high throughput. But when the parallelism is medium, The ACMP outperforms both Niagara and P6-Tile Note that the Tile-Large approach never outperforms the ACMP, However Niagara does because it has a higher throughput.