The document presents a framework for practical fast matrix multiplication algorithms that can achieve higher performance than Intel's MKL library. It explores block partitioning schemes, code generation for rapid prototyping of algorithms, and parallelization strategies on shared memory. Several fast algorithms are found to outperform MKL for large problems, but bandwidth limitations often prevent large speedups with many cores. Strategies like matching an algorithm's shape to a problem's and distributed memory parallelism may help address these limitations.