This document presents an overview of making matrix multiplication algorithms more I/O efficient. It discusses the parallel disk model and how to incorporate locality into algorithms to minimize I/O steps. Cannon's algorithm for matrix multiplication in a 2D mesh network is described. Loop interchange is discussed as a way to improve cache efficiency when multiplying matrices by exploring different loop orderings. Results are shown for parallel I/O efficient matrix multiplication on different sized matrices, with times ranging from 0.38 to 7 seconds. References on cache-oblivious algorithms and distributed memory matrix multiplication are provided.