This document discusses performance, parallelism, and scalability for machine learning. It begins by showing how optimizing stochastic gradient descent (SGD) using Python, Numba, NumPy, and Cython can speed it up versus pure Python. It also discusses optimizing for memory layout and caching. A case study shows optimizing Gensim's word2vec by rewriting it in Cython and leveraging BLAS for further speedups. The document discusses how hardware trends have increased parallelism through more cores. It describes parallelizing word2vec training using threads, achieving near linear speedup. Finally, it mentions experimenting with an asynchronous, lock-free implementation of SAG in Julia.