This document discusses topic extraction using machine learning techniques. It provides a history of topic models, including TF-IDF, LSI, pLSI and LDA. It describes how LDA uses a hierarchical Bayesian model to represent documents as mixtures of topics and topics as mixtures of words. The document demonstrates LDA and k-means topic modeling in R and Spark. It concludes that LDA provides mixtures of topics while k-means provides distinct topics, and unsupervised LDA may need domain experts to improve topic representation.