This document provides an overview of Apache Spark, including why it was created, how it works, and how to get started with it. Some key points: - Spark was initially developed at UC Berkeley as a class project in 2009 to test cluster management systems like Mesos, and was later open sourced in 2010. It became an Apache project in 2014. - Spark is faster than Hadoop for machine learning tasks because it keeps data in-memory between jobs rather than writing to disk, and has a smaller codebase. - The basic unit of data in Spark is the resilient distributed dataset (RDD), which allows immutable, distributed collections across a cluster. RDDs support transformations and actions. -