This document summarizes how switching from Hadoop to Spark for data science applications improved performance, reliability, and reduced costs at Salesforce. Some key issues addressed were handling large datasets across many S3 prefixes, efficiently computing segment overlap on skewed user data, and performing joins on highly skewed datasets. These changes resulted in applications that were 100x faster, used 10x less data, had fewer failures, and reduced infrastructure costs.