This document discusses Synchronoss' journey in developing their data pipeline and profiling capabilities. It describes: 1) Their initial ETL-based pipeline (V1) that had long batch processes and could not handle large, unstructured data. 2) An upgraded version (V2) using a MPP appliance that improved performance but had high costs. 3) Their adoption of Spark (V4) to build a flexible, scalable pipeline that profiles data in the data lake using RDDs and built-in transformations. 4) This approach improved their data analysis time from weeks to hours and identified data quality issues earlier.