Big data is everywhere, but we barely hear about the work required to clean and conform massive datasets as a first step in performing reliable data analysis. I'd like to talk about our experience over the past 2 years at Shopify, re-implementing our data warehouse using Apache Spark (PySpark) and building trustworthy data models that benefit both business users and data analysts, across different business units. Providing meaningful metrics and insights for an e-commerce platform that is rapidly growing and adding features and new products can be a challenge. Staying on top of changing and moving source data, ensuring data quality and avoiding drawing incorrect conclusions from data and then doing predictive analysis on top of this, is what my team and I spend most of our time on at Shopify. The problem is that very few conversations in the data community are about the joys and frustrations of building these data models and data warehouses. I'd like to talk about challenges, things that have worked for us (and didn't) and have a conversation around this dark corner of data analysis.