Talk about testing large Data Pipelines, mostly inspired from my experience at LinkedIn working on relevancy and recommender system pipelines.
Abstract: Applied machine learning data pipelines are being developed at a very fast pace and often exceed traditional web/business applications codebase in terms of scale and complexity. The algorithms and processes these data workflows implement fulfill business-critical applications which require robust and scalable architectures. But how to make these data pipelines robust? When the number of developers and data jobs grow while at the same time the underlying data change how do we test that everything works as expected?
In software development we divide things in clean, independent modules and use unit and integration testing to prevent bugs and regression. So why is it more complicated with big data workflows? Partly because these workflows usually pull data from dozens of sources out of our control and have a large number of interdependent data processing jobs. Also, partly because we don't know yet how to do or lack the proper tools.