This document discusses the importance of data quality measurement for machine learning applications. It covers quality metrics for both structured/tabular data and unstructured text data.
For structured data, it discusses common data cleaning techniques and whether cleaning always helps machine learning pipelines. It also covers topics like class imbalance, label noise, data valuation, data homogeneity, and transformations.
For class imbalance specifically, it describes factors that can affect imbalanced classification like imbalance ratio, overlap between classes, smaller sub-concepts, dataset size, and label noise. It also discusses different modeling strategies and resampling techniques to address imbalance.
Finally, it covers label noise as another important data quality metric, how noise can be introduced in labels,