Let the Product Lead: R&D Paradigms for Recognition Models of Handwritten Math Expressions Typical approaches to handwritten recognition tasks involve collecting and tagging large amounts of data on which many iterations of models are trained. The “one dataset, many models” paradigm has specific drawbacks within the context of agile product development. As product requirements evolve naturally, such as the addition of a new characters into the prediction space, a new data collection and tagging effort must be undertaken. In this talk, I’ll describe a different approach where we iteratively build a complex, synthetically generated dataset towards specific requirements. The generation process delivers exact control over the distribution of math expressions, characters, location of characters, specific visual qualities of the math, image noise, and image augmentations to the developer. Thus, we can arrive at a “many datasets, one model” paradigm where the capability (driven by iterative changes in the data) can quickly iterate and adapt on agile cycles. In addition to affording alignment with the product development process, synthetic data allows for 100% correct pixel by pixel tagging that opens the door for new modeling possibilities at scale. https://ghostday.pl/index.php/20-edition-speakers/ https://www.kaggle.com/datasets/aidapearson/ocr-data