Applied Machine Learning Conference: Synthetic OCR data

Let the Product Lead:
R&D Paradigms for
Recognition Models of
Handwritten Math
Expressions
Quinn Lathrop
quinn.lathrop@gmail.com

AI Products and Solutions
• Global organization
• Research, Development, Engineering, Product, Design
• We own the product/feature from ideation to delivery
• Achieve broad innovation goals by developing B2C products first
Contributors to this work:
Zac Hancock, Jiamin He, Michael Chifala, Sungjin Nam, Bill Vander Lugt, Mounika Kakarla, Teddy Ampian, Luke Tuguluke, JB
DeVries, Leslie Satterfield, Claudia Cassidy, Douglas Cobb, Brian LoPiccolo, Joey Ashcroft, Ashley Fallon, Holly Smith, Tim
Stewart, JD Corbin, Tejawini Nallagatla, Jakob Vendegna, Wes Galbraith, Randall Barnhart, Eric Kattwinkel, Johann Larusson,
Jason Fournier, Michal Okulski, Piotr Kabacinski, Kacper Lodzikowski

Aida Calculus
Check This Problem Flow:
• Student inputs their problem
by taking a picture of their
handwritten work
• Student receives step-by-step
feedback
• Personalized hints and
tutoring
Video

Handwriting Recognition of
Math Expressions
y = ( 2 x – 3 ) ( x^2 - 5 )^3

Typical Data -> Model Flow
Waterfall:
• Collect dataset
• Iterate on model
At a certain point of time
the dataset is fixed
Real data is not perfect

Product Driven R&D
• Iteratively build a synthetic generation capability
towards requirements
• Control over the distribution of math
expressions, characters, location of characters,
specific visual qualities of the math, image noise,
and image augmentations
• Control every pixel of image
• Can create millions of perfectly labelled images in
hours

Math Expression: y=(2x-3)(x^2-5)^3
Background
Extra Marks
Bleed Through
Our synthetic data builds these features from the ground up:
• Math Expression
• Font and writing utensil
• Character-specific distortions/augmentations
• Backgrounds
• Other visual noise
• Simulated photo – angle, distance, quality, shadows

Benefits: Product
Development Cycles
When other product features depend on a developing AI capability, it is
important to integrate and iterate early and often
• Alpha
• Beta
• MVP

Benefits: Exact Control
and Visibility of the
Population Distributions

Benefits: User Behavior
Drives Backlog
Example: computer screens
• Can generate data and quickly ask the question: Does our model architecture
support this expansion in scope?

Benefits: Modeling
Because we have 100% correct pixel-level tagging, a
range of Object Detection models are available.
Tagging bounding boxes and masks comes at no
additional data collection cost

Open Sourcing Dataset on Kaggle
100,000 images with ground truth Latex, Bounding Boxes, and Masks
https://www.kaggle.com/aidapearson/ocr-
data

Thank you!
Open Sourced Kaggle Dataset
https://www.kaggle.com/aidapearson/ocr-data
Quinn Lathrop
quinn.lathrop@gmail.com

Applied Machine Learning Conference: Synthetic OCR data

More Related Content

What's hot

Similar to Applied Machine Learning Conference: Synthetic OCR data

Recently uploaded

Applied Machine Learning Conference: Synthetic OCR data