This document proposes a data structure for standardizing person metadata, such as inventor and applicant names and addresses, extracted from patent documents. It discusses:
- Four main steps to cleaning and standardizing person data: re-parsing, cleaning, standardization, and deduplication.
- Challenges with different teams specializing in local address cleaning and varying address standards between countries.
- A proposed metadata structure splitting name and address fields into standardized subfields like first name, last name, street, city, etc.
- Operators for cleaning like map+correct to transform strings between fields while correcting errors, and conditions for applying transformations based on country, time period, or text patterns in the data