Bad data can be as problematic as no data at all. This document discusses how Monsanto has improved data stewardship workflows in plant agriculture through automation. Key points:
- Monsanto uses various technologies like biotechnology, plant breeding, and data science to help farmers grow crops more sustainably.
- The company has implemented processes to standardize data from acquisition to analytics to ensure integrity and enable decisions. This includes curating data, developing ontologies, and integrating databases.
- Automating data curation through a program called dataCuratoR has increased accuracy, minimized resource usage, and improved data accessibility over time. It helps standardize real-time insect assay data.
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Bad Data is No Better Than No Data - Impact of Automation in Data Stewardship workflows
1. Bad Data is No Better Than No Data! -
Impact of Automation in Data Stewardship
Workflows in Plant Agriculture Industry
Karnam Vasudeva Rao
Senior Scientist, Data Science Team
Monsanto
1
2. Innovations at Monsanto
R & D
Discovering innovative solutions to
challenges big and small, helping
farmers grow more sustainably.
Agricultural Biologicals
Using naturally-occurring microbes to
benefit the soil and seed.
Modern Agriculture
Evolving the approach to agricultural
innovations and farming practices that
helps farmers increase efficiency.
Crop Protection
Guarding plants from disease, weeds,
and pests.
Data Science
Measuring the health of plants,
available natural resources, and the
efficiency of a farm..
Biotechnology
Introducing greater tolerance and
adaptability to a seed product.
Plant Breeding
Merging plant genetics for improved
yield, water efficiency, and more.
Biotechnol
ogy
• Headquarters: St. Louis, Missouri, United States
• Fortune 500 company
• Over 20,000 employees globally
• Facilities in 69 countries
2
Monsanto
3. 01 – Getting Data,
Compilation
02 – Curation and
Ontology
03 –Data
processing and DB
management
04 –Data analysis,
App development
and Visual
analytics
Acquisition Normalization Integration Analytics
Data Stewardship phases to enable Data2Decisions
3
4. Tracing entities in R & D pipeline is difficult
Registration Cloning
Gene
transfer
Green-
house
Field
studies
in-house IDs,
Gene, Protein IDs
DB1 & 3
Gene, Monsanto
vector name
DB 1, 2, & 4
Monsanto vector
name, Plant
barcode, sample
barcode
DB 5 & 6
Monsanto vector
name, Plant
barcode, sample
barcode
DB 7
Monsanto vector
name, Plant
barcode, sample
barcode, Field IDs
DB 8-10
EntitiesDatabases
4
Pipeline
5. Identifiers Gene/in-house IDs Monsanto vector ID Plant/Seed IDs
Data
storage
Example
Field studies
Green-house
experiments
Lab experiments
Sample name Insect Name
NCR
Corn rootworm
CRW
SCR
5
DB1
DB6
DB2 DB3
DB7
DB4 DB5
DB8 DB9 DB10
D3 data from in-house research studies
6. From D3 to C3
6
Common name (Acronym) Scientific name Colloquial term
Northern corn rootworm (NCR) Diabrotica barberi
Corn rootworm
(CRW)
Southern corn rootworm (SCR) Diabrotica undecimpunctata howardi
Implementing CV and ontologies removes ambiguity caused by using
colloquial terms and makes the data Clean, Consistent and Connected.
Corn/corn = Maize/maize = Zea mays
7. Data Stewardship to Achieve Data Integrity
• Ensures data reusability, accessibility, and quality
• Has consistent data definitions, data aliases
• Metadata (data about data) enables organized information retrieval
• Integrated, enterprise-wide view of the data provides the foundation for the shared data
7
Standardizing metadata is important for data integrity, reproducibility and accessibility
8. Raw (dirty)
data
Metadata (Crop, Insect,
Plant stage and Gen)
Curated data
Clean and consistent Data
Dashboards - analytics DB 3 (Oracle) PostgreSQL
API API
dataCuratoR: Automated data standardization
of real-time insect assay data to enable decisions
CRON
8
DB1 DB2
Decisions
9. Automation increased accuracy and
minimized resource usage
9
• Data access
• Requirement gathering
• Patterns, missing data and
inconsistencies
• Source for answers
• Manual curation
• Programming
• More patterns,
gaps and
inconsistencies?
• Maintenance &
enhancements
• Minimal coding
• Patterns, gaps
and
inconsistencies
• Coding & APIs
• More patterns,
inconsistances?
• Maintenance &
enchantements
• Minimal coding
FY16
2.2 Resource
hours
FY17
0.9 Resource
hours
FY18
0.4 Resource
hours
FY19
0.3 Resource
hours
Increased data
accessibility
Steps:
Collect data from relevant sources.
Organize data and upload to centralized repository for normalization.
Define the meta data and capture relevant information using CV.
Annotate, map and enrich data relations for consistency.
Development and maintenance of database for (un)structured data.
Format and load data in to DB.
Exploratory, inferential statistics and predictive analytics.
Web services, Web app development; cloud based solutions.
Build user friendly interface to access and analyze data.
How easy it is for us to track entities across pipeline?
“Entity” is defined as anything that is tracked in the pipeline at a relatively high frequency and has a physical presence
“Relationships” connect entities using a specific property ·
“Meta-data” is defined as data that supports entity discovery in the research pipeline
Research pipeline has numerous entities and relationships. Linking different data systems and the lineage to enable faster decisions is always a challenge in the pipelines. For example genes to be tested are identified by different IDs during different phases of the pipeline. Creating one data system for all by linking different data to enable faster decisions is possible only if the metadata for these entities is uniform. Which means there should be a consistency in construct names, gene and protein names and also linkage between these, means, consistent mapping of protein names to construct names.
So stewardship work would be important to bring consistency in the data for these entities.