Software	and	Data	as	Scaffolds	
for	Integrative	Science
David	LeBauer,	Ph.D.	
University	of	Illinois	at	Urbana-Champaign	
Department	of	Agricultural	and	Biological	Engineering	
Carl	R	Woese	Institute	for	Genomic	Biology	
National	Center	for	Supercomputing	Applications
1
Outline
• Overview:	Problems	and	Approach	
• Combining	Information	
• Models:	integration	across	domains		
• PEcAn:	integration	of	models	and	data	
• TERRA	REF:	automated	data	collection	and	analysis	
• Future	Directions
2
Challenges	we	face
• Agricultural	Production:	
• Feeding	9bn	by	2050	
• Climate	is	changing	
• Resources	are	becoming	scarce	
• Scientific	Problems:	
• How	do	genes	control	traits?	
• How	can	leverage	data	and	computing?
3Tilman	et	al,	Nature	2002
Yield
Fertilizer
Pesticides
Technical	Solutions	for	Science	and	Agriculture
• Knowledge	is	Spread	Across	Many	Scales	and	Formats:	
• Expert	Knowledge		
• Data	
• Mechanistic	Models	
• Integrating	these	will	enable:	
• Stronger	Inference	and	Prediction	
• More	Science	and	Engineering
4
Marshall-Colon et al 2017
Frontiers in Plant Science
102		m
10-3	m	
103	m
104	m
105	m Which	crops	are	viable,	
…	and	where?	
What	fraction	of	global	
energy	/	food	demand?
County	level	mean	yields	
Supply	chain	optimization
Local	topography:	soil,	hydrology	
Sub-field	management
Crop	Architecture	
Row	Spacing	/	Orientation	
Harvesting	Equipment	
Shading	response
Spatial	Scale Questions
Opportunities		
Across	Scales
5
Outline
• Conceptual	Overview	
• Computational	Solutions	
• Crop	Models	
• PEcAn	
• TERRA	REF	
• Future	Directions
Zhu,	Lynch,	LeBauer,	Millar,	Stitt,	Long,	2015	Plant	Cell	&	Environment
6
Evan	Delucia
Starting	Point:	Conceptual	Models
7
BioCro:	Combining	Biology,	Physics,	Chemistry
Humphries	and	Long	2005	
Miguez	et	al	2009,	2012	
Jaiswal,	DeSouza,	Larsen,	LeBauer,	…		et	al	2017		
Wang,	Jaiswal,	LeBauer,	…	et	al	2015	
8
Inputs	
Meteorology	(energy,	water)	
Soil	(physics,	carbon,	nutrients)	
Parameters	(e.g.	plant	traits)	
Outputs	
Yield,	Biomass,		
Energy	Balance	
Water	Use	
Nutrient	Use
Scaling	Photosynthesis	from	Leaf	to	Canopy
Light
Temperature
Light
Light
PhotosynthesisPhotosynthesis
Temperature
9
Scaling	Up	&	Predicting	the	Future
IPCC AR5
Warszawski et al. PNAS
Temperature Precipitation
Climate	Forecasts	(2040-2050)

CMIP5:	5	Climate	models	x	4	CO2
	emissions	Scenarios
10
Effects	of	Climate	on	Sugarcane	Yield	in	Brazil
2040-2050 Climate Impact
(metric Tons / ha)
Jaiswal, DeSouza, Larsen, LeBauer, Miguez, Sparovek, Bollero, Buckeridge, Long, 2017
Scaling leaf-level CO2 x T x H2O response
11
Outline
• Conceptual	Overview	
• Computational	Solutions	
• Crop	Models	
• PEcAn:	Linking	Models	and	Data	
• TERRA	REF	
• Future	Directions
12
PE
Ecol
LeBauer	and	Treseder,	2008
13
Thomas	et	al	2013
Combining	Data	and	Models	Is	Hard,		
Mostly	a	Technical	Challenge
Traits
System
States
Prediction
Soil
Meteorology
Parameters
Boundary Conditions
Drivers
Publications
Primary Data
Repositories
Wild Data Relevant
Information
Configuration
Sensitivity
Calibration
Validation
AnalysesRun Model Outputs
Just	Running	a	Model	is	Hard
Most of this work is model independent, so solutions can be shared
14
Data	Sources	 Analyses	
Ecosystem	
Models	
BioCro
ED2
CLM
SIPNET
...
n=12
The	Standard	Approach:		
Redundant,	Labor	Intensive,	Error	Prone
Converter
For	Met,	need	one	converter	per	driver	(m)	x	model	(n)	combination
Prediction
NARR
NOAA
Fluxnet
CMIP5
… m = 10
Met Station
Calibration
Sensitivity
Validation
Visualization
15
PEcAn	common	formats:	
				Many	users	use,	reuse,	test,	and	improve	components		
Common	
Format
Common	
Format
Ecosystem	
Models	
BioCro
ED2
CLM
SIPNET
...
n=12
Converter
Only	need	n+m	(not	n×m)	converters	
Less	work,	more	robust	and	valid	results
Diverse Met Data
NARR
NOAA
Fluxnet
CMIP5
… n = 10
Met Station
Analyses	
Prediction
Calibration
Sensitivity
Validation
Visualization
16
Parameter	Estimation:		
Combining	Literature	and	Field	Data
LeBauer	et	al,	2013 17
LeBauer	et	al,	2013
Given	current	data,	what	drives	uncertainty?							
3	Years,	1	crop,	1	location	
18
PEcAn	Variance	Decomposition	
Bars:	Parameter	Contribution	to	Uncertainty	in	Yield	Prediction	
Grey	=	Prior	
Black	=	Posterior	
Used	to	inform	optimal	data	collection
LeBauer	et	al,	2013
Automation	&	Reuse:	Uncertainty	analysis		
													bars	/	color	=	Parameter	Contribution	to	Predictive	Uncertainty
3	Years,	1	crop,	1	location	
19Dietze	et	al,	2014
~1	Year,	8	scientists,	17	PFTs,	6	biomes
Targeted	Field	Study:	Willow	Water	Use
Wertin,	LeBauer,	Volk,	Leakey,	in	prep
Predictions
20
Before After	Data	Collection
Add	Data Configure AnalyzeRun
Making	Crop	&	Ecosystem	Models	Accessible
LeBauer	et	al	2013,	Kooper	et	al	2013,	Dietze	et	al,	2013
21
PEcAn	is	a	community	project
42	Contributors	
>	50	citations	
Textbook	
100s	of	students	trained
22
PEcAn	Radiative	Transfer	Model	Inversion
23
Ely,	Serbin,	Shiklomanov,	Dietze	and	others
PEcAn	now	provides	a	place	for	shared		
models,	data	access,	and	tools
Tools:
Web front end
PostGIS database*
Met Scaling and Gap filling
Data Ingest
Meta-Analysis*
Sensitivity & Uncertainty Analysis*
Ensemble Prediction
Parameter Data Assimilation
State Data Assimilation
Benchmarking
Visualization*
Data Modeling:
Radiative Transfer
Photosynthesis
Tree Rings
Models:
BioCro*
CABLE
CLM
DALEC
ED*
FATES
G’Day
JULES
Linkages
LPJguess
MAAT
MAESPA
PRELES
SIPNET
Data:
Literature*
Field Measurements
Expert Priors*
Meteorology
Soils
PalEON
Fluxnet
ORNL
NEON
TERRA REF*
LTER
…
github.com/pecanproject/pecan	
pecanproject.org
24
Outline
• Conceptual	Overview	
• Computational	Solutions	
• Crop	Models	
• PEcAn:	Linking	Models	and	Data	
• TERRA	REF	
• Future	Directions
25
High	Throughput	Phenotyping
• High	Throughput	Phenotyping:	
• Replace	manual	with	sensor-based	measurements	
• Measure	more	traits	with	higher	frequency	
• But	…	sensors	are	expensive	and	data	are	difficult	to	
interpret	
• Terra	program	major	investment	to	push	this	forward
http://bulletin.ipm.illinois.edu/print.php?id=513
26
TERRA	REF
• Motivation:		
• Automated	Measurements	—>	Stronger	Inference	
• Software	&	Data	—>	Framework	for	Interdisciplinary	Collaboration	
• Solutions:	
• Reference	Datasets		
• Modular	and	Interoperable		
• Open	Data,	Software,	Computing
27
A	Phenomics	Pipeline	for	Crop	Improvement
Sensors Traits Genotypes
Selection
Genomics
Higher Yield
Yield Stability
Nutrition
Stress Tolerance
and more …
Automated Measurements
Component & Aggregate
Genomic Prediction
Pan Genome
28
Diverse	Scientific	Disciplines	
Sensors Traits Genotypes
Selection
Genomics
Engineering
Robotics
Computer Vision
(Eco)Physiology
Agronomy
Biology
Breeding
Statistics & Machine Learning
29
ARPA-E	TERRA
Open	Dataset	for	Six	Projects	+	Public	Release
30
TERRA	Reference	Data	Sources
Lemnatec	Scanalyzer	
Danforth,	St.	Louis
Lemnatec	Field	Scanner	
USDA	ALRC,	Maricopa,	AZ
Tractor	and	UAV	
AZ	and	Kansas	State
31
Field	Scanner	Sensors
terraref.org/articles/lemnatec-scanalyzer-field-sensors/
VNIR Imaging Spectrometer 380-1000nm

SWIR Imaging Spectrometer 900-2500 nm
IR Temperature Sensor

NDVI (1 down, 1 up) 650, 800 nm

PRI Sensor 531, 570 nm

PAR Sensor 410-655 nm

Color Sensor 410-655 nm

3D Scanners: 2 Side View, 1 Down

RGB: 2 Side View, 1 Down (1)

Active Reflectance 670, 730, 780 nm

PS II Fluorescence
Environmental:
wind, temperature, humidity, 

light, rain, CO2
32
Approach:	Integrate	Software	and	Databases
• What	do	people	currently	use?	
• What	domain	specific	software	and	databases	exist?	
• How	can	we	connect	these?	
• What	standards	&	conventions	to	adopt?
33
General	Framework	for	Cross-Domain	Links
Sensors Traits Genotypes
Selection
Genomics
Location	
Time	
Genotype	
34
Data	Formats,	Standards	&	Conventions
Sensors Traits Genotypes
Selection
CF Conventions
OGC
geoTIFF
NetCDF-CF
LAS
PEcAn
Crop Ontology
AgMIP/ICASA
BRAPI
BAM,
FASTQ,
VCF, BED,
FASTA, GFF
Genomics
35
TERRA	REF	Databases
Sensors Traits Genotypes
Selection
Genomics
36
Modular	Software
github.com/terraref 37
TERRA	REF	Pipeline
Field	measurements
Metadata
Trait	Data
Pipeline	Orchestration
Sensor	Data
Analysis	&	Development
1TB/d
<48h
Genomics
38
Data	Analysis	Environments
Any	Linux	Configuration	
+	Large	Filesystem	
+ Databases	
+ Compute
Workflows:	
		Analyze	! Share	! Publish	
		Develop	! Deploy
workbench.terraref.org	39
~/data	
~/tutorials
40
Web	Application	Developed	with	NDS	
Workbench
traitvis.workbench.terraref.org 41
218	mm
Robert	Pless	
Zongyang	Li	
Solmaz	Hajmohammadi
3D	Laser	Scanner
42
%	Reflectance
10	cm
N
scan	direction
Hyperspectral	Image	at	543	nm
x
y
43
Thermal
44
Automated	Detection	Algorithms		
(Time	Series	of	Panicle	Counts)
Zongyang	Li	and	Robert	Pless
45
Automated	Detection	Algorithms		
(Time	Series	of	Panicle	Counts)
Zongyang	Li	and	Robert	Pless
46
Automated	Detection	Algorithms		
(Time	Series	of	Panicle	Counts)
Zongyang	Li	and	Robert	Pless
47
Automated	Detection	Algorithms		
(Time	Series	of	Panicle	Counts)
Zongyang	Li	and	Robert	Pless
48
Automated	Detection	Algorithms		
(Time	Series	of	Panicle	Counts)
Zongyang	Li	and	Robert	Pless
49
Automated	Detection	Algorithms		
(Time	Series	of	Panicle	Counts)
Zongyang	Li	and	Robert	Pless
50
Automated	Detection	Algorithms		
(Time	Series	of	Panicle	Counts)
Zongyang	Li	and	Robert	Pless
51
Automated	Detection	Algorithms		
(Time	Series	of	Panicle	Counts)
Zongyang	Li	and	Robert	Pless
52
Automated	Detection	Algorithms		
(Time	Series	of	Panicle	Counts)
Zongyang	Li	and	Robert	Pless
53
Automated	Detection	Algorithms		
(Time	Series	of	Panicle	Counts)
Zongyang	Li	and	Robert	Pless
54
Automated	Detection	Algorithms		
(Time	Series	of	Panicle	Counts)
Zongyang	Li	and	Robert	Pless
55
Automated	Detection	Algorithms		
(Time	Series	of	Panicle	Counts)
Zongyang	Li	and	Robert	Pless
56
Automated	Detection	Algorithms		
(Time	Series	of	Panicle	Counts)
Zongyang	Li	and	Robert	Pless
57
Automated	Detection	Algorithms		
(Time	Series	of	Panicle	Counts)
Zongyang	Li	and	Robert	Pless
58
Automated	Detection	Algorithms		
(Time	Series	of	Panicle	Counts)
Zongyang	Li	and	Robert	Pless
59
Automated	Detection	Algorithms		
(Time	Series	of	Panicle	Counts)
Zongyang	Li	and	Robert	Pless
60
Automated	Detection	Algorithms		
(Time	Series	of	Panicle	Counts)
Zongyang	Li	and	Robert	Pless
61
Automated	Detection	Algorithms		
(Time	Series	of	Panicle	Counts)
Zongyang	Li	and	Robert	Pless
62
Automated	Detection	Algorithms		
(Time	Series	of	Panicle	Counts)
Zongyang	Li	and	Robert	Pless
63
Automated	Detection	Algorithms		
(Time	Series	of	Panicle	Counts)
Zongyang	Li	and	Robert	Pless
64
Automated	Detection	Algorithms		
(Time	Series	of	Panicle	Counts)
Zongyang	Li	and	Robert	Pless
65
Automated	Detection	Algorithms		
(Time	Series	of	Panicle	Counts)
Zongyang	Li	and	Robert	Pless
66
Geoff	Morris	&	Zhenbin	Hu,	KSULOD	(Logorithm	of	Odds)	genes	linked	to	trait	
Genes	That	Control	Growth	Rate
67
Get	involved
• Sign	up	for	beta	release	of	software	and	data		
• terraref.org/data	
• Use	and	provide	feedback	on	software	and	data	formats	
• github.com/terraref	
• Collaborate	
• Field	measurements	
• Software	
• Algorithms	
• Colocated	Sensors
68
Outline
• Conceptual	Overview	
• Computational	Solutions	
• Crop	Models	
• PEcAn:	Linking	Models	and	Data	
• TERRA	REF	
• Future	Directions
69
Barone	et	al	2017	bioRxiv		
			“Unmet	Needs	for	Analyzing	Biological	Big	Data:	A	Survey	of	704	NSF	Principal	Investigators”		
												
Software	Carpentry
XSEDE.org,	Shared	Clusters
Training	is	the	bottleneck
70
Introduction	to	data	science,	
with	examples	and	projects	
from	TERRA	REF
Hackathons	and	Training
71
Arkansas	State	University		
Iowa	State	University	
Purdue	University		
University	of	Arizona		
University	of	Illinois	
University	of	Nebraska	
University	of	Arkansas
Topp	et	al,	unpublished
72
Sensor	Modeling	and	Model	Coupling
Topp	et	al.	unpublished
73
Modular	Model	Components
Zhu,	Lynch,	LeBauer,	Millar,	Stitt,	Long,	2015	Plant	Cell	&	Environment	
Marshall-Colon	et	al	2017	Fronsers	in	Plant	Science	
cropsinsilico.org
Each	component	represents	>=	1	hypothesis.	
Each	parameter	or	output	can	be	treated	as	a	
phenotype	
Environmental	drivers	can	be	integrated	over	
to	address	GxE
74
Purdue	Phenomics	&	IoT	Platforms
• Develop	Cyberinfrastructure	
• Make	data	useable	
• Facilitate	interdisciplinary	research		
• Assess	existing	capabilities,	current	roadblocks,	future	needs	
• Work	with	Library,	RCAC,	faculty	to	facilitate	data	publishing	
• QA/QC	
• Community	Standards	and	Common	Interfaces
75
Funding:		
NSF	Advances	in	Biological	Infrastructure	
USDA	NIFA	Food	and	Agriculture	Cyberinformatics	and	Tools
Agricultural	Technology
Once	we	understand	how	these	systems	work,	we	can	engineer	for	
ecosystem	services	rather	that	solely	for	yield:	
• Climate	control	
• Soil	improvement,	carbon	storage		
• Roots,	mycorrhizae,	microbiome	
• Pharmaceuticals		
• Petrochemical	Substitutes	
• …	anything	plants	can	do
NASA Ames Research
76
Todd	Mockler Project	Lead
Nadia	Shakoor Project	Director
Noah	Fahlgren Phenotyping	&	Bioinformatics
Erica	Fishel Technology	Transfer
Solmaz	Hajmohammadi Sensor	Fusion
Stephen	Kresovich Breeding
Jeremy	Schmutz Sequencing
Geoff	Morris Gene-trait	Associations
William	Rooney Breeding
Pedro	Andrade-Sanchez Agronomy	&	Phenomics
Michael	Ottman Physiology
Maria	Newcomb Field	Measurements
Jeff	White Agronomy
David	LeBauer Informatics	&	Computing
Robert	Pless Image	Analysis
Roman	Garnett Prediction	Algorithms
Wasit	Walamu Sensing	&	Physiology
Max	Burnette
Craig	Willis
Rob	Kooper
Jeff	Terstreip
Zongyang	Li
Zhenbin	Hu
Nick	Heyek
Charlie	Zender
Henry	Butowsky
Team
77
• Mike	Dietze,	Boston	University	
• David	LeBauer,	University	of	Illinois	
• Shawn	Serbin,	Brookhaven	National	Lab	
• Ankur	Desai,	University	of	Wisconsin	
• Kenton	McHenry,	National	Center	for	Supercomputing	Applications	
• and	many	other	user/contributors
78
David	LeBauer	
		dlebauer@illinois.edu	
TERRA	REF	
		terraref.org	
		github.com/terraref	
		@terra_ref	
PEcAn	Project	
		pecanproject.org	
		github.com/pecanproject	
		@pecanproject
79

Software and data as scaffolds for integrative science