Slides from a talk given at GraphConnect San Francisco, 21 October 2015
http://graphconnect.com/speaker/tim-williamson/
Video of this talk can be found on YouTube:
https://youtu.be/6KEvLURBenM
Abstract:
Modern agriculture has seen only four major transformations in the last century that started with the hybridization of crops including corn and the development of biotech traits; both of which dramatically improved farm productivity and profitability. More recently the application of molecular techniques to crop development combined with a nondestructive seed sampling process called seed chipping have increased the rate of yield gain in new hybrids and varieties of row crops such as corn, soybeans and cotton. The agricultural industry is currently in the midst of an information revolution that will enable farmers globally to meet the growing need for food, fuel and fiber as the world population climbs to 10 billion and a greater fraction shifts to an animal based diet. This information revolution requires the near real-time integration of multiple disparate data sources including ancestry, genomic, market and grower data. Each one of these data sources spans one or more decades and are complex in and of themselves. An example is the movement of seeds through the product development pipeline, beginning at the earliest recorded discovery breeding cross, and ending with the most recent commercialized products. Historically the constraints of modeling and processing this data within a relational database has made drawing inferences from this dataset complex and computationally infeasible at the scale required for modern analytics uses such as prescriptive breeding and genome-wide selection.
In this talk we present how we leveraged a polyglot environment, with a graph database implemented in Neo4j at the core, to enable this shift in agricultural product development. We will share examples of how the transformation of our genetic ancestry dataset into a graph has replaced months of computational effort. Our approach to polyglot persistence will be discussed via our use of a distributed commit log, Apache Kafka, to feed our graph store from sources of live transactional data. Finally, we will touch upon how we are using these technologies to annotate our genetic ancestry dataset with molecular genomics data in order to build a computational platform capable of imputing the genotype of every seed produced during new product development.
Genomic selection and systems biology – lessons from dairy cattle breedingJohn B. Cole, Ph.D.
Presentation made to the staff of Keygene, NV, in Wageningen, The Netherlands.
(I don't know what the problem is with the template here. It looks fine if you use a dark background.)
Genomic selection and systems biology – lessons from dairy cattle breedingJohn B. Cole, Ph.D.
Presentation made to the staff of Keygene, NV, in Wageningen, The Netherlands.
(I don't know what the problem is with the template here. It looks fine if you use a dark background.)
A talk for an agriculture audience to introduce them to new gene editing technologies and how they are changing plant and animal breeding. Presented June 6, 2016 at GCREC in Balm, FL
Using the Semantic Web to Support Ecoinformaticsebiquity
We describe our on-going work in using the semantic web in support of ecological informatics, and demonstrate a distributed platform for constructing end-to-end use cases. Specifically, we describe ELVIS (the Ecosystem Location Visualization and Information System), a suite of tools for constructing food webs for a given location, and Triple Shop, a SPARQL query interface which allows scientists to semi-automatically construct distributed datasets relevant to the queries they want to ask. ELVIS functionality is exposed as a collection of web services, and all input and output data is expressed in OWL, thereby enabling its integration with Triple Shop and other semantic web resources.
Computational approaches to study GeneticsArithmer Inc.
Slide for Arithmer Seminar given by Dr. Jeffrey Fawcett (RIKEN) at Arithmer inc.
The topic is how data science is used in genetics, especially in analyzing thoroughbred gene pool.
"Arithmer Seminar" is weekly held, where professionals from within and outside our company give lectures on their respective expertise.
The slides are made by the lecturer from outside our company, and shared here with his/her permission.
Arithmer株式会社は東京大学大学院数理科学研究科発の数学の会社です。私達は現代数学を応用して、様々な分野のソリューションに、新しい高度AIシステムを導入しています。AIをいかに上手に使って仕事を効率化するか、そして人々の役に立つ結果を生み出すのか、それを考えるのが私たちの仕事です。
Arithmer began at the University of Tokyo Graduate School of Mathematical Sciences. Today, our research of modern mathematics and AI systems has the capability of providing solutions when dealing with tough complex issues. At Arithmer we believe it is our job to realize the functions of AI through improving work efficiency and producing more useful results for society.
Project Unity: The Way of the Future for Plant BreedingPhenome Networks
Project Unity is a platform that will host all phenotype-to-genotype public-domain data in a common and unified platform, offered as a free service for academia. Each researcher will be able to load their data and connect it to existing global knowledge, by linking traits to ontology, markers to genetic/physical maps and germplasms to pedigree and their sources. Initially, each dataset is stored privately, and can only be accessed by the researcher comparing his results to public ones. Data is made public once the researcher decides to do so typically after the publication of the corresponding scientific paper.
Inference and informatics in a 'sequenced' worldJoe Parker
Short lecture relating my recent work on real-time phylogenomics, implications for bioinformatics research and future directions of genomic/phylogenetic modelling to explicitly account for phylogeny, synteny and identity through coloured graphs.
University of Reading, 2nd August 2017
Managing Genetic Ancestry at Scale with Neo4j and Kafka - StampedeCon 2015StampedeCon
At the StampedeCon 2015 Big Data Conference: The global Monsanto R&D pipeline produces millions of new plant populations every year; each which contributes to a dataset of genetic ancestry spanning several decades. Historically the constraints of modeling and processing this data within an RDBMS has made drawing inferences from this dataset complex and computationally infeasible at large scale. Fortunately, the genetic history of any plant population forms a naturally occurring directed acyclic graph, a property that has allowed us to utilize graph theory to re-imagine how ancestral lineage data is modeled, stored, and queried.
In this talk we present our solutions to these problems, as realized using a graph-based approach within Neo4j. We will discuss our learnings around using Neo4j in a production setting that includes transactional and high-throughput computation, including how we transitioned from recursive JOIN queries to using Cypher and the Neo4j traversal framework to take full advantage of index-free adjacency. Our approach to polyglot persistence will be discussed via our use of a distributed commit log, Apache Kafka, to feed our graph store from sources of live transactional data. Finally, we will touch upon how we are using these technologies to annotate our genetic ancestry dataset with molecular genomics data in order to build an pipeline-scale genotype imputation platform with core algorithms built using Apache Spark.
A talk for an agriculture audience to introduce them to new gene editing technologies and how they are changing plant and animal breeding. Presented June 6, 2016 at GCREC in Balm, FL
Using the Semantic Web to Support Ecoinformaticsebiquity
We describe our on-going work in using the semantic web in support of ecological informatics, and demonstrate a distributed platform for constructing end-to-end use cases. Specifically, we describe ELVIS (the Ecosystem Location Visualization and Information System), a suite of tools for constructing food webs for a given location, and Triple Shop, a SPARQL query interface which allows scientists to semi-automatically construct distributed datasets relevant to the queries they want to ask. ELVIS functionality is exposed as a collection of web services, and all input and output data is expressed in OWL, thereby enabling its integration with Triple Shop and other semantic web resources.
Computational approaches to study GeneticsArithmer Inc.
Slide for Arithmer Seminar given by Dr. Jeffrey Fawcett (RIKEN) at Arithmer inc.
The topic is how data science is used in genetics, especially in analyzing thoroughbred gene pool.
"Arithmer Seminar" is weekly held, where professionals from within and outside our company give lectures on their respective expertise.
The slides are made by the lecturer from outside our company, and shared here with his/her permission.
Arithmer株式会社は東京大学大学院数理科学研究科発の数学の会社です。私達は現代数学を応用して、様々な分野のソリューションに、新しい高度AIシステムを導入しています。AIをいかに上手に使って仕事を効率化するか、そして人々の役に立つ結果を生み出すのか、それを考えるのが私たちの仕事です。
Arithmer began at the University of Tokyo Graduate School of Mathematical Sciences. Today, our research of modern mathematics and AI systems has the capability of providing solutions when dealing with tough complex issues. At Arithmer we believe it is our job to realize the functions of AI through improving work efficiency and producing more useful results for society.
Project Unity: The Way of the Future for Plant BreedingPhenome Networks
Project Unity is a platform that will host all phenotype-to-genotype public-domain data in a common and unified platform, offered as a free service for academia. Each researcher will be able to load their data and connect it to existing global knowledge, by linking traits to ontology, markers to genetic/physical maps and germplasms to pedigree and their sources. Initially, each dataset is stored privately, and can only be accessed by the researcher comparing his results to public ones. Data is made public once the researcher decides to do so typically after the publication of the corresponding scientific paper.
Inference and informatics in a 'sequenced' worldJoe Parker
Short lecture relating my recent work on real-time phylogenomics, implications for bioinformatics research and future directions of genomic/phylogenetic modelling to explicitly account for phylogeny, synteny and identity through coloured graphs.
University of Reading, 2nd August 2017
Managing Genetic Ancestry at Scale with Neo4j and Kafka - StampedeCon 2015StampedeCon
At the StampedeCon 2015 Big Data Conference: The global Monsanto R&D pipeline produces millions of new plant populations every year; each which contributes to a dataset of genetic ancestry spanning several decades. Historically the constraints of modeling and processing this data within an RDBMS has made drawing inferences from this dataset complex and computationally infeasible at large scale. Fortunately, the genetic history of any plant population forms a naturally occurring directed acyclic graph, a property that has allowed us to utilize graph theory to re-imagine how ancestral lineage data is modeled, stored, and queried.
In this talk we present our solutions to these problems, as realized using a graph-based approach within Neo4j. We will discuss our learnings around using Neo4j in a production setting that includes transactional and high-throughput computation, including how we transitioned from recursive JOIN queries to using Cypher and the Neo4j traversal framework to take full advantage of index-free adjacency. Our approach to polyglot persistence will be discussed via our use of a distributed commit log, Apache Kafka, to feed our graph store from sources of live transactional data. Finally, we will touch upon how we are using these technologies to annotate our genetic ancestry dataset with molecular genomics data in order to build an pipeline-scale genotype imputation platform with core algorithms built using Apache Spark.
Partnering on crop wild relative research at three scales: commonalities for ...CWRofUS
The potential for crop wild relatives (CWR) to contribute to crop improvement is growing due to improvements in information on species and their diversity, advancements in breeding tools, and the growing need for exotic genetic diversity to address compounding agronomic challenges. As wild plants, CWR are subject to a myriad of human caused threats to natural ecosystems, and their representation ex situ is often far from comprehensive. Ex situ conservation of many of these wild plants is also technically challenging, particularly in an environment of insufficient resources. Enhancing conservation, availability, and access to CWR requires a spectrum of action spanning basic and applied research on wild species to inform on-the-ground collecting, ex situ maintenance, and germplasm utilization. The development of effective information channels and productive partnerships between diverse organizations are essential to the success of these actions. Here we report on a spectrum of CWR activities involving broad partnerships, at three levels: a) the collaborative compilation and distribution on over 5 million occurrence data records on the CWR of major food crops, b) the analysis of conservation concerns and genetic resources potential of the CWR of potato, sweetpotato, and pigeonpea, and c) ongoing efforts to map the diversity and conservation concerns for CWR in the USA. Although differing in scales and depth of collaborations, the success of these initiatives are largely due to commonalities in research orientation, e.g., inclusiveness, offering clear incentives for involvement, and service providing to the crop science community.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
1. Graphs
are
Feeding
the
World
Tim
Williamson
(@TimWilliate)
Data
Scientist
Monsanto
2. Our
Growing
Planet
Faces
Difficult
Challenges
Sources: http://esa.un.org/unpd/wpp/; UN FAO Food Balance Sheet, “World Health Organization
Global and regional food consumption patterns and trends”; The World Bank, Food and Agriculture
Organization of the United Nations (FAO-STAT), Monsanto Internal Calculations; @TimWilliate #MonDataScience
Rising
Population
Growing enough for
a growing world
Global Population
1980 TODAY 2050
4.4B
7.1B
9.6B+
Limited
Farmland
Farmers will need to
produce enough food
with fewer resources
to support our
world population
Acres per Person
1961 2050
1 <1/3
Changing
Economies
and Diets
A growing global middle
class is choosing animal
protein – meat, eggs,
and dairy – as a larger
part of their diet
Dietary Percentage of Protein
14%
1965 2030
9%
Changing
Climate
Farmers are impacted
by climate change
in many ways:
WATER AVAILABILITY ISSUES
INCREASINGLY
UNPREDICTABLE WEATHER
INSECT RANGE EXPANSION
WEED PRESSURE CHANGES
CROP DISEASE INCREASES
PLANTING ZONE SHIFTS
3. Improved
Genetic
Gain
is
One
of
Several
Tools
Humanity
has
to
Address
These
Challenges
Sources: http://www.ers.usda.gov/data-products/feed-grains-database/feed-grains-yearbook-tables.aspx
• 8
commodity
crops
and
18
vegetable
crop
families,
sold
in
160
countries
Average US Corn Yield 1866 - 2014
Yield(Bushels/Acre)
0
45
90
135
180
Year
1865 1890 1915 1940 1965 1990 2015
@TimWilliate #MonDataScience
10,000 Years
4. Genetic
Gain
is
Created
Through
Breeding
Cycles
@TimWilliate #MonDataScience
X
Lab Data (Genotypes)
Field Data (Phenotypes)
Lab Data (Genotypes)
Field Data (Phenotypes)
Lab Data (Genotypes)
Lab Data (Genotypes)
Select the Best,
Discard the Rest
All Progeny of Two Parents Enter
Best One Leaves to
Become a Future Parent
1000’s crosses/year
Dozens progeny/cross
5-10 locations/progeny
$3-5 million/year
Screening
Field Trials
5. Every
Breeding
Cycle
Extends
a
Tree
of
Genetic
Ancestry
@TimWilliate #MonDataScience
C
A B
A B
C
7. Forcing
Genetic
Ancestry
Data
into
Rows
and
Columns
• In
our
relational
store,
genetic
ancestry
data
was
spread
across
a
hierarchy
of
~11
tables
representing
a
total
of
~895
million
rows
• Every
read
became
an
unpleasant
exercise
in
CONNECT BY PRIOR
@TimWilliate #MonDataScience
Plant Plant:Plant Relationship
plant id attributes… plant id parent plant id parental role
8. Given
a
Starting
Population,
Return
All
Ancestors
ResponseTime(s)
0
6
12
18
24
30
Depth
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
SQL on Oracle Exadata
@TimWilliate #MonDataScience
14. Ancestry-‐as-‐a-‐Service
is
Released
September
2014
REST API (Ancestry-as-a-Service)
Data Scientists
Application
Developers • >30
elements
of
RESTful
grammar
• ~120
applications
and
data
scientists
•
>
600
million
REST
requests
• 10x
performance
boost
• 1
month
analysis
now
takes
3
hours
@TimWilliate #MonDataScience
15. Real-‐Time
Reads
Require
Real-‐Time
Data
• Ingestion
volume
is
~10
million
writes/day
(not
a
write
heavy
flow)
• https://github.com/MonsantoCo/goldengate-‐kafka-‐adapter
Field + Lab
Applications
{
“table”: “foo”
“type”: “INSERT”
“columns”: [
{
“name”: “bar”,
“before”: “fizz”,
“after”: “buzz”
}
]
}
REST API
REST API (Ancestry-as-a-Service)
POST /population
PUT /population/1234
PUT /population/parents
DELETE /population
@TimWilliate #MonDataScience
17. Layering
Genotype
Data
Over
Ancestry
Trees
Genotype
nodes
act
as
simple
pointers
to
remote
systems
which
store
the
raw
data
@TimWilliate #MonDataScience
:Plant :Plant
:PARENT
:Plant Inventory
:Plant Inventory
:PARENT
:Planting
:PLANTED
:Selection :SELECTED
:HARVESTED
:INVENTORY
:Genotype
:HAS_GENOTYPE
:Genotype
:HAS_GENOTYPE
19. Estimate
the
Genotype
of
Every
Seed
Produced
Genotypes
Field + Lab
Applications
REST API
REST API (Ancestry-as-a-Service)
Genotype Estimation
Engine
Genotype Annotated
Ancestry Trees
Required Genotype
DataSets
Estimated
Genotypes
New Estimated
Genotypes Messages
@TimWilliate #MonDataScience
20. Let’s
Revisit
the
Flow
of
a
Breeding
Cycle
@TimWilliate #MonDataScience
X
Lab Data (Genotypes)
Estimate Hi-Res Genotypes
Lab Data (Genotypes)
Field Data (Phenotypes)
Lab Data (Genotypes)
Lab Data (Genotypes)
Select the Best,
Discard the Rest
All Progeny of Two Parents Enter
Best One Leaves to
Become a Future Parent
1000’s crosses/year
Dozens progeny/cross
1 genotype/progeny
< $1 million/year
Genome-Wide
Selection
Width of Pipeline
Increases to
Accommodate More
Crosses
21. A
Glimpse
Inside
Our
Active
‘Graphy’
Work
Sources: http://biodiversitylibrary.org/page/27066167#page/125/mode/1up @TimWilliate #MonDataScience
22. Constructing
Coancestry
Matrices
A
B C
ED GF
A B C D E F G
A 1 0.5 0.5 0.25 0.25 0.25 0.25
B 1 0 0.5 0.5 0 0
C 1 0 0 0.5 0.5
D 1 0 0 0
E 1 0 0
F 1 0
G 1
Coancestry(A)
• Consider
a
reduced
ancestor
tree
only
between
crosses
• A
progeny
inherits
50%
of
its
genetics
from
each
parent
• Key
input
for
a
large
class
of
predictive
genetic
analysis
algorithms
@TimWilliate #MonDataScience