Successfully reported this slideshow.
Your SlideShare is downloading.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Managing Genetic
Ancestry at Scale
Jason Clark
@jclark1985
Copyright 2015 Monsanto Company
Food is a looming issue as populations
rise and farm acres shrink
2
By 2050, the world will grow by 2 billion people,
that...
Breeding for a Better Harvest
3
Approaches to
make crops yield
better under
dwindling
resources requires
huge advances in
...
Plant Breeding in a Nutshell
4
Tracking Our Plant Ancestry
5
Plant
ID
1
attributes..
...
2 ...
3 ...
Plant Relationship
Plant ID Parent ID
3 1
3 2
Copyri...
Our R&D pipeline can be cyclical
Copyright 2015 Monsanto Company
8-10 years
Ask a question about an Ancestry
7
Copyright 2015 Monsanto Company
Can you return to me all
ancestors of a given plant?
It’s Complicated
8
Copyright 2015 Monsanto Company
This is a single breeding line!
Our reads do not scale…
9
0
5
10
15
20
25
30
Response(s)
Response (s)
Copyright 2015 Monsanto Company
At a depth of 15 – W...
Database indexes do not help
10
Identifying each
set of related
materials
potentially
requires a
full scan of an
index
O(m...
Ask a question about an Ancestry
11
Copyright 2015 Monsanto Company
Can you return to me all
ancestors of a given plant?
Index Free Adjacency (IFA)
12
A single index
hit finds my
starting point;
all other
relationship
identification is
O(1)
Co...
We were looking for…
13
Something that can
accurately represent the
domain model
We were looking for…
14
Query performance to
remain near constant as
we ask questions about
particular plants
We were looking for…
15
Something that easily
lends itself to TDD
We were looking for…
16
Ideally open source with
a low barrier to entry
17
Copyright 2015 Monsanto Company
18
VS.
Copyright 2015 Monsanto Company
~700M nodes
~1.2B relationships
Ask a question about an Ancestry
19
Copyright 2015 Monsanto Company
Can you return to me all
ancestors of a given plant?
Enabling Innovation
Providing the ability to consume raw trees gives
our consumers a way to leverage the power of
the Grap...
In RESTful Style
21
/materials/1/parents
{
“nodes”: [
{ “id”: 1, “attr1”: “foo” },
{ “id”: 2, “attr1”: “bar” }
],
“relatio...
Predefined Ancestral Milestones
Given where I am at on the Earth
now, where is the closest sandwich
shop “X”?
22
Team iden...
Binary Cross Milestone
23
GET /materials/5/binary-cross
{
“male”: {
“id”: 1
},
“female”: {
“id”: 2
}
}
Copyright 2015 Mons...
Let’s ask a more complex question
24
Do any ancestors of a given
plant show a strong resistance
to a particular disease?
C...
Decorating the Ancestry
25
G G
G
Genotype
nodes act
as simple
pointers to
remote
systems
Copyright 2015 Monsanto Company
Ask a complex question
26
/materials/1/parents?until=genotyped-ancestor&props=genotypes
{“nodes”: [
{ “id”: 1 },
{ “id”: 2...
Architecture
Informing our Ancestry backbone of additional data
that identify significant events in a line’s history
allow...
Take this with you…
28
• Untie yourself from your database indexes
• Let Neo4j do the heavy lifting
• Value added even as ...
Thank You
29
engineering.monsanto.com
discover.monsanto.com
Upcoming SlideShare
Loading in …5
×

2

Share

Managing Genetic Ancestry at Scale with Neo4j and Kafka - StampedeCon 2015

At the StampedeCon 2015 Big Data Conference: The global Monsanto R&D pipeline produces millions of new plant populations every year; each which contributes to a dataset of genetic ancestry spanning several decades. Historically the constraints of modeling and processing this data within an RDBMS has made drawing inferences from this dataset complex and computationally infeasible at large scale. Fortunately, the genetic history of any plant population forms a naturally occurring directed acyclic graph, a property that has allowed us to utilize graph theory to re-imagine how ancestral lineage data is modeled, stored, and queried.

In this talk we present our solutions to these problems, as realized using a graph-based approach within Neo4j. We will discuss our learnings around using Neo4j in a production setting that includes transactional and high-throughput computation, including how we transitioned from recursive JOIN queries to using Cypher and the Neo4j traversal framework to take full advantage of index-free adjacency. Our approach to polyglot persistence will be discussed via our use of a distributed commit log, Apache Kafka, to feed our graph store from sources of live transactional data. Finally, we will touch upon how we are using these technologies to annotate our genetic ancestry dataset with molecular genomics data in order to build an pipeline-scale genotype imputation platform with core algorithms built using Apache Spark.

  • Be the first to comment

Managing Genetic Ancestry at Scale with Neo4j and Kafka - StampedeCon 2015

  1. 1. Managing Genetic Ancestry at Scale Jason Clark @jclark1985 Copyright 2015 Monsanto Company
  2. 2. Food is a looming issue as populations rise and farm acres shrink 2 By 2050, the world will grow by 2 billion people, that’s as many people as there are currently in North and South America combined TWICE!!! Copyright 2015 Monsanto Company
  3. 3. Breeding for a Better Harvest 3 Approaches to make crops yield better under dwindling resources requires huge advances in breeding FEED FOOD 10K YEARS Copyright 2015 Monsanto Company
  4. 4. Plant Breeding in a Nutshell 4
  5. 5. Tracking Our Plant Ancestry 5 Plant ID 1 attributes.. ... 2 ... 3 ... Plant Relationship Plant ID Parent ID 3 1 3 2 Copyright 2015 Monsanto Company 0 10 20 30 40 50 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 # of Inserts (M)
  6. 6. Our R&D pipeline can be cyclical Copyright 2015 Monsanto Company 8-10 years
  7. 7. Ask a question about an Ancestry 7 Copyright 2015 Monsanto Company Can you return to me all ancestors of a given plant?
  8. 8. It’s Complicated 8 Copyright 2015 Monsanto Company This is a single breeding line!
  9. 9. Our reads do not scale… 9 0 5 10 15 20 25 30 Response(s) Response (s) Copyright 2015 Monsanto Company At a depth of 15 – We killed the query at 1.5 hours
  10. 10. Database indexes do not help 10 Identifying each set of related materials potentially requires a full scan of an index O(m log n) Copyright 2015 Monsanto Company
  11. 11. Ask a question about an Ancestry 11 Copyright 2015 Monsanto Company Can you return to me all ancestors of a given plant?
  12. 12. Index Free Adjacency (IFA) 12 A single index hit finds my starting point; all other relationship identification is O(1) Copyright 2015 Monsanto Company
  13. 13. We were looking for… 13 Something that can accurately represent the domain model
  14. 14. We were looking for… 14 Query performance to remain near constant as we ask questions about particular plants
  15. 15. We were looking for… 15 Something that easily lends itself to TDD
  16. 16. We were looking for… 16 Ideally open source with a low barrier to entry
  17. 17. 17 Copyright 2015 Monsanto Company
  18. 18. 18 VS. Copyright 2015 Monsanto Company ~700M nodes ~1.2B relationships
  19. 19. Ask a question about an Ancestry 19 Copyright 2015 Monsanto Company Can you return to me all ancestors of a given plant?
  20. 20. Enabling Innovation Providing the ability to consume raw trees gives our consumers a way to leverage the power of the Graph Database on top of our ancestry grammar 20 Team identified a basic set of features and Codify patterns to identify important features in an Ancestry Derived at query time • Return “raw” ancestral trees to consumers • Allow on-demand pruning of raw trees • Promote language consistency across business consumers Copyright 2015 Monsanto Company
  21. 21. In RESTful Style 21 /materials/1/parents { “nodes”: [ { “id”: 1, “attr1”: “foo” }, { “id”: 2, “attr1”: “bar” } ], “relationships”: [ { “from”: 1, “to”: 2, “relation”: “PARENT” } ] } Copyright 2015 Monsanto Company
  22. 22. Predefined Ancestral Milestones Given where I am at on the Earth now, where is the closest sandwich shop “X”? 22 Team identified a basic set of features and Codify patterns to identify important features in an Ancestry Derived at query time • Traverse raw crossing records at query time • Derivation at query time allows patterns to more easily adapt to changes in business process • Prevents data decay Copyright 2015 Monsanto Company
  23. 23. Binary Cross Milestone 23 GET /materials/5/binary-cross { “male”: { “id”: 1 }, “female”: { “id”: 2 } } Copyright 2015 Monsanto Company
  24. 24. Let’s ask a more complex question 24 Do any ancestors of a given plant show a strong resistance to a particular disease? Copyright 2015 Monsanto Company Who are the first of my ancestors to immigrate from Germany and Ireland to America?
  25. 25. Decorating the Ancestry 25 G G G Genotype nodes act as simple pointers to remote systems Copyright 2015 Monsanto Company
  26. 26. Ask a complex question 26 /materials/1/parents?until=genotyped-ancestor&props=genotypes {“nodes”: [ { “id”: 1 }, { “id”: 2 }, { “id”: 3, “genotypes”: [{“id”: 1234}]} ], “relationships”: [ { “from”: 1, “to”: 2, “relation”: “PARENT” }, { “from”: 2, “to”: 3, “relation”: “PARENT” } ]} Copyright 2015 Monsanto Company
  27. 27. Architecture Informing our Ancestry backbone of additional data that identify significant events in a line’s history allows our APIs to evolve and adapt as our agronomic practices change. 27 Copyright 2015 Monsanto Company
  28. 28. Take this with you… 28 • Untie yourself from your database indexes • Let Neo4j do the heavy lifting • Value added even as non system of record • Keep the storage model as close to mental model as possible Copyright 2015 Monsanto Company
  29. 29. Thank You 29 engineering.monsanto.com discover.monsanto.com

    Be the first to comment

    Login to see the comments

  • trishakunstmartinez

    Oct. 28, 2015
  • pgonyan

    Jun. 13, 2017

At the StampedeCon 2015 Big Data Conference: The global Monsanto R&D pipeline produces millions of new plant populations every year; each which contributes to a dataset of genetic ancestry spanning several decades. Historically the constraints of modeling and processing this data within an RDBMS has made drawing inferences from this dataset complex and computationally infeasible at large scale. Fortunately, the genetic history of any plant population forms a naturally occurring directed acyclic graph, a property that has allowed us to utilize graph theory to re-imagine how ancestral lineage data is modeled, stored, and queried. In this talk we present our solutions to these problems, as realized using a graph-based approach within Neo4j. We will discuss our learnings around using Neo4j in a production setting that includes transactional and high-throughput computation, including how we transitioned from recursive JOIN queries to using Cypher and the Neo4j traversal framework to take full advantage of index-free adjacency. Our approach to polyglot persistence will be discussed via our use of a distributed commit log, Apache Kafka, to feed our graph store from sources of live transactional data. Finally, we will touch upon how we are using these technologies to annotate our genetic ancestry dataset with molecular genomics data in order to build an pipeline-scale genotype imputation platform with core algorithms built using Apache Spark.

Views

Total views

3,368

On Slideshare

0

From embeds

0

Number of embeds

663

Actions

Downloads

0

Shares

0

Comments

0

Likes

2

×