AgileIndia Breakout session on serverless applications. This talk covers how AWS serverless infrastructure can be used for a wide range of applications, such as compute intensive tasks (GT-Scan), tasks requiring continuous learning (CryptoBreeder), data intensive tasks (PhenGen Database).
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Going Server-less for Web-Services that need to Crunch Large Volumes of Data
1. Going Server-less for Web-Services that
need to Crunch Large Volumes of Data
HEATH & BIOSECURITY
Dr Denis Bauer | Bioinformatics | @allPowerde
9 Mar 2018 – Continuous Delivery and DevOps Day, Agile India
2. Transformational Bioinformatics | Denis C. Bauer | @allPowerde
CSIRO
An agile government
research
organization.
Overview
GT-Scan2
A Serverless web-
service for complex
research workflows.
GenPhen DB
A serverless system
for large data.
Cryptobreeder
A Serverless system
that continuously
learns.
Not CSIRO-funded
3. Team CSIRO
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
5319
talented staff
$1billion+
budget
Working
with over
2800+
industry
partners
55
sites across
Australia
Top 1%
of global
research
agencies
Each year
6 CSIRO
technologies
contribute
$5 billion to
the economy
4. Transformational Bioinformatics | Denis C. Bauer | @allPowerde
EXTENDED
WEAR
CONTACTS
POLYMER
BANKNOTES
RELENZA
FLU TREATMENT
Fast WLAN
Wireless Local
Area Network
AEROGARD
TOTAL
WELLBEING
DIET
RAFT
POLYMERISATION
BARLEYmax™
SELF
TWISTING
YARN
SOFTLY
WASHING
LIQUID
HENDRA
VACCINE
NOVACQ™
PRAWN FEED
Australia’s innovation catalyst
5. Transformational Bioinformatics | Denis C. Bauer | @allPowerde
CSIRO
An agile government
research
organization.
Overview
GT-Scan2
A Serverless web-
service for complex
research workflows.
GenPhen DB
A serverless system
for large data.
Cryptobreeder
A Serverless system
that continuously
learns.
Not CSIRO-funded
6. Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Recruiting instantaneous appropriately powered compute
Desktop
compute
High-performance
compute
Hadoop/Spark Serverless
Focus small data Compute-intensive Data-intensive Agility
Fault tolerant No No Yes (Yes)
Node-bound Yes Yes No No
Parallelization 10 CPU 100 CPU 1000 CPU 1000 CPU
Parallelization procedure bespoke bespoke standardized standardized
Overhead in the cloud NA spin-up lag spin-up lag instantaneously
CSIRO solution
7. Ideal application case for serverless:
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Small tasks…
• Embarrassingly parallele
tasks
…that need to scale
• Unpredictable burstable
workload that needs to
be delivered online
Agility + Scalability =
8. Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Genome Editing (CRISPR) can
correct genetic diseases, such as
hypertrophic cardiomyopathy.
However, editing does not work
every time, e.g. only 7 in 10
embryos were mutation free.
Aim: Develop computational
guidance framework to enable
edits the first time; every time.
Ma et al. Nature 2017 *
* Some controversy around the paper
12. Interoperable Workflows
• Programmable call to GT-
Scan2 (API)
• Automatic result retrieval
to notebook environment
• Seamless and
reproducible access to
tertiary analytics.
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
This notebook shows the workflow for genome
engineering of finding a specific target site and
then having the result from GT-Scan for direct
visualization.
GenEng
Reproducible Genome
Engineering
14. Serverless systems are hard to optimize
• Pay only for what you use
-> Optimize to use as little as possible
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
15. GT-Scan2 X-Ray Analysis
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
25
50
75
getFastaSequence
createJobtargetScan
offtargetScanStarter
offtargetSearch
targetIntersects
targetTranscriptionIntersects
targetW
uScorer
targetSgR
N
AScorer
O
nTargetScorer
genom
eC
R
ISPR
functions
runtime(s)
Type
base
old
16. Results – 4x Faster (80% improvement)
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
2 min
30 sec
17. Using hypothesis-driven
architecture to improve
serverless infrastructure
Architecture as
text
Evolve
Automatic
performance
measure
Evaluate
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
James Lewis
https://www.epsagon.com/
4pm
Kief Morris
18. Transformational Bioinformatics | Denis C. Bauer | @allPowerde
CSIRO
An agile government
research
organization.
Overview
GT-Scan2
A Serverless web-
service for complex
research workflows.
GenPhen DB
A serverless system
for large data.
Cryptobreeder
A Serverless system
that continuously
learns.
Not CSIRO-funded
19. CryptoKitties in a nutshell
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
What’s in the genes of your CryptoKitties? By cryptobreeder
Not CSIRO-funded
20. CryptoBreeder.net
• Objective:
• Build a machine learning
web-service to predict the
‘cattributes’ of the offspring
from a breeding pair
• Problem:
• New ‘cattributes’ emerge all
the time
• Solution:
• Continuously learning model
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Not CSIRO-funded
Uptake: 164 sessions / week
22. Cost to date
• Any guesses?
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
23. Cost to date: AU $24.35
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Ongoing costs:
Still within the 1 Year AWS Free Tier – however already exceeded S3 limit and EC2 not eligible
Uptake: 164 sessions / week
Not CSIRO-funded
24. Transformational Bioinformatics | Denis C. Bauer | @allPowerde
CSIRO
An agile government
research
organization.
Overview
GT-Scan2
A Serverless web-
service for complex
research workflows.
GenPhen DB
A serverless system
for large data.
Cryptobreeder
A Serverless system
that continuously
learns.
25. Stephens et al. PLOS Biology 2015
Genomics will outpace other BigData disciplines
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Astronomy
Twitter
YouTube
Genomics
26. Clinical use of GenPhen DB
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
• Objective:
• Build a web-service that can
query databases in response to a
patient’s genome and medical
record
• Problem:
• Genomic data is so large
• Solution:
• Athena-based query engine
genomephenome
28. Three things to remember
• Distributed architecture (serverless) can cater for a wide
range of applications
• compute intensive tasks (GT-Scan)
• Tasks requiring continuous learning (CryptoBreeder)
• Data intensive tasks (PhenGen Database)
• Interoperability is built in supporting evidence based
decision-making
• Optimization is currently still work intensive; however
there are many startups addressing this issue
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
29. Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Denis Bauer,
PhD
Oscar Luo,
PhD
Rob Dunne,
PhD
Piotr Szul
Team
Aidan O’BrienLaurence Wilson,
PhD
Collaborators
News
Software
Kaitao Lai,
PhD
Arash Bayat
Lynn Langit
Natalie Twine,
PhD
Top 10 Australian IT stories of 2017
Transformational Bioinformatics
Editor's Notes
12:30
Going Server-less for Web-Services that need to Crunch Large Volumes of Data
By Denis Bauer Team Leader Transformational Bioinformatics @ CSIRO
12:30 - 13:15
Real-time analysis through cloud-based solutions is expected in every domain, including life sciences. However, keeping runtime to real-time and constant can be challenging for problems that vary in their complexity such as genome engineering. Here, the whole genome needs to be analyzed for every potential modification spot, hence the computational complexity of finding the optima spot can vary by orders of magnitude. Using AWS Lambda we break down this task into smaller sub-tasks that can be solved in parallel by instantaneously recruiting additional Lambda functions as the complexity increases. The resulting web-tool, GT-Scan2 was featured on the prestigious AWS Jeff Barr blog as it brings together novel scientific insights and unprecedented cloud-compute capacity. This same idea has been used for building CryptoBreeder. In this presentation, we will discuss the general template for serverless web-application and discuss bespoke solutions for overcoming technical limitations server-less imposes.
CONTINUOUS DELIVERY AND DEVOPS
Staff # as at 3 March 2016 = 5319
2014–15 budget = $1.2 billion
--------------------
Today we have around 5300 talented people working out of 50-plus centres in Australia and internationally.
We are a billion dollar organisation
We generate $485+ million in external revenue – essentially nearly 40% per cent of our revenue is externally sourced
Our people work closely with industry and communities to leave a lasting legacy.
Our ability to achieve results is shown by the quality of our research. We are in the top 1% of global research institutions in 15 of 22 research fields and in the top 0.1% in four research fields.
CSIRO is the key connector of institutions in the Australian system for some areas. CSIRO is the most central Australian institution in 6 research fields – Agricultural Sciences, Environment/Ecology, Plant and Animal Sciences, Geosciences, Chemistry and Materials Science.
CSIRO works with 1208 SME’s and 2,877 customers each year. We’re always looking for ways we can help business and industry.
Square Kilometre Array (SKA) project is expected to lead to a storage demand of 1 exabyte per year. YouTube currently requires from 100 petabytes to 1 exabyte for storage and may be projected to require between 1 and 2 exabytes additional storage per year by 2025. Twitter’s storage needs today are estimated at 0.5 petabytes per year, which may increase to 1.5 petabytes in the next ten years.