SlideShare a Scribd company logo
1 of 21
Download to read offline
Management of large genomics data with free
software
Michele Finelli
m@biodec.com
BioDec
The genomics revolution
Index
The genomics revolution
Problems
Solutions
Contacts
The genomics revolution
The NGS revolution
I The last twenty years have seen an amazing improvement
in the capability to perform human (and not-human) genome
sequencing.
I These so called Next Generation Sequencing (NGS)
technologies have revolutionized the world of biology and
medicine.
I The Apr 13th 2023 Economist article "Epic ambition"
summarizes the current status and shows how the cost of
the complete sequence of a human genome went from over
the 50.000.000$ of the Human Genome Project (HGP) to
few hundreds of dollars.
The genomics revolution
Cost of sequencing a human-sized genome
The genomics revolution
The sequencing process
I Using physical properties, the NGS device converts the
information of a biological sample (blood or tissue, for
example) into text sequences (reads) representing elements
(bases) of the genome or of other genetic material.
I Different sequencing techniques create reads of different
lengths: some technologies are more suited to shorter
reads, others to longer ones.
I Given the error rate of sequencing technologies — which is
a physical process — the same part of the genome is read
multiple times, leading to many reads for the same region
(that is the coverage of the genome).
The genomics revolution
Size of sequences
Application Estimated output
Human whole-genome se-
quencing (at 40x coverage)
~120 Gbases
Human exome sequencing (at
100x coverage)
~8 Gbases
Microbial whole-genome se-
quencing
~300 Mbases
16S rRNA sequencing ~60 Mbases
Table: Data output for common NGS applications. 1 megabase
(Mbases) = 1.000.000 bases; 1 gigabase (Gbases) = 1.000.000.000
bases
Problems
Index
The genomics revolution
Problems
Solutions
Contacts
Problems
Storage and computation
I Genomes must be stored in secure and meaningful way.
I The value of data is in the analysis that can be done on
them.
I There are computational processes — called pipelines —
that produce the annotations that people will use to
interpret the genomics data.
I Annotations (i.e. the genome metadata) must be accessed,
often with graphical interfaces.
Problems
Meaningful storage
I Genomics data is large (i.e. many TBs) and in some cases
even huge (many PBs).
I Structuring a large amount of data is a complex task.
I Also, you have the usual issues of large data management:
backup, disaster recovery, etcetera.
Problems
Pipelines execution
I Tools that analyze the data take a long time: it is not
unusual that they take days.
I Access to computational data must be as fast as possible:
but fast storage is small, i.e. it cannot accommodate large
data.
I Balancing this requirement is a complex task.
Problems
Access to annotations
I Annotations created by the computational process are
intrinsically complicated to visualize and explore, because
the genomics strands are very long sequences, with a rich
and heterogeneous ecosystem of different metadata,
depending on the biological or the medical questions that
are asked.
I The metadata of one person’s genome are personal data so
they are subject to privacy regulations.
I Balancing UX with privacy guarantees is another complex
task.
Solutions
Index
The genomics revolution
Problems
Solutions
Contacts
Solutions
Two suggestions
FREE SOFTWARE AT THE RESCUE !
1. The OpenCB project.
2. The Galaxy project.
Solutions
The OpenCB project
I OpenCB (Open Computational Biology -
https://github.com/opencb) is a project that aims to
create tools to efficiently manage large genomics
databases.
I These tools are not widely known or used:
1. the software stack is complex (and it is getting even more
complex with more recent releases);
2. steep learning curve.
Solutions
The OpenCB tools
OpenCGA The storage engine: emphasis on privacy,
performances and scalability (the older version
uses MongoDB, the newest is Hadoop/HBase
based).
IVA The user interface (Interactive Variant Analyser) to
OpenCGA data.
Cellbase An integration database to manage annotations
from several genomics and biological databases.
Solutions
The Galaxy project
Galaxy (https://galaxyproject.org/) is an open-source
platform for FAIR1 data analysis.
I Use tools from various domains (that can be plugged into
workflows) through its graphical web interface.
I Run code in interactive environments (RStudio, Jupyter, . . . )
along with other tools or workflows.
I Manage data by sharing and publishing results, workflows,
and visualizations.
I Ensure reproducibility by capturing the necessary
information to repeat and understand data analyses.
1FAIR: Findability, Accessibility,
Interoperability and Reusability.
Solutions
The Galaxy tool
I Galaxy is a Python/Django-based web application that uses
a PostgreSQL database.
I The main use case of Galaxy is as a web frontend to
manage, schedule and execute computational pipelines.
I Galaxy has a large number of integrations, ranging from
queuing systems (SLURM) to containers (Docker), identity
management systems (LDAP, AD) etcetera.
I Galaxy has a pretty solid user base and an active
community.
Solutions
Summing up
Storage OpenCGA and Cellbase as an improvement to . . .
files and custom integration (even Excel files2).
Pipelines Galaxy as an improvement to . . .
shell and the CLI.
Visualization IVA and Galaxy as an improvement to . . .
any random tool.
Remember: the use cases of the Galaxy and the IVA web
interface are different so you will probably need both.
2I wish I were joking.
Solutions
BioDec projects
OpenCB BioDec is currently deploying a large installation for
the Genetic Unit of a main Italian hospital: data in
the order of the hundreds of terabytes will be
managed and analyzed by bioinformaticians.
Galaxy BioDec did many installations in the past ten years,
ranging from research institutions to companies
involved in drug-design and gene therapy, hospitals
etcetera.
Contacts
Index
The genomics revolution
Problems
Solutions
Contacts
Contacts
Get in touch
Who BioDec S.r.l. — a bioinformatics company.
Where Physical and digital places.
Address Via Calzavecchio 20/2, Casalecchio di
Reno (BO).
Web https://www.biodec.com/
Email info@biodec.com
Phone +39 051 0548263, ask about me
(Michele Finelli).
THANKS !

More Related Content

Similar to SFSCON23 - Michele Finelli - Management of large genomic data with free software

tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...David Peyruc
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Ola Spjuth
 
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterParalyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterIOSR Journals
 
Scientific Workflows: what do we have, what do we miss?
Scientific Workflows: what do we have, what do we miss?Scientific Workflows: what do we have, what do we miss?
Scientific Workflows: what do we have, what do we miss?Paolo Romano
 
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLERHPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLERcscpconf
 
Ontologies Ontop Databases
Ontologies Ontop DatabasesOntologies Ontop Databases
Ontologies Ontop DatabasesMartín Rezk
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...Bonnie Hurwitz
 
Cloud Accelerated Genomics
Cloud Accelerated GenomicsCloud Accelerated Genomics
Cloud Accelerated GenomicsIdan Tohami
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / PhoenixAllen Day, PhD
 
Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxxRowlet
 
NGS: bioinformatic challenges
NGS: bioinformatic challengesNGS: bioinformatic challenges
NGS: bioinformatic challengesLex Nederbragt
 
Cao report 2007-2012
Cao report 2007-2012Cao report 2007-2012
Cao report 2007-2012Elif Ceylan
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesGuy Coates
 
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...Ilkay Altintas, Ph.D.
 

Similar to SFSCON23 - Michele Finelli - Management of large genomic data with free software (20)

tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...
 
Collins seattle-2014-final
Collins seattle-2014-finalCollins seattle-2014-final
Collins seattle-2014-final
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...
 
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterParalyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
 
Pine education-platform
Pine education-platformPine education-platform
Pine education-platform
 
Scientific Workflows: what do we have, what do we miss?
Scientific Workflows: what do we have, what do we miss?Scientific Workflows: what do we have, what do we miss?
Scientific Workflows: what do we have, what do we miss?
 
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLERHPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
Ontologies Ontop Databases
Ontologies Ontop DatabasesOntologies Ontop Databases
Ontologies Ontop Databases
 
2014 sage-talk
2014 sage-talk2014 sage-talk
2014 sage-talk
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
Cloud Accelerated Genomics
Cloud Accelerated GenomicsCloud Accelerated Genomics
Cloud Accelerated Genomics
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
 
Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptx
 
BioNLPSADI
BioNLPSADIBioNLPSADI
BioNLPSADI
 
NGS: bioinformatic challenges
NGS: bioinformatic challengesNGS: bioinformatic challenges
NGS: bioinformatic challenges
 
Cao report 2007-2012
Cao report 2007-2012Cao report 2007-2012
Cao report 2007-2012
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
 

More from South Tyrol Free Software Conference

SFSCON23 - Rufai Omowunmi Balogun - SMODEX – a Python package for understandi...
SFSCON23 - Rufai Omowunmi Balogun - SMODEX – a Python package for understandi...SFSCON23 - Rufai Omowunmi Balogun - SMODEX – a Python package for understandi...
SFSCON23 - Rufai Omowunmi Balogun - SMODEX – a Python package for understandi...South Tyrol Free Software Conference
 
SFSCON23 - Roberto Innocenti - From the design to reality is here the Communi...
SFSCON23 - Roberto Innocenti - From the design to reality is here the Communi...SFSCON23 - Roberto Innocenti - From the design to reality is here the Communi...
SFSCON23 - Roberto Innocenti - From the design to reality is here the Communi...South Tyrol Free Software Conference
 
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data Hub
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data HubSFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data Hub
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data HubSouth Tyrol Free Software Conference
 
SFSCON23 - Marianna d'Atri Enrico Zanardo - How can Blockchain technologies i...
SFSCON23 - Marianna d'Atri Enrico Zanardo - How can Blockchain technologies i...SFSCON23 - Marianna d'Atri Enrico Zanardo - How can Blockchain technologies i...
SFSCON23 - Marianna d'Atri Enrico Zanardo - How can Blockchain technologies i...South Tyrol Free Software Conference
 
SFSCON23 - Lucas Lasota - The Future of Connectivity, Open Internet and Human...
SFSCON23 - Lucas Lasota - The Future of Connectivity, Open Internet and Human...SFSCON23 - Lucas Lasota - The Future of Connectivity, Open Internet and Human...
SFSCON23 - Lucas Lasota - The Future of Connectivity, Open Internet and Human...South Tyrol Free Software Conference
 
SFSCON23 - Giovanni Giannotta - Intelligent Decision Support System for trace...
SFSCON23 - Giovanni Giannotta - Intelligent Decision Support System for trace...SFSCON23 - Giovanni Giannotta - Intelligent Decision Support System for trace...
SFSCON23 - Giovanni Giannotta - Intelligent Decision Support System for trace...South Tyrol Free Software Conference
 
SFSCON23 - Elena Maines - Embracing CI/CD workflows for building ETL pipelines
SFSCON23 - Elena Maines - Embracing CI/CD workflows for building ETL pipelinesSFSCON23 - Elena Maines - Embracing CI/CD workflows for building ETL pipelines
SFSCON23 - Elena Maines - Embracing CI/CD workflows for building ETL pipelinesSouth Tyrol Free Software Conference
 
SFSCON23 - Charles H. Schulz - Why open digital infrastructure matters
SFSCON23 - Charles H. Schulz - Why open digital infrastructure mattersSFSCON23 - Charles H. Schulz - Why open digital infrastructure matters
SFSCON23 - Charles H. Schulz - Why open digital infrastructure mattersSouth Tyrol Free Software Conference
 
SFSCON23 - Thomas Aichner - How IoT and AI are revolutionizing Mass Customiza...
SFSCON23 - Thomas Aichner - How IoT and AI are revolutionizing Mass Customiza...SFSCON23 - Thomas Aichner - How IoT and AI are revolutionizing Mass Customiza...
SFSCON23 - Thomas Aichner - How IoT and AI are revolutionizing Mass Customiza...South Tyrol Free Software Conference
 
SFSCON23 - Mirko Boehm - European regulators cast their eyes on maturing OSS ...
SFSCON23 - Mirko Boehm - European regulators cast their eyes on maturing OSS ...SFSCON23 - Mirko Boehm - European regulators cast their eyes on maturing OSS ...
SFSCON23 - Mirko Boehm - European regulators cast their eyes on maturing OSS ...South Tyrol Free Software Conference
 
SFSCON23 - Marco Pavanelli - Monitoring the fleet of Sasa with free software
SFSCON23 - Marco Pavanelli - Monitoring the fleet of Sasa with free softwareSFSCON23 - Marco Pavanelli - Monitoring the fleet of Sasa with free software
SFSCON23 - Marco Pavanelli - Monitoring the fleet of Sasa with free softwareSouth Tyrol Free Software Conference
 
SFSCON23 - Marco Cortella - KNOWAGE and AICS for 2030 agenda SDG goals monito...
SFSCON23 - Marco Cortella - KNOWAGE and AICS for 2030 agenda SDG goals monito...SFSCON23 - Marco Cortella - KNOWAGE and AICS for 2030 agenda SDG goals monito...
SFSCON23 - Marco Cortella - KNOWAGE and AICS for 2030 agenda SDG goals monito...South Tyrol Free Software Conference
 
SFSCON23 - Lina Ceballos - Interoperable Europe Act - A real game changer
SFSCON23 - Lina Ceballos - Interoperable Europe Act - A real game changerSFSCON23 - Lina Ceballos - Interoperable Europe Act - A real game changer
SFSCON23 - Lina Ceballos - Interoperable Europe Act - A real game changerSouth Tyrol Free Software Conference
 
SFSCON23 - Johannes Näder Linus Sehn - Let’s monitor implementation of Free S...
SFSCON23 - Johannes Näder Linus Sehn - Let’s monitor implementation of Free S...SFSCON23 - Johannes Näder Linus Sehn - Let’s monitor implementation of Free S...
SFSCON23 - Johannes Näder Linus Sehn - Let’s monitor implementation of Free S...South Tyrol Free Software Conference
 
SFSCON23 - Gabriel Ku Wei Bin - Why Do We Need A Next Generation Internet
SFSCON23 - Gabriel Ku Wei Bin - Why Do We Need A Next Generation InternetSFSCON23 - Gabriel Ku Wei Bin - Why Do We Need A Next Generation Internet
SFSCON23 - Gabriel Ku Wei Bin - Why Do We Need A Next Generation InternetSouth Tyrol Free Software Conference
 
SFSCON23 - Davide Vernassa - Empowering Insights Unveiling the latest innova...
SFSCON23 - Davide Vernassa - Empowering Insights  Unveiling the latest innova...SFSCON23 - Davide Vernassa - Empowering Insights  Unveiling the latest innova...
SFSCON23 - Davide Vernassa - Empowering Insights Unveiling the latest innova...South Tyrol Free Software Conference
 

More from South Tyrol Free Software Conference (20)

SFSCON23 - Rufai Omowunmi Balogun - SMODEX – a Python package for understandi...
SFSCON23 - Rufai Omowunmi Balogun - SMODEX – a Python package for understandi...SFSCON23 - Rufai Omowunmi Balogun - SMODEX – a Python package for understandi...
SFSCON23 - Rufai Omowunmi Balogun - SMODEX – a Python package for understandi...
 
SFSCON23 - Roberto Innocenti - From the design to reality is here the Communi...
SFSCON23 - Roberto Innocenti - From the design to reality is here the Communi...SFSCON23 - Roberto Innocenti - From the design to reality is here the Communi...
SFSCON23 - Roberto Innocenti - From the design to reality is here the Communi...
 
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data Hub
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data HubSFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data Hub
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data Hub
 
SFSCON23 - Marianna d'Atri Enrico Zanardo - How can Blockchain technologies i...
SFSCON23 - Marianna d'Atri Enrico Zanardo - How can Blockchain technologies i...SFSCON23 - Marianna d'Atri Enrico Zanardo - How can Blockchain technologies i...
SFSCON23 - Marianna d'Atri Enrico Zanardo - How can Blockchain technologies i...
 
SFSCON23 - Lucas Lasota - The Future of Connectivity, Open Internet and Human...
SFSCON23 - Lucas Lasota - The Future of Connectivity, Open Internet and Human...SFSCON23 - Lucas Lasota - The Future of Connectivity, Open Internet and Human...
SFSCON23 - Lucas Lasota - The Future of Connectivity, Open Internet and Human...
 
SFSCON23 - Giovanni Giannotta - Intelligent Decision Support System for trace...
SFSCON23 - Giovanni Giannotta - Intelligent Decision Support System for trace...SFSCON23 - Giovanni Giannotta - Intelligent Decision Support System for trace...
SFSCON23 - Giovanni Giannotta - Intelligent Decision Support System for trace...
 
SFSCON23 - Elena Maines - Embracing CI/CD workflows for building ETL pipelines
SFSCON23 - Elena Maines - Embracing CI/CD workflows for building ETL pipelinesSFSCON23 - Elena Maines - Embracing CI/CD workflows for building ETL pipelines
SFSCON23 - Elena Maines - Embracing CI/CD workflows for building ETL pipelines
 
SFSCON23 - Christian Busse - Free Software and Open Science
SFSCON23 - Christian Busse - Free Software and Open ScienceSFSCON23 - Christian Busse - Free Software and Open Science
SFSCON23 - Christian Busse - Free Software and Open Science
 
SFSCON23 - Charles H. Schulz - Why open digital infrastructure matters
SFSCON23 - Charles H. Schulz - Why open digital infrastructure mattersSFSCON23 - Charles H. Schulz - Why open digital infrastructure matters
SFSCON23 - Charles H. Schulz - Why open digital infrastructure matters
 
SFSCON23 - Andrea Vianello - Achieving FAIRness with EDP-portal
SFSCON23 - Andrea Vianello - Achieving FAIRness with EDP-portalSFSCON23 - Andrea Vianello - Achieving FAIRness with EDP-portal
SFSCON23 - Andrea Vianello - Achieving FAIRness with EDP-portal
 
SFSCON23 - Thomas Aichner - How IoT and AI are revolutionizing Mass Customiza...
SFSCON23 - Thomas Aichner - How IoT and AI are revolutionizing Mass Customiza...SFSCON23 - Thomas Aichner - How IoT and AI are revolutionizing Mass Customiza...
SFSCON23 - Thomas Aichner - How IoT and AI are revolutionizing Mass Customiza...
 
SFSCON23 - Stefan Mutschlechner - Smart Werke Meran
SFSCON23 - Stefan Mutschlechner - Smart Werke MeranSFSCON23 - Stefan Mutschlechner - Smart Werke Meran
SFSCON23 - Stefan Mutschlechner - Smart Werke Meran
 
SFSCON23 - Mirko Boehm - European regulators cast their eyes on maturing OSS ...
SFSCON23 - Mirko Boehm - European regulators cast their eyes on maturing OSS ...SFSCON23 - Mirko Boehm - European regulators cast their eyes on maturing OSS ...
SFSCON23 - Mirko Boehm - European regulators cast their eyes on maturing OSS ...
 
SFSCON23 - Marco Pavanelli - Monitoring the fleet of Sasa with free software
SFSCON23 - Marco Pavanelli - Monitoring the fleet of Sasa with free softwareSFSCON23 - Marco Pavanelli - Monitoring the fleet of Sasa with free software
SFSCON23 - Marco Pavanelli - Monitoring the fleet of Sasa with free software
 
SFSCON23 - Marco Cortella - KNOWAGE and AICS for 2030 agenda SDG goals monito...
SFSCON23 - Marco Cortella - KNOWAGE and AICS for 2030 agenda SDG goals monito...SFSCON23 - Marco Cortella - KNOWAGE and AICS for 2030 agenda SDG goals monito...
SFSCON23 - Marco Cortella - KNOWAGE and AICS for 2030 agenda SDG goals monito...
 
SFSCON23 - Lina Ceballos - Interoperable Europe Act - A real game changer
SFSCON23 - Lina Ceballos - Interoperable Europe Act - A real game changerSFSCON23 - Lina Ceballos - Interoperable Europe Act - A real game changer
SFSCON23 - Lina Ceballos - Interoperable Europe Act - A real game changer
 
SFSCON23 - Johannes Näder Linus Sehn - Let’s monitor implementation of Free S...
SFSCON23 - Johannes Näder Linus Sehn - Let’s monitor implementation of Free S...SFSCON23 - Johannes Näder Linus Sehn - Let’s monitor implementation of Free S...
SFSCON23 - Johannes Näder Linus Sehn - Let’s monitor implementation of Free S...
 
SFSCON23 - Gabriel Ku Wei Bin - Why Do We Need A Next Generation Internet
SFSCON23 - Gabriel Ku Wei Bin - Why Do We Need A Next Generation InternetSFSCON23 - Gabriel Ku Wei Bin - Why Do We Need A Next Generation Internet
SFSCON23 - Gabriel Ku Wei Bin - Why Do We Need A Next Generation Internet
 
SFSCON23 - Edoardo Scepi - The Brand-New Version of IGis Maps
SFSCON23 - Edoardo Scepi - The Brand-New Version of IGis MapsSFSCON23 - Edoardo Scepi - The Brand-New Version of IGis Maps
SFSCON23 - Edoardo Scepi - The Brand-New Version of IGis Maps
 
SFSCON23 - Davide Vernassa - Empowering Insights Unveiling the latest innova...
SFSCON23 - Davide Vernassa - Empowering Insights  Unveiling the latest innova...SFSCON23 - Davide Vernassa - Empowering Insights  Unveiling the latest innova...
SFSCON23 - Davide Vernassa - Empowering Insights Unveiling the latest innova...
 

Recently uploaded

Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 

Recently uploaded (20)

Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 

SFSCON23 - Michele Finelli - Management of large genomic data with free software

  • 1. Management of large genomics data with free software Michele Finelli m@biodec.com BioDec
  • 2. The genomics revolution Index The genomics revolution Problems Solutions Contacts
  • 3. The genomics revolution The NGS revolution I The last twenty years have seen an amazing improvement in the capability to perform human (and not-human) genome sequencing. I These so called Next Generation Sequencing (NGS) technologies have revolutionized the world of biology and medicine. I The Apr 13th 2023 Economist article "Epic ambition" summarizes the current status and shows how the cost of the complete sequence of a human genome went from over the 50.000.000$ of the Human Genome Project (HGP) to few hundreds of dollars.
  • 4. The genomics revolution Cost of sequencing a human-sized genome
  • 5. The genomics revolution The sequencing process I Using physical properties, the NGS device converts the information of a biological sample (blood or tissue, for example) into text sequences (reads) representing elements (bases) of the genome or of other genetic material. I Different sequencing techniques create reads of different lengths: some technologies are more suited to shorter reads, others to longer ones. I Given the error rate of sequencing technologies — which is a physical process — the same part of the genome is read multiple times, leading to many reads for the same region (that is the coverage of the genome).
  • 6. The genomics revolution Size of sequences Application Estimated output Human whole-genome se- quencing (at 40x coverage) ~120 Gbases Human exome sequencing (at 100x coverage) ~8 Gbases Microbial whole-genome se- quencing ~300 Mbases 16S rRNA sequencing ~60 Mbases Table: Data output for common NGS applications. 1 megabase (Mbases) = 1.000.000 bases; 1 gigabase (Gbases) = 1.000.000.000 bases
  • 8. Problems Storage and computation I Genomes must be stored in secure and meaningful way. I The value of data is in the analysis that can be done on them. I There are computational processes — called pipelines — that produce the annotations that people will use to interpret the genomics data. I Annotations (i.e. the genome metadata) must be accessed, often with graphical interfaces.
  • 9. Problems Meaningful storage I Genomics data is large (i.e. many TBs) and in some cases even huge (many PBs). I Structuring a large amount of data is a complex task. I Also, you have the usual issues of large data management: backup, disaster recovery, etcetera.
  • 10. Problems Pipelines execution I Tools that analyze the data take a long time: it is not unusual that they take days. I Access to computational data must be as fast as possible: but fast storage is small, i.e. it cannot accommodate large data. I Balancing this requirement is a complex task.
  • 11. Problems Access to annotations I Annotations created by the computational process are intrinsically complicated to visualize and explore, because the genomics strands are very long sequences, with a rich and heterogeneous ecosystem of different metadata, depending on the biological or the medical questions that are asked. I The metadata of one person’s genome are personal data so they are subject to privacy regulations. I Balancing UX with privacy guarantees is another complex task.
  • 13. Solutions Two suggestions FREE SOFTWARE AT THE RESCUE ! 1. The OpenCB project. 2. The Galaxy project.
  • 14. Solutions The OpenCB project I OpenCB (Open Computational Biology - https://github.com/opencb) is a project that aims to create tools to efficiently manage large genomics databases. I These tools are not widely known or used: 1. the software stack is complex (and it is getting even more complex with more recent releases); 2. steep learning curve.
  • 15. Solutions The OpenCB tools OpenCGA The storage engine: emphasis on privacy, performances and scalability (the older version uses MongoDB, the newest is Hadoop/HBase based). IVA The user interface (Interactive Variant Analyser) to OpenCGA data. Cellbase An integration database to manage annotations from several genomics and biological databases.
  • 16. Solutions The Galaxy project Galaxy (https://galaxyproject.org/) is an open-source platform for FAIR1 data analysis. I Use tools from various domains (that can be plugged into workflows) through its graphical web interface. I Run code in interactive environments (RStudio, Jupyter, . . . ) along with other tools or workflows. I Manage data by sharing and publishing results, workflows, and visualizations. I Ensure reproducibility by capturing the necessary information to repeat and understand data analyses. 1FAIR: Findability, Accessibility, Interoperability and Reusability.
  • 17. Solutions The Galaxy tool I Galaxy is a Python/Django-based web application that uses a PostgreSQL database. I The main use case of Galaxy is as a web frontend to manage, schedule and execute computational pipelines. I Galaxy has a large number of integrations, ranging from queuing systems (SLURM) to containers (Docker), identity management systems (LDAP, AD) etcetera. I Galaxy has a pretty solid user base and an active community.
  • 18. Solutions Summing up Storage OpenCGA and Cellbase as an improvement to . . . files and custom integration (even Excel files2). Pipelines Galaxy as an improvement to . . . shell and the CLI. Visualization IVA and Galaxy as an improvement to . . . any random tool. Remember: the use cases of the Galaxy and the IVA web interface are different so you will probably need both. 2I wish I were joking.
  • 19. Solutions BioDec projects OpenCB BioDec is currently deploying a large installation for the Genetic Unit of a main Italian hospital: data in the order of the hundreds of terabytes will be managed and analyzed by bioinformaticians. Galaxy BioDec did many installations in the past ten years, ranging from research institutions to companies involved in drug-design and gene therapy, hospitals etcetera.
  • 21. Contacts Get in touch Who BioDec S.r.l. — a bioinformatics company. Where Physical and digital places. Address Via Calzavecchio 20/2, Casalecchio di Reno (BO). Web https://www.biodec.com/ Email info@biodec.com Phone +39 051 0548263, ask about me (Michele Finelli). THANKS !