Abril/2024
Trazendo data provenance
e reprodutibilidade para
suas análises de dados em
Python
Using computers to collect, store, analyze, and disseminate data and information
Large files
>
100 GB for one raw
human genome…
Many languages
Bash, Python, R, PERL…
Complex interactions
Networks of software
and their dependencies…
Workflows
Reproducibility
Hidden reproducibility issues are like an iceberg
First, we tried to re-run the
analysis with the code and data
provided by the authors.
Second, we reimplemented the whole
method in a Python package...
Experimenting with reproducibility:
a case study of robustness in bioinformatics
Kim et al., GigaScience
)
https://doi.org/10.1093/gigascience/giy077
Nextflow
Managing modern workflows is complicated
A reactive workflow framework and a programming DSL
Processes
Channels
Workflow
Nextflow
A reactive workflow framework and a programming DSL
data x data y data z
Input Channel
Process A
Task 1
Task 2
Task 3
data x
data y
data z
output x
output y
output z
Nextflow
A reactive workflow framework and a programming DSL
a x data y data z
ut Channel
Process A
Task 1
Task 2
Task 3
data x
data y
data z
output x
output y
output z
Nextflow
output x output y output z
Output Channel
A reactive workflow framework and a programming DSL
output x
output y
output z
Nextflow
output x output y output z
Output Channel
Process B
Task 1
Task 2
Task 3
data x
data y
data z
A reactive workflow framework and a programming DSL
Nextflow
A reactive workflow framework and a programming DSL
Nextflow on the Cloud
A reactive workflow framework and a programming DSL
Nextflow on the Cloud
A Nextflow pipeline for audio transcription using Whisper from OpenAI
nf-whisper
YouTube
URL
Video
File
Audio
File
Transcript
File
pytube Whisper
A Nextflow pipeline for audio transcription using Whisper from OpenAI
nf-whisper
A Nextflow pipeline for audio transcription using Whisper from OpenAI
nf-whisper
A Nextflow pipeline for audio transcription using Whisper from OpenAI
nf-whisper
A reactive workflow framework and a programming DSL
Parallelism
Reentrancy
(
Resume partial runs)
Reusability
Subworkflow foo
Nextflow
Nextflow
Nextflow is a language, a runtime, and a community
pipeline
runtime
Task orchestration
and execution
Built-in version control
with Git
Write code
in any language
Define software
dependencies via containers
Orchestrate tasks with
dataflow programming
Nextflow is a language, a runtime, and a community
Reproducible
Integration with code
management tools, with
versioned releases.
Portable
Docker, Singularity,
Conda, works with most
compute environments.
Scalable
5 samples on your
laptop, 5k on an HPC or
5 million in the cloud.
Nextflow
Provenance report
Generates provenance
reports in different
formats
Enables partial
execution in different
platforms
Makes it easier to
connect workflows and
different technologies
Nextflow plugins
nf-prov: Nextflow plugin to render provenance reports for pipeline runs.
https://github.com/nextflow-io/nf-prov
nextflow-io/nf-prov
Different platforms Workflow integration
nf-core
A community effort to collect a curated set of analysis pipelines built using Nextflow
8k+
Slack
users
40k
GitHub
commits
2k+
GitHub
contributors
16k+
Pull
requests
120
+
GitHub
repositories
7k+
GitHub
issues
Pipelines
>
95 pipelines and a
base template
Linting
Choose conventions to
test for consistency
Modules
>
1150 modules
Tooling
Development and
deployment
Subworkflows
>
55 subworkflows
Schema
Validation, channels
and user interface
nf-core components
Pick and choose which component you need
Pipelines
Create from template,
sync to get updates
Schema
Build your pipeline
schema with a GUI
Modules
Create, install, update,
patch, test
Download
Fetch with singularity
images for offline use
Subworkflows
Create, install and
update
Linting
Test nf-core standards
and best practices
nf-core/tools
Command line tools to help you build your pipeline with ease
Participate
Seminars, training, hackathons, and more
● Bytesize seminars
● Training sessions
● Hackathons
● Social media
● Blogs
● Community Forum and Slack
● Documentation
● Mentorships
Thank you
Marcel Ribeiro-Dantas, Ph.D.
Developer Advocate at Seqera
mribeirodantas@seqera.io
Barcelona | Natal

PyData Meetup Presentation in Natal April 2024

  • 1.
    Abril/2024 Trazendo data provenance ereprodutibilidade para suas análises de dados em Python
  • 2.
    Using computers tocollect, store, analyze, and disseminate data and information Large files > 100 GB for one raw human genome… Many languages Bash, Python, R, PERL… Complex interactions Networks of software and their dependencies… Workflows
  • 3.
    Reproducibility Hidden reproducibility issuesare like an iceberg First, we tried to re-run the analysis with the code and data provided by the authors. Second, we reimplemented the whole method in a Python package... Experimenting with reproducibility: a case study of robustness in bioinformatics Kim et al., GigaScience ) https://doi.org/10.1093/gigascience/giy077
  • 4.
  • 5.
    A reactive workflowframework and a programming DSL Processes Channels Workflow Nextflow
  • 6.
    A reactive workflowframework and a programming DSL data x data y data z Input Channel Process A Task 1 Task 2 Task 3 data x data y data z output x output y output z Nextflow
  • 7.
    A reactive workflowframework and a programming DSL a x data y data z ut Channel Process A Task 1 Task 2 Task 3 data x data y data z output x output y output z Nextflow output x output y output z Output Channel
  • 8.
    A reactive workflowframework and a programming DSL output x output y output z Nextflow output x output y output z Output Channel Process B Task 1 Task 2 Task 3 data x data y data z
  • 9.
    A reactive workflowframework and a programming DSL Nextflow
  • 10.
    A reactive workflowframework and a programming DSL Nextflow on the Cloud
  • 11.
    A reactive workflowframework and a programming DSL Nextflow on the Cloud
  • 12.
    A Nextflow pipelinefor audio transcription using Whisper from OpenAI nf-whisper YouTube URL Video File Audio File Transcript File pytube Whisper
  • 13.
    A Nextflow pipelinefor audio transcription using Whisper from OpenAI nf-whisper
  • 14.
    A Nextflow pipelinefor audio transcription using Whisper from OpenAI nf-whisper
  • 16.
    A Nextflow pipelinefor audio transcription using Whisper from OpenAI nf-whisper
  • 17.
    A reactive workflowframework and a programming DSL Parallelism Reentrancy ( Resume partial runs) Reusability Subworkflow foo Nextflow
  • 18.
    Nextflow Nextflow is alanguage, a runtime, and a community pipeline runtime Task orchestration and execution Built-in version control with Git Write code in any language Define software dependencies via containers Orchestrate tasks with dataflow programming
  • 19.
    Nextflow is alanguage, a runtime, and a community Reproducible Integration with code management tools, with versioned releases. Portable Docker, Singularity, Conda, works with most compute environments. Scalable 5 samples on your laptop, 5k on an HPC or 5 million in the cloud. Nextflow
  • 20.
    Provenance report Generates provenance reportsin different formats Enables partial execution in different platforms Makes it easier to connect workflows and different technologies Nextflow plugins nf-prov: Nextflow plugin to render provenance reports for pipeline runs. https://github.com/nextflow-io/nf-prov nextflow-io/nf-prov Different platforms Workflow integration
  • 21.
    nf-core A community effortto collect a curated set of analysis pipelines built using Nextflow 8k+ Slack users 40k GitHub commits 2k+ GitHub contributors 16k+ Pull requests 120 + GitHub repositories 7k+ GitHub issues
  • 22.
    Pipelines > 95 pipelines anda base template Linting Choose conventions to test for consistency Modules > 1150 modules Tooling Development and deployment Subworkflows > 55 subworkflows Schema Validation, channels and user interface nf-core components Pick and choose which component you need
  • 23.
    Pipelines Create from template, syncto get updates Schema Build your pipeline schema with a GUI Modules Create, install, update, patch, test Download Fetch with singularity images for offline use Subworkflows Create, install and update Linting Test nf-core standards and best practices nf-core/tools Command line tools to help you build your pipeline with ease
  • 24.
    Participate Seminars, training, hackathons,and more ● Bytesize seminars ● Training sessions ● Hackathons ● Social media ● Blogs ● Community Forum and Slack ● Documentation ● Mentorships
  • 25.
    Thank you Marcel Ribeiro-Dantas,Ph.D. Developer Advocate at Seqera mribeirodantas@seqera.io Barcelona | Natal