SlideShare a Scribd company logo
Abril/2024
Trazendo data provenance
e reprodutibilidade para
suas análises de dados em
Python
Using computers to collect, store, analyze, and disseminate data and information
Large files
>
100 GB for one raw
human genome…
Many languages
Bash, Python, R, PERL…
Complex interactions
Networks of software
and their dependencies…
Workflows
Reproducibility
Hidden reproducibility issues are like an iceberg
First, we tried to re-run the
analysis with the code and data
provided by the authors.
Second, we reimplemented the whole
method in a Python package...
Experimenting with reproducibility:
a case study of robustness in bioinformatics
Kim et al., GigaScience
)
https://doi.org/10.1093/gigascience/giy077
Nextflow
Managing modern workflows is complicated
A reactive workflow framework and a programming DSL
Processes
Channels
Workflow
Nextflow
A reactive workflow framework and a programming DSL
data x data y data z
Input Channel
Process A
Task 1
Task 2
Task 3
data x
data y
data z
output x
output y
output z
Nextflow
A reactive workflow framework and a programming DSL
a x data y data z
ut Channel
Process A
Task 1
Task 2
Task 3
data x
data y
data z
output x
output y
output z
Nextflow
output x output y output z
Output Channel
A reactive workflow framework and a programming DSL
output x
output y
output z
Nextflow
output x output y output z
Output Channel
Process B
Task 1
Task 2
Task 3
data x
data y
data z
A reactive workflow framework and a programming DSL
Nextflow
A reactive workflow framework and a programming DSL
Nextflow on the Cloud
A reactive workflow framework and a programming DSL
Nextflow on the Cloud
A Nextflow pipeline for audio transcription using Whisper from OpenAI
nf-whisper
YouTube
URL
Video
File
Audio
File
Transcript
File
pytube Whisper
A Nextflow pipeline for audio transcription using Whisper from OpenAI
nf-whisper
A Nextflow pipeline for audio transcription using Whisper from OpenAI
nf-whisper
A Nextflow pipeline for audio transcription using Whisper from OpenAI
nf-whisper
A reactive workflow framework and a programming DSL
Parallelism
Reentrancy
(
Resume partial runs)
Reusability
Subworkflow foo
Nextflow
Nextflow
Nextflow is a language, a runtime, and a community
pipeline
runtime
Task orchestration
and execution
Built-in version control
with Git
Write code
in any language
Define software
dependencies via containers
Orchestrate tasks with
dataflow programming
Nextflow is a language, a runtime, and a community
Reproducible
Integration with code
management tools, with
versioned releases.
Portable
Docker, Singularity,
Conda, works with most
compute environments.
Scalable
5 samples on your
laptop, 5k on an HPC or
5 million in the cloud.
Nextflow
Provenance report
Generates provenance
reports in different
formats
Enables partial
execution in different
platforms
Makes it easier to
connect workflows and
different technologies
Nextflow plugins
nf-prov: Nextflow plugin to render provenance reports for pipeline runs.
https://github.com/nextflow-io/nf-prov
nextflow-io/nf-prov
Different platforms Workflow integration
nf-core
A community effort to collect a curated set of analysis pipelines built using Nextflow
8k+
Slack
users
40k
GitHub
commits
2k+
GitHub
contributors
16k+
Pull
requests
120
+
GitHub
repositories
7k+
GitHub
issues
Pipelines
>
95 pipelines and a
base template
Linting
Choose conventions to
test for consistency
Modules
>
1150 modules
Tooling
Development and
deployment
Subworkflows
>
55 subworkflows
Schema
Validation, channels
and user interface
nf-core components
Pick and choose which component you need
Pipelines
Create from template,
sync to get updates
Schema
Build your pipeline
schema with a GUI
Modules
Create, install, update,
patch, test
Download
Fetch with singularity
images for offline use
Subworkflows
Create, install and
update
Linting
Test nf-core standards
and best practices
nf-core/tools
Command line tools to help you build your pipeline with ease
Participate
Seminars, training, hackathons, and more
● Bytesize seminars
● Training sessions
● Hackathons
● Social media
● Blogs
● Community Forum and Slack
● Documentation
● Mentorships
Thank you
Marcel Ribeiro-Dantas, Ph.D.
Developer Advocate at Seqera
mribeirodantas@seqera.io
Barcelona | Natal

More Related Content

Similar to PyData Meetup Presentation in Natal April 2024

"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre..."APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
Edge AI and Vision Alliance
 
Simon Barker CV 20160926
Simon Barker CV 20160926Simon Barker CV 20160926
Simon Barker CV 20160926
Simon Barker
 
Python Intro For Managers
Python Intro For ManagersPython Intro For Managers
Python Intro For Managers
Atul Shridhar
 
NoFlo - Flow-Based Programming for Node.js
NoFlo - Flow-Based Programming for Node.jsNoFlo - Flow-Based Programming for Node.js
NoFlo - Flow-Based Programming for Node.js
Henri Bergius
 
Simon Barker CV 20151116
Simon Barker CV 20151116Simon Barker CV 20151116
Simon Barker CV 20151116
Simon Barker
 

Similar to PyData Meetup Presentation in Natal April 2024 (20)

"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre..."APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
 
Robust stream processing with Apache Flink
Robust stream processing with Apache FlinkRobust stream processing with Apache Flink
Robust stream processing with Apache Flink
 
Researh toolbox - Data analysis with python
Researh toolbox  - Data analysis with pythonResearh toolbox  - Data analysis with python
Researh toolbox - Data analysis with python
 
Researh toolbox-data-analysis-with-python
Researh toolbox-data-analysis-with-pythonResearh toolbox-data-analysis-with-python
Researh toolbox-data-analysis-with-python
 
OpenCL Overview Japan Virtual Open House Feb 2021
OpenCL Overview Japan Virtual Open House Feb 2021OpenCL Overview Japan Virtual Open House Feb 2021
OpenCL Overview Japan Virtual Open House Feb 2021
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred Hutch
 
Simon Barker CV 20160926
Simon Barker CV 20160926Simon Barker CV 20160926
Simon Barker CV 20160926
 
Reproducible Science and Deep Software Variability
Reproducible Science and Deep Software VariabilityReproducible Science and Deep Software Variability
Reproducible Science and Deep Software Variability
 
Python Intro For Managers
Python Intro For ManagersPython Intro For Managers
Python Intro For Managers
 
Systems Support for Many Task Computing
Systems Support for Many Task ComputingSystems Support for Many Task Computing
Systems Support for Many Task Computing
 
Python | What is Python | History of Python | Python Tutorial
Python | What is Python | History of Python | Python TutorialPython | What is Python | History of Python | Python Tutorial
Python | What is Python | History of Python | Python Tutorial
 
WRENCH: Workflow Management System Simulation Workbench
WRENCH: Workflow Management System Simulation WorkbenchWRENCH: Workflow Management System Simulation Workbench
WRENCH: Workflow Management System Simulation Workbench
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Open Access Week 2017: Life Sciences and Open Sciences - worfkflows and tools
Open Access Week 2017: Life Sciences and Open Sciences - worfkflows and toolsOpen Access Week 2017: Life Sciences and Open Sciences - worfkflows and tools
Open Access Week 2017: Life Sciences and Open Sciences - worfkflows and tools
 
End-to-End ML pipelines with Beam, Flink, TensorFlow and Hopsworks.
End-to-End ML pipelines with Beam, Flink, TensorFlow and Hopsworks.End-to-End ML pipelines with Beam, Flink, TensorFlow and Hopsworks.
End-to-End ML pipelines with Beam, Flink, TensorFlow and Hopsworks.
 
NoFlo - Flow-Based Programming for Node.js
NoFlo - Flow-Based Programming for Node.jsNoFlo - Flow-Based Programming for Node.js
NoFlo - Flow-Based Programming for Node.js
 
Simon Barker CV 20151116
Simon Barker CV 20151116Simon Barker CV 20151116
Simon Barker CV 20151116
 
Cytoscape and External Data Analysis Tools
Cytoscape and External Data Analysis ToolsCytoscape and External Data Analysis Tools
Cytoscape and External Data Analysis Tools
 

Recently uploaded

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 

Recently uploaded (20)

WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Agentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdfAgentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdf
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 

PyData Meetup Presentation in Natal April 2024

  • 1. Abril/2024 Trazendo data provenance e reprodutibilidade para suas análises de dados em Python
  • 2. Using computers to collect, store, analyze, and disseminate data and information Large files > 100 GB for one raw human genome… Many languages Bash, Python, R, PERL… Complex interactions Networks of software and their dependencies… Workflows
  • 3. Reproducibility Hidden reproducibility issues are like an iceberg First, we tried to re-run the analysis with the code and data provided by the authors. Second, we reimplemented the whole method in a Python package... Experimenting with reproducibility: a case study of robustness in bioinformatics Kim et al., GigaScience ) https://doi.org/10.1093/gigascience/giy077
  • 5. A reactive workflow framework and a programming DSL Processes Channels Workflow Nextflow
  • 6. A reactive workflow framework and a programming DSL data x data y data z Input Channel Process A Task 1 Task 2 Task 3 data x data y data z output x output y output z Nextflow
  • 7. A reactive workflow framework and a programming DSL a x data y data z ut Channel Process A Task 1 Task 2 Task 3 data x data y data z output x output y output z Nextflow output x output y output z Output Channel
  • 8. A reactive workflow framework and a programming DSL output x output y output z Nextflow output x output y output z Output Channel Process B Task 1 Task 2 Task 3 data x data y data z
  • 9. A reactive workflow framework and a programming DSL Nextflow
  • 10. A reactive workflow framework and a programming DSL Nextflow on the Cloud
  • 11. A reactive workflow framework and a programming DSL Nextflow on the Cloud
  • 12. A Nextflow pipeline for audio transcription using Whisper from OpenAI nf-whisper YouTube URL Video File Audio File Transcript File pytube Whisper
  • 13. A Nextflow pipeline for audio transcription using Whisper from OpenAI nf-whisper
  • 14. A Nextflow pipeline for audio transcription using Whisper from OpenAI nf-whisper
  • 15.
  • 16. A Nextflow pipeline for audio transcription using Whisper from OpenAI nf-whisper
  • 17. A reactive workflow framework and a programming DSL Parallelism Reentrancy ( Resume partial runs) Reusability Subworkflow foo Nextflow
  • 18. Nextflow Nextflow is a language, a runtime, and a community pipeline runtime Task orchestration and execution Built-in version control with Git Write code in any language Define software dependencies via containers Orchestrate tasks with dataflow programming
  • 19. Nextflow is a language, a runtime, and a community Reproducible Integration with code management tools, with versioned releases. Portable Docker, Singularity, Conda, works with most compute environments. Scalable 5 samples on your laptop, 5k on an HPC or 5 million in the cloud. Nextflow
  • 20. Provenance report Generates provenance reports in different formats Enables partial execution in different platforms Makes it easier to connect workflows and different technologies Nextflow plugins nf-prov: Nextflow plugin to render provenance reports for pipeline runs. https://github.com/nextflow-io/nf-prov nextflow-io/nf-prov Different platforms Workflow integration
  • 21. nf-core A community effort to collect a curated set of analysis pipelines built using Nextflow 8k+ Slack users 40k GitHub commits 2k+ GitHub contributors 16k+ Pull requests 120 + GitHub repositories 7k+ GitHub issues
  • 22. Pipelines > 95 pipelines and a base template Linting Choose conventions to test for consistency Modules > 1150 modules Tooling Development and deployment Subworkflows > 55 subworkflows Schema Validation, channels and user interface nf-core components Pick and choose which component you need
  • 23. Pipelines Create from template, sync to get updates Schema Build your pipeline schema with a GUI Modules Create, install, update, patch, test Download Fetch with singularity images for offline use Subworkflows Create, install and update Linting Test nf-core standards and best practices nf-core/tools Command line tools to help you build your pipeline with ease
  • 24. Participate Seminars, training, hackathons, and more ● Bytesize seminars ● Training sessions ● Hackathons ● Social media ● Blogs ● Community Forum and Slack ● Documentation ● Mentorships
  • 25. Thank you Marcel Ribeiro-Dantas, Ph.D. Developer Advocate at Seqera mribeirodantas@seqera.io Barcelona | Natal