1) The document presents a methodology to convert script-based experiments into reproducible workflow research objects (WROs). This addresses issues of understanding, reusing, and reproducing experiments conducted through scripts.
2) The methodology involves 5 steps: generate an abstract workflow, create an executable workflow, refine the workflow, record provenance data, and annotate and bundle resources into a WRO.
3) The methodology was demonstrated on a molecular dynamics simulation case study. It aims to produce executable workflows, record provenance of the conversion process, and bundle resources like scripts, workflows, annotations, and papers to support reproducibility and reuse.
Converting scripts into reproducible workflow research objectsKhalid Belhajjame
This talk was given at the eScience 2016 conference. It presents a principled methodology for converting raw scripts into annotated workflow research objects.
Efficient analysis of large scientific datasets often requires a means to rapidly search and select interesting portions of data
based on ad-hoc search criteria. We present our work on integrating an efficient searching technology named
FastBit
[2, 3]
with HDF5. The integrated system named
HDF5-FastQuery
allows the users to efficiently generate complex selections on
HDF5 datasets using compound range queries of the form
(
temperature>
1000)
AND
(70
<pressure><
90)
. The FastBit
technology generates compressed bitmap indices that accelerate searches on HDF5 datasets and can be stored together with
those datasets in an HDF5 file. Compared with other indexing schemes, compressed bitmap indices are compact and very well
suited for searching over multidimensional data – even for arbitrarily complex combinations of range conditions.
The increasing sizes of the data being stored in HDF files, necessciates methods for efficient retrieval
of a subset of the data. An index datastructure provides a very effective method of increasing efficiency
of subset data retrieval. Over the years other groups have come up with wrapper indexing APIs to serve
this purpose. Hence a need was felt for a standardized, portable indexing API to be built into the HDF
API so that the duplication of effort to create such APIs could be avoided.
Big Data Technology on Red Hat Enterprise Linux: OpenJDK vs. Oracle JDKPrincipled Technologies
OpenJDK is an efficient foundation for distributed data processing and analytics using Apache Hadoop. In our testing of a Hortonworks HDP 2.0 distribution running on Red Hat Enterprise Linux 6.5, we found that Hadoop performance using OpenJDK was comparable to the performance using Oracle JDK. Comparable performance paired with automatic updates means that OpenJDK can benefit organizations using Red Hat Enterprise Linux -based Hadoop deployments.
Converting scripts into reproducible workflow research objectsKhalid Belhajjame
This talk was given at the eScience 2016 conference. It presents a principled methodology for converting raw scripts into annotated workflow research objects.
Efficient analysis of large scientific datasets often requires a means to rapidly search and select interesting portions of data
based on ad-hoc search criteria. We present our work on integrating an efficient searching technology named
FastBit
[2, 3]
with HDF5. The integrated system named
HDF5-FastQuery
allows the users to efficiently generate complex selections on
HDF5 datasets using compound range queries of the form
(
temperature>
1000)
AND
(70
<pressure><
90)
. The FastBit
technology generates compressed bitmap indices that accelerate searches on HDF5 datasets and can be stored together with
those datasets in an HDF5 file. Compared with other indexing schemes, compressed bitmap indices are compact and very well
suited for searching over multidimensional data – even for arbitrarily complex combinations of range conditions.
The increasing sizes of the data being stored in HDF files, necessciates methods for efficient retrieval
of a subset of the data. An index datastructure provides a very effective method of increasing efficiency
of subset data retrieval. Over the years other groups have come up with wrapper indexing APIs to serve
this purpose. Hence a need was felt for a standardized, portable indexing API to be built into the HDF
API so that the duplication of effort to create such APIs could be avoided.
Big Data Technology on Red Hat Enterprise Linux: OpenJDK vs. Oracle JDKPrincipled Technologies
OpenJDK is an efficient foundation for distributed data processing and analytics using Apache Hadoop. In our testing of a Hortonworks HDP 2.0 distribution running on Red Hat Enterprise Linux 6.5, we found that Hadoop performance using OpenJDK was comparable to the performance using Oracle JDK. Comparable performance paired with automatic updates means that OpenJDK can benefit organizations using Red Hat Enterprise Linux -based Hadoop deployments.
Curso introdutório do framework Scrum, analisando de forma simples e direta a sua utilização como fonte de melhoria na produção de projetos.
Lembrando que esses slides apenas servem para ilustrar o curso e portanto não devem ser tratados como única fonte de conhecimento.
Pequena apresentação para o evento do GBG Aracaju (Google Business Group Aracaju) na Semana da Computação da Universidade Federal de Sergipe (UFS).
Mostrando a potencialidade da ferramenta para as empresas e ótima integração com outros produtos do Google, ajudando a alcançar uma grande produtividade em seu negócio.
Curso introdutório do framework Scrum, analisando de forma simples e direta a sua utilização como fonte de melhoria na produção de projetos.
Lembrando que esses slides apenas servem para ilustrar o curso e portanto não devem ser tratados como única fonte de conhecimento.
Pequena apresentação para o evento do GBG Aracaju (Google Business Group Aracaju) na Semana da Computação da Universidade Federal de Sergipe (UFS).
Mostrando a potencialidade da ferramenta para as empresas e ótima integração com outros produtos do Google, ajudando a alcançar uma grande produtividade em seu negócio.
PhD Thesis: Mining abstractions in scientific workflowsdgarijo
Slides of the presentation for my PhD dissertation. I strongly recommend downloading the slides, as they have animations that are easier to see in power point. The abstract of the thesis is as follows: "Scientific workflows have been adopted in the last decade to represent the computational methods used in in silico scientific experiments and their associated research products. Scientific workflows have demonstrated to be useful for sharing and reproducing scientific experiments, allowing scientists to visualize, debug and save time when re-executing previous work. However, scientific workflows may be difficult to understand and reuse. The large amount of available workflows in repositories, together with their heterogeneity and lack of documentation and usage examples may become an obstacle for a scientist aiming to reuse the work from other scientists. Furthermore, given that it is often possible to implement a method using different algorithms or techniques, seemingly disparate workflows may be related at a higher level of abstraction, based on their common functionality. In this thesis we address the issue of reusability and abstraction by exploring how workflows relate to one another in a workflow repository, mining abstractions that may be helpful for workflow reuse. In order to do so, we propose a simple model for representing and relating workflows and their executions, we analyze the typical common abstractions that can be found in workflow repositories, we explore the current practices of users regarding workflow reuse and we describe a method for discovering useful abstractions for workflows based on existing graph mining techniques. Our results expose the common abstractions and practices of users in terms of workflow reuse, and show how our proposed abstractions have potential to become useful for users designing new workflows".
Reproducible Research: how could Research Objects helpCarole Goble
Reproducible Research: how could Research Objects help, given at 21st Genomic Standards Consortium Meeting
Dates: May 20-23, 2019
https://press3.mcs.anl.gov/gensc/meetings/gsc21/
Presenters:
Tal Sansani, CFA (Quantitative Analyst / Portfolio Manager, American Century Investments)
Sampath Thummati (IT Manager / Advisor, American Century Investments)
Presentation Date: February 26, 2013
This presentation is about how American Century Investments revamped their research and production platforms with Revolution R Enterprise.
"Data Provenance: Principles and Why it matters for BioMedical Applications"Pinar Alper
Tutorial given at Informatics for HEalth 2017 COnference These slides are for the second part of the tutorial describing provenance capture and management tools.
Data Scientists and Machine Learning practitioners, nowadays, seem to be churning out models by the dozen and they continuously experiment to find ways to improve their accuracies. They also use a variety of ML and DL frameworks & languages , and a typical organization may find that this results in a heterogenous, complicated bunch of assets that require different types of runtimes, resources and sometimes even specialized compute to operate efficiently.
But what does it mean for an enterprise to actually take these models to "production" ? How does an organization scale inference engines out & make them available for real-time applications without significant latencies ? There needs to be different techniques for batch (offline) inferences and instant, online scoring. Data needs to be accessed from various sources and cleansing, transformations of data needs to be enabled prior to any predictions. In many cases, there maybe no substitute for customized data handling with scripting either.
Enterprises also require additional auditing and authorizations built in, approval processes and still support a "continuous delivery" paradigm whereby a data scientist can enable insights faster. Not all models are created equal, nor are consumers of a model - so enterprises require both metering and allocation of compute resources for SLAs.
In this session, we will take a look at how machine learning is operationalized in IBM Data Science Experience (DSX), a Kubernetes based offering for the Private Cloud and optimized for the HortonWorks Hadoop Data Platform. DSX essentially brings in typical software engineering development practices to Data Science, organizing the dev->test->production for machine learning assets in much the same way as typical software deployments. We will also see what it means to deploy, monitor accuracies and even rollback models & custom scorers as well as how API based techniques enable consuming business processes and applications to remain relatively stable amidst all the chaos.
Speaker
Piotr Mierzejewski, Program Director Development IBM DSX Local, IBM
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, RomeCarole Goble
Workflow systems support the design, configuration and execution of repetitive, multi-step pipelines and analytics, well established in many disciplines, notably biology and chemistry, but less so in biodiversity and ecology. From an experimental perspective workflows are a means to handle the work of accessing an ecosystem of software and platforms, manage data and security, and handle errors. From a reporting perspective they are a means to accurately document methodology for reproducibility, comparison, exchange and reuse, and to trace the provenance of results for review, credit, workflow interoperability and impact analysis. Workflows operate in an evolving ecosystem and are assemblages of components in that ecosystem; their provenance trails are snapshots of intermediate and final results. Taking a lifecycle perspective, what are the challenges in workflow design and use with different stakeholders? What needs to be tackled in evolution, resilience, and preservation? And what are the “mitigate or adapt” strategies adopted by workflow systems in the face of changes in the ecosystem/environment, for example when tools are depreciated or datasets become inaccessible in the face of funding shortfalls?
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
cientific workflows are used by many scientific communities to capture, automate and standardize computational and data practices in science. Workflow-based automation is often achieved through a craft that combines people, process, computational and Big Data platforms, application-specific purpose and programmability, leading to provenance-aware archival and publications of the results. This talk summarizes varying and changing requirements for distributed workflows influenced by Big Data and heterogeneous computing architectures and present a methodology for workflow-driven science based on these maturing requirements.
Research Objects for improved sharing and reproducibilityOscar Corcho
Presentation about the usage of Research Objects to improve scientific experiment sharing and reproducibility, given at the Dagstuhl Perspective Workshop on the intersection between Computer Sciences and Psychology (July 2015)
A task-based scientific paper recommender system for literature review and ma...Aravind Sesagiri Raamkumar
My PhD oral defense presentation (as of Oct 3rd 2017)
The dissertation can be requested at this link https://www.researchgate.net/publication/323308750_A_task-based_scientific_paper_recommender_system_for_literature_review_and_manuscript_preparation
This slide was used in ISO/IEC JTC1 SC36 Plenary Meeting in June 22, 2015.
Title of this slide is 'Proof of Concept for Learning Analytics Interoperability and subtitle is 'Reference Model based on open source SW'.
Lessons learnt and system built while solving the last mile problem in machine learning - taking models to production. Used for the talk at - http://sched.co/BLvf
Startup Weekend São Paulo Retail+Tech 2018 (Varejo)
Pitch final do time Pexincha formado por:
- André Rodrigues
- João Cartaxo
- Lucas Augusto Carvalho
- Mário Moraes
We designed guidelines and NiW, a software tool, for converting Computational Notebooks into Scientific Workflows.
Work presented in the Second International Workshop on Capturing Scientific Knowledge.
December 4th, 2017. Austin, Texas, USA
Palestra realizada na II Semana de Informática da UFS no campus de Itabaiana.
Apresentei um pouco da trajetória que me fez escolher o empreendedorismo como objetivo de vida. As dificuldades, oportunidades encontradas e também as perdidas no percurso de empreendedor. Também busquei motivar os estudantes e demais profissionais presentes a empreenderem, mostrando entre outras coisas como o mercado digital está em uma ótima fase para os empreendedores criarem o próprio negócio.
Por fim, será apresentei o ambiente que vem sendo criado em Aracaju para incentivar o empreendedorismo digital com um grupo sobre Startup e um escritório de coworking.
Aula realizada a convite na Universidade Tiradentes em Aracaju/SE para a turma de Publicidade. Na aula foi apresentado minha experiência como empreendedor digital, além do que é ser um empreendedor digital e por que se tornar um. Vários cases de empresas de sucesso tanto nacionais quanto internacionais e até locais foram apresentados.
Por último foi apresentado o ambiente de incentivo a criação de startups que está sendo criado em Aracaju, assim como a origem do primeiro escritório de coworking na cidade, e que incentivará a colaboração, networking, parcerias, trocas de idéias e muito mais entre candidatos a futuros empreendedores digitais na cidade.
Apresentação realizada na SECOMP (Semana de Computação) 2011 da Universidade Federal de Sergipe. Breve introdução sobre os conceitos de Sistema de Recomendação com foco nas maiores empresas que utilizam na web.
Google Analytics: Como explorar suas estatísticas para fazer mais negócios.Lucas Augusto Carvalho
Apresentação realizada no Bate Papo sobre E-commerce em Aracaju por Lucas Augusto Carvalho. O objetivo foi apresentar o que é e para que serve web analytics, a ferramenta de estatísticas do Google e seus principais relatórios.
Seminário realizado na matéria Redes Sem Fio sobre a TV Digital no Brasil. Também foi abordado pesquisa sobre interatividade vocal no middleware Ginga, projeto TeouVi da Universidade Federal de Sergipe.
Palestra sobre otimização para mecanismos de busca (SEO - Search Engine Optimization) realizada na Semana de Computação - Secomp da Universidade Federal de Sergipe. Melhore seu posicionamento com as dicas da palestra.
Palestra realizada em abril de 2009 no FLISOL de Aracaju/Sergipe. O tema foi o framework orientado a objetos em PHP chamado Symfony.
Algumas de suas features:
- utiliza a arquitetura MVC (Model View Controller), isto é, separação das camadas de apresentação, controle e modelo.
- utiliza bastantes componentes independentes como o Doctrine e Propel (ORM), YML (arquivos de configuração) e outros. Que podem ser adotados em outros projetos que não utilizam o Symfony.
- possui geração automática de CRUD (create, retrieve, update, delete)
- possui uma documentação e comunidade muito grandes.
- geração automática de telas de administração
- ambiente de desenvolvimento com debug
- ambiente de teste
- integração com sincronizador de arquivos (para atualização do projeto na produção)
- separação de configuração para ambientes de teste, desenvolvimento e produção.
- e muito mais...
Este é um artigo sobre ferramentas para ajudar desenvolvedores web a construirem websites mais acessíveis e também para ajudar os portadores de alguma deficiência a aproveitarem a Web.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.Sérgio Sacani
The return of a sample of near-surface atmosphere from Mars would facilitate answers to several first-order science questions surrounding the formation and evolution of the planet. One of the important aspects of terrestrial planet formation in general is the role that primary atmospheres played in influencing the chemistry and structure of the planets and their antecedents. Studies of the martian atmosphere can be used to investigate the role of a primary atmosphere in its history. Atmosphere samples would also inform our understanding of the near-surface chemistry of the planet, and ultimately the prospects for life. High-precision isotopic analyses of constituent gases are needed to address these questions, requiring that the analyses are made on returned samples rather than in situ.
Slide 1: Title Slide
Extrachromosomal Inheritance
Slide 2: Introduction to Extrachromosomal Inheritance
Definition: Extrachromosomal inheritance refers to the transmission of genetic material that is not found within the nucleus.
Key Components: Involves genes located in mitochondria, chloroplasts, and plasmids.
Slide 3: Mitochondrial Inheritance
Mitochondria: Organelles responsible for energy production.
Mitochondrial DNA (mtDNA): Circular DNA molecule found in mitochondria.
Inheritance Pattern: Maternally inherited, meaning it is passed from mothers to all their offspring.
Diseases: Examples include Leber’s hereditary optic neuropathy (LHON) and mitochondrial myopathy.
Slide 4: Chloroplast Inheritance
Chloroplasts: Organelles responsible for photosynthesis in plants.
Chloroplast DNA (cpDNA): Circular DNA molecule found in chloroplasts.
Inheritance Pattern: Often maternally inherited in most plants, but can vary in some species.
Examples: Variegation in plants, where leaf color patterns are determined by chloroplast DNA.
Slide 5: Plasmid Inheritance
Plasmids: Small, circular DNA molecules found in bacteria and some eukaryotes.
Features: Can carry antibiotic resistance genes and can be transferred between cells through processes like conjugation.
Significance: Important in biotechnology for gene cloning and genetic engineering.
Slide 6: Mechanisms of Extrachromosomal Inheritance
Non-Mendelian Patterns: Do not follow Mendel’s laws of inheritance.
Cytoplasmic Segregation: During cell division, organelles like mitochondria and chloroplasts are randomly distributed to daughter cells.
Heteroplasmy: Presence of more than one type of organellar genome within a cell, leading to variation in expression.
Slide 7: Examples of Extrachromosomal Inheritance
Four O’clock Plant (Mirabilis jalapa): Shows variegated leaves due to different cpDNA in leaf cells.
Petite Mutants in Yeast: Result from mutations in mitochondrial DNA affecting respiration.
Slide 8: Importance of Extrachromosomal Inheritance
Evolution: Provides insight into the evolution of eukaryotic cells.
Medicine: Understanding mitochondrial inheritance helps in diagnosing and treating mitochondrial diseases.
Agriculture: Chloroplast inheritance can be used in plant breeding and genetic modification.
Slide 9: Recent Research and Advances
Gene Editing: Techniques like CRISPR-Cas9 are being used to edit mitochondrial and chloroplast DNA.
Therapies: Development of mitochondrial replacement therapy (MRT) for preventing mitochondrial diseases.
Slide 10: Conclusion
Summary: Extrachromosomal inheritance involves the transmission of genetic material outside the nucleus and plays a crucial role in genetics, medicine, and biotechnology.
Future Directions: Continued research and technological advancements hold promise for new treatments and applications.
Slide 11: Questions and Discussion
Invite Audience: Open the floor for any questions or further discussion on the topic.
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
Converting Scripts into Reproducible Workflow Research Objects
1. Converting Scripts into Reproducible
Workflow Research Objects
Lucas A. M. C. Carvalho, Khalid Belhajjame, Claudia Bauzer Medeiros
lucas.carvalho@ic.unicamp.br
Baltimore, Maryland, USA
October 23-26, 2016
3. 3
Background and Motivation
● Data-Intensive Experiments
– Collection of scripts, programs and (big) data
Papers
How to understand,
reproduce or reuse
data and models of
experiments?
4. 4
Background and Motivation
● Data-Intensive Experiments
– Collection of scripts, programs and (big) data
Manual collection and
organization of data provenance
Papers
How to understand,
reproduce or reuse
data and models of
experiments?
5. 5
Background and Motivation
● Script-based experiments
What are the inputs
and outputs?
How to change this
local program for a
similar web service?
Example of script code.
Difficult to
understand, to reuse,
and to reproduce.
10. 10
Related Work
● Script-language specific.
● Workflow-engine specific.
● A new language is needed.
● Outcome is not an executable workflow.
● Do not collect provenance data of the
conversion process.
11. 11
Two Kind of Experts
● Scientists
– Domain experts who understand the experiment, and
the script (sometimes called user);
● Curators:
– Scientists who are also familiar with workflow and
script programming or;
– Computer scientists who are familiar enough with the
domain to be able to implement our methodology;
– Responsible for authoring, documenting and
publishing workflows and associated resources.
12. 12
Requirements
● Produce workflow-like view of the script.
● Create an executable workflow and compare
execution of workflow and script.
● Modify the workflow resources.
● Record provenance data.
● Aggregate all resources to support
Reproducibility and Reuse.
1
2
3
4
5
13. 13
Requirements
● Produce workflow-like view of the script.1
Activity 1
Port 1 Port 2 Port 3
Port 1 Port 2
Activity 2
Port 3
Port 3
Activity n
Port n
Script-based experiment.
Abstract workflow.
16. 16
Requirements
● Record provenance data4
Activity 1
Output 1 Output 2
wasGeneratedBy wasGeneratedBy
Sample
used
“2012-06-01”
wasStartedAt
Activity 2
used
LucasWorkflow
Run
wasAssociatedWith
used
17. 17
Requirements
● Aggregate all resources to support
Reproducibility and Reuse.
5
Abstract
workflows
Concrete
workflows
Annotations
Papers and
Reports
Provenance
Authors
Scripts
Data
18. 18
Script
Generate Abstract
Workflow
Generate Abstract
Workflow
Create an
executable workflow
Create an
executable workflow
Refine workflowRefine workflow
Bundle Resources into
a Research Object
Bundle Resources into
a Research Object
Annotate and
check quality
Annotate and
check quality
Abstract
workflow
Concrete
workflow
2
1
3
4
5
Methodology
19. 19
Workflow Research Object (WRO)
● Research Objects are
semantically rich
aggregations of resources
that bring together data,
methods and people in
scientific investigations.
● WROs encapsulate scientific
workflows and additional
information regarding their
context and resources.
Research Object Model
20. 20
Running Example
● Molecular Dynamics Simulations
– Many branches of material sciences, computational
engineering, physics and chemistry.
– Scripts (shell script), programs (NAMD, VMD, Fortran)
– Phases: set up, simulation and analysis of trajectories.
– Inputs: protein structure, simulation parameters and
force field files.
– Output: trajectories and analysis results.
34. 34
Step
Annotate and check quality
● Annotations describing the workflow.
● Use provenance data
– To check the quality of the conversion process.
● Run checks to verify the soundness of the
workflow.
4
37. 37
Step
Annotate and check quality
● Common mistakes during the conversion:
– not clearly identified the main logical processing
units in the script;
– a mistake when migrating script code into the
corresponding activity;
– not provided the correct input files and parameters;
– the coding of the workflow itself contained errors.
4
38. 38
Step
Bundle Resources into a Research Object
5
Script Abstract
workflow
Concrete
workflow(s)
Annotations
Paper
Provenance
Data
Attributions
39. 39
Contributions
● A methodology that guides curators in a
principled manner to transform scripts into
reproducible and reusable WRO;
● This addresses an important issue in the area
of script provenance;
40. 40
Conclusions
● We addressed issues wrt understanding, reuse and
reproducibility of script-based experiments.
● The methodology created was:
– elaborated based on requirements;
– showcased via a real world use case from the field of Molecular
Dynamics;
● We exploited tools and standards from the scientific
community:
– Scientific Workflows, YesWorkflow, Research Objects, the W3C
PROV recommendations and the Web Annotation Data Model.
● The bundle is available at http://w3id.org/w2share/s2rwro/
41. 41
Next Steps
● Evaluation using other case studies;
● Evaluation of the cost of the effectiveness of
our methodology;
● Extension of YesWorkflow to support the
semantic annotation of blocks;
● Implementation of tools.
42. 42
Acknowledgments
● FAPESP (grant # 2014/23861-4)
● CCES/CEPID (grant # 2013/08293-7)
– Center for Computational Engineering & Sciences
● LIS (Laboratory of Information Systems)
● Prof. Munir Skaf and his group from Institute of
Chemistry - Unicamp.
43. Converting Scripts into Reproducible
Workflow Research Objects
Lucas A. M. C. Carvalho, Khalid Belhajjame, Claudia Bauzer Medeiros
lucas.carvalho@ic.unicamp.br
Baltimore, Maryland, USA
October 23-26, 2016