This document introduces khmer, a platform for scalable sequence analysis. It discusses how khmer uses k-mers to provide implicit read alignments and assemble sequences using de Bruijn graphs. It also describes some of the challenges with k-mers, such as each sequencing error resulting in novel k-mers. The document outlines khmer's data structures and algorithms for efficiently counting k-mers and represents de Bruijn graphs. It discusses how khmer has been applied to real biological problems and highlights areas of current research using khmer, such as error correction, variant calling, and assembly-free comparisons of data sets.
2013 3 27 TAR Webinar Part 4 Getting Started SiglerSonya Sigler
Getting started using technology assisted review can be difficult if lawyers aren't used to this type of technology. Part 4 of this webinar series provides in depth coverage on how to get started with TAR tools.
RuleML2015: Using PSL to Extend and Evaluate Event OntologiesRuleML
The representation of events plays a key role in a wide range
of Semantic Web applications, and several ontologies have been proposed
to support this task. However, a review of existing event ontologies on
the web reveals limited reasoning being done in their applications. To
investigate this, we designed a set of reasoning problems (competency
questions) aimed at providing a pragmatic assessment of the reasoning
capabilities of three well-known Semantic Web event ontologies – SEM,
The Event Ontology, and LODE. Using OWL and SWRL axiomatizations
of the Process Specification Language (PSL) Ontology, we specify
maximal extensions of the existing event ontologies.We then evaluate the
resulting set of OWL and SWRL ontologies against our reasoning problems,
using the results to both assess the abilities of existing Semantic
Web event ontologies, and to explore the potential gains that may be
achieved through additional axioms.
It Does What You Say, Not What You Mean: Lessons From A Decade of Program RepairClaire Le Goues
In this talk we present lessons learned, good ideas, and thoughts on the future, with an eye toward informing junior researchers about the realities and opportunities of a long-running project. We highlight some notions from the original paper that stood the test of time, some that were not as prescient, and some that became more relevant as industrial practice advanced. We place the work in context, highlighting perceptions from software engineering and evolutionary computing, then and now, of how program repair could possibly work. We discuss the importance of measurable benchmarks and reproducible research in bringing scientists together and advancing the area. We give our thoughts on the role of quality requirements and properties in program repair. From testing to metrics to scalability to human factors to technology transfer, software repair touches many aspects of software engineering, and we hope a behind-the-scenes exploration of some of our struggles and successes may benefit researchers pursuing new projects.
Open & reproducible research - What can we do in practice?Felix Z. Hoffmann
Talk on my project within the Open Science Fellowship program, held at the Bordeaux Neurocampus on April 2018.
Note: For working videos, please refer to the GitHub source code http://bit.ly/bx18s
Myths In Software Engineering: Does complex code mean there will be more bugs? We have analyzed a number of bug databases (including Eclipse, Mozilla, and various Microsoft projects) and come to surprising conclusions.
2013 3 27 TAR Webinar Part 4 Getting Started SiglerSonya Sigler
Getting started using technology assisted review can be difficult if lawyers aren't used to this type of technology. Part 4 of this webinar series provides in depth coverage on how to get started with TAR tools.
RuleML2015: Using PSL to Extend and Evaluate Event OntologiesRuleML
The representation of events plays a key role in a wide range
of Semantic Web applications, and several ontologies have been proposed
to support this task. However, a review of existing event ontologies on
the web reveals limited reasoning being done in their applications. To
investigate this, we designed a set of reasoning problems (competency
questions) aimed at providing a pragmatic assessment of the reasoning
capabilities of three well-known Semantic Web event ontologies – SEM,
The Event Ontology, and LODE. Using OWL and SWRL axiomatizations
of the Process Specification Language (PSL) Ontology, we specify
maximal extensions of the existing event ontologies.We then evaluate the
resulting set of OWL and SWRL ontologies against our reasoning problems,
using the results to both assess the abilities of existing Semantic
Web event ontologies, and to explore the potential gains that may be
achieved through additional axioms.
It Does What You Say, Not What You Mean: Lessons From A Decade of Program RepairClaire Le Goues
In this talk we present lessons learned, good ideas, and thoughts on the future, with an eye toward informing junior researchers about the realities and opportunities of a long-running project. We highlight some notions from the original paper that stood the test of time, some that were not as prescient, and some that became more relevant as industrial practice advanced. We place the work in context, highlighting perceptions from software engineering and evolutionary computing, then and now, of how program repair could possibly work. We discuss the importance of measurable benchmarks and reproducible research in bringing scientists together and advancing the area. We give our thoughts on the role of quality requirements and properties in program repair. From testing to metrics to scalability to human factors to technology transfer, software repair touches many aspects of software engineering, and we hope a behind-the-scenes exploration of some of our struggles and successes may benefit researchers pursuing new projects.
Open & reproducible research - What can we do in practice?Felix Z. Hoffmann
Talk on my project within the Open Science Fellowship program, held at the Bordeaux Neurocampus on April 2018.
Note: For working videos, please refer to the GitHub source code http://bit.ly/bx18s
Myths In Software Engineering: Does complex code mean there will be more bugs? We have analyzed a number of bug databases (including Eclipse, Mozilla, and various Microsoft projects) and come to surprising conclusions.
The Bexar County DWI Task Force was originally created in 1985 and membership has grown to include all Bexar County area law enforcement agencies, state and federal agencies, criminal justice partners as well as members from the medical, education, and community fields. As a partner organization to the Circles of San Antonio Community Coalition we gather annually to meet at a special holiday luncheon. It's a time to meet all those people who work together in saving lives in our community through education, enforcement, prevention, and rehabilitation.
Dave McCarty presented "Light Duty, Good Faith Job Offers + Transitional Work" on October 23, 2015, at the Ohio Self-Insurers Association's "Nuts & Bolts: Workers' Compensation Educational Seminar."
This whitepaper discusses the existing software-based paging
virtualization solutions and their associated performance overheads. It
then introduces AMD-V™ Rapid Virtualization Indexing technology –
referred to as nested paging within this publication - highlights its
advantages and demonstrates the performance uplift that may be seen
with nested paging.
The official event program for the 2009 National Association of Bar Executives (NABE) Communications Section Workshop
held Tuesday, October 20 through Friday, October 23, 2009 at the Rio-All Suites Hotel in Las Vegas, Nevada.
The Bexar County DWI Task Force was originally created in 1985 and membership has grown to include all Bexar County area law enforcement agencies, state and federal agencies, criminal justice partners as well as members from the medical, education, and community fields. As a partner organization to the Circles of San Antonio Community Coalition we gather annually to meet at a special holiday luncheon. It's a time to meet all those people who work together in saving lives in our community through education, enforcement, prevention, and rehabilitation.
Dave McCarty presented "Light Duty, Good Faith Job Offers + Transitional Work" on October 23, 2015, at the Ohio Self-Insurers Association's "Nuts & Bolts: Workers' Compensation Educational Seminar."
This whitepaper discusses the existing software-based paging
virtualization solutions and their associated performance overheads. It
then introduces AMD-V™ Rapid Virtualization Indexing technology –
referred to as nested paging within this publication - highlights its
advantages and demonstrates the performance uplift that may be seen
with nested paging.
The official event program for the 2009 National Association of Bar Executives (NABE) Communications Section Workshop
held Tuesday, October 20 through Friday, October 23, 2009 at the Rio-All Suites Hotel in Las Vegas, Nevada.
Tom DeMarco states that “You can’t control what you can’t measure”, but how much can we change and control (with) what we measure? This talk investigates the opportunities and limits of data-driven software engineering, shows which opportunities lie ahead of us when we engage in mining and analyzing software engineering process data, but also highlights important factors that influence the success and adaptability of data-based improvement approaches.
Presentation for Harvard's ABCD Technology in Education group:
The Institute for Quantitative Social Science (IQSS) is a unique entity at Harvard - it combines research, software development, and specialized services to provide innovative solutions to research and scholarship problems at Harvard and beyond. I will talk about the software projects that IQSS is currently working on (Dataverse, Zelig, Consilience, and OpenScholar), including the research and development processes, the benefits provided to the Harvard community, and the impacts on research and scholarship.
This is a 90 min talk with some exercises and discussion that I gave at the DHS Agile Expo. It places DevOps as a series of feedback loops and emphasizes agile engineering practices being at the core.
Examines some of the fundamental problems with the way the industry thinks about software "engineering", and breaks some notions in order to find useful ways of improving your code quality, and your skills and discipline as a developer.
Leaping over the Boundaries of Boundary Value AnalysisTechWell
Many books, articles, classes, and conference presentations tout equivalence class partitioning and boundary value analysis as core testing techniques. Yet many discussions of these techniques are shallow and oversimplified. Testers learn to identify classes based on little more than hopes, rumors, and unwarranted assumptions, while the "analysis" consists of little more than adding or subtracting one to a given number. Do you want to limit yourself to checking the product's behavior at boundaries? Or would you rather test the product to discover that the boundaries aren't where you thought they were, and that the equivalence classes aren't as equivalent as you've been told? Join Michael Bolton as he jumps over the partitions and leaps across the boundaries to reveal a topic far richer than you might have anticipated and far more complex than the simplifications that appear in traditional testing literature and folklore.
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...PyData
To productionize data science work (and have it taken seriously by software engineers, CTOs, clients, or the open source community), you need to write tests! Except… how can you test code that performs nondeterministic tasks like natural language parsing and modeling? This talk presents an approach to testing probabilistic functions in code, illustrated with concrete examples written for Pytest.
How to Actually DO High-volume Automated TestingTechWell
In high volume automated testing (HiVAT), the test tool generates the test, runs it, evaluates the results, and alerts a human to suspicious results that need further investigation. What makes it simple is its oracle—run the program until it crashes or fails in some other extremely obvious way. More powerful HiVAT approaches are more sensitive to more types of errors. They are particularly useful for testing combinations of many variables and for hunting hard-to-replicate bugs that involve timing or corruption of memory or data. Cem Kaner presents a new strategy for teaching HiVAT testing. Instead of describing what has been done, Cem is creating open source examples of the techniques applied to real (open source) applications. These examples are written in Ruby, making the code readable and reusable by snapping in code specific to your own application. Join Cem Kaner and Carol Oliver as they describe three HiVAT techniques, their associated code, and how you can customize them.
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.Sérgio Sacani
The return of a sample of near-surface atmosphere from Mars would facilitate answers to several first-order science questions surrounding the formation and evolution of the planet. One of the important aspects of terrestrial planet formation in general is the role that primary atmospheres played in influencing the chemistry and structure of the planets and their antecedents. Studies of the martian atmosphere can be used to investigate the role of a primary atmosphere in its history. Atmosphere samples would also inform our understanding of the near-surface chemistry of the planet, and ultimately the prospects for life. High-precision isotopic analyses of constituent gases are needed to address these questions, requiring that the analyses are made on returned samples rather than in situ.
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
4. K-mers give you an
implicit alignment
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTGGACCGATGCACGGTACCG
5. K-mers give you an
implicit alignment
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTGGACCGATGCACGGTACCG
CATGGACCGATTGCACTGGACCGATGCACGGACCG
(with no accounting for mismatches or indels)
6. De Bruijn graphs –
assemble on overlaps
J.R. Miller et al. / Genomics (2010)
7. The problem with k-mers
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTCGACCGATGCACGGTACCG
Each sequencing error results in k novel k-mers!
11. This leads to good things.
Efficient online
counting of k-mers
Trimming reads
on abundance
Efficient De
Bruijn graph
representations
Read
abundance
normalization
12. Data structures &
algorithms papers
• “These are not the k-mers you are looking for…”,
Zhang et al., arXiv 1309.2975, in review.
• “Scaling metagenome sequence assembly with
probabilistic de Bruijn graphs”, Pell et al., PNAS 2012.
• “A Reference-Free Algorithm for Computational
Normalization of Shotgun Sequencing Data”, Brown
et al., arXiv 1203.4802, under revision.
13. Data analysis papers
• “Tackling soil diversity with the assembly of large,
complex metagenomes”, Howe et al., PNAS, 2014.
• Assembling novel ascidian genomes &
transcriptomes, Lowe et al., in prep.
• A de novo lamprey transcriptome from large scale
multi-tissue mRNAseq, Scott et al., in prep.
14. Lab approach – not
intentional, but working out.
Novel data
structures and
algorithms
Implement at
scale
Apply to real
biological
problems
15. This leads to good things.
Efficient online
counting of k-mers
Trimming reads
on abundance
Efficient De
Bruijn graph
representations
Read
abundance
normalization
(khmer software)
16. Efficient online
counting of k-mers
Trimming reads
on abundance
Efficient De
Bruijn graph
representations
Read
abundance
normalization
Streaming
algorithms for
assembly,
variant calling,
and error
correction
Cloud assembly
protocols
Efficient graph
labeling &
exploration
Data set
partitioning
approaches
Assembly-free
comparison of
data sets
HMM-guided
assembly
Efficient search
for target genes
Currentresearch
(khmer software)
17. How is this feasible?!
Representative half-arsed lab software development
Version that
worked once, for
some publication.
Grad student 1
research
Grad student 2
research
Incompatible and broken code
18. Stable version
Grad student 1
research
Grad student 2
research
Stable, tested code
Run tests
Run tests
Run tests
Run tests
Run tests
A not-insane way to do software development
19. A not-insane way to do software development
Stable version
Grad student 1
research
Grad student 2
research
Stable, tested code
Run tests
Run tests
Run tests
Run tests
Run tests
Run tests
Run tests
20. Testing & version control
– the not so secret sauce
• High test coverage - grown over time.
• Stupidity driven testing – we write tests for bugs after
we find them and before we fix them.
• Pull requests & continuous integration – does your
proposed merge break tests?
• Pull requests & code review – does new code meet
our minimal coding etc requirements?
o Note: spellchecking!!!
21. Integration testing
• khmer is designed to work with other packages.
• For releases >= 1.0, we now have added
acceptance tests to make sure that khmer works
OK with other packages.
• These acceptance tests are based on integration
tests, than in turn come from an education &
documentation effort…
23. khmer-protocols:
• Provide standard “cheap”
assembly protocols for the cloud.
• Entirely copy/paste; ~2-6 days
from raw reads to assembly,
annotations, and differential
expression analysis. ~$150 per
data set (on Amazon rental
computers)
• Open, versioned, forkable,
citable….
Read cleaning
Diginorm
Assembly
Annotation
RSEM differential
expression
24. Literate testing
• Our shell-command tutorials for bioinformatics can
now be executed in an automated fashion –
commands are extracted automatically into shell
scripts.
• See: github.com/ged-lab/literate-resting/.
• Tremendously improves peace of mind and
confidence moving forward!
Leigh Sheneman
25. Doing things right
=> #awesomesauce
Protocols in English
for running analyses in
the cloud
Literate reSTing =>
shell scripts
Tool
competitions
Benchmarking
Education
Acceptance
tests
31. Error correction on
simulated E. coli data
1% error rate, 100x coverage.
Jordan Fish and Jason Pell
TP FP TN FN
ideal 3,469,834 99.1% 8,186 460,655,449 31,731 0.9%
1-pass 2,827,839 80.8% 30,254 460,633,381 673,726 19.2%
1.2-pass 3,403,171 97.2% 8,764 460,654,871 98,394 2.8%
(corrected) (mistakes) (OK) (missed)
32. Single pass, reference free, tunable, streaming online
variant calling.
Streaming, online variant calling.
See NIH BIG DATA grant, http://ged.msu.edu/.
33. Novelty… to what power?
• “Novelty” requirements for “high impact
publishing”:
o Must do novel algorithm development
o …and apply to novel and interesting data sets.
o (See Josh Bloom, https://medium.com/tech-talk/dd88857f662)
• We’ve taken on the additional challenge of trying
to develop and maintain a core set of functionality
in research software: novelty cubed? :)
34. Reproducibility
Scientific progress relies on reproducibility of analysis.
(Aristotle, Nature, 322 BCE.)
All our papers now have:
• Source hosted on github;
• Data hosted there or on AWS;
• Long running data analysis =>
‘make’
• Graphing and data digestion
=> IPython Notebook (also in
github)
Qingpeng Zhang
35. Concluding thoughts
• API is destiny – without online counting, diginorm &
streaming approaches would not have been
possible.
• Tackle the hard problems – engineering
optimization would not have gotten us very far.
• Testing lets us scale development & process – which
means when something works, we can run with it.
36. Caveats
• Expense and effort – you can spend an infinite
amount of time on infrastructure & process!
o Advice: choose techniques that address actual pain points.
o (See: “Best Practices in Scientific Computing”, Wilson et al., 2014)
• Funders and reviewers just don’t care – adopt good
software practices for yourself, not others.
o Advice: briefly mention keywords in grants, papers.
• Advisors just don’t care – see above.
o These are 90% true statements :>
37. Can we crowdsource
bioinformatics?
We already are! Bioinformatics is already a tremendously
open and collaborative endeavor. (Let’s take advantage
of it!)
“It’s as if somewhere, out there, is a collection of totally free
software that can do a far better job than ours can, with
open, published methods, great support networks and
fantastic tutorials. But that’s madness – who on Earth
would create such an amazing resource?”
-
http://thescienceweb.wordpress.com/2014/02/21/bioinfor
matics-software-companies-have-no-clue-why-no-one-
buys-their-products/
39. Prospective: sequencing
tumor cells
• Goal: phylogenetically reconstruct causal “driver
mutations” in face of passenger mutations.
• 1000 cells x 3 Gbp x 20 coverage: 60 Tbp of
sequence.
• Most of this data will be redundant and not useful.
• Developing diginorm-based algorithms to eliminate
data while retaining variant information.
40. Where are we taking this?
• Streaming online algorithms only look at data
~once.
• Diginorm is streaming, online…
• Conceptually, can move many aspects of
sequence analysis into streaming mode.
=> Extraordinary potential for computational efficiency.