The document discusses genome assembly from sequencing reads. It describes how reads can be aligned to a reference genome if available, but for a new genome the reads must be assembled without a reference. Two main assembly approaches are described: overlap-layout-consensus which builds an overlap graph, and de Brujin graph assembly which constructs a de Brujin graph from k-mers. Both approaches aim to find contiguous sequences (contigs) from the reads but face challenges from computational complexity and sequencing errors in the reads.
Course: Bioinformatics for Biomedical Research (2014).
Session: 4.1- Introduction to RNA-seq and RNA-seq Data Analysis.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
BITS - Comparative genomics on the genome levelBITS
This is the third presentation of the BITS training on 'Comparative genomics'.
It reviews the basic concepts of sequence homology on the gene
Thanks to Klaas Vandepoele of the PSB department.
Course: Bioinformatics for Biomedical Research (2014).
Session: 4.1- Introduction to RNA-seq and RNA-seq Data Analysis.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
BITS - Comparative genomics on the genome levelBITS
This is the third presentation of the BITS training on 'Comparative genomics'.
It reviews the basic concepts of sequence homology on the gene
Thanks to Klaas Vandepoele of the PSB department.
AGRF in conjunction with EMBL Australia recently organised a workshop at Monash University Clayton. This workshop was targeted at beginners and biologists who are new to analysing Next-Gen Sequencing data. The workshop also aimed to provide users with a snapshot of bioinformatics and data analysis tips on how to begin to analyse project data. An introduction to RNA-seq data analysis was presented by AGRF Senior Bioinformatician Dr. Sonika Tyagi.
Presented: 1st August 2012
Transgene-free CRISPR/Cas9 genome-editing methods in plantsCIAT
"Transgene-free CRISPR/Cas9 genome-editing methods in plants" by Matthew R. Willmann, Ph.D. Director, Plant Transformation Facility College of Agriculture and Life Sciences, School of Integrative Plant Science, Cornell University.
Guest lecture on comparative genomics for University of Dundee BS32010, delivered 21/3/2016
Workshop/other materials available at DOI:10.5281/zenodo.49447
CRISPR/Cas9 gene editing is based on a microbial restriction system, that has been harnessed for genome targeting using only a short sequence of RNA as a guide.
The beauty of the system is that unlike protein binding based technologies such as Zinc Fingers and TALENs which require complex protein engineering, the design rules are very simple, and it is this fact that is allowing CRISPR to take genome engineering from a relatively niche persuit to the mainstream scientific community.
The principle of the system is that a short guide RNA, homologous to the target site recruits a nuclease – Cas9
This then cuts the dsDNA, triggering repair by either the low fidelity NHEJ pathway, or by HDR in the presence of an exogenous donor sequence.
High Efficiencies for both knockouts and knock-ins have been reported and whilst there are understandable concerns about specificity, new methodologies to address these are now being developed
The system itself is comprised of three key components
the Cas9 protein, which cuts/cleaves the DNA and
Two RNAs - a crispr RNA contains the sequence homologous to the target site and a trans-activating crisprRNA (or TracrRNA) which recruits the nuclease/crispr complex
For genome editing, the crisperRNA and TraceRNA are generally now constructed together into a single guideRNA or sgRNA
Genome editing is elicited through hybridization of the sgRNA with its matching genomic sequence, and the recruitment of the Cas9, which cleaves at the target site.
Gene mapping / Genetic map vs Physical Map | determination of map distance a...NARC, Islamabad
Mapping- determining the location of elements with in a genome, with respect to identifiable land marks.
Gene mapping describes the methods used to identify the locus of a gene and the distances between genes.
In simple mapping of genes to specific locations on chromosomes.
Two types
Genetic map
Physical Map
Construction of a Linkage Map or Genetic Mapping
Construction of a Linkage Map or Genetic Mapping
1. DNA MARKERS FOR GENETIC MAPPING
– Restriction Fragment Length Polymorphism (RFLP)
– Simple Sequence Length Polymorphism (SSLP)
– Single Nucleotide Polymorphism (SNP)
2. Determination of Linkage Groups(No. of Chromosomes)
Dihybrid cross
Trihybrid cross
3. Determination of Map Distance
Recombination fraction
4. Determination of Gene Order
5. Combining Map Segments
Integrative Genomics Viewer, popularly known as IGV, is one of the popular tools to visualize High-throughput sequencing data alignment and genome alteration (SNV, InDel) in an interactive mode. This tutorial gives a basic understanding of IGV interface and NGS data browsing.
Comparative genomic hybridization is a molecular cytogenetic method for analysing copy number variations (CNVs) relative to ploidy level in the DNA of a test sample compared to a reference sample, without the need for culturing cells
Sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology cannot read whole genomes in one go, but rather reads small pieces of between 20 and 30,000 bases, depending on the technology used. Typically the short fragments, called reads, result from shotgun sequencing genomic DNA, or gene transcript (ESTs).
The problem of sequence assembly can be compared to taking many copies of a book, passing each of them through a shredder with a different cutter, and piecing the text of the book back together just by looking at the shredded pieces. Besides the obvious difficulty of this task, there are some extra practical issues: the original may have many repeated paragraphs, and some shreds may be modified during shredding to have typos. Excerpts from another book may also be added in, and some shreds may be completely unrecognizable.
The next generation of crispr–cas technologies and Applicationsiqraakbar8
The prokaryote-derived CRISPR–Cas genome editing systems have transformed our ability to manipulate, detect, image and annotate specific DNA and RNA sequences in living cells of diverse species. The ease of use and robustness of this technology have revolutionized genome editing for research ranging from fundamental science to translational medicine. Initial successes have inspired efforts to discover new systems for targeting and manipulating nucleic acids, including those from Cas9, Cas12, Cascade and Cas13 orthologues.
AGRF in conjunction with EMBL Australia recently organised a workshop at Monash University Clayton. This workshop was targeted at beginners and biologists who are new to analysing Next-Gen Sequencing data. The workshop also aimed to provide users with a snapshot of bioinformatics and data analysis tips on how to begin to analyse project data. An introduction to RNA-seq data analysis was presented by AGRF Senior Bioinformatician Dr. Sonika Tyagi.
Presented: 1st August 2012
Transgene-free CRISPR/Cas9 genome-editing methods in plantsCIAT
"Transgene-free CRISPR/Cas9 genome-editing methods in plants" by Matthew R. Willmann, Ph.D. Director, Plant Transformation Facility College of Agriculture and Life Sciences, School of Integrative Plant Science, Cornell University.
Guest lecture on comparative genomics for University of Dundee BS32010, delivered 21/3/2016
Workshop/other materials available at DOI:10.5281/zenodo.49447
CRISPR/Cas9 gene editing is based on a microbial restriction system, that has been harnessed for genome targeting using only a short sequence of RNA as a guide.
The beauty of the system is that unlike protein binding based technologies such as Zinc Fingers and TALENs which require complex protein engineering, the design rules are very simple, and it is this fact that is allowing CRISPR to take genome engineering from a relatively niche persuit to the mainstream scientific community.
The principle of the system is that a short guide RNA, homologous to the target site recruits a nuclease – Cas9
This then cuts the dsDNA, triggering repair by either the low fidelity NHEJ pathway, or by HDR in the presence of an exogenous donor sequence.
High Efficiencies for both knockouts and knock-ins have been reported and whilst there are understandable concerns about specificity, new methodologies to address these are now being developed
The system itself is comprised of three key components
the Cas9 protein, which cuts/cleaves the DNA and
Two RNAs - a crispr RNA contains the sequence homologous to the target site and a trans-activating crisprRNA (or TracrRNA) which recruits the nuclease/crispr complex
For genome editing, the crisperRNA and TraceRNA are generally now constructed together into a single guideRNA or sgRNA
Genome editing is elicited through hybridization of the sgRNA with its matching genomic sequence, and the recruitment of the Cas9, which cleaves at the target site.
Gene mapping / Genetic map vs Physical Map | determination of map distance a...NARC, Islamabad
Mapping- determining the location of elements with in a genome, with respect to identifiable land marks.
Gene mapping describes the methods used to identify the locus of a gene and the distances between genes.
In simple mapping of genes to specific locations on chromosomes.
Two types
Genetic map
Physical Map
Construction of a Linkage Map or Genetic Mapping
Construction of a Linkage Map or Genetic Mapping
1. DNA MARKERS FOR GENETIC MAPPING
– Restriction Fragment Length Polymorphism (RFLP)
– Simple Sequence Length Polymorphism (SSLP)
– Single Nucleotide Polymorphism (SNP)
2. Determination of Linkage Groups(No. of Chromosomes)
Dihybrid cross
Trihybrid cross
3. Determination of Map Distance
Recombination fraction
4. Determination of Gene Order
5. Combining Map Segments
Integrative Genomics Viewer, popularly known as IGV, is one of the popular tools to visualize High-throughput sequencing data alignment and genome alteration (SNV, InDel) in an interactive mode. This tutorial gives a basic understanding of IGV interface and NGS data browsing.
Comparative genomic hybridization is a molecular cytogenetic method for analysing copy number variations (CNVs) relative to ploidy level in the DNA of a test sample compared to a reference sample, without the need for culturing cells
Sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology cannot read whole genomes in one go, but rather reads small pieces of between 20 and 30,000 bases, depending on the technology used. Typically the short fragments, called reads, result from shotgun sequencing genomic DNA, or gene transcript (ESTs).
The problem of sequence assembly can be compared to taking many copies of a book, passing each of them through a shredder with a different cutter, and piecing the text of the book back together just by looking at the shredded pieces. Besides the obvious difficulty of this task, there are some extra practical issues: the original may have many repeated paragraphs, and some shreds may be modified during shredding to have typos. Excerpts from another book may also be added in, and some shreds may be completely unrecognizable.
The next generation of crispr–cas technologies and Applicationsiqraakbar8
The prokaryote-derived CRISPR–Cas genome editing systems have transformed our ability to manipulate, detect, image and annotate specific DNA and RNA sequences in living cells of diverse species. The ease of use and robustness of this technology have revolutionized genome editing for research ranging from fundamental science to translational medicine. Initial successes have inspired efforts to discover new systems for targeting and manipulating nucleic acids, including those from Cas9, Cas12, Cascade and Cas13 orthologues.
Genome assembly: the art of trying to make one big thing from millions of ver...Keith Bradnam
A talk about genome assembly. Largely aimed at people new to the field, this slide deck is an updated version of a talk that I first gave last year and which I recently presented as part of a UC Davis Bioinformatics Core training workshop.
Author: Keith Bradnam, Genome Center, UC Davis
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Word embeddings have received a lot of attention since some Tomas Mikolov published word2vec in 2013 and showed that the embeddings that the neural network learned by “reading” a large corpus of text preserved semantic relations between words. As a result, this type of embedding started being studied in more detail and applied to more serious NLP and IR tasks such as summarization, query expansion, etc… More recently, researchers and practitioners alike have come to appreciate the power of this type of approach and have started a cottage industry of modifying Mikolov’s original approach to many different areas.
In this talk we will cover the implementation and mathematical details underlying tools like word2vec and some of the applications word embeddings have found in various areas. Starting from an intuitive overview of the main concepts and algorithms underlying the neural network architecture used in word2vec we will proceed to discussing the implementation details of the word2vec reference implementation in tensorflow. Finally, we will provide a birds eye view of the emerging field of “2vec" (dna2vec, node2vec, etc...) methods that use variations of the word2vec neural network architecture.
This (long) version of the Tutorial was presented at #O'Reilly AI 2017 in San Francisco. See https://bmtgoncalves.github.io/word2vec-and-friends/ for further details.
Atmosphere Conference 2015: Need for Async: In pursuit of scalable internet-s...PROIDEA
Speaker: Konrad Malawski
Language: English
It's the year 2015, so unless you've been living under a rock for the last decade, you probably have heard about servers and platforms needing to go asynchronous in order to scale. But really, how deep did you dive into the reasons as why this need arrises? This talk aims to explain the various reasons and techniques that can be (and often are) used in developing high performance web applications - from the kernel depths, to the high level abstractions that all contribute to such designs.
We'll start with the lowest level of them all - the network transports we all use and how they impact latency in our systems.
Then we will move on to operating systems' socket selector implementation details and the now legendary C10K problem, to see how implementations were forced to change in order to survive the ever-rising number of concurrent connections. Next we'll dive into processor and thread utilisation effects and how parallel programming - using either message-passing or stream processing style libraries fits into the grand picture of pursuing the most stable and lowest latency characteristics we could dream of.
Visit our website: http://atmosphere-conference.com/
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Databricks
Explore the trade-offs of performing linear algebra for data analysis and machine learning using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Apache Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks.
This session will examine three widely-used and important matrix factorizations: NMF (for physical plausibility), PCA (for its ubiquity) and CX (for data interpretability). Learn how these methods are applied to terabyte-sized problems in particle physics, climate modeling and bioimaging, as use cases where interpretable analytics is of interest. The data matrices are tall-and-skinny, which enable the algorithms to map conveniently into Spark’s data-parallel model. We perform scaling experiments on up to 1600 Cray XC40 nodes, describe the sources of slowdowns and provide tuning guidance to obtain high performance. Based on joint work with Alex Gittens and many others.
Graph Sample and Hold: A Framework for Big Graph AnalyticsNesreen K. Ahmed
Sampling is a standard approach in big-graph analytics; the goal is to efficiently estimate the graph properties by consulting a sample of the whole population. A perfect sample is assumed to mirror every property of the whole population. Unfortunately, such a perfect sample is hard to collect in complex populations such as graphs(e.g. web graphs, social networks), where an underlying network connects the units of the population. Therefore, a good sample will be representative in the sense that graph properties of interest can be estimated with a known degree of accuracy.While previous work focused particularly on sampling schemes to estimate certain graph properties (e.g. triangle count), much less is known for the case when we need to estimate various graph properties with the same sampling scheme. In this paper, we pro-pose a generic stream sampling framework for big-graph analytics,called Graph Sample and Hold (gSH), which samples from massive graphs sequentially in a single pass, one edge at a time, while maintaining a small state in memory. We use a Horvitz-Thompson construction in conjunction with a scheme that samples arriving edges without adjacencies to previously sampled edges with probability p and holds edges with adjacencies with probability q. Our sample and hold framework facilitates the accurate estimation of subgraph patterns by enabling the dependence of the sampling process to vary based on previous history. Within our framework, we show how to produce statistically unbiased estimators for various graph properties from the sample. Given that the graph analytic swill run on a sample instead of the whole population, the runtime complexity is kept under control. Moreover, given that the estimators are unbiased, the approximation error is also kept under control.
Rethinking Data-Intensive Science Using Scalable Analytics Systems fnothaft
Presentation from SIGMOD 2015. With Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja, Jeff Hammerbacher, Michael Linderman, Michael J. Franklin, Anthony D. Joseph, David A. Patterson. Paper at http://dl.acm.org/citation.cfm?id=2742787.
Toxic effects of heavy metals : Lead and Arsenicsanjana502982
Heavy metals are naturally occuring metallic chemical elements that have relatively high density, and are toxic at even low concentrations. All toxic metals are termed as heavy metals irrespective of their atomic mass and density, eg. arsenic, lead, mercury, cadmium, thallium, chromium, etc.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...Wasswaderrick3
In this book, we use conservation of energy techniques on a fluid element to derive the Modified Bernoulli equation of flow with viscous or friction effects. We derive the general equation of flow/ velocity and then from this we derive the Pouiselle flow equation, the transition flow equation and the turbulent flow equation. In the situations where there are no viscous effects , the equation reduces to the Bernoulli equation. From experimental results, we are able to include other terms in the Bernoulli equation. We also look at cases where pressure gradients exist. We use the Modified Bernoulli equation to derive equations of flow rate for pipes of different cross sectional areas connected together. We also extend our techniques of energy conservation to a sphere falling in a viscous medium under the effect of gravity. We demonstrate Stokes equation of terminal velocity and turbulent flow equation. We look at a way of calculating the time taken for a body to fall in a viscous medium. We also look at the general equation of terminal velocity.
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Studia Poinsotiana
I Introduction
II Subalternation and Theology
III Theology and Dogmatic Declarations
IV The Mixed Principles of Theology
V Virtual Revelation: The Unity of Theology
VI Theology as a Natural Science
VII Theology’s Certitude
VIII Conclusion
Notes
Bibliography
All the contents are fully attributable to the author, Doctor Victor Salas. Should you wish to get this text republished, get in touch with the author or the editorial committee of the Studia Poinsotiana. Insofar as possible, we will be happy to broker your contact.
2. Processing Reads
• As we’ve covered before, if we already have a
reference assembly, we can process reads by
aligning to the reference genome
3. The Sequencing Abstraction
It was the best of times, it was the worst of times…
worst of times
was the worst
the worst of
• Sequencing performs a poisson distributed
sampling of substrings from a larger string
• Reads are exact substrings (i.e., error free)
Metaphor borrowed from Michael Schatz
It was the
the best of
times, it was
best of times
4. The Alignment Abstraction
It was the best of times, it was the worst of times…
It was the
the best of
times, it was
worst of times
the worst of
best of times was the worst
was the worst
It was the
worst of times
the best of
the worst of
times, it was
best of times
5. But!
• What do we do if we don’t have a reference
genome to map against?
• Can we use information in the reads to assemble
the reads together into a string?
6. Sequence Assembly
was the worst
best of times
It was the
worst of times
the best of
the worst of
times, it was
It was the
the best of
best of times
times, it was
was the worst
the worst of
worst of times
It was the best of times, it was the worst of times…
7. The Assembly Problem
• Given a set of reads, we want to assemble the
“best” contigs possible
• Contig = contiguous sequence
• Two general formulations for assembly:
• Overlap-layout-consensus (OLC)
• de Brujin graph (DBG)
9. Assembly is Graph Traversal
• In OLC, we create an overlap graph, and find a
Hamiltonian path
• In DBG, we create a de Brujin graph, and find an
Eulerian path
10. Overlap Graphs
• Given a set of reads, represents how these reads
overlap
Nodes are reads, edges are overlaps.
11. Example Overlap Graph
It was the
the best of
times, it was
the worst of
worst of times
best of times
was the worst
the best of
It was the
times, it was
worst of times
the worst of
best of times
was the worst
12. Hamiltonian Path
• A Hamiltonian Path is a path which visits each node
in the graph exactly once
13. Computing Overlaps
• To compute overlaps between two reads, we
compute the pairwise alignment of these two reads
• This can be done using dynamic programming
(Smith-Waterman) or a profile HMM
• We can accelerate this with indexing-based
methods, similar to those in SNAP
14. Two Problems
1. Overlapping is expensive:
• Must compute O(n2) overlaps, n = # reads
• Computing an overlap is O(l2), l = read length
2. Hamiltonian Path is NP-hard:
• Approximate solvers exist, but don’t scale up
to genomics datasets
15. de Brujin Graphs
• In a de Brujin graph, nodes are k-mers, and edges
represent observed transitions between k-mers
• k-mers are k-length substrings from reads
ACACTGCACT
ACCAAC
ACT
CTG
T GGCCA
C AACCT
16. de Brujin Graphs
• In a de Brujin graph, we may have multiple paths
between two nodes
ACACTGCACT
ACCAAC
ACT
CTG
T GGCCA
C AACCT
ACA CAC ACT
GCA TGC CTG
17. Eulerian Path
• In an Eulerian path, we use every edge exactly once
• Preconditions for finding an Eulerian path assembly on a DBG:
1. One node must have one more edge leaving than
entering
2. One node must have one more edge entering than
leaving
3. All other nodes must have equal numbers of edges
entering and leaving
18. Finding an Eulerian Path
• Connect the two nodes with unbalanced edges
• This provides us an Eulerian cycle
• From an arbitrary node n, walk the graph until we return from
n, and save the path we’ve walked
• Until all edges have been used:
• Pick a point n’ from our path, where n’ has unused
edges
• Walk from n’ until we return to n’, and track visited edges
19. Problems with Eulerian Path
• For a given graph, we may have multiple valid paths!
CAA AAT
8
9
1 2
ACA CAC ACT
GCA TGC CTG
10
ATG 7
3
5 4
6
11
ACACTGCACAATGC
CAA AAT
1
2
8 9
ACA CAC ACT
GCA TGC CTG
3
ATG 7
10
5 11
6
4
ACAATGCACACTGC
21. How Do We Assemble
Multiple Reads?
• In practice, de Brujin graphs are additive
• This allows us to merge graphs from multiple reads
• When do we keep/remove edges?
23. Errors!
• One of the key assumptions that we make in the
sequencing process is that reads are correct
• But, in reality, reads have a 2% error rate
• How does this impact us?
24. What Are The Errors Like?
ACATATAGAA
AGATATAGAN
• Currently, the most common sequencing technology
is called Illumina
• Errors tend to be a misread of a single base
• Errors tend to be clustered at the ends of reads
28. Help‽ What Can We Do?
• For some errors, we can inspect the de Brujin
graph directly, and eliminate edges from the graph
• More generally, we can look at the distribution of
k-mers, and try to make corrections to the reads
29. Trimming Spurs
• Since errors are at the ends of reads, we see spurious branches
off of the graph
• Use heuristics to determine whether we can remove these nodes
• E.g., if these nodes are only present in 1 read, probably OK
30. The k-mer Spectrum
• If we look at the frequencies of k-mers, we see
something interesting…
32. Those Are Our Errors!
• Errors create low-frequency substrings
• We can identify errors with a mixture model:
• Mixture of poissons
• Distribution with lowest mean —> errors
• From here, we can remove those “erroneous”
strings, and pick likely replacements
33. How Do We Define Likely?
• Can use edit distance of replacement as a heuristic
• Can define a probabilistic measure for the quality of
a replacement:
34. Dealing With Repeats
• A cycle in a de Brujin graph is caused by repeated
sequence
• In real genomes, there is a lot of repetition:
• Structural variation —> duplicated sequences
• Transposons/Mobile Elements
• Centromeres and Telomeres
35. Increased k-mer Length
ACA CAC ACT
GCA TGC CTG
ACACTGCACT
ACACT CACTG ACTGC
GCACT TGCAC CTGCA
• If we have a sequence which is less than b bases
long, we can resolve the repeat by using k-mers
with k > b
36. Scaffolding
It was the best of times, it was the worst of times…
the best of
best of times was the worst
It was the
worst of times
times, it was
• Current sequencing technology gives us paired reads,
with approximately known distance between reads
37. Scaffolding
• We can use this to estimate repeat sizes:
• Or, to estimate the size of gaps:
smaller!
bigger!
40. Opportunities
• New read technologies are available
• Provide much longer reads (250bp vs. >10kbp)
• Different error model… (15% INDEL errors, vs. 2% SNP errors)
• Generally, lower sequence specific bias
• But, need to improve OLC assembler performance!
Left: PacBio homepage, Right: Wired, http://www.wired.com/2012/03/oxford-nanopore-sequencing-usb/
41. Can we turn an expensive,
serial problem into a
cheap, parallel problem?
42. Fast Overlapping with
MinHashing
• Wonderful realization by Berlin et al1: overlapping is
similar to document similarity problem
• Use MinHashing to approximate similarity:
1: Berlin et al, bioRxiv 2014
Per document/read,
compute signature:!
!
1. Cut into shingles
2. Apply random
hashes to shingles
3. Take min over all
random hashes
Hash into buckets:!
!
Signatures of length l
can be hashed into b
buckets, so we expect
to compare all elements
with similarity
≥ (1/b)^(b/l)
Compare:!
!
For two documents with
signatures of length l,
Jaccard similarity is
estimated by
(# equal hashes) / l
!
Can reduce complexity from O(n2) to O(nb)!
43. MapReduce
• Intuition: if we have a data parallel algorithm, we can
run the algorithm across many computers
• Many popular systems:
• MapReduce at Google
• Hadoop
• (from Berkeley!)
• Provide special programming models for graphs…
44. MinHash On MR
Per document/read,
compute signature:!
!
1. Cut into shingles
2. Apply random
hashes to shingles
3. Take min over all
random hashes
Hash into buckets:!
!
Signatures of length l
can be hashed into b
buckets, so we expect
to compare all elements
with similarity
≥ (1/b)^(b/l)
Compare:!
!
For two documents with
signatures of length l,
Jaccard similarity is
estimated by
(# equal hashes) / l
!
map groupBy map + filter
45. Transitive Reduction
• We can find a consensus between clique members
• Or, we can reduce down:
• Can be implemented efficiently using graph-optimized
MapReduce libraries!