Presentation given at the Genomics Today and Tomorrow event in Uppsala, Sweden, 19 March 2015. (http://connectuppsala.se/events/genomics-today-and-tomorrow/) Topics include APIs, "querying by data set", machine learning.
Lecture given for the Data Mining course at Uppsala university in October 2013. The presentation talks about data analysis in the context of genomics, next-generation sequencing, metagenomics etc.
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedSpark Summit
In 2001, it cost ~$100M to sequence a single human genome. In 2014, due to dramatic improvements in sequencing technology far outpacing Moore’s law, we entered the era of the $1,000 genome. At the same time, the power of genetics to impact medicine has become evident: for example, drugs with supporting genetic evidence have twice the clinical trial success rate. These factors have led to an explosion in the volume of genetic data, in the face of which existing analysis tools are breaking down.
Therefore, we began the open-source Hail project (https://hail.is) to be a scalable platform built on Apache Spark to enable the worldwide genetics community to build, share, and apply new tools. Hail is focused on variant-level (post-read) data; querying genetic data, annotations and sample data; and performing rare and common variant association analyses. Hail has already been used to analyze datasets with hundreds of thousands of exomes and tens of thousands of whole genomes.
We will give an overview of the goals of the Hail project and its architecture. The challenge of efficiently manipulating genetic data in Spark has led to several innovations that may have wider applicability, including an RDD-like abstraction for representing multidimensional data and an OrderedRDD abstraction for ordered data, (for example, data indexed by position in the genome). Finally, we will discuss Hail performance and future directions.
Genomic Big Data Management, Integration and Mining - Emanuel WeitschekData Driven Innovation
Thanks to Next Generation Sequencing (NGS), a technology that is lowering the cost and time of reading DNA, we are faced with huge amounts of biomedical data. These data are continuously collected by research laboratories, and often organized through world-wide consortia, which are releasing many public data bases. One of the main aims of bioinformatics is to solve fundamental issues in biomedicine research (e.g., how cancer occurs) starting from big genomic data and their analysis. In this talk I will give an overview of big genomic data management, integration, and mining.
Slides contain information about why bioinformatics appeared,
who bioinformaticians are, what they do, what kind of cool applications and challenges in bioinformatics there are.
Slides were prepared for the Bioinformatics seminar 2016, Institute of Computer Science, University of Tartu.
Presentation from the "Demystifying Big Data" Technical Conference (Universidad de La Laguna, Spain, June 2014).
Biomedical sciences rely on massive data sets. By using machines capable of generating large amounts of data with low cost, science has entered the 'Big Data' era, making computational infrastructures essential to maintain, transfer and analyze all this information.
Lecture given for the Data Mining course at Uppsala university in October 2013. The presentation talks about data analysis in the context of genomics, next-generation sequencing, metagenomics etc.
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedSpark Summit
In 2001, it cost ~$100M to sequence a single human genome. In 2014, due to dramatic improvements in sequencing technology far outpacing Moore’s law, we entered the era of the $1,000 genome. At the same time, the power of genetics to impact medicine has become evident: for example, drugs with supporting genetic evidence have twice the clinical trial success rate. These factors have led to an explosion in the volume of genetic data, in the face of which existing analysis tools are breaking down.
Therefore, we began the open-source Hail project (https://hail.is) to be a scalable platform built on Apache Spark to enable the worldwide genetics community to build, share, and apply new tools. Hail is focused on variant-level (post-read) data; querying genetic data, annotations and sample data; and performing rare and common variant association analyses. Hail has already been used to analyze datasets with hundreds of thousands of exomes and tens of thousands of whole genomes.
We will give an overview of the goals of the Hail project and its architecture. The challenge of efficiently manipulating genetic data in Spark has led to several innovations that may have wider applicability, including an RDD-like abstraction for representing multidimensional data and an OrderedRDD abstraction for ordered data, (for example, data indexed by position in the genome). Finally, we will discuss Hail performance and future directions.
Genomic Big Data Management, Integration and Mining - Emanuel WeitschekData Driven Innovation
Thanks to Next Generation Sequencing (NGS), a technology that is lowering the cost and time of reading DNA, we are faced with huge amounts of biomedical data. These data are continuously collected by research laboratories, and often organized through world-wide consortia, which are releasing many public data bases. One of the main aims of bioinformatics is to solve fundamental issues in biomedicine research (e.g., how cancer occurs) starting from big genomic data and their analysis. In this talk I will give an overview of big genomic data management, integration, and mining.
Slides contain information about why bioinformatics appeared,
who bioinformaticians are, what they do, what kind of cool applications and challenges in bioinformatics there are.
Slides were prepared for the Bioinformatics seminar 2016, Institute of Computer Science, University of Tartu.
Presentation from the "Demystifying Big Data" Technical Conference (Universidad de La Laguna, Spain, June 2014).
Biomedical sciences rely on massive data sets. By using machines capable of generating large amounts of data with low cost, science has entered the 'Big Data' era, making computational infrastructures essential to maintain, transfer and analyze all this information.
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...Databricks
Epinomics is advancing epigenetic research to drive personalized medicine, using epigenomic data analysis. Their goal is to provide an analysis resource to the community that will promote high-quality data and replicable and interpretable results. They work with academic and commercial users to ingest and analyze their genomic sequencing data and metadata. They extract epigenetic features from the sequenced genome, called “chromatin accessibility”, which are indicative of instrumental epigenetic changes responsible for differential gene expression and disease development.
Epinomics has built an Apache Spark-based pipeline that retrieves chromatin accessibility data from the epigenome, uses GraphX to find overlapping accessibility atlas and then clusters the data and runs machine learning algorithms. This session will provide a primer on epigenomics, details about Epinomics’ Spark-based data pipeline focusing on parallel bioinformatic analysis, and how they use machine learning models to build the epigenomic landscape and accelerate the field of personalized immunotherapy. use GraphX to find overlapping accessibility atlas and then cluster the data and run machine learning algorithms.
In this talk we will provide a primer on epigenomics, details about our Spark based data pipeline focusing on parallel bioinformatic analysis and how we use machine learning models to build the epigenomic landscape and accelerate the field of personalized immunotherapy.
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Spark Summit
Recent advances in genome sequencing technologies and bioinformatics have enabled whole-genomes to be studied at population-level rather then for small number of individuals. This provides new power to whole genome association studies (WGAS
), which now seek to identify the multi-gene causes of common complex diseases like diabetes or cancer.
As WGAS involve studying thousands of genomes, they pose both technological and methodological challenges. The volume of data is significant, for example the dataset from 1000 Genomes project with genomes of 2504 individuals includes nearly 85M genomic variants with raw data size of 0.8 TB. The number of features is enormous and greatly exceeds the number of samples, which makes it challenging to apply traditional statistical approaches.
Random forest is one of the methods that was found to be useful in this context, both because of its potential for parallelization and its robustness. Although there is a number of big data implementations available (including Spark ML) they are tuned for typical dataset with large number of samples and relatively small number of variables, and either fail or are inefficient in the GWAS context especially, that a costly data preprocessing is usually required.
To address these problems, we have developed the RandomForestHD – a Spark based implementation optimized for highly dimensional data sets. We have successfully RandomForestHD applied it to datasets beyond the reach of other tools and for smaller datasets found its performance superior. We are currently applying RandomForestHD, released as part of the VariantSpark toolkit, to a number of WGAS studies.
In the presentation we will introduce the domain of WGAS and related challenges, present RandomForestHD with its design principles and implementation details with regards to Spark, compare its performance with other tools, and finally showcase the results of a few WGAS applications.
Building an Information Infrastructure to Support Microbial Metagenomic SciencesLarry Smarr
06.01.14
Presentation for the Microbe Project Interagency Team
Title: Building an Information Infrastructure to Support Microbial Metagenomic Sciences
La Jolla, CA
Variant (SNPs/Indels) calling in DNA sequences, Part 1 Denis C. Bauer
Abstract: This session will focus on the first steps involved in identifying SNPs from whole genome, exome capture or targeted resequencing data: The different read mapping approaches to a DNA reference sequence will be introduced and quality metrics discussed.
The presentation includes preliminary information about the big data mainly metagenomic data and discussions related to the hurdles in analyzing using conventional approaches. In the later part, brief introduction about machine learning approaches using biological example for each. In the last, work done with special focus on implementation of a machine learning approach Random Forest for the functional annotation and taxonomic classification of metagenomic data.
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Golden Helix Inc
With a focus on scalable architecture and optimized native code that fully utilizes the CPU and RAM available, we can scale genomic analysis into sizes conventionally considered Big Data on a single host. In this webcast, we demonstrate recent innovations and features in Golden Helix solutions that enable the analysis of big data on your own terms.
Cross-Kingdom Standards in Genomics, Epigenomics and MetagenomicsChristopher Mason
Challenges and biases in preparing, characterizing, and sequencing DNA and RNA can have significant impacts on research in genomics across all kingdoms of life, including experiments in single cells, RNA profiling, and metagenomics. Technical artifacts and contaminations can arise at each point of sample manipulation, extraction, sequencing, and analysis. Thus, the measurement and benchmarking of these potential sources of error are of paramount importance as next-generation sequencing (NGS) projects become more global and ubiquitous.
Fortunately, a variety of methods, standards, and technologies have recently emerged that improve measurements in genomics and sequencing, from the initial input material to the computational pipelines that process and annotate the data.
This webinar will review work to develop standards and their applications in genomics, including the ABRF-NGS Phase II NGS Study on DNA Sequencing; the FDA’s Sequencing Quality Control Consortium (SEQC2); metagenomics standards efforts (ABRF, ATCC, Zymo, Metaquins), and the Epigenomics QC group of the SEQC2. The webinar will also review he computational methods for detection, validation, and implementation of these genomic measures.
Building bioinformatics resources for the global communityExternalEvents
http://www.fao.org/about/meetings/wgs-on-food-safety-management/en/
Building bioinformatics resources for the global community. Presentation from the Technical Meeting on the impact of Whole Genome Sequencing (WGS) on food safety management and GMI-9, 23-25 May 2016, Rome, Italy.
GRC Workshop held at Churchill College on Sep 21, 2014. Talk by Bronwen Aken discussing the Ensembl approach to annotating the complete human reference assembly.
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...Larry Smarr
06.09.15
Invited Talk
2006 Synthetic Biology Symposium
Aliso Creek Inn
Title: Building a Community Cyberinfrastructure to Support Marine Microbial Ecology Metagenomics
Laguna Beach, CA
A description of how technology has changed the face of Biology, specially in the fields of genetics, proteomics, and evolution.
It includes a brief history, examples of usage, and a look into the future.
Presentation given at the Stockholm R useR Group (SRUG) meetup on Dec 6, 2016. Contains a general overview of deep learning, material on using Tensorflow in R etc.
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...Databricks
Epinomics is advancing epigenetic research to drive personalized medicine, using epigenomic data analysis. Their goal is to provide an analysis resource to the community that will promote high-quality data and replicable and interpretable results. They work with academic and commercial users to ingest and analyze their genomic sequencing data and metadata. They extract epigenetic features from the sequenced genome, called “chromatin accessibility”, which are indicative of instrumental epigenetic changes responsible for differential gene expression and disease development.
Epinomics has built an Apache Spark-based pipeline that retrieves chromatin accessibility data from the epigenome, uses GraphX to find overlapping accessibility atlas and then clusters the data and runs machine learning algorithms. This session will provide a primer on epigenomics, details about Epinomics’ Spark-based data pipeline focusing on parallel bioinformatic analysis, and how they use machine learning models to build the epigenomic landscape and accelerate the field of personalized immunotherapy. use GraphX to find overlapping accessibility atlas and then cluster the data and run machine learning algorithms.
In this talk we will provide a primer on epigenomics, details about our Spark based data pipeline focusing on parallel bioinformatic analysis and how we use machine learning models to build the epigenomic landscape and accelerate the field of personalized immunotherapy.
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Spark Summit
Recent advances in genome sequencing technologies and bioinformatics have enabled whole-genomes to be studied at population-level rather then for small number of individuals. This provides new power to whole genome association studies (WGAS
), which now seek to identify the multi-gene causes of common complex diseases like diabetes or cancer.
As WGAS involve studying thousands of genomes, they pose both technological and methodological challenges. The volume of data is significant, for example the dataset from 1000 Genomes project with genomes of 2504 individuals includes nearly 85M genomic variants with raw data size of 0.8 TB. The number of features is enormous and greatly exceeds the number of samples, which makes it challenging to apply traditional statistical approaches.
Random forest is one of the methods that was found to be useful in this context, both because of its potential for parallelization and its robustness. Although there is a number of big data implementations available (including Spark ML) they are tuned for typical dataset with large number of samples and relatively small number of variables, and either fail or are inefficient in the GWAS context especially, that a costly data preprocessing is usually required.
To address these problems, we have developed the RandomForestHD – a Spark based implementation optimized for highly dimensional data sets. We have successfully RandomForestHD applied it to datasets beyond the reach of other tools and for smaller datasets found its performance superior. We are currently applying RandomForestHD, released as part of the VariantSpark toolkit, to a number of WGAS studies.
In the presentation we will introduce the domain of WGAS and related challenges, present RandomForestHD with its design principles and implementation details with regards to Spark, compare its performance with other tools, and finally showcase the results of a few WGAS applications.
Building an Information Infrastructure to Support Microbial Metagenomic SciencesLarry Smarr
06.01.14
Presentation for the Microbe Project Interagency Team
Title: Building an Information Infrastructure to Support Microbial Metagenomic Sciences
La Jolla, CA
Variant (SNPs/Indels) calling in DNA sequences, Part 1 Denis C. Bauer
Abstract: This session will focus on the first steps involved in identifying SNPs from whole genome, exome capture or targeted resequencing data: The different read mapping approaches to a DNA reference sequence will be introduced and quality metrics discussed.
The presentation includes preliminary information about the big data mainly metagenomic data and discussions related to the hurdles in analyzing using conventional approaches. In the later part, brief introduction about machine learning approaches using biological example for each. In the last, work done with special focus on implementation of a machine learning approach Random Forest for the functional annotation and taxonomic classification of metagenomic data.
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Golden Helix Inc
With a focus on scalable architecture and optimized native code that fully utilizes the CPU and RAM available, we can scale genomic analysis into sizes conventionally considered Big Data on a single host. In this webcast, we demonstrate recent innovations and features in Golden Helix solutions that enable the analysis of big data on your own terms.
Cross-Kingdom Standards in Genomics, Epigenomics and MetagenomicsChristopher Mason
Challenges and biases in preparing, characterizing, and sequencing DNA and RNA can have significant impacts on research in genomics across all kingdoms of life, including experiments in single cells, RNA profiling, and metagenomics. Technical artifacts and contaminations can arise at each point of sample manipulation, extraction, sequencing, and analysis. Thus, the measurement and benchmarking of these potential sources of error are of paramount importance as next-generation sequencing (NGS) projects become more global and ubiquitous.
Fortunately, a variety of methods, standards, and technologies have recently emerged that improve measurements in genomics and sequencing, from the initial input material to the computational pipelines that process and annotate the data.
This webinar will review work to develop standards and their applications in genomics, including the ABRF-NGS Phase II NGS Study on DNA Sequencing; the FDA’s Sequencing Quality Control Consortium (SEQC2); metagenomics standards efforts (ABRF, ATCC, Zymo, Metaquins), and the Epigenomics QC group of the SEQC2. The webinar will also review he computational methods for detection, validation, and implementation of these genomic measures.
Building bioinformatics resources for the global communityExternalEvents
http://www.fao.org/about/meetings/wgs-on-food-safety-management/en/
Building bioinformatics resources for the global community. Presentation from the Technical Meeting on the impact of Whole Genome Sequencing (WGS) on food safety management and GMI-9, 23-25 May 2016, Rome, Italy.
GRC Workshop held at Churchill College on Sep 21, 2014. Talk by Bronwen Aken discussing the Ensembl approach to annotating the complete human reference assembly.
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...Larry Smarr
06.09.15
Invited Talk
2006 Synthetic Biology Symposium
Aliso Creek Inn
Title: Building a Community Cyberinfrastructure to Support Marine Microbial Ecology Metagenomics
Laguna Beach, CA
A description of how technology has changed the face of Biology, specially in the fields of genetics, proteomics, and evolution.
It includes a brief history, examples of usage, and a look into the future.
Presentation given at the Stockholm R useR Group (SRUG) meetup on Dec 6, 2016. Contains a general overview of deep learning, material on using Tensorflow in R etc.
The Seven Deadly Sins of BioinformaticsDuncan Hull
Keynote talk at Bioinformatics Open Source Conference (BOSC) Special Interest Group at the 15th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB 2007) in Vienna, July 2007 by Carole Goble, University of Manchester.
Semantics for Bioinformatics: What, Why and How of Search, Integration and An...Amit Sheth
Amit Sheth's Keynote at Semantic Web Technologies for Science and Engineering Workshop (held in conjunction with ISWC2003), Sanibel Island, FL, October 20, 2003.
Presentation for the BioAssist programmers face-to-face, Novemebr 17, 2008, Utrecht, The Netherlands. BioAssist is a nation-wide Bioinformatics support programme.
Introduction to Gene Mining Part A: BLASTn-off!adcobb
In this lesson, students will learn to use bioinformatics portals and tools to mine plant versions of human genes. Student handout and teacher resource materials are available at www.Araport.org, Teaching Resources (Community tab). Suitable for grades 9-12 or first year undergraduate students.
Presented by Richard Kidd at "The Future Information Needs of Pharmaceutical & Medicinal Chemistry", Monday 28 November 2011 at The Linnean Society, Burlington Square, London run by the RSC CICAG group.
Semantic Web for Health Care and Biomedical InformaticsAmit Sheth
Amit Sheth, "Semantic Web for Health Care and Biomedical Informatics," Keynote at NSF Biomed Web Workshop, Corbett, Oregon, December 4-5, 2007.
http://www.biomedweb.info/2007/
Similar to Data analysis & integration challenges in genomics (20)
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
Richard's entangled aventures in wonderlandRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
A brief information about the SCOP protein database used in bioinformatics.
The Structural Classification of Proteins (SCOP) database is a comprehensive and authoritative resource for the structural and evolutionary relationships of proteins. It provides a detailed and curated classification of protein structures, grouping them into families, superfamilies, and folds based on their structural and sequence similarities.
Introduction:
RNA interference (RNAi) or Post-Transcriptional Gene Silencing (PTGS) is an important biological process for modulating eukaryotic gene expression.
It is highly conserved process of posttranscriptional gene silencing by which double stranded RNA (dsRNA) causes sequence-specific degradation of mRNA sequences.
dsRNA-induced gene silencing (RNAi) is reported in a wide range of eukaryotes ranging from worms, insects, mammals and plants.
This process mediates resistance to both endogenous parasitic and exogenous pathogenic nucleic acids, and regulates the expression of protein-coding genes.
What are small ncRNAs?
micro RNA (miRNA)
short interfering RNA (siRNA)
Properties of small non-coding RNA:
Involved in silencing mRNA transcripts.
Called “small” because they are usually only about 21-24 nucleotides long.
Synthesized by first cutting up longer precursor sequences (like the 61nt one that Lee discovered).
Silence an mRNA by base pairing with some sequence on the mRNA.
Discovery of siRNA?
The first small RNA:
In 1993 Rosalind Lee (Victor Ambros lab) was studying a non- coding gene in C. elegans, lin-4, that was involved in silencing of another gene, lin-14, at the appropriate time in the
development of the worm C. elegans.
Two small transcripts of lin-4 (22nt and 61nt) were found to be complementary to a sequence in the 3' UTR of lin-14.
Because lin-4 encoded no protein, she deduced that it must be these transcripts that are causing the silencing by RNA-RNA interactions.
Types of RNAi ( non coding RNA)
MiRNA
Length (23-25 nt)
Trans acting
Binds with target MRNA in mismatch
Translation inhibition
Si RNA
Length 21 nt.
Cis acting
Bind with target Mrna in perfect complementary sequence
Piwi-RNA
Length ; 25 to 36 nt.
Expressed in Germ Cells
Regulates trnasposomes activity
MECHANISM OF RNAI:
First the double-stranded RNA teams up with a protein complex named Dicer, which cuts the long RNA into short pieces.
Then another protein complex called RISC (RNA-induced silencing complex) discards one of the two RNA strands.
The RISC-docked, single-stranded RNA then pairs with the homologous mRNA and destroys it.
THE RISC COMPLEX:
RISC is large(>500kD) RNA multi- protein Binding complex which triggers MRNA degradation in response to MRNA
Unwinding of double stranded Si RNA by ATP independent Helicase
Active component of RISC is Ago proteins( ENDONUCLEASE) which cleave target MRNA.
DICER: endonuclease (RNase Family III)
Argonaute: Central Component of the RNA-Induced Silencing Complex (RISC)
One strand of the dsRNA produced by Dicer is retained in the RISC complex in association with Argonaute
ARGONAUTE PROTEIN :
1.PAZ(PIWI/Argonaute/ Zwille)- Recognition of target MRNA
2.PIWI (p-element induced wimpy Testis)- breaks Phosphodiester bond of mRNA.)RNAse H activity.
MiRNA:
The Double-stranded RNAs are naturally produced in eukaryotic cells during development, and they have a key role in regulating gene expression .
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
Multi-source connectivity as the driver of solar wind variability in the heli...Sérgio Sacani
The ambient solar wind that flls the heliosphere originates from multiple
sources in the solar corona and is highly structured. It is often described
as high-speed, relatively homogeneous, plasma streams from coronal
holes and slow-speed, highly variable, streams whose source regions are
under debate. A key goal of ESA/NASA’s Solar Orbiter mission is to identify
solar wind sources and understand what drives the complexity seen in the
heliosphere. By combining magnetic feld modelling and spectroscopic
techniques with high-resolution observations and measurements, we show
that the solar wind variability detected in situ by Solar Orbiter in March
2022 is driven by spatio-temporal changes in the magnetic connectivity to
multiple sources in the solar atmosphere. The magnetic feld footpoints
connected to the spacecraft moved from the boundaries of a coronal hole
to one active region (12961) and then across to another region (12957). This
is refected in the in situ measurements, which show the transition from fast
to highly Alfvénic then to slow solar wind that is disrupted by the arrival of
a coronal mass ejection. Our results describe solar wind variability at 0.5 au
but are applicable to near-Earth observatories.
This pdf is about the Schizophrenia.
For more details visit on YouTube; @SELF-EXPLANATORY;
https://www.youtube.com/channel/UCAiarMZDNhe1A3Rnpr_WkzA/videos
Thanks...!
3. SciLifeLab – an infrastructure for massive biology
Science 328,805 (14 May 2010)
Inaugurated mid-2010
Hosted by three universities in Stockholm:
Karolinska Institutet (medical faculty), Royal Institute
of Technology (technical) and Stockholm University
(natural science). SciLifeLab node in Uppsala.
Approximately 700 researchers
More than 100 researchers in bioinformatics and
systems biology
5. Clinical Genomics Clinical biomarkers Clinical sequencing
Functional genomics
Eukaryotic Single Cell Genomics
Single Cell Proteomics
Microbial Single Cell Genomics
Karolinska High Throughput Center
(KHTC)
Bioimaging - Advanced Light Microscopy, Fluorescence Correlation Spectroscopy
Drug discovery – ADME, Antibody Therapeutics, Protein Expression &
Characterization, Lead Indetification, Biophysical Screning etc.
Chemical Biology Consortium Sweden – Umeå, Uppsala, KI
Structural Biology – Protein Science Facility
National facilities at SciLifeLab
Clinical diagnostics
Affinity proteomics
Biobank profiling, Cell profiling,
Fluorescence Tissue Profiling,
Mass Cytometry, PLA Proteomics,
Protein and Peptide Arrays,
Tissue Profiling
6. Bioinformatics facilities
• Bioinformatics compute and storage (UPPNEX)
• Short-term support (2 weeks / 80h) + paid extension
– About 45 FTEs
• Long-term support (500h) for projects selected by
external committee
“embedded
bioinformaticians”
Participate in projects on a
longer term basis
7. Long-term bioinformatics support group
• Currently 13 senior bioinformaticians + 2 managers
• Currently recruiting for 10 new employees and
thereby expanding from Uppsala and Stockholm to
other locations in Sweden
• Example projects (from my own work):
– Characterizing the human muscle transcriptome in connection with exercise
– Metagenomics for looking at the connection between international travel and
antibiotic resistance
– Characterizing neural stem cells in developing mouse brain
– Small RNAs involved in the CRISPR/Cas9 system in bacteria
8. Integrative bioinformatics initiative
(“big data” project)
• Advertising for 4 positions, 2 in Gothenburg & 2 in
Stockholm
• More in-depth support, experimental planning,
method development
• Data integration
9. Pilot project
Connecting layers of information
DNA Whole-genome sequencing
Exome sequencing
CGH
Mutations, SNVs
Copy number variations
Structural variations
Gene fusions
RNA mRNA isoforms
Allele specific expression
Fusion transcripts
eQTLS
proteins
RNA-seq
Microarrays
High throughput mass
spectrometry
Protein isoforms
Post-translational modifications
10.
11. My blog: Follow the Data
Machine learning, “big data”, “data science”, often in connection with life science
Published brief notes on APIs from One Codex, Google Genomics, SolveBio
13. … but some people are willing to go out on a limb
“Where is the cut-off? The
line in the sand is 5TB of
unstructured data or 7.5-
10TB of structured data,
which cannot be reduced
any further”
(OLRAC SPS)
http://www.itweb.co.za/index.php?option=com_con
tent&view=article&id=111815
”There is no such thing as
biomedical big data”
(Will Bush, Vanderbilt
University Center for
Human Genetic Research)
http://gettinggeneticsdone.blogspot.se/2014/02/no-
such-thing-biomedical-bigdata.html
14. Genomics big data in context: Throughput
Data processed per day (terabytes)
Tb
SciLifeLabKing
NYSE
Sanger
Spotify BGI
Twitter
Facebook
Baidu
NSA
Google Ebay
Internet
World
1e+001e+021e+041e+06
S
15. Genomics big data in context: Storage
Data stored (petabytes)
pb
AZ
SciLifeLab
Spotify
Sanger
Novartis
Ebay
Facebook
Baidu
NSA
Google
110100100010000
16. Aside: Storage & processing frameworks
Hadoop, the standard solution for “big data” in industry, has not really caught on
in genomics … Why? Some ideas –
- Existing computing infrastructure is sufficient
- Or, focused on supercomputing solutions rather than commodity servers
- The programming/sysadmin skills and training are not there
- Many problems not parallelizable
- Not enough flexibility for ad hoc, exploratory analysis
Spark/ADAM, new framework enabling more interactive and in-memory-
oriented analysis
17. Genomics big data in context: Heterogeneity
“The size of the data is not the whole story.
If the data are uniform, they can almost always
be compressed and filtered with traditional
methods.
You do not get a ‘big data’ processing challenge
until other factors, such as variety, non-
uniformity and continuous growth, are added to
a large data set.”
(adapted from Aleksi Kallio)
18. Ideas on improving data integration
1. APIs to mitigate friction in data collection and preprocessing
2. Querying “by data set”
3. Leveraging advances in machine learning
So much public data out there!
19. APIs
Lowering barriers to entry with APIs (application programming interfaces; ways
for a computer program to automatically retrieve information in a defined
manner).
“80% of the time of a data scientist is spent finding and preparing the data”
APIs against good reference collections mitigate the hassle of looking for the right
data sources, handling different versions/releases, etc.
We should be able to ask questions such as:
“Which gene variants in a patient have been previously associated to a specific
disease?” <= addressed by SolveBio and Google Genomics (with the inclusion of
the Tute annotation db)
20. APIs
Other questions could be, e.g.:
“Which microorganisms are found in this tissue sample?” <= addressed by the One
Codex API
21. APIs
Other questions could be, e.g.:
“Which microorganisms are found in this tissue sample?” <= addressed by the One
Codex API
“Which genes are expressed exclusively in the parathyroid gland?”
22. APIs
Other questions could be, e.g.:
“Which microorganisms are found in this tissue sample?” <= addressed by the One
Codex API
“Which genes are expressed exclusively in the parathyroid gland?”
“What is the most similar expression dataset to this one that I am currently
working on?” <= partly addressed by NextBio (but it’s a commercial package!)
23. APIs
Other questions could be, e.g.:
“Which microorganisms are found in this tissue sample?” <= addressed by the One
Codex API
“Which genes are expressed exclusively in the parathyroid gland?”
“What is the most similar expression dataset to this one that I am currently
working on?” <= partly addressed by NextBio (but it’s a commercial package!)
“Download all available sequences for arthropoda and store them as FASTQ files”
<= addressed by bionode.io
24. APIs
Other questions could be, e.g.:
“Which microorganisms are found in this tissue sample?” <= addressed by the One
Codex API
“Which genes are expressed exclusively in the parathyroid gland?”
“What is the most similar expression dataset to this one that I am currently
working on?” <= partly addressed by NextBio (but it’s a commercial package!)
“Download all available sequences for arthropoda and store them as FASTQ files”
<= addressed by bionode.io
“Give me the publicly available RNA-seq sequences that support this peptide that I
found in mass spectrometry and which appears to have been translated from a
fusion transcript”
25. Data provenance
Researchers often want to look at processed data (avoiding the work of
reprocessing everything from scratch) but they want to know how the processing
was done.
Each data set should have an “analysis history” attached
Also important for reproducibility and paper writing
26. Querying by data set
Querying by dataset – we often want to relate our dataset to something “out
there” without necessarily having a good preconception what it could be.
(especially in metagenomics!)
NextBio does an interesting version of this but costs money (has been acquired by
Illumina) and focuses on selected types of functional studies.
27. Querying by dataset
Querying by dataset – we often want to relate our dataset to something “out
there” without necessarily having a good preconception what it could be.
(especially in metagenomics!)
NextBio does an interesting version of this but costs money (has been acquired by
Illumina) and focuses on selected types of functional studies.
Using the dataset itself, or a statistical
description of it, as a query
Jeff Jonas:
“Data finds data”
“The data is the query”
“we want to support automated data exploration in ways that are simply not possible today”
C Titus Brown (http://ivory.idyll.org/blog/2014-moore-ddd-round2-final.html)
28. Cumulative biology and metagenomics:
The unknown
http://www.ted.com/talks/nathan_wolfe_wha
t_s_left_to_explore.html
“Biological dark matter”
“The unknown continent”
According to one estimate,
less than 1% of the viral
diversity has been explored!
=> Reference databases very limited!
29. The unknown
In a recent paper on soil metagenomics, Titus Brown and colleagues report that:
80% of the 398 billion sequences they obtained could not be assembled into
putative genes
Of the cases where sequences could be assembled into putative genes which
would create putative proteins, 60% of these proteins could not be matched
to anything in the databases!
30. Ergo…
For metagenomics in particular, but also for other applications, we would like
to have everything that has been published indexed in a better way, so we can
relate new stuff to those. We need to have a constantly growing index.
When we perform a new experiment, we could then relate our results to all of
the data out there, not just the part that has made it into the official reference
databases.
31. Machine learning
Google has had great success with deep learning …
Learning to recognize cats from
unlabel Youtube videos (2012)
Neural network with “3 million
neurons and 1 billion synapses”
…now it’s all over the place
Inaugural Stockholm deep learning meetup,
March 10, 2015
32. Deep learning
Perhaps deep learning could be used in genomics, proteomics etc to
transform diverse data sets into a more general representation which would
facilitate data integration?
New datasets can then be overlaid onto representations trained on large
collections.
33. Deep learning in genomics (1)
How do gene expression patterns relate to cell type and state? Hard problem to
classify expression profiles into cell types because it is really a hierarchy where
different genes are important at different levels of the hierarchy
We may be starting to accumulate enough data to enable a deep learning
approach to learn a hierarchical representation of cell state based on expression
profiles (particularly with all the single-cell RNA-seq data now coming out)
34. Deep learning in genomics (1)
How do gene expression patterns relate to cell type and state? Hard problem to
classify expression profiles into cell types because it is really a hierarchy where
different genes are important at different levels of the hierarchy
We may be starting to accumulate enough data to enable a deep learning
approach to learn a hierarchical representation of cell state based on expression
profiles (particularly with all the single-cell RNA-seq data now coming out)
First step: Casey Greene’s group (Dartmouth)
A denoising autoencoder learned a generalized
representation of breast cancer expression
profiles based on the METABRIC cohort (>2000
samples). Validated on TCGA.
The nodes in the net can be interpreted to stand
for different biological features.
Tan et al. (2015)
35. Deep learning in genomics (2)
Convolutional network for splice site detection
Reads the DNA sequence directly and abstracts into higher-level features.
This network learned patterns of splice sites
And also re-discovered the concept of codons
Hannes Bretschneider: http://www.psi.toronto.edu/~hannes/resources/MLCB2014-Presentation.pdf
36. “Classical” machine learning
Predictive modeling as a way to integrate information from different experimental assays.
Example: ongoing mouse neural development project
A number of genome-wide experiments have been done in developing spinal cord and
cortex; have measurements/genome-wide signals about:
- Gene expression (RNA-seq)
- Where the Sox2 transcription factor is bound in each tissue (ChIP-seq)
- How open/accessible the chromatin is (DNase-seq)
- Potential transcription factor binding sites (DNase footprints)
as well as some calculated features like certain interesting “DNA words” (transcription
factor binding motifs) and how conserved each stretch of DNA is between mice and other
organisms.
How to make some sense of all these data?
37. “Genome browser” view of genomic landscape around a gene
Gene
Conservation
Different data tracks
“Openness”
Sox2
binding
raw signal
peaks
38. (borrowed from Mark Gerstein)
We decided we are most interested in
understanding differences in gene
expression between spinal cord and
cortex neurons. Can the other
measurements help?
Progressively summarized and
abstracted the raw signals into blocks
with various features => matrix of
~20,000 genes x 13 features
Use machine learning techniques to
predict relative gene expression in
cortex/spinal cord based on these
features (ongoing…)
39. Indexing and querying technology such as Google’s can help genomics researchers by
e g
- Enabling programmatic access to published data (processed but with a known
analysis history) to lower the threshold for integrative analysis
- Allowing them to relate their datasets to other published data without overly
relying on curated reference databases (cumulative biology)
- Facilitating ingestion into machine learning (e g deep learning) systems for learning
general features of biological data from a very large set of samples
Recap