Simon Andrews and Laura Biggins describe the process of RNA-Seq analysis from library preparation to data exploration and differential expression analysis. Key steps include rRNA depletion, fragmentation, adapter ligation, sequencing, quality control of raw reads, alignment to a reference genome, and quantification of gene and transcript expression between conditions. Statistical analysis with tools like DESeq2 is used to identify genes that are differentially expressed between groups.
This document provides an introduction and overview of common methods for processing and analyzing next generation sequencing (NGS) data, including mapping NGS reads and de novo assembly of NGS reads. It discusses various NGS applications such as RNA-Seq, epigenetics, structural variation detection, and metagenomics. Key steps in read alignment such as choosing an alignment program and viewing alignments are outlined. Considerations for choosing an alignment program based on library type, read type, and platform are also reviewed. Popular alignment programs including Bowtie, BWA, TopHat, and Novoalign are mentioned.
RNA sequencing analysis tutorial with NGSHAMNAHAMNA8
This document provides an overview of RNA-seq data analysis. It discusses quality control of sequencing data using tools like FastQC, mapping reads to a reference genome or transcriptome using aligners like BWA and TopHat, and summarizing reads using counting tools to obtain read counts for each gene. These counts can then be used to estimate gene expression levels and perform differential expression analysis to identify genes with different expression between samples or conditions.
The Advanced Data Analysis Centre (ADAC) at the University of Nottingham provides bioinformatics and data analysis support for complex genomic and transcriptomic datasets. It offers a range of services including obtaining and processing next-generation sequencing data, quality control, mapping, variant calling, and specialized analysis. ADAC has expertise in many areas relevant to NGS analysis and is able to provide flexible consultancy, collaboration, and bespoke analysis support for high-quality research.
Neuroscience core lecture given at the Icahn school of medicine at Mount Sinai. This is the version 2 of the same topic. I have made some modifications to give a more gentle introduction and add a new example for ngs.plot.
This document provides an overview of RNA-Seq analysis. It begins with considerations for RNA-Seq experiments such as computational requirements. It then describes the general RNA-Seq analysis workflow including short-read alignment, transcript reconstruction, abundance estimation, visualization, and statistics. The document focuses on explaining the "Tuxedo" analysis pipeline which includes Bowtie, Tophat, Cufflinks, Cuffmerge, Cuffdiff and CummeRbund. It provides examples of commands for each step and discusses alternative tools. The document concludes with resources for further information on RNA-Seq analysis.
This document discusses strategies for analyzing moderately large data sets in R when the total number of observations (N) times the total number of variables (P) is too large to fit into memory all at once. It presents several approaches including loading data incrementally from files or databases, using randomized algorithms, and outsourcing computations to SQL. Specific examples discussed include linear regression on large data sets and whole genome association studies.
Next-generation sequencing format and visualization with ngs.plotLi Shen
Lecture given at the department of neuroscience, Icahn school of medicine at Mount Sinai. ngs.plot has been published in BMC genomics. Link: http://www.biomedcentral.com/1471-2164/15/284
This document provides an overview and introduction to RNA-seq analysis using Next Generation Sequencing. It discusses the RNA-seq workflow including mapping reads with TopHat2, transcript assembly with Cufflinks, and differential expression analysis. Key points covered include the advantages of RNA-seq over microarrays, the exponential drop in sequencing costs, mapping strategies for junction reads including TopHat, and running TopHat from the command line.
This document provides an introduction and overview of common methods for processing and analyzing next generation sequencing (NGS) data, including mapping NGS reads and de novo assembly of NGS reads. It discusses various NGS applications such as RNA-Seq, epigenetics, structural variation detection, and metagenomics. Key steps in read alignment such as choosing an alignment program and viewing alignments are outlined. Considerations for choosing an alignment program based on library type, read type, and platform are also reviewed. Popular alignment programs including Bowtie, BWA, TopHat, and Novoalign are mentioned.
RNA sequencing analysis tutorial with NGSHAMNAHAMNA8
This document provides an overview of RNA-seq data analysis. It discusses quality control of sequencing data using tools like FastQC, mapping reads to a reference genome or transcriptome using aligners like BWA and TopHat, and summarizing reads using counting tools to obtain read counts for each gene. These counts can then be used to estimate gene expression levels and perform differential expression analysis to identify genes with different expression between samples or conditions.
The Advanced Data Analysis Centre (ADAC) at the University of Nottingham provides bioinformatics and data analysis support for complex genomic and transcriptomic datasets. It offers a range of services including obtaining and processing next-generation sequencing data, quality control, mapping, variant calling, and specialized analysis. ADAC has expertise in many areas relevant to NGS analysis and is able to provide flexible consultancy, collaboration, and bespoke analysis support for high-quality research.
Neuroscience core lecture given at the Icahn school of medicine at Mount Sinai. This is the version 2 of the same topic. I have made some modifications to give a more gentle introduction and add a new example for ngs.plot.
This document provides an overview of RNA-Seq analysis. It begins with considerations for RNA-Seq experiments such as computational requirements. It then describes the general RNA-Seq analysis workflow including short-read alignment, transcript reconstruction, abundance estimation, visualization, and statistics. The document focuses on explaining the "Tuxedo" analysis pipeline which includes Bowtie, Tophat, Cufflinks, Cuffmerge, Cuffdiff and CummeRbund. It provides examples of commands for each step and discusses alternative tools. The document concludes with resources for further information on RNA-Seq analysis.
This document discusses strategies for analyzing moderately large data sets in R when the total number of observations (N) times the total number of variables (P) is too large to fit into memory all at once. It presents several approaches including loading data incrementally from files or databases, using randomized algorithms, and outsourcing computations to SQL. Specific examples discussed include linear regression on large data sets and whole genome association studies.
Next-generation sequencing format and visualization with ngs.plotLi Shen
Lecture given at the department of neuroscience, Icahn school of medicine at Mount Sinai. ngs.plot has been published in BMC genomics. Link: http://www.biomedcentral.com/1471-2164/15/284
This document provides an overview and introduction to RNA-seq analysis using Next Generation Sequencing. It discusses the RNA-seq workflow including mapping reads with TopHat2, transcript assembly with Cufflinks, and differential expression analysis. Key points covered include the advantages of RNA-seq over microarrays, the exponential drop in sequencing costs, mapping strategies for junction reads including TopHat, and running TopHat from the command line.
Learn from influencers. Influencers play a crucial role when it comes to marketing brands. ...
Use social media tools for research. ...
Use hashtag aggregators and analytics tools. ...
Know your hashtags. ...
Find a unique hashtag. ...
Use clear hashtags. ...
Keep It short and simple. ...
Make sure the hashtag is relevant.
Marius Eriksen discusses Reflow, a new cloud-native workflow framework for bioinformatics. Reflow programs workflows directly using a functional programming language for simplicity and composability. It leverages lazy evaluation and caching to efficiently parallelize and distribute work across private clusters. Reflow aims to untie the hands of implementors compared to traditional workflow systems through its unified approach to programming, execution, and infrastructure.
The document discusses RNA-Seq data analysis. Some key points:
- RNA-Seq involves sequencing steady-state RNA in a sample without prior knowledge of the organism. It can uncover novel transcripts and isoforms.
- Making sense of the large and complex RNA-Seq data depends on the scientific question, such as finding transcribed SNPs for allele-specific expression or novel transcripts in cancer samples.
- Common applications of RNA-Seq include abundance estimation, alternative splicing detection, RNA editing discovery, and finding novel transcripts and isoforms.
- Analysis steps include mapping reads to a reference genome/transcriptome, generating mapping statistics and quality metrics, differential expression analysis, clustering, and pathway analysis using tools like
Next-generation sequencing data format and visualization with ngs.plot 2015Li Shen
An introduction to the commonly used formats for the next-generation sequencing data. ngs.plot is a popular tool for the visualization and data mining of the NGS data.
Making powerful science: an introduction to NGS data analysisAdamCribbs1
This slide deck is from the Botnar Research Centre introduction to NGS sequencing workshop 2021- an overview of the theoretical concepts behind sequencing data analysis are given
Course: Bioinformatics for Biomedical Research (2014).
Session: 4.1- Introduction to RNA-seq and RNA-seq Data Analysis.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Spark Summit
This document discusses using Apache Spark to assemble metagenomes from short read sequencing data. Metagenomes are genomes from microbial communities containing many species. Spark provides an efficient and scalable approach compared to previous methods. The document demonstrates clustering reads from small test datasets in Spark and evaluates performance on real datasets ranging from 20GB to failures at 100GB. While Spark is easy to develop for and efficient, challenges remain in robustness at large scales and optimizing for different problem complexities.
Processing 70Tb Of Genomics Data With ADAM And ToilSpark Summit
This document discusses analyzing large genomic datasets with ADAM and Toil. It summarizes the sequencing and analysis process, and how ADAM implemented on Spark can provide horizontal scalability and speedups of 30-50x over traditional tools. Toil is introduced as a pipeline system for massive genomic workflows that can run on thousands of nodes and is resilient to failures. Results show ADAM produces equivalent variants to GATK while being 3.5x faster and 4x cheaper.
The document outlines key concepts related to distributed database systems including what a distributed database is, applications of distributed databases, issues in distributed database design, fragmentation, query processing, transaction management, and reliability. It provides examples of primary horizontal fragmentation and derived horizontal fragmentation. It also summarizes query optimization objectives, issues, and cost-based optimization approaches like join ordering and semijoin algorithms.
Cassandra is a decentralized structured storage system developed at Facebook as an extension of Bigtable with aspects of Dynamo. It provides high availability, high write throughput, and failure tolerance. Cassandra uses a gossip-based protocol for node communication and management, and a ring topology for data partitioning and replication across nodes. Tests on Facebook data showed Cassandra providing lower latency for writes and reads compared to MySQL, and it scaled well to large datasets and workloads in experiments.
Cassandra is a decentralized structured storage system designed for high availability, high write throughput, and failure tolerance. It uses a gossip-based protocol for node communication and a ring topology for data partitioning across nodes. Data is replicated across multiple nodes for fault tolerance. Cassandra provides low-latency reads and high-throughput writes through its use of commit logs, memtables, and Bloom filters. It was developed at Facebook to power user messaging search and scaled to support over 50TB of user data distributed across 150 nodes. Benchmark results show Cassandra providing lower read and write latencies compared to MySQL on large datasets.
Cassandra is a decentralized structured storage system developed at Facebook as an extension of Bigtable with aspects of Dynamo. It provides high availability, high write throughput, and failure tolerance. Cassandra uses a gossip-based protocol for node communication and management, and a ring topology for data partitioning and replication across nodes. Tests on Facebook data showed Cassandra providing lower latency for writes and reads compared to MySQL, and it scaled well to large datasets and workloads based on YCSB benchmarking.
The document provides information about RNA-seq analysis using R and Bioconductor. It begins with an introduction to the BCBB branch and its services assisting researchers with bioinformatics and computational projects. The document then discusses RNA-seq, R, and Bioconductor individually before explaining how they can be used together for RNA-seq analysis. Step-by-step tutorials and resources are provided for differential expression analysis and other tasks using R packages like DESeq2.
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...DataStax Academy
The audience will participate in a live, interactive demo that generates high-quality recommendations using the latest Spark-Cassandra integration for real time, approximate, and advanced analytics including machine learning, graph processing, and text processing.
The ability to easily and efficiently analyse RNA-sequencing data is a key strength of the Bioconductor project. Starting with counts summarised at the gene-level, a typical analysis involves pre-processing, exploratory data analysis, differential expression testing and pathway analysis with the results obtained informing future experiments and validation studies
https://www.shamra.sy/academia/show/5b06e01c54e75
Cassandra is a structured storage system designed for large amounts of data across commodity servers. It provides high availability with eventual consistency and scales incrementally without centralized administration. Data is partitioned across nodes and replicated for fault tolerance. Writes are applied locally and propagated asynchronously, prioritizing availability over consistency. It uses a gossip protocol for membership and failure detection.
The document discusses tools for analyzing transcriptome data. It describes FastQC, a tool used for quality control checks on raw sequencing data by generating statistics on base quality, GC content, overrepresented sequences, etc. Scripture is described as a tool for de novo assembly of RNA-seq data that relies on aligned reads and a reference genome to reconstruct transcripts. The document outlines the typical workflow of indexing aligned reads, running quality checks with FastQC, and using Scripture or other tools for reconstruction. Common file formats like FASTQ, SAM, BAM and output formats like BED are also summarized.
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Tatiana Kojar
Skybuffer AI, built on the robust SAP Business Technology Platform (SAP BTP), is the latest and most advanced version of our AI development, reaffirming our commitment to delivering top-tier AI solutions. Skybuffer AI harnesses all the innovative capabilities of the SAP BTP in the AI domain, from Conversational AI to cutting-edge Generative AI and Retrieval-Augmented Generation (RAG). It also helps SAP customers safeguard their investments into SAP Conversational AI and ensure a seamless, one-click transition to SAP Business AI.
With Skybuffer AI, various AI models can be integrated into a single communication channel such as Microsoft Teams. This integration empowers business users with insights drawn from SAP backend systems, enterprise documents, and the expansive knowledge of Generative AI. And the best part of it is that it is all managed through our intuitive no-code Action Server interface, requiring no extensive coding knowledge and making the advanced AI accessible to more users.
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...alexjohnson7307
Predictive maintenance is a proactive approach that anticipates equipment failures before they happen. At the forefront of this innovative strategy is Artificial Intelligence (AI), which brings unprecedented precision and efficiency. AI in predictive maintenance is transforming industries by reducing downtime, minimizing costs, and enhancing productivity.
Learn from influencers. Influencers play a crucial role when it comes to marketing brands. ...
Use social media tools for research. ...
Use hashtag aggregators and analytics tools. ...
Know your hashtags. ...
Find a unique hashtag. ...
Use clear hashtags. ...
Keep It short and simple. ...
Make sure the hashtag is relevant.
Marius Eriksen discusses Reflow, a new cloud-native workflow framework for bioinformatics. Reflow programs workflows directly using a functional programming language for simplicity and composability. It leverages lazy evaluation and caching to efficiently parallelize and distribute work across private clusters. Reflow aims to untie the hands of implementors compared to traditional workflow systems through its unified approach to programming, execution, and infrastructure.
The document discusses RNA-Seq data analysis. Some key points:
- RNA-Seq involves sequencing steady-state RNA in a sample without prior knowledge of the organism. It can uncover novel transcripts and isoforms.
- Making sense of the large and complex RNA-Seq data depends on the scientific question, such as finding transcribed SNPs for allele-specific expression or novel transcripts in cancer samples.
- Common applications of RNA-Seq include abundance estimation, alternative splicing detection, RNA editing discovery, and finding novel transcripts and isoforms.
- Analysis steps include mapping reads to a reference genome/transcriptome, generating mapping statistics and quality metrics, differential expression analysis, clustering, and pathway analysis using tools like
Next-generation sequencing data format and visualization with ngs.plot 2015Li Shen
An introduction to the commonly used formats for the next-generation sequencing data. ngs.plot is a popular tool for the visualization and data mining of the NGS data.
Making powerful science: an introduction to NGS data analysisAdamCribbs1
This slide deck is from the Botnar Research Centre introduction to NGS sequencing workshop 2021- an overview of the theoretical concepts behind sequencing data analysis are given
Course: Bioinformatics for Biomedical Research (2014).
Session: 4.1- Introduction to RNA-seq and RNA-seq Data Analysis.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Spark Summit
This document discusses using Apache Spark to assemble metagenomes from short read sequencing data. Metagenomes are genomes from microbial communities containing many species. Spark provides an efficient and scalable approach compared to previous methods. The document demonstrates clustering reads from small test datasets in Spark and evaluates performance on real datasets ranging from 20GB to failures at 100GB. While Spark is easy to develop for and efficient, challenges remain in robustness at large scales and optimizing for different problem complexities.
Processing 70Tb Of Genomics Data With ADAM And ToilSpark Summit
This document discusses analyzing large genomic datasets with ADAM and Toil. It summarizes the sequencing and analysis process, and how ADAM implemented on Spark can provide horizontal scalability and speedups of 30-50x over traditional tools. Toil is introduced as a pipeline system for massive genomic workflows that can run on thousands of nodes and is resilient to failures. Results show ADAM produces equivalent variants to GATK while being 3.5x faster and 4x cheaper.
The document outlines key concepts related to distributed database systems including what a distributed database is, applications of distributed databases, issues in distributed database design, fragmentation, query processing, transaction management, and reliability. It provides examples of primary horizontal fragmentation and derived horizontal fragmentation. It also summarizes query optimization objectives, issues, and cost-based optimization approaches like join ordering and semijoin algorithms.
Cassandra is a decentralized structured storage system developed at Facebook as an extension of Bigtable with aspects of Dynamo. It provides high availability, high write throughput, and failure tolerance. Cassandra uses a gossip-based protocol for node communication and management, and a ring topology for data partitioning and replication across nodes. Tests on Facebook data showed Cassandra providing lower latency for writes and reads compared to MySQL, and it scaled well to large datasets and workloads in experiments.
Cassandra is a decentralized structured storage system designed for high availability, high write throughput, and failure tolerance. It uses a gossip-based protocol for node communication and a ring topology for data partitioning across nodes. Data is replicated across multiple nodes for fault tolerance. Cassandra provides low-latency reads and high-throughput writes through its use of commit logs, memtables, and Bloom filters. It was developed at Facebook to power user messaging search and scaled to support over 50TB of user data distributed across 150 nodes. Benchmark results show Cassandra providing lower read and write latencies compared to MySQL on large datasets.
Cassandra is a decentralized structured storage system developed at Facebook as an extension of Bigtable with aspects of Dynamo. It provides high availability, high write throughput, and failure tolerance. Cassandra uses a gossip-based protocol for node communication and management, and a ring topology for data partitioning and replication across nodes. Tests on Facebook data showed Cassandra providing lower latency for writes and reads compared to MySQL, and it scaled well to large datasets and workloads based on YCSB benchmarking.
The document provides information about RNA-seq analysis using R and Bioconductor. It begins with an introduction to the BCBB branch and its services assisting researchers with bioinformatics and computational projects. The document then discusses RNA-seq, R, and Bioconductor individually before explaining how they can be used together for RNA-seq analysis. Step-by-step tutorials and resources are provided for differential expression analysis and other tasks using R packages like DESeq2.
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...DataStax Academy
The audience will participate in a live, interactive demo that generates high-quality recommendations using the latest Spark-Cassandra integration for real time, approximate, and advanced analytics including machine learning, graph processing, and text processing.
The ability to easily and efficiently analyse RNA-sequencing data is a key strength of the Bioconductor project. Starting with counts summarised at the gene-level, a typical analysis involves pre-processing, exploratory data analysis, differential expression testing and pathway analysis with the results obtained informing future experiments and validation studies
https://www.shamra.sy/academia/show/5b06e01c54e75
Cassandra is a structured storage system designed for large amounts of data across commodity servers. It provides high availability with eventual consistency and scales incrementally without centralized administration. Data is partitioned across nodes and replicated for fault tolerance. Writes are applied locally and propagated asynchronously, prioritizing availability over consistency. It uses a gossip protocol for membership and failure detection.
The document discusses tools for analyzing transcriptome data. It describes FastQC, a tool used for quality control checks on raw sequencing data by generating statistics on base quality, GC content, overrepresented sequences, etc. Scripture is described as a tool for de novo assembly of RNA-seq data that relies on aligned reads and a reference genome to reconstruct transcripts. The document outlines the typical workflow of indexing aligned reads, running quality checks with FastQC, and using Scripture or other tools for reconstruction. Common file formats like FASTQ, SAM, BAM and output formats like BED are also summarized.
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Tatiana Kojar
Skybuffer AI, built on the robust SAP Business Technology Platform (SAP BTP), is the latest and most advanced version of our AI development, reaffirming our commitment to delivering top-tier AI solutions. Skybuffer AI harnesses all the innovative capabilities of the SAP BTP in the AI domain, from Conversational AI to cutting-edge Generative AI and Retrieval-Augmented Generation (RAG). It also helps SAP customers safeguard their investments into SAP Conversational AI and ensure a seamless, one-click transition to SAP Business AI.
With Skybuffer AI, various AI models can be integrated into a single communication channel such as Microsoft Teams. This integration empowers business users with insights drawn from SAP backend systems, enterprise documents, and the expansive knowledge of Generative AI. And the best part of it is that it is all managed through our intuitive no-code Action Server interface, requiring no extensive coding knowledge and making the advanced AI accessible to more users.
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...alexjohnson7307
Predictive maintenance is a proactive approach that anticipates equipment failures before they happen. At the forefront of this innovative strategy is Artificial Intelligence (AI), which brings unprecedented precision and efficiency. AI in predictive maintenance is transforming industries by reducing downtime, minimizing costs, and enhancing productivity.
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyScyllaDB
Freshworks creates AI-boosted business software that helps employees work more efficiently and effectively. Managing data across multiple RDBMS and NoSQL databases was already a challenge at their current scale. To prepare for 10X growth, they knew it was time to rethink their database strategy. Learn how they architected a solution that would simplify scaling while keeping costs under control.
A Comprehensive Guide to DeFi Development Services in 2024Intelisync
DeFi represents a paradigm shift in the financial industry. Instead of relying on traditional, centralized institutions like banks, DeFi leverages blockchain technology to create a decentralized network of financial services. This means that financial transactions can occur directly between parties, without intermediaries, using smart contracts on platforms like Ethereum.
In 2024, we are witnessing an explosion of new DeFi projects and protocols, each pushing the boundaries of what’s possible in finance.
In summary, DeFi in 2024 is not just a trend; it’s a revolution that democratizes finance, enhances security and transparency, and fosters continuous innovation. As we proceed through this presentation, we'll explore the various components and services of DeFi in detail, shedding light on how they are transforming the financial landscape.
At Intelisync, we specialize in providing comprehensive DeFi development services tailored to meet the unique needs of our clients. From smart contract development to dApp creation and security audits, we ensure that your DeFi project is built with innovation, security, and scalability in mind. Trust Intelisync to guide you through the intricate landscape of decentralized finance and unlock the full potential of blockchain technology.
Ready to take your DeFi project to the next level? Partner with Intelisync for expert DeFi development services today!
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
2. A
A
RNA-Seq Libraries
rRNA depleted mRNA
Fragment
Random prime + RT
2nd strand synthesis (+ U)
A-tailing
Adapter Ligation
(U strand degradation)
Sequencing
NNNN
u u u u
u u u u
u u u u A
A
T
T
A T
3. Reference based RNA-Seq Analysis
QC Trimming Mapping
Mapped QC
Exploration
and
Quantitation
Statistical Analysis
18. HiSat2 pipeline
Reference FastA files Indexed Genome
Reference GTF Models
Pool of known splice
junctions
Reads
(fastq)
Maps with known junctions Report
Maps convincingly with
novel junction?
Report
Yes
Yes
Discard
Add
No
23. Running programs in Linux
• Open a shell (text based OS interface)
• Type the name of the program you want to run
– Add on any options the program needs
– Press return - the program will run
– When the program ends control will return to the shell
• Run the next program!
24. Running programs
user@server:~$ ls
Desktop Documents Downloads examples.desktop
Music Pictures Public Templates Videos
user@server:~$
Command prompt - you can't enter a command unless you can see this
The command we're going to run (ls in this case, to list files)
The output of the command - just text in this case
25. The structure of a unix command
ls -ltd --reverse Downloads/ Desktop/ Documents/
Program
name
Switches Data
(normally files)
Each option or section is separated by spaces. Options or files with spaces in must be put in quotes.
26. Command line switches
• Change the behaviour of the program
• Come in two flavours (each option often has both types available)
– Minus plus single letter (eg -x -c -z)
• Can be combined (eg -xcz)
– Two minuses plus a word (eg --extract --gzip)
• Can't be combined
• Some take an additional value
-f somfile.txt (specify a filename)
--width=30 (specify a value)
27. Specifying file paths
• Specify names from whichever directory you are currently in
– If I'm in /home/simon
– Data/big_data.fq.gz
• is the same as /home/simon/Data/big_data.fq.gz
• Move to the directory with the data and just use file names
– cd Data
– big_data.fq.gz
home
simon
Data
big_data.fq.gz
28. Command line completion
• Most errors in commands are typing errors in either program
names or file paths
• Shells (ie BASH) can help with this by offering to complete path
names for you
• Command line completion is achieved by typing a partial path
and then pressing the TAB key (to the left of Q)
29. Command line completion
List of files / folders:
Desktop
Documents
Downloads
Music
Public
Published
Templates
Videos
T [TAB] → Templates
P [TAB] → Publ
Do [TAB] → [beep]
Do [TAB] [TAB] → Documents Downloads
Doc [TAB] → Documents
You should ALWAYS use TAB completion to fill in paths for
locations which exist so you can't make typing mistakes
(it obviously won't work for output files though)
30. Debugging Tips
• If anything (except the splice site extraction) completes almost
immediately then it didn't work!
• Look for errors before asking for help. They will either be
– The last piece of text before the program exited
– The first piece of text produced after it started (followed by the help file)
• To see if a program is running go to another shell and look at the last file
produced to see if it's growing
• Programs which are stuck can be cancelled with Control+C
31. Some useful commands
cd mydir Change directory to mydir
ls -ltrh List files in the current directory, show details and
put the newest files at the bottom
less x.txt View the x.txt text file
Return = down one line
Space = down one page
q = quit
41. Fixing Duplication?
• If duplication is biased (some genes more than others)
– Can’t be ‘fixed’ – can still analyse but be cautious
• If it’s unbiased (everything is duplicated)
– Doesn’t affect quantitation
– Will affect statistics
– Can estimate global level and correct raw counts
42. Quantitation
Exon 1 Exon 2 Exon 3
Exon 1 Exon 3
Splice form 1
Splice form 2
Definitely splice form 1
Definitely splice form 2
Ambiguous
43. Simple Quantitation - Forget splicing
• Count read overlaps with exons of each gene
– Consider library directionality
– Simple
– Gene level quantitation
– Many programs
• Seqmonk (graphical)
• Feature Counts (subread)
• BEDTools
• HTSeq
44. Analysing Splicing
• Try to quantitate transcripts (cufflinks, RSEM, bitSeq)
• Quantitate exons and compare to gene (EdgeR, DEXSeq)
• Quantitate splicing events (rMATS, MAJIQ)
45. Normalisation: RPKM / FPKM / TPM
• RPKM (Reads per kilobase of transcript per million reads of library)
– Corrects for total library coverage
– Corrects for gene length
– Comparable between different genes within the same dataset
• FPKM (Fragments per kilobase of transcript per million fragments of library)
– Only relevant for paired end libraries
– Pairs are not independent observation
– Effectively halves raw counts
• TPM (transcripts per million)
– Normalises to transcript copies instead of reads
– Corrects for cases where the average transcript length differs between samples
49. Size Factor Normalisation
• Make an ‘average’ sample from the mean of expression for
each gene across all samples
• For each sample calculate the distribution of differences
between the data in that sample and the equivalent in the
‘average’ sample
• Use the median of the difference distribution to normalise the
data
52. Exploratory Analyses
• Time to understand your data
– Behaviour of raw data and annotation
– Clustering of samples (PCA / tSNE etc)
– Pairwise comparisons of samples and
groups
– Are expected effects present (eg KO)?
– Can I validate other aspects of the
samples (eg sex)
– Can I see obvious changes?
– Are the changes convincing?
55. DE-Seq2 binomial Stats
• Are the counts we see for gene X in condition 1 consistent with those
for gene X in condition 2?
• Size factors
– Estimator of library sampling depth
– More stable measure than total coverage
– Based on median ratio between conditions
• Variance – required for Negative Binomial distribution
– Insufficient observations to allow direct measure
– Custom variance distribution fitted to real data
– Smooth distribution assumed to allow fitting
56. Dispersion shrinkage
• Plot observed per gene dispersion
• Calculate average dispersion for genes
with similar observation
• Individual dispersions regressed towards
the mean. Weighted by
– Distance from mean
– Number of observations
• Points more than 2SD above the mean
are not regressed
57. 5x5 Replicates
8,022 out of 18,570 genes (43%) identified
as DE using DESeq (p<0.05)
Needs further filtering
Two options:
1. Decrease the p-value cutoff
2. Filter on magnitude of change
(both are a bit rubbish)
Visualising Differential Expression Results
59. Fold Change Shrinkage
• Aims to make the log2 Fold change a more useful value
• Tries to remove systematic biases
• Two types:
1. Fold Change Shrinkage – removes bias from both expression level
and variance, produces a modified fold change
2. Intensity difference – removes bias from just expression level,
produces a p-value
66. Practical Experiment Design
• What type of library?
• What type of sequencing?
• How many reads?
• How many replicates?
67. What type of library?
• Directional libraries if possible
– Easier to spot contamination
– No mixed signals from antisense transcription
– May be difficult for low input samples
• mRNA vs total vs depletion etc.
– Down to experimental questions
– Remember LINC RNA may not have polyA tail
– Active transcription vs standing mRNA pool
68. What type of sequencing
• Depends on your interest
– Expression quantitation of known genes
• 50bp single end is fine
– Expression plus splice junction usage
• 100bp (or longer if possible) single end
– Novel transcript discovery or per transcript expression
• 100bp paired end
69. How many reads
• Typically aim for 20 million reads for human / mouse sized
genome
• More reads:
– De-novo discovery
– Low expressed transcripts
• More replicates more useful than more reads
70. Replicates
• Compared to arrays, RNA-Seq is a very clean technical measure of expression
– Generally don’t run technical replicates
– Must run biological replicates
• For clean systems (eg cell lines) 3x3 or 4x4 is common
• Higher numbers required as the system gets more variable
• Always plan for at least one sample to fail
• Randomise across sample groups
71. Power Analysis
• Power Analysis is not simple for RNA-Seq data
– Not a single test – one test per gene
– Need to apply multiple testing correction
– Each gene will have different power
• Power correlates with observation level
• Variations in variance per gene
• Several tools exist to automate power analysis
– All require parameters which are difficult to estimate, and have
dramatic effects on the outcome
73. Tools available
• RnaSeqSampleSize https://cqs-vumc.shinyapps.io/rnaseqsamplesizeweb/
• Scotty http://scotty.genetics.utah.edu/
• All require an estimate of count vs variance
– Pilot data (if only!)
– “Similar” studies
We are planning a RNA sequencing experiment to identify differential gene expression between two groups. Prior data
indicates that the minimum average read counts among the prognostic genes in the control group is 500, the maximum
dispersion is 0.1, and the ratio of the geometric mean of normalization factors is 1. Suppose that the total number of genes for
testing is 10000 and the top 100 genes are prognostic. If the desired minimum fold change is 3, we will need to study 4
subjects in each group to be able to reject the null hypothesis that the population means of the two groups are equal with
probability (power) 0.8 using exact test. The FDR associated with this test of this null hypothesis is 0.05.
The format of the course will be…
Firstly – theory
Steps involved in an RNA-seq analysis
Practical session
Point out parts in library prep which are relevant later on.
Start with single strand RNA.
Can't sequence because it's RNA and too long
Must fragment to make short bits
Must convert to DNA, but that loses directionality - which is bad.
Random priming and RT (causes biases later)
Normal Illumina library prep once double stranded.
Can retain strand information by tagging and degrading one of the stands. Which one determines same strand or opposing strand specific libraries.
Focus on experiment types where you have a reference.
Mention transcriptome assembly.
Some QC looks very similar for different types of sequencing
Random primed so no positional bias is expected.
Expect to see 4 horizontal lines, not always all 25%, GC and AT might be different
Will see biases in the actual data. Will be flagged as a problem.
Assume hexamer libraries have all possible hexamers in equal proportion
Assume that all possible hexamers bind and extend with equal efficiency.
Should worry if you don't see this.
Any slight biases in the analysis will be similar over all samples.
Definitely a problem.
Hard to assess from raw sequence.
Shouldn't expect that all sequence are present equally – measuring expression levels
You will see high duplication levels, this isn't necessarily a problem.
Can look much more sensitively after mapping.
Explain paired end sequence and colours.
Must use a splicing aware read mapper.
TopHat was original. Mapped to transcriptome first, then to genome if no hit.
STAR and HiSat are very similar. Map directly to genome but with knowledge of splice sites.
We're using Hisat as it uses less memory.
Pick one and stick to it.
Discovery of new junctions depends a lot on where the junction is in the read.
50/50 easy to discover
90/10 probably impossible
Can do 1 or 2 pass processing. 2 pass is more complete - uses first pass just to find junctions but is slow.
In practise 1 pass is fine and you hardly lose anything.
Important thing to check is consistency
MultiQC
Zoomed right in on one gene
To get a broader overview – RNA-seq QC report examines all the data
Coincidental vs technical (PCR) duplication.
Anything that isn’t a smooth distribution is more worrying
What to do about bad duplication;
Duplication doesn't affect corrected quantitation.
Mainly affects count based statistics.
Can estimate basal level of duplication and divide raw counts. Only matters for statistical tests. Don't use unless there is a problem.
This is a simple example – about as simple as you get.
Paired end mapping as before.
Get a lot of ambiguous reads.
Not going to do any of this today.
Transcript level expression: Expectation maximisation model to create the mostly likely redistribution of counts between different splice forms.
Can work well in some cases. Can be *very* wrong in others.
Very difficult to evaluate.
Counting junctions is much less complex. Set of counts which you can treat the same as expression levels.
Splicing decisions is nice - ratio test.
RPKM and TPM are very functionally equivalent.
Main thing is the scale - TPM values are much higher. Can be an issue when you log transform. Very common to get negative logRPKM values which is fine, but people don't like.
Point out the upper scoop on a log scale.
Changes in high expressed genes can mess up the normalisation. Particularly rRNA but there are others too.
If normalisation is messed up then apply additional correction.
Percentile normalisation - pick a nicely behaving part of the distribution.
Size factor normalisation - take the median difference between all genes of two sets.
Most of the time it's fine. Don't do additional correction if you don't need to.
Point out the upper scoop on a log scale.
Changes in high expressed genes can mess up the normalisation. Particularly rRNA but there are others too.
If normalisation is messed up then apply additional correction.
Percentile normalisation - pick a nicely behaving part of the distribution.
Size factor normalisation - take the median difference between all genes of two sets.
Most of the time it's fine. Don't do additional correction if you don't need to.
Point out the upper scoop on a log scale.
Changes in high expressed genes can mess up the normalisation. Particularly rRNA but there are others too.
If normalisation is messed up then apply additional correction.
Percentile normalisation - pick a nicely behaving part of the distribution.
Size factor normalisation - take the median difference between all genes of two sets.
Most of the time it's fine. Don't do additional correction if you don't need to.
Apply DNA contamination estimation and subtraction only if you know you have a problem. Can fix problems which a mathematical transformation can't.
Count based is natural fit for raw count quantitation
Continuous would work with cufflinks or other transcript level quantitation.
Binomial stats are very powerful with good power to detect changes.
Talk about DESeq but others are very similar (EdgeR, BaySeq etc).
Big problem is that people don't do enough replicates. Need a statistical kludge to work around this.
We don’t get enough observations for an individual gene to get a good measure of variance. Need to share information between genes.
There is a global relationship between variance and observation level.
Makes sense - low observed is very variable, high observed is more stable.
Can construct a global regression line of the relationship.
In the test we don't use the observed variance but a 'shrunken' version of this where it is moved towards the global line.
Amount it moves dependent on
Number of replicates
Distance from line
For most genes this is fine and improves the analysis. For some hyper-variable genes though it's *bad*
Worst offenders removed by not shrinking any points more than 2SD above the mean. Nothing statistical or magical about 2SD. Other points won't hit that limit.
Bottom line is that the fewer replicates you have the more you rely on the global model.
Always look at your results. Spots obvious errors in calculation.
Example cell lines in a dish with a compound added.
Samples collected in the wild, dissected, extracted, posted, processed at different times.