Bioinformatics & It's Scope in BiotechnologyTuhin Samanta
As an interdisciplinary field of science, bioinformatics consolidates science, software engineering, data building, arithmetic and measurements to dissect and decipher organic information. Bioinformatics has been utilized for in silico investigations of organic inquiries utilizing numerical and measurable methods.
This document provides an overview and introduction to RNA-seq analysis using Next Generation Sequencing. It discusses the RNA-seq workflow including mapping reads with TopHat2, transcript assembly with Cufflinks, and differential expression analysis. Key points covered include the advantages of RNA-seq over microarrays, the exponential drop in sequencing costs, mapping strategies for junction reads including TopHat, and running TopHat from the command line.
1. A DNA microarray contains thousands of DNA probes attached to a solid surface in defined locations. Each probe represents a single gene.
2. Sample mRNA is converted to fluorescently labeled cDNA and hybridized to the DNA microarray. The level of fluorescence indicates the expression level of each gene.
3. After washing, the microarray is scanned and analyzed to determine changes in gene expression between control and test samples. This allows high-throughput analysis of gene expression profiles.
DNA sequencing is a process to determine the order of nucleotides in a DNA molecule. It was discovered in the 1970s by scientists like Frederick Sanger who developed the chain termination method. This method involves DNA replication with modified nucleotides that cause the growing DNA strand to terminate at that point. The fragments are then separated by size to reveal the sequence. Automated sequencing now uses fluorescent dyes and capillary electrophoresis for faster and higher throughput sequencing. DNA sequencing has applications in medicine, forensics, and agriculture.
Whole genome sequencing is the process of determining the complete DNA sequence of an organism's genome. It involves sequencing all chromosomal and organellar DNA. Key methods include shotgun sequencing, which randomly fragments DNA for sequencing, and single molecule real time sequencing, which observes individual DNA polymerases incorporating nucleotides in real time using fluorescent tags. Whole genome sequencing has provided insights into evolutionary biology and may help predict disease susceptibility, though technical challenges remain such as fully sequencing repetitive regions.
DNA microarrays, also known as DNA chips, allow researchers to analyze large numbers of DNA sequences simultaneously. They contain thousands of unique DNA probes immobilized on a solid surface in an organized grid pattern. Unknown DNA samples are labeled with fluorescent dyes and hybridized to the probes on the chip. The bound DNA is then detected using lasers and analyzed by a computer to determine which genes are present or expressed in the sample. There are two main types of DNA chips - cDNA-based chips containing amplified cDNA probes, and oligonucleotide-based chips containing short, synthesized DNA sequences. DNA microarrays have many applications including gene expression profiling, comparative genomics, disease diagnosis, and drug discovery.
Single Nucleotide Polymorphism Analysis
Predictive Analytics and Data Science Conference May 27-28
Asst. Prof. Vitara Pungpapong, Ph.D.
Department of Statistics
Faculty of Commerce and Accountancy
Chulalongkorn University
Bioinformatics & It's Scope in BiotechnologyTuhin Samanta
As an interdisciplinary field of science, bioinformatics consolidates science, software engineering, data building, arithmetic and measurements to dissect and decipher organic information. Bioinformatics has been utilized for in silico investigations of organic inquiries utilizing numerical and measurable methods.
This document provides an overview and introduction to RNA-seq analysis using Next Generation Sequencing. It discusses the RNA-seq workflow including mapping reads with TopHat2, transcript assembly with Cufflinks, and differential expression analysis. Key points covered include the advantages of RNA-seq over microarrays, the exponential drop in sequencing costs, mapping strategies for junction reads including TopHat, and running TopHat from the command line.
1. A DNA microarray contains thousands of DNA probes attached to a solid surface in defined locations. Each probe represents a single gene.
2. Sample mRNA is converted to fluorescently labeled cDNA and hybridized to the DNA microarray. The level of fluorescence indicates the expression level of each gene.
3. After washing, the microarray is scanned and analyzed to determine changes in gene expression between control and test samples. This allows high-throughput analysis of gene expression profiles.
DNA sequencing is a process to determine the order of nucleotides in a DNA molecule. It was discovered in the 1970s by scientists like Frederick Sanger who developed the chain termination method. This method involves DNA replication with modified nucleotides that cause the growing DNA strand to terminate at that point. The fragments are then separated by size to reveal the sequence. Automated sequencing now uses fluorescent dyes and capillary electrophoresis for faster and higher throughput sequencing. DNA sequencing has applications in medicine, forensics, and agriculture.
Whole genome sequencing is the process of determining the complete DNA sequence of an organism's genome. It involves sequencing all chromosomal and organellar DNA. Key methods include shotgun sequencing, which randomly fragments DNA for sequencing, and single molecule real time sequencing, which observes individual DNA polymerases incorporating nucleotides in real time using fluorescent tags. Whole genome sequencing has provided insights into evolutionary biology and may help predict disease susceptibility, though technical challenges remain such as fully sequencing repetitive regions.
DNA microarrays, also known as DNA chips, allow researchers to analyze large numbers of DNA sequences simultaneously. They contain thousands of unique DNA probes immobilized on a solid surface in an organized grid pattern. Unknown DNA samples are labeled with fluorescent dyes and hybridized to the probes on the chip. The bound DNA is then detected using lasers and analyzed by a computer to determine which genes are present or expressed in the sample. There are two main types of DNA chips - cDNA-based chips containing amplified cDNA probes, and oligonucleotide-based chips containing short, synthesized DNA sequences. DNA microarrays have many applications including gene expression profiling, comparative genomics, disease diagnosis, and drug discovery.
Single Nucleotide Polymorphism Analysis
Predictive Analytics and Data Science Conference May 27-28
Asst. Prof. Vitara Pungpapong, Ph.D.
Department of Statistics
Faculty of Commerce and Accountancy
Chulalongkorn University
Structurally variable regions like loops, insertions and deletions can complicate protein structure modeling. The structure of an equivalent length segment from a homologous protein provides a guide for modeling missing regions, though the chosen segment may not always fit properly. De novo prediction involves using rotamer libraries of common amino acid conformations to predict side chain positions. Model validation checks the stereochemical accuracy, packing quality, and folding reliability of the predicted structure.
This document discusses genomic databases. It begins by defining key terms like genes, genomes, and genomics. It then describes categories of biological databases including those for nucleic acid sequences, proteins, structures, and genomes. It provides many examples of genomic databases for both non-vertebrate and vertebrate species, including databases for bacteria, fungi, plants, invertebrates, and humans. The final sections note that genomic databases collect genome-wide data from various sources and that databases can be specific to a single organism or category of organisms.
Pyrosequencing is a sequencing by synthesis technique that uses a luciferase enzyme system to monitor DNA synthesis. It works by adding DNA polymerase and a single nucleotide to the DNA fragments, generating pyrophosphate that is converted to light. The light is detected and identifies the nucleotide incorporated. Pyrosequencing has applications in cDNA analysis, mutation detection, re-sequencing of disease genes, and identifying single nucleotide polymorphisms and typing bacteria and viruses.
1. DNA sequencing involves determining the order of nucleotide bases in DNA. The original chain termination or dideoxy method developed by Sanger is still widely used for small DNA segments.
2. Whole genome shotgun sequencing breaks large genomes into fragments that are sequenced and then reassembled, allowing sequencing of entire genomes.
3. Pyrosequencing is a sequencing by synthesis method that uses a bioluminescent reaction to determine nucleotides added, enabling accurate and fast sequencing.
This document discusses fluorescence in situ hybridization (FISH) and genomic in situ hybridization (GISH), which are molecular cytogenetic techniques used to localize DNA sequences on chromosomes. FISH uses fluorescent probes to detect specific DNA or RNA sequences on chromosomes. GISH uses total genomic DNA as a probe to detect specific chromosomes. Both techniques overcome limitations of conventional cytogenetics and have various applications, including gene mapping, analyzing structural abnormalities, and detecting aneuploidy. The document discusses the principles, methods, advantages and limitations of FISH and GISH.
FASTA is a bioinformatics tool and biological database that is used to compare amino acid sequences of proteins or nucleotide sequences of DNA. It was first described in 1985 by Lipman and Pearson. FASTA performs fast homology searches to find similarities between a query sequence and sequences in a database. While similar to BLAST, FASTA is faster for sequence comparisons. It works by identifying patches of sequence similarity that may contain gaps. Some key FASTA programs include FASTA, TFASTA, FASTS, and FASTX/Y. FASTA is useful for applications like identification of species, establishing phylogeny, DNA mapping, and understanding protein function.
Plant genome projects aim to discover all the genes and their functions in a particular plant species. Early projects focused on model organisms like Arabidopsis thaliana due to their small genomes and amenability to genetic studies. In 1990, the National Science Foundation led a multi-agency effort to sequence the entire Arabidopsis genome by 2000, making it the first plant to be fully sequenced. Recent advances have enabled large-scale genome sequencing projects, like the 1001 Genomes Project which obtained complete genomes of 1001 Arabidopsis strains from different geographical regions to study genetic variations.
The document discusses various types of biological databases including nucleotide databases, genomic databases, protein databases, and metabolic databases. It provides examples of several specific databases, such as Nucleotide databases like GenBank, genomic databases like Entrez Genome, protein databases like UniProt, and metabolic databases like KEGG. It also discusses the different levels of data in biological databases from primary data directly from experiments to secondary data that is analyzed and derived from primary data.
A plant genome project aims to discover all genes and their function in a particular plant species.
The main objective of genomic research in any species is to sequence the whole genome and functions of all the different coding and non-coding sequences.
These techniques helped in preparation of molecular maps of many plant genomes.
Plant genome projects initially focused on a few model organisms that are characterized by small genomes or their amenability to genetic studies
Since sequencing technologies have moved on, sequencing cost have dropped and bioinformatics tools advanced, the genomes of many plant species including the enormous genome of bread wheat have been assembled
Genome sequencing projects have been carried out on all three plant genomes: the nuclear, chloroplast and mitochondrial genomes
This opened venues for advanced molecular breeding and manipulation of plant species, but also have accelerated phylogenetics studies amongst species
Several excellent curated plant genome databases, besides the general nucleotide data base archives, allow public access of plant genomes
Computational genomics uses computational and statistical analysis to understand biology from genome sequences and related data. It involves analyzing whole genomes to understand how DNA controls organisms' molecular biology. The field emerged in the late 1990s with available complete genomes. It has contributed to discoveries like predicting gene locations, signaling networks, and genome evolution mechanisms. The first computer model of an organism was of Mycoplasma genitalium incorporating over 1,900 parameters. Computational genomics addresses problems like data storage, pattern matching, and structure prediction to analyze vast genomic data from databases.
The document provides an overview of polymerase chain reaction (PCR) techniques. It begins with an introduction to molecular biology techniques and the importance of hands-on experience. It then describes several key molecular techniques including PCR, gel electrophoresis, northern blotting, and southern blotting. The bulk of the document focuses on describing PCR in detail, including its history, components, steps, types, applications, advantages, and limitations. It also briefly discusses gel electrophoresis and provides an overview of the northern blotting process.
The document discusses techniques for DNA sequencing, including early methods developed in the 1970s by Maxam and Gilbert as well as Sanger. It provides details on how both methods work, such as using specific chemical or enzymatic reactions to generate labeled DNA fragments of different lengths corresponding to nucleotide positions in the sequence. The document also describes how these methods were later automated, using fluorescent tags on dideoxynucleotides and capillary electrophoresis to simultaneously sequence multiple samples in a single gel. This allowed rapid determination of thousands of nucleotides and enabled large genome sequencing projects such as the Human Genome Project.
The document discusses different text-based database retrieval systems for accessing biological data, including Entrez, SRS, and DBGET/LinkDB. It describes their key features and how each system allows users to search text databases using queries, with Entrez providing linked related data across multiple databases. An example shows how each system can be used to retrieve and view related information for a SwissProt protein entry.
This document provides an overview of functional genomics and methods for transcriptome analysis. It discusses two main approaches - sequence-based approaches like expressed sequence tags (ESTs) and serial analysis of gene expression (SAGE), and microarray-based approaches. For sequence-based approaches, it describes how ESTs can provide gene discovery and expression information but have limitations. It outlines the SAGE methodology and gene index construction to organize EST data. For microarrays, it summarizes the basic workflow including sample preparation, hybridization, image analysis and data normalization to identify differentially expressed genes through statistical tests.
This document provides an overview of DNA sequencing:
- It discusses the history of DNA sequencing, from the early 1970s methods to modern techniques. The Sanger and Maxam-Gilbert methods were among the first developed.
- DNA sequencing involves determining the order of nucleotides (A, T, C, G) in a DNA molecule. This provides important information for research and applications in diagnostics, biotechnology, forensics, and more.
- The document outlines some of the major DNA sequencing techniques and methods, including Sanger sequencing and Maxam-Gilbert sequencing. It also discusses next-generation sequencing approaches.
It contains information about- DNA Sequencing; History and Era sequencing; Next Generation Sequencing- Introduction, Workflow, Illumina/Solexa sequencing, Roche/454 sequencing, Ion Torrent sequencing, ABI-SOLiD sequencing; Comparison between NGS & Sangers and NGS Platforms; Advantages and Applications of NGS; Future Applications of NGS.
Genomic variation refers to slight differences in genetic material between organisms. It includes mutations, which are mistakes in DNA copying, and polymorphisms, where multiple alleles exist for a gene. Variations are found throughout genomes and are not evenly distributed. Studying genomic variation helps with genome mapping and screening for genetic diseases. Phylogeny determines evolutionary relationships between species based on physical/genetic similarities from fossils, molecules, and genes. A phylogenetic tree shows inferred relationships in a branching diagram. Synteny refers to homologous genes occurring in the same order on chromosomes, showing closely related species have similar gene order and large syntenic regions. The document compares gene order and syntenic regions among rice, sorghum, maize, and
The document discusses DNA sequencing techniques. It defines DNA sequencing as determining the exact order of nucleotides within a DNA molecule. The first DNA sequences were obtained in the 1970s using 2D chromatography. Sanger and Maxam-Gilbert sequencing were the first generation techniques, with Sanger using DNA polymerase and Maxam-Gilbert using chemical degradation. Next generation sequencing allows millions of reactions in parallel and produces short reads quickly and at low cost without electrophoresis. It utilizes cluster generation and sequencing methods like pyrosequencing, reversible terminators, semiconductor, and ligation. Data analysis involves separating reads, clustering, pairing strands, and aligning to reference genomes.
Protein-protein interactions are important for many biological processes. There are various types of interactions depending on their composition and duration. Methods to study interactions include yeast two-hybrid, co-immunoprecipitation, affinity chromatography, and chromatin immunoprecipitation. Databases such as IntAct and MINT provide repositories for protein interaction data.
The document discusses various types of mutations that can occur, including missense mutations, nonsense mutations, splice mutations, and frameshift mutations. It provides examples of wild-type and mutant DNA sequences to illustrate frameshift mutations. It also describes techniques used in mutational analysis like allele specific oligonucleotides (ASO), allele-specific real time polymerase chain reaction (PCR), and discusses genes involved in cancer signaling pathways such as EGFR, BRAF, KRAS, and their roles in the RAS/MAPK pathway.
1. Clinical data management systems are needed for multi-center clinical trials to manage large volumes of data from multiple sites in real-time.
2. India has potential to grow as a clinical data management hub due to its large, skilled workforce and lower costs compared to other countries.
3. Stakeholders in clinical data management include sponsors, CROs, sites, and regulators who require standardized, clean data to be efficiently captured and reported.
The Imperative of Linking Clinical and Financial Data to Improve Outcomes - H...Health Catalyst
Quality and cost improvements require the intelligent use of financial and clinical data coupled with education for multi-disciplinary teams who are driving process improvements. Once a data warehouse is established, healthcare organizations need to set up multi-disciplinary clinical, financial, and IT specialist teams to make the best use of the data. Sometimes, financial involvement is minimized or even excluded for a number of reasons that can turn out to be counterproductive. However, including financial measurements and participation up front can help enhance the recognized value and sustainability of quality improvement or waste reduction efforts. the In this session you will learn keys to success and real-life examples of linking clinical, financial and patient satisfaction data via multi-disciplinary teams that produce impressive results.
Structurally variable regions like loops, insertions and deletions can complicate protein structure modeling. The structure of an equivalent length segment from a homologous protein provides a guide for modeling missing regions, though the chosen segment may not always fit properly. De novo prediction involves using rotamer libraries of common amino acid conformations to predict side chain positions. Model validation checks the stereochemical accuracy, packing quality, and folding reliability of the predicted structure.
This document discusses genomic databases. It begins by defining key terms like genes, genomes, and genomics. It then describes categories of biological databases including those for nucleic acid sequences, proteins, structures, and genomes. It provides many examples of genomic databases for both non-vertebrate and vertebrate species, including databases for bacteria, fungi, plants, invertebrates, and humans. The final sections note that genomic databases collect genome-wide data from various sources and that databases can be specific to a single organism or category of organisms.
Pyrosequencing is a sequencing by synthesis technique that uses a luciferase enzyme system to monitor DNA synthesis. It works by adding DNA polymerase and a single nucleotide to the DNA fragments, generating pyrophosphate that is converted to light. The light is detected and identifies the nucleotide incorporated. Pyrosequencing has applications in cDNA analysis, mutation detection, re-sequencing of disease genes, and identifying single nucleotide polymorphisms and typing bacteria and viruses.
1. DNA sequencing involves determining the order of nucleotide bases in DNA. The original chain termination or dideoxy method developed by Sanger is still widely used for small DNA segments.
2. Whole genome shotgun sequencing breaks large genomes into fragments that are sequenced and then reassembled, allowing sequencing of entire genomes.
3. Pyrosequencing is a sequencing by synthesis method that uses a bioluminescent reaction to determine nucleotides added, enabling accurate and fast sequencing.
This document discusses fluorescence in situ hybridization (FISH) and genomic in situ hybridization (GISH), which are molecular cytogenetic techniques used to localize DNA sequences on chromosomes. FISH uses fluorescent probes to detect specific DNA or RNA sequences on chromosomes. GISH uses total genomic DNA as a probe to detect specific chromosomes. Both techniques overcome limitations of conventional cytogenetics and have various applications, including gene mapping, analyzing structural abnormalities, and detecting aneuploidy. The document discusses the principles, methods, advantages and limitations of FISH and GISH.
FASTA is a bioinformatics tool and biological database that is used to compare amino acid sequences of proteins or nucleotide sequences of DNA. It was first described in 1985 by Lipman and Pearson. FASTA performs fast homology searches to find similarities between a query sequence and sequences in a database. While similar to BLAST, FASTA is faster for sequence comparisons. It works by identifying patches of sequence similarity that may contain gaps. Some key FASTA programs include FASTA, TFASTA, FASTS, and FASTX/Y. FASTA is useful for applications like identification of species, establishing phylogeny, DNA mapping, and understanding protein function.
Plant genome projects aim to discover all the genes and their functions in a particular plant species. Early projects focused on model organisms like Arabidopsis thaliana due to their small genomes and amenability to genetic studies. In 1990, the National Science Foundation led a multi-agency effort to sequence the entire Arabidopsis genome by 2000, making it the first plant to be fully sequenced. Recent advances have enabled large-scale genome sequencing projects, like the 1001 Genomes Project which obtained complete genomes of 1001 Arabidopsis strains from different geographical regions to study genetic variations.
The document discusses various types of biological databases including nucleotide databases, genomic databases, protein databases, and metabolic databases. It provides examples of several specific databases, such as Nucleotide databases like GenBank, genomic databases like Entrez Genome, protein databases like UniProt, and metabolic databases like KEGG. It also discusses the different levels of data in biological databases from primary data directly from experiments to secondary data that is analyzed and derived from primary data.
A plant genome project aims to discover all genes and their function in a particular plant species.
The main objective of genomic research in any species is to sequence the whole genome and functions of all the different coding and non-coding sequences.
These techniques helped in preparation of molecular maps of many plant genomes.
Plant genome projects initially focused on a few model organisms that are characterized by small genomes or their amenability to genetic studies
Since sequencing technologies have moved on, sequencing cost have dropped and bioinformatics tools advanced, the genomes of many plant species including the enormous genome of bread wheat have been assembled
Genome sequencing projects have been carried out on all three plant genomes: the nuclear, chloroplast and mitochondrial genomes
This opened venues for advanced molecular breeding and manipulation of plant species, but also have accelerated phylogenetics studies amongst species
Several excellent curated plant genome databases, besides the general nucleotide data base archives, allow public access of plant genomes
Computational genomics uses computational and statistical analysis to understand biology from genome sequences and related data. It involves analyzing whole genomes to understand how DNA controls organisms' molecular biology. The field emerged in the late 1990s with available complete genomes. It has contributed to discoveries like predicting gene locations, signaling networks, and genome evolution mechanisms. The first computer model of an organism was of Mycoplasma genitalium incorporating over 1,900 parameters. Computational genomics addresses problems like data storage, pattern matching, and structure prediction to analyze vast genomic data from databases.
The document provides an overview of polymerase chain reaction (PCR) techniques. It begins with an introduction to molecular biology techniques and the importance of hands-on experience. It then describes several key molecular techniques including PCR, gel electrophoresis, northern blotting, and southern blotting. The bulk of the document focuses on describing PCR in detail, including its history, components, steps, types, applications, advantages, and limitations. It also briefly discusses gel electrophoresis and provides an overview of the northern blotting process.
The document discusses techniques for DNA sequencing, including early methods developed in the 1970s by Maxam and Gilbert as well as Sanger. It provides details on how both methods work, such as using specific chemical or enzymatic reactions to generate labeled DNA fragments of different lengths corresponding to nucleotide positions in the sequence. The document also describes how these methods were later automated, using fluorescent tags on dideoxynucleotides and capillary electrophoresis to simultaneously sequence multiple samples in a single gel. This allowed rapid determination of thousands of nucleotides and enabled large genome sequencing projects such as the Human Genome Project.
The document discusses different text-based database retrieval systems for accessing biological data, including Entrez, SRS, and DBGET/LinkDB. It describes their key features and how each system allows users to search text databases using queries, with Entrez providing linked related data across multiple databases. An example shows how each system can be used to retrieve and view related information for a SwissProt protein entry.
This document provides an overview of functional genomics and methods for transcriptome analysis. It discusses two main approaches - sequence-based approaches like expressed sequence tags (ESTs) and serial analysis of gene expression (SAGE), and microarray-based approaches. For sequence-based approaches, it describes how ESTs can provide gene discovery and expression information but have limitations. It outlines the SAGE methodology and gene index construction to organize EST data. For microarrays, it summarizes the basic workflow including sample preparation, hybridization, image analysis and data normalization to identify differentially expressed genes through statistical tests.
This document provides an overview of DNA sequencing:
- It discusses the history of DNA sequencing, from the early 1970s methods to modern techniques. The Sanger and Maxam-Gilbert methods were among the first developed.
- DNA sequencing involves determining the order of nucleotides (A, T, C, G) in a DNA molecule. This provides important information for research and applications in diagnostics, biotechnology, forensics, and more.
- The document outlines some of the major DNA sequencing techniques and methods, including Sanger sequencing and Maxam-Gilbert sequencing. It also discusses next-generation sequencing approaches.
It contains information about- DNA Sequencing; History and Era sequencing; Next Generation Sequencing- Introduction, Workflow, Illumina/Solexa sequencing, Roche/454 sequencing, Ion Torrent sequencing, ABI-SOLiD sequencing; Comparison between NGS & Sangers and NGS Platforms; Advantages and Applications of NGS; Future Applications of NGS.
Genomic variation refers to slight differences in genetic material between organisms. It includes mutations, which are mistakes in DNA copying, and polymorphisms, where multiple alleles exist for a gene. Variations are found throughout genomes and are not evenly distributed. Studying genomic variation helps with genome mapping and screening for genetic diseases. Phylogeny determines evolutionary relationships between species based on physical/genetic similarities from fossils, molecules, and genes. A phylogenetic tree shows inferred relationships in a branching diagram. Synteny refers to homologous genes occurring in the same order on chromosomes, showing closely related species have similar gene order and large syntenic regions. The document compares gene order and syntenic regions among rice, sorghum, maize, and
The document discusses DNA sequencing techniques. It defines DNA sequencing as determining the exact order of nucleotides within a DNA molecule. The first DNA sequences were obtained in the 1970s using 2D chromatography. Sanger and Maxam-Gilbert sequencing were the first generation techniques, with Sanger using DNA polymerase and Maxam-Gilbert using chemical degradation. Next generation sequencing allows millions of reactions in parallel and produces short reads quickly and at low cost without electrophoresis. It utilizes cluster generation and sequencing methods like pyrosequencing, reversible terminators, semiconductor, and ligation. Data analysis involves separating reads, clustering, pairing strands, and aligning to reference genomes.
Protein-protein interactions are important for many biological processes. There are various types of interactions depending on their composition and duration. Methods to study interactions include yeast two-hybrid, co-immunoprecipitation, affinity chromatography, and chromatin immunoprecipitation. Databases such as IntAct and MINT provide repositories for protein interaction data.
The document discusses various types of mutations that can occur, including missense mutations, nonsense mutations, splice mutations, and frameshift mutations. It provides examples of wild-type and mutant DNA sequences to illustrate frameshift mutations. It also describes techniques used in mutational analysis like allele specific oligonucleotides (ASO), allele-specific real time polymerase chain reaction (PCR), and discusses genes involved in cancer signaling pathways such as EGFR, BRAF, KRAS, and their roles in the RAS/MAPK pathway.
1. Clinical data management systems are needed for multi-center clinical trials to manage large volumes of data from multiple sites in real-time.
2. India has potential to grow as a clinical data management hub due to its large, skilled workforce and lower costs compared to other countries.
3. Stakeholders in clinical data management include sponsors, CROs, sites, and regulators who require standardized, clean data to be efficiently captured and reported.
The Imperative of Linking Clinical and Financial Data to Improve Outcomes - H...Health Catalyst
Quality and cost improvements require the intelligent use of financial and clinical data coupled with education for multi-disciplinary teams who are driving process improvements. Once a data warehouse is established, healthcare organizations need to set up multi-disciplinary clinical, financial, and IT specialist teams to make the best use of the data. Sometimes, financial involvement is minimized or even excluded for a number of reasons that can turn out to be counterproductive. However, including financial measurements and participation up front can help enhance the recognized value and sustainability of quality improvement or waste reduction efforts. the In this session you will learn keys to success and real-life examples of linking clinical, financial and patient satisfaction data via multi-disciplinary teams that produce impressive results.
Presentation given at the Consorcio Madrono conference on Data Management Plans in Horizon 2020 http://www.consorciomadrono.es/info/web/blogs/formacion/217.php
Gone are the days of using spreadsheets to manage clinical trials. Fortunately, a clinical trial management system (CTMS) such as Oracle Siebel CTMS, offers an effective method for streamlining business processes, reducing cost and saving time.
Whether you are a sponsor running global trials or a research organization conducting hundreds of studies, Perficient’s Param Singh, Director of Clinical Trial Management Solutions, will teach you:
What a CTMS is and who needs one
Key functions of a CTMS
CTMS selection process
System types and implementation options
Best practices
Clinical trials have evolved since the 18th century to become a rigorous process for testing new medical treatments. There are typically four phases of clinical trials. Phase I trials test safety in small groups. Phase II explores efficacy. Phase III tests effectiveness against standard treatments in large groups. Phase IV monitors long-term safety post-approval. Clinical trials must include women, children, and pregnant women if a drug will be used to treat those populations, to determine safety under controlled conditions.
Introduction to Oracle Clinical Data ModelPerficient
The document summarizes the key tables and fields in Oracle Clinical's data model. It describes the global library subsystem which defines discrete value groups, questions, and question groups. It also covers the study design subsystem which defines clinical studies, sites, investigators, and the study schedule. The summary provides high-level information on the purpose and structure of several major tables without extensive detail on each field.
Visit:www.acriindia.com
ACRI is a leading Clinical data management training Institute in Bangalore India.
ACRI creates a value add for every degree. Our PGDCRCDM course is approved by the Mysore University. Graduates and Post Graduates and even PhDs have trained with us and got enviable positions in the Clinical Research Industry. ACRI supplements University training with Industry based training, coupled with hands-on internships and projects based on real case studies. The ACRI brand gives the individual the confidence and expertise to join the ever-growing workforce both in the country and abroad.
The document outlines the process for setting up clinical data management and pharmacovigilance processes. It discusses developing the protocol and case report forms, designing the database, installing software like Oracle Inform and Argus, and preparing documents like the data management plan. It also describes the data entry, validation, query resolution, medical coding, biostatistics, and database locking and freezing aspects of the clinical data management and pharmacovigilance setup process.
Software Testing Process, Testing Automation and Software Testing TrendsKMS Technology
This is the slide deck that KMS Technology's experts shared useful information about latest and greatest achievements of software testing field with lecturers of HCMC University of Industry.
Clinical Data Management Training @ Gratisol LabsGratisol Labs
Clinical data management involves processing clinical trial data using computer applications and database systems. It supports the collection, cleaning, and management of subject data. Key aspects of clinical data management include CRF design, database setup, data entry, discrepancy management, medical coding, quality control, and database lock. The goal is to ensure the integrity and quality of clinical trial data.
Integrating Clinical Operations and Clinical Data Management Through EDCwww.datatrak.com
When electronic data capture was first introduced there was a great deal of discussion surrounding how the technology would alter the roles of those in clinical operations and clinical data management. Through the review of a case study, we will explore how EDC is used as a tool to more tightly integrate clinical operational staffs with those in clinical data management resulting in a more streamlined process from study initiation to database lock.
This document discusses the role of computers in clinical pharmacy. It describes how computers can be used for patient record management, medication order entry, generating medication profiles and lists, screening for drug interactions, and applications in areas like research, education, and inventory management. Computers help improve efficiency and accuracy in monitoring patient drug therapy and free up pharmacists to spend more time on direct patient care activities.
The document provides an overview of software testing methodology and trends:
- It discusses the evolution of software development processes and how testing has changed and become more important. Testing now includes more automation, non-functional testing, and professional testers.
- The key components of a testing process framework are described, including test management, quality metrics, risk-based testing, and exploratory testing.
- Automation testing, performance testing, and popular testing tools are also covered.
- The future of software testing is discussed, with notes on faster release cycles, more complex applications, global testing teams, increased use of automation, and a focus on practices over processes.
Information Management In Pharmaceutical IndustryFrank Wang
Pharmaceutical Industry Information Management Opportunities and Challenges in Research, Development, Clinical, Sales, Marketing, Managed Markets, Manufacturing, Supply Chain and Distribution
Thesis in IT Online Grade Encoding and Inquiry System via SMS TechnologyBelLa Bhe
This document provides background information on an online grade encoding and inquiry system via SMS technology for the San Mateo Municipal College. It discusses the college's current manual grading system and the problems with it, such as the long process for students to inquire about their grades. The objectives of developing a new online system are outlined, including allowing instructors to encode grades online and students to inquire about grades via SMS. The scope and limitations of the new system are also defined. Finally, the significance of the study in benefiting instructors, students, administrators, and future researchers is described.
This document is a learning material for a workshop on research data management. It introduces key concepts in research data management (RDM) such as defining RDM and digital curation. It discusses the research process and challenges around managing large amounts of diverse research data. It also covers drivers for taking RDM seriously such as funder mandates, benefits of open data, and the strategic context of RDM as an emerging field. Learners are guided through reflection activities to relate RDM concepts to their own research interests and roles as information professionals.
Librarians can provide valuable data management services to researchers on campus. An effective strategy includes surveying researchers to identify needs, communicating service offerings through workshops and consultations, and providing in-depth guidance on data management plans and long-term data preservation. Developing workshops involves setting learning objectives, evaluating content, and securing resources like space and food. Consultations allow librarians to help with specific topics like choosing file formats or finding metadata standards. Creating a data management plan requires detailing a data inventory, metadata description, long-term preservation and access methods. Trusted disciplinary repositories and use of stable identifiers help ensure long-term findability and access.
A brief overview of the development and current workflows for Research Data Management at Imperial College London, presented to colleagues at the University of Copenhagen and Roskilde University in Denmark.
Getting to grips with research data management Wendy Mears
This document provides an overview of research data management. It defines research data management and discusses its importance. It also outlines the data lifecycle model and provides guidance on sharing data, working with data, planning for data management, and useful resources for research data management. The document aims to help researchers effectively manage the data created throughout the research process.
This document discusses drivers and organizational responses to research data management (RDM) maturity from transatlantic perspectives. It describes external funder mandates in the US and UK that require open sharing of research publications and data. Universities have responded by developing RDM policies, tools, expertise, and education/outreach for researchers. Key RDM components discussed include policies, storage and repository tools, expertise and staffing models, and outreach/education activities. Connecting electronic lab notebooks to other RDM infrastructure is presented as an approach to better integrate researcher workflows with institutional RDM. The document concludes with an invitation to provide comments on RDM maturity through an online survey.
The document outlines a 23 Things program for research data management training, which releases weekly activities and has monthly webinars, and provides a calendar of events and list of coordinators for the program at UWA.
Research Data Service at the University of EdinburghRobin Rice
The University of Edinburgh provides research data management services and resources to support researchers through the entire data lifecycle. These include tools for creating data management plans, storing and sharing research data securely, and preserving data in the long term. The Research Data Service aims to help researchers comply with open science principles and data policies through a range of training programs, online guidance, and technical infrastructure. It has developed a multi-year roadmap and maturity model to continuously improve services based on researchers' needs and priorities like relationship building, communication skills, and consultation.
Open data and research data management at the University of Edinburgh: polici...Robin Rice
The document discusses open data and research data management policies and services at the University of Edinburgh. It provides an overview of Edinburgh's focus on data-driven science through various initiatives. It also outlines the drivers for Edinburgh's research data management policy, including funder requirements and guidelines. The policy aims to support the storage, sharing, and long-term preservation of research data. The university has implemented a roadmap to support the policy through training, infrastructure, repositories, and consultancy services. Challenges to effective research data management include a lack of staff and funding resources, low researcher prioritization, and difficulties engaging researchers early in the research process.
Presentation from a University of York Library workshop on research data management. The workshop provides an introduction to research data management, covering best practice for the successful organisation, storage, documentation, archiving, and sharing of research data.
Webinar delivered by the OU Library Research Support team on 21st March 2020. Covers essential tips for working with research data, including file storage, information security, file naming, metadata and working with participants.
Stuart Macdonald talks about the Research Data Management programme at the University of Edinburgh Data Library, delivered at the ADP Workshop for Librarians: Open Research Data in Social Sciences and Humanities (ADP), Ljubljana, Slovenia, 18 June 2014
This document provides an overview of research data management. It begins by defining research data and research data management. It discusses the data lifecycle and importance of planning for data management. A key part of planning is creating a Data Management Plan which covers topics like data collection, documentation, ethics, storage, sharing, and responsibilities. The document provides guidance on each of these topics to help researchers effectively manage their research data.
Practical Strategies for Research Data Managementdancrane_open
The document provides an overview of practical strategies for research data management. It discusses what research data management is, including definitions of research data and the data lifecycle. It emphasizes the importance of planning for data management from the start of a research project through drafting a data management plan. The document outlines key elements to address in a data management plan, such as data collection, documentation, ethics and legal compliance, storage and backup, and data sharing. It also provides guidance on issues like organizing and naming research data files, using metadata to document data, and managing personal or sensitive data.
The document discusses the experiences of running an institutional data repository at the University of Edinburgh. It provides context on the university and growing policies supporting research data management. It then describes the university's research data management program, which includes services for data management planning, active data infrastructure like a data repository called DataShare, and data stewardship. DataShare uses the DSpace platform and has seen growth in deposited items over the years. Challenges in running the repository include handling large files, facilitating uploads and downloads, assigning DOIs, and promoting a culture change around data sharing.
Presentation given at the Indiana University School of Medicine's Ruth Lilly Medical Library. Contains information and resources specific to Indiana University Purdue University Indianapolis (IUPUI). For full class materials, see LYD17_IUPUIWorkshop folder here: https://osf.io/r8tht/.
This document discusses research data management (RDM) at KU Leuven. It provides an overview of the RDM Competence Centre, which was established in June 2020 to support high quality RDM practices. The Centre aims to guide RDM training, tools, and services based on researcher needs. It also works to strengthen the network of central and local RDM support staff. Recent Centre activities include reviewing over 400 data management plans, providing RDM training and advice, and developing new RDM tools and infrastructure like an active data repository and research data repository. Challenges for RDM at KU Leuven include addressing complex needs, funding dedicated support, and engaging researchers in open science practices.
Similar to Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015 (20)
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
it describes the bony anatomy including the femoral head , acetabulum, labrum . also discusses the capsule , ligaments . muscle that act on the hip joint and the range of motion are outlined. factors affecting hip joint stability and weight transmission through the joint are summarized.
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
हिंदी वर्णमाला पीपीटी, hindi alphabet PPT presentation, hindi varnamala PPT, Hindi Varnamala pdf, हिंदी स्वर, हिंदी व्यंजन, sikhiye hindi varnmala, dr. mulla adam ali, hindi language and literature, hindi alphabet with drawing, hindi alphabet pdf, hindi varnamala for childrens, hindi language, hindi varnamala practice for kids, https://www.drmullaadamali.com
Assessment and Planning in Educational technology.pptxKavitha Krishnan
In an education system, it is understood that assessment is only for the students, but on the other hand, the Assessment of teachers is also an important aspect of the education system that ensures teachers are providing high-quality instruction to students. The assessment process can be used to provide feedback and support for professional development, to inform decisions about teacher retention or promotion, or to evaluate teacher effectiveness for accountability purposes.
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...PECB
Denis is a dynamic and results-driven Chief Information Officer (CIO) with a distinguished career spanning information systems analysis and technical project management. With a proven track record of spearheading the design and delivery of cutting-edge Information Management solutions, he has consistently elevated business operations, streamlined reporting functions, and maximized process efficiency.
Certified as an ISO/IEC 27001: Information Security Management Systems (ISMS) Lead Implementer, Data Protection Officer, and Cyber Risks Analyst, Denis brings a heightened focus on data security, privacy, and cyber resilience to every endeavor.
His expertise extends across a diverse spectrum of reporting, database, and web development applications, underpinned by an exceptional grasp of data storage and virtualization technologies. His proficiency in application testing, database administration, and data cleansing ensures seamless execution of complex projects.
What sets Denis apart is his comprehensive understanding of Business and Systems Analysis technologies, honed through involvement in all phases of the Software Development Lifecycle (SDLC). From meticulous requirements gathering to precise analysis, innovative design, rigorous development, thorough testing, and successful implementation, he has consistently delivered exceptional results.
Throughout his career, he has taken on multifaceted roles, from leading technical project management teams to owning solutions that drive operational excellence. His conscientious and proactive approach is unwavering, whether he is working independently or collaboratively within a team. His ability to connect with colleagues on a personal level underscores his commitment to fostering a harmonious and productive workplace environment.
Date: May 29, 2024
Tags: Information Security, ISO/IEC 27001, ISO/IEC 42001, Artificial Intelligence, GDPR
-------------------------------------------------------------------------------
Find out more about ISO training and certification services
Training: ISO/IEC 27001 Information Security Management System - EN | PECB
ISO/IEC 42001 Artificial Intelligence Management System - EN | PECB
General Data Protection Regulation (GDPR) - Training Courses - EN | PECB
Webinars: https://pecb.com/webinars
Article: https://pecb.com/article
-------------------------------------------------------------------------------
For more information about PECB:
Website: https://pecb.com/
LinkedIn: https://www.linkedin.com/company/pecb/
Facebook: https://www.facebook.com/PECBInternational/
Slideshare: http://www.slideshare.net/PECBCERTIFICATION
Strategies for Effective Upskilling is a presentation by Chinwendu Peace in a Your Skill Boost Masterclass organisation by the Excellence Foundation for South Sudan on 08th and 09th June 2024 from 1 PM to 3 PM on each day.
Main Java[All of the Base Concepts}.docxadhitya5119
This is part 1 of my Java Learning Journey. This Contains Custom methods, classes, constructors, packages, multithreading , try- catch block, finally block and more.
Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015
1. Dr. Sven Nahnsen,
Quantitative Biology Center (QBiC)
Data Management for Quantitative
Biology
Lecture 1: Introduction and overview
2. Overview
• Administrative stuff (credits, requirements)
• Motivation/quick review of relevant contents
(Bioinformatics I and II)
• Introduction to this lecture series
• Semester overview
2
3. Course requirements
To pass this course you must:
• regularly and actively participate in the weekly problem sessions,
• pass the final exam, assignments and projects
• You have to work on assignments alone
• You will work in small groups for the problem-orientated research
project
3
4. Course credits and grading
• Credits
- MSc Bioinfo: 4 LP, module “Wahlpflichtbereich Bioinformatik”
- MSc Info: 4 LP, area “Wahlpflichtbereich Informatik”
• Grade
- 30% assignments
- 20% project
- 50% finals
• Finals: oral exam (30 minutes) covering the contents of the whole
lecture, the assignments and the project
• Finals will be scheduled at the end of the semester (Thu,
30/07/2015)
4
5. Recommended literature
• We will point to relevant papers during the course of the literature
• Important overview papers:
§ Hastings et al., 2005, Quantitative Bioscience for the 21st century. BioScience. Vol
55 No. 6
§ Cohen JE (2004) Mathematics Is Biology's Next Microscope, Only Better; Biology
Is Mathematics' Next Physics, Only Better. PLoS Biol 2(12): e439.
• Books
§ Free E-Book: Data management in Bioinformatics (
http://en.wikibooks.org/wiki/Data_Management_in_Bioinformatics)
§ Lacroix, Z.; Critchlow, T. (eds): Bioinformatics: Managing Scientific Data. Morgan
Kaufmann Publishers, San Francisco 2003
§ Michale E. Wall, Quantitative Biology: From Molecular to Cellular Systems. 2012.
Chapman & Hall
§ Pierre Bonnet. Enterprise Data Governance: Reference and Master Data
Management Semantic Modeling. 2013. Wiley
• Web resources
§ http://www.ariadne.ac.uk: Ariadne, Web Magazine for Information Professionals
§ http://www.dama.org: THE GLOBAL DATA MANAGEMENT COMMUNITY
§ H.D. Ehrich: http://www.ifis.cs.tu-bs.de/node/2855
5
6. Recommended Software
• These software tools/framework and webservers will be used
during the problem sessions
http://www.cisd.ethz.ch/software/openBIS https://usegalaxy.org
https://vaadin.com/home https://www.knime.org/
6
7. Contact and organization
• Questions concerning the lecture/assignments
§ dmqb-ss15@informatik.uni-tuebingen.de
• Website
§ abi.inf.uni-tuebingen.de/Teaching/ws-2013-14/CPM
• Christopher Mohr (Sand 14, C322) , Andreas Friedrich (Sand 14, C 304)
• Dr. Sven Nahnsen (Quantitative Biology Center, Auf der Morgenstelle 10,
C2P43, please send e-mail first)
• Course material will be available on the website (see above), through
social media channels and (if wished) as a hard copy during the lecture
facebook.com/qbic.tuebingen twitter.com/qbic_tue
7
8. Who am I
• Most of me and on our work can be found here:
www.qbic.uni-tuebingen.de
8
9. Contents of this lecture
Date
Lecturer
Lecture
8-‐10
AM
Thursday
16
April
15
Nahnsen
Introduc8on
and
overview
Thursday
23
April
15
Nahnsen
Biological
Data
Management
Thursday
30
April
15
Czemmel
Data
sources
("Next-‐genera8on"
technologies)
Dr. Stefan Czemmel
9
10. Contents of this lecture
Date
Lecturer
Lecture
8-‐10
AM
Thursday
7
May
15
Codrea
Database
systems
(mySQL,
noSQL,
etc.)
Thursday
14
May
15
Ascension
Day
(Himmelfahrt)
Thursday
21
May
15
Czemmel
LIMS
and
E-‐Lab
books
Thursday
28,
May
15
Kenar
Experimental
Design
Dr. Marius CodreaErhan Kenar
10
11. Contents of this lecture
Date
Lecturer
Lecture
8-‐10
AM
Thursday
4
June
15
Corpus
Chris8
Day
(Fronleichnam)
Thursday
11
June
15
Nahnsen
Data
analysis
workflows
(I)
Thursday
18
June
15
Nahnsen
Data
analysis
workflows
(II)
Thursday
25
June
15
Nahnsen
Standardiza8on
Thursday
2
July
15
Nahnsen
Big
Data
Thursday
9
July
15
Nahnsen
Integrated
data
management
(OpenBIS,
OpenBEB)
Thursday
16
July
15
Nahnsen
Applica8ons
Thursday
23
July
15
Nahnsen
Exam
prepara8on
Thursday
30
July
15
Nahnsen,
Mohr,
Friedrich
EXAMS
11
12. What is your background?
Ad hoc collection from the audience, Apr. 16, 2015
• Computer Science
• Bioinformatics(immonoinformatics; User Front-end;integration ,
visualization)
• Biology
• Drug design
• Agricultural biology (plant breeding)
• Bioinformatics (Tx, NGS)
• Geoecology
• (ecology)
• Biochemistry; Molecular Biology
• Structural Biology
• Electronic business
12
13. Let us brainstorm
Ad hoc collection from the audience, Apr. 16, 2015
• What is data management?
- Rapid access to data
- Selective access to data; database queries
- Combine data; manipulate data efficiently
- Big data storage/analysis
- Curating quality
- Data visualization
- Make data interpretable
13
14. Let us brainstorm
• What is data management?
http://zonese7en.com/wp-content/uploads/2014/04/Data-Management.jpg, accessed Apr 10,
2015, 11 AM 14
15. Data Management
• The official definition provided by DAMA (Data management
association) International, the professional organization for those
in the data management profession, is: "Data Resource
Management is the development and execution of
architectures, policies, practices and procedures that
properly manage the full data lifecycle needs of an enterprise.”
• Further, the DAMA – Data management Body of Knowledge
((DAMA-DMBOK)) states:” Data management is the development,
execution and supervision of plans, policies, programs and
practices that control, protect, deliver and enhance the value of
data and information assets ”
Wikipedia: http://en.wikipedia.org/wiki/Data_management accessed Mar 30, 2015, 10 PM
15
16. 10 Data Management functions According to the DAMA
Data Management Body of
Knowledge (DMBOK)
16
17. Data governance
• Strategy
• Organization and roles
• Policies and standards
• Projects and services
• Issues
• Valuation
Source: DAMA DMBOK Guide, p. 10
“Planning, supervision and control over data management and use”
http://meship.com
17
18. Data quality management
• Data cleansing
• Data integrity
• Data enrichment
• Data quality
• Data quality assurance
Source: DAMA DMBOK Guide, p. 10
“defining, monitoring and improving data quality”
http://www.arcplan.com/
18
19. Data architecture management
• Data architecture
• Data analysis
• Data design (modeling)
Source: DAMA DMBOK Guide, p. 10
atasourceconsulting.com
19
20. Data development
• Analysis
• Data modeling
• Database design
• Implementation
Source: DAMA DMBOK Guide, p. 10
dataone.org
20
“Data development is the process of building a data set for a specific purpose. The
process includes identifying what data are required and how feasible it is to obtain
the data. Data development includes developing or adopting data standards in
consultation with stakeholders to ensure uniform data collection and reporting, and
obtaining authoritative approval for the data set.”, A guide to data development,
Australian Institute of Health and Welfare Canberra, 2007
21. Database management
• Data maintenance
• Data administration
• Database management
system
Source: DAMA DMBOK Guide, p. 10
21
23. Reference and Master Data management
• External/internal codes
• Customer Data
• Product Data
• Dimension management (why do
different dimensions (entities) relate to each other)
• Taxonomy/Ontology
Source: DAMA DMBOK Guide, p. 10
Master Reference
23
Reference data
management
25. Data warehousing and business intelligence management
• Architecture
• Implementation
• Training and Support
• Monitoring and Tuning
Source: DAMA DMBOK Guide, p. 10
Raw data
Metadata
…
Summary
data
Data warehouse
25
26. Data warehousing and business intelligence management
Raw data
Metadata
…
Summary
data
Data warehouse
Input
Report
Business intelligence
26
27. Document, record and content management
• Acquisition and storage
• Backup and Recovery
• Content Management
• Retrieval
• Retention
Source: DAMA DMBOK Guide, p. 10
27
28. Metadata management
Metadata is data about data
• Architecture
• Integration
• Control
• Delivery
Source: DAMA DMBOK Guide, p. 10
28
29. DAMA – DMBOK
• A broad collection of all discipline and subtopics that are
summarized under the umbrella of data management
• These concern many business-related issues, but many concepts
are very well applicable to the field of bioscience
• We will come back to various aspects of the DAMA DMBOK during
the course
29
30. Data management needs in science and research
• Survey at the University of Oregon, USA (Brian Westra. "Data Services for the Sciences: A
Needs Assessment". July 2010, Ariadne Issue 64 http://www.ariadne.ac.uk/issue64/westra/)
• Different scientific discipline
30
31. Data management in science and research
Brian Westra. "Data Services for the Sciences: A Needs Assessment". July 2010, Ariadne Issue 64
http://www.ariadne.ac.uk/issue64/westra/, accessed Apr. 10, 2015, 11 AM
1 2 3 4 5 6 7 8 9 10 11
1 Data storage and backup 7 Finding and accessing related data from others
2 Making scientific data findable by others 8 Connecting data storage to data analysis
3 Connecting data acquisition to data storage 9 Liniking this data to publications or other asset
4 Allowing or controlling access to scientific data by others 10 Ensuring data is secure and trustworthy
5 Documenting and tracking updates 11 Others
6 Data analysis and manipulation
31
32. Let us brainstorm
• What is Quantitative Biology?
Ad hoc collection from the audience, Apr. 16, 2015
- Not only yes/no, but put amounts to entities
- Huge amount of data
- Qunatitative methods to study biology
- System-wide analysis; specific pathways
- Make results human readable and accesible
32
33. Quantitative Biology
• The term quantitative biology has been coined by Hastings et al.,
2005.
• High-throughput methods have led to a paradigm shift in
biomedical research
• Traditionally, the focus was on one-molecule-at-a-time for most
bio(medical) research projects
• Now, data on whole genomes, exomes, epigenomes,
transcriptomes, proteomes and metabolomes can be generated at
low cost.
• The term quantitative biology is used to describe this paradigm
shift. Improvements in this area have been driven mainly by two
technological developments:
Hastings et al., 2005, Quantitative Bioscience for the 21st century. BioScience. Vol 55 No. 6
33
34. Technological innovations
• State-of-the-art mass spectrometers coupled to high-
performance liquid chromatography through soft ionization
techniques (HPLC-ESI-MS) have quickly changed the way we do
proteomics, metabolomics, and lipidomics.
• Next-generation sequencing has similarly changed the way we
look at genomes, epigenomes, transcriptomes, and metagenomes.
Due to advances in chemistry and imaging, sequencing reactions
have been parallelized on a very large scale. The
comprehensiveness of the data produced by high-throughput
methods makes them particularly interesting as general-purpose
analytical and diagnostic techniques.
34
35. Technological innovations
• Imaging technologies can now produce high-resolution pictures
of fine-grained cellular details at a very high speed
• Finally methods from bioinformatics and computational biology
have matured to rapidly analyze the huge raw data sets that are
generated by these high throughput technologies
35
36. Contents from Bioinformatics 2 (high-throughput
technologies
• Most of the high throughput technologies have been introduced
during the Bioinformatics II lecture
• There are specialized lectures on “Transcriptomics” and on
“Computational Proteomics and Metabolomics”
• We will give a short Recap on the Bioinformatics II contents that
are relevant for this lecture
• More advanced topics on data generation methods will be
introduced in lecture 3 by Dr. Stefan Czemmel (focus on next
generation sequencing)
36
37. Origin of the “Central Dogma of Molecular Biology” (Francis Crick, 1956)
The central dogma of molecular biology
• First articulation by Francis Crick in 1956
• Published in Nature in 1970
37
38. The central dogma – classical view
• In general, the classic view reflects how biology is (biological data
are) organized
• Genomics, however, enabled a more complex view
Cox Systems Biology Lab | Research, University of Toronto, Canada
38
40. Recap Bioinformatics II: Systems biology
• Quantitative data on various levels of biological complexity build
fundaments of systems biology
• Mathematical modeling has been based on gene expression
• Recent important technological improvements allow the analysis of protein
and metabolite profiles to a great depth
• Important layers for understanding biology
• New experimental techniques offer tremendous challenges for
computational analysis
40
41. Recap Bioinformatics II: Aims of Systems Biology
• Describe large-scale organization
• Quantitative modeling
• Describe cell as system of networks
- Fundamental research: time-resolved quantitative
understanding of living systems
- Medicine: enable personalized medicine (e.g., improve
treatment strategies for cancer patients)
- Biotechnology: improve production, degradation, construction
of synthetic organisms, etc.
41
42. Exp. Methods – Transcriptomics
• Extract and amplify RNA
• Hybridization on microarray
• Identify and quantify by fluorescence signal
• Sequences can be mapped back to genome
Lindsay, Nature Rev. Drug Discovery, 2003, 2, 803
42
43. Microarray Data Analysis
• Key problems in microarray
data analysis are
- Data normalization
- Clustering
- Dimension reduction
- Diagnostics/classification
- Network inference
- Visualization of results
Janko Dietzsch , Nils Gehlenborg and Kay Nieselt. Mayday-a microarray
data analysis workbench. Bioinformatics 2006 22(8):1010-1012 43
45. Genome sequencing
• 2001: initial publication
• 2003: 2nd draft “Human Genome”
• > 13 years of work and > 3*109 $
• 2010: 8 days 1*104 $
• Today: approximately 5.5 days and < 1*104 $
• Future: within 3 years Biotech company (Pacific Biosciences)
expects similar amount of data in < 15 min for < 1*103 $
45
46. Status genomics/transcriptomics
• Dramatic drop in cost for genome sequencing
• Number of sequenced genomes grows continuously
• Genome is a very static snapshot of living system
• Biological adaption is rather slow; long-term information storage
• Proteins and their reaction products, metabolites are much closer
to reality
• Genome and transcriptome databases are essential bases for
proteomics and metabolomics research
46
47. Genomics vs. Proteomics
Genomics Proteomics
Genomes rather static
~ 20 k genes
established technology
(capillary sequencer)
Proteomes are dynamic
(age, tissue, breakfast,
…)
up to 1000 k proteins
emerging technologies
(MS, HPLC/MS, protein
chips)
47
59. Large-scale study data – 1000 Genomes
• Sample lists and sequencing progress
• Variant Calls
• Alignments
• Raw sequence files
http://www.1000genomes.org/data
59
60. Large-scale study data – The cancer genome atlas (TCGA)
• TCGA aims to help to diagnose, treat and prevent cancer
• explore the entire spectrum of genomic changes involved in more
than 20 types of human cancer.
• Approx. 2 PB of genomic raw data
http://cancergenome.nih.gov
60
61. Laboratory information management systems/
Electronic Lab Books
• How to track all information that is generated in the laboratory
• Automated annotation of all experimental parameters is essential
for reproducible science
• Currently, most experiments are protocolled manually in lab
textbooks
• Data security (intellectual property versus open data)
61
62. Experimental design
• Biological experiments are very complex
• Statistical significance requires a high number of biological
replicates
• Often many different conditions and time points need to be
considered
• One study can involve many different experiments (multi omics
studies involve different omics layers, e.g. genomics +
transcriptomics + proteomics)
• All experiments come with different meta data requirements
• For various reasons the experimental design is not always
balanced (e.g. 5 samples in group A and and only 3 samples are
available for group B)
Friedrich, A., et al. Biomed Research International, April 2015 – in press.
Nahnsen, S., Drug Target, May 2015 – in press. 62
63. Experimental design
Friedrich, A., et al. Biomed Research International, April 2015 – in press.
Nahnsen, S., Drug Target, May 2015 – in press. 63
64. Data analysis workflows
• Chain different (heterogeneous) tools
• Parameter handling
• Execution in high performance computing environment made easy
64
65. Standardization in bioinformatics
• Many world-wide bioinformatics initiatives need to rely on open
standards
• Development of standards has to be a community effort
• Standardized data formats are important to guarantee
- Sustainability
- Independence of instrument vendors
- Independence of analysis software
- Exchangeability of raw data
• Standard formats increase the amount of data by a factor of x (x =
2-4)
• Many people refrain from using open standards
65
66. http://en.wikipedia.org/wiki/Big_Data, accessed Apr 24, 2014
Big data is the term for a collection of data sets so large and complex
that it becomes difficult to process using on-hand database management
tools or traditional data processing applications. The challenges include
capture, curation, storage, search, sharing, transfer, analysis and
visualization. The trend to larger data sets is due to the additional
information derivable from analysis of a single large set of related data,
as compared to separate smaller sets with the same total amount of data,
allowing correlations to be found ……
Big data
66
67. Big data examples
• European Council for Nuclear Research (CERN) Geneva,
Swizerland
25 Petabyte/Jahr at LHC (Large Hadron collider) (~6.2 Mio.
DVDs)
CERN
LHD data
Big data Beispiele
ep.ph.bham.ac.uk, 2014
67
68. Big data examples
• Google verarbeitet 9.1 Exabyte/year (300 Mio. DVDs)
GOOGLE
data
Mayer-Schönberger, 2013, ititch.com, 2014
68
69. Biology and Big data?
• Klassisch: Beobachtung der Natur
und deren Phänomene
DNA RNA Proteine
Träger der
Erbinformation
Expression von
bestimmten Genen
Üben nötige Funktion
in der Zelle aus
1956 formuliert Francis Crick das zentrale Dogma der Molekularbiologie:
• 1950er JahreDurchbruch in der
Molekularbiologie
69
70. Big data
Vivien Marx, Biology: The big challenges of big data, Nature. 2013, doi:10.1038/498255a
70
71. Integrated data management in biology/biomedicine
71
http://media.americanlaboratory.com/m/20/Article/35231-fig1.jpg
73. NGS Lab
Lab
Storage
Data movers
• Automatically moves large to huge file-based data to a remote
(central) storage
• Uses rsync routine; easy configuration using config file
• Data mover athentification: public/private key ssh authentification
• Moves data to openbis dropboxes (individual boxes and users for
each of the five member labs)
Data
Mover
DataMover:
• Developed at ETH Zurich as part of OpenBIS
• http://www.cisd.ethz.ch/software/Data_Mover
73
74. openBIS (meta) data store
• Open, distributed system for managing biological
information
• Captures different experiment types (OMICS,
imaging, screening,...)
• Tracking, annotating and sharing of experiments,
samples and datasets for distributed research
• Different servers for meta data and bulk raw
data
• Underlying PostgreSQL database
• ETL routines for extraction of meta data and
linking
74
78. Contact:
Quantitative Biology Center (QBiC)
Auf der Morgenstelle 10
72076 Tübingen · Germany
dmqb-ss15@informatik.uni-tuebingen.de
Thanks for listening – See you next week