My M.Sc. dissertation defense. The title is "Algorithm Implementation of Genetic Association Analysis for Rheumatoid Arthritis Data Based on Haplotype Blocks"
This document provides an overview of genome editing techniques such as CRISPR/Cas9 and rAAV and considerations for their use. It discusses how CRISPR/Cas9 and rAAV work to edit genomes and compares their advantages. Key factors for CRISPR gene editing are discussed such as gRNA design, donor design, and screening/validation approaches. The document also summarizes research optimizing CRISPR gene editing through improvements like testing different donor lengths and modifications. The goal is to translate genetic information into personalized medicines by leveraging tools like CRISPR and rAAV.
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Elia Brodsky
This workshop will address critical issues related to Transcriptomics data:
Processing raw Next Generation Sequencing (NGS) data:
1. Next Generation Sequencing data preprocessing:
Trimming technical sequences
Removing PCR duplicates
2. RNA-seq based quantification of expression levels:
Conventional pipelines (looking at known transcripts)
Identification of novel isoforms
Analysis of Expression Data Using Machine Learning:
3. Unsupervised analysis of expression data:
Principal Component Analysis
Clustering
4. Supervised analysis:
Differential expression analysis
Classification, gene signature construction
5. Gene set enrichment analysis
The workshop will include hands-on exercises utilizing public domain datasets:
breast cancer cell lines transcriptomic profiles (https://genomebiology.biomedcentral.com/articles/10.1186/gb-2013-14-10-r110),
patient-derived xenograft (PDX) mouse model of tumor and stroma transcriptomic profiles (http://www.oncotarget.com/index.php?journal=oncotarget&page=article&op=view&path[]=8014&path[]=23533), and
processed data from The Cancer Genome Atlas samples (https://cancergenome.nih.gov/).
Team: The workshops are designed by the researchers at the Tauber Bioinformatics Research Center at University of Haifa, Israel in collaboration with academic centers across the US. Technical support for the workshops is provided by the Pine Biotech team. https://edu.t-bio.info/a-critical-approach-to-transcriptomic-data-analysis/
Critical Reading Biomedical Research Papers-2022.pptxMingdergLai
1. The study investigates whether the ATAC complex, which contains the histone acetyltransferase Gcn5, regulates mitotic progression.
2. Experiments using siRNA to knock down subunits of ATAC and SAGA complexes in NIH-3T3 cells show that ATAC knockdown, but not SAGA knockdown, leads to mitotic defects including delayed or asymmetric cell divisions.
3. Further experiments localize ATAC subunits to mitotic cells and show that the ATAC complex remains intact during mitosis.
Moving Towards a Validated High Throughput Sequencing Solution for Human Iden...Thermo Fisher Scientific
Presented by Jennifer D. Churchill, PhD during a special Lunch and Learn session during the American Academy of Forensic Sciences (AAFS) 67th annual conference, February 2015. / Conclusions
• Robust panels of identity and ancestry SNPs
• Robust STR panel
• Whole genome mtDNA sequencing
• Highly informative
• Sensitive
• Quantitative – scaling comparison
• Low density chip is not necessarily a bad chip
• Wide range of density can still yield high quality data
• Based on results continue development and validation
Partitioning Heritability using GWAS Summary Statistics with LD Score Regressionbbuliksullivan
1) The document describes a new method for partitioning heritability of complex traits using summary statistics from large GWAS. It uses LD Score Regression to estimate the proportion of heritability associated with different functional annotations of the genome.
2) The method was validated in simulations, where it accurately estimated null and enriched heritability proportions.
3) The method was applied to real GWAS data for 10 complex traits, finding many functional elements enriched including conserved regions, enhancers, and cell-type specific H3K27ac regions, providing new insights into genetic architecture and disease biology.
Single Nucleotide Polymorphism Analysis
Predictive Analytics and Data Science Conference May 27-28
Asst. Prof. Vitara Pungpapong, Ph.D.
Department of Statistics
Faculty of Commerce and Accountancy
Chulalongkorn University
T-BioInfo is a platform for processing, analyzing, and integrating multi-omics data. It is used by multiple research groups to extract meaningful insights from large multi-omics datasets. The platform is expanding its educational capabilities to enable more people to extract meaningful, data-driven insights from omics datasets with biomedical applications. The document provides links to learn more about the platform's research and educational features.
T-BioInfo is a platform for processing, analyzing, and integrating multi-omics data. It is used by multiple research groups to extract meaningful insights from large multi-omics datasets. The platform is expanding its educational capabilities to enable more people to extract meaningful, data-driven insights from omics datasets with biomedical applications. The document provides links to learn more about the platform's research and educational features.
This document provides an overview of genome editing techniques such as CRISPR/Cas9 and rAAV and considerations for their use. It discusses how CRISPR/Cas9 and rAAV work to edit genomes and compares their advantages. Key factors for CRISPR gene editing are discussed such as gRNA design, donor design, and screening/validation approaches. The document also summarizes research optimizing CRISPR gene editing through improvements like testing different donor lengths and modifications. The goal is to translate genetic information into personalized medicines by leveraging tools like CRISPR and rAAV.
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Elia Brodsky
This workshop will address critical issues related to Transcriptomics data:
Processing raw Next Generation Sequencing (NGS) data:
1. Next Generation Sequencing data preprocessing:
Trimming technical sequences
Removing PCR duplicates
2. RNA-seq based quantification of expression levels:
Conventional pipelines (looking at known transcripts)
Identification of novel isoforms
Analysis of Expression Data Using Machine Learning:
3. Unsupervised analysis of expression data:
Principal Component Analysis
Clustering
4. Supervised analysis:
Differential expression analysis
Classification, gene signature construction
5. Gene set enrichment analysis
The workshop will include hands-on exercises utilizing public domain datasets:
breast cancer cell lines transcriptomic profiles (https://genomebiology.biomedcentral.com/articles/10.1186/gb-2013-14-10-r110),
patient-derived xenograft (PDX) mouse model of tumor and stroma transcriptomic profiles (http://www.oncotarget.com/index.php?journal=oncotarget&page=article&op=view&path[]=8014&path[]=23533), and
processed data from The Cancer Genome Atlas samples (https://cancergenome.nih.gov/).
Team: The workshops are designed by the researchers at the Tauber Bioinformatics Research Center at University of Haifa, Israel in collaboration with academic centers across the US. Technical support for the workshops is provided by the Pine Biotech team. https://edu.t-bio.info/a-critical-approach-to-transcriptomic-data-analysis/
Critical Reading Biomedical Research Papers-2022.pptxMingdergLai
1. The study investigates whether the ATAC complex, which contains the histone acetyltransferase Gcn5, regulates mitotic progression.
2. Experiments using siRNA to knock down subunits of ATAC and SAGA complexes in NIH-3T3 cells show that ATAC knockdown, but not SAGA knockdown, leads to mitotic defects including delayed or asymmetric cell divisions.
3. Further experiments localize ATAC subunits to mitotic cells and show that the ATAC complex remains intact during mitosis.
Moving Towards a Validated High Throughput Sequencing Solution for Human Iden...Thermo Fisher Scientific
Presented by Jennifer D. Churchill, PhD during a special Lunch and Learn session during the American Academy of Forensic Sciences (AAFS) 67th annual conference, February 2015. / Conclusions
• Robust panels of identity and ancestry SNPs
• Robust STR panel
• Whole genome mtDNA sequencing
• Highly informative
• Sensitive
• Quantitative – scaling comparison
• Low density chip is not necessarily a bad chip
• Wide range of density can still yield high quality data
• Based on results continue development and validation
Partitioning Heritability using GWAS Summary Statistics with LD Score Regressionbbuliksullivan
1) The document describes a new method for partitioning heritability of complex traits using summary statistics from large GWAS. It uses LD Score Regression to estimate the proportion of heritability associated with different functional annotations of the genome.
2) The method was validated in simulations, where it accurately estimated null and enriched heritability proportions.
3) The method was applied to real GWAS data for 10 complex traits, finding many functional elements enriched including conserved regions, enhancers, and cell-type specific H3K27ac regions, providing new insights into genetic architecture and disease biology.
Single Nucleotide Polymorphism Analysis
Predictive Analytics and Data Science Conference May 27-28
Asst. Prof. Vitara Pungpapong, Ph.D.
Department of Statistics
Faculty of Commerce and Accountancy
Chulalongkorn University
T-BioInfo is a platform for processing, analyzing, and integrating multi-omics data. It is used by multiple research groups to extract meaningful insights from large multi-omics datasets. The platform is expanding its educational capabilities to enable more people to extract meaningful, data-driven insights from omics datasets with biomedical applications. The document provides links to learn more about the platform's research and educational features.
T-BioInfo is a platform for processing, analyzing, and integrating multi-omics data. It is used by multiple research groups to extract meaningful insights from large multi-omics datasets. The platform is expanding its educational capabilities to enable more people to extract meaningful, data-driven insights from omics datasets with biomedical applications. The document provides links to learn more about the platform's research and educational features.
DNA microarrays, also known as DNA chips, allow simultaneous measurement of gene expression levels for every gene in a genome. They detect mRNA levels by hybridizing cDNA to arrays of gene probes spotted on glass slides or other surfaces. Differences in gene expression between cell types or conditions can be measured and analyzed to answer biological questions.
The document discusses the Genome in a Bottle Consortium (GIAB) which aims to provide reference materials and data for benchmarking and assessing sequencing technologies and bioinformatics pipelines. The GIAB analyzed multiple sequencing datasets for the NA12878 genome and established a high confidence call set for variants through integration. Quality assessment found the call set to have near 100% sensitivity and specificity compared to other datasets in high confidence regions. The NA12878 data serves as an important reference for validation studies.
Introduction and key considerations around gene-editing using CRISPR and rAAV.
With an overview of our knock-out library using the haploid cell line HAP1
This document discusses copy number variation analysis and qBiomarker Copy Number PCR Arrays. It begins with defining copy number variation and describing current methods to analyze copy number, including array CGH, SNP chips, NGS, qPCR and FISH. It then discusses issues with using single gene references and introduces the concept of a multicopy reference assay as a better reference. The remainder focuses on qBiomarker Copy Number PCR Arrays, which allow profiling copy number variation across curated gene sets or custom arrays. The arrays utilize a multicopy reference assay and are compatible with most qPCR instruments. Data analysis is performed using an online portal.
Advances and Applications Enabled by Single Cell TechnologyQIAGEN
Over the past 5 years, single-cell genomics have become a powerful technology for studying small samples and rare cells, and for dissecting complex populations such as heterogeneous tumors. Single-cell technology is enabling many new insights into diverse research areas from oncology, immunology and microbiology to neuroscience, stem cell and developmental biology. This webinar introduces single-cell technology and summarizes the newest scientific applications in various research areas, all in the context of current literature.
Presentation by Justin Zook at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on benchmarks for indels and structural variants.
Course: Bioinformatics for Biomedical Research (2014).
Session: 3.2- Basic Aspects of Microarray Technology and Data Analysis.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
As increasing numbers of people choose to have their genomes sequenced and made available for research, more genomic data is available for analysis by machine learning approaches. Single Nucleotide Polymorphisms (SNPs) are known to be a major factor influencing many physical traits, diseases and other phenotypes. Using publicly available data and tools we predict phenotype from genotype using SNP data (1 to 2 million SNPs). We utilize data analysis and machine learning approaches only, no domain knowledge, so that our automated approach may be generally used to predict different phenotypes from genotype. In the first application of our method we predicted eye color with 87% accuracy.
The document discusses various applications and techniques of DNA microarrays, including summarizing key points about Affymetrix GeneChips, spotted microarrays, experimental design, data analysis, and several case studies on various topics like ovarian cancer, Sjogren's syndrome, wine yeast genomics, and norovirus genotyping. Microarrays allow analysis of gene expression patterns and copy number variations across genomes through comparative hybridization experiments. The document provides an overview of microarray technology and applications in genomic and biomedical research.
Gene Expression - Microarrays discusses analyzing gene expression data from microarray experiments. It describes the basic workflow including experimental design, sample preparation, hybridization, image analysis, preprocessing, normalization, and statistical analysis. Key points are that microarrays allow measuring expression of thousands of genes simultaneously, and proper experimental design and data analysis are important to draw meaningful biological conclusions from microarray data.
Genome in a Bottle is working to characterize difficult variants in human genomes to enable benchmarking of sequencing technologies and bioinformatics methods. They have extensively characterized five human genomes and are now focusing on large insertions, deletions, and structural variants over 20 base pairs. This work presents many challenges due to limitations in detection and representation of large variants. Genome in a Bottle is integrating calls from multiple technologies and approaches to refine sequence-resolved variants and provide benchmark variant call files.
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923GenomeInABottle
Using accurate long reads to improve Genome in a Bottle Benchmarks
The Genome in a Bottle Consortium has used accurate long reads to characterize variants in difficult genomic regions for 7 human genomes. Long and linked reads improved the small variant benchmark by expanding reference coverage and the number of called variants. Accurate long reads were also essential for generating benchmarks for medically relevant genes and for improving benchmarks on chromosomes X and Y. Ongoing work includes developing RNA sequencing benchmarks from long reads and generating the first tumor/normal cell line benchmark.
This document summarizes Christopher Mason's presentation on epigenetics quality control and single-cell RNA-seq variant calling using samples from the Genome in a Bottle project. It discusses generating reference epigenetics datasets, including whole genome bisulfite sequencing data, Illumina 450K methylation array data, and targeted bisulfite sequencing data for several GIAB samples. Parameters for variant calling from single-cell RNA-seq data are evaluated, finding best sensitivity and specificity at 97% and 80% respectively using certain settings. The work aims to establish high quality epigenetics and variant calling references to help benchmark computational methods for personalized medicine.
Genome Editing Comes of Age; CRISPR, rAAV and the new landscape of molecular ...Candy Smellie
Information is no longer a bottleneck, emphasis is shifting to the ‘what does it all mean’
In a translational context we hope that by answering that question we will be able to is to characterise the genetics that drive disease, and indeed develop drugs and diagnostics that are personalised to patients.
Genome editing provides the link between the information here, and this outcome here, by allowing scientists to recapitulate specific genetic alterations in any gene in any living tissue to probe function, develop disease models and identify therapeutic strategies. So, not only do we now have unparalleled access to genetic information, but we now have the tools to most accuartely understand what this genetic information – with genome editing allowing us to explore the genetic drivers of disease in physiological models.
AAV is a single-stranded, linear DNA virus with a a 4.7 kb genome which for the purpose of genome editing is replaced almost in entirety with the targeting vector sequence (except for the iTRs)
It is in effect a highly effective DNA delivery mechanism
After entry of the vector into the cell, target-specific homologous DNA is believed to activate and recruit HR-dependent repair factors can induce HR at rates approximately 1,000 times greater than plasmid based double stranded DNA vectors, but the mechanism by which it achieves this is still largely unknown
By including a selection cassette can select for cells that have integrated the targeting vector, and then screen for clones which have undergone targeted insetion rather than random integration, which will generally be around 1%.
Here are the steps to prove the claim:
1) By Lagrange's theorem, the order of any subgroup H of G divides the order of G.
2) Since the order of G is pn, the only possible orders for subgroups are 1, p, p2, ..., pn.
3) Therefore any composition series for G will have abelian factor groups of prime power order, so G is solvable.
Micro array based comparative genomic hybridisation -Dr Yogesh DDr.Yogesh D
This is a brief introduction to the technique and principle of Array Comparative Genomic Hybridization. Array CGH is a powerful tool for genetic testing and has been enormously useful in cancer cytogenetics, prenatal genetic testing etc.
Objectives are an understanding of:
▶ Homology search tools
▶ E-values
▶ how BLAST works
▶ how profile HMMs (hmmer) work
▶ which is the right tool for different questions
This document summarizes the Genome in a Bottle (GIAB) Consortium's efforts to characterize structural variants in human genomes to serve as benchmarks. The GIAB Consortium has generated structural variant calls for 7 human genomes using diverse data types and analysis methods. The document describes the GIAB Consortium's process for integrating these data to identify high-confidence structural variant calls to include in version 0.6 of the structural variant benchmark set. It provides examples of different types of structural variants characterized and evaluates the trustworthiness of the benchmark calls based on independent validation. The document also discusses ongoing efforts to further improve structural variant characterization using emerging long-read technologies.
The document provides an introduction to epistasis detection in genome-wide association studies (GWAS). It defines epistasis as the detection of causal SNPs for a disease through their interactions, rather than their individual effects. It outlines the problem of epistasis detection as analyzing large genotype datasets to find combinations of SNPs that maximize an association measure with binary disease status. Popular measures discussed are chi-squared and mutual information statistics. The document reviews computational methods for epistasis detection, including Multifactor Dimensionality Reduction, SNPHarvester, and SNPRuler. It notes the challenges of reducing computational burden and detecting higher-order epistatic interactions.
This document provides an introduction to embedded computer architecture. It defines embedded computing systems as devices that include programmable computers but are not general-purpose. Examples include cell phones, printers, vehicles, and appliances. Characteristics of embedded systems include sophisticated functionality, real-time operation, low cost, low power usage, and design by small teams. The document discusses microprocessors, memory, instruction sets, and programming models used in embedded systems. It also covers topics like digital signal processors, endianness, assembly language, and bus-based computer architectures.
This document discusses single nucleotide polymorphism (SNP) data analysis. It defines key terms like SNPs, genotypes, haplotypes, and linkage disequilibrium. It describes techniques for SNP genotyping like PCR and challenges like phasing ungenotyped SNPs and inferring haplotypes. International projects like HapMap are referenced that aim to construct a haplotype map of the human genome to reveal patterns of genetic variation.
More Related Content
Similar to Algorithm Implementation of Genetic Association Analysis for Rheumatoid Arthritis Data Based on Haplotype Blocks
DNA microarrays, also known as DNA chips, allow simultaneous measurement of gene expression levels for every gene in a genome. They detect mRNA levels by hybridizing cDNA to arrays of gene probes spotted on glass slides or other surfaces. Differences in gene expression between cell types or conditions can be measured and analyzed to answer biological questions.
The document discusses the Genome in a Bottle Consortium (GIAB) which aims to provide reference materials and data for benchmarking and assessing sequencing technologies and bioinformatics pipelines. The GIAB analyzed multiple sequencing datasets for the NA12878 genome and established a high confidence call set for variants through integration. Quality assessment found the call set to have near 100% sensitivity and specificity compared to other datasets in high confidence regions. The NA12878 data serves as an important reference for validation studies.
Introduction and key considerations around gene-editing using CRISPR and rAAV.
With an overview of our knock-out library using the haploid cell line HAP1
This document discusses copy number variation analysis and qBiomarker Copy Number PCR Arrays. It begins with defining copy number variation and describing current methods to analyze copy number, including array CGH, SNP chips, NGS, qPCR and FISH. It then discusses issues with using single gene references and introduces the concept of a multicopy reference assay as a better reference. The remainder focuses on qBiomarker Copy Number PCR Arrays, which allow profiling copy number variation across curated gene sets or custom arrays. The arrays utilize a multicopy reference assay and are compatible with most qPCR instruments. Data analysis is performed using an online portal.
Advances and Applications Enabled by Single Cell TechnologyQIAGEN
Over the past 5 years, single-cell genomics have become a powerful technology for studying small samples and rare cells, and for dissecting complex populations such as heterogeneous tumors. Single-cell technology is enabling many new insights into diverse research areas from oncology, immunology and microbiology to neuroscience, stem cell and developmental biology. This webinar introduces single-cell technology and summarizes the newest scientific applications in various research areas, all in the context of current literature.
Presentation by Justin Zook at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on benchmarks for indels and structural variants.
Course: Bioinformatics for Biomedical Research (2014).
Session: 3.2- Basic Aspects of Microarray Technology and Data Analysis.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
As increasing numbers of people choose to have their genomes sequenced and made available for research, more genomic data is available for analysis by machine learning approaches. Single Nucleotide Polymorphisms (SNPs) are known to be a major factor influencing many physical traits, diseases and other phenotypes. Using publicly available data and tools we predict phenotype from genotype using SNP data (1 to 2 million SNPs). We utilize data analysis and machine learning approaches only, no domain knowledge, so that our automated approach may be generally used to predict different phenotypes from genotype. In the first application of our method we predicted eye color with 87% accuracy.
The document discusses various applications and techniques of DNA microarrays, including summarizing key points about Affymetrix GeneChips, spotted microarrays, experimental design, data analysis, and several case studies on various topics like ovarian cancer, Sjogren's syndrome, wine yeast genomics, and norovirus genotyping. Microarrays allow analysis of gene expression patterns and copy number variations across genomes through comparative hybridization experiments. The document provides an overview of microarray technology and applications in genomic and biomedical research.
Gene Expression - Microarrays discusses analyzing gene expression data from microarray experiments. It describes the basic workflow including experimental design, sample preparation, hybridization, image analysis, preprocessing, normalization, and statistical analysis. Key points are that microarrays allow measuring expression of thousands of genes simultaneously, and proper experimental design and data analysis are important to draw meaningful biological conclusions from microarray data.
Genome in a Bottle is working to characterize difficult variants in human genomes to enable benchmarking of sequencing technologies and bioinformatics methods. They have extensively characterized five human genomes and are now focusing on large insertions, deletions, and structural variants over 20 base pairs. This work presents many challenges due to limitations in detection and representation of large variants. Genome in a Bottle is integrating calls from multiple technologies and approaches to refine sequence-resolved variants and provide benchmark variant call files.
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923GenomeInABottle
Using accurate long reads to improve Genome in a Bottle Benchmarks
The Genome in a Bottle Consortium has used accurate long reads to characterize variants in difficult genomic regions for 7 human genomes. Long and linked reads improved the small variant benchmark by expanding reference coverage and the number of called variants. Accurate long reads were also essential for generating benchmarks for medically relevant genes and for improving benchmarks on chromosomes X and Y. Ongoing work includes developing RNA sequencing benchmarks from long reads and generating the first tumor/normal cell line benchmark.
This document summarizes Christopher Mason's presentation on epigenetics quality control and single-cell RNA-seq variant calling using samples from the Genome in a Bottle project. It discusses generating reference epigenetics datasets, including whole genome bisulfite sequencing data, Illumina 450K methylation array data, and targeted bisulfite sequencing data for several GIAB samples. Parameters for variant calling from single-cell RNA-seq data are evaluated, finding best sensitivity and specificity at 97% and 80% respectively using certain settings. The work aims to establish high quality epigenetics and variant calling references to help benchmark computational methods for personalized medicine.
Genome Editing Comes of Age; CRISPR, rAAV and the new landscape of molecular ...Candy Smellie
Information is no longer a bottleneck, emphasis is shifting to the ‘what does it all mean’
In a translational context we hope that by answering that question we will be able to is to characterise the genetics that drive disease, and indeed develop drugs and diagnostics that are personalised to patients.
Genome editing provides the link between the information here, and this outcome here, by allowing scientists to recapitulate specific genetic alterations in any gene in any living tissue to probe function, develop disease models and identify therapeutic strategies. So, not only do we now have unparalleled access to genetic information, but we now have the tools to most accuartely understand what this genetic information – with genome editing allowing us to explore the genetic drivers of disease in physiological models.
AAV is a single-stranded, linear DNA virus with a a 4.7 kb genome which for the purpose of genome editing is replaced almost in entirety with the targeting vector sequence (except for the iTRs)
It is in effect a highly effective DNA delivery mechanism
After entry of the vector into the cell, target-specific homologous DNA is believed to activate and recruit HR-dependent repair factors can induce HR at rates approximately 1,000 times greater than plasmid based double stranded DNA vectors, but the mechanism by which it achieves this is still largely unknown
By including a selection cassette can select for cells that have integrated the targeting vector, and then screen for clones which have undergone targeted insetion rather than random integration, which will generally be around 1%.
Here are the steps to prove the claim:
1) By Lagrange's theorem, the order of any subgroup H of G divides the order of G.
2) Since the order of G is pn, the only possible orders for subgroups are 1, p, p2, ..., pn.
3) Therefore any composition series for G will have abelian factor groups of prime power order, so G is solvable.
Micro array based comparative genomic hybridisation -Dr Yogesh DDr.Yogesh D
This is a brief introduction to the technique and principle of Array Comparative Genomic Hybridization. Array CGH is a powerful tool for genetic testing and has been enormously useful in cancer cytogenetics, prenatal genetic testing etc.
Objectives are an understanding of:
▶ Homology search tools
▶ E-values
▶ how BLAST works
▶ how profile HMMs (hmmer) work
▶ which is the right tool for different questions
This document summarizes the Genome in a Bottle (GIAB) Consortium's efforts to characterize structural variants in human genomes to serve as benchmarks. The GIAB Consortium has generated structural variant calls for 7 human genomes using diverse data types and analysis methods. The document describes the GIAB Consortium's process for integrating these data to identify high-confidence structural variant calls to include in version 0.6 of the structural variant benchmark set. It provides examples of different types of structural variants characterized and evaluates the trustworthiness of the benchmark calls based on independent validation. The document also discusses ongoing efforts to further improve structural variant characterization using emerging long-read technologies.
The document provides an introduction to epistasis detection in genome-wide association studies (GWAS). It defines epistasis as the detection of causal SNPs for a disease through their interactions, rather than their individual effects. It outlines the problem of epistasis detection as analyzing large genotype datasets to find combinations of SNPs that maximize an association measure with binary disease status. Popular measures discussed are chi-squared and mutual information statistics. The document reviews computational methods for epistasis detection, including Multifactor Dimensionality Reduction, SNPHarvester, and SNPRuler. It notes the challenges of reducing computational burden and detecting higher-order epistatic interactions.
Similar to Algorithm Implementation of Genetic Association Analysis for Rheumatoid Arthritis Data Based on Haplotype Blocks (20)
This document provides an introduction to embedded computer architecture. It defines embedded computing systems as devices that include programmable computers but are not general-purpose. Examples include cell phones, printers, vehicles, and appliances. Characteristics of embedded systems include sophisticated functionality, real-time operation, low cost, low power usage, and design by small teams. The document discusses microprocessors, memory, instruction sets, and programming models used in embedded systems. It also covers topics like digital signal processors, endianness, assembly language, and bus-based computer architectures.
This document discusses single nucleotide polymorphism (SNP) data analysis. It defines key terms like SNPs, genotypes, haplotypes, and linkage disequilibrium. It describes techniques for SNP genotyping like PCR and challenges like phasing ungenotyped SNPs and inferring haplotypes. International projects like HapMap are referenced that aim to construct a haplotype map of the human genome to reveal patterns of genetic variation.
This document summarizes a study that used the BigLD algorithm to partition haplotype blocks in chromosome 21 of the NARAC genomic dataset. The researchers:
1) Applied the BigLD algorithm and three other methods (FGT, CIT, SSLD) to detect haplotype blocks in a portion of chromosome 21.
2) Analyzed and compared the blocks detected by each method based on parameters like block size, number of blocks, and genomic coverage.
3) Found that BigLD produced the fewest and largest blocks, indicating more robust partitioning compared to the other methods.
The document outlines the steps taken to prepare genetic data for analysis in R, including reading in the data, removing underscores, imputing missing values, recoding alleles, and measuring linkage disequilibrium (LD). Key steps include converting the raw data to a matrix without underscores, imputing missing values using codeGeno, and implementing BigLD to partition SNPs into LD blocks and generate a heatmap of LD on chromosome 21. The entire process is timed, taking approximately 11 minutes to complete.
This document provides guidelines and screenshots for installing TensorFlow, Keras, Python, and Anaconda. It recommends installing Anaconda to get Python and then installing TensorFlow and Keras. It also notes that while Jupyter is good for visualization, Spyder is better for working offline without internet and there is an easy way to import Keras using TensorFlow as the backend.
Automatic System for Detection and Classification of Brain TumorsFatma Sayed Ibrahim
Automatic system for brain tumors detection based on DICOM MRI images
Surveying methodologies of from preprocessing to classifications
Implementing comparative study.
Proposed technique with highest accuracy and lest elapsed time.
This document provides guidance on designing an intensive care department (ICU) including space requirements and area calculations. It outlines the necessary spaces and estimated areas for an ICU with 10 beds, including a 160m2 common room, storage, offices, and support spaces totaling 605m2 of essential area and 95m2 of desirable area. The total area required is within the allowed 900m2 for the department. Isolation rooms would be needed based on the number of beds and level of care.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
3. 3
Minia University
Faculty of Engineering
Biomedical Engineering Department
Fatma Sayed Ibrahim
Master of science thesis defense
Wednesday, January 27,
2021
Algorithm Implementation of Genetic Association Analysis
for Rheumatoid Arthritis Data Based on Haplotype Blocks
5. 5
Supervisors
Prof. Dr. Hesham Fathy A. Hamed
Former Dean of Faculty of Engineering, Minia University
Professor at Egyptian-Russian University
Dr. Ashraf Mahroos Said
Associate Professor
Biomedical Engineering Department, Minia University,
Dr. Mohamed Nagy Saad
Assistant Professor
Biomedical Engineering Department, Minia University
6. 6
Dr Muhammad Ali M. Rushdi
Biomedical Engineering and Systems
Department
Faculty of Engineering, Cairo university
Dr Essam Halim Houssein
Vice-dean for Postgraduate studies and
research affairs
Faculty of Computers and Information,
Minia University
Thesis committee members
Prof. Dr. Hesham Fathy A.
Hamed
Former Dean of Faculty of
Engineering, Minia University
Professor at Egyptian-
Russian University
Dr. Ashraf Mahroos Said
Associate Professor
Biomedical Engineering Department
Minia University, Minia
7. 7
1. Introduction 2. Literature
review
3. Data
description
4. Pre-processing
5. Methods 6. Results 7. Conclusion
Outlines
20. 20
The minor allele frequency (MAF)
...ATGTCACACGTACTT...
...ATGTCACACGTACTT...
...ATGACACAGGTACTT...
...ATGTCACAGGTACTT...
...ATGTCACAGGTACTT...
...ATGACACAGGTACTT...
...ATGTCACAGGTACTT...
...ATGTCACAGGTACTT...
...ATGACACACGTACTT...
...ATGACACAGGTACTT...
SNP1 SNP2
SNP1 SNP2
Allele 1
Allele 2
Allele 1 frequency
Allele 2 frequency
Allele 1
Allele 2
Major
Minor
T C
A G
6 3
4 7
60% 30%
40% 70%
T G
A C
46. Why this point ?
• The genetic variations influence our predisposition to diseases and any disease
has a genetic component, even infectious diseases.
• Complex diseases are very common in societies. In particular, chronic condition
hugely affects the productivity of a person and its quality of life.
• Since the haplotype blocks are much more effective and powerful in such case
46
Motivations
Introduction Literature review Data description pre-processing Methods Results Conclusion
47. Why this point ?
• The gap in knowledge in this field (especially MAF). There are many
remains questions that are unanswered
• No so many Arabs in this field especially in Minia.
47
Introduction Literature review Data description pre-processing Methods Results Conclusion
48. 48
Research Objectives
Introduction Literature review Data description pre-processing Methods Results Conclusion
• Practically implement computational algorithms to partition
genotyped data based on the haplotype blocks.
• Find the best haplotype partitioning method applied for the
whole-genome case-control dataset to reduce the number
of SNPs in the association study.
55. 55
Start index End index Start rsID End rsID
rs22572xxx rs225722xxx
rs74307xxx rs198574xxx
The output
56. 56
Major finding
Introduction Literature review Data description pre-processing Methods Results Conclusion
• Practically data exploration and uncovering interesting detections
• Investigating partitioning method from literature review and empirical
comparative study (Biomarker reduction with high SNP correlation)
• Sequence of data preprocessing 🡪 R (Hope to make it in R package)
• The role of MAF on haplotype block partitioning
62. Literature review
62
Introduction Literature review Data description pre-processing Methods Results Conclusion
The main projects in genomics Haplotype bock partitioning methods
63. 2001
2002
2005
2007
2009
Human Genome Project
International HapMap Project
The 100,000 Genomes Project
The 1000 Genome Project
The announcement of the HGP
The completion of the HGP
The initial HGP sequencing
The announcement of the International HapMap
Project
HapMap Phase I completion
HapMap Phase II completion
HapMap Phase III completion
1990
2003
2008
2010
2012
2015
2012
2015
2019
The announcement of 1KGP
The completion of pilot phase
The 1KGP fulfill its goal
The 1KGP’s completion Northern Ireland and Scotland joint the
project
Beginning the initiative to involve the public in
genomic research
The 100, 000 Genome Project’s announcement
63
68. 68
Hidden Markov model (HMM)
• 2001
Greedy algorithm (GA)
• 2002
Dynamic programming (DP)
• 2002
Confidence interval (CI)
• 2002
Four-gamete test (FGT)
• 2002
The minimum description length (MDL)
• 2003
Ning Wang
Kui Zhang
Stacey Gabriel
Nila Patil
Mark Daly
Mikko Koivisto
69. 69
Solid spine of LD (SSLD)
• 2005
Markov Chain Monte Carlo (MCMC) algorithm
• 2008
Xor-genotypes
• 2009
Wavelet transforms
• 2011
GA-SVM algorithm
• 2013
From
2005
to
2013
Jeffrey Barrett
Pattaro
70. 70
MIG++
• 2014
S-MIG ++
• 2015
Big-LD
• 2018
on neutrosophic c-means (NCM) algorithm
• 2020
LDBlockShow
• 2020
From
2014
to
2020
Daniel
Talian
Sunah Kim
73. 73
Introduction Literature review Data description pre-processing Methods Results Conclusion
Data description and exploration
NARAC dataset
Description
Map file
SNP array data file
Missing data
Participants
% Cases
Very rare SNPs
Alleles distribution
Genotype
distribution
% Male
% Female
Rare SNPs
SNP
annotation
% Controls
Visualization based on MAF
viewpoint
MAF for each
SNP
Low-frequency
SNPs
Common SNPs
% Male
% Female
74. 74
Participants
ID
Af
fe
ct
io
n
Se
x
DRB1
_1
DRB1
_2
SE
Num
SE
Status
Anti-
CCP
RFUW
rs104
39884
D0024949 0 F 0101 0401 SS yes ? ? G_G
D0024302 0 F 0101 7 SN yes ? ? G_G
D0023151 0 F 0101 11 SN yes ? ? G_G
D0022042 0 F 0101 2 SN yes ? ? G_G
D0021275 0 F 0101 7 SN yes ? ? G_G
D0021163 0 F 0101 0403 SN yes ? ? G_G
D0020795 0 F 0101 3 SN yes ? ? G_G
6045201 1 F 0101 7 SN yes 31.3 142 G_G
D0023027 0 M 0101 3 SN yes ? ? G_G
1015200 1 M 0101 0403 SN yes 112.9 405 ?_?
D0015941 0 F 2 7 NN no ? ? A_G
D0016405 0 F 0101 7 SN yes ? ? ?_?
KNH763243 1 M 0404 0301 SN yes 99 ? G_G
Sample of
the SNPs’
array data
75. 75
Chromosome rsID Position
1 rs3094315 792429
1 rs12562034 808311
11 rs3802985 188510
11 rs3741411 189256
21 rs2821850 13693682
21 rs2257226 13695103
Sample of the map file from
76. North American Rheumatoid Arthritis Consortium (NARAC) dataset
76
Cases (RA) Controls Total
Male 227 342 569
Female 641 852 1493
Total 868 1194 2,062
97. Our Data consists of about 545,080 SNPs for
about 2062 individuals (Cases and controls)
Matrix size =
2062*545080 = 1,123,954,960
about 5,619,774,800 letters
(it taking in acount homozygous and homozygous in a string format)
Introduction Literature review Data description pre-processing Methods Results Conclusion
98. SNPs from chromosome 1 to chromosome 22
(531,689 SNPs ×2,062 participants)
1,096,342,718
Introduction Literature review Data description pre-processing Methods Results Conclusion
99. Reading and cropping
Starting with reading genotyped
data and removing the first 9
columns
Introduction Literature review Data description pre-processing Methods Results Conclusion
101. 101
rs10439884 rs2260810 rs1296971 rs2257224
GG AA AA GG
GG AA AA GG
GG AA AA GG
GG AA AA AA
GG AA AA GG
GG AA AA GG
GG AA AA GG
GG GG CC AG
AG AG AC AG
AG AG AC GG
NA AG AC AG
GG AA AA GG
Introduction Literature review Data description pre-processing Methods Results Conclusion
102. Convert the genotyped matrices and its map file into gp.data form as preparation to
codeGeno function
Introduction Literature review Data description pre-processing Methods Results Conclusion
103. Imputation
using marginal allele distribution
Introduction Literature review Data description pre-processing Methods Results Conclusion
105. Imputed data in bi-allelic format
Introduction Literature review Data description pre-processing Methods Results Conclusion
106. After Imputation and recoding using Synbreed R package
Introduction Literature review Data description pre-processing Methods Results Conclusion
0==Reference allele= Major allele
1==heterozygous
2==major allele
107. The output of imputation and
recoding
Pre-processed dataset
Introduction Literature review Data description pre-processing Methods Results Conclusion
108. From bi-allelic format to 1,2,3,4 format
A🡪1
C🡪2
G🡪3
T🡪4
Introduction Literature review Data description pre-processing Methods Results Conclusion
109. Family ID ID P ID M ID sex aff
SNPs
SNP 1
Preparing data for Haploview
Introduction Literature review Data description Methods Results
pre-processing Conclusion
110. 110
Introduction Literature review Data description Methods Results
pre-processing Conclusion
Minor allele frequency (MAF)
quality control
111. Why studying the effect of MAF
(8 SNPs) in the CIT
Same size LD block
(12 SNPs) in SSLD
In 2019, Saad et al.
different MAF threshold
discard significant SNPs
while the size of block
was the same
114. 114
Introduction Literature review Data description pre-processing Methods Results Conclusion
Methods and workflow
The proposed methods for haplotype block partitioning
116. start
NARAC genotype
dataset ch21
NARAC map file
Position ch21
Reformatting Dataset
Imputation using ImputR
MAF=0.01
Biomarker check
MAF=0.02 MAF=0.05 MAF=0.1
MAF=0.001
Haploview
R
(BigLD)
Comparison and calculations
Haploview
R
(BigLD)
Haploview
R
(BigLD)
Haploview
R
(BigLD)
Haploview
R
(BigLD)
Flowchart and system
description
117. NARAC 22 chromosomes input files
NARAC genomic data
(2,062 individuals)
Perl
Ch1 to ch22 data and map file
Haploview
NARAC map file
(545,080 SNPs)
R
Haplotype blocks for 22 chromosome using 4 methods using 5 MAF
thresholds
R
Pre-processing
Imputation and recoding
Reformatting
Reformatting
for Haploview
MAF
0.001
MAF
0.01
MAF
0.05
MAF
0.02
MAF
0.1
FGT
CIT SSLD BigLD
Reformatting
for BigLD
Chromosomes
separated map file
Chromosomes
separated data
Haplotype block
partitioning
117
Flowchart and system
description
118. Ch1 to ch22 data and map file
Haploview
Haplotype blocks for 22 chromosome using 4 methods using 5 MAF
thresholds
R
Pre-
processing
Imputation and recoding
Reformatting
Reformatting for Haploview
MAF
0.001
MAF
0.01
MAF
0.05
MAF
0.02
MAF
0.1
FGT
CIT SSLD BigLD
Reformatting for BigLD
Haplotype block
partitioning
118
124. Heatmap for the haplotype
blocks detected by interval
graph modeling of clusters for a
portion of chromosome 21 from
9,993,822 bp to 14,137,685 bp.
12
4
125. • Confidence interval test (CIT)
• Four-gamete test (FGT)
• Solid spine of linkage disequilibrium (SSLD)
Haploview
133. 133
Introduction Literature review Data description pre-processing Methods Results Conclusion
1) The total number of haplotype blocks
FGT
CIT
SSLD
BigLD
134. 134
Introduction Literature review Data description pre-processing Methods Results Conclusion
1) The total number of haplotype blocks
The smaller is the number of haplotypes blocks the greater is
the reduction rate.
135. 135
Introduction Literature review Data description pre-processing Methods Results Conclusion
The MAF and total number of haplotype blocks
136. 136
Introduction Literature review Data description pre-processing Methods Results Conclusion
1) The total number of haplotype blocks
0.1
0.05
0.02
0.01
0.001
137. 137
Introduction Literature review Data description pre-processing Methods Results Conclusion
2) Total number of blocks with considering the singletons
CIT
FGT
BigLD
SSLD
138. 138
Introduction Literature review Data description pre-processing Methods Results Conclusion
2) Total number of blocks with considering the singletons
139. 139
Introduction Literature review Data description pre-processing Methods Results Conclusion
3) Total number of SNPs in all blocks
SSLD
BigLD
FGT
CIT
141. 141
Introduction Literature review Data description pre-processing Methods Results Conclusion
4) The total length of all blocks (bp)
SSLD
142. 142
Introduction Literature review Data description pre-processing Methods Results Conclusion
4) The total length of all blocks (bp)
0.1
0.05
0.02
0.01
0.001
143. 143
Introduction Literature review Data description pre-processing Methods Results Conclusion
5) Mean number of SNPs in blocks
The BigLD and SSLD has higher mean number of SNPs in blocks than FIG and CIT
144. 144
Introduction Literature review Data description pre-processing Methods Results Conclusion
5) Mean number of SNPs in blocks
• MAF does not affect the mean number of SNPs in blocks so much.
• The highest mean number of SNPs in blocks is in chromosome 6
using SSLD with MAF=0.1 which equals to 6.637.
145. 145
Introduction Literature review Data description pre-processing Methods Results Conclusion
5) Mean number of SNPs in blocks
The BigLD has almost the same mean number of SNPs in blocks in the range from
0.001 to 0.05 and a higher mean number of SNPs in blocks at MAF=0.1.
The SSLD’s mean number of SNPs in blocks increases with the MAF threshold increases
146. 146
Introduction Literature review Data description pre-processing Methods Results Conclusion
6) The mean block length in base pair
SSLD
BigLD
CIT
FG
147. 147
Introduction Literature review Data description pre-processing Methods Results Conclusion
6) The mean block length in base pair
0.1
0.05
0.02
0.01
0.001
148. 148
Introduction Literature review Data description pre-processing Methods Results Conclusion
7) The mean r2 within blocks
The correlation mean r2 within the blocks is higher in
BigLD in general
BigLD
CIT
FGT
SSLD
149. 149
Introduction Literature review Data description pre-processing Methods Results Conclusion
7) The mean r2 within blocks
0.1
0.05
0.02
0.01
0.001
Contrastingly, in the BigLD method, the mean r2 within a block
decrease with the MAF threshold increases
150. 150
Introduction Literature review Data description pre-processing Methods Results Conclusion
8) The mean r2 between consecutive blocks
(without considering the singleton blocks)
FGT
CIT
BigLD
SSLD
0.1
0.05
0.02
0.01
0.001
151. 151
Introduction Literature review Data description pre-processing Methods Results Conclusion
8) The mean r2 between consecutive blocks
(without considering the singleton blocks)
152. 152
Introduction Literature review Data description pre-processing Methods Results Conclusion
8) The mean r2 between consecutive blocks
(without considering the singleton blocks)
FGT
CIT
BigLD
SSLD
153. 153
Introduction Literature review Data description pre-processing Methods Results Conclusion
9) The mean r2 between consecutive all blocks
(with singleton)
FGT
CIT
SSLD
BigLD
154. 154
Introduction Literature review Data description pre-processing Methods Results Conclusion
9) The mean r2 between consecutive all blocks
(with singleton)
156. Methods Matching percentage
FGT, CIT and, SSLD 67%
FGT, CIT, SSLD and, Big-LD 57.45%
FGT and Big-LD 78.6%
CIT and Big-LD 76.7%
SSLD and Big-LD 71.92%
The results of agreement in percentage of haplotype blocks
produced by compared methods
15
6
158. Plot of a sample of chromosome 21 haplotype blocks produced by
FGT, CIT, SSLD, and Big-LD.
15
8
159.
160. Compared parameters Big-LD FGT CIT SSLD
Max. No. of SNPs in each block 26 17 20 27
Total No. of Blocks 1,182 1,562 1,464 1,378
Max. block size (in bp) 190,708 140,491 178,064 218,644
Min. block size (in bp) 34 4 2 12
Percentage of uncovered SNPs 14.5% 12.8% 22.1% 4.9%
Median No. of SNPs within each block 4 4 3 4
Median block size (in bp) 9,830 7,551 6,783 10,870
Total block size (in bp) 23932662 23452817 23696256 23696256
The comparison between Big-LD, FGT, CIT, and SSLD haplotype block
partitioning methods in chromosome 21
16
0
164. 164
Introduction Literature review Data description pre-processing Methods Results Conclusion
164
• The alleles distribution and description.
• The percentage of SNP appearance in physical location in chromosomes
affect by SNP’s MAF.
• The genotype imputation and preprocessing are crucial steps in HBP and
we produce a sequence of preprocessing that facilitates several any
genetic analysis.
Conclusion
165. 165
Introduction Literature review Data description pre-processing Methods Results Conclusion
165
• HBP reduce the biomarker to about 13%
• Big-LD method provided robust blocks partitioning in terms of
the block size and genomic coverage.
Conclusion
166. 166
• There is a 70% intersection agreement among most HBP methods, Big-LD
matched more with FGT.
• FGT produce modest results in term of correlation and in term of
biomarker reduction.
• BigLD produced large haplotype blocks and show high r2 between blocks
and the lowest r2 between blocks considering the singleton blocks.
• In term of computation, BigLD takes less than half computational time of
Haploview methods.
166
Introduction Literature review Data description pre-processing Methods Results Conclusion
166
Conclusion
167. 167
• MAF quality control has a high effect of haplotype block partitioning
• We recommend taking the MAF in consideration when applying a
haplotype block partition. However, it is a tradeoff, higher MAF
produces a higher correlation within the blocks but its trunk a
portion of data that may could be significant.
• In term of correlation, we recommend using high MAF while using
Haploview methods, low or moderate MAF in BigLD method.
167
Introduction Literature review Data description pre-processing Methods Results Conclusion
167
Conclusion
168. 168
• We could answer the question related to MAF 🡪 the number of
blocks not necessarily affect the number of SNPs within blocks.
• The SNPs within blocks is the highest in SSLD at the same MAF
due to its size and is lowest in CIT
• At the same block size, the SNPs within blocks is decrease with
MAF increase.
168
Introduction Literature review Data description pre-processing Methods Results Conclusion
168
Conclusion