This document discusses the challenges of large scale DNA resequencing projects at the Wellcome Trust Sanger Institute. It describes their work on several major projects that have generated over 100 terabases of sequencing data, including the 1000 Genomes Project and UK10K. It outlines their data processing pipeline and strategies for managing the large amount of data through tiered storage and automated compute workflows. Finally, it notes that new high throughput sequencers like the Illumina HiSeq 2500 will allow generating a human genome in a single day, further increasing the scale of data production.
This document discusses Git branching models and conventions. It introduces common branches like master, release, production and hotfix branches. It explains how to create and switch branches. Feature branches are used for new development and hotfix branches are for urgent fixes. Commit messages should include the related Jira issue number. The branching model aims to separate development, release and production stages through distinct branches with clear naming conventions.
The document proposes a procedure for beginners to use support vector machines (SVMs) that first transforms data into the format required by SVM packages, conducts simple scaling on the data, considers the radial basis function (RBF) kernel, uses cross-validation to find the best parameter values for C and γ, trains the full training set with the best parameters, and tests the model. This procedure is discussed in more detail in subsequent sections of the document on data preprocessing for SVMs.
Git branching model for efficient development.
(1) Main branches like master and development are used for new features and releases. (2) Supporting branches like bugfix and feature branches have limited lifespans. (3) The workflow handles features, bugs, and releases across environments while maintaining a clean history. Rebasing is preferred over merging for cleaner histories when working locally, while merging integrates changes for public branches.
The document is an Apple compatibility guide that provides information on whether various Apple products like iPods, iPhones, and docks are compatible. It lists the product, firmware version, and whether key functions like music playback, charging, and remote control work for different combinations of Apple devices and Harman Kardon docks. It also includes notes explaining cases where functions only partially work or require workarounds.
This document contains information about genome assemblies and annotations for multiple mouse strains. It lists the strains and key details about their genome assemblies such as length, N50 size, and largest scaffold. It also lists the organizations and researchers involved in generating the assemblies and annotations over multiple years from 2014 to 2018. Manual curation and integration of gene sets and annotations is described.
Presentation from the 3rd Joint Meeting of the Antimicrobial Resistance and Healthcare-Associated Infections (ARHAI) Networks, organised by the European Centre of Disease Prevention and Control - Stockholm, 11-13 February 2015
Assessing the impact of transposable element variation on mouse phenotypes an...Thomas Keane
This document summarizes research assessing the impact of transposable element (TE) variation on mouse phenotypes and traits. Over 100,000 TE variants were detected among 17 laboratory mouse strains by whole genome sequencing, including both insertions present or absent compared to the reference genome. Validation experiments confirmed the accuracy of the calls. Analysis showed the distribution and structure of TE families varies between strains, and some TE classes like ERVs are expanding. The goal is to better understand how TE variation contributes to phenotypic diversity and complex traits in mice.
The document discusses the benefits of personalized medicine and how SAP solutions can help advance it. SAP solutions allow researchers to aggregate and analyze large amounts of fragmented patient data from different sources to gain insights. This helps uncover information to improve patient care and healthcare decisions. Unlike other solutions, SAP enables owners to retain control over their valuable data and gain transparency into how it is used.
This document discusses Git branching models and conventions. It introduces common branches like master, release, production and hotfix branches. It explains how to create and switch branches. Feature branches are used for new development and hotfix branches are for urgent fixes. Commit messages should include the related Jira issue number. The branching model aims to separate development, release and production stages through distinct branches with clear naming conventions.
The document proposes a procedure for beginners to use support vector machines (SVMs) that first transforms data into the format required by SVM packages, conducts simple scaling on the data, considers the radial basis function (RBF) kernel, uses cross-validation to find the best parameter values for C and γ, trains the full training set with the best parameters, and tests the model. This procedure is discussed in more detail in subsequent sections of the document on data preprocessing for SVMs.
Git branching model for efficient development.
(1) Main branches like master and development are used for new features and releases. (2) Supporting branches like bugfix and feature branches have limited lifespans. (3) The workflow handles features, bugs, and releases across environments while maintaining a clean history. Rebasing is preferred over merging for cleaner histories when working locally, while merging integrates changes for public branches.
The document is an Apple compatibility guide that provides information on whether various Apple products like iPods, iPhones, and docks are compatible. It lists the product, firmware version, and whether key functions like music playback, charging, and remote control work for different combinations of Apple devices and Harman Kardon docks. It also includes notes explaining cases where functions only partially work or require workarounds.
This document contains information about genome assemblies and annotations for multiple mouse strains. It lists the strains and key details about their genome assemblies such as length, N50 size, and largest scaffold. It also lists the organizations and researchers involved in generating the assemblies and annotations over multiple years from 2014 to 2018. Manual curation and integration of gene sets and annotations is described.
Presentation from the 3rd Joint Meeting of the Antimicrobial Resistance and Healthcare-Associated Infections (ARHAI) Networks, organised by the European Centre of Disease Prevention and Control - Stockholm, 11-13 February 2015
Assessing the impact of transposable element variation on mouse phenotypes an...Thomas Keane
This document summarizes research assessing the impact of transposable element (TE) variation on mouse phenotypes and traits. Over 100,000 TE variants were detected among 17 laboratory mouse strains by whole genome sequencing, including both insertions present or absent compared to the reference genome. Validation experiments confirmed the accuracy of the calls. Analysis showed the distribution and structure of TE families varies between strains, and some TE classes like ERVs are expanding. The goal is to better understand how TE variation contributes to phenotypic diversity and complex traits in mice.
The document discusses the benefits of personalized medicine and how SAP solutions can help advance it. SAP solutions allow researchers to aggregate and analyze large amounts of fragmented patient data from different sources to gain insights. This helps uncover information to improve patient care and healthcare decisions. Unlike other solutions, SAP enables owners to retain control over their valuable data and gain transparency into how it is used.
Long read sequencing - LSCC lab talk - fri 5 june 2015Torsten Seemann
Long read sequencing technologies such as Pacific Biosciences and Oxford Nanopore offer several advantages over short read technologies. They can generate reads of over 100kb which allows for untangling of repeats and completion of genomes and phasing of haplotypes. While PacBio is more established, Nanopore offers the potential for real-time, portable sequencing. Both require adaptation of bioinformatics tools and analysis approaches. The new technologies will change genomics jobs by moving more to streaming analysis and requiring skills in adapting to changing technologies.
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...Torsten Seemann
I describe the three levels of parallelism that can be exploited in bioinformatics software (1) clusters of multiple computers; (2) multiple cores on each computer; and (3) vector machine code instructions.
Optimize physician workflow and you’ll contribute to optimizing patient care. But what is it physicians look for to improve diagnoses, decision-making, patient care, and ultimately, outcomes? To answer this, consider what constitutes ideal working conditions in any industry: the right tools, training, and information to maximize productivity and deliver results. Physicians need analytics integrated into the EHR to maximize their efficiency, a common quest among the chronically overworked. And by flowing the universe of global, local, and individual data back into an enterprise data warehouse, a healthcare system can close the analytics loop, and begin to realize true precision medicine.
Why and how to clean Illumina genome sequencing reads. Includes illustrative examples, and a case where a project was saved by using Nesoni clip: to discover the cause of non-mapping reads.
Assembling NGS Data - IMB Winter School - 3 July 2012Torsten Seemann
This document discusses assembling next-generation sequencing (NGS) data using de novo assembly. De novo assembly involves reconstructing original DNA sequences from short fragment read sequences without a reference genome. It involves finding overlaps between reads and tracing paths in a graph representation while dealing with sequencing errors and repeats. The document outlines the assembly process including finding overlaps, building graphs, simplifying graphs, traversing graphs, and assessing assemblies. It provides examples of tools used for genome, metagenome, transcriptome, and metatranscriptome assembly including Velvet, Abyss, and Trinity.
Multiple mouse reference genomes and strain specific gene annotationsThomas Keane
This document discusses multiple efforts related to developing reference genomes and gene annotations for laboratory mouse strains:
1) Genome assemblies have been improved for several strains using techniques like Illumina sequencing, Dovetail scaffolding, and PacBio alignments.
2) Gene predictions are being developed using a combination of annotation lifting from C57BL/6J, local refinement with strain-specific RNA-seq data, and de novo prediction.
3) Resources have been created for viewing and accessing these new reference genomes and annotations.
This document summarizes a presentation on mouse genomic variation and its effect on phenotypes and gene regulation. It discusses the Mouse Genomes Project which sequenced 18 laboratory mouse strains to catalog genetic variants like SNPs and structural variations. It also analyzed RNA-sequencing data to identify over 36,000 candidate RNA editing sites, with most being adenosine-to-inosine edits. Some edits were found to alter protein coding sequences or be conserved across species, potentially impacting gene regulation and phenotypes.
Antimicrobial resistance (AMR) in N. gonorrhoeae (GC) - global problem but v...Игорь Шадеркин
Antimicrobial resistance in Neisseria gonorrhoeae is a global problem, but valid data are lacking in many areas. Gonorrhea surveillance is crucial for public health to help prevent untreatable infections and inform treatment guidelines. However, resistance to traditional antibiotics is very high in most countries, and multi-drug resistant strains have emerged. Improved diagnostic testing and surveillance of antibiotic resistance according to WHO standards is needed worldwide, especially in low-resource areas.
1) The document discusses comparing bacterial isolates using whole genome sequencing and bioinformatics techniques.
2) Key steps include isolating bacteria from samples, sequencing their genomes, performing de novo assembly and annotation, and clustering homologous genes to determine the pan-genome and core genome.
3) Comparing single nucleotide polymorphisms (SNPs) in the core genome allows construction of phylogenetic trees to infer relationships between isolates.
The document summarizes an annual scientific conference on maternal fetal medicine taking place in Kyiv, Ukraine from April 7-8, 2017. The conference is organized by EXTEMPORE Publishing and has expanded to become an Eastern European event. World leaders in fetal medicine will present alongside Ukrainian and foreign experts. The 2-day program will include presentations, panels, workshops and networking. The venue is the RAMADA ENCORE KYIV hotel. Sponsorship opportunities are available at various levels. Contact information is provided for registration or additional details.
The Real Opportunity of Precision Medicine and How to Not Miss OutHealth Catalyst
Precision medicine, defined as a new model of patient-powered research that will give clinicians the ability to select the best treatment for an individual patient, holds the key that will allow health IT to merge advances in genomics research with new methods for managing and analyzing large data sets. This will accelerate research and biomedical discoveries. However, clinical improvements are often designed to reduce variation. So, how do systems balance tailoring medicine to each patient with standardizing care? The answer is precise registries. For example, using registries that can account for the most accurate, specific patients and disease, clinicians can use gene variant knowledge bases to provide personalized care.
The document discusses a lecture on sequence alignment, data formats, quality control, and data processing for next-generation sequencing data. It covers common file formats like FASTQ, SAM/BAM, and CRAM. It also describes algorithms for sequence alignment like hash table based methods and suffix/prefix tree based aligners. The lecture discusses scaling alignment to large datasets using parallel computing approaches.
De novo genome assembly - IMB Winter School - 7 July 2015Torsten Seemann
This document discusses de novo genome assembly, which is the process of reconstructing the original DNA sequence from short fragment reads alone. Due to limitations in sequencing technology, the DNA must be broken into short reads which must then be reassembled like a jigsaw puzzle. Challenges include sequencing errors, repeats, and heterozygosity. Various algorithms and techniques are used to assemble the reads, including overlap layout consensus and de Bruijn graphs. Long read technologies help resolve repeats and scaffold contigs. Software recommendations for de novo assembly include SPAdes, Velvet, and CLC Genomics Workbench.
Key Issues on the Economics of Precision MedicineHEHTAslides
Paula Lorgelly gave a presentation on the key issues in the economics of precision medicine. She discussed how a decade ago, genome sequencing was very expensive and personalized medicine seemed unlikely. Now, testing is much cheaper and more widely used. However, challenges remain around valuing new elements of precision medicine, paying for value, and generating evidence of value. She argued that new ways of measuring value beyond QALYs may be needed, and that current systems incentivize pharmaceutical companies more than diagnostic innovation. Real-world data will also be important to address economic questions as vast amounts of clinical data are now available.
The Scottish Ecosystem for Precision MedicineHEHTAslides
The document describes plans to build a precision medicine ecosystem in Scotland by coordinating existing initiatives through the Stratified Medicine Scotland Innovation Centre (SMS-IC). The SMS-IC will develop a business model and service catalog to facilitate collaboration between technology partners, genomic service providers, and other organizations. This will help leverage existing Scottish assets and capabilities to accelerate the adoption of genomic services, enable broader participation across academia, industry and the NHS, and create precision medicine solutions that attract commercial investment. The goal is to transform disease management and accelerate biomedical research, healthcare, and economic growth in Scotland.
The scientific discovery of the induced pluripotent stem cell (iPSC) technology using adult stem cel is rather recent, in 2006. Now a new era of personalized medical treatments in clearer to perceived and accelerating worldwide with motivation groups and individuals in medical intervention, science & financical circles , more specifically in next decade.
This document provides information about an upcoming webinar on sales skills presented by Benjamin Brown and co-hosted by Gabriela Yanez. It includes contact information for Benjamin Brown and Gabriela Yanez, details about the webinar content which will cover six secrets to closing sales, instructions for participants on how to submit questions and provide feedback, and invites participants to schedule a follow up call with Benjamin Brown to discuss applying the sales strategies to their business.
This document provides recommendations for bioinformatic tools to use for analysis of high-throughput sequencing data for molecular diagnostics. It discusses tools for quality control, species identification, reference-based approaches including alignment and variant calling, de novo assembly, annotation, phylogenetic tree building, and population genomics. Key tools mentioned include FastQC, Trimmomatic, BLAST, BWA, SPAdes, Prokka, and BIGSDB. The document also acknowledges contributions from experts in the field.
Next generation sequencing in cloud computing eraThomas Keane
The document discusses using cloud computing to address the large data volumes produced by next generation sequencing (NGS). It notes that a single sequencing machine can now produce 20 gigabases of data per run and that uploading raw sequencing data to the cloud is infeasible due to bandwidth limitations. It proposes moving raw data generation and analysis to the cloud to overcome these issues, with only the final variant call files being transferred back to sequencing centers and research groups for downstream analysis.
AMD's new Bobcat core architecture is a low power x86 core designed for small die area and optimized for cloud clients. It features dual x86 decoders, out-of-order execution, 32KB L1 caches, a 512KB L2 cache, and advanced power reduction techniques to target sub-one watt operation. The goal of Bobcat is to provide 90% of the performance of mainstream notebook CPUs while using half the die area.
Efficient Parallel Set-Similarity Joins Using MapReduce - Posterrvernica
In this paper we study how to efficiently perform set-similarity joins in parallel using the popular MapReduce framework. We propose a 3-stage approach for end-to-end set-similarity joins. We take as input a set of records and output a set of joined records based on a set-similarity condition. We efficiently partition the data across nodes in order to balance the workload and minimize the need for replication. We study both self-join and R-S join cases, and show how to carefully control the amount of data kept in main memory on each node. We also propose solutions for the case where, even if we use the most fine-grained partitioning, the data still does not fit in the main memory of a node. We report results from extensive experiments on real datasets, synthetically increased in size, to evaluate the speedup and scaleup properties of the proposed algorithms using Hadoop.
Long read sequencing - LSCC lab talk - fri 5 june 2015Torsten Seemann
Long read sequencing technologies such as Pacific Biosciences and Oxford Nanopore offer several advantages over short read technologies. They can generate reads of over 100kb which allows for untangling of repeats and completion of genomes and phasing of haplotypes. While PacBio is more established, Nanopore offers the potential for real-time, portable sequencing. Both require adaptation of bioinformatics tools and analysis approaches. The new technologies will change genomics jobs by moving more to streaming analysis and requiring skills in adapting to changing technologies.
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...Torsten Seemann
I describe the three levels of parallelism that can be exploited in bioinformatics software (1) clusters of multiple computers; (2) multiple cores on each computer; and (3) vector machine code instructions.
Optimize physician workflow and you’ll contribute to optimizing patient care. But what is it physicians look for to improve diagnoses, decision-making, patient care, and ultimately, outcomes? To answer this, consider what constitutes ideal working conditions in any industry: the right tools, training, and information to maximize productivity and deliver results. Physicians need analytics integrated into the EHR to maximize their efficiency, a common quest among the chronically overworked. And by flowing the universe of global, local, and individual data back into an enterprise data warehouse, a healthcare system can close the analytics loop, and begin to realize true precision medicine.
Why and how to clean Illumina genome sequencing reads. Includes illustrative examples, and a case where a project was saved by using Nesoni clip: to discover the cause of non-mapping reads.
Assembling NGS Data - IMB Winter School - 3 July 2012Torsten Seemann
This document discusses assembling next-generation sequencing (NGS) data using de novo assembly. De novo assembly involves reconstructing original DNA sequences from short fragment read sequences without a reference genome. It involves finding overlaps between reads and tracing paths in a graph representation while dealing with sequencing errors and repeats. The document outlines the assembly process including finding overlaps, building graphs, simplifying graphs, traversing graphs, and assessing assemblies. It provides examples of tools used for genome, metagenome, transcriptome, and metatranscriptome assembly including Velvet, Abyss, and Trinity.
Multiple mouse reference genomes and strain specific gene annotationsThomas Keane
This document discusses multiple efforts related to developing reference genomes and gene annotations for laboratory mouse strains:
1) Genome assemblies have been improved for several strains using techniques like Illumina sequencing, Dovetail scaffolding, and PacBio alignments.
2) Gene predictions are being developed using a combination of annotation lifting from C57BL/6J, local refinement with strain-specific RNA-seq data, and de novo prediction.
3) Resources have been created for viewing and accessing these new reference genomes and annotations.
This document summarizes a presentation on mouse genomic variation and its effect on phenotypes and gene regulation. It discusses the Mouse Genomes Project which sequenced 18 laboratory mouse strains to catalog genetic variants like SNPs and structural variations. It also analyzed RNA-sequencing data to identify over 36,000 candidate RNA editing sites, with most being adenosine-to-inosine edits. Some edits were found to alter protein coding sequences or be conserved across species, potentially impacting gene regulation and phenotypes.
Antimicrobial resistance (AMR) in N. gonorrhoeae (GC) - global problem but v...Игорь Шадеркин
Antimicrobial resistance in Neisseria gonorrhoeae is a global problem, but valid data are lacking in many areas. Gonorrhea surveillance is crucial for public health to help prevent untreatable infections and inform treatment guidelines. However, resistance to traditional antibiotics is very high in most countries, and multi-drug resistant strains have emerged. Improved diagnostic testing and surveillance of antibiotic resistance according to WHO standards is needed worldwide, especially in low-resource areas.
1) The document discusses comparing bacterial isolates using whole genome sequencing and bioinformatics techniques.
2) Key steps include isolating bacteria from samples, sequencing their genomes, performing de novo assembly and annotation, and clustering homologous genes to determine the pan-genome and core genome.
3) Comparing single nucleotide polymorphisms (SNPs) in the core genome allows construction of phylogenetic trees to infer relationships between isolates.
The document summarizes an annual scientific conference on maternal fetal medicine taking place in Kyiv, Ukraine from April 7-8, 2017. The conference is organized by EXTEMPORE Publishing and has expanded to become an Eastern European event. World leaders in fetal medicine will present alongside Ukrainian and foreign experts. The 2-day program will include presentations, panels, workshops and networking. The venue is the RAMADA ENCORE KYIV hotel. Sponsorship opportunities are available at various levels. Contact information is provided for registration or additional details.
The Real Opportunity of Precision Medicine and How to Not Miss OutHealth Catalyst
Precision medicine, defined as a new model of patient-powered research that will give clinicians the ability to select the best treatment for an individual patient, holds the key that will allow health IT to merge advances in genomics research with new methods for managing and analyzing large data sets. This will accelerate research and biomedical discoveries. However, clinical improvements are often designed to reduce variation. So, how do systems balance tailoring medicine to each patient with standardizing care? The answer is precise registries. For example, using registries that can account for the most accurate, specific patients and disease, clinicians can use gene variant knowledge bases to provide personalized care.
The document discusses a lecture on sequence alignment, data formats, quality control, and data processing for next-generation sequencing data. It covers common file formats like FASTQ, SAM/BAM, and CRAM. It also describes algorithms for sequence alignment like hash table based methods and suffix/prefix tree based aligners. The lecture discusses scaling alignment to large datasets using parallel computing approaches.
De novo genome assembly - IMB Winter School - 7 July 2015Torsten Seemann
This document discusses de novo genome assembly, which is the process of reconstructing the original DNA sequence from short fragment reads alone. Due to limitations in sequencing technology, the DNA must be broken into short reads which must then be reassembled like a jigsaw puzzle. Challenges include sequencing errors, repeats, and heterozygosity. Various algorithms and techniques are used to assemble the reads, including overlap layout consensus and de Bruijn graphs. Long read technologies help resolve repeats and scaffold contigs. Software recommendations for de novo assembly include SPAdes, Velvet, and CLC Genomics Workbench.
Key Issues on the Economics of Precision MedicineHEHTAslides
Paula Lorgelly gave a presentation on the key issues in the economics of precision medicine. She discussed how a decade ago, genome sequencing was very expensive and personalized medicine seemed unlikely. Now, testing is much cheaper and more widely used. However, challenges remain around valuing new elements of precision medicine, paying for value, and generating evidence of value. She argued that new ways of measuring value beyond QALYs may be needed, and that current systems incentivize pharmaceutical companies more than diagnostic innovation. Real-world data will also be important to address economic questions as vast amounts of clinical data are now available.
The Scottish Ecosystem for Precision MedicineHEHTAslides
The document describes plans to build a precision medicine ecosystem in Scotland by coordinating existing initiatives through the Stratified Medicine Scotland Innovation Centre (SMS-IC). The SMS-IC will develop a business model and service catalog to facilitate collaboration between technology partners, genomic service providers, and other organizations. This will help leverage existing Scottish assets and capabilities to accelerate the adoption of genomic services, enable broader participation across academia, industry and the NHS, and create precision medicine solutions that attract commercial investment. The goal is to transform disease management and accelerate biomedical research, healthcare, and economic growth in Scotland.
The scientific discovery of the induced pluripotent stem cell (iPSC) technology using adult stem cel is rather recent, in 2006. Now a new era of personalized medical treatments in clearer to perceived and accelerating worldwide with motivation groups and individuals in medical intervention, science & financical circles , more specifically in next decade.
This document provides information about an upcoming webinar on sales skills presented by Benjamin Brown and co-hosted by Gabriela Yanez. It includes contact information for Benjamin Brown and Gabriela Yanez, details about the webinar content which will cover six secrets to closing sales, instructions for participants on how to submit questions and provide feedback, and invites participants to schedule a follow up call with Benjamin Brown to discuss applying the sales strategies to their business.
This document provides recommendations for bioinformatic tools to use for analysis of high-throughput sequencing data for molecular diagnostics. It discusses tools for quality control, species identification, reference-based approaches including alignment and variant calling, de novo assembly, annotation, phylogenetic tree building, and population genomics. Key tools mentioned include FastQC, Trimmomatic, BLAST, BWA, SPAdes, Prokka, and BIGSDB. The document also acknowledges contributions from experts in the field.
Next generation sequencing in cloud computing eraThomas Keane
The document discusses using cloud computing to address the large data volumes produced by next generation sequencing (NGS). It notes that a single sequencing machine can now produce 20 gigabases of data per run and that uploading raw sequencing data to the cloud is infeasible due to bandwidth limitations. It proposes moving raw data generation and analysis to the cloud to overcome these issues, with only the final variant call files being transferred back to sequencing centers and research groups for downstream analysis.
AMD's new Bobcat core architecture is a low power x86 core designed for small die area and optimized for cloud clients. It features dual x86 decoders, out-of-order execution, 32KB L1 caches, a 512KB L2 cache, and advanced power reduction techniques to target sub-one watt operation. The goal of Bobcat is to provide 90% of the performance of mainstream notebook CPUs while using half the die area.
Efficient Parallel Set-Similarity Joins Using MapReduce - Posterrvernica
In this paper we study how to efficiently perform set-similarity joins in parallel using the popular MapReduce framework. We propose a 3-stage approach for end-to-end set-similarity joins. We take as input a set of records and output a set of joined records based on a set-similarity condition. We efficiently partition the data across nodes in order to balance the workload and minimize the need for replication. We study both self-join and R-S join cases, and show how to carefully control the amount of data kept in main memory on each node. We also propose solutions for the case where, even if we use the most fine-grained partitioning, the data still does not fit in the main memory of a node. We report results from extensive experiments on real datasets, synthetically increased in size, to evaluate the speedup and scaleup properties of the proposed algorithms using Hadoop.
The document discusses the functional verification of the Jaguar x86 low-power core. It describes Jaguar's microarchitecture, which includes improvements over the previous Bobcat core such as a new shared L2 cache and updated ISA support. The verification strategy involves testing at the unit, cluster, and system levels using techniques like random stimulus generation, coverage analysis, and formal verification. Challenges included verifying the complex new power management features and shared L2 cache across multiple independent cores.
This document provides a comparison of Netgear's ReadyNAS network storage products. It outlines the key specifications of 10 desktop and rackmount models including their form factor, supported file/block protocols, hard drive options, networking ports, reliability features, RAID support, performance metrics, included software, warranty and support options. The ReadyNAS models range from 2-bay desktop options to 12-bay rackmount solutions with 1 or 2 GbE and optional 10GbE networking.
BGP Error Handling - Developing an Operator-Led Approach in the IETF (UKNOF 18)Rob Shakir
This document discusses developing an operator-led approach to improving BGP error handling in the IETF. It outlines a four-point approach: 1) avoiding sending BGP NOTIFICATION messages when possible, 2) recovering routing information base (RIB) consistency after errors, 3) restarting BGP sessions hitlessly to reduce impact, and 4) introducing additional monitoring to improve visibility of error handling. The goal is to define how BGP is used in service provider networks, provide operator requirements, and tie together relevant IETF work items to make BGP more robust. Challenges addressed include protocol inconsistencies caused by error responses, achieving RIB synchronization, and balancing manageability against added complexity.
Semiconductor manufacturing roadmaps are contingent upon the lithography tool supplier product. EUV appears to be a viable alternative to DP/HP immersion. ASML details their EUV roadmap, and offers specifications and a look at upgrades for next node product.
Public Presentation, ASML EUV forecast Jul 2010JVervoort
The document discusses progress on EUV lithography systems for semiconductor manufacturing. It outlines ASML's lithography roadmap to support Moore's Law with EUV technology. It describes the status of their 0.25NA and 0.32NA EUV systems, including resolution improvements achieved and integration progress. It provides outlook on their EUV roadmap and future systems aimed at 16nm nodes and beyond.
16.07.12 Analyzing Logs/Configs of 200'000 Systems with Hadoop (Christoph Sch...Swiss Big Data User Group
This talk was held at the second meeting of the Swiss Big Data User Group on July 16 at ETH Zürich. The topic of this meeting was: "NoSQL Storage: War Stories and Best Practices".
http://www.bigdata-usergroup.ch/item/296477
ESS-Bilbao Initiative Workshop. Beam Dynamics Codes: Availability, Sophistica...ESS BILBAO
Beam Dynamics Codes: Availability, Sophistication, Limitations. P.N. Ostroumov and B. Mustapha Argonne National Laboratory, J.-P. Carneiro Fermi National Accelerator Laboratory
This document discusses next generation sequencing (NGS) data preprocessing and quality control. It provides a brief history of DNA sequencing technologies and compares current NGS platforms. The importance of quality control and preprocessing NGS data is explained. Key metrics for assessing NGS data quality are described, including per base quality scores, GC content distributions, and k-mer content. Tools for preprocessing (Fastx-toolkit) and quality control (FastQC) are introduced.
The talk presented how AMD technologies meet HPC requirements through a hands-on session. Key concepts covered included performance metrics like GFLOPS and memory bandwidth, scalability on multi-socket platforms, and the impact of compilers, libraries, and tuning on performance and power consumption. The session aimed to provide foundational knowledge on building effective HPC solutions using AMD technologies.
Overview of methods for variant calling from next-generation sequence dataThomas Keane
This document provides an overview of methods for variant calling from next-generation sequencing data. It discusses data formats and workflows, including SNP calling, short indels, and structural variation. The document describes alignment, BAM improvement through realignment and base quality recalibration, library merging, and duplicate removal. It also reviews software tools for these processes and introduces the variant call format (VCF) standard.
The document summarizes a lightning talk given by Makoto Kuwata on benchmarking and performance testing in Python. It shows the results of several benchmarks comparing the speed of different string operations in Python like addition, formatting, and joining. It demonstrates that the + operator is fastest for short strings while += is fastest for string concatenation. It also discusses using the Benchmarker utility for benchmarking code in Python.
The document discusses cloud computing and the AIST Super Cloud. It provides details on 3 common cloud platforms: Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). It then describes AIST's transition from its Super Cluster to the Green Cloud and newer Super Cloud, including hardware specifications. Key cloud management platforms are outlined, including Rocks, Eucalyptus, OpenStack, and OpenNebula. The document focuses on AIST's adoption and use of cloud computing technologies for research.
This document discusses developing an operator-led approach to improving BGP error handling in the IETF. It describes common BGP failures seen by network operators, such as erroneous AS_PATH data and very long AS paths causing session failures. The goal is to define how BGP is used in service provider networks, determine operator requirements for how BGP should fail, and ensure existing and future IETF work items form a useful framework to make BGP more robust. The proposed approach includes avoiding sending NOTIFICATION messages when possible, recovering routing information consistency after errors, and reducing the impact of necessary session resets through monitoring.
Similar to Large Scale Resequencing: Approaches and Challenges (16)
The document summarizes a lecture on identifying SNPs, indels, and structural variants from next-generation sequencing data. It discusses the VCF format for storing variant call data, methods for identifying SNPs and indels, and approaches for detecting structural variants like insertions, deletions, and inversions using read pair information. It also covers sources of bias in variant calling and strategies for evaluating called variants.
Enhanced structural variant and breakpoint detection using SVMerge by integra...Thomas Keane
This document describes SVMerge, a meta structural variant calling pipeline that integrates multiple SV detection methods and uses local assembly to validate calls. It summarizes SVMerge's workflow of running various SV callers individually on BAM files, merging and validating calls using local de novo assembly. The document also provides examples of how SVMerge was applied to a HapMap trio dataset, demonstrating it discovered more SVs than individual callers and had a low false positive rate compared to a curated database. Future work areas are identified like cataloging complex SV events and improving heterozygous call support.
Overview of methods for variant calling from next-generation sequence dataThomas Keane
This document provides an overview of methods for variant calling from next-generation sequencing data. It describes the SAM/BAM format for storing read alignments and details various tools for manipulating and visualizing BAM files. It also discusses approaches for calling SNPs, small indels, and structural variants from aligned sequencing data, highlighting factors to consider for accurate variant detection.
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...Thomas Keane
This document discusses the bioinformatics challenges of large-scale human genome resequencing projects like 1000 Genomes and UK10K. It notes that over 23 terabases of sequence data have been generated, requiring innovative solutions for storage, computing and data sharing due to the massive scale. A tiered storage model and use of transposed BAM files are proposed to help process and analyze the data more efficiently.
The Mouse Genomes Project aims to sequence the genomes of 17 inbred mouse strains to generate a complete map of genetic variation. Sequencing was performed on the Illumina GAII platform, generating 54, 76, and 108bp reads with over 20x coverage for each strain. The reads were aligned to the mm9 reference using MAQ software and stored in BAM format, with duplicates removed. The project seeks to provide a permanent foundation for understanding phenotypic variation in mice through a systems biology approach.
This tutorial provides an overview of working with next-generation sequencing data, including quality control, alignment, and variation analysis. It covers topics such as next-gen sequencing technologies and applications, quality control measures, short read alignment algorithms and tools, sequence assembly methods, and calling variants from sequencing data. The tutorial is presented by Thomas Keane and Jan Aerts at the 9th European Conference on Computational Biology.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Large Scale Resequencing: Approaches and Challenges
1. Large Scale Resequencing: Approaches and
Challenges
Thomas Keane
Vertebrate Resequencing Informatics group
Wellcome Trust Sanger Institute
Hinxton, Cambridge, UK
thomas.keane@sanger.ac.uk
AGBT Tutorial Workshop 15th February, 2012
4. Vertebrate Resequencing Informatics Group
Established in 2008 with Jim Stalker
PIs: Richard Durbin and David Adams
Initial projects
1000 Genomes project (http://www.1000genomes.org)
Data processing, releases, aligner evaluation, sequencing
Pilot 2008-2009: ~5Tbp (Nature 2011;467)
Phase 1 2009-2011: ~30Tbp
Phase 2 2011-: ~36.9Tbp (LowCov ilmn only)
Mouse Genomes Project (http://www.sanger.ac.uk/
mousegenomes)
Sequencing 17 laboratory mouse strains
SNPs, indels, SVs, de novo assembly
Approx. ~1.2Tbp (Nature 2011;477)
AGBT Tutorial Workshop 15th February, 2012
5. UK10K
Investigating the role of rare genetic variants in health and disease
Whole genome cohorts: 4,000 individuals across two well-established and deeply
phenotyped UK cohorts with ongoing longitudinal phenotype collection:
TWINSUK – 2,000
ALSPAC – 2,000
6x (18Gbp) per sample
Exomes: 6,000 exomes from 3 sets of extreme phenotype individuals
Neurodevelopmental diseases – 3,000
e.g. schizophrenia, autism spectrum disorders
Obesity – 2,000
e.g. severe childhood onset obesity
Rare diseases – 1,000
e.g. severe insulin resistance, congenital heart disease, ciliopathies
5Gbp per sample
Expect to generate ~100Tbp by end 2012
~40Tbp from BGI
AGBT Tutorial Workshop 15th February, 2012
6. Current Status
Recently passed 1000 genomes in terms of total Gbp
AGBT Tutorial Workshop 15th February, 2012
7. What are the challenges?
Storage Software/Workflows
NGS
Compute Power
AGBT Tutorial Workshop 15th February, 2012
10. Storage Challenges
Expect ~200Tbp of sequence in 2011-2012
Working estimate including processing, release, and variant calling
10bytes per bp
Storage considerations
Scalability – can we easily add more storage units?
Backup and disaster recovery – what do we really need to keep?
Performance – sufficient I/O throughput to serve compute nodes
Cost
Data Formats
Standardised formats – BAM & VCF
Minimise the number of copies
Aim for two copies at most – original lanes + release (stripped) BAM
AGBT Tutorial Workshop 15th February, 2012
11. A Tiered Storage Solution
Cost Size
2 1 3Gb/sec
CPU Farm
1 3 800Mb/sec
Off- Off-
2 2 site site
Level 1
Data: Current release vertical BAMs
Processes: BAM merging + splitting, Variant calling (SNPs, indels, SVs)
Level 2
Data: Lane level BAMs
Processes: Alignment, recalibration, local realignment
Level 3
Data: Previous release BAMs + variant calls backup
AGBT Tutorial Workshop 15th February, 2012
12. Data release + archiving: iRODs
Rule-Oriented Data management systems iRODs
Open source – origins in particle physics world
Most important feature of iRODS is the Rule Engine nfs02 nfs20
Akin to source control system
Customise own application level metadata nfs03
nfs01 Off-
e.g. run, lane, plex, sample, library…. site
Stores/searches key-value metadata on files:
List all files from UK10K studies:
imeta -z seq qu -d study like 'UK10K_%’!
/seq/5363/5363_1.bam!
/seq/5363/5363_2.bam (.....and a whole lot more)!
Get metadata about a file:
imeta ls -d /seq/6534/6534_3#7.bam sample!
attribute: sample!
value: QTL191953!
Sanger production: BAM files from runs per lane per plex deposited
BMC Bioinformatics 2011, 12:361
Recently adopted for UK10K internal data release and archiving
Users use meta-data queries to find their data
Files can be part of multiple releases
http://www.irods.org
AGBT Tutorial Workshop 15th February, 2012
13. Compute Pipeline Management: VRPipe
VRPipe
Managed and automated execution of sequences of arbitrary
software against massive datasets across large compute clusters
Error handling, optimal memory requests, batching of jobs, retrying
failures, failure reporting, highly extendable, detailed job statistics
1000 Genomes Phase 2 processed through VRPipe
Tracked ~1 million jobs
Total serial wall time: 9886 days, 3 hrs, 43 mins, 25 secs
bwa_aln_fastq: ~2443 days total serial wall time
Mean memory: 941MB/job (max 5637)
2012 sb10@sanger.ac.uk
Fully migrate all NGS processes to VRPipe (data processing, SNP/
indel/SV variant calling, and RNA-seq/ChIP-Seq pipelines)
Management front-ends
Create distributable VM for cloud rollout
http://www.github.com/VertebrateResequencing/vr-pipe/wiki
AGBT Tutorial Workshop 15th February, 2012
14. Even more scale up in 2012 – HiSeq 2500
Currently takes 1-2 weeks to sequence a human genome
High depth human genomes in a single day – Illumina HiSeq
2500
Caucasian family with a severe T-cell deficiency in affected
sibling
Single run on HiSeq 2500 by Illumina per individual
PF
% ≥Q30 Mismatch Mismatch Run time
Sample Yield % Align
(Gbp) value R1 (%) R2 (%) (hrs)
Father 117.7 89 92.6 0.4 0.5 25.5
Mother 125.7 90.2 92.8 0.4 0.5 25.5
Affected 124.4 90.3 92.4 0.4 0.5 25.5
AGBT Tutorial Workshop 15th February, 2012
15. What does the data look like?
AGBT Tutorial Workshop 15th February, 2012
16. Upcoming Changes in 2012
We cannot keep all of the data
2007-2008: Keep everything including images from runs
2009: BAM/Fastq – all of the base quality information
2010-2011: Stripping original qualities and other unused tags
2012-: Current formats contain lots of repetition
Reference based compression
Reducing quality information e.g. quality binning or quality
budgets
Potential formats: CRAM and/or Reduced BAM
AGBT Tutorial Workshop 15th February, 2012
17. CRAM Format
TGAGCTCTAAGTACC!
329183050298757!
CRAM models for
compression TGAGCTCTAAGTACC! TGAGCTCTAAGTACC!
002020010022212! -2---30---9---7!
Horizontal Vertical
Do nothing Lossless
Quality lossy
100 10 1 0.1
CRAM current
Untreated CRAM CRAM CRAM substitutions/insertions
performance lossless combination model
model
CRAM v0.6 released 13.2.12: • Option to preserve all unmapped reads
• Pairing information preservation regardless of distance • Performance and bug fixes
• Revised and improved lossless mode • Arbitrary tags
http://www.ebi.ac.uk/ena/about/cram_toolkit
Source: Ewan Birney/Guy Cochrane, EBI
AGBT Tutorial Workshop 15th February, 2012
18. Any questions?
Richard Durbin
URLs
• VRPipe: https://github.com/VertebrateResequencing/vr-pipe David Adams
• iRODS@Sanger: BMC Bioinformatics 2011, 12:361
• http://www.slideshare.net/thomaskeane
AGBT Tutorial Workshop 15th February, 2012