The document describes the Genome Analysis Toolkit (GATK), a MapReduce framework for analyzing large DNA sequencing datasets. The GATK aims to simplify the development of analysis tools for next-generation sequencing data by providing structured access to sequencing reads and reference context, and a plug-in model for writing analysis tools. It uses a MapReduce approach to divide data into independent chunks that can be processed in parallel. The document outlines the GATK workflow and concepts, and provides an example of a simple Bayesian genotyper implementation within the GATK framework.
This document discusses scalable data analysis using R. It introduces Revolution Analytics' RevoScaleR package, which provides scalability from small to huge data sets using chunking and parallel external memory algorithms. This allows R code and analysis to remain the same regardless of data size or compute environment. The package scales to multiple cores, computers, and clouds to enable analysis of large data sets.
This document discusses distributed database systems and distributed query processing. It begins with an introduction that notes the differences between distributed and centralized query processing, including considering the physical data distribution and communication costs during query optimization in distributed systems. The document then provides an overview of its contents, which include discussions of centralized query processing, the basics of distributed query processing, global query optimization, and a summary. It also gives examples of motivations for distributed query processing like low response times, high throughput, and efficient hardware usage.
This document summarizes and compares three high-level parallel processing models: Pig Latin, SCOPE, and Hive. It discusses how each aims to address the limitations of traditional approaches to large-scale data analysis by providing a high-level scripting language that is compiled into optimized parallel tasks. While the ideas are similar, there are differences in programming style, extensibility, data models, and optimization strategies. Overall, the models evaluate tradeoffs between flexibility, performance, and usability for large-scale data analysis.
Query Processing : Query Processing Problem, Layers of Query Processing Query Processing in Centralized Systems – Parsing & Translation, Optimization, Code generation, Example Query Processing in Distributed Systems – Mapping global query to local, Optimization,
RAMSES: Robust Analytic Models for Science at Extreme ScalesIan Foster
This document discusses the RAMSES project, which aims to develop a new science of end-to-end analytical performance modeling of science workflows in extreme-scale science environments. The RAMSES research agenda involves developing component and end-to-end models, tools to provide performance advice, data-driven estimation methods, automated experiments, and a performance database. The models will be evaluated using five challenge workflows: high-performance file transfer, diffuse scattering experimental data analysis, data-intensive distributed analytics, exascale application kernels, and in-situ analysis placement.
The document discusses Oracle system catalogs which contain metadata about database objects like tables and indexes. System catalogs allow accessing information through views with prefixes like USER, ALL, and DBA. Examples show how to query system catalog views to get information on tables, columns, indexes and views. Query optimization and evaluation are also covered, explaining how queries are parsed, an execution plan is generated, and the least cost plan is chosen.
Mahout is a machine learning library that provides implementations of common machine learning algorithms like recommender systems, clustering, and classification. It started as a subproject of Apache Lucene in 2008 and became a top-level Apache project in 2010. Mahout algorithms can run locally or on Hadoop for distributed processing of large datasets. Key areas Mahout supports include recommender systems, clustering algorithms like k-means, and classification using algorithms like naive Bayes.
Multi-data-types Interval Decision Diagrams for XACML Evaluation EngineCanh Ngo
The document describes research on improving the performance of XACML policy evaluation engines using multi-data-type interval decision diagrams (MIDDs). The researchers propose transforming XACML policies into MIDD representations called X-MIDDs to enable more efficient evaluation. X-MIDDs decompose attribute logic expressions and combine decision paths. Policies are then evaluated by traversing the X-MIDD graph and applying combining algorithms at nodes. Experiments show the approach handles complex XACML logic and achieves high performance evaluation.
This document discusses scalable data analysis using R. It introduces Revolution Analytics' RevoScaleR package, which provides scalability from small to huge data sets using chunking and parallel external memory algorithms. This allows R code and analysis to remain the same regardless of data size or compute environment. The package scales to multiple cores, computers, and clouds to enable analysis of large data sets.
This document discusses distributed database systems and distributed query processing. It begins with an introduction that notes the differences between distributed and centralized query processing, including considering the physical data distribution and communication costs during query optimization in distributed systems. The document then provides an overview of its contents, which include discussions of centralized query processing, the basics of distributed query processing, global query optimization, and a summary. It also gives examples of motivations for distributed query processing like low response times, high throughput, and efficient hardware usage.
This document summarizes and compares three high-level parallel processing models: Pig Latin, SCOPE, and Hive. It discusses how each aims to address the limitations of traditional approaches to large-scale data analysis by providing a high-level scripting language that is compiled into optimized parallel tasks. While the ideas are similar, there are differences in programming style, extensibility, data models, and optimization strategies. Overall, the models evaluate tradeoffs between flexibility, performance, and usability for large-scale data analysis.
Query Processing : Query Processing Problem, Layers of Query Processing Query Processing in Centralized Systems – Parsing & Translation, Optimization, Code generation, Example Query Processing in Distributed Systems – Mapping global query to local, Optimization,
RAMSES: Robust Analytic Models for Science at Extreme ScalesIan Foster
This document discusses the RAMSES project, which aims to develop a new science of end-to-end analytical performance modeling of science workflows in extreme-scale science environments. The RAMSES research agenda involves developing component and end-to-end models, tools to provide performance advice, data-driven estimation methods, automated experiments, and a performance database. The models will be evaluated using five challenge workflows: high-performance file transfer, diffuse scattering experimental data analysis, data-intensive distributed analytics, exascale application kernels, and in-situ analysis placement.
The document discusses Oracle system catalogs which contain metadata about database objects like tables and indexes. System catalogs allow accessing information through views with prefixes like USER, ALL, and DBA. Examples show how to query system catalog views to get information on tables, columns, indexes and views. Query optimization and evaluation are also covered, explaining how queries are parsed, an execution plan is generated, and the least cost plan is chosen.
Mahout is a machine learning library that provides implementations of common machine learning algorithms like recommender systems, clustering, and classification. It started as a subproject of Apache Lucene in 2008 and became a top-level Apache project in 2010. Mahout algorithms can run locally or on Hadoop for distributed processing of large datasets. Key areas Mahout supports include recommender systems, clustering algorithms like k-means, and classification using algorithms like naive Bayes.
Multi-data-types Interval Decision Diagrams for XACML Evaluation EngineCanh Ngo
The document describes research on improving the performance of XACML policy evaluation engines using multi-data-type interval decision diagrams (MIDDs). The researchers propose transforming XACML policies into MIDD representations called X-MIDDs to enable more efficient evaluation. X-MIDDs decompose attribute logic expressions and combine decision paths. Policies are then evaluated by traversing the X-MIDD graph and applying combining algorithms at nodes. Experiments show the approach handles complex XACML logic and achieves high performance evaluation.
This document presents a comparative evaluation of Galaxy and Ruffus-based scripting workflows for DNA sequencing analysis pipelines. The research aims to identify the optimal workflow system by implementing DNA-seq analysis pipelines in both Galaxy and Ruffus and benchmarking their performance. Literature on existing workflow systems is reviewed. The document outlines the research objectives, design, methodology, and requirements for the DNA-seq analysis pipeline use case. Preliminary results indicate pros and cons of each approach, with further analysis of performance metrics still needed.
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...Kyong-Ha Lee
This document proposes HadoopXML, a system for efficiently processing massive XML data and multiple twig pattern queries in parallel using Hadoop. Key features of HadoopXML include:
1) It partitions large XML files and processes them in parallel across nodes while preserving structural information.
2) It simultaneously processes multiple twig pattern queries with a shared input scan, without needing separate MapReduce jobs for each query.
3) It enables query processing tasks to share input scans and intermediate results like path solutions, reducing redundant processing and I/O.
4) It provides load balancing to fairly distribute twig join operations across nodes.
This document summarizes Spark, a fast and general engine for large-scale data processing. Spark addresses limitations of MapReduce by supporting efficient sharing of data across parallel operations in memory. Resilient distributed datasets (RDDs) allow data to persist across jobs for faster iterative algorithms and interactive queries. Spark provides APIs in Scala and Java for programming RDDs and a scheduler to optimize jobs. It integrates with existing Hadoop clusters and scales to petabytes of data.
This document provides an overview of next generation analytics with YARN, Spark and GraphLab. It discusses how YARN addressed limitations of Hadoop 1.0 like scalability, locality awareness and shared cluster utilization. It also describes the Berkeley Data Analytics Stack (BDAS) which includes Spark, and how companies like Ooyala and Conviva use it for tasks like iterative machine learning. GraphLab is presented as ideal for processing natural graphs and the PowerGraph framework partitions such graphs for better parallelism. PMML is introduced as a standard for defining predictive models, and how a Naive Bayes model can be defined and scored using PMML with Spark and Storm.
Beyond Map/Reduce: Getting Creative With Parallel ProcessingEd Kohlwey
While Map/Reduce is an excellent environment for some parallel computing tasks, there are many ways to use a cluster beyond Map/Reduce. Within the last year, the YARN and NextGen Map/Reduce has been contributed into the Hadoop trunk, Mesos has been released as an open source project, and a variety of new parallel programming environments have emerged such as Spark, Giraph, Golden Orb, Accumulo, and others.
We will discuss the features of YARN and Mesos, and talk about obvious yet relatively unexplored uses of these cluster schedulers as simple work queues. Examples will be provided in the context of machine learning. Next, we will provide an overview of the Bulk-Synchronous-Parallel model of computation, and compare and contrast the implementations that have emerged over the last year. We will also discuss two other alternative environments: Spark, an in-memory version of Map/Reduce which features a Scala-based interpreter; and Accumulo, a BigTable-style database that implements a novel model for parallel computation and was recently released by the NSA.
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET Journal
The document reviews existing methods for the k-means clustering algorithm. It discusses how k-means clustering works and some of its limitations when dealing with large datasets, such as being dependent on the initial choice of centroids. It then proposes using Hadoop to overcome big data challenges and calculate preliminary centroids for k-means clustering in a distributed manner. Finally, it reviews different techniques that have been proposed in other research to improve k-means clustering, such as methods for selecting better initial centroids or determining the optimal number of clusters.
Implementation of p pic algorithm in map reduce to handle big dataeSAT Publishing House
This document presents an implementation of the p-PIC clustering algorithm using the MapReduce framework to handle big data. P-PIC is a parallel version of the Power Iteration Clustering (PIC) algorithm that is able to cluster large datasets in a distributed environment. The document first provides background on PIC and challenges with scaling to big data. It then describes how p-PIC addresses these challenges using MPI for parallelization. The design of implementing p-PIC within MapReduce is presented, including the map and reduce functions. Experimental results on synthetic datasets up to 100,000 records show that p-PIC using MapReduce has increased performance and scalability compared to the original p-PIC implementation using MPI.
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive function. Exercise causes chemical changes in the brain that may help protect against developing mental illness and improve symptoms for those who already suffer from conditions like anxiety and depression.
Petikan ini menceritakan tentang seorang ahli nujum bernama Pak Belalang yang dipanggil oleh raja untuk mengenalpasti jantina seekor anak itik. Pak Belalang menjelaskan bahawa cara untuk mengenalpasti jantina anak itik ialah dengan melepaskannya ke dalam air; mana yang terlebih dahulu menyelam ialah betina, manakala yang kemudian menyelam ialah jantan.
Cytoscape Web is an interactive, web-based network browser that is a pared down version of Cytoscape, an open source software platform for visualizing and analyzing molecular interaction networks. It allows users to visualize networks, perform basic operations like filtering nodes and edges, and export images of the network. Performance depends on factors like the number of elements in the network, with networks over 2000 elements usually sluggish.
The document provides troubleshooting tips for when things go wrong with cameras, suggesting to first reset the camera, try using a different or new tape, and never remove a damaged tape by hand. It advises taking cameras to a repair place if they experience liquid, shock, or sand damage and to be honest about any self-induced faults when requesting help, listing all symptoms and intermittent faults.
The Intellectual Property Quagmire, or, The Perils of Libertarian CreationismStephan Kinsella
"The Intellectual Property Quagmire, or, The Perils of Libertarian Creationism,” by Stephan Kinsella. Rothbard Memorial Lecture, Austrian Scholars Conference, Mar. 13 2008. Accompanying audio/video available at
http://www.stephankinsella.com/media/
The key points are:
- The BSE Sensex plunged 769 points or almost 4% to close at 18,598 while the NSE Nifty lost 234 points or 4.08% to close at 5,507 on Friday, August 16, 2013.
- Stocks fell sharply due to the steep decline in the rupee which hit a new low of 62.03 against the US dollar and fears that the US Federal Reserve will begin tapering its bond buying program.
- Measures by the RBI to curb capital outflows and speculation have had little effect in stemming the rupee's slide and dampened investor sentiment, leading to the sharp falls in the Indian stock markets.
Social media presents both opportunities and risks for companies. It allows new ways to interact with stakeholders through marketing and recruitment. However, it also risks sensitive information leaks and legal/IP issues. In-house counsel should understand new technologies and provide early legal advice to address reputational, security and compliance risks. Companies need social media policies and employee training to mitigate risks while leveraging opportunities.
Educational Model to Illustrate HIV Infection Cyclekcmurphy3
This is the final paper for my Fall 2009 design project. We worked with our client, Marge Sutinen, who teaches the social science course: Contemporary Issues in HIV/AIDS at the University of Wisconsin - Madison. She asked us to create an educational model that would illustrate to students the process of HIV attachment, replication, and lysis; as well as portraying how HIV is different from other viruses and what makes it more dangerous than a virus that your body can fight off itself.
This document presents a comparative evaluation of Galaxy and Ruffus-based scripting workflows for DNA sequencing analysis pipelines. The research aims to identify the optimal workflow system by implementing DNA-seq analysis pipelines in both Galaxy and Ruffus and benchmarking their performance. Literature on existing workflow systems is reviewed. The document outlines the research objectives, design, methodology, and requirements for the DNA-seq analysis pipeline use case. Preliminary results indicate pros and cons of each approach, with further analysis of performance metrics still needed.
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...Kyong-Ha Lee
This document proposes HadoopXML, a system for efficiently processing massive XML data and multiple twig pattern queries in parallel using Hadoop. Key features of HadoopXML include:
1) It partitions large XML files and processes them in parallel across nodes while preserving structural information.
2) It simultaneously processes multiple twig pattern queries with a shared input scan, without needing separate MapReduce jobs for each query.
3) It enables query processing tasks to share input scans and intermediate results like path solutions, reducing redundant processing and I/O.
4) It provides load balancing to fairly distribute twig join operations across nodes.
This document summarizes Spark, a fast and general engine for large-scale data processing. Spark addresses limitations of MapReduce by supporting efficient sharing of data across parallel operations in memory. Resilient distributed datasets (RDDs) allow data to persist across jobs for faster iterative algorithms and interactive queries. Spark provides APIs in Scala and Java for programming RDDs and a scheduler to optimize jobs. It integrates with existing Hadoop clusters and scales to petabytes of data.
This document provides an overview of next generation analytics with YARN, Spark and GraphLab. It discusses how YARN addressed limitations of Hadoop 1.0 like scalability, locality awareness and shared cluster utilization. It also describes the Berkeley Data Analytics Stack (BDAS) which includes Spark, and how companies like Ooyala and Conviva use it for tasks like iterative machine learning. GraphLab is presented as ideal for processing natural graphs and the PowerGraph framework partitions such graphs for better parallelism. PMML is introduced as a standard for defining predictive models, and how a Naive Bayes model can be defined and scored using PMML with Spark and Storm.
Beyond Map/Reduce: Getting Creative With Parallel ProcessingEd Kohlwey
While Map/Reduce is an excellent environment for some parallel computing tasks, there are many ways to use a cluster beyond Map/Reduce. Within the last year, the YARN and NextGen Map/Reduce has been contributed into the Hadoop trunk, Mesos has been released as an open source project, and a variety of new parallel programming environments have emerged such as Spark, Giraph, Golden Orb, Accumulo, and others.
We will discuss the features of YARN and Mesos, and talk about obvious yet relatively unexplored uses of these cluster schedulers as simple work queues. Examples will be provided in the context of machine learning. Next, we will provide an overview of the Bulk-Synchronous-Parallel model of computation, and compare and contrast the implementations that have emerged over the last year. We will also discuss two other alternative environments: Spark, an in-memory version of Map/Reduce which features a Scala-based interpreter; and Accumulo, a BigTable-style database that implements a novel model for parallel computation and was recently released by the NSA.
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET Journal
The document reviews existing methods for the k-means clustering algorithm. It discusses how k-means clustering works and some of its limitations when dealing with large datasets, such as being dependent on the initial choice of centroids. It then proposes using Hadoop to overcome big data challenges and calculate preliminary centroids for k-means clustering in a distributed manner. Finally, it reviews different techniques that have been proposed in other research to improve k-means clustering, such as methods for selecting better initial centroids or determining the optimal number of clusters.
Implementation of p pic algorithm in map reduce to handle big dataeSAT Publishing House
This document presents an implementation of the p-PIC clustering algorithm using the MapReduce framework to handle big data. P-PIC is a parallel version of the Power Iteration Clustering (PIC) algorithm that is able to cluster large datasets in a distributed environment. The document first provides background on PIC and challenges with scaling to big data. It then describes how p-PIC addresses these challenges using MPI for parallelization. The design of implementing p-PIC within MapReduce is presented, including the map and reduce functions. Experimental results on synthetic datasets up to 100,000 records show that p-PIC using MapReduce has increased performance and scalability compared to the original p-PIC implementation using MPI.
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive function. Exercise causes chemical changes in the brain that may help protect against developing mental illness and improve symptoms for those who already suffer from conditions like anxiety and depression.
Petikan ini menceritakan tentang seorang ahli nujum bernama Pak Belalang yang dipanggil oleh raja untuk mengenalpasti jantina seekor anak itik. Pak Belalang menjelaskan bahawa cara untuk mengenalpasti jantina anak itik ialah dengan melepaskannya ke dalam air; mana yang terlebih dahulu menyelam ialah betina, manakala yang kemudian menyelam ialah jantan.
Cytoscape Web is an interactive, web-based network browser that is a pared down version of Cytoscape, an open source software platform for visualizing and analyzing molecular interaction networks. It allows users to visualize networks, perform basic operations like filtering nodes and edges, and export images of the network. Performance depends on factors like the number of elements in the network, with networks over 2000 elements usually sluggish.
The document provides troubleshooting tips for when things go wrong with cameras, suggesting to first reset the camera, try using a different or new tape, and never remove a damaged tape by hand. It advises taking cameras to a repair place if they experience liquid, shock, or sand damage and to be honest about any self-induced faults when requesting help, listing all symptoms and intermittent faults.
The Intellectual Property Quagmire, or, The Perils of Libertarian CreationismStephan Kinsella
"The Intellectual Property Quagmire, or, The Perils of Libertarian Creationism,” by Stephan Kinsella. Rothbard Memorial Lecture, Austrian Scholars Conference, Mar. 13 2008. Accompanying audio/video available at
http://www.stephankinsella.com/media/
The key points are:
- The BSE Sensex plunged 769 points or almost 4% to close at 18,598 while the NSE Nifty lost 234 points or 4.08% to close at 5,507 on Friday, August 16, 2013.
- Stocks fell sharply due to the steep decline in the rupee which hit a new low of 62.03 against the US dollar and fears that the US Federal Reserve will begin tapering its bond buying program.
- Measures by the RBI to curb capital outflows and speculation have had little effect in stemming the rupee's slide and dampened investor sentiment, leading to the sharp falls in the Indian stock markets.
Social media presents both opportunities and risks for companies. It allows new ways to interact with stakeholders through marketing and recruitment. However, it also risks sensitive information leaks and legal/IP issues. In-house counsel should understand new technologies and provide early legal advice to address reputational, security and compliance risks. Companies need social media policies and employee training to mitigate risks while leveraging opportunities.
Educational Model to Illustrate HIV Infection Cyclekcmurphy3
This is the final paper for my Fall 2009 design project. We worked with our client, Marge Sutinen, who teaches the social science course: Contemporary Issues in HIV/AIDS at the University of Wisconsin - Madison. She asked us to create an educational model that would illustrate to students the process of HIV attachment, replication, and lysis; as well as portraying how HIV is different from other viruses and what makes it more dangerous than a virus that your body can fight off itself.
The document outlines teacher development programs offered by the Educational Technologies and Teacher Training Institute (INTEF) in Madrid, Spain. It describes INTEF's face-to-face and online training courses, tools and resources for teachers, and research initiatives. Key programs include summer courses, MOOCs, online courses, the E-Twinning platform, and resources like eXeLearning and Procomún. INTEF also works on interoperability solutions, connectivity plans, and developing a digital competence framework for teachers.
The document reports on issues from the Canadian Water Summit discussing water management. It summarizes recent regulations and guidance from the US and Canada on companies disclosing environmental impacts and water usage. It also outlines voluntary reporting standards from the Carbon Disclosure Project and Global Reporting Initiative for companies to disclose their water usage, impacts, and conservation efforts. Finally, it mentions a water management plan by Suncor and the development of standardized integrated reporting guidelines.
Os netbooks são pequenos computadores portáteis que são mais baratos e leves que os notebooks tradicionais. Eles possuem menos memória e processamento do que os notebooks, mas são ideais para navegar na internet e usar aplicativos básicos. Esta aula irá ensinar sobre as características e usos dos netbooks.
How digital can deliver your business goalsChris Woods
Marketers know new technology encourages greater customer engagement – but can digital also help brands realise their core objectives?
Putting digital technology at the heart of your business can encourage harmonisation and break down silo thinking – but how is that achieved?
What methods can be employed to turn customers into communities and integrate businesses departments around your brand’s core principles?
This discussion will help you understand:
How to encourage internal change through new technology
How business aims can be met through focusing on digital
Why building and managing communities matters
Speakers
Matt Ballantine
Principal evangelist
Microsoft
Chris Woods
Head of digital
Hanover
Louis Georgiou
Owner and director
Code Computerlove
>> View the webinar recording
http://www.themarketer.co.uk/knowledge-centre/marketing-transformation-how-digital-can-deliver-your-business-goals/
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
The document discusses Microsoft's Cognitive Toolkit (CNTK), an open source deep learning toolkit developed by Microsoft. It provides the following key points:
1. CNTK uses computational graphs to represent machine learning models like DNNs, CNNs, RNNs in a flexible way.
2. It supports CPU and GPU training and works on Windows and Linux.
3. CNTK achieves state-of-the-art accuracy and is efficient, scaling to multi-GPU and multi-server settings.
This deck was presented at the Spark meetup at Bangalore. The key idea behind the presentation was to focus on limitations of Hadoop MapReduce and introduce both Hadoop YARN and Spark in this context. An overview of the other aspects of the Berkeley Data Analytics Stack was also provided.
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
http://flink-forward.org/kb_sessions/to-petascale-and-beyond-apache-flink-in-the-clouds/
Apache Flink performs with low latency but can also scale to great heights. Gelly is Flink’s laboratory for building and tuning scalable graph algorithms and analytics. In this talk we’ll discuss writing algorithms optimized for the Flink architecture, assembling and configuring a cloud compute cluster, and boosting performance through benchmarking and system profiling. This talk will cover recent developments in the Gelly library to include scalable graph generators and a mixed collection of modular algorithms written with native Flink operators. We’ll think like a data stream, keep a cool cache, and send the garbage collector on holiday. To this we’ll add a lightweight benchmarking harness to stress and validate core Flink and to identify and refactor hot code with aplomb.
The Performance of MapReduce: An In-depth StudyKevin Tong
This document summarizes a study that evaluated techniques for improving the performance of MapReduce-based systems. The study identified several bottlenecks including I/O mode, record parsing, and sorting. It implemented optimizations like direct I/O, mutable decoding, and fingerprint-based sorting. Benchmarking showed these optimizations improved performance by 2.5-3.5 times, bringing MapReduce performance closer to parallel database systems. The authors conclude the techniques are effective but note further work is needed to develop a complete queryable MapReduce framework.
PyCVF is a Python framework for computer vision that aims to improve upon traditional frameworks. It avoids archaic file formats, uses better software design inspired by other frameworks, and provides essential concepts like datatypes, databases, nodes, models, and applications to build computer vision systems in a more unified and performant way. The preliminary version provides wrappers for common computer vision libraries and tools to browse and analyze image databases.
This document discusses large scale computing with MapReduce. It provides background on the growth of digital data, noting that by 2020 there will be over 5,200 GB of data for every person on Earth. It introduces MapReduce as a programming model for processing large datasets in a distributed manner, describing the key aspects of Map and Reduce functions. Examples of MapReduce jobs are also provided, such as counting URL access frequencies and generating a reverse web link graph.
Large Scale Machine Learning with Apache SparkCloudera, Inc.
Spark offers a number of advantages over its predecessor MapReduce that make it ideal for large-scale machine learning. For example, Spark includes MLLib, a library of machine learning algorithms for large data. The presentation will cover the state of MLLib and the details of some of the scalable algorithms it includes.
The document summarizes two use cases for Hadoop in biotech companies. The first case discusses a large biotech firm "N" that implemented Hadoop to improve their drug development workflow using next generation DNA sequencing. Hadoop reduced the workflow from 6 weeks to 2 days. The second case discusses challenges at another biotech firm "M" around scaling genomic data analysis and Hadoop's role in addressing those challenges through improved data ingestion, storage, querying and analysis capabilities.
Dynamically Optimizing Queries over Large Scale Data PlatformsINRIA-OAK
Enterprises are adapting large-scale data processing platforms, such as Hadoop, to gain actionable insights from their "big data". Query optimization is still an open challenge in this environment due to the volume and heterogeneity of data, comprising both structured and un/semi-structured datasets. Moreover, it has become common practice to push business logic close to the data via user-defined functions (UDFs), which are usually opaque to the optimizer, further complicating cost-based optimization. As a result, classical relational query optimization techniques do not fit well in this setting, while at the same time, suboptimal query plans can be disastrous with large datasets. In this talk, I will present new techniques that take into account UDFs and correlations between relations for optimizing queries running on large scale clusters. We introduce "pilot runs", which execute part of the query over a sample of the data to estimate selectivities, and employ a cost-based optimizer that uses these selectivities to choose an initial query plan. Then, we follow a dynamic optimization approach, in which plans evolve as parts of the queries get executed. Our experimental results show that our techniques produce plans that are at least as good as, and up to 2x (4x) better for Jaql (Hive) than, the best hand-written left-deep query plans.
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingPeter Haase
(1) FedBench is a benchmark suite for evaluating federated semantic data processing systems.
(2) It includes parameterized benchmark drivers, a variety of RDF datasets and SPARQL queries, and an evaluation framework to measure system performance.
(3) An initial evaluation was conducted to demonstrate FedBench's flexibility in comparing centralized and federated query processing using different systems and scenarios.
MapReduce is a programming model for processing large datasets in a distributed system. It allows parallel processing of data across clusters of computers. A MapReduce program defines a map function that processes key-value pairs to generate intermediate key-value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. The MapReduce framework handles parallelization of tasks, scheduling, input/output handling, and fault tolerance.
1. The document describes Glacier, a component library and compiler for implementing continuous queries on FPGAs.
2. Glacier includes common streaming operators as well as specialized building blocks for the FPGA context. It can implement a variety of streaming queries by composing these components.
3. The paper evaluates the performance of queries implemented on an FPGA using Glacier, finding they can process over 1 million tuples per second directly from the network interface.
Spark auf Hadoop ist hochskalierbar. Cloud Computing ist hochskalierbar. R, die erweiterbare Open Source Data Science Software, eher nicht. Aber was passiert, wenn wir Spark auf Hadoop, Cloud Computing und den Microsoft R Server zu einer skalierbaren Data Science-Plattform zusammenfügen? Stellen Sie sich vor wie es sein könnte, wenn Sie das Erkunden, Transformieren und Modellieren von Daten in jeder beliebigen Größe aus Ihrer Lieblings-R-Umgebung durchführen könnten. Stellen Sie sich nun vor, wie man anschließend die erzeugten Modelle - mit wenigen Klicks - als skalierbare, cloud basierte Web-Services-API bereitstellt. In dieser Session zeigt Sascha Dittmann, wie Sie Ihren R-Code, tausende von Open-Source-R-Pakete sowie die verteilte Implementierungen der beliebtesten Maschine-Learning-Algorithmen nutzen können, um genau dies umzusetzen. Dabei zeigt er wie man ein HDInsight Spark-Cluster inkl. eines Microsoft R Server-Clusters erstellt, sowie das daraus entstandene Model im SQL Server oder als swagger-based API für Anwendungsentwickler bereitstellt.
LDBC 8th TUC Meeting: Introduction and status updateLDBC council
The document summarizes an 8th Technical User Community meeting on the LDBC benchmark. It discusses:
1) The LDBC Organization which sponsors benchmarks and task forces to develop them.
2) The key elements of a benchmark - data/schema, workloads, performance metrics, and execution rules.
3) The Semantic Publishing Benchmark and Social Network Benchmark being developed to evaluate graph and RDF databases on industry workloads.
4) The workloads include interactive, business intelligence, and graph analytics to test different database capabilities.
5) Various database systems that can be evaluated using the benchmarks.
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetCloudera, Inc.
This document discusses YapMap, a visual search platform built on Hadoop and HBase. It summarizes how YapMap interfaces with HBase data, uses HBase as a data processing pipeline with checkpoints, and had to adjust schemas and migrate data as the system evolved. It also covers how YapMap constructs search indexes in shards based on HBase regions and stored indexes on HDFS. The document concludes with some lessons learned around optimizing HBase operations.
Revolution Analytics provides an advanced analytics platform called Revolution R Enterprise that allows users to leverage the open source R language for big data analytics. The presentation discusses how R can be used to extract value from large, complex datasets through data exploration, visualization, and predictive modeling. It also outlines best practices for implementing an advanced analytics stack and how Revolution R Enterprise optimizes R for distributed computing across multiple data platforms like Hadoop and databases. The key benefits of the Revolution R platform are that it makes R scalable for big data, provides an enterprise-ready environment, and allows organizations to leverage R's flexibility for analytics innovation.
This document provides an overview of the Apache Spark framework. It discusses how Spark allows distributed processing of large datasets across computer clusters using simple programming models. It also describes how Spark can scale from single servers to thousands of machines. Spark is designed to provide high availability by detecting and handling failures at the application layer. The document also summarizes Resilient Distributed Datasets (RDDs), which are Spark's fundamental data abstraction, and transformations and actions that can be performed on RDDs.
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Spark Summit
In this talk we evaluate Apache Spark for a data-intensive machine learning problem. Our use case focuses on policy diffusion detection across the state legislatures in the United States over time. Previous work on policy diffusion has been unable to make an all-pairs comparison between bills due to computational intensity. As a substitute, scholars have studied single topic areas.
We provide an implementation of this analysis workflow as a distributed text processing pipeline with Spark ML and GraphFrames.
Histogrammar package—a cross-platform suite of data aggregation primitives for making histograms, calculating descriptive statistics and plotting in Scala—is introduced to enable interactive data analysis in Spark REPL.
We discuss the challenges and strategies of unstructured data processing, data formats for storage and efficient access, and graph processing at scale.
In this session you will learn:
1. Meet MapReduce
2. Word Count Algorithm – Traditional approach
3. Traditional approach on a Distributed System
4. Traditional approach – Drawbacks
5. MapReduce Approach
6. Input & Output Forms of a MR program
7. Map, Shuffle & Sort, Reduce Phase
8. WordCount Code walkthrough
9. Workflow & Transformation of Data
10. Input Split & HDFS Block
11. Relation between Split & Block
12. Data locality Optimization
13. Speculative Execution
14. MR Flow with Single Reduce Task
15. MR flow with multiple Reducers
16. Input Format & Hierarchy
17. Output Format & Hierarchy
Scaling Application on High Performance Computing Clusters and Analysis of th...Rusif Eyvazli
The document discusses techniques for scaling applications across computing nodes in high performance computing (HPC) clusters. It analyzes the performance of different computing nodes on various applications like BLASTX, HPL, and JAGS. Array job facilities are used to parallelize applications by dividing iterations into independent tasks assigned across nodes. Python programs are created to analyze system performance based on log files and produce plots showing differences in node performance on different applications. The plots help with preventative maintenance and capacity management of the HPC system.
The Microsoft Biology Foundation (MBF) is an open-source library of bioinformatics algorithms and services built on .NET. MBF provides modular and reusable code for tasks like genomics, sequencing, and analysis. It leverages existing Microsoft technologies and allows distribution of computations across platforms from local to cloud. The first version was released in June 2010. MBF is developed openly on CodePlex and aims to benefit both commercial and non-commercial users.
The document discusses using cloud-scale computing for genomic analysis. It provides timing and cost estimates for running a genomic analysis pipeline called Myrna on Amazon EC2 using different numbers of compute nodes. The analysis of 1.1 billion reads would take 4 hours and 20 minutes on 1 master and 10 worker nodes at a cost of $44, or 1 hour and 38 minutes on 1 master and 40 workers at a cost of $66. It also discusses strategies for running genomic tools on cloud infrastructure or single computers.
This document summarizes a study on the persistence and availability of bioinformatics web services. The study analyzed over 900 web services listed in the Nucleic Acids Research journal between 2003-2009. It found that 17% of the original web addresses were no longer reachable. More recent services had higher quality standards but 24% of authors said their services would not be maintained long-term. The document provides recommendations for web service authors to improve long-term availability, such as using persistent URLs, releasing source code, and planning for the future maintenance of the service.
The document describes MOLGENIS, an open-source software system that allows users to define data models and generate full-featured web applications and databases from those models. Key features include a graphical user interface, database integration, support for common data formats, and the ability to rapidly develop applications by editing simple domain-specific models. The system has been applied to build several genomic and biomedical databases.
The document provides an update on the EMBOSS European Molecular Biology Open Software Suite project. It discusses new features added in the latest release including support for next-generation sequencing formats, additional data sources, and integration of ontologies. The EMBOSS team continues to work on improving interfaces and providing support to other projects.
The document discusses Evoker, a visualization tool for genotype intensity data from genome-wide association studies (GWAS). It provides background on GWAS and highlights the importance of rigorous quality control procedures for GWAS to eliminate sources of false positives like poor quality DNA, population structure, and genotyping artifacts. The document then discusses Evoker's implementation and software features for visualizing quality control metrics and genotype intensity data to assist with quality control checks.
This document contains 6 repeated links to the website http://www.g-language.org/PathwayProjector. The links all point to a pathway projection tool on the G-language website that can be used to visualize biological pathways.
This document discusses establishing a national repository for microarray gene expression data using MOLGENIS and MAGE-TAB. The objectives are to populate the repository with well-annotated microarray experiments from over 6,500 biobank samples, share the software as a microarray database solution for all biobanks, and combine gene expression data with GWAS studies to create novel eQTL datasets for complex diseases. The repository was created using MOLGENIS and populated with over 12,000 curated experiments from GEO and ArrayExpress for testing purposes. Future work includes populating with local data, integrating analysis tools, and enabling data and tool sharing between local installations while maintaining privacy.
This document discusses using Python to access libraries implemented in R through Bioconductor. It provides background on both Bioconductor and popular Python libraries for bioinformatics. As an example, it shows how to run an edgeR analysis from Python to identify differentially expressed genes from microarray data, accessing the R code and edgeR package from Python. This allows leveraging powerful statistical methods from R while taking advantage of Python's scripting abilities.
The document discusses the history and operations of the Apache Software Foundation. It began in 1995 with 8 developers working on the Apache HTTP Server. It is now a large organization with over 2,500 committers across 70+ projects. The ASF operates under an open governance model called "The Apache Way" which emphasizes merit-based consensus decision making. It also discusses how the ASF scales its operations through project oversight, incubating new projects, and community education programs like mentoring.
This document describes IPRStats, a visualization tool for InterProScan results. IPRStats allows users to view summaries and charts of protein domain annotations from InterProScan. It imports InterProScan XML files, generates statistics and taxonomy summaries, and exports results as HTML or Excel files. IPRStats uses a wxPython GUI, SQLite or PyTables for data storage, and generates pie charts, bar graphs and other visualizations of the annotation data.
The document summarizes updates to BioPerl, an open source Perl package for biological research. It discusses addressing new bioinformatics problems through collaborations, using modern Perl features to lower the barrier for new users, and potential approaches for BioPerl 2.0, including using Moose and preparing for Perl 6. The core of BioPerl provides classes for biological sequences, sequence I/O and features.
This document discusses the challenges of open source biological software projects including community engagement, integration with other tools, and increasing accessibility (democratization). It provides examples of how the Biopython project addresses these challenges such as through the Google Summer of Code program, improving documentation, and leveraging cloud computing resources to more easily distribute and access data and tools.
BioRuby is a bioinformatics library for the Ruby programming language. It provides object-oriented tools for tasks like sequence analysis, format conversion, running bioinformatics tools, and working with biological data. The latest version added features like improved support for phylogenetic XML (PhyloXML), next-generation sequencing FASTQ format reading/writing, and a REST API wrapper for the NCBI database. BioRuby development follows agile principles and its large developer community contributes new code frequently on GitHub. The project aims to improve integration with R and data visualization while maintaining a stable core.
This document discusses BioPython modules for handling RNA sequences containing modified nucleosides. There are 115 known post-transcriptionally modified nucleosides in RNA and several nomenclature schemes exist. The solution involves cloning a branch of the BioPython repository containing an RNA alphabet with modified nucleotides and using it to represent sequences containing modifications like 2-O-methyloadenosine. Example applications presented are ModeRNA for RNA structure modeling and CompaRNA for benchmarking RNA structure prediction methods, both of which use open source tools including BioPython.
Bio.Phylo is a new phylogenetics library in Biopython for exploring, modifying, annotating, reading, writing, and visualizing trees and for connecting computational pipelines. It supports common file formats like Newick and Nexus and can read/write the XML-based PhyloXML format which allows for annotations. The demo shows how to read a Newick tree, inspect it, draw it, promote it to PhyloXML to add branch colors, and write it out.
Archaeopteryx is a tool for visualizing and analyzing evolutionary trees. It is based on ATV and built using the open source Forester framework. Archaeopteryx allows users to visualize large trees with over 20,000 nodes. It supports various file formats and can access online databases. Key features include zooming, duplication inference tools, and editing trees. An example biological study analyzed functional profiles of genomes using Forester, phyloXML, and Archaeopteryx.
The document discusses the transition from BioMoby to SADI as a framework for semantic web services. It provides statistics on BioMoby usage and describes demonstrations of complex queries being answered through SADI and SHARE without a centralized database. The demonstrations include finding pathways for a protein and lab results for transplant patients. It advocates for SADI to support the scientific method and personal hypotheses through distributed ontologies rather than centralized ones.
ONTO-Toolkit is a collection of tools within the Galaxy framework that enables bio-ontology engineering using OBO file format ontologies. It includes wrappers for functions from the ONTO-PERL API to retrieve ontology terms and substructures. Two use cases are demonstrated: 1) identifying common ancestor terms between two molecular functions, and 2) finding the intersection between sub-ontologies for two biological processes to investigate overlap. The toolkit provides rich ontology-driven solutions for biologists within Galaxy.
This document provides an overview of the Hadoop/MapReduce/HBase framework and its applications in bioinformatics. It discusses Hadoop and its components, how MapReduce programs work, HBase which enables random access to Hadoop data, related projects like Pig and Hive, and examples of applications in bioinformatics and benchmarking of these systems.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving
Manufacturing custom quality metal nameplates and badges involves several standard operations. Processes include sheet prep, lithography, screening, coating, punch press and inspection. All decoration is completed in the flat sheet with adhesive and tooling operations following. The possibilities for creating unique durable nameplates are endless. How will you create your brand identity? We can help!
How information systems are built or acquired puts information, which is what they should be about, in a secondary place. Our language adapted accordingly, and we no longer talk about information systems but applications. Applications evolved in a way to break data into diverse fragments, tightly coupled with applications and expensive to integrate. The result is technical debt, which is re-paid by taking even bigger "loans", resulting in an ever-increasing technical debt. Software engineering and procurement practices work in sync with market forces to maintain this trend. This talk demonstrates how natural this situation is. The question is: can something be done to reverse the trend?
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/temporal-event-neural-networks-a-more-efficient-alternative-to-the-transformer-a-presentation-from-brainchip/
Chris Jones, Director of Product Management at BrainChip , presents the “Temporal Event Neural Networks: A More Efficient Alternative to the Transformer” tutorial at the May 2024 Embedded Vision Summit.
The expansion of AI services necessitates enhanced computational capabilities on edge devices. Temporal Event Neural Networks (TENNs), developed by BrainChip, represent a novel and highly efficient state-space network. TENNs demonstrate exceptional proficiency in handling multi-dimensional streaming data, facilitating advancements in object detection, action recognition, speech enhancement and language model/sequence generation. Through the utilization of polynomial-based continuous convolutions, TENNs streamline models, expedite training processes and significantly diminish memory requirements, achieving notable reductions of up to 50x in parameters and 5,000x in energy consumption compared to prevailing methodologies like transformers.
Integration with BrainChip’s Akida neuromorphic hardware IP further enhances TENNs’ capabilities, enabling the realization of highly capable, portable and passively cooled edge devices. This presentation delves into the technical innovations underlying TENNs, presents real-world benchmarks, and elucidates how this cutting-edge approach is positioned to revolutionize edge AI across diverse applications.
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
1. The Genome Analysis Toolkit
A MapReduce framework for analyzing next-generation
DNA sequencing data
Ma#
Hanna
and
Mark
DePristo
Genome
Sequencing
and
Analysis
Group
Medical
and
Popula<on
Gene<cs
Program
Broad
Ins<tute
of
Harvard
and
MIT
2. The Genome Analysis Toolkit
Agenda
• GATK
Overview
and
Concepts
• GATK
Workflow
• Example:
A
Simple
Bayesian
Genotyper
2
2 2
3. GATK: Overview and Concepts
Motivation
Coverage in xMHC region of JPT individuals"
• Dataset size greatly increases analysis complexity.
• Implementation issues can prematurely terminate
long-running jobs or introduce subtle bugs.
3
4. GATK: Overview
Simplifying the process of writing analysis tools for resequencing data
• The
framework
is
designed
to
support
most
common
paradigms
of
analysis
algorithms
– Provides
structured
access
to
reads
in
BAM
format,
reference
context,
as
well
as
reference-‐associated
meta
data
• General-‐purpose
– Op<mized
for
ease
of
use
and
completeness
of
func<onality
within
scope
• Efficient
– Engineering
investment
on
performance
of
cri<cal
data
structures
and
manipula<on
rou<nes
• Convenient
– Structured
plug-‐in
model
makes
developing
in
Java
against
the
framework
rela<vely
painfree
4
5. GATK: Overview
The MapReduce design philosophy
Data elements a
b
c
d
e
Operations are
f(x) independent of
each other
X = f(x) A
B
C
D
E
r(x,y, …, z) Results depends on
all sites
R = r(A, R(B,…,E)) R
Result is:
Map Function f applied to each element of list
Reduce Function r recursively reduced over each f(…)
5
6. GATK: Overview
Rapid development of efficient and robust analysis tools
Genome
Analysis
Provides the Toolkit
(GATK)
boilerplate infrastructure
code required
to perform any
NGS analysis
Traversal
engine
Analysis
tool
Provided
by
framework
Implemented
by
user
6
7. GATK: Workflow
Introduction
• GATK
Overview
and
Concepts
• GATK
Workflow
• An
example
of
one
of
the
GATK’s
most
common
workflows
• Data
access
pa#ern:
by
locus
• Inputs:
reads,
reference,
dbSNP
• Example:
A
Simple
Bayesian
Genotyper
7
8. GATK: Workflow
The sharding system: dividing data into processor-sized pieces
Reads
Reference
dbSNP
• Divides data into small chunks that can be
processed independently
• Handles extraction of subsets of data
• Groups small intervals together to avoid
repetitive decompression
8
9. GATK: Workflow
Traversal engines: preparing data for processing
Builds data structures
easy consumed by the
analysis
9
10. GATK: Workflow
Interaction between sharding system and traversal engines
• Datasets are split into shards, which can be processed sequentially or in parallel
• When processing sequentially, the reduce value of each shard is used to
bootstrap the next shard.
• When processing in parallel, the result of each shard is computed independently
and then “tree-reduced” together.
10
11. GATK: Workflow
Walkers: Analyses written by end-users
dbsnp
exons
A
ref
A
reads C
C
A
C
Analysis
tool
• Walkers (analyses) can easily be written by end users. The GATK is
distributed with a significant library of walkers.
• Only the reads, reference, and reference metadata applicable to a single-
base location is presented to the analysis tool.
• The GATK provides tools to filter the pileup automatically or on demand.
11
12. GATK: Workflow
Other data access patterns
Other data access patterns:
Traversal Type Description
Reads Call map per read, along with the reference
and reference-ordered metadata spanning
that read.
Duplicates Call map for each set of duplicate reads.
Read pair (naïve) Call map for each read and its mate (naïve,
requires the input BAM to be sorted in
query name order).
Straightforward (but not necessarily easy) to add any new
access pattern involving streaming data.
12
13. GATK: Additional features
Additional inputs and outputs
Reference metadata
• Support for additional input data that is sorted in reference
order can easily be added to the GATK.
• Input types can be added by creating two new classes: a
feature (data access object) and a codec (parser).
• New file formats are indexed automatically.
• New data types are autodiscovered via a classpath search.
• Joint initiative with IGV.
Additional I/O
• Analysis parameters can be added to a walker by annotating a
field in the walker with an @Argument annotation.
• Command-line argument types can become very sophisticated.
13
14. Walkers: Example
A simple Bayesian genotyper
• GATK
Overview
and
Concepts
• GATK
Workflow
• Example:
A
Simple
Bayesian
Genotyper
• A
func<onal
genotyper
in
under
150
lines
of
code
• A
minimal
example:
calls
are
much
lower
in
quality
than
the
UnifiedGenotyper
14
15. Walkers: Example
A simple Bayesian genotyper: the model
Likelihood of the
Likelihood for Prior for the data given the
the genotype genotype genotype Independent base model
Bayesian
model
L(G | D) = P(G) P(D | G) = ∏
b∈{good _ bases}
P(b | G)
• Likelihood
of
data
computed
using
pileup
of
bases
and
associated
quality
scores
at
given
locus
• Only
“good
bases”
are
included:
those
sa<sfying
minimum
base
quality,
mapping
read
quality,
pair
mapping
quality,
NQS
• L(G|D)
computed
for
all
10
genotypes
See http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper
for a more complete approach
15
16. Walkers: Example
A simple Bayesian genotyper
• Walker specifies the data access pattern and
declares command-line arguments.
• Inheritance defines traversal type.
• Annotation defines command-line argument.
public class GATKPaperGenotyper extends LocusWalker<Integer,Long> {
@Argument(fullName = "log_odds_score",
shortName = "LOD",
doc = "The LOD threshold",
required = false)
private double LODScore = 3.0;
16
17. Walkers: Example
A simple Bayesian genotyper
• Walker prepares the input dataset.
• ReadBackedPileup utility can be used to filter pileup on
demand.
public Integer map(RefMetaDataTracker tracker,
ReferenceContext ref,
AlignmentContext context) {
double likelihoods[] =
DiploidGenotypePriors.getReferencePolarizedPrior(
ref.getBase(),
DiploidGenotypePriors.HUMAN_HETEROZYGOSITY,
0.01);
// get the bases and qualities from the pileup
ReadBackedPileup pileup = context.getBasePileup().
getPileupWithoutMappingQualityZeroReads();
byte bases[] = pileup.getBases();
byte quals[] = pileup.getQuals();
…
17
18. Walkers: Example
A simple Bayesian genotyper
• Calculate the likelihood for each possible genotype.
• Determine the best of the calculated genotypes.
for (GENOTYPE genotype : GENOTYPE.values())
for (int index = 0; index < bases.length; index++) {
// our epsilon is the de-Phred scored base quality
double epsilon = Math.pow(10, quals[index] / -10.0);
byte pileupBase = bases[index];
double p = 0;
for (char r : genotype.toString().toCharArray())
p += r == pileupBase ? 1 - epsilon : epsilon / 3;
likelihoods[genotype.ordinal()] += Math.log10(p /
genotype.length());
}
Integer sortedList[] = MathUtils.sortPermutation(likelihoods);
18
19. Walkers: Example
A simple Bayesian genotyper
• Conditionally output the results.
• Use reduce to calculate number of genotypes called.
• Writing to provided output stream is guaranteed to be
thread-safe.
…
if (lod > LODScore)
out.printf("%st%st%.4ft%c%n", context.getLocation(),
selectedGenotype, lod, (char)ref.getBase());
return 1;
}
}
// end of map() function
public Long reduce(Integer value, Long sum) {
return value + sum;
}
public void onTraversalDone(Integer result) {
out.printf("Simple Genotyper genotyped %d loci.”, result);
}
19
20. Walkers: Threading performance
A simple Bayesian genotyper
GATK
performance
improves
nearly linearly
as processors
are added
20
21. Genome Analysis Toolkit
1000 Genomes Project
• Supports
any
BAM-‐ Ini<al
alignment
compa<ble
aligner
• All
of
these
tools
MSA
realignment
have
been
developed
in
the
GATK
Q-‐score
recalibra<on
• They
are
memory
and
CPU
efficient,
Base
error
cluster
friendly
and
are
modeling
easily
parallelized
• They
are
now
Genotyping
publically
and
are
being
used
at
many
sites
around
the
world
SNP
filtering
More
info:
h#p://www.broadins<tute.org/gsa/wiki/
Support
:
h#p://www.getsa<sfac<on.com/gsa/
21
22. Acknowledgments
Genome sequencing and Broad postdocs, staff, 1000 Genomes project
analysis group (MPG) and faculty In general but notably:
Kiran Garimella (Analysis Lead) Anthony Philippakis Matt Hurles
Michael Melgar Vineeta Agarwala Philip Awadalla
Chris Hartl Manny Rivas Richard Durbin
Sherman Jia Jared Maguire Goncalo Abecasis
Eric Banks (Development lead) Carrie Sougnez Richard Gibbs
Ryan Poplin David Jaffe Gabor Marth
Guillermo del Angel Nick Patterson Thomas Keane
Aaron McKenna Steve Schaffner Gil McVean
Khalid Shakir Shamil Sunyaev Gerton Lunter
Brett Thomas Paul de Bakker Heng Li
Corin Boyko
Copy number group Cancer genome
Bob Handsaker analysis
Genome Sequencing Platform Jim Nemesh Kristian Cibulskis
In general but notably: Josh Korn Andrey Sivachenko
Lauren Ambrogio Steve McCarroll Gad Getz
Illumina Production Team
Tim Fennell Integrative Genomics
Kathleen Tibbetts Viewer (IGV) MPG directorship
Alec Wysoker Jim Robinson Stacey Gabriel
Ben Weisburd Jesse Whitworth David Altshuler
Toby Bloom Helga Thorvaldsdottir Mark Daly
22