The document describes Ancestry's journey moving from a single machine DNA matching process to a scalable Hadoop and HBase solution. It details how they first parallelized the ethnicity prediction step using Hadoop as a job scheduler. This freed resources for the more challenging matching algorithm. It then explains how they developed "Jermline", storing matching data in HBase and using MapReduce to efficiently find new matches for incremental DNA samples. The new distributed solution allowed matching to scale to millions of DNA samples.
Utah Big Mountain Conference: AncestryDNA, HBase, Hadoop (9-7-2013)William Yetman
This presentation was given at Adobe with support from Utah Geek Events. It is the story of creating a business in an Agile way and digs into the technology that we used to support that business. It was a large open room with no microphone. I had a great audience with experienced Big Data developers and people new to the technology.
This document discusses AncestryDNA's use of Hadoop to scale their DNA analysis pipeline as their database and processing needs grew rapidly over time. It describes how they initially ran the entire pipeline on a single machine, and then incrementally moved each step of the pipeline to run on Hadoop clusters, including running Admixture ethnicity processing with MapReduce, replacing GERMLINE matching with a new Jermline algorithm implemented in MapReduce, and moving phasing from Beagle to a new Underdog implementation in MapReduce. Each change significantly improved performance and allowed them to keep up with the growth of their DNA database and user base.
Cassandra data structures and algorithmsDuyhai Doan
This document discusses Cassandra data structures and algorithms. It begins with an introduction and agenda, then covers Cassandra's use of CRDTs, bloom filters, and Merkle trees for its data model. It explains how Cassandra columns can be modeled as a CRDT join semilattice and proves their eventual convergence. The document also covers Cassandra's write path, read path optimized with bloom filters, and the math behind bloom filter probabilities.
Know your platform. 7 things every scala developer should know about jvmPawel Szulc
The document discusses the importance for Scala developers to understand the basics of the Java Virtual Machine (JVM) platform that Scala code runs on. It provides examples of Java bytecode produced from simple Scala code snippets to demonstrate how code is executed by the JVM. Key points made include that the JVM is a stack-based virtual machine that compiles source code to bytecode instructions, and that understanding the level below the code helps developers write more efficient, robust and performant code.
A brief introduction to the Bayesian analysis program PyRate for paleobiology colleagues. Given at a lab meeting, so the format is casual and a good chunk of prior knowledge is assumed.
Lucene 4.0 is on its way to deliver a tremendous amount of new features and improvements. Beside Real-Time Search & Flexible Indexing DocValues aka. Column Stride Fields is one of the "next generation" features
The document discusses MacRuby, a Ruby implementation for Mac OS X. It describes creating objects like jobs, companies, and products using Ruby. It also covers Cocoa integration using HotCocoa, and default mappings between Ruby and Cocoa classes and methods. Finally, it provides information about the MacRuby project and the author's background and contact details.
Published by the Finnish Information Processing Association, the yearly IT Barometer charts the importance of IT to Finnish organizations. In the IT Barometer, we study Finnish IT and business management’s views on how IT is utilized in their organizations, how IT produces value for their business, and what factors and competences are seen to contribute to future success.
This is the fourth IT Barometer and during these four years, we have seen dramatic changes in IT and the role of IT in Finnish companies. During these four years, we have gone through one downturn and we are now potentially entering another. During 2009, 2010 and 2011, we have monitored the effect that the general economic trend has on IT and perceptions on IT. During these four years, we also have seen the rise of consumerization, including social media services and a new class of smart phones and tablet computers. IT has also undergone a process of consumerization – new services and devices now first come to consumer markets and move from there to corporate use – oftentimes after years of delay.
Utah Big Mountain Conference: AncestryDNA, HBase, Hadoop (9-7-2013)William Yetman
This presentation was given at Adobe with support from Utah Geek Events. It is the story of creating a business in an Agile way and digs into the technology that we used to support that business. It was a large open room with no microphone. I had a great audience with experienced Big Data developers and people new to the technology.
This document discusses AncestryDNA's use of Hadoop to scale their DNA analysis pipeline as their database and processing needs grew rapidly over time. It describes how they initially ran the entire pipeline on a single machine, and then incrementally moved each step of the pipeline to run on Hadoop clusters, including running Admixture ethnicity processing with MapReduce, replacing GERMLINE matching with a new Jermline algorithm implemented in MapReduce, and moving phasing from Beagle to a new Underdog implementation in MapReduce. Each change significantly improved performance and allowed them to keep up with the growth of their DNA database and user base.
Cassandra data structures and algorithmsDuyhai Doan
This document discusses Cassandra data structures and algorithms. It begins with an introduction and agenda, then covers Cassandra's use of CRDTs, bloom filters, and Merkle trees for its data model. It explains how Cassandra columns can be modeled as a CRDT join semilattice and proves their eventual convergence. The document also covers Cassandra's write path, read path optimized with bloom filters, and the math behind bloom filter probabilities.
Know your platform. 7 things every scala developer should know about jvmPawel Szulc
The document discusses the importance for Scala developers to understand the basics of the Java Virtual Machine (JVM) platform that Scala code runs on. It provides examples of Java bytecode produced from simple Scala code snippets to demonstrate how code is executed by the JVM. Key points made include that the JVM is a stack-based virtual machine that compiles source code to bytecode instructions, and that understanding the level below the code helps developers write more efficient, robust and performant code.
A brief introduction to the Bayesian analysis program PyRate for paleobiology colleagues. Given at a lab meeting, so the format is casual and a good chunk of prior knowledge is assumed.
Lucene 4.0 is on its way to deliver a tremendous amount of new features and improvements. Beside Real-Time Search & Flexible Indexing DocValues aka. Column Stride Fields is one of the "next generation" features
The document discusses MacRuby, a Ruby implementation for Mac OS X. It describes creating objects like jobs, companies, and products using Ruby. It also covers Cocoa integration using HotCocoa, and default mappings between Ruby and Cocoa classes and methods. Finally, it provides information about the MacRuby project and the author's background and contact details.
Published by the Finnish Information Processing Association, the yearly IT Barometer charts the importance of IT to Finnish organizations. In the IT Barometer, we study Finnish IT and business management’s views on how IT is utilized in their organizations, how IT produces value for their business, and what factors and competences are seen to contribute to future success.
This is the fourth IT Barometer and during these four years, we have seen dramatic changes in IT and the role of IT in Finnish companies. During these four years, we have gone through one downturn and we are now potentially entering another. During 2009, 2010 and 2011, we have monitored the effect that the general economic trend has on IT and perceptions on IT. During these four years, we also have seen the rise of consumerization, including social media services and a new class of smart phones and tablet computers. IT has also undergone a process of consumerization – new services and devices now first come to consumer markets and move from there to corporate use – oftentimes after years of delay.
Este documento contiene 16 preguntas sobre los principales componentes y características de las computadoras. Se describen los componentes internos como el procesador, la memoria RAM y ROM, las tarjetas de expansión y los dispositivos de almacenamiento. También se explican conceptos como cómo funciona una computadora, los tipos de periféricos, unidades de medida de memoria, y tipos de monitores, impresoras, computadoras personales y mouse. Por último, hace referencia a las primeras computadoras y sus características.
La discriminación hacia los transexuales puede causar depresión. El documento analiza los efectos de la discriminación hacia personas transexuales en Iguala, Guerrero, México en 2012. La discriminación se define como dar un trato desfavorable a alguien basado en criterios como la identidad de género. Los transexuales sufren mucho rechazo social y familiar, lo cual puede conducir a sentimientos de tristeza y aislamiento conocidos como depresión. El objetivo es comprender mejor los efectos de la discriminación para promover el respeto a todas las preferencias de
Solano ortega luis enrique, unidad 2, aa1, el fenómeno comunicativo.Chitomix7812
Este documento define los elementos fundamentales del fenómeno comunicativo: la fuente o mensaje, el emisor, el medio o canal, y el receptor. Explica que la comunicación es el proceso mediante el cual se transmite información de un emisor a un receptor a través de un canal, y que los elementos clave son el mensaje codificado por el emisor, el medio por el que viaja, y el receptor que lo decodifica.
Various Views of Golden Dawn in the Centre of Athensdorethvanmanen
This document discusses various views of the far-right Greek political party Golden Dawn. It begins with an introduction providing context on Golden Dawn's rise in popularity since 2008 and gaining 7% of the parliamentary vote in 2012 elections.
The document is divided into two main sections. Section one discusses different views of Golden Dawn held by Greek citizens, immigrants, other countries, and various authorities in Greece. Section two analyzes how Golden Dawn sees itself versus how others within Greece view the party, including citizens, immigrants, former immigrants, a police officer, and a lawyer.
The conclusion reflects on challenges obtaining direct input from Golden Dawn and notes the views presented are based on interviews conducted in Athens over a one week period in January
La comunicación implica la transmisión de información de un emisor a un receptor a través de un canal o medio. El emisor codifica un mensaje en signos que son enviados a través de un canal y recibidos por el receptor, quien los decodifica para extraer la información.
The document discusses a study that compares user preferences and perceptions of the Yahoo and MSN web portals. It provides background on the history and changes made to each portal over the years. The study aims to investigate how attributes like information architecture, aesthetics, and navigation impact user preference and evaluation. Fifteen participants performed tasks to assess their pre-use perceptions and test usability while using the portals. A questionnaire addressed how attributes influenced perceptions. The study seeks to understand why changes were made and how services and usability have evolved to inform portal design.
HBaseCon 2013: Apache HBase, Apache Hadoop, DNA and YOU! Cloudera, Inc.
Ancestry.com is the world's largest online family history resource with over 30,000 historical collections, 11 billion records, and 4 petabytes of data. It allows users to spit in a tube, pay $99, and learn about their family origins and find long-lost relatives through Ancestry DNA, which has over 120,000 samples in its database. However, the original DNA matching algorithm called GERMLINE did not scale well to large data sets. Ancestry.com developed its own algorithm called Jermline that uses Hadoop and HBase to match DNA samples in parallel across a cluster of servers, providing a 1700% performance improvement over GERMLINE.
This week, the MBTS team welcomed a new intern, Sean Pinto. They purchased their own 3D printer to print faster and with multiple colors. Unfortunately, the tardigrades in their sample disappeared again without a trace. The team will examine archived petri dishes under a plate scanner. They are starting a new experiment to test how moss and water type affect tardigrade lifespan and reproduction. Progress was made on the plate scanner, including revising the 3D model and adding code. The team created a website to publicize their project. Their budget is lower after the 3D printer purchase. Team members rated the week between 8 to 9 out of 10, feeling they had a slow start but were happy to
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBCody Ray
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Many startups collect and display stats and other time-series data for their users. A supposedly-simple NoSQL option such as MongoDB is often chosen to get started... which soon becomes 50 distributed replica sets as volume increases. This talk describes how we designed a scalable distributed stats infrastructure from the ground up. KairosDB, a rewrite of OpenTSDB built on top of Cassandra, provides a solid foundation for storing time-series data. Unfortunately, though, it has some limitations: millisecond time granularity and lack of atomic upsert operations which make counting (critical to any stats infrastructure) a challenge. Additionally, running KairosDB atop Cassandra inside AWS brings its own set of challenges, such as managing Cassandra seeds and AWS security groups as you grow or shrink your Cassandra ring. In this deep-dive talk, we explore how we've used a mix of open-source and in-house tools to tackle these challenges and build a robust, scalable, distributed stats infrastructure.
Всеволод Поляков (DevOps Team Lead в Grammarly)Provectus
This document discusses graphite metrics storage and summarizes performance testing of various graphite components. It finds that go-carbon with carbon-c-relay provides the best performance at over 1 million requests per second. Various tuning options are discussed including cache sizing, write strategies, and OS configuration to optimize performance. Alternative time series databases like Influx and OpenTSDB are also benchmarked.
This document discusses Graphite and options for optimizing its performance for high volumes of metrics data. It summarizes the default Graphite architecture using Carbon and Whisper and different approaches for scaling it up including using go-carbon, carbon-c-relay, and evaluating alternative time series databases like Influx and OpenTSDB. Various techniques for optimizing whisper and cache configurations, I/O performance, and system parameters are also explored. Overall the best performing combination found was go-carbon with carbon-c-relay to handle over 1 million requests per second.
This talk will show what is possible huge datasets that are becoming more prevalent in the era of big data. I will demonstrate this and the 3d visualization in the Jupyter notebook, the by now almost standard environment of (data) scientists.
With large astronomical catalogues containing more than a billion stars becoming common, we are preparing for methods to visualize and explore these large datasets. Data volumes of this size requires different visualization techniques, since scatter plots become too slow and meaningless due to overplotting. We solve the performance and visualization issue using binned statistics, e.g. histograms, density maps, and volume rendering in 3d. The calculation of statistics on N-dimensional grids is handled by Python library called vaex, which I will introduce. It can process at least a billion samples per second, to produce for instance the mean of a quantity on a regular grid. This statistics can be calculated for any mathematical expression on the data (numpy style) and can be on the full dataset or subsets, specified by queries/selections.
However, to visualize higher dimensional data in the notebook interactively, no proper solution existed. This led to the development of ipyvolume, which can render 3d volumes and up to a million glyphs (scatter plots and quiver) in the Jupyter notebook as a widget. With the browser as a platform, and the release of ipywidgets 6.0, these 3d plots can also be embedded in static html files and renders on nbviewer. This allows for sharing with colleagues, rendering on your tablet (paperless office), outreach, press release material, etc. Full screen stereo rendering allows for a virtual reality experience using your phone and Google Cardboard, a minor investment compared to other VR head mountables. Overlaying 3d quiver plots on a 3d volume rendering allows exploring a 6d (or higher) space.
Vaex and ipyvolume can be used together to explore and visualize any large tabular data set, or separately to calculate statistics, and render 3d plots in the notebook and outside.
CTF3, Stripe's third Capture-the-Flag, focused on distributed systems engineering with a goal of learning to build fault-tolerant, performant software while playing around with a bunch of cool cutting-edge technologies.
More here: https://stripe.com/blog/ctf3-launch.
Abstract: Nowadays it’s only a lazy one who haven’t written his own metric storage and aggregation system. I am lazy, and that’s why I have to choose what to use and how to use. I don’t want you to do the same job, so I decided to share my considerations concerning architectures and test results.
The document discusses using OpenCL to accelerate genomic analysis through parallelization. It introduces OpenCL and provides examples of using it to parallelize algorithms for copy number inference in tumors, computing relatedness between individuals, and performing variable selection in regression. Key applications discussed include hidden Markov models for copy number inference, principal component analysis on relatedness matrices, and coordinate descent algorithms for lasso regression. Performance gains of up to 155x are reported for the parallel implementations compared to serial code.
Programming the Cell Processor A simple raytracer from pseudo-code to spu-codeSlide_N
This document provides an overview of programming the Cell processor by describing how to optimize a raytracing algorithm for parallel execution across the Synergistic Processing Elements (SPEs) of the Cell. It discusses strategies like partitioning the image into rows for each SPE to process, using vectors and SIMD instructions to perform the 3D math calculations efficiently, avoiding branches by restructuring code, and leveraging direct memory access (DMA) to transfer data effectively between the SPEs and main processor. The document also notes more advanced techniques like adaptive work partitioning, object caching, and dynamic code loading that could further improve a complex raytracer's performance on the Cell architecture.
This document discusses high-throughput screening (HTS) workflows for identifying biologically active small molecules. It describes how robots are used to rapidly screen large libraries of compounds in assays and generate large datasets. Statistical and machine learning methods in R can then be used to build predictive models from these datasets to identify promising leads and guide the screening of additional compounds. Caveats regarding the applicability of models to new chemical spaces are also discussed.
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at ScaleAndy Petrella
A talk given at the BioBankCloud conference in Feb 2015 about distributed computing in the contexts of genomics and health.
In this one, we exposed what results we obtained exploring the 1000genomes data using ADAM, followed by an introduction to our scalable GA4GH server implementation built using ADAM, Apache Spark and Play Framework 2.
Top 5 mistakes when writing Spark applicationshadooparchbook
This document discusses common mistakes made when writing Spark applications and provides recommendations to address them. It covers issues like having executors that are too small or large, shuffle blocks exceeding size limits, data skew slowing jobs, and excessive stages. The key recommendations are to optimize executor and partition sizes, increase partitions to reduce skew, use techniques like salting to address skew, and favor transformations like ReduceByKey over GroupByKey to minimize shuffles and memory usage.
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...MongoDB
This will cover what to consider for high write throughput performance from hardware configuration through to the use of replica sets, multi-data centre deployments, monitoring and sharding to ensure your database is fast and stays online.
Top 5 mistakes when writing Spark applicationshadooparchbook
This document discusses common mistakes people make when writing Spark applications and provides recommendations to address them. It covers issues related to executor configuration, application failures due to shuffle block sizes exceeding limits, slow jobs caused by data skew, and managing the DAG to avoid excessive shuffles and stages. Recommendations include using smaller executors, increasing the number of partitions, addressing skew through techniques like salting, and preferring ReduceByKey over GroupByKey and TreeReduce over Reduce to improve performance and resource usage.
Este documento contiene 16 preguntas sobre los principales componentes y características de las computadoras. Se describen los componentes internos como el procesador, la memoria RAM y ROM, las tarjetas de expansión y los dispositivos de almacenamiento. También se explican conceptos como cómo funciona una computadora, los tipos de periféricos, unidades de medida de memoria, y tipos de monitores, impresoras, computadoras personales y mouse. Por último, hace referencia a las primeras computadoras y sus características.
La discriminación hacia los transexuales puede causar depresión. El documento analiza los efectos de la discriminación hacia personas transexuales en Iguala, Guerrero, México en 2012. La discriminación se define como dar un trato desfavorable a alguien basado en criterios como la identidad de género. Los transexuales sufren mucho rechazo social y familiar, lo cual puede conducir a sentimientos de tristeza y aislamiento conocidos como depresión. El objetivo es comprender mejor los efectos de la discriminación para promover el respeto a todas las preferencias de
Solano ortega luis enrique, unidad 2, aa1, el fenómeno comunicativo.Chitomix7812
Este documento define los elementos fundamentales del fenómeno comunicativo: la fuente o mensaje, el emisor, el medio o canal, y el receptor. Explica que la comunicación es el proceso mediante el cual se transmite información de un emisor a un receptor a través de un canal, y que los elementos clave son el mensaje codificado por el emisor, el medio por el que viaja, y el receptor que lo decodifica.
Various Views of Golden Dawn in the Centre of Athensdorethvanmanen
This document discusses various views of the far-right Greek political party Golden Dawn. It begins with an introduction providing context on Golden Dawn's rise in popularity since 2008 and gaining 7% of the parliamentary vote in 2012 elections.
The document is divided into two main sections. Section one discusses different views of Golden Dawn held by Greek citizens, immigrants, other countries, and various authorities in Greece. Section two analyzes how Golden Dawn sees itself versus how others within Greece view the party, including citizens, immigrants, former immigrants, a police officer, and a lawyer.
The conclusion reflects on challenges obtaining direct input from Golden Dawn and notes the views presented are based on interviews conducted in Athens over a one week period in January
La comunicación implica la transmisión de información de un emisor a un receptor a través de un canal o medio. El emisor codifica un mensaje en signos que son enviados a través de un canal y recibidos por el receptor, quien los decodifica para extraer la información.
The document discusses a study that compares user preferences and perceptions of the Yahoo and MSN web portals. It provides background on the history and changes made to each portal over the years. The study aims to investigate how attributes like information architecture, aesthetics, and navigation impact user preference and evaluation. Fifteen participants performed tasks to assess their pre-use perceptions and test usability while using the portals. A questionnaire addressed how attributes influenced perceptions. The study seeks to understand why changes were made and how services and usability have evolved to inform portal design.
HBaseCon 2013: Apache HBase, Apache Hadoop, DNA and YOU! Cloudera, Inc.
Ancestry.com is the world's largest online family history resource with over 30,000 historical collections, 11 billion records, and 4 petabytes of data. It allows users to spit in a tube, pay $99, and learn about their family origins and find long-lost relatives through Ancestry DNA, which has over 120,000 samples in its database. However, the original DNA matching algorithm called GERMLINE did not scale well to large data sets. Ancestry.com developed its own algorithm called Jermline that uses Hadoop and HBase to match DNA samples in parallel across a cluster of servers, providing a 1700% performance improvement over GERMLINE.
This week, the MBTS team welcomed a new intern, Sean Pinto. They purchased their own 3D printer to print faster and with multiple colors. Unfortunately, the tardigrades in their sample disappeared again without a trace. The team will examine archived petri dishes under a plate scanner. They are starting a new experiment to test how moss and water type affect tardigrade lifespan and reproduction. Progress was made on the plate scanner, including revising the 3D model and adding code. The team created a website to publicize their project. Their budget is lower after the 3D printer purchase. Team members rated the week between 8 to 9 out of 10, feeling they had a slow start but were happy to
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBCody Ray
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Many startups collect and display stats and other time-series data for their users. A supposedly-simple NoSQL option such as MongoDB is often chosen to get started... which soon becomes 50 distributed replica sets as volume increases. This talk describes how we designed a scalable distributed stats infrastructure from the ground up. KairosDB, a rewrite of OpenTSDB built on top of Cassandra, provides a solid foundation for storing time-series data. Unfortunately, though, it has some limitations: millisecond time granularity and lack of atomic upsert operations which make counting (critical to any stats infrastructure) a challenge. Additionally, running KairosDB atop Cassandra inside AWS brings its own set of challenges, such as managing Cassandra seeds and AWS security groups as you grow or shrink your Cassandra ring. In this deep-dive talk, we explore how we've used a mix of open-source and in-house tools to tackle these challenges and build a robust, scalable, distributed stats infrastructure.
Всеволод Поляков (DevOps Team Lead в Grammarly)Provectus
This document discusses graphite metrics storage and summarizes performance testing of various graphite components. It finds that go-carbon with carbon-c-relay provides the best performance at over 1 million requests per second. Various tuning options are discussed including cache sizing, write strategies, and OS configuration to optimize performance. Alternative time series databases like Influx and OpenTSDB are also benchmarked.
This document discusses Graphite and options for optimizing its performance for high volumes of metrics data. It summarizes the default Graphite architecture using Carbon and Whisper and different approaches for scaling it up including using go-carbon, carbon-c-relay, and evaluating alternative time series databases like Influx and OpenTSDB. Various techniques for optimizing whisper and cache configurations, I/O performance, and system parameters are also explored. Overall the best performing combination found was go-carbon with carbon-c-relay to handle over 1 million requests per second.
This talk will show what is possible huge datasets that are becoming more prevalent in the era of big data. I will demonstrate this and the 3d visualization in the Jupyter notebook, the by now almost standard environment of (data) scientists.
With large astronomical catalogues containing more than a billion stars becoming common, we are preparing for methods to visualize and explore these large datasets. Data volumes of this size requires different visualization techniques, since scatter plots become too slow and meaningless due to overplotting. We solve the performance and visualization issue using binned statistics, e.g. histograms, density maps, and volume rendering in 3d. The calculation of statistics on N-dimensional grids is handled by Python library called vaex, which I will introduce. It can process at least a billion samples per second, to produce for instance the mean of a quantity on a regular grid. This statistics can be calculated for any mathematical expression on the data (numpy style) and can be on the full dataset or subsets, specified by queries/selections.
However, to visualize higher dimensional data in the notebook interactively, no proper solution existed. This led to the development of ipyvolume, which can render 3d volumes and up to a million glyphs (scatter plots and quiver) in the Jupyter notebook as a widget. With the browser as a platform, and the release of ipywidgets 6.0, these 3d plots can also be embedded in static html files and renders on nbviewer. This allows for sharing with colleagues, rendering on your tablet (paperless office), outreach, press release material, etc. Full screen stereo rendering allows for a virtual reality experience using your phone and Google Cardboard, a minor investment compared to other VR head mountables. Overlaying 3d quiver plots on a 3d volume rendering allows exploring a 6d (or higher) space.
Vaex and ipyvolume can be used together to explore and visualize any large tabular data set, or separately to calculate statistics, and render 3d plots in the notebook and outside.
CTF3, Stripe's third Capture-the-Flag, focused on distributed systems engineering with a goal of learning to build fault-tolerant, performant software while playing around with a bunch of cool cutting-edge technologies.
More here: https://stripe.com/blog/ctf3-launch.
Abstract: Nowadays it’s only a lazy one who haven’t written his own metric storage and aggregation system. I am lazy, and that’s why I have to choose what to use and how to use. I don’t want you to do the same job, so I decided to share my considerations concerning architectures and test results.
The document discusses using OpenCL to accelerate genomic analysis through parallelization. It introduces OpenCL and provides examples of using it to parallelize algorithms for copy number inference in tumors, computing relatedness between individuals, and performing variable selection in regression. Key applications discussed include hidden Markov models for copy number inference, principal component analysis on relatedness matrices, and coordinate descent algorithms for lasso regression. Performance gains of up to 155x are reported for the parallel implementations compared to serial code.
Programming the Cell Processor A simple raytracer from pseudo-code to spu-codeSlide_N
This document provides an overview of programming the Cell processor by describing how to optimize a raytracing algorithm for parallel execution across the Synergistic Processing Elements (SPEs) of the Cell. It discusses strategies like partitioning the image into rows for each SPE to process, using vectors and SIMD instructions to perform the 3D math calculations efficiently, avoiding branches by restructuring code, and leveraging direct memory access (DMA) to transfer data effectively between the SPEs and main processor. The document also notes more advanced techniques like adaptive work partitioning, object caching, and dynamic code loading that could further improve a complex raytracer's performance on the Cell architecture.
This document discusses high-throughput screening (HTS) workflows for identifying biologically active small molecules. It describes how robots are used to rapidly screen large libraries of compounds in assays and generate large datasets. Statistical and machine learning methods in R can then be used to build predictive models from these datasets to identify promising leads and guide the screening of additional compounds. Caveats regarding the applicability of models to new chemical spaces are also discussed.
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at ScaleAndy Petrella
A talk given at the BioBankCloud conference in Feb 2015 about distributed computing in the contexts of genomics and health.
In this one, we exposed what results we obtained exploring the 1000genomes data using ADAM, followed by an introduction to our scalable GA4GH server implementation built using ADAM, Apache Spark and Play Framework 2.
Top 5 mistakes when writing Spark applicationshadooparchbook
This document discusses common mistakes made when writing Spark applications and provides recommendations to address them. It covers issues like having executors that are too small or large, shuffle blocks exceeding size limits, data skew slowing jobs, and excessive stages. The key recommendations are to optimize executor and partition sizes, increase partitions to reduce skew, use techniques like salting to address skew, and favor transformations like ReduceByKey over GroupByKey to minimize shuffles and memory usage.
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...MongoDB
This will cover what to consider for high write throughput performance from hardware configuration through to the use of replica sets, multi-data centre deployments, monitoring and sharding to ensure your database is fast and stays online.
Top 5 mistakes when writing Spark applicationshadooparchbook
This document discusses common mistakes people make when writing Spark applications and provides recommendations to address them. It covers issues related to executor configuration, application failures due to shuffle block sizes exceeding limits, slow jobs caused by data skew, and managing the DAG to avoid excessive shuffles and stages. Recommendations include using smaller executors, increasing the number of partitions, addressing skew through techniques like salting, and preferring ReduceByKey over GroupByKey and TreeReduce over Reduce to improve performance and resource usage.
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
This document discusses 5 common mistakes when writing Spark applications:
1) Improperly sizing executors by not considering cores, memory, and overhead. The optimal configuration depends on the workload and cluster resources.
2) Applications failing due to shuffle blocks exceeding 2GB size limit. Increasing the number of partitions helps address this.
3) Jobs running slowly due to data skew in joins and shuffles. Techniques like salting keys can help address skew.
4) Not properly managing the DAG to avoid shuffles and bring work to the data. Using ReduceByKey over GroupByKey and TreeReduce over Reduce when possible.
5) Classpath conflicts arising from mismatched library versions, which can be addressed using sh
This document discusses techniques for finding duplicate records in large datasets. It describes a machine learning framework with two steps: candidate selection and candidate scoring. The candidate selection step uses domain knowledge, information retrieval techniques like "more like this" queries, and approximate nearest neighbors to find candidate duplicates. The candidate scoring step then uses machine learning models trained on pairwise record comparisons to identify true duplicates among the candidates. Features for the models include differences in fields, text similarity measures, image hashes and embeddings. Approximate techniques like locality sensitive hashing allow scaling these methods to very large datasets.
Top 5 mistakes when writing Spark applicationsmarkgrover
This document discusses 5 common mistakes people make when writing Spark applications.
The first mistake is improperly sizing Spark executors by not considering factors like the number of cores, amount of memory, and overhead needed. The second mistake is running into the 2GB limit on Spark shuffle blocks, which can cause jobs to fail. The third mistake is not addressing data skew during joins and shuffles, which can cause some tasks to be much slower than others. The fourth mistake is poorly managing the DAG by overusing shuffles, not using techniques like ReduceByKey instead of GroupByKey, and not using complex data types. The fifth mistake is classpath conflicts between the versions of libraries used by Spark and those added by the user.
The document discusses quantum computing and its potential uses and limitations. It begins with an explanation of how far a person could count in their lifetime compared to what conventional computing and quantum computing are capable of. It then covers the basics of quantum computing including qubits, quantum gates, quantum circuits, and measurement. Examples of different approaches to building quantum computers are provided. While progress is being made, quantum computing is still in its early stages with devices currently having just tens of qubits. The document concludes with a discussion of the investments and doubling of quantum volume needed each year to achieve quantum advantage in the 2020s and potential early uses in areas like cryptography.
Similar to Utahbigmountain ancestrydnahbasehadoop9-7-2013billyetman-130928100600-phpapp02 (20)
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
1. Ancestry DNA at Scale
Using Hadoop and HBase
September 7, 2013
1
2. What does this talk cover?
What does Ancestry do?
How did our journey with Hadoop start?
Using Hadoop as a Job Processor
DNA Matching with Hadoop and HBase
What’s next?
2
4. Discoveries Are the Key
We are the world's largest online family history resource.
• Over 30,000 historical content collections
• 11 billion records and images
• Records dating back to 16th century
• 4 petabytes
6. Discoveries With DNA
Spit in a tube, pay $99, learn your past
Autosomal DNA tests
Over 120,000 DNA samples
700,000 SNPs for each sample
6,000,000+ 4th cousin matches
150,000
100,000
Genotyped samples
50,000
-
6
DNA molecule 1 differs from DNA
molecule 2 at a single base-pair location
(a C/T polymorphism).
(http://en.wikipedia.org/wiki/Singlenucleiotide_polymorphism)
10. What’s the Story?
Cast of Characters (Scientists and Software Engineers)
Scientists
Software Engineers
Think they can code:
Think they are Scientists:
• Linux
• Biology in HS and College
• MySQL
• Math/Statistics
• PERL and/or Python
• Read science papers
Pressures of a startup business
– Release a product, learn, and then scale
Sr. Manager and 5 developers and 4 member Science Team
10
11. DNA Input
Raw Data (A,C,T,G,0):
3 123456789_RZZZZ2_XXXXXXH3Q7U7Q2B_YYYY84598-DNA 0 0 0 -9 C C G G G G G G A A A A C C G G A
AAACCGGGGAAGGGAAAGGAGAACCAAAAGGAAAGGGGGCCGGAAGGGGGG
G A A A A C G A A A A G A G A A A A G G G G G G A G G G G G G G … (continues for 700,000+ snips)
Map File:
0
0
0
0
0
0
0
0
0
0
0
11
rs10005853
rs10015934
rs1004236
rs10059646
rs10085382
rs10123921
rs10127827
rs10155688
rs10162780
rs1017484
rs10188129
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
12. What Did “Get Something Running” Look Like?
Old Version
Run
Watch Dog
B
Init
Rakesh
Results
Processing
3) Poll
status
Finalize
Creates run
Reruns
Heart beat
2) Enqueuer
(dna validation)
Pipeline
Control
Monitor
Monitor
4) Disc
Management
(V2)
Runs on
AdMixture (Ethnicity)
Beagle (Phasing) and GermLine (Matching)
runs here
“Beefy Box”
Single Beefy Box – Only option is to scale Vertically
12
13. Measure Everything Principle
• Start time, end time, duration in seconds, and sample
count for every step in the pipeline. Also the full end-toend processing time
• Put the data in pivot tables and graphed each step
• Normalize the data (sample size was changing)
#1
• Use the data collected to predict future performance
13
14. Challenges and Pain Points
Performance degrades when DNA pool grows
• Static
(by batch size)
• Linear
(by DNA pool size)
• Quadratic (Matching related steps) – Time bomb
(Courtesy from Keith’s Potting)
14
16. Why Attack Ethnicity First?
• Smart developers, little Hadoop experience
– Using Hadoop as a job scheduler and scaling the ethnicity step
was easier than redesigning the matching step
• AdMixture is a self-contained application
– Reference panel, the users DNA, and a seed value for inputs
– CPU intensive job that writes to stdout
• Easy to split up the input
• Looked hard enough at the matching problem to realize a
HBase, MapReduce solution was realistic
16
17. Parallel Ethnicity Jobs
Typical run of 1000 samples. Queue up one Hadoop job
with 40 tasks, 25 samples per task
Hadoop Cluster (20 x 4 slots x 96g)
Server
Server
Server Server
Server
Server
Server
Server
Server
Server
1) Map Reduce
Admixture
Admixture
Admixture
Admixture
Admixture
Admixture
17
Admixture
Admixture
Admixture
#2
19. Freed up the “Beefy Box”
• Moving AdMixture off left an additional 10 threads for
phasing and matching
• Memory was freed up for phasing and matching
• Just moving AdMixture off, saved over 6 hours of
processing on the single box
– Bought us time
19
21. What is GERMLINE?
•
•
•
GERMLINE is an algorithm that finds hidden relationships
within a pool of DNA
GERMLINE also refers to the reference implementation of
that algorithm written in C++
You can find it here :
http://www1.cs.columbia.edu/~gusev/germline/
22. So what's the problem?
•
•
•
•
GERMLINE (the implementation) was not meant to be
used in an industrial setting
• Stateless
• Single threaded
• Prone to swapping (heavy memory usage)
• Generic
• Used for any DNA (fish, fruit fly, human, …)
GERMLINE performs poorly on large data sets
Our metrics predicted exactly where the process would
slow to a crawl
Put simply : GERMLINE couldn't scale
23. Hours
GERMLINE Run Times (in hours)
25
20
15
10
5
0
60000
57500
55000
52500
50000
47500
45000
42500
40000
37500
35000
32500
30000
27500
25000
22500
20000
17500
15000
12500
10000
7500
5000
2500
Number of samples
24. Hours
Projected GERMLINE Run Times (in hours)
700
600
500
400
300
200
GERMLINE run
times
100
Projected
GERMLINE run
times
0
122500
120000
117500
115000
112500
110000
107500
105000
102500
100000
97500
95000
92500
90000
87500
85000
82500
80000
77500
75000
72500
70000
67500
65000
62500
60000
57500
55000
52500
50000
47500
45000
42500
40000
37500
35000
32500
30000
27500
25000
22500
20000
17500
15000
12500
10000
7500
5000
2500
Number of samples
25. The Mission : Create a Scalable
Matching Engine
... and thus was
born
(aka "Jermline with a J")
26. DNA Matching : How it Works
The
Input
Starbuck : ACTGACCTAGTTGAC
Adama : TTAAGCCTAGTTGAC
Kara Thrace, aka
Starbuck
•
•
•
Ace viper pilot
Has a special
destiny
Not to be trifled
with
Admiral Adama
•
•
Admiral of the
Colonial Fleet
Routinely
saves
humanity from
destruction
27. DNA Matching : How it Works
Separate into
words
0
1
2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC
28. DNA Matching : How it Works
Build the hash
table
0
1
2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC
ACTGA_0 : Starbuck
TTAAG_0 : Adama
CCTAG_1 : Starbuck, Adama
TTGAC_2 : Starbuck, Adama
29. DNA Matching : How it Works
Iterate through genome and find matches
0
1
2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC
ACTGA_0 : Starbuck
TTAAG_0 : Adama
CCTAG_1 : Starbuck, Adama
TTGAC_2 : Starbuck, Adama
Starbuck and Adama match from position 1 to position 2
33. The GERMLINE Way
Step one : Rebuild the entire hash table from scratch, including
the new sample
0
1
2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC
Baltar
: TTAAG CCTAG GGGCG
ACTGA_0 : Starbuck
TTAAG_0 : Adama, Baltar
CCTAG_1 : Starbuck, Adama, Baltar
TTGAC_2 : Starbuck, Adama
GGGCG_2 : Baltar
34. The GERMLINE Way
Step two : Find everybody's matches all over again, including the
new sample. (n x n comparisons)
0
1
2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC
Baltar
: TTAAG CCTAG GGGCG
ACTGA_0 : Starbuck
TTAAG_0 : Adama, Baltar
CCTAG_1 : Starbuck, Adama, Baltar
TTGAC_2 : Starbuck, Adama
GGGCG_2 : Baltar
Starbuck and Adama match from position 1 to position 2
Adama and Baltar match from position 0 to position 1
Starbuck and Baltar match at position 1
35. The GERMLINE Way
Step three : Now, throw away the evidence!
0
1
2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC
Baltar
: TTAAG CCTAG GGGCG
ACTGA_0 : Starbuck
TTAAG_0 : Adama, Baltar
CCTAG_1 : Starbuck, Adama, Baltar
TTGAC_2 : Starbuck, Adama
GGGCG_2 : Baltar
Starbuck and Adama match from position 1 to position 2
Adama and Baltar match from position 0 to position 1
Starbuck and Baltar match at position 1
You have done this before, and you will have
to do it ALL OVER AGAIN.
36. Not so good, right?
Now let's take a look at the
way.
37. The
Starbuck
2_ACTGA_0
way
Adama
Step one : Update the hash table.
1
2_TTAAG_0
1
2_CCTAG_1
1
1
2_TTGAC_2
1
Already stored in HBase
1
Baltar : TTAAG CCTAG GGGCG
New sample to add
Add a column for every new sample for each user
Key : [CHROMOSOME]_[WORD]_[POSITION]
Qualifier : [USER ID]
Cell value : A byte set to 1, denoting that the user has that word at that
position on that chromosome
38. The
2_Starbuck
2_Starbuck
2_Adama
way
2_Adama
Step two : Find matches.
{ (1, 2), ...}
{ (1, 2), ...}
Baltar and Adama match from position 0 to position 1
Baltar and Starbuck match at position 1
Already
stored in
HBase
New
matches to
add
“Fuzzy Match” the consecutive words. Worst case: Identical twins
Key : [CHROMOSOME]_[USER ID]
Qualifier : [CHROMOSOME]_[USER ID]
Cell value : A list of ranges where the two users match on a chromosome
40. But wait ... what about
Zarek, Roslin, Hera, and Helo?
41. Run them in parallel with Hadoop!
Photo by Benh Lieu
Song
42. Parallelism with Hadoop
•
Batches are usually about a thousand
people.
•
Each mapper takes a single chromosome for
a single person.
o
•
Three samples per task means 22 jobs with 334 tasks
(1000/3) each
MapReduce Jobs :
Job #1 : Match Words
• Updates the hash table
Job #2 : Match Segments
• Identifies areas where the samples
match
43. How does Jermline perform?
A 1700% improvement over
GERMLINE!
Along with more accurate results
#3
46. Incremental Changes Over Time
• Support the business, move incrementally and adjust
• After H2, pipeline speed stays flat
•
46
(Courtesy from Bill’s plotting)
47. Dramatically Increased our Capacity
Bottom line : Without Hadoop and HBase, this would
have been expensive and difficult.
•
Previously, we ran GERMLINE on a single "beefy box".
• 12-core 2.2GHZ Opteron 6174 with 256GB of RAM
• We had upgraded this machine until it couldn't be upgraded any more.
• Processing time was unacceptable, growth was unsustainable.
• To continue running GERMLINE on a single box, we would have required a vastly more
powerful machine, probably at the supercomputer level – at considerable cost!
•
Now, we run Jermline on a cluster.
• 20 X 12-core 2GHZ Xeon E5-2620 with 96GB of RAM
• We can now run 16 batches per day, whereas before we could only run one.
• Most importantly, growth is sustainable. To add capacity, we need only add more
nodes.
49. Continue to Evolve the Software
• Azkaban for job control
– Nearly complete
• Phasing
– Still runs on the “Beefy Box”, 1000 samples take over 11 hours
– Total run time for 1000 samples is about 14 hours.
– Re-implement with HBase, MapReduce, Hadoop
• Version Updates
– New algorithms require us to re-run the entire DNA pool
– Burst capacity to the cloud
• Machine Learning
– Matching (V2) and Ethnicity (V3) both would benefit from a
Machine Learning approach
49
Job Processor: As you will see, we started our Hadoop/DNA journey with something that was fairly basic and then we moved to the matching problemDNA Matching: We will walk through and example of how matching works, discuss how GERMLINE implemented the matching, and contrast that with the Hadoop/HBase implementation we created.
At Ancestry.com our mission is to help people discover, preserve and share their family history.
Everything from birth certificates, obituaries, immigration records, census records, voter registration, old phone books, everything.
Typically, the way it works is this :You search through our records to find one of your relatives. Once you've found enough records that you're satisfied you've found your relative, you attach them to your family tree. After that, Ancestry goes to work for you. Our search engine takes a look at your whole tree to find relatives that you may not know about yet, and presents these to you as hints. (shaky leaf) You can then examine these hints and see if they are, in fact, related to you. It's pretty cool! And the beauty of it is that, say you've found a relative who's researched their family tree pretty extensively? Well, you get to piggyback on all that research by simply adding their family tree to yours. A fine example of crowdsourcing.
Spit in a tube, pay $99 and learn about your past. That is how Derrick Harris of GigaOm described what we do. DNA is found in every living cell – it is the genetic material that encodes all of the information required to create and maintain life. DNA is passed down from parent to child and is like breadcrumbs left by our Ancestors. And changes in DNA across generations give us a view into history. We can take those breadcrumbs and determine with a large degree of accuracy what your ethnicity is and who else in our database might be your cousin. If we determine that you have a 4th cousin then you likely share a common ancestor with that person between 7 and 10 generations ago or 150 – 300 years ago. We have a team of data scientists and bioinformatics PhD’s working on this effort and have very quickly acquired over 120,000 DNA samples for people that have family trees on our site. Each DNA sample is composed of over 700,000 SNPs or location markers. In order to compare the 700,000 SNPs from each new sample with the 700,000 SNPs from each existing sample that is already in our database we have a sophisticated pipeline of algorithms that run using Hadoop, Hbase and MapReduce for parallel distributed processing.What is our confidence rate of a 4th cousin match?The average customer has close to 30 fourth cousin matches
Top left our ethnicity chart. To the right, Tree view with cousin hints and surnames in another member’s public tree. Maps pinpointing birth locations. List of surnames that appear in both trees.
The bottom red line is the size of our DNA pool (i.e. each unique sample in our database). The black line is the number of cousin matches we’ve calculated at the particular DNA pool size. As you can see, the matches start to compound and grow quadratically as the pool size increases. This is a good thing. It means we can find genetic relatives for most customers who take the DNA test.The cousin matches are actually a Big Data problem for our Front End. We are looking at different ways to handle the transfer, storage, and growth of the cousin match data as the DNA pool size increases.
Every scientist thinks they can code – because they have been doing it for a long time on their own or in an academic environment. But they don’t know what it means to build, deploy, support “production” code. Software engineers understand production code. They just think they understand the math and statistics – after all they are computer scientists. They can understand the science behind DNA, after all, they took Biology in high school. Nowhere near the education of a Bioinformatics or Population Geneticist PhD. The Science Team are the domain experts and the engineers are required to build a production system to meet the domain expert’s needs.Really started light 3 developers and 2 scientists. In fact, for the first 3 months we “borrowed” engineers from other projects to get this started.
5 possible values – not 4. A C T G and zero. Zero indicates a “read” failure at that position. No sample is perfect, extraction could be off, each run on the same sample will come up with zeros in different spots.QC checks on the sample. If there are too many “zeros” we have the lab try the extraction again. If that fails 2 more times, we issue a recollect (send another kit to the customer and ask them to submit their DNA again).Map file tells you where each value is on a particular chromosome
Ran AdMixture on 10 threads, Phasing and Germline on 10 threads. AdMixture would usually finish before Beagle (Phasing) and that freed up more memory and threads for Germline. In all, a 500 sample run took about 24 hours to complete (pool size < 25K)IF WE STAYED IN THIS CONFIGURATION (WHICH MATCHED MANY ACEDEMIC ENVIRONMENTS) THE ONLY OPTION WAS TO INCREASE THE HARDWARE. MORE CPUS, MORE MEMORY. SCALING VERTICALLY JUST PLAIN SUCKS!
Critically important. In software development you must measure your performance at every step. Does not matter what you are doing, if you are not measuring your performance, you can’t improve. The last point is critical. We could determine the formula for performance of key phases (correlate this) and used that formula to predict future performance at particular DNA pool sizes. We could see the problems coming and knew when we were going to have performance issues.Story #1: Our first step that was going out of control (going quadratic), was the first implementation of the relationship calculation – happens just after matching. This step was basically two nested for loops that walked over the entire DNA pool for each input sample. Simple code, it worked with small numbers, fell over fast. Time was approaching 5 hours to run. Two of my developers rewrote this in PERL and got it down to 2 minutes 30 seconds. They were ecstatic. One of our DNA Scientists (PhD in Bioinformatics, MS in Computer Science – he knows how to code) wrote an AWK command (nasty regular expressions) that ran in less than 10 seconds. My devs were humbled. For the next week, whenever they ran into Keith, they formally bowed to his skills. (All in good nature, all fun.)
Static by batch size (Phasing). Some steps took a long time but were very consistent. A worry but not critical to fix up front.Linear by DNA Pool size (Pipeline Initialization). Looked at ways to streamline and improve performance of these steps.Quadratic – those are the time bombs (Germline, Relationship processing, Germline results processing)The only way we knew this was coming was because we measured each step in the pipeline.
KEY POINT: We knew we wanted to move this to Hadoop to solve matching. With that end goal in mind, we attacked the AdMixture/Ethnicity step. Without that initial investigation and discovery step, we could have used an MPI Linux environment or some other way to scale AdMixture.
Story 2: First job we put through was a single job with 500 tasks (sample size of 500). AdMixture is a C++, multithreaded App. When we kicked up all the tasks, it did not leave enough CPU time for the “task health check” to run in the background on the Hadoop node. So the Job Controller would reach out and kill some jobs because they were “misbehaving” – when in fact they were running just fine. Remember Hadoop is intimately aware of the JVM and how it is running. Hadoop does not have a good view into other applications you choose to run. Since AdMixture was C++, Hadoop had no idea how much memory, threads, or CPU was being used per “slot”. We had to back things off, so there was enough room for the Job Controller to get an “ACK” indicating the jobs were running fine.THE ONLY WAY TO UNDERSTAND HADOOP’S CAPABILITIES AND LIMITATIONS IS TO USE IT! BE READY FOR SOME SURPRISES.
Really happy with this performance. 1000 samples usually run in 2 hours 30 minutes to 2 hours 45 minutes. Two spikes to explain (this is the bug):There is a bug in AdMixture (remember, created by someone who wanted to finish their CS Masters for an academic situation) that showed up occasionally. The program would literally “get lost” and never complete. It would not GPF or throw an error, it just swallowed up the CPU and never completed. Even worse, it usually happened on chromosome 1 or 2, the biggest chromosomes we process. We put a timeout on our tasks. If a task did not finish in 2 hours, we killed it, changed the seed value and resubmitted a new task. This fixes the problem. That explains the spikes.
This was a great first step. We got valuable Hadoop, MapReduce, job control experience and this first step BOUGHT US TIME!It gave us the confidence to start working on the GERMLINE matching problem.
Very smart people at Columbia University came up with GERMLINE.
Remember, for an academic, running a 1000 sample set through GERMLINE was “large”. I’ve talked to people who kept re-running the same 50 fish DNA samples through GERMLINE to clean up the variations between sample extractions (think of it as eliminating all the zeros).In a lot of ways, we were using GERMLINE in a way that it was not built for.
Mention how we kept upgrading and tightening things up
Our projections showed how bad the execution time would get. As we approached 120K for the DNA pool size, each additional 500 sample set would require 700 hours to complete – over 4 weeks.
Germline with a “J” (lead engineer’s first name is Jeremy)This was a “clean room” implementation of the algorithm. Read the reference paper, don’t look at the C++ reference implementation. Work off the original (brilliant) paper.
Using BattlestarGallactica for the matching example.
For each person-to-person comparison, we add up the total length of their shared DNA and run that through a statistical model to see how closely they're related. This is the “Relationship Calculation” step that works on the GERMLINE output.
Remind people that GERMLINE was stateless
Anytime you see an N-by-N comparison in a computer problem you are working on it should send up huge red flags.
HBase holds the data. (Mix between a spread sheet and a hash table.) Adding columns is easy. Having a very sparse matrix is fine. Key is the chromosome, the word value, and position (which word). Each new sample adds a column to the table. A value of 1 in the cell indicates this user has this value at this location. A row holds all the samples with that same value in our DNA poolsize.This is really a pretty simple implementation. Remember: SIMPLE SCALES.
There is a second table for the fuzzy matching phase. It holds the list of ranges where two users match on a chromosome. This is used to create the output of the matching phase. Exactly where two individuals match on each chromosome.
There were a whole bunch of characters on BattlestarGallactica!
First run we kicked off one job with (500 samples x 22 chromosomes) 11,000 tasks using Hbase 0.92 and we panicked the HBase region server. That’s where we came up with 22 jobs (one for each chromosome) with about 334 tasks per job. (Moved to HBase 0.94 was much more stable)
Story #3: We would run samples through the old GERMLINE and the new HadoopJermline. For the most part, they always matched. We finally found a few runs where there were discrepancies. We had to pull in the Science Team to check – we had actually found a bug in the original GERMLINE implementation for an edge case. The clean room implementation of the Hadoop code was “more correct” than the original C++ GERMLINE reference code. Very gratifying to see – but the truth is it had us concerned and confused for about 3 days.Made the natural assumption that the base implementation GERMLINE (with a ‘G’) was 100% correct. That assumption was wrong.
This slide is a huge relief. We’ve been released and steady for a while. One note, the curve for H2 is not totally flat. It is going up ever so slightly. No worries. We can always add more nodes to the cluster and reduce the time.
This is an “Agile” development story.Point out a few colors: Dark Green, Orange, Light Green at the top, and the PurpleThe darker green is AdMixture, you can see when we moved to Hadoop (our H1 Release)Orange is the Matching Step (Germline to Jermline, our H2 Release)The lighter green is the pipeline finalization step. We eliminated most of this step when we released H2. We had a failsafe way to fallback to the completed steps of the previous run. We never wanted to fail in the middle of a run, destroy everything, and then have to rerun the entire pool from scratch. Key part of finalization pre-JermlineThe Purple is Phasing (Beagle). Static based on input size, very stable. On our hit list.
The “Beefy Box” would be a good candidate for a large database server or a single node on a heavily used distributed cache (Memcache-D or Redis)