VariantSpark - a Spark library for genomics

•Download as PPTX, PDF•

1 like•2,213 views

VariantSpark a customer Apache Spark library for genomic data. Customer wide random forest machine learning algorithm, designed for workloads with millions of features.

Science

VariantSpark: a library for Genomics
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Lynn Langit

Natalie Twine
Transformational Bioinformatics Team
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Denis Bauer Oscar Luo Rob Dunne Piotr SzulAidan O’BrienLaurence Wilson
Adrian White
Mia Champion
Gaetan Burgio
Collaborators
David Levy
News
Software
Dan Andrews
Kaitao Lai
Kaylene Simpson
Iva Nikolic
Ian Blair
Kelly Williams

BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4)
Cited
4
VariantSpark | Denis C. Bauer @allPowerde

Unsupervised ML : K-Means
www.cloudaccess.eu
1000 x 40 Million variants
Matrix *
k-means
Predict super
population
4
14 ethnic groups and
s u p e r
populations
VariantSpark | Denis C. Bauer @allPowerde
* VariantSpark can also process phase 3 data: 3000 individuals and 80 million variants

Comparing K-Means Implementations
0
1000
2000
Python
R
H
adoop
Adam
AD
M
IXTU
R
E
VariantSpark
method
timeinseconds
task
binary−conversion
clustering
pre−processing
103 75 29 28 18 4 min
VariantSpark | Denis C. Bauer @allPowerde

Supervised ML: Wide Random Forests
Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Genomic Research Workflow
https://www.projectmine.com/about/
Focus

Performance – Faster and More Accurate
VariantSpark is the only method to scale to 100% of the genome
Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Scaling to 50 M variables and 10 K samples
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
100K trees: 5 – 50h
AWS: ~$215.50
100K trees: 200 – 2000h
AWS: ~ $ 8620.00
• Yarn Cluster (12 workers)
• 16 x Intel Xeon E5-2660@2.20GHz CPU
• 128 GB of RAM
• Spark 1.6.1 on YARN
• 128 executors
• 6GB / executor (0.75TB)
• Synthetic dataset (mtry = 0.25)
Whole Genome
Range
GWAS Range

Databricks &
VariantSpark
via a Jupyter notebook

Solving Important Questions…
Cancer genomics?

• Quickly access a managed Spark cluster - AWS EC2 / spot instances
• Link to your data and perform whole genome analysis in real-time
VariantSpark & Databricks Notebooks
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Jupyter Notebook
Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Try it out: VariantSpark Notebook
https://databricks.com/blog/2017/07/26/breaking-the-
curse-of-dimensionality-in-genomics-using-wide-
random-forests.html

AgileIndia 2018 Keynote. This talk covers how ‘Datafication’ will make data ‘wider’ (more features describing a data point), which represents a paradigm shift for Machine Learning applications. It also covers serverless architecture, which can cater for even compute-intensive tasks. It concludes by stating that business and life-science research are not that different: so let’s build a community together!

Customer Case Study: How Novel Compute Technology Transforms Medical and Life...

Amazon Web Services

Understanding Jupyter notebooks using bioinformatics examples

Lynn Langit

Accelerating Time to Science: Transforming Research in the Cloud

Jamie Kinney

Director's Colloquium at Los Alamos National Laboratory, September 18, 2014. We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to leverage the “cloud” (whether private or public) to achieve economies of scale and reduce cognitive load. In this talk, I explore the past, current, and potential future of large-scale outsourcing and automation for science.

Laurie Goodman: Sharing and Reusing Cell Image Data, ASCB/EMBO 2017 Subgroup ...

GigaScience, BGI Hong Kong

2014 moore-dddc.titus.brown

Machine Learning in Healthcare Diagnostics

Larry Smarr

Reusable Software and Open Data To Optimize Agriculture

David LeBauer

Abstract: Humans need a secure and sustainable food supply, and science can help. We have an opportunity to transform agriculture by combining knowledge of organisms and ecosystems to engineer ecosystems that sustainably produce food, fuel, and other services. The challenge is that the information we have. Measurements, theories, and laws found in publications, notebooks, measurements, software, and human brains are difficult to combine. We homogenize, encode, and automate the synthesis of data and mechanistic understanding in a way that links understanding at different scales and across domains. This allows extrapolation, prediction, and assessment. Reusable components allow automated construction of new knowledge that can be used to assess, predict, and optimize agro-ecosystems. Developing reusable software and open-access databases is hard, and examples will illustrate how we use the Predictive Ecosystem Analyzer (PEcAn, pecanproject.org), the Biofuel Ecophysiological Traits and Yields database (BETYdb, betydb.org), and ecophysiological crop models to predict crop yield, decide which crops to plant, and which traits can be selected for the next generation of data driven crop improvement. A next step is to automate the use of sensors mounted on robots, drones, and tractors to assess plants in the field. The TERRA Reference Phenotyping Platform (TERRA-Ref, terraref.github.io) will provide an open access database and computing platform on which researchers can use and develop tools that use sensor data to assess and manage agricultural and other terrestrial ecosystems. TERRA-Ref will adopt existing standards and develop modular software components and common interfaces, in collaboration with researchers from iPlant, NEON, AgMIP, USDA, rOpenSci, ARPA-E, many scientists and industry partners. Our goal is to advance science by enabling efficient use, reuse, exchange, and creation of knowledge. --- Invited talk for the "Informatics for Reproducibility in Earth and Environmental Science Research" session at the American Geophysical Union Fall Meeting, Dec 17 2015.

Data Automation at Light Sources

Ian Foster

Plenary talk at the international Synchrotron Radiation Instrumentation conference in Taiwan, on work with great colleagues Ben Blaiszik, Ryan Chard, Logan Ward, and others. Rapidly growing data volumes at light sources demand increasingly automated data collection, distribution, and analysis processes, in order to enable new scientific discoveries while not overwhelming finite human capabilities. I present here three projects that use cloud-hosted data automation and enrichment services, institutional computing resources, and high- performance computing facilities to provide cost-effective, scalable, and reliable implementations of such processes. In the first, Globus cloud-hosted data automation services are used to implement data capture, distribution, and analysis workflows for Advanced Photon Source and Advanced Light Source beamlines, leveraging institutional storage and computing. In the second, such services are combined with cloud-hosted data indexing and institutional storage to create a collaborative data publication, indexing, and discovery service, the Materials Data Facility (MDF), built to support a host of informatics applications in materials science. The third integrates components of the previous two projects with machine learning capabilities provided by the Data and Learning Hub for science (DLHub) to enable on-demand access to machine learning models from light source data capture and analysis workflows, and provides simplified interfaces to train new models on data from sources such as MDF on leadership scale computing resources. I draw conclusions about best practices for building next-generation data automation systems for future light sources.

DCSF 19 Towards Reproducable Climate Research

Docker, Inc.

Aparna Radhakrishnan, Engility NOAA/GFDL was founded in 1955 and is still in the forefront of climate research, contributing to the numerous policies and decisions undertaken in this world of evolving responses with respect to climate, which in turn creates an avalanche of effects in various sectors, e.g agriculture, health, GDP. The scale and magnitude of computing and data have proven to increase significantly in the last decade, thus making data delivery methods to the world a herculean research problem by itself. In addition to this, the time and efforts invested by a user in analyzing and peer-reviewing a research article is very laborious. Literature shows numerous outstanding climate studies published in International climate assessment reports, such as the Intergovernmental Panel on Climate Change (IPCC), the United Nations body for assessing the science related to climate change. The need to verify the research and make it reproducible and transparent before it gets translated into major decisions is, now more than ever, one of our most critical challenges. In this presentation, we will paint a picture of the history of climate computing and analytics with significant transformations applied in order to make meaningful, quantifiable, credible, interoperable, accessible and reusable climate research. In other words, we will draw a path towards reproducible research using Docker containers for massive data publishing and climate analytics. This paper will also discuss some of the pioneering efforts from collaborators from other laboratories and organizations (such as ESGF, Google, NASA JPL, Columbia University, PMEL, etc.) in the area of Docker containers in computing and analysis on and off the cloud.

Democratizing Machine Learning: Perspective from a scikit-learn Creator

Databricks

<p>Once an obscure branch of applied mathematics, machine learning is now the darling of tech. I will talk about lessons learned democratizing machine learning. How libraries like scikit-learn were designed to empower users: simplifying but avoiding ambiguous behaviors. How the Python data ecosystem was built from scientific computing tools: the importance of good numerics. How some machine-learning patterns easily provide value to real-world situations. I will also discuss remain challenges to address and the progresses that we are making. Scaling up brings different bottlenecks to numerics. Integrating data in the statistical models, a hurdle to data-science practice requires to rethink data cleaning pipelines.</p><p>This talk will drawn from my experience as a scikit-learn developer, but also as a researcher in machine learning and applications.</p>

Cycle Computing Record-breaking Petascale HPC Run

inside-BigData.com

In this slidecast, Jason Stowe from Cycle Computing describes the company's recent record-breaking Petascale CycleCloud HPC production run. "For this big workload, a 156,314-core CycleCloud behemoth spanning 8 AWS regions, totaling 1.21 petaFLOPS (RPeak, not RMax) of aggregate compute power, to simulate 205,000 materials, crunched 264 compute years in only 18 hours. Thanks to Cycle's software and Amazon's Spot Instances, a supercomputing environment worth $68M if you had bought it, ran 2.3 Million hours of material science, approximately 264 compute-years, of simulation in only 18 hours, cost only $33,000, or $0.16 per molecule." Learn more: http://blog.cyclecomputing.com/2013/11/back-to-the-future-121-petaflopsrpeak-156000-core-cyclecloud-hpc-runs-264-years-of-materials-science.html Watch the video presentation: http://wp.me/p3RLHQ-aO9

The Rise of Machine Intelligence

Larry Smarr

Big Data

Sameer Sawhney

My recent presentation about what is Big Data, Why so much Hype now, Startling Facts, Opportunity, History, Important Research Papers such as GFS, Map-Reduce , Technology Platforms and Organizations , Hadoop, Cassandra, Introduction to Hadoop, Contribution of Indians to various Big Data technologies working in Google, Cloudera, Hortonworks, Yahoo, Facebook, Aadhar - "All your answers lie in data - @Sameer Sawhney"

Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...

Data Con LA

Twitter generates billions and billions of events per day. Analyzing these events in real time presents a massive challenge. Twitter designed and deployed a new streaming system called Heron. Heron has been in production nearly 2 years and is widely used by several teams for diverse use cases. This talk looks at Twitter's operating experiences and challenges of running Heron at scale and the approaches taken to solve those challenges.

Butler - a framework for a large-scale scientific analysis on the cloud - EOS...

ATMOSPHERE .

Coding the Continuum

Ian Foster

In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and "where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum.

Cloud Accelerated Genomics

Idan Tohami

re:Invent 2013-foster-madduri

Ravi Madduri

A 100 gigabit highway for science: researchers take a 'test drive' on ani tes...balmanme

(BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On...

Amazon Web Services

"Not only did the 156,000+ core run (nicknamed the MegaRun) on Amazon EC2 break industry records for size, scale, and power, but it also delivered real-world results. The University of Southern California ran the high-performance computing job in the cloud to evaluate over 220,000 compounds and build a better organic solar cell. In this session, USC provides an update on the six promising compounds that we have found and is now synthesizing in laboratories for a clean energy project. We discuss the implementation of and lessons learned in running a cluster in eight AWS regions worldwide, with highlights from Cycle Computing's project Jupiter, a low-overhead cloud scheduler and workload manager. This session also looks at how the MegaRun was financially achievable using the Amazon EC2 Spot Instance market, including an in-depth discussion on leveraging Spot Instances, a strategy to deal with the variability of Spot pricing, and a template to avoid compromising workflow integrity, security, or management. After a year of production workloads on AWS, HGST, a Western Digital Company, has zeroed in on understanding how to create on-demand clusters to maximize value on AWS. HGST will outline the company's successes in addressing the company's changes in operations, culture, and behavior to this new vision of on-demand clusters. In addition, the session will provide insights into leveraging Amazon EC2 Spot Instances to reduce costs and maximize value, while maintaining the needed flexibility, and agility that AWS is known for.andquot; "

VariantSpark a library for genomics by Lynn Langit

Data Con LA

VariantSpark on AWS

Lynn Langit

How novel compute technology transforms life science research

Denis C. Bauer

Unprecedented data volumes and pressure on turnaround time driven by commercial applications require bioinformatics solutions to evolve to meed these new demands. New compute paradigms and cloud-based IT solutions enable this transition. Here I present two solution capable of meeting these demands for genomic variant analysis, VariantSpark, as well as genome engineering applications, GT-Scan2. VariantSpark classifies 3000 individuals with 80 Million genomic variants each in under 30 minutes. This Hadoop/Spark solution for machine learning application on genomic data is hence capable to scale up to population size cohorts. GT-Scan2, identifies CRISPR target sites by minimizing off-target effects and maximizing on-target efficiency. This optimization is powered by AWS Lambda functions, which offer an “always-on” web service that can instantaneously recruit enough compute resources keep runtime stable even for queries with several thousand of potential target sites.

What's hot

Utility HPC: Right Systems, Right Scale, Right ScienceChef Software, Inc.

The Discovery Cloud: Accelerating Science via Outsourcing and Automation

Ian Foster

Laurie Goodman: Sharing and Reusing Cell Image Data, ASCB/EMBO 2017 Subgroup ...

GigaScience, BGI Hong Kong

2014 moore-dddc.titus.brown

Machine Learning in Healthcare Diagnostics

Larry Smarr

Reusable Software and Open Data To Optimize Agriculture

David LeBauer

Data Automation at Light Sources

Ian Foster

DCSF 19 Towards Reproducable Climate Research

Docker, Inc.

Democratizing Machine Learning: Perspective from a scikit-learn Creator

Databricks

Cycle Computing Record-breaking Petascale HPC Run

inside-BigData.com

The Rise of Machine Intelligence

Larry Smarr

Big Data

Sameer Sawhney

Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...

Data Con LA

Butler - a framework for a large-scale scientific analysis on the cloud - EOS...

ATMOSPHERE .

Coding the Continuum

Ian Foster

Cloud Accelerated Genomics

Idan Tohami

re:Invent 2013-foster-madduri

Ravi Madduri

A 100 gigabit highway for science: researchers take a 'test drive' on ani tes...balmanme

(BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On...

Amazon Web Services

What's hot (19)

Utility HPC: Right Systems, Right Scale, Right Science

The Discovery Cloud: Accelerating Science via Outsourcing and Automation

Laurie Goodman: Sharing and Reusing Cell Image Data, ASCB/EMBO 2017 Subgroup ...

2014 moore-ddd

Machine Learning in Healthcare Diagnostics

Reusable Software and Open Data To Optimize Agriculture

Data Automation at Light Sources

DCSF 19 Towards Reproducable Climate Research

Democratizing Machine Learning: Perspective from a scikit-learn Creator

Cycle Computing Record-breaking Petascale HPC Run

The Rise of Machine Intelligence

Big Data

Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...

Butler - a framework for a large-scale scientific analysis on the cloud - EOS...

Coding the Continuum

Cloud Accelerated Genomics

re:Invent 2013-foster-madduri

A 100 gigabit highway for science: researchers take a 'test drive' on ani tes...

(BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On...

Similar to VariantSpark - a Spark library for genomics

VariantSpark a library for genomics by Lynn Langit

Data Con LA

VariantSpark on AWS

Lynn Langit

How novel compute technology transforms life science research

Denis C. Bauer

VariantSpark: applying Spark-based machine learning methods to genomic inform...

Denis C. Bauer

Genomic information is increasingly used in medical practice giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. Here we introduce VariantSpark, which utilizes Hadoop/Spark along with its machine learning library, MLlib, providing the means of parallelisation for population-scale bioinformatics tasks. VariantSpark is the interface to the standard variant format (VCF), offers seamless genome-wide sampling of variants and provides a pipeline for visualising results. To demonstrate the capabilities of VariantSpark, we clustered more than 3,000 individuals with 80 Million variants each to determine the population structure in the dataset. VariantSpark is 80% faster than the Spark-based genome clustering approach, ADAM, the comparable implementation using Hadoop/Mahout, as well as Admixture, a commonly used tool for determining individual ancestries. It is over 90% faster than traditional implementations using R and Python. These benefits of speed, resource consumption and scalability enables VariantSpark to open up the usage of advanced, efficient machine learning algorithms to genomic data. The package is written in Scala and available at https://github.com/BauerLab/VariantSpark.

AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...

Amazon Web Services

Big data at experimental facilities

Ian Foster

Time to Science/Time to Results: Transforming Research in the Cloud

Amazon Web Services

This session demonstrates how cloud can accelerate breakthroughs in scientific research by providing on-demand access to powerful computing. You will gain insight into how scientific researchers are using the cloud to solve complex science, engineering, and business problems that require high bandwidth, low latency networking and very high compute capabilities. You will hear how leveraging the cloud reduces the costs and time to conduct large scale, worldwide collaborative research. Researchers can then access computational power, data storage, and supercomputing resources, and data sharing capabilities in a cost-efficient manner without implementation delays. Disease research can be accomplished in a fraction of the time, and innovative researchers in small schools or distant corners of the world have access to the same computing power as those at major research institutions by leveraging Amazon EC2, Amazon S3, optimizing C3 instances and more to increase collaboration. This session will provide best practices and insight from UC Berkeley AMP Lab on the services used to connect disparate sets of data to drive meaningful new insight and impact.

Bioclouds CAMDA (Robert Grossman) 09-v9p

Robert Grossman

Cloud-native machine learning - Transforming bioinformatics research

Denis C. Bauer

Cloud computing and artificial intelligence transforms bioinformatics research Denis Bauer, Transformational Bioinformatics Team Genomic data is outpacing traditional Big Data disciplines, producing more information than Astronomy, twitter, and YouTube combined. As such, Genomic research has leapfrogged to the forefront of Big Data and Cloud solutions. We developed software platforms using the latest in cloud architecture, artificial intelligence and machine learning to support every aspect genome medicine; from disease gene detection through to validation and personalized medicine. This talk outlines how we find disease genes for complex genetic diseases, such as ALS, using VariantSpark, which is a custom machine learning implementation capable of dealing with Whole Genome Sequencing data of 80 million common and rare variants. To support disease gene validation, we created GT-Scan, which is an innovative web application, which we think of it as the “search engine for the genome”. It enables researchers to identify the optimal editing spot to create animal models efficiently. The talk concludes by demonstrating how cloud-based software distribution channels (digital Marketplaces) can be harnessed to share bioinformatics tools internationally and make research more reproducible.

Sharing massive data analysis: from provenance to linked experiment reports

Gaignard Alban

The Transformation of Systems Biology Into A Large Data Science

Robert Grossman

Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

GigaScience, BGI Hong Kong

A Data Ecosystem to Support Machine Learning in Materials Science

Globus

Translating genomics into clinical practice - 2018 AWS summit keynote

Denis C. Bauer

Scott Edmunds: Data Dissemination in the era of "Big-Data"

GigaScience, BGI Hong Kong

Multi-omics methods and resources for Bioconductor

Levi Waldron

Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...

Spark Summit

Recent advances in genome sequencing technologies and bioinformatics have enabled whole-genomes to be studied at population-level rather then for small number of individuals. This provides new power to whole genome association studies (WGAS ), which now seek to identify the multi-gene causes of common complex diseases like diabetes or cancer. As WGAS involve studying thousands of genomes, they pose both technological and methodological challenges. The volume of data is significant, for example the dataset from 1000 Genomes project with genomes of 2504 individuals includes nearly 85M genomic variants with raw data size of 0.8 TB. The number of features is enormous and greatly exceeds the number of samples, which makes it challenging to apply traditional statistical approaches. Random forest is one of the methods that was found to be useful in this context, both because of its potential for parallelization and its robustness. Although there is a number of big data implementations available (including Spark ML) they are tuned for typical dataset with large number of samples and relatively small number of variables, and either fail or are inefficient in the GWAS context especially, that a costly data preprocessing is usually required. To address these problems, we have developed the RandomForestHD – a Spark based implementation optimized for highly dimensional data sets. We have successfully RandomForestHD applied it to datasets beyond the reach of other tools and for smaller datasets found its performance superior. We are currently applying RandomForestHD, released as part of the VariantSpark toolkit, to a number of WGAS studies. In the presentation we will introduce the domain of WGAS and related challenges, present RandomForestHD with its design principles and implementation details with regards to Spark, compare its performance with other tools, and finally showcase the results of a few WGAS applications.

R Analytics in the CloudDataMine Lab

Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...

Spark Summit

Scott Edmunds: Revolutionizing Data Dissemination: GigaScience

GigaScience, BGI Hong Kong

Similar to VariantSpark - a Spark library for genomics (20)

VariantSpark a library for genomics by Lynn Langit

VariantSpark on AWS

How novel compute technology transforms life science research

VariantSpark: applying Spark-based machine learning methods to genomic inform...

AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...

Big data at experimental facilities

Time to Science/Time to Results: Transforming Research in the Cloud

Bioclouds CAMDA (Robert Grossman) 09-v9p

Cloud-native machine learning - Transforming bioinformatics research

Sharing massive data analysis: from provenance to linked experiment reports

The Transformation of Systems Biology Into A Large Data Science

Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

A Data Ecosystem to Support Machine Learning in Materials Science

Translating genomics into clinical practice - 2018 AWS summit keynote

Scott Edmunds: Data Dissemination in the era of "Big-Data"

Multi-omics methods and resources for Bioconductor

Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...

R Analytics in the Cloud

Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...

Scott Edmunds: Revolutionizing Data Dissemination: GigaScience

Recently uploaded

Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...

Ana Luísa Pinho

Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.

PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION

ChetanK57

GBSN - Biochemistry (Unit 5) Chemistry of Lipids

Areesha Ahmad

SCHIZOPHRENIA Disorder/ Brain Disorder.pdf

SELF-EXPLANATORY

Richard's aventures in two entangled wonderlands

Richard Gill

Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.

Mammalian Pineal Body Structure and Also Functions

YOGESH DOGRA

Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...

University of Maribor

Orion Air Quality Monitoring Systems - CWS

Columbia Weather Systems

Nutraceutical market, scope and growth: Herbal drug technology

Lokesh Patil

As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.

Richard's entangled aventures in wonderland

Richard Gill

Multi-source connectivity as the driver of solar wind variability in the heli...

Sérgio Sacani

The ambient solar wind that flls the heliosphere originates from multiple sources in the solar corona and is highly structured. It is often described as high-speed, relatively homogeneous, plasma streams from coronal holes and slow-speed, highly variable, streams whose source regions are under debate. A key goal of ESA/NASA’s Solar Orbiter mission is to identify solar wind sources and understand what drives the complexity seen in the heliosphere. By combining magnetic feld modelling and spectroscopic techniques with high-resolution observations and measurements, we show that the solar wind variability detected in situ by Solar Orbiter in March 2022 is driven by spatio-temporal changes in the magnetic connectivity to multiple sources in the solar atmosphere. The magnetic feld footpoints connected to the spacecraft moved from the boundaries of a coronal hole to one active region (12961) and then across to another region (12957). This is refected in the in situ measurements, which show the transition from fast to highly Alfvénic then to slow solar wind that is disrupted by the arrival of a coronal mass ejection. Our results describe solar wind variability at 0.5 au but are applicable to near-Earth observatories.

(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...

Scintica Instrumentation

Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes. In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.

4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf

ssuserbfdca9

EY - Supply Chain Services 2018_template.pptx

AlguinaldoKong

Lab report on liquid viscosity of glycerin

ossaicprecious19

platelets- lifespan -Clot retraction-disorders.pptx

muralinath2

Citrus Greening Disease and its Management

subedisuryaofficial

ESR_factors_affect-clinic significance-Pathysiology.pptx

muralinath2

Hemostasis_importance& clinical significance.pptx

muralinath2

platelets_clotting_biogenesis.clot retractionpptx

muralinath2

Recently uploaded (20)

Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...

PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION

GBSN - Biochemistry (Unit 5) Chemistry of Lipids

SCHIZOPHRENIA Disorder/ Brain Disorder.pdf

Richard's aventures in two entangled wonderlands

Mammalian Pineal Body Structure and Also Functions

Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...

Orion Air Quality Monitoring Systems - CWS

Nutraceutical market, scope and growth: Herbal drug technology

Richard's entangled aventures in wonderland

Multi-source connectivity as the driver of solar wind variability in the heli...

(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...

4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf

EY - Supply Chain Services 2018_template.pptx

Lab report on liquid viscosity of glycerin

platelets- lifespan -Clot retraction-disorders.pptx

Citrus Greening Disease and its Management

ESR_factors_affect-clinic significance-Pathysiology.pptx

Hemostasis_importance& clinical significance.pptx

platelets_clotting_biogenesis.clot retractionpptx

VariantSpark - a Spark library for genomics

1. VariantSpark: a library for Genomics Transformational Bioinformatics | Denis C. Bauer | @allPowerde Lynn Langit

2. “Genomical” Big Data

3. Natalie Twine Transformational Bioinformatics Team Transformational Bioinformatics | Denis C. Bauer | @allPowerde Denis Bauer Oscar Luo Rob Dunne Piotr SzulAidan O’BrienLaurence Wilson Adrian White Mia Champion Gaetan Burgio Collaborators David Levy News Software Dan Andrews Kaitao Lai Kaylene Simpson Iva Nikolic Ian Blair Kelly Williams

4. BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4) Cited 4 VariantSpark | Denis C. Bauer @allPowerde

5. Unsupervised ML : K-Means www.cloudaccess.eu 1000 x 40 Million variants Matrix * k-means Predict super population 4 14 ethnic groups and s u p e r populations VariantSpark | Denis C. Bauer @allPowerde * VariantSpark can also process phase 3 data: 3000 individuals and 80 million variants

6. Comparing K-Means Implementations 0 1000 2000 Python R H adoop Adam AD M IXTU R E VariantSpark method timeinseconds task binary−conversion clustering pre−processing 103 75 29 28 18 4 min VariantSpark | Denis C. Bauer @allPowerde

7. Supervised ML: Wide Random Forests Transformational Bioinformatics | Denis C. Bauer | @allPowerde

8. Transformational Bioinformatics | Denis C. Bauer | @allPowerde Genomic Research Workflow https://www.projectmine.com/about/ Focus

9. Performance – Faster and More Accurate VariantSpark is the only method to scale to 100% of the genome Transformational Bioinformatics | Denis C. Bauer | @allPowerde

10. Scaling to 50 M variables and 10 K samples Transformational Bioinformatics | Denis C. Bauer | @allPowerde 100K trees: 5 – 50h AWS: ~$215.50 100K trees: 200 – 2000h AWS: ~ $ 8620.00 • Yarn Cluster (12 workers) • 16 x Intel Xeon E5-2660@2.20GHz CPU • 128 GB of RAM • Spark 1.6.1 on YARN • 128 executors • 6GB / executor (0.75TB) • Synthetic dataset (mtry = 0.25) Whole Genome Range GWAS Range

11.

12. Databricks & VariantSpark via a Jupyter notebook

13. Solving Important Questions… Cancer genomics?

14. DEMO: Who is a Hipster?

15. • Quickly access a managed Spark cluster - AWS EC2 / spot instances • Link to your data and perform whole genome analysis in real-time VariantSpark & Databricks Notebooks Transformational Bioinformatics | Denis C. Bauer | @allPowerde Jupyter Notebook Transformational Bioinformatics | Denis C. Bauer | @allPowerde

16. Joint-loci association test Hipster-Index = ((2 + GT[B6]) * (1.5 + GT[R1])) + ((0.5 + GT[C2]) * (1 + GT[B2])) Label = 1 if Hipster-Index>10 Genomic profile Label Samples(n=2500) Transformational Bioinformatics | Denis C. Bauer | @allPowerde

17. Try it out: VariantSpark Notebook https://databricks.com/blog/2017/07/26/breaking-the- curse-of-dimensionality-in-genomics-using-wide- random-forests.html

18. VariantSpark: a library for Genomics Transformational Bioinformatics | Denis C. Bauer | @allPowerde Lynn Langit

Editor's Notes

http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
http://www.cloudaccess.eu/blog/wp-content/uploads/2014/06/genetic_roots.png
Chromosome 22; VM on Microsoft Azure with A7 Linux instance and 8 cores, 56GB memory running Ubuntu.
https://academics.cloud.databricks.com/#notebook/170398/command/170419
https://databricks.com/blog/2017/07/26/breaking-the-curse-of-dimensionality-in-genomics-using-wide-random-forests.html

VariantSpark - a Spark library for genomics

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to VariantSpark - a Spark library for genomics

Similar to VariantSpark - a Spark library for genomics (20)

More from Lynn Langit

More from Lynn Langit (20)

Recently uploaded

Recently uploaded (20)

VariantSpark - a Spark library for genomics

Editor's Notes