This document describes using Spark for spatial analysis of histological images to characterize the tumor microenvironment. The goal is to provide actionable data on the location and density of immune cells and blood vessels. Over 100,000 objects are annotated in each whole slide image. Spark is used to efficiently calculate over 5 trillion pairwise distances between objects within a neighborhood window. This enables profiling of co-localization and spatial clustering of objects. Initial results show the runtime scales linearly with the number of objects. Future work includes integrating clinical and genomic data to characterize variation between tumor types and patients.
Data Science Solutions by Materials Scientists: The Early Case StudiesTony Fast
Improvements in algorithms, technology, and computation are directly impacting the landscape of information use in materials science. The 3 V’s of Big Data (volume, velocity, and variety) are becoming evermore apparent within all sectors of the field. Novel approaches will be required to confront the emerging data deluge and extract the richest knowledge from simulated and empirical information in complex evolving 3-D spaces. Microstructure Informatics (μInformatics) is an emerging suite of signal processing techniques, advanced statistical tools, and data science methods tailored specifically for this new frontier. μInformatics curates and transforms large collections of materials science information using efficient workflows to extract knowledge of bi-directional structure-property/processing connections for most material classes.
In this talk, a few early case studies in data-driven methods to solve materials science problems will be explored. Emerging spatial statistics tools will be explored that enable an objective comparison of static and evolving 3-D material volumes from molecular dynamics simulation, micro-CT, and Scanning Electron Microscopy. Also, the statistics will provide a foundation to create improved bottom-up homogenization relationships in fuel cell materials. Lastly, applications of the Materials Knowledge System, a data-driven meta-model to create top-down localization relationships will be explored for phase field model and finite element model information.
Near duplicate detection algorithms have been proposed and implemented in order to detect and eliminate duplicate entries from massive datasets. Due to the differences in data representation (such as measurement units) across different data sources, potential duplicates may not be textually identical, even though they refer to the same real-world entity. As data warehouses typically contain data coming from several heterogeneous data sources, detecting near duplicates in a data warehouse requires a considerable memory and processing power.
Traditionally, near duplicate detection algorithms are sequential and operate on a single computer. While parallel and distributed frameworks have recently been exploited in scaling the existing algorithms to operate over larger datasets, they are often focused on distributing a few chosen algorithms using frameworks such as MapReduce. A common distribution strategy and framework to parallelize the execution of the existing similarity join algorithms is still lacking.
In-Memory Data Grids (IMDG) offer a distributed storage and execution, giving the illusion of a single large computer over multiple computing nodes in a cluster. This paper presents the research, design, and implementation of ∂u∂u, a distributed near duplicate detection framework, with preliminary evaluations measuring its performance and achieved speed up. ∂u∂u leverages the distributed shared memory and execution model provided by IMDG to execute existing near duplicate detection algorithms in a parallel and multi-tenanted environment. As a unified near duplicate detection framework for big data, ∂u∂u efficiently distributes the algorithms over utility computers in research labs and private clouds and grids.
Master's Thesis - deep genomics: harnessing the power of deep neural networks...Enrico Busto
The human genome project [1], an international scientific research project with the goal of determining the sequence of nucleotide base pairs that make up human DNA, lasted roughly 15 years and cost $5 billion (adjusted for inflation). With the recent advances in genome sequencing technology, that cost has now reduced to a few hundreds dollars [2] and can be done overnight.
Being able to access this kind of information may have a deep impact on the way complex diseases are treated: physicians will shift from general-purpose treatments to specific ones, tailored on the individual patient’s genomic features.This approach is referred to as precision medicine.
There are however several caveats: first of all, due to the nature of the problem, knowledge of both the biomedical and the computer science domain are required in order to correctly approach it; second, unlike more classical scenarios such as image classification or object detection, it is much more difficult to determine the accuracy of the system due to the complex and multifactorial nature of complex diseases such as cancer and neurodegenerative diseases.
Moreover, a black box kind of solution is unlikely to be of any use, due to legal and ethical reasons: interpretability of the model is crucial more than ever.
The goal of this thesis is to explore the possibilities and the limits of techniques based on deep neural networks for the analysis of biomolecular data, experimenting with publicly available datasets.
Data Science Solutions by Materials Scientists: The Early Case StudiesTony Fast
Improvements in algorithms, technology, and computation are directly impacting the landscape of information use in materials science. The 3 V’s of Big Data (volume, velocity, and variety) are becoming evermore apparent within all sectors of the field. Novel approaches will be required to confront the emerging data deluge and extract the richest knowledge from simulated and empirical information in complex evolving 3-D spaces. Microstructure Informatics (μInformatics) is an emerging suite of signal processing techniques, advanced statistical tools, and data science methods tailored specifically for this new frontier. μInformatics curates and transforms large collections of materials science information using efficient workflows to extract knowledge of bi-directional structure-property/processing connections for most material classes.
In this talk, a few early case studies in data-driven methods to solve materials science problems will be explored. Emerging spatial statistics tools will be explored that enable an objective comparison of static and evolving 3-D material volumes from molecular dynamics simulation, micro-CT, and Scanning Electron Microscopy. Also, the statistics will provide a foundation to create improved bottom-up homogenization relationships in fuel cell materials. Lastly, applications of the Materials Knowledge System, a data-driven meta-model to create top-down localization relationships will be explored for phase field model and finite element model information.
Near duplicate detection algorithms have been proposed and implemented in order to detect and eliminate duplicate entries from massive datasets. Due to the differences in data representation (such as measurement units) across different data sources, potential duplicates may not be textually identical, even though they refer to the same real-world entity. As data warehouses typically contain data coming from several heterogeneous data sources, detecting near duplicates in a data warehouse requires a considerable memory and processing power.
Traditionally, near duplicate detection algorithms are sequential and operate on a single computer. While parallel and distributed frameworks have recently been exploited in scaling the existing algorithms to operate over larger datasets, they are often focused on distributing a few chosen algorithms using frameworks such as MapReduce. A common distribution strategy and framework to parallelize the execution of the existing similarity join algorithms is still lacking.
In-Memory Data Grids (IMDG) offer a distributed storage and execution, giving the illusion of a single large computer over multiple computing nodes in a cluster. This paper presents the research, design, and implementation of ∂u∂u, a distributed near duplicate detection framework, with preliminary evaluations measuring its performance and achieved speed up. ∂u∂u leverages the distributed shared memory and execution model provided by IMDG to execute existing near duplicate detection algorithms in a parallel and multi-tenanted environment. As a unified near duplicate detection framework for big data, ∂u∂u efficiently distributes the algorithms over utility computers in research labs and private clouds and grids.
Master's Thesis - deep genomics: harnessing the power of deep neural networks...Enrico Busto
The human genome project [1], an international scientific research project with the goal of determining the sequence of nucleotide base pairs that make up human DNA, lasted roughly 15 years and cost $5 billion (adjusted for inflation). With the recent advances in genome sequencing technology, that cost has now reduced to a few hundreds dollars [2] and can be done overnight.
Being able to access this kind of information may have a deep impact on the way complex diseases are treated: physicians will shift from general-purpose treatments to specific ones, tailored on the individual patient’s genomic features.This approach is referred to as precision medicine.
There are however several caveats: first of all, due to the nature of the problem, knowledge of both the biomedical and the computer science domain are required in order to correctly approach it; second, unlike more classical scenarios such as image classification or object detection, it is much more difficult to determine the accuracy of the system due to the complex and multifactorial nature of complex diseases such as cancer and neurodegenerative diseases.
Moreover, a black box kind of solution is unlikely to be of any use, due to legal and ethical reasons: interpretability of the model is crucial more than ever.
The goal of this thesis is to explore the possibilities and the limits of techniques based on deep neural networks for the analysis of biomolecular data, experimenting with publicly available datasets.
It is widely agreed that complex diseases are typically caused by joint effects of multiple genetic variations, rather than a single genetic variation. Multi-SNP interactions, also known as epistatic interactions, have the potential to provide information about causes of complex diseases, and build on GWAS studies that look at associations between single SNPs and phenotypes. However, epistatic analysis methods are both computationally expensive, and have limited accessibility for biologists wanting to analyse GWAS datasets due to being command line based. Here we present APPistatic, a prototype desktop version of a pipeline for epistatic analysis of GWAS datasets. his application combines ease-of-use, via a GUI, with accelerated implementation of BOOST and FaST-LMM epistatic analysis methods.
Highlights topics of discussion on remote sensing during Day 1 of Program on Mathematical and Statistical Methods for Climate and the Earth System Opening Workshop.
Drug Repurposing using Deep Learning on Knowledge GraphsDatabricks
Discovering new drugs is a lengthy and expensive process. This means that finding new uses for existing drugs can help create new treatments in less time and with less time. The difficulty is in finding these potential new uses.
How do we find these undiscovered uses for existing drugs?
We can unify the available structured and unstructured data sets into a knowledge graph. This is done by fusing the structured data sets, and performing named entity extraction on the unstructured data sets. Once this is done, we can use deep learning techniques to predict latent relationships.
In this talk we will cover:
Building the knowledge graph
Predicting latent relationships
Using the latent relationships to repurpose existing drugs
Building an informatics solution to sustain AI-guided cell profiling with hig...Ola Spjuth
Presentation at SLAS Europe 2019 in Barcelona on 28 june, 2019.
High-content microscopy in automated laboratories present many challenges for storing and processing data, and to build AI models to aid decision making. We have established an informatics system to serve a robotized cell profiling setup with incubators, liquid handling and high-content microscopy for microplates. The informatics system consists of computational infrastructure (CPUs, GPUs, storage), middleware (Kubernetes), imaging database and software (OMERO), and workflow system (Pachyderm) to perform online prioritization of new data, and automate the process from acquired images to continuously updated and deployed AI models. The AI methodologies include Deep Learning models trained on image data, and conventional machine learning models trained on data from Cell Painting experiments. The microservice architecture makes the system scalable and expandable, and a key objective is on improving screening and toxicity assessment using AI-aided intelligent experimental design.
Climate Science presents several data intensive challenges that are the intersection of software architecture and data science. This includes developing approaches for scaling the analysis of highly distributed data across institutional and system boundaries. JPL has been developing approaches for quantitatively evaluating software architectures to consider different topologies in the deployment of computing capabilities and methodologies in order to support the analysis of distributed climate data. This talk will cover those approaches and also needed research in new methodologies as remote sensing and climate model output data continue to increase in their size and distribution.
Artificial neural network based chinese medicine diagnosis in decision suppor...Dr. Wilfred Lin (Ph.D.)
HerbMiners Informatics Limited is a clinical Traditional Chinese Medicine (TCM) intelligence software solutions company. HerbMiners Informatics Limited focuses on research in TCM data mining, which aims to reveal relationships between symptoms, illnesses, herbs and prescriptions. HerbMiners Informatics Limited also provides artificial intelligence software solutions which assist hospitals and clinics for TCM modernization and patient records digitization.
Photo Rendering with Swarms: From Figurative to Abstract Pherogenic ImagingCarlos Cotta
Paper presented at the IEEE Symposium on Computational Intelligence for Creativity and Affective Computing (CICAC 2013), held as a part of the IEEE Symposium Series on Computational Intelligence (SSCI 2013), Singapore, 15-19 April 2013
This is an introduction to a knowledge engineering methodology called 'Knowledge Engineering from Experimental Design' (KEfED). This methodology provides a powerful, intuitive method for modeling the design of scientific experiments and provides the foundation for work at the Biomedical Knowledge Engineering Group at the Information Sciences Institute (run by Gully Burns)
It is widely agreed that complex diseases are typically caused by joint effects of multiple genetic variations, rather than a single genetic variation. Multi-SNP interactions, also known as epistatic interactions, have the potential to provide information about causes of complex diseases, and build on GWAS studies that look at associations between single SNPs and phenotypes. However, epistatic analysis methods are both computationally expensive, and have limited accessibility for biologists wanting to analyse GWAS datasets due to being command line based. Here we present APPistatic, a prototype desktop version of a pipeline for epistatic analysis of GWAS datasets. his application combines ease-of-use, via a GUI, with accelerated implementation of BOOST and FaST-LMM epistatic analysis methods.
Highlights topics of discussion on remote sensing during Day 1 of Program on Mathematical and Statistical Methods for Climate and the Earth System Opening Workshop.
Drug Repurposing using Deep Learning on Knowledge GraphsDatabricks
Discovering new drugs is a lengthy and expensive process. This means that finding new uses for existing drugs can help create new treatments in less time and with less time. The difficulty is in finding these potential new uses.
How do we find these undiscovered uses for existing drugs?
We can unify the available structured and unstructured data sets into a knowledge graph. This is done by fusing the structured data sets, and performing named entity extraction on the unstructured data sets. Once this is done, we can use deep learning techniques to predict latent relationships.
In this talk we will cover:
Building the knowledge graph
Predicting latent relationships
Using the latent relationships to repurpose existing drugs
Building an informatics solution to sustain AI-guided cell profiling with hig...Ola Spjuth
Presentation at SLAS Europe 2019 in Barcelona on 28 june, 2019.
High-content microscopy in automated laboratories present many challenges for storing and processing data, and to build AI models to aid decision making. We have established an informatics system to serve a robotized cell profiling setup with incubators, liquid handling and high-content microscopy for microplates. The informatics system consists of computational infrastructure (CPUs, GPUs, storage), middleware (Kubernetes), imaging database and software (OMERO), and workflow system (Pachyderm) to perform online prioritization of new data, and automate the process from acquired images to continuously updated and deployed AI models. The AI methodologies include Deep Learning models trained on image data, and conventional machine learning models trained on data from Cell Painting experiments. The microservice architecture makes the system scalable and expandable, and a key objective is on improving screening and toxicity assessment using AI-aided intelligent experimental design.
Climate Science presents several data intensive challenges that are the intersection of software architecture and data science. This includes developing approaches for scaling the analysis of highly distributed data across institutional and system boundaries. JPL has been developing approaches for quantitatively evaluating software architectures to consider different topologies in the deployment of computing capabilities and methodologies in order to support the analysis of distributed climate data. This talk will cover those approaches and also needed research in new methodologies as remote sensing and climate model output data continue to increase in their size and distribution.
Artificial neural network based chinese medicine diagnosis in decision suppor...Dr. Wilfred Lin (Ph.D.)
HerbMiners Informatics Limited is a clinical Traditional Chinese Medicine (TCM) intelligence software solutions company. HerbMiners Informatics Limited focuses on research in TCM data mining, which aims to reveal relationships between symptoms, illnesses, herbs and prescriptions. HerbMiners Informatics Limited also provides artificial intelligence software solutions which assist hospitals and clinics for TCM modernization and patient records digitization.
Photo Rendering with Swarms: From Figurative to Abstract Pherogenic ImagingCarlos Cotta
Paper presented at the IEEE Symposium on Computational Intelligence for Creativity and Affective Computing (CICAC 2013), held as a part of the IEEE Symposium Series on Computational Intelligence (SSCI 2013), Singapore, 15-19 April 2013
This is an introduction to a knowledge engineering methodology called 'Knowledge Engineering from Experimental Design' (KEfED). This methodology provides a powerful, intuitive method for modeling the design of scientific experiments and provides the foundation for work at the Biomedical Knowledge Engineering Group at the Information Sciences Institute (run by Gully Burns)
RISELab:Enabling Intelligent Real-Time DecisionsJen Aman
Spark Summit East Keynote by Ion Stoica
A long-standing grand challenge in computing is to enable machines to act autonomously and intelligently: to rapidly and repeatedly take appropriate actions based on information in the world around them. To address this challenge, at UC Berkeley we are starting a new five year effort that focuses on the development of data-intensive systems that provide Real-Time Intelligence with Secure Execution (RISE). Following in the footsteps of AMPLab, RISELab is an interdisciplinary effort bringing together researchers across AI, robotics, security, and data systems. In this talk I’ll present our research vision and then discuss some of the applications that will be enabled by RISE technologies.
Learning, Training, Classification, Common Sense and Exascale ComputingJoel Saltz
In this talk, I will describe work my group has carried out in development of deep learning methods that target semantic segmentation and object identification tasks in terapixel Pathology datasets and for satellite data. I will describe what we have been able to achieve, how this work can generalize to additional types of problems and will outline how exascale computing could be used to transform and integrate our methods and pipelines. I will then go on to outline broad research program in exascale computing and deep learning that promises to identify common deep learning methods for previously disparate large and extreme scale data tasks.
In this deck from the 2014 HPC User Forum in Seattle, Jack Collins from the National Cancer Institute presents: Genomes to Structures to Function: The Role of HPC.
Watch the video presentation: http://wp.me/p3RLHQ-d28
Enabling Real Time Analysis & Decision Making - A Paradigm Shift for Experime...PyData
By Kerstin Kleese van Dam
PyData New York City 2017
New instrument technologies are enabling a new generation of in-situ and in-operando experiments, with extremely fine spatial and temporal resolution, that allows researchers to observe as physics, chemistry and biology are happening. These new methodologies go hand in hand with an exponential growth in data volumes and rates - petabyte scale data collections and terabyte/sec. At the same time scientists are pushing for a paradigm shift. As they can now observe processes in intricate details, they want to analyze, interpret and control those processes. Given the multitude of voluminous, heterogenous data streams involved in every single experiment, novel real time, data driven analysis and decision support approaches are needed to realize their vision. This talk will discuss state of the art streaming analysis for experimental facilities, its challenges and early successes. It will present where commercial technologies can be leveraged and how many of the novel approaches differ from commonly available solutions.
Building bioinformatics resources for the global communityExternalEvents
http://www.fao.org/about/meetings/wgs-on-food-safety-management/en/
Building bioinformatics resources for the global community. Presentation from the Technical Meeting on the impact of Whole Genome Sequencing (WGS) on food safety management and GMI-9, 23-25 May 2016, Rome, Italy.
Spatial Biology - The New Frontier of Microbiology KML Vision
We summarized the most important facts about state-of-the art spatial biology techniques and their uses in different areas of life science.
If you want to know more, we dive deep into this exciting subject in our blogpost https://www.kmlvision.com/spatial-biology-introducing-spatial-metrics-to-bioimage-analysis/
References:
Carter, P., Presta, L. E. N., Gorman, C. M., Ridgway, J. B., Henner, D., Wong, W. L., … & Shepard, H. M. (1992). Humanization of an anti-p185HER2 antibody for human cancer therapy. Proceedings of the National Academy of Sciences, 89(10), 4285-4289.
Gohil, S. H., Iorgulescu, J. B., Braun, D. A., Keskin, D. B., & Livak, K. J. (2021). Applying high-dimensional single-cell technologies to the analysis of cancer immunotherapy. Nature reviews. Clinical oncology, 18(4), 244–256. https://doi.org/10.1038/s41571-020-00449-x
Lu, S., Stein, J. E., Rimm, D. L., Wang, D. W., Bell, J. M., Johnson, D. B., Sosman, J. A., Schalper, K. A., Anders, R. A., Wang, H., Hoyt, C., Pardoll, D. M., Danilova, L., & Taube, J. M. (2019). Comparison of Biomarker Modalities for Predicting Response to PD-1/PD-L1 Checkpoint Blockade: A Systematic Review and Meta-analysis. JAMA oncology, 5(8), 1195–1204. https://doi.org/10.1001/jamaoncol.2019.1549
Rad, H. S., Rad, H. S., Shiravand, Y., Radfar, P., Arpon, D., Warkiani, M. E., O’Byrne, K., & Kulasinghe, A. (2021). The Pandora’s box of novel technologies that may revolutionize lung cancer. Lung cancer (Amsterdam, Netherlands), 159, 34–41. https://doi.org/10.1016/j.lungcan.2021.06.022
Subramanian, I., Verma, S., Kumar, S., Jere, A., & Anamika, K. (2020). Multi-omics Data Integration, Interpretation, and Its Application. Bioinformatics and biology insights, 14, 1177932219899051. https://doi.org/10.1177/1177932219899051
Links:
Visit our website:
www.kmlvision.com
Sparkfinder: https://kmlvision.atlassian.net/wiki/spaces/KB/pages/3449618514/Sparkfinder+App+Versions
About the company:
https://www.kmlvision.com/
KML Vision GmbH,
Nikolaiplatz 4, 2nd floor,
8020 Graz, Austria
Contact us:
slideshare@kmlvision.com
Tools to Analyze Morphology and Spatially Mapped Molecular Data - Informatio...Joel Saltz
Description of NCI Information Technology for Cancer Research Project dedicated to 1) development of development of Digital Pathology pipelines, databases, data modeling and visualization methods, 2) support for digital pathology/Radiology/"omics" based precision medicine
Presented at Spring 2015 Information Technology for Cancer Research PI Meeting
Keynote presentation at GlobusWorld 2021. Highlights product updates and roadmap, as well as user success stories in research data management. Presented by Ian Foster, Rachana Ananthakrishnan, Kyle Chard and Vas Vasiliadis.
dkNET Webinar: The Human BioMolecular Atlas Program (HuBMAP) 10/14/2022dkNET
Abstract
HuBMAP aims to catalyze the development of an open, global framework for comprehensively mapping the human body at cellular resolution. HuBMAP goals include: (1) Accelerate the development of the next generation of tools and techniques for constructing high resolution spatial tissue maps. (2) Generate foundational 3D tissue atlases. (3) Establish an open data platform. (4) Coordinate and collaborate with other funding agencies, programs, and the biomedical research community. (5) Support projects that demonstrate the value of the resources developed by the program. The HuBMAP Portal can be found at https://portal.hubmapconsortium.org and the Visible Human MOOC describes the compilation and coverage of HuBMAP data, demonstrates new single-cell analysis and mapping techniques, and introduces major features of the HuBMAP portal.
The top 3 key questions that HuBMAP can answer:
1. What assay types are best to map the human body in 3D and across scales?
2. What Common Coordinate System (CCF) is best to construct the Human Reference Atlas?
3. How can others help construct and/or use the Human Reference Atlas?
Presenters:
Katy Börner, PhD, Victor H. Yngve Distinguished Professor of Engineering and Information Science, Department of Intelligent Systems Engineering and Information Science, Indiana University
Jeffrey Spraggins, PhD, Assistant Professor, Department of Cell and Developmental Biology, Vanderbilt University
Upcoming webinars schedule: https://dknet.org/about/webinar
Automated Analysis of Microscopy Images using Deep Convolutional Neural NetworkAdetayoOkunoye
The general cell quantification and identification have technical limitations concerning the fast and accurate detection of complex morphological cells, especially for overlapping cells, irregular cell shapes, bad focal planes, among other factors. We use the deep convolutional neural networks (DCNN) to classify the annotated images of five types of white blood cells. The accuracy and performance of the proposed framework are evaluated for the blood cell classifications. The results demonstrate that the DCNN model performs close to the accuracy of 80% and provides an accurate and fast method for hematological laboratories.
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaJen Aman
2017 continues to be an exciting year for Apache Spark. I will talk about new updates in two major areas in the Spark community this year: stream processing with Structured Streaming, and deep learning with high-level libraries such as Deep Learning Pipelines and TensorFlowOnSpark. In both areas, the community is making powerful new functionality available in the same high-level APIs used in the rest of the Spark ecosystem (e.g., DataFrames and ML Pipelines), and improving both the scalability and ease of use of stream processing and machine learning.
Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman
Building applications that can read and analyze a wide variety of data may change the way we do science and make business decisions. However, building such applications is challenging: real world data is expressed in natural language, images, or other “dark” data formats which are fraught with imprecision and ambiguity and so are difficult for machines to understand. This talk will describe Snorkel, whose goal is to make routine Dark Data and other prediction tasks dramatically easier. At its core, Snorkel focuses on a key bottleneck in the development of machine learning systems: the lack of large training datasets. In Snorkel, a user implicitly creates large training sets by writing simple programs that label data, instead of performing manual feature engineering or tedious hand-labeling of individual data items. We’ll provide a set of tutorials that will allow folks to write Snorkel applications that use Spark.
Snorkel is open source on github and available from Snorkel.Stanford.edu.
Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Spatial Analysis On Histological Images Using Spark
1. Spatial Analysis on Histological
Images Using Spark
Wei-Yi Cheng and Franziska Mech
Roche Pharma Research and Early Development (pRED)
Informatics, Data Science
Roche Innovation Center New York / Munich
2. Disclaimer
• This presentation is …
– NOT about computer vision / image processing
– NOT about drug, biology, or biochemistry
– NOT about new algorithm or infrastructure
3. Disclaimer
• This presentation is …
– An application of Spark on spatial analysis on
biomedical images
– A proof of concept of a small module in a complex
pipeline
– A work in progress
4. Towards a systematic
characterization of tumor context
Challenges:
• Location and density of different
immune cell populations are
associated with patient prognosis and
prediction outcome
• Huge variation in immune infiltrates
across tumor entities and patients
• Inconsistent data in the literature
Nat Rev Drug Discov. 2015;14(9):603-22.
5. Roche pRED Tissue image analysis workflow
T cell location
and subtype
Vessel location, class
assignment and object
features
Objects in same
reference coordinate
system
Whole slide
image analysis In silico
multiplexing
>10,000 slides per year
(~ 6 slides per block)
>100,000 objects per slide
Imaging
results
Spatial profiling
7. POC data set
• 57 “blocks“ of 3 cancer types
• Each block contains 6 or 7 slides
• histology stain
• cancer biomarker
• microenvironment biomarker
• Each object annotated with object
type (T cell, lymph vessels, etc.),
shape (point / polygon), and
coordinate
8. Distance calculation
• Basic statistic for distribution, co-localization, spatial clustering, etc.
• Distance: shortest distance between two contours
• Total number of pairwise distances: 5.3 trillion (1012) pairs – prohibitive
• Workaround: only calculate the distance between each object and its
“neighbors” within a window r.
9. Spatial Spark
• An open source library developed by
Dr. Simin You and Prof. Jianting Zhang
from CUNY.
• Divide-and-conquer, following similar
designs of HadoopGIS.
• Support multiple spatial partitioning
methods: sort-tile partition, binary-split
partition, fixed-grid partition
http://simin.me/projects/spatialspark/
11. Distance calculation: run time
• Using 18 cores, 4 threads / core (72X
parallelization).
• Radius = 232.5 μm
• Under shared test environment
• Execute time is roughly linear to
number of object (theoretical time
complexity).
14. Spatial clustering of objects: DBSCAN
• Density-based spatial clustering of applications with noise (DBSCAN)
clusters closely connected points into the same cluster, marking high-density
regions.
• Parameters:
• minPts: minimum number of neighbors
• Eps: maximum distance to define a neighbor
minPts = 3
A: core point
B, C: reachable
N: outlier
https://en.wikipedia.org/wiki/DBSCAN
http://scikit-learn.org/stable/modules/clustering.html
15. DBSCAN: run time
• Directly load distance from
parquet.
• Run time is linear to the number
of objects (given already
calculated distances).
16. Future works
• Clinical information
• Genomic data
• Scale up and upstream integration
• UI integration
17. Acknowledgement
• Spark exploratory / support
– Sittichoke Saisanit (Roche pREDi)
– Xing Yang (Roche pREDi)
– Padmanabha Udupa (Roche pREDi)
– Ivan San Antonio Martinez (Roche
IDW)
– Zayed Albertyn (Novocraft)
• Tumor image spatial analysis research
– Angelika Fuchs (Roche pREDi)
– Gerlind Herberich (Roche pREDi)
– Jurriaan Brouwer (Roche pREDi)
• Spatial Spark
– Jianting Zhang (CUNY)
– Simin You (CUNY)