Digital biomarkers for preventive personalised healthcarePaolo Missier
A talk given to the Alan Turing Institute, UK, Oct 2021, reporting on the preliminary results and ongoing research in our lab, on self-monitoring using accelerometers for healthcare applications
The document discusses data provenance for data science applications. It proposes automatically generating and storing metadata that describes how data flows through a machine learning pipeline. This provenance information could help address questions about model predictions, data processing decisions, and regulatory requirements for high-risk AI systems. Capturing provenance at a fine-grained level incurs overhead but enables detailed queries. The approach was evaluated on performance and scalability. Provenance may help with transparency, explainability and oversight as required by new regulations.
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Paolo Missier
The document discusses provenance in the context of data science and artificial intelligence. It provides bibliometric data on publications related to data/workflow provenance from 2000 to the present. Recent trends include increased focus on applications in computing and engineering fields. Blockchain is discussed as a method for capturing fine-grained provenance. The document also outlines challenges around explainability, transparency and accountability for high-risk AI systems according to new EU regulations, and argues that provenance techniques may help address these challenges by providing traceability of system functioning and operation monitoring.
ReComp and P4@NU: Reproducible Data Science for HealthPaolo Missier
brief overview of the ReComp project (http://recomp.org.uk) on Selective recurring re-computation of complex analytics, and a brief outlook for the P4@NU project on seeking digital biomarkers for age-0related metabolic diseases
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...Paolo Missier
talk for paper published at ICWE2019:
Primo F, Missier P, Romanovsky A, Mickael F, Cacho N. A customisable pipeline for continuously harvesting socially-minded Twitter users. In: Procs. ICWE’19. Daedjeon, Korea; 2019.
Digital biomarkers for preventive personalised healthcarePaolo Missier
A talk given to the Alan Turing Institute, UK, Oct 2021, reporting on the preliminary results and ongoing research in our lab, on self-monitoring using accelerometers for healthcare applications
Digital biomarkers for preventive personalised healthcarePaolo Missier
A talk given to the Alan Turing Institute, UK, Oct 2021, reporting on the preliminary results and ongoing research in our lab, on self-monitoring using accelerometers for healthcare applications
The document discusses data provenance for data science applications. It proposes automatically generating and storing metadata that describes how data flows through a machine learning pipeline. This provenance information could help address questions about model predictions, data processing decisions, and regulatory requirements for high-risk AI systems. Capturing provenance at a fine-grained level incurs overhead but enables detailed queries. The approach was evaluated on performance and scalability. Provenance may help with transparency, explainability and oversight as required by new regulations.
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Paolo Missier
The document discusses provenance in the context of data science and artificial intelligence. It provides bibliometric data on publications related to data/workflow provenance from 2000 to the present. Recent trends include increased focus on applications in computing and engineering fields. Blockchain is discussed as a method for capturing fine-grained provenance. The document also outlines challenges around explainability, transparency and accountability for high-risk AI systems according to new EU regulations, and argues that provenance techniques may help address these challenges by providing traceability of system functioning and operation monitoring.
ReComp and P4@NU: Reproducible Data Science for HealthPaolo Missier
brief overview of the ReComp project (http://recomp.org.uk) on Selective recurring re-computation of complex analytics, and a brief outlook for the P4@NU project on seeking digital biomarkers for age-0related metabolic diseases
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...Paolo Missier
talk for paper published at ICWE2019:
Primo F, Missier P, Romanovsky A, Mickael F, Cacho N. A customisable pipeline for continuously harvesting socially-minded Twitter users. In: Procs. ICWE’19. Daedjeon, Korea; 2019.
Digital biomarkers for preventive personalised healthcarePaolo Missier
A talk given to the Alan Turing Institute, UK, Oct 2021, reporting on the preliminary results and ongoing research in our lab, on self-monitoring using accelerometers for healthcare applications
This document introduces digital biomarkers and their use in image classification algorithms. It discusses how digital biomarkers are extracted from images as quantifiable features and optimized to develop multivariate classifiers. The document outlines Contiguity's approach, which extracts obvious and non-obvious features to generate digital biomarkers from histology images. These biomarkers are optimized and combined in classification algorithms. Contiguity applied this method to the CAMELYON16 Grand Challenge dataset, analyzing lymph node images to detect cancer metastases through sampling, filtering, and decision tree classification.
Deep learning for biomedical discovery and data mining IIDeakin University
(1) The document discusses deep learning techniques for analyzing biomedical data from electronic medical records (EMRs).
(2) It describes models like DeepPatient that use autoencoders to learn representations of patient records that can predict diseases.
(3) Other models like Deepr and DeepCare use convolutional and recurrent neural networks to model temporal patterns in EMRs and predict future health risks and care trajectories.
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET Journal
The document proposes an improved model for big data analytics using dynamic multi-swarm optimization and unsupervised learning algorithms. It develops an algorithm called DynamicK-reference Clustering that combines dynamic multi-swarm optimization with a k-reference clustering algorithm. The k-reference clustering algorithm uses reference distance weighting, Euclidean distance, and chi-square relative frequency to cluster mixed datasets. It was tested on several datasets from a machine learning repository and was shown to more efficiently cluster large, mixed datasets than other clustering algorithms like k-means and particle swarm optimization. The dynamic multi-swarm optimization helps guide the clustering algorithm to obtain more accurate cluster formations by providing the best initial value of k clusters.
An Extensive Review on Generative Adversarial Networks GAN’sijtsrd
This paper is to provide a high level understanding of Generative Adversarial Networks. This paper will be covering the working of GAN’s by explaining the background idea of the framework, types of GAN’s in the industry, it’s advantages and disadvantages, history of how GAN’s are developed and enhanced along the timeline and some applications where GAN’s outperforms themselves. Atharva Chitnavis | Yogeshchandra Puranik "An Extensive Review on Generative Adversarial Networks (GAN’s)" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-4 , June 2021, URL: https://www.ijtsrd.compapers/ijtsrd42357.pdf Paper URL: https://www.ijtsrd.comcomputer-science/artificial-intelligence/42357/an-extensive-review-on-generative-adversarial-networks-gan’s/atharva-chitnavis
This document is a 36-page bachelor's thesis written by Duc Minh Luong Nguyen titled "Detect COVID-19 from Chest X-Ray images using Deep Learning". The thesis was submitted to Metropolia University of Applied Sciences in May 2020. It aims to build a deep convolutional neural network to detect COVID-19 using only chest X-ray images. The model achieves an accuracy of 93% at detecting COVID-19 patients versus healthy patients, despite being trained on a small dataset of 115 images for each class.
The document discusses using deep learning models to analyze episodic healthcare data and make predictions. It proposes:
1) Viewing healthcare processes as executable computer programs with hidden "grammars" that can be learned from observational data.
2) Modeling health dynamics as a system of state transitions where treatments shift illness states, and historical events' importance is person-specific.
3) Training models by minimizing prediction loss to forecast outcomes like readmission, mortality, and disease progression based on patients' diseases, treatments, and visits over time.
IRJET- Classifying Chest Pathology Images using Deep Learning TechniquesIRJET Journal
This document discusses classifying chest pathology images using deep learning techniques. It explores using pre-trained convolutional neural networks (CNNs) to classify chest radiograph images as either healthy or pathological, and to identify specific pathologies. The document reviews previous work on applying deep learning to medical image analysis. It then proposes using features extracted from pre-trained CNN models to classify chest radiographs, focusing on classifying images as healthy vs. pathological as an important screening task. The strengths of deep learning approaches for analyzing various chest diseases are explored.
IRJET- Prediction of Heart Disease using RNN AlgorithmIRJET Journal
This document discusses using a recurrent neural network (RNN) algorithm to predict heart disease. It proposes a method called prognosis prediction using RNN (PP-RNN) that uses multiple RNNs to learn from patient diagnosis code sequences in order to predict high-risk diseases. The experimental results show that the proposed PP-RNN method can achieve more accurate results than existing methods for predicting heart disease risk. It also provides background on related works using other techniques like decision trees, clustering, and AdaBoost for heart disease prediction.
DATA AUGMENTATION TECHNIQUES AND TRANSFER LEARNING APPROACHES APPLIED TO FACI...ijaia
The face expression is the first thing we pay attention to when we want to understand a person’s state of
mind. Thus, the ability to recognize facial expressions in an automatic way is a very interesting research
field. In this paper, because the small size of available training datasets, we propose a novel data
augmentation technique that improves the performances in the recognition task. We apply geometrical
transformations and build from scratch GAN models able to generate new synthetic images for each
emotion type. Thus, on the augmented datasets we fine tune pretrained convolutional neural networks with
different architectures. To measure the generalization ability of the models, we apply extra-database
protocol approach, namely we train models on the augmented versions of training dataset and test them on
two different databases. The combination of these techniques allows to reach average accuracy values of
the order of 85% for the InceptionResNetV2 model.
The document provides background information on machine learning and discusses its application to predicting COVID-19. It outlines the objectives of developing a machine learning model to predict whether a patient has COVID-19 based on their clinical information and identifying influential features. The document describes conducting a literature review and experiment to determine the most suitable machine learning techniques and influential features. It also defines the scope of the thesis and provides an outline of the following chapters.
A BINARY BAT INSPIRED ALGORITHM FOR THE CLASSIFICATION OF BREAST CANCER DATAIJSCAI Journal
Advancement in information and technology has made a major impact on medical science where the
researchers come up with new ideas for improving the classification rate of various diseases. Breast cancer
is one such disease killing large number of people around the world. Diagnosing the disease at its earliest
instance makes a huge impact on its treatment. The authors propose a Binary Bat Algorithm (BBA) based
Feedforward Neural Network (FNN) hybrid model, where the advantages of BBA and efficiency of FNN is
exploited for the classification of three benchmark breast cancer datasets into malignant and benign cases.
Here BBA is used to generate a V-shaped hyperbolic tangent function for training the network and a fitness
function is used for error minimization. FNNBBA based classification produces 92.61% accuracy for
training data and 89.95% for testing data.
REVIEWING PROCESS MINING APPLICATIONS AND TECHNIQUES IN EDUCATIONijaia
Process Mining (PM) emerged from business process management but has recently been applied to
educational data and has been found to facilitate the understanding of the educational process.
Educational Process Mining (EPM) bridges the gap between process analysis and data analysis, based on
the techniques of model discovery, conformance checking and extension of existing process models. We
present a systematic review of the recent and current status of research in the EPM domain, focusing on
application domains, techniques, tools and models, to highlight the use of EPM in comprehending and
improving educational processes.
MITIGATION TECHNIQUES TO OVERCOME DATA HARM IN MODEL BUILDING FOR MLijaia
Given the impact of Machine Learning (ML) on individuals and the society, understanding how harm might
be occur throughout the ML life cycle becomes critical more than ever. By offering a framework to
determine distinct potential sources of downstream harm in ML pipeline, the paper demonstrates the
importance of choices throughout distinct phases of data collection, development, and deployment that
extend far beyond just model training. Relevant mitigation techniques are also suggested for being used
instead of merely relying on generic notions of what counts as fairness.
Intelligent data analysis for medicinal diagnosisIRJET Journal
The document describes a proposed privacy-preserving patient-centric clinical decision support system called PPCD that uses naive Bayesian classification to help doctors predict disease risks for patients in a privacy-preserving manner. PPCD allows medical diagnosis and prediction of disease risks for new patients without leaking any individual patient medical information. It utilizes historical medical information from past patients, stored privately in the cloud, to train a naive Bayesian classifier. This trained classifier can then be used to diagnose diseases for new patients based on their symptoms while preserving privacy. The system also introduces a new aggregation technique called additive homomorphic proxy aggregation to allow training of the naive Bayesian classifier without revealing individual patient medical records.
This talk will cover various medical applications of deep learning including tumor segmentation in histology slides, MRI, CT, and X-Ray data. Also, more complicated tasks such as cell counting where the challenge is to count how many objects are in an image. It will also cover generative adversarial networks and how they can be used for medical applications. This presentation is accessible to non-doctors and non-computer scientists.
This document provides information about a computational intelligence and soft computing course including the instructor's contact information, class times, required text, and an overview of upcoming lectures on data mining with neural networks. It then discusses key issues in data mining such as theory, methods/algorithms, processes, applications, and tools/techniques. Several example data mining projects are also summarized along with homework and exam due dates for the course.
This thesis aims to develop deep learning models to detect COVID-19 pneumonia in chest X-ray images. The author trains two models: 1) A binary classifier to distinguish COVID-19 pneumonia from non-COVID cases, which classifies all test cases correctly. 2) A four-class classifier to identify COVID-19, viral pneumonia, bacterial pneumonia, and normal cases, which achieves an average accuracy of 93% on the test set. Gradient-weighted Class Activation Mapping is used to interpret the four-class model and finds it can focus on patchy areas characteristic of COVID-19 pneumonia to make accurate predictions.
University at Buffalo’s Center for Computational ResearchAllineaSoftware
Creating a holistic geoscientific model is complicated enough. So when scientists have to debug their computer code, they turn to Allinea DDT, a tool easy enough for undergraduates to use.
“People were impressed with the results Christine achieved using Allinea DDT and assumed she was a highly educated computer science technician; whereas, at the time, I think she might have taken just one ‘intro to computer science’ course.” – Dr. Shawn Matott, computational scientist, University at Buffalo’s Center for Computational Research.
Read more at http://www.allinea.com/case-studies/
GASCAN: A Novel Database for Gastric Cancer Genes and Primersijdmtaiir
GasCan is a specialized and unique database of
gastric cancer protein encoding genes expressed in human and
mouse. The features that make GasCan unique are availability
of gene information, availability of primers for each gene, with
their features and conditions given that are useful in PCR
amplification, especially in cloning experiments and to make it
more unique built in programmed sequence analysis facility is
provided that analyze gene sequences in database itself,
resulting sequence analysis information can be valuable for
researchers in different experiments. Furthermore, DNA
sequence analysis tool is provided that can be access freely.
GasCan will expand in future to other species, genes and cover
more useful information of other species. Flexible database
design, expandability and easy access of information to all of
the users are the main features of the database. The Database is
publicly available at http://www.gastric-cancer.site40.net.
Machine Learning in Modern Medicine with Erin LeDell at Stanford MedSri Ambati
Machine learning and AI company H2O.ai presented on machine learning applications in modern medicine. They discussed how electronic health records, genomics, wearables, and other data sources can be used with machine learning for personalized healthcare, disease prediction and prevention. H2O's software platform allows building models at scale from large datasets using algorithms like random forests, deep learning and ensembles. Demonstrations showed predicting HIV treatment failure and classifying breast cancer malignancy from medical images, achieving high accuracy. H2O aims to make machine learning accessible and scalable for improving medical research and care.
This document summarizes an application of data mining techniques to analyze customer data. It discusses using decision trees to model customer response to marketing campaigns. Decision trees partition customers into groups based on attributes like income and age to predict their response rates to mailings. Groups with a response rate over 3.5% would be targeted for direct marketing. Decision trees provide a flexible yet simple model for segmentation and targeting of customers.
Introduction to Data and Computation: Essential capabilities for everyone in ...Kim Flintoff
An overview seminar about the themes of the Curtin Institute for Computation, and some thoughts on the future role of these capabilities in Learning and Teaching.
This document introduces digital biomarkers and their use in image classification algorithms. It discusses how digital biomarkers are extracted from images as quantifiable features and optimized to develop multivariate classifiers. The document outlines Contiguity's approach, which extracts obvious and non-obvious features to generate digital biomarkers from histology images. These biomarkers are optimized and combined in classification algorithms. Contiguity applied this method to the CAMELYON16 Grand Challenge dataset, analyzing lymph node images to detect cancer metastases through sampling, filtering, and decision tree classification.
Deep learning for biomedical discovery and data mining IIDeakin University
(1) The document discusses deep learning techniques for analyzing biomedical data from electronic medical records (EMRs).
(2) It describes models like DeepPatient that use autoencoders to learn representations of patient records that can predict diseases.
(3) Other models like Deepr and DeepCare use convolutional and recurrent neural networks to model temporal patterns in EMRs and predict future health risks and care trajectories.
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET Journal
The document proposes an improved model for big data analytics using dynamic multi-swarm optimization and unsupervised learning algorithms. It develops an algorithm called DynamicK-reference Clustering that combines dynamic multi-swarm optimization with a k-reference clustering algorithm. The k-reference clustering algorithm uses reference distance weighting, Euclidean distance, and chi-square relative frequency to cluster mixed datasets. It was tested on several datasets from a machine learning repository and was shown to more efficiently cluster large, mixed datasets than other clustering algorithms like k-means and particle swarm optimization. The dynamic multi-swarm optimization helps guide the clustering algorithm to obtain more accurate cluster formations by providing the best initial value of k clusters.
An Extensive Review on Generative Adversarial Networks GAN’sijtsrd
This paper is to provide a high level understanding of Generative Adversarial Networks. This paper will be covering the working of GAN’s by explaining the background idea of the framework, types of GAN’s in the industry, it’s advantages and disadvantages, history of how GAN’s are developed and enhanced along the timeline and some applications where GAN’s outperforms themselves. Atharva Chitnavis | Yogeshchandra Puranik "An Extensive Review on Generative Adversarial Networks (GAN’s)" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-4 , June 2021, URL: https://www.ijtsrd.compapers/ijtsrd42357.pdf Paper URL: https://www.ijtsrd.comcomputer-science/artificial-intelligence/42357/an-extensive-review-on-generative-adversarial-networks-gan’s/atharva-chitnavis
This document is a 36-page bachelor's thesis written by Duc Minh Luong Nguyen titled "Detect COVID-19 from Chest X-Ray images using Deep Learning". The thesis was submitted to Metropolia University of Applied Sciences in May 2020. It aims to build a deep convolutional neural network to detect COVID-19 using only chest X-ray images. The model achieves an accuracy of 93% at detecting COVID-19 patients versus healthy patients, despite being trained on a small dataset of 115 images for each class.
The document discusses using deep learning models to analyze episodic healthcare data and make predictions. It proposes:
1) Viewing healthcare processes as executable computer programs with hidden "grammars" that can be learned from observational data.
2) Modeling health dynamics as a system of state transitions where treatments shift illness states, and historical events' importance is person-specific.
3) Training models by minimizing prediction loss to forecast outcomes like readmission, mortality, and disease progression based on patients' diseases, treatments, and visits over time.
IRJET- Classifying Chest Pathology Images using Deep Learning TechniquesIRJET Journal
This document discusses classifying chest pathology images using deep learning techniques. It explores using pre-trained convolutional neural networks (CNNs) to classify chest radiograph images as either healthy or pathological, and to identify specific pathologies. The document reviews previous work on applying deep learning to medical image analysis. It then proposes using features extracted from pre-trained CNN models to classify chest radiographs, focusing on classifying images as healthy vs. pathological as an important screening task. The strengths of deep learning approaches for analyzing various chest diseases are explored.
IRJET- Prediction of Heart Disease using RNN AlgorithmIRJET Journal
This document discusses using a recurrent neural network (RNN) algorithm to predict heart disease. It proposes a method called prognosis prediction using RNN (PP-RNN) that uses multiple RNNs to learn from patient diagnosis code sequences in order to predict high-risk diseases. The experimental results show that the proposed PP-RNN method can achieve more accurate results than existing methods for predicting heart disease risk. It also provides background on related works using other techniques like decision trees, clustering, and AdaBoost for heart disease prediction.
DATA AUGMENTATION TECHNIQUES AND TRANSFER LEARNING APPROACHES APPLIED TO FACI...ijaia
The face expression is the first thing we pay attention to when we want to understand a person’s state of
mind. Thus, the ability to recognize facial expressions in an automatic way is a very interesting research
field. In this paper, because the small size of available training datasets, we propose a novel data
augmentation technique that improves the performances in the recognition task. We apply geometrical
transformations and build from scratch GAN models able to generate new synthetic images for each
emotion type. Thus, on the augmented datasets we fine tune pretrained convolutional neural networks with
different architectures. To measure the generalization ability of the models, we apply extra-database
protocol approach, namely we train models on the augmented versions of training dataset and test them on
two different databases. The combination of these techniques allows to reach average accuracy values of
the order of 85% for the InceptionResNetV2 model.
The document provides background information on machine learning and discusses its application to predicting COVID-19. It outlines the objectives of developing a machine learning model to predict whether a patient has COVID-19 based on their clinical information and identifying influential features. The document describes conducting a literature review and experiment to determine the most suitable machine learning techniques and influential features. It also defines the scope of the thesis and provides an outline of the following chapters.
A BINARY BAT INSPIRED ALGORITHM FOR THE CLASSIFICATION OF BREAST CANCER DATAIJSCAI Journal
Advancement in information and technology has made a major impact on medical science where the
researchers come up with new ideas for improving the classification rate of various diseases. Breast cancer
is one such disease killing large number of people around the world. Diagnosing the disease at its earliest
instance makes a huge impact on its treatment. The authors propose a Binary Bat Algorithm (BBA) based
Feedforward Neural Network (FNN) hybrid model, where the advantages of BBA and efficiency of FNN is
exploited for the classification of three benchmark breast cancer datasets into malignant and benign cases.
Here BBA is used to generate a V-shaped hyperbolic tangent function for training the network and a fitness
function is used for error minimization. FNNBBA based classification produces 92.61% accuracy for
training data and 89.95% for testing data.
REVIEWING PROCESS MINING APPLICATIONS AND TECHNIQUES IN EDUCATIONijaia
Process Mining (PM) emerged from business process management but has recently been applied to
educational data and has been found to facilitate the understanding of the educational process.
Educational Process Mining (EPM) bridges the gap between process analysis and data analysis, based on
the techniques of model discovery, conformance checking and extension of existing process models. We
present a systematic review of the recent and current status of research in the EPM domain, focusing on
application domains, techniques, tools and models, to highlight the use of EPM in comprehending and
improving educational processes.
MITIGATION TECHNIQUES TO OVERCOME DATA HARM IN MODEL BUILDING FOR MLijaia
Given the impact of Machine Learning (ML) on individuals and the society, understanding how harm might
be occur throughout the ML life cycle becomes critical more than ever. By offering a framework to
determine distinct potential sources of downstream harm in ML pipeline, the paper demonstrates the
importance of choices throughout distinct phases of data collection, development, and deployment that
extend far beyond just model training. Relevant mitigation techniques are also suggested for being used
instead of merely relying on generic notions of what counts as fairness.
Intelligent data analysis for medicinal diagnosisIRJET Journal
The document describes a proposed privacy-preserving patient-centric clinical decision support system called PPCD that uses naive Bayesian classification to help doctors predict disease risks for patients in a privacy-preserving manner. PPCD allows medical diagnosis and prediction of disease risks for new patients without leaking any individual patient medical information. It utilizes historical medical information from past patients, stored privately in the cloud, to train a naive Bayesian classifier. This trained classifier can then be used to diagnose diseases for new patients based on their symptoms while preserving privacy. The system also introduces a new aggregation technique called additive homomorphic proxy aggregation to allow training of the naive Bayesian classifier without revealing individual patient medical records.
This talk will cover various medical applications of deep learning including tumor segmentation in histology slides, MRI, CT, and X-Ray data. Also, more complicated tasks such as cell counting where the challenge is to count how many objects are in an image. It will also cover generative adversarial networks and how they can be used for medical applications. This presentation is accessible to non-doctors and non-computer scientists.
This document provides information about a computational intelligence and soft computing course including the instructor's contact information, class times, required text, and an overview of upcoming lectures on data mining with neural networks. It then discusses key issues in data mining such as theory, methods/algorithms, processes, applications, and tools/techniques. Several example data mining projects are also summarized along with homework and exam due dates for the course.
This thesis aims to develop deep learning models to detect COVID-19 pneumonia in chest X-ray images. The author trains two models: 1) A binary classifier to distinguish COVID-19 pneumonia from non-COVID cases, which classifies all test cases correctly. 2) A four-class classifier to identify COVID-19, viral pneumonia, bacterial pneumonia, and normal cases, which achieves an average accuracy of 93% on the test set. Gradient-weighted Class Activation Mapping is used to interpret the four-class model and finds it can focus on patchy areas characteristic of COVID-19 pneumonia to make accurate predictions.
University at Buffalo’s Center for Computational ResearchAllineaSoftware
Creating a holistic geoscientific model is complicated enough. So when scientists have to debug their computer code, they turn to Allinea DDT, a tool easy enough for undergraduates to use.
“People were impressed with the results Christine achieved using Allinea DDT and assumed she was a highly educated computer science technician; whereas, at the time, I think she might have taken just one ‘intro to computer science’ course.” – Dr. Shawn Matott, computational scientist, University at Buffalo’s Center for Computational Research.
Read more at http://www.allinea.com/case-studies/
GASCAN: A Novel Database for Gastric Cancer Genes and Primersijdmtaiir
GasCan is a specialized and unique database of
gastric cancer protein encoding genes expressed in human and
mouse. The features that make GasCan unique are availability
of gene information, availability of primers for each gene, with
their features and conditions given that are useful in PCR
amplification, especially in cloning experiments and to make it
more unique built in programmed sequence analysis facility is
provided that analyze gene sequences in database itself,
resulting sequence analysis information can be valuable for
researchers in different experiments. Furthermore, DNA
sequence analysis tool is provided that can be access freely.
GasCan will expand in future to other species, genes and cover
more useful information of other species. Flexible database
design, expandability and easy access of information to all of
the users are the main features of the database. The Database is
publicly available at http://www.gastric-cancer.site40.net.
Machine Learning in Modern Medicine with Erin LeDell at Stanford MedSri Ambati
Machine learning and AI company H2O.ai presented on machine learning applications in modern medicine. They discussed how electronic health records, genomics, wearables, and other data sources can be used with machine learning for personalized healthcare, disease prediction and prevention. H2O's software platform allows building models at scale from large datasets using algorithms like random forests, deep learning and ensembles. Demonstrations showed predicting HIV treatment failure and classifying breast cancer malignancy from medical images, achieving high accuracy. H2O aims to make machine learning accessible and scalable for improving medical research and care.
This document summarizes an application of data mining techniques to analyze customer data. It discusses using decision trees to model customer response to marketing campaigns. Decision trees partition customers into groups based on attributes like income and age to predict their response rates to mailings. Groups with a response rate over 3.5% would be targeted for direct marketing. Decision trees provide a flexible yet simple model for segmentation and targeting of customers.
Introduction to Data and Computation: Essential capabilities for everyone in ...Kim Flintoff
An overview seminar about the themes of the Curtin Institute for Computation, and some thoughts on the future role of these capabilities in Learning and Teaching.
May 2015 talk to SW Data Meetup by Professor Hendrik Blockeel from KU Leuven & Leiden University.
With increasing amounts of ever more complex forms of digital data becoming available, the methods for analyzing these data have also become more diverse and sophisticated. With this comes an increased risk of incorrect use of these methods, and a greater burden on the user to be knowledgeable about their assumptions. In addition, the user needs to know about a wide variety of methods to be able to apply the most suitable one to a particular problem. This combination of broad and deep knowledge is not sustainable.
The idea behind declarative data analysis is that the burden of choosing the right statistical methodology for answering a research question should no longer lie with the user, but with the system. The user should be able to simply describe the problem, formulate a question, and let the system take it from there. To achieve this, we need to find answers to questions such as: what languages are suitable for formulating these questions, and what execution mechanisms can we develop for them? In this talk, I will discuss recent and ongoing research in this direction. The talk will touch upon query languages for data mining and for statistical inference, declarative modeling for data mining, meta-learning, and constraint-based data mining. What connects these research threads is that they all strive to put intelligence about data analysis into the system, instead of assuming it resides in the user.
Hendrik Blockeel is a professor of computer science at KU Leuven, Belgium, and part-time associate professor at Leiden University, The Netherlands. His research interests lie mostly in machine learning and data mining. He has made a variety of research contributions in these fields, including work on decision tree learning, inductive logic programming, predictive clustering, probabilistic-logical models, inductive databases, constraint-based data mining, and declarative data analysis. He is an action editor for Machine Learning and serves on the editorial board of several other journals. He has chaired or organized multiple conferences, workshops, and summer schools, including ILP, ECMLPKDD, IDA and ACAI, and he has been vice-chair, area chair, or senior PC member for ECAI, IJCAI, ICML, KDD, ICDM. He was a member of the board of the European Coordinating Committee for Artificial Intelligence from 2004 to 2010, and currently serves as publications chair for the ECMLPKDD steering committee.
Analysing a Complex Agent-Based Model Using Data-Mining TechniquesBruce Edmonds
A talk given at "Social Simulation 2014" at Barcelona in September.
A complex “Data Integration Model” of voter behaviour is described. However it is very complex and hard to analyse. For such a model “thin” samples of the outcomes using classic parameter sweeps are inadequate. In order to get a more holistic picture of its behaviour data- mining techniques are applied to the data generated by many runs of the model, each with randomised parameter values.
Paper is at: http://cfpm.org/aacabm/analysing a complex model-v3.4.pdf
The document discusses various challenges in social network analysis including collecting and extracting network data at scale from sources such as the web, validating automated data extraction methods, and developing algorithms and software that can analyze large and complex network datasets. It also outlines different network analysis methods, visualization and simulation techniques, and recommendations for how tools can better support networking, referrals, and workflows across multiple data sources and programs. Scaling methods and algorithms to very large network sizes and developing standards to integrate diverse data and tools are highlighted as key challenges.
The document summarizes Anita de Waard's presentation on Elsevier's experiments with big and small data. It discusses Elsevier's work with text mining and knowledge graphs to extract information from over 14 million articles. It also describes Elsevier's Medical Graph which predicts the probability of over 2,000 medical conditions occurring based on analysis of clinical data from 6 million patients. Finally, it reviews Elsevier's various tools and services to help researchers preserve, process, share, comprehend, access, and discover research data and publications.
REPRESENTATION OF UNCERTAIN DATA USING POSSIBILISTIC NETWORK MODELScscpconf
Uncertainty is a pervasive in real world environment due to vagueness, is associated with the
difficulty of making sharp distinctions and ambiguity, is associated with situations in which the
choices among several precise alternatives cannot be perfectly resolved. Analysis of large
collections of uncertain data is a primary task in the real world applications, because data is
incomplete, inaccurate and inefficient. Representation of uncertain data in various forms such
as Data Stream models, Linkage models, Graphical models and so on, which is the most simple,
natural way to process and produce the optimized results through Query processing. In this
paper, we propose the Uncertain Data model can be represented as Possibilistic data model
and vice versa for the process of uncertain data using various data models such as possibilistic
linkage model, Data streams, Possibilistic Graphs. This paper presents representation and
process of Possiblistic Linkage model through Possible Worlds with the use of product-based
operator.
1) The document discusses the differences between explanatory and predictive modeling in scientific research.
2) Explanatory models are used to test causal theories, while predictive models are used to predict new records or scenarios.
3) Explanatory power and predictive accuracy are different and one cannot be inferred from the other. The best explanatory model is often not the best predictive model and vice versa.
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!Sri Ambati
This meetup took place in Mountain View on January 24th, 2019.
Description:
With the effort and contributions from researchers and practitioners from academia and industry, Machine Learning Interpretation has become a young sub-field of ML. However, the norms around its definition and understanding is still in its infancy and there are numerous different approaches emerging rapidly. However, there seems to be a lack of a consistent explanation framework to evaluate and consistently benchmark different algorithms - evaluating against interpretation, completeness and consistency of the algorithms.
The idea with the gym is to provide a controlled interactive environment for all forms of Machine Learning algorithms, - initially focusing on supervised predictive modeling problems, to allow analysts and data-scientists to explore, debug and generate insightful understanding of the models by
1.Model Validation: Ways to explore and validate black box ML systems enabling model comparison both globally and locally - identifying biases in the training data through interpretation.
2.What-if Analysis: An interactive environment where communication can happen i.e. enable learning through interactions. User having the ability to conduct "What-If" analysis - effect of single or multiple features and their interactions
3.Model Debugging: Ways to analyze the misbehavior of the model by exploring counterfactual examples(adversarial examples and training)
4. Interpretable Models: Ability to build natively interpretable models - with the goal to simplify complex models to enable better understanding.
The central concept with MLI gym is to have an interactive environment where one could explore and simulate variations in the world(a world post a model is operationalized) beyond the defined model metrics point estimates - e.g. ROC-AUC, confusion matrix, RMSE, R2 score and others.
Speaker's Bio:
Pramit is a Lead Data Scientist/ at H2O.ai. His area of interests is building Statistical/Machine Learning models(Bayesian and Frequentist Modeling techniques) to help the business realize their data-driven goals.
Currently, he is exploring "Model Interpretation" as means to efficiently understand the true nature of predictive models to enable model robustness and security. He believes effective Model Inference coupled with Adversarial training could lead to building trustworthy models with known blind spots. He has started an open source project Skater: https://github.com/datascienceinc/Skater to solve the need for Model Inference(The project is still in its early stages of development but check it out, always eager for feedback)
The document summarizes five papers that address challenges in context-aware recommendation systems using factorization methods. Three key challenges are high dimensionality, data sparsity, and cold starts. The papers propose various algorithms using matrix factorization and tensor factorization to address these challenges. COT models each context as an operation on user-item pairs to reduce dimensionality. Another approach extracts latent contexts from sensor data using deep learning and matrix factorization. CSLIM extends the SLIM algorithm to incorporate contextual ratings. TAPER uses tensor factorization to integrate various contexts for expert recommendations. Finally, GFF provides a generalized factorization framework to handle different recommendation models. The document analyzes how well each paper meets the challenges.
Simplified Knowledge Prediction: Application of Machine Learning in Real LifePeea Bal Chakraborty
Machine learning is the scientific study of algorithms and statistical models that is used by the machines to perform a specific task depending on patterns and inference rather than explicit instructions. This research and analysis aims to observe how precisely a machine can predict that a patient suspected of breast cancer is having malignant or benign cancer.In this paper the classification of cancer type and prediction of risk levels is done by various model of machine learning and is pictorially depicted by various tools of visual analytics.
Data Science: Origins, Methods, Challenges and the future?Cagatay Turkay
Slides for my talk at City Unrulyversity on 18.03.15 in London. Discuss the term Data Science, touch upon the origins and the data scientist types. A longer discussion on the Data Science process and challenges analysts face.
And here is the abstract of the talk:
Data Science ... the term is everywhere now, on the news, recruitment sites, technology boards. "Data scientist" is even named to be sexiest job title of the century. But what is it, really? Is it just a hype or a term that will be with us for some time?
This session will investigate where the term is originating from and how it relates to decades of research in established fields such as statistics, data mining, visualisation and machine learning. We will investigate how the field is evolving with the emergence of large, heterogeneous data resources. We will discuss the objectives, tools and challenges of data science as a practice, and look at examples from research and industrial applications.
Challenges and opportunities for machine learning in biomedical researchFranciscoJAzuajeG
1. Machine learning faces challenges in biomedical research due to data heterogeneity, lack of labeled data, and complexity in biological patterns and networks.
2. Combining machine learning and biological network models can help address these challenges by encoding data in biologically meaningful networks and extracting network-based features for prediction.
3. Examples applying this approach to cancer datasets showed that models based on network centrality features outperformed other methods, and deep learning using these features achieved the best prediction performance across multiple neuroblastoma datasets.
Comprehensive Survey of Data Classification & Prediction Techniquesijsrd.com
In this paper, we present an literature survey of the modern data classification and prediction algorithms. All these algorithms are very important in real world applications like- heart disease prediction, cancer prediction etc. Classification of data is a very popular and computationally expensive task. The fundamentals of data classification are also discussed in brief.
1. The document discusses model interpretation and techniques for interpreting machine learning models, especially deep neural networks.
2. It describes what model interpretation is, its importance and benefits, and provides examples of interpretability algorithms like dimensionality reduction, manifold learning, and visualization techniques.
3. The document aims to help make machine learning models more transparent and understandable to humans in order to build trust and improve model evaluation, debugging and feature engineering.
The document discusses the increasing scale and complexity of knowledge generation in science domains like astronomy and medicine over recent centuries. It argues that knowledge generation can be viewed as a systems problem involving many actors and processes. The document proposes a service-oriented approach using web services as an integrating framework to address challenges of scale, complexity, and distributed collaboration in e-Science. Key challenges discussed include semantics, documentation, scaling issues, and sociological factors like incentives.
This document discusses data mining techniques used in finance, including predicting stock prices and evaluating credit risk. It describes common data mining methods like decision trees, neural networks, and genetic algorithms. It also discusses applications like predicting stock price movements using techniques like time series analysis, neural networks, genetic algorithms, and hybrid models. The document notes that financial time series are complex and nonlinear, so artificial intelligence techniques often provide more accurate predictions than traditional regression models.
On Tuesday 18 September 2007, Ben Shneiderman gave a talk at the Centre for HCI Design, City University London, on the topic of information visualisation for high-dimensional spaces. Over 100 people from industry and academia attended the talk.
http://hcid.soi.cty.ac.uk/
Data science is an area at the interface of statistics, computer science, and mathematics.
• Statisticians contributed a large inferential framework, important Bayesian perspectives, the bootstrap and CART and random forests, and the concepts of sparsity and parsimony.
• Computer scientists contributed an appetite for big, challenging problems.They also pioneered neural networks, boosting, PAC bounds, and developed programming languages, such as Spark and hadoop, for handling Big Data.
• Mathematicians contributed support vector machines, modern optimization, tensor analysis, and (maybe) topological data analysis.
Similar to algorithmic-decisions, fairness, machine learning, provenance, transparency (20)
Design and Development of a Provenance Capture Platform for Data SciencePaolo Missier
A talk given at the DATAPLAT workshop, co-located with the IEEE ICDE conference (May 2024, Utrecht, NL).
Data Provenance for Data Science is our attempt to provide a foundation to add explainability to data-centric AI.
It is a prototype, with lots of work still to do.
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
In this presentation, given to graduate students at Universita' RomaTre, Italy, we suggest that concepts well-known in Data Provenance can be exploited to provide explanations in the context of data-centric AI processes. Through use cases (incremental data cleaning, training set pruning), we build up increasingly complex provenance patterns, culminating in an open question:
how to describe "why" a specific data item has been manipulated as part of data processing, when such processing may consist of a complex data transformation algorithm.
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
A talk given at the BDA4HM workshop, IEEE BigData conference, Dec. 2023
please see paper here:
https://drive.google.com/file/d/1vN08G0FWxOSH1Yeak5AX6a0sr5-EBbAt/view
Data-centric AI and the convergence of data and model engineering:opportunit...Paolo Missier
A keynote talk given to the IDEAL 2023 conference (Evora, Portugal Nov 23, 2023).
Abstract.
The past few years have seen the emergence of what the AI community calls "Data-centric AI", namely the recognition that some of the limiting factors in AI performance are in fact in the data used for training the models, as much as in the expressiveness and complexity of the models themselves. One analogy is that of a powerful engine that will only run as fast as the quality of the fuel allows. A plethora of recent literature has started the connection between data and models in depth, along with startups that offer "data engineering for AI" services. Some concepts are well-known to the data engineering community, including incremental data cleaning, multi-source integration, or data bias control; others are more specific to AI applications, for instance the realisation that some samples in the training space are "easier to learn from" than others. In this "position talk" I will suggest that, from an infrastructure perspective, there is an opportunity to efficiently support patterns of complex pipelines where data and model improvements are entangled in a series of iterations. I will focus in particular on end-to-end tracking of data and model versions, as a way to support MLDev and MLOps engineers as they navigate through a complex decision space.
Realising the potential of Health Data Science:opportunities and challenges ...Paolo Missier
This document summarizes a presentation on opportunities and challenges for applying health data science and AI in healthcare. It discusses the potential of predictive, preventative, personalized and participatory (P4) approaches using large health datasets. However, it notes major challenges including data sparsity, imbalance, inconsistency and high costs. Case studies on liver disease and COVID datasets demonstrate issues requiring data engineering. Ensuring explanations and human oversight are also key to adopting AI in clinical practice. Overall, the document outlines a complex landscape and the need for better data science methods to realize the promise of data-driven healthcare.
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
This document describes DP4DS, a tool to collect fine-grained provenance from data processing pipelines. Specifically, it can collect provenance from dataframe-based Python scripts. It demonstrates scalable provenance generation, storage, and querying. Current work includes improving provenance compression techniques and demonstrating the tool's generality for standard relational operators. Open questions remain around how useful fine-grained provenance is for explaining findings from real data science pipelines.
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
a brief intro on the data challenges associated with working with Health Care data, with a few examples, both from literature and our own, of traditional approaches (Latent Class Analysis, Topic Modelling) and a perspective on Language-based modelling for Electronic Health Records (EHR).
probably more references than actual content in here!
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
This document describes a method for capturing and querying fine-grained provenance from data science preprocessing pipelines. It captures provenance at the dataframe level by comparing inputs and outputs to identify transformations. Templates are used to represent common transformations like joins and appends. The approach was evaluated on benchmark datasets and pipelines, showing overhead from provenance capture is low and queries are fast even for large datasets. Scalability was demonstrated on datasets up to 1TB in size. A tool called DPDS was also developed to assist with data science provenance.
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
The document proposes tracking trajectories of multiple long-term conditions using dynamic patient-cluster associations. It uses topic modeling to identify disease clusters from patient timelines and quantifies how patients associate with clusters over time. Preliminary results on 143,000 patients from UK Biobank show varying stability of patient associations with clusters. Further work aims to better define stability and identify causes of instability.
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
a talk given at the VLDB 2021 conference, August, 2021, presenting our paper:
Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science. Chapman, A., Missier, P., Simonelli, G., & Torlone, R. PVLDB, 14(4):507–520, January, 2021.
http://doi.org/10.14778/3436905.3436911
Analytics of analytics pipelines:from optimising re-execution to general Dat...Paolo Missier
This document discusses using data provenance to optimize re-execution of analytics pipelines and enable transparency in data science workflows. It proposes a framework called ReComp that selectively recomputes parts of expensive analytics workflows when inputs change based on provenance data. It also discusses applying provenance techniques to collect fine-grained data on data preparation steps in machine learning pipelines to help explain model decisions and data transformations. Early results suggest provenance can be collected with reasonable overhead and enables useful queries about pipeline execution.
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
Paolo Missier presented on optimizing the re-execution of analytics pipelines in response to changes in input data. The talk discussed using provenance to selectively re-run parts of workflows impacted by changes. ProvONE combines process structure and runtime provenance to enable granular re-execution. The ReComp framework detects and quantifies data changes, estimates impact, and selectively re-executes relevant sub-processes to optimize re-running workflows in response to evolving data.
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
The document describes the ReComp framework for efficiently recomputing analytics processes when changes occur. ReComp uses provenance data from past executions to estimate the impact of changes and selectively re-execute only affected parts of processes. It identifies changes, computes data differences, and estimates impacts on past outputs to determine the minimum re-executions needed. For genomic analysis workflows, ReComp reduced re-executions from 495 to 71 by caching intermediate data and re-running only impacted fragments. The framework is customizable via difference and impact functions tailored to specific applications and data types.
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
This document discusses efficient re-computation of big data analytics processes when changes occur. It presents the ReComp framework which uses process execution history and provenance to selectively re-execute only the relevant parts of a process that are impacted by changes, rather than fully re-executing the entire process from scratch. This approach estimates the impact of changes using type-specific difference functions and impact estimation functions. It then identifies the minimal subset of process fragments that need to be re-executed based on change impact analysis and provenance traces. The framework is able to efficiently re-compute complex processes like genomics analytics workflows in response to changes in reference databases or other dependencies.
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Paolo Missier
a talk given at the 2nd IEEE Blockchain conference, Atlanta, US ?july 2019.
here is the paper: http://homepages.cs.ncl.ac.uk/paolo.missier/doc/Decentralised_Marketplace_USA_Conference___Accepted_Version_.pdf
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
This document discusses an efficient framework called ReComp for re-computing big data analytics processes when inputs or algorithms change. ReComp uses fine-grained process provenance and execution history to estimate the impact of changes and selectively re-execute only affected parts. This can provide significant time savings over fully re-running processes from scratch. The framework was tested on two case studies: genomic variant analysis (SVI tool) and simulation modeling, demonstrating savings of 28-37% compared to complete re-execution. ReComp provides a generic approach but allows customization for specific processes and change types.
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Paolo Missier
A paper presented at the annual Italian Database conference (SEBD): http://sisinflab.poliba.it/sebd/2018/
here is the paper: http://sisinflab.poliba.it/sebd/2018/papers/June-27-Wednesday/1-Big-Data/SEBD_2018_paper_23.pdf
AppSec PNW: Android and iOS Application Security with MobSFAjin Abraham
Mobile Security Framework - MobSF is a free and open source automated mobile application security testing environment designed to help security engineers, researchers, developers, and penetration testers to identify security vulnerabilities, malicious behaviours and privacy concerns in mobile applications using static and dynamic analysis. It supports all the popular mobile application binaries and source code formats built for Android and iOS devices. In addition to automated security assessment, it also offers an interactive testing environment to build and execute scenario based test/fuzz cases against the application.
This talk covers:
Using MobSF for static analysis of mobile applications.
Interactive dynamic security assessment of Android and iOS applications.
Solving Mobile app CTF challenges.
Reverse engineering and runtime analysis of Mobile malware.
How to shift left and integrate MobSF/mobsfscan SAST and DAST in your build pipeline.
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Must Know Postgres Extension for DBA and Developer during MigrationMydbops
Mydbops Opensource Database Meetup 16
Topic: Must-Know PostgreSQL Extensions for Developers and DBAs During Migration
Speaker: Deepak Mahto, Founder of DataCloudGaze Consulting
Date & Time: 8th June | 10 AM - 1 PM IST
Venue: Bangalore International Centre, Bangalore
Abstract: Discover how PostgreSQL extensions can be your secret weapon! This talk explores how key extensions enhance database capabilities and streamline the migration process for users moving from other relational databases like Oracle.
Key Takeaways:
* Learn about crucial extensions like oracle_fdw, pgtt, and pg_audit that ease migration complexities.
* Gain valuable strategies for implementing these extensions in PostgreSQL to achieve license freedom.
* Discover how these key extensions can empower both developers and DBAs during the migration process.
* Don't miss this chance to gain practical knowledge from an industry expert and stay updated on the latest open-source database trends.
Mydbops Managed Services specializes in taking the pain out of database management while optimizing performance. Since 2015, we have been providing top-notch support and assistance for the top three open-source databases: MySQL, MongoDB, and PostgreSQL.
Our team offers a wide range of services, including assistance, support, consulting, 24/7 operations, and expertise in all relevant technologies. We help organizations improve their database's performance, scalability, efficiency, and availability.
Contact us: info@mydbops.com
Visit: https://www.mydbops.com/
Follow us on LinkedIn: https://in.linkedin.com/company/mydbops
For more details and updates, please follow up the below links.
Meetup Page : https://www.meetup.com/mydbops-databa...
Twitter: https://twitter.com/mydbopsofficial
Blogs: https://www.mydbops.com/blog/
Facebook(Meta): https://www.facebook.com/mydbops/
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...Fwdays
Direct losses from downtime in 1 minute = $5-$10 thousand dollars. Reputation is priceless.
As part of the talk, we will consider the architectural strategies necessary for the development of highly loaded fintech solutions. We will focus on using queues and streaming to efficiently work and manage large amounts of data in real-time and to minimize latency.
We will focus special attention on the architectural patterns used in the design of the fintech system, microservices and event-driven architecture, which ensure scalability, fault tolerance, and consistency of the entire system.
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...DanBrown980551
This LF Energy webinar took place June 20, 2024. It featured:
-Alex Thornton, LF Energy
-Hallie Cramer, Google
-Daniel Roesler, UtilityAPI
-Henry Richardson, WattTime
In response to the urgency and scale required to effectively address climate change, open source solutions offer significant potential for driving innovation and progress. Currently, there is a growing demand for standardization and interoperability in energy data and modeling. Open source standards and specifications within the energy sector can also alleviate challenges associated with data fragmentation, transparency, and accessibility. At the same time, it is crucial to consider privacy and security concerns throughout the development of open source platforms.
This webinar will delve into the motivations behind establishing LF Energy’s Carbon Data Specification Consortium. It will provide an overview of the draft specifications and the ongoing progress made by the respective working groups.
Three primary specifications will be discussed:
-Discovery and client registration, emphasizing transparent processes and secure and private access
-Customer data, centering around customer tariffs, bills, energy usage, and full consumption disclosure
-Power systems data, focusing on grid data, inclusive of transmission and distribution networks, generation, intergrid power flows, and market settlement data
High performance Serverless Java on AWS- GoTo Amsterdam 2024Vadym Kazulkin
Java is for many years one of the most popular programming languages, but it used to have hard times in the Serverless community. Java is known for its high cold start times and high memory footprint, comparing to other programming languages like Node.js and Python. In this talk I'll look at the general best practices and techniques we can use to decrease memory consumption, cold start times for Java Serverless development on AWS including GraalVM (Native Image) and AWS own offering SnapStart based on Firecracker microVM snapshot and restore and CRaC (Coordinated Restore at Checkpoint) runtime hooks. I'll also provide a lot of benchmarking on Lambda functions trying out various deployment package sizes, Lambda memory settings, Java compilation options and HTTP (a)synchronous clients and measure their impact on cold and warm start times.
"Choosing proper type of scaling", Olena SyrotaFwdays
Imagine an IoT processing system that is already quite mature and production-ready and for which client coverage is growing and scaling and performance aspects are life and death questions. The system has Redis, MongoDB, and stream processing based on ksqldb. In this talk, firstly, we will analyze scaling approaches and then select the proper ones for our system.
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsDianaGray10
Join us to learn how UiPath Apps can directly and easily interact with prebuilt connectors via Integration Service--including Salesforce, ServiceNow, Open GenAI, and more.
The best part is you can achieve this without building a custom workflow! Say goodbye to the hassle of using separate automations to call APIs. By seamlessly integrating within App Studio, you can now easily streamline your workflow, while gaining direct access to our Connector Catalog of popular applications.
We’ll discuss and demo the benefits of UiPath Apps and connectors including:
Creating a compelling user experience for any software, without the limitations of APIs.
Accelerating the app creation process, saving time and effort
Enjoying high-performance CRUD (create, read, update, delete) operations, for
seamless data management.
Speakers:
Russell Alfeche, Technology Leader, RPA at qBotic and UiPath MVP
Charlie Greenberg, host
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillLizaNolte
HERE IS YOUR WEBINAR CONTENT! 'Mastering Customer Journey Management with Dr. Graham Hill'. We hope you find the webinar recording both insightful and enjoyable.
In this webinar, we explored essential aspects of Customer Journey Management and personalization. Here’s a summary of the key insights and topics discussed:
Key Takeaways:
Understanding the Customer Journey: Dr. Hill emphasized the importance of mapping and understanding the complete customer journey to identify touchpoints and opportunities for improvement.
Personalization Strategies: We discussed how to leverage data and insights to create personalized experiences that resonate with customers.
Technology Integration: Insights were shared on how inQuba’s advanced technology can streamline customer interactions and drive operational efficiency.
The Microsoft 365 Migration Tutorial For Beginner.pptxoperationspcvita
This presentation will help you understand the power of Microsoft 365. However, we have mentioned every productivity app included in Office 365. Additionally, we have suggested the migration situation related to Office 365 and how we can help you.
You can also read: https://www.systoolsgroup.com/updates/office-365-tenant-to-tenant-migration-step-by-step-complete-guide/
Discover top-tier mobile app development services, offering innovative solutions for iOS and Android. Enhance your business with custom, user-friendly mobile applications.
In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security.
This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing.
The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.
Conversational agents, or chatbots, are increasingly used to access all sorts of services using natural language. While open-domain chatbots - like ChatGPT - can converse on any topic, task-oriented chatbots - the focus of this paper - are designed for specific tasks, like booking a flight, obtaining customer support, or setting an appointment. Like any other software, task-oriented chatbots need to be properly tested, usually by defining and executing test scenarios (i.e., sequences of user-chatbot interactions). However, there is currently a lack of methods to quantify the completeness and strength of such test scenarios, which can lead to low-quality tests, and hence to buggy chatbots.
To fill this gap, we propose adapting mutation testing (MuT) for task-oriented chatbots. To this end, we introduce a set of mutation operators that emulate faults in chatbot designs, an architecture that enables MuT on chatbots built using heterogeneous technologies, and a practical realisation as an Eclipse plugin. Moreover, we evaluate the applicability, effectiveness and efficiency of our approach on open-source chatbots, with promising results.
The Department of Veteran Affairs (VA) invited Taylor Paschal, Knowledge & Information Management Consultant at Enterprise Knowledge, to speak at a Knowledge Management Lunch and Learn hosted on June 12, 2024. All Office of Administration staff were invited to attend and received professional development credit for participating in the voluntary event.
The objectives of the Lunch and Learn presentation were to:
- Review what KM ‘is’ and ‘isn’t’
- Understand the value of KM and the benefits of engaging
- Define and reflect on your “what’s in it for me?”
- Share actionable ways you can participate in Knowledge - - Capture & Transfer
"Scaling RAG Applications to serve millions of users", Kevin GoedeckeFwdays
How we managed to grow and scale a RAG application from zero to thousands of users in 7 months. Lessons from technical challenges around managing high load for LLMs, RAGs and Vector databases.
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving
Manufacturing custom quality metal nameplates and badges involves several standard operations. Processes include sheet prep, lithography, screening, coating, punch press and inspection. All decoration is completed in the flat sheet with adhesive and tooling operations following. The possibilities for creating unique durable nameplates are endless. How will you create your brand identity? We can help!
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor IvaniukFwdays
At this talk we will discuss DDoS protection tools and best practices, discuss network architectures and what AWS has to offer. Also, we will look into one of the largest DDoS attacks on Ukrainian infrastructure that happened in February 2022. We'll see, what techniques helped to keep the web resources available for Ukrainians and how AWS improved DDoS protection for all customers based on Ukraine experience
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
1. Paolo Missier
School of Computing
Newcastle University
Supporting Algorithm Accountability using Provenance
A ProvenanceWeek 2018 workshop
London, July 12th, 2018
Transparency and fairness of predictive models, and
the provenance of the data used to build them:
thoughts and challenges
2. 2
One of my favourite books
<eventname>
How much of Big Data is My Data?
Is Data the problem?
Or the algorithms?
Or how much we trust them?
Is there a problem at all?
3. 3
What matters?
<eventname>
• automatically filtering job applicants
• approving loans or other credit
• approving access to benefits schemes
• predicting insurance risk levels
• user profiling for policing purposes and to predict risk of criminal
recidivism
• identifying health risk factors
• …
Decisions made based on algorithmically-generated knowledge:
4. 4
GDPR and algorithmic decision making
<eventname>
Article 22: Automated individual decision-making, including profiling, paragraph
1 (see figure 1) prohibits any“decision based solely on automated processing,
including profiling” which “significantly affects” a data subject.
it stands to reason that an algorithm can only be explained if the trained model can be
articulated and understood by a human.
It is reasonable to suppose that any adequate explanation would provide an account of
how input features relate to predictions:
- Is the model more or less likely to recommend a loan if the applicant is a minority?
- Which features play the largest role in prediction?
B. Goodman and S. Flaxman, “European Union regulations on algorithmic decision-making and a ‘right to explanation,’”
Proc. 2016 ICML Work. Hum. Interpret. Mach. Learn. (WHI 2016), Jun. 2016.
7. 8
Interpretability (of machine learning models)
<eventname>
Z. C. Lipton, “The Mythos of Model Interpretability,” Proc. 2016 ICML Work. Hum. Interpret. Mach.
Learn. (WHI 2016), Jun. 2016.
- Transparency
- Are features understandable?
- Which features are more important?
- Post hoc interpretability
- Natural language explanations
- Visualisations of models
- Explanations by example
- “this tumor is classified as malignant
because to the model it looks a lot like
these other tumors”
W. Samek, T. Wiegand, and K.-R. Müller, “Explainable Artificial Intelligence: Understanding, Visualizing
and Interpreting Deep Learning Models,” Aug. 2017.
Interpretability: Ability to provide a qualitative understanding between the input
variables and the response
8. 9
Black-box approaches
<eventname>
Model agnostic:
An explainer should be able to explain any model, and thus be model-
agnostic (i.e. treat the original model as a black box)
Local fidelity:
for an explanation to be meaningful it must at least be locally faithful, i.e. it
must correspond to how the model behaves in the vicinity of the instance
being predicted
9. 10
Occlusion testing
<eventname>
M. T. Ribeiro, S. Singh, and C. Guestrin, “‘Why Should I Trust You?’ : Explaining the Predictions of Any Classifier,” in Proceedings of
the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’16, 2016, pp. 1135–1144.
10. 11
Expected accuracy not enough for trust
<eventname>
SVM classifier, 94% accuracy
…but questionable!
11. 13
LIME
<eventname>
Model agnostic
Locally faithful: it must
correspond to how the model
behaves in the vicinity of the
instance being predicted
M. T. Ribeiro, S. Singh, and C. Guestrin, “‘Why Should I Trust You?’ : Explaining the Predictions of Any Classifier,” in Proceedings of
the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’16, 2016, pp. 1135–1144.
12. 14
Other model explanation approaches
<eventname>
[1] Lakkaraju, H., Kamar, E., Caruana, R., & Leskovec, J. (2017). Interpretable & Explorable
Approximations of Black Box Models. arXiv preprint arXiv:1707.01154.
[2] R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad, “Intelligible models for healthcare:
Predicting pneumonia risk and hospital 30-day readmission,” in Proceedings of the 21th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015, pp. 1721–1730.
1. Black Box Explanations through Transparent Approximations (BETA) [1]
• Decision Set approximation of black box models
• Fidelity + interpretability of the explanation
• Global (unlike LIME)
2. Intelligible additive models [2]
• General Additive Model (GAM)
• Pairwise interactions General Additive Model (GA2M)
13. 15
Data Model Predictions
<eventname>
Model
Population data pre-processing
Raw
datasets
features
Predicted you:
- Ranking
- Score
- Class
Data
collection
Instances
Key decisions are made during data collection:
- Where does the data come from?
- What’s in the dataset?
Complementing current ML approaches to model interpretability
14. 16
Possible roles for provenance
<eventname>
1) Data acquisition: Provenance Transparency Trust
15. 17
Data Model Predictions
<eventname>
Model
Population data pre-processing
Raw
datasets
features
Predicted you:
- Ranking
- Score
- Class
Data
collection
Instances
Key decisions are made during
- Data collection:
- where does the data come from? What’s in
the dataset?
- Data preparation: how was it pre-processed?
1. Can we explain these decisions?
2. Are these explanations useful?
16. 18
Explaining data preparation
PaoloMissier(Computing),DennisPrangle(Stats)
Data
collection
Model
Population data pre-processing
Raw
datasets
features
Predicted you:
- Ranking
- Score
- Class
- Integration
- Cleaning
- Outlier removal
- Normalisation
- Feature selection
- Class rebalancing
- Sampling
- Stratification
- …
Data acquisition and wrangling:
- How were datasets acquired?
- How recently?
- For what purpose?
- Are they being reused /
repurposed?
- What is their quality?
Instances
- Scripts Python / TensorFlow, Pandas, Spark
- Workflows Knime, …
Provenance Transparency
17. 19
Provenance for transparency
<eventname>
1. Collection
- Program-level
- System-level
2. Representation
- W3C PROV (for interoperability)
- Multiple proprietary formats (for efficient encoding)
3. Querying / analysis
• RDBMS
• GDBMS
• RDF / SPARQL
• Configuration of each pre-processing step
• Data dependency graph
- Which kind of normalisation did you apply?
- Was the data (down/up) sampled? How?
- How did you define / remove outliers?
- How did you window your time series?
- Was the data repurposed (acquired from a repository)?
- How was the original protocol defined?
18. 20
Example
<eventname>
• The classic ”Titanic” dataset
• Can you predict survival probabilities?
• A simple logistic regression analysis
Survived - Survival (0 = No; 1 = Yes)
Pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
Name - Name
Sex - Sex
Age - Age
SibSp - Number of Siblings/Spouses Aboard
Parch - Number of Parents/Children Aboard
Ticket - Ticket Number
Fare - Passenger Fare (British pound)
Cabin - Cabin
Embarked - Port of Embarkation (C = Cherbourg; Q =
Queenstown; S = Southampton)
19. 21
Enable analysis of data pre-processing
<eventname>
Managing
missing
values
Is the target
class
balanced?
• Data preparation workflow includes a number of decisions
Dropping
irrelevant
attributes
PassengerId',
'Name',
'Ticket',
'Cabin'
Dropping
correlated
features (?)
Age missing in
714/891 records
“Pclass is a
good predictor
for age”
Impute Age values
using average age
for PClass
Drop
“Fare”, “Pclass”
21. 23
Exploring the effect of alternative pre-processing
<eventname>
D
P1 D1 Learn M1 Predict
x
y1
How can knowledge of P1, P2 help understand why y1 ≠ y2 ?
Ex. Alternative imputation methods for missing values
Ex. Boost minority class / downsample majority class
P2 D2 Learn M2 Predict y2
y1 ≠ y2
22. 24
Also: script alludes to human decisions
<eventname>
How do we capture these decisions?
To what extent can they be inferred from code?
23. 25
Correlation analysis
<eventname>
• Is Pclass really a good
predictor for Age?
• Why drop both PClass
and Fare?
1. Dropped Age only
(Nearly identical performance (F1=0.77, 0.76))
2. Use sex, Pclass only
Alternative pre-processing:
24. 26
Possible roles for provenance
<eventname>
1) Data acquisition: Provenance Transparency Trust
2) Data transformation: Provenance explanations
- Is data preparation correct?
- Is training data fit to learn from?
- What is the effect of alternative pre-processing?
- Can we infer data prep decisions from pre-processing code?
25. 27
Bias (in ML)
<eventname>
(*) Mitchell, T. M. (1980). The need for biases in learning generalizations. Tech. rep. CBMTR-117,
Rutgers University, New Brunswick, NJ
Bias: “Any basis for choosing one generalization [hypothesis] over another,
other than strict consistency with the observed training instances." (*)
Absolute bias:
• certain hypotheses are entirely eliminated from the hypothesis space)
• Eg “A priori choice of model (decision trees, SVM, NN, …)
Relative bias:
• certain hypotheses are preferred over others
• Eg “prefer shallow simple decision trees to deep ones”
26. 28
Fairness and bias: the (notorious) COMPAS case
<eventname>
• Increasingly popular within the criminal justice system
• Used or considered for use in pre-trial decision-making (USA)
1: The initial claim
Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias: There’s
software used across the country to predict future criminals. and it’s biased against
blacks. 2016.
https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-
algorithm
black defendants who did not recidivate over a two-year period were nearly twice as
likely to be misclassified as higher risk compared to their white counterparts (45
percent vs. 23 percent).
white defendants who re-offended within the next two years were mistakenly labeled
low risk almost twice as often as black re-offenders (48 percent vs. 28 percent)
27. 29
Model Fairness and data bias
<eventname>
A. Chouldechova, “Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction
Instruments,” Big Data, vol. 5, no. 2, pp. 153–163, Jun. 2017.
In this paper we show that the differences in false positive and false negative rates cited as
evidence of racial bias in the ProPublica article are a direct consequence of applying an instrument
that is free from predictive bias to a population in which recidivism prevalence differs across
groups.
COMPAS complies with the test fairness condition:
Observed P(Y | S=s) largely independent of R
28. 30
COMPAS Scores are skewed
<eventname>
- scores for white defendants were skewed toward lower-risk categories,
while black defendants were evenly distributed across scores
- large discrepancies in FPR and FNR between Black and White defendants
- … but this does not mean that the score itself is unfair
6,172 defendants
who had not been
arrested for a new
offense or who had
recidivated within two
years
29. 31
FPR / FNR
<eventname>
positive predictive value of Sc:
The test fairness condition (2.1) can be expressed as the constraint that PPV
does not depend on R
recidivism prevalence within groups:
False positive rate:
False negative rate:
When the recidivism prevalence differs between two groups, a test-fair score
cannot have equal FPR, FNR across those groups
30. 32
The actual “provenance” of the analysis
<eventname>
https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm
Data acquisition + transformation Model bias and fairness
- Can knowledge of data prep explain model bias?
- Does data prep introduce / remove bias?
31. 33
Fairness: many possible definitions
<eventname>
(*) M. J. Kusner, J. Loftus, C. Russell, and R. Silva, “Counterfactual Fairness,” in Advances in Neural
Information Processing Systems 30, I. Guyon, U. V Luxburg, S. Bengio, H. Wallach, R. Fergus,
S.Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 4066–4076.
32. 34
Causality and counterfactual fairness
<eventname>
aggressive
driving
accident
rate
red cars
preference
Driver’s
race
Latent
Protected
Predicted
Observable
• Individuals belonging to a race A are more likely to drive red cars (A X)
• However, race is not a good predictor for either U or Y
• Aggressive drivers tend to prefer red cars (U X)
Using X to predict Y leads to a counterfactually unfair model:
• it may charge individuals of a certain race more than others, even though no
race is more likely to have an accident
Is knowledge of data prep useful at all to determine
this kind of fairness?
33. 35
Possible roles for provenance
<eventname>
1) Data acquisition: Provenance Transparency Trust
2) Data transformation: Provenance explanations
- Is data preparation correct?
- Is training data fit to learn from?
- What is the effect of alternative pre-processing?
3) Data acquisition + transformation Model bias and fairness
- Is provenance useful to diagnose an unfair / biased model?
- Does data prep introduce / remove bias?
34. 36
Opportunities and challenges: Summary
<eventname>
1) Data acquisition: Provenance Transparency Trust
2) Data transformation: Provenance explanations
- Is data preparation correct?
- Is training data fit to learn from?
- What is the effect of alternative pre-processing?
3) Data acquisition + transformation Model bias and fairness
- Is provenance useful to diagnose an unfair / biased model?
- Does data prep introduce / remove bias?
35. 37
A few initial references
[1] C. O’Neill, Weapons of Math Destruction. Crown books, 2016.
[2] B. Goodman and S. Flaxman, “European Union regulations on algorithmic decision-making and a ‘right to
explanation,’” Proc. 2016 ICML Work. Hum. Interpret. Mach. Learn. (WHI 2016), Jun. 2016.
[3] M. T. Ribeiro, S. Singh, and C. Guestrin, “‘Why Should I Trust You?’ : Explaining the Predictions of Any
Classifier,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining - KDD ’16, 2016, pp. 1135–1144.
[4] H. Lakkaraju, S. H. Bach, and J. Leskovec, “Interpretable Decision Sets: A Joint Framework for Description and
Prediction,” in
Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016,
pp. 1675–1684.
[5] K. Yang and J. Stoyanovich, “Measuring Fairness in Ranked Outputs,” in Proceedings of the 29th International
Conference on Scientific and Statistical Database Management - SSDBM ’17, 2017, pp. 1–6.
[6] T. Gebru et al., “Datasheets for Datasets,” 2108.
[7] Z. Abedjan, L. Golab, and F. Naumann, “Profiling relational data: a survey,” VLDB J., vol. 24, no. 557, 2015.
[8] A. Weller, “Challenges for Transparency,” in Proceedings of the 2016 ICML Workshop on Human Interpretability
in Machine Learning (WHI 2016).
[8] R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad, “Intelligible models for healthcare: Predicting
pneumonia risk and hospital 30-day readmission,” in Proceedings of the 21th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, 2015, pp. 1721–1730.
Individuals as well as businesses, which we will initially refer to as subjects (and later upgrade to active participants), increasingly find themselves at the receiving end of impactful decisions made by organisations on their behalf, based on processes that use algorithmically-generated knowledge.
Brings about the issue of trust in the models.
Should I use the prediction?
“Determining trust in individual predictions is an importantproblem when the model is used for decision making. When using machine learning for medical diagnosis [6] or terrorism detection, for example, predictions cannot be acted upon on blind faith, as the consequences may be catastrophic”
How about the data used to train / build the model?
How about the data used to train / build the model?
Relatively easy to keep track of data pre-processing provenance