Robust Feature Learning with Deep Neural Networks
http://snu-primo.hosted.exlibrisgroup.com/primo_library/libweb/action/display.do?tabs=viewOnlineTab&doc=82SNU_INST21557911060002591
Wajdi Khattel presented a proposal for a terrorist detection model in social networks. The model uses a multi-dimensional network as input and consists of three sub-models: a text classification model, image classification model, and general information classification model. The sub-models each output a score that is then used by a decision making module to classify a user as a terrorist or not based on a threshold. The implementation involved collecting offline training data from banned Twitter accounts, Google images, and a public dataset. Online data was also collected from Facebook, Instagram, and Twitter using their APIs. Several machine learning models were tested for each sub-model and the proposed full model uses a neural network for text, CNN with data augmentation and
Explainable AI: Building trustworthy AI models? Raheel Ahmad
Building trustworthy, transparent and unbiased machine learning models?
Get started with explainX that brings state-of-the-art explainability techniques under one roof accessible via one-line of code.
Learn the major modules within the explainX explainable AI and model interpretability framework.
These slides are taken from Raheel's presentation at the UnpackAI's forum on Data Ethics in AI.
This document provides an overview of generative adversarial networks (GANs). It explains that GANs were introduced in 2014 and involve two neural networks, a generator and discriminator, that compete against each other. The generator produces synthetic data to fool the discriminator, while the discriminator learns to distinguish real from synthetic data. As they train, the generator improves at producing more realistic outputs that match the real data distribution. Examples of GAN applications discussed include image generation, text-to-image synthesis, and face aging.
Can we use data to train Machine Learning models, perform statistical analysis, yet without putting private data on risk? There are tools and techniques such as Federated Learning, Differential Privacy or Homomorphic Encryption enabling safer work on the data.
발표자: 최윤제(고려대 석사과정)
최윤제 (Yunjey Choi)는 고려대학교에서 컴퓨터공학을 전공하였으며, 현재는 석사과정으로 Machine Learning을 공부하고 있는 학생이다. 코딩을 좋아하며 이해한 것을 다른 사람들에게 공유하는 것을 좋아한다. 1년 간 TensorFlow를 사용하여 Deep Learning을 공부하였고 현재는 PyTorch를 사용하여 Generative Adversarial Network를 공부하고 있다. TensorFlow로 여러 논문들을 구현, PyTorch Tutorial을 만들어 Github에 공개한 이력을 갖고 있다.
개요:
Generative Adversarial Network(GAN)은 2014년 Ian Goodfellow에 의해 처음으로 제안되었으며, 적대적 학습을 통해 실제 데이터의 분포를 추정하는 생성 모델입니다. 최근 들어 GAN은 가장 인기있는 연구 분야로 떠오르고 있고 하루에도 수 많은 관련 논문들이 쏟아져 나오고 있습니다.
수 없이 쏟아져 나오고 있는 GAN 논문들을 다 읽기가 힘드신가요? 괜찮습니다. 기본적인 GAN만 완벽하게 이해한다면 새로 나오는 논문들도 쉽게 이해할 수 있습니다.
이번 발표를 통해 제가 GAN에 대해 알고 있는 모든 것들을 전달해드리고자 합니다. GAN을 아예 모르시는 분들, GAN에 대한 이론적인 내용이 궁금하셨던 분들, GAN을 어떻게 활용할 수 있을지 궁금하셨던 분들이 발표를 들으면 좋을 것 같습니다.
발표영상: https://youtu.be/odpjk7_tGY0
Federated Learning makes it possible to build machine learning systems without direct access to training data. The data remains in its original location, which helps to ensure privacy, reduces network communication costs, and taps edge device computing resources. The principles of data minimization established by the GDPR, and the growing prevalence of smart sensors make the advantages of federated learning more compelling. Federated learning is a great fit for smartphones, industrial and consumer IoT, healthcare and other privacy-sensitive use cases, and industrial sensor applications.
We’ll present the Fast Forward Labs team’s research on this topic and the accompanying prototype application, “Turbofan Tycoon”: a simplified working example of federated learning applied to a predictive maintenance problem. In this demo scenario, customers of an industrial turbofan manufacturer are not willing to share the details of how their components failed with the manufacturer, but want the manufacturer to provide them with a strategy to maintain the part. Federated learning allows us to satisfy the customer's privacy concerns while providing them with a model that leads to fewer costly failures and less maintenance downtime.
We’ll discuss the advantages and tradeoffs of taking the federated approach. We’ll assess the state of tooling for federated learning, circumstances in which you might want to consider applying it, and the challenges you’d face along the way.
Speaker
Chris Wallace
Data Scientist
Cloudera
Robust Feature Learning with Deep Neural Networks
http://snu-primo.hosted.exlibrisgroup.com/primo_library/libweb/action/display.do?tabs=viewOnlineTab&doc=82SNU_INST21557911060002591
Wajdi Khattel presented a proposal for a terrorist detection model in social networks. The model uses a multi-dimensional network as input and consists of three sub-models: a text classification model, image classification model, and general information classification model. The sub-models each output a score that is then used by a decision making module to classify a user as a terrorist or not based on a threshold. The implementation involved collecting offline training data from banned Twitter accounts, Google images, and a public dataset. Online data was also collected from Facebook, Instagram, and Twitter using their APIs. Several machine learning models were tested for each sub-model and the proposed full model uses a neural network for text, CNN with data augmentation and
Explainable AI: Building trustworthy AI models? Raheel Ahmad
Building trustworthy, transparent and unbiased machine learning models?
Get started with explainX that brings state-of-the-art explainability techniques under one roof accessible via one-line of code.
Learn the major modules within the explainX explainable AI and model interpretability framework.
These slides are taken from Raheel's presentation at the UnpackAI's forum on Data Ethics in AI.
This document provides an overview of generative adversarial networks (GANs). It explains that GANs were introduced in 2014 and involve two neural networks, a generator and discriminator, that compete against each other. The generator produces synthetic data to fool the discriminator, while the discriminator learns to distinguish real from synthetic data. As they train, the generator improves at producing more realistic outputs that match the real data distribution. Examples of GAN applications discussed include image generation, text-to-image synthesis, and face aging.
Can we use data to train Machine Learning models, perform statistical analysis, yet without putting private data on risk? There are tools and techniques such as Federated Learning, Differential Privacy or Homomorphic Encryption enabling safer work on the data.
발표자: 최윤제(고려대 석사과정)
최윤제 (Yunjey Choi)는 고려대학교에서 컴퓨터공학을 전공하였으며, 현재는 석사과정으로 Machine Learning을 공부하고 있는 학생이다. 코딩을 좋아하며 이해한 것을 다른 사람들에게 공유하는 것을 좋아한다. 1년 간 TensorFlow를 사용하여 Deep Learning을 공부하였고 현재는 PyTorch를 사용하여 Generative Adversarial Network를 공부하고 있다. TensorFlow로 여러 논문들을 구현, PyTorch Tutorial을 만들어 Github에 공개한 이력을 갖고 있다.
개요:
Generative Adversarial Network(GAN)은 2014년 Ian Goodfellow에 의해 처음으로 제안되었으며, 적대적 학습을 통해 실제 데이터의 분포를 추정하는 생성 모델입니다. 최근 들어 GAN은 가장 인기있는 연구 분야로 떠오르고 있고 하루에도 수 많은 관련 논문들이 쏟아져 나오고 있습니다.
수 없이 쏟아져 나오고 있는 GAN 논문들을 다 읽기가 힘드신가요? 괜찮습니다. 기본적인 GAN만 완벽하게 이해한다면 새로 나오는 논문들도 쉽게 이해할 수 있습니다.
이번 발표를 통해 제가 GAN에 대해 알고 있는 모든 것들을 전달해드리고자 합니다. GAN을 아예 모르시는 분들, GAN에 대한 이론적인 내용이 궁금하셨던 분들, GAN을 어떻게 활용할 수 있을지 궁금하셨던 분들이 발표를 들으면 좋을 것 같습니다.
발표영상: https://youtu.be/odpjk7_tGY0
Federated Learning makes it possible to build machine learning systems without direct access to training data. The data remains in its original location, which helps to ensure privacy, reduces network communication costs, and taps edge device computing resources. The principles of data minimization established by the GDPR, and the growing prevalence of smart sensors make the advantages of federated learning more compelling. Federated learning is a great fit for smartphones, industrial and consumer IoT, healthcare and other privacy-sensitive use cases, and industrial sensor applications.
We’ll present the Fast Forward Labs team’s research on this topic and the accompanying prototype application, “Turbofan Tycoon”: a simplified working example of federated learning applied to a predictive maintenance problem. In this demo scenario, customers of an industrial turbofan manufacturer are not willing to share the details of how their components failed with the manufacturer, but want the manufacturer to provide them with a strategy to maintain the part. Federated learning allows us to satisfy the customer's privacy concerns while providing them with a model that leads to fewer costly failures and less maintenance downtime.
We’ll discuss the advantages and tradeoffs of taking the federated approach. We’ll assess the state of tooling for federated learning, circumstances in which you might want to consider applying it, and the challenges you’d face along the way.
Speaker
Chris Wallace
Data Scientist
Cloudera
[unofficial] Pyramid Scene Parsing Network (CVPR 2017)Shunta Saito
Pyramid Scene Parsing Network introduces the Pyramid Pooling Module to improve semantic segmentation. The module captures context at different regions and scales by performing average pooling at different pyramid levels on the final convolutional feature map. Experiments on ADE20K and PASCAL VOC datasets show the Pyramid Pooling Module improves mean Intersection-over-Union by over 4% compared to global average pooling, achieving state-of-the-art performance.
The document discusses the potential applications of deep learning in healthcare. It begins by explaining that deep learning models can improve accuracy of diagnosis, prognosis, and risk prediction by analyzing large datasets. It then discusses how deep learning can optimize hospital processes like resource allocation and patient flow by early and accurate prediction of diseases. Finally, it mentions that deep learning can help identify patient subgroups for personalized and precision medicine approaches.
Generative Adversarial Networks (GANs) - Ian Goodfellow, OpenAIWithTheBest
This is how Generative Adversarial Networks (GANs) work and benefit the tech and dev industry. Although GANs still have room for improvement, GANs are important generative models that learn how to create realistic samples.
GANS
Ian Goodfellow, OpenAI Research Scientist
Machine learning is the subfield of computer science that, according to Arthur Samuel in 1959, gives "computers the ability to learn without being explicitly programmed.Evolved from the study of pattern recognition and computational learning theory in artificial intelligence,machine learning explores the study and construction of algorithms that can learn from and make predictions on data – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions,:2 through building a model from sample inputs. Machine learning is employed in a range of computing tasks where designing and programming explicit algorithms with good performance is difficult or unfeasible; example applications include email filtering, detection of network intruders or malicious insiders working towards a data breach,Optical character recognition (OCR),learning to rank and computer vision.
A sharing talk in Hsinchu Coders.
The materials (i.e. images) are from their respective owners:
https://research.googleblog.com/2017/04/federated-learning-collaborative.html
A short presentation on the emerging research on normalizing flows. The presentations follows two recent survey papers on the topic:
[1] Kobyzev, Ivan, Simon Prince, and Marcus Brubaker. Normalizing flows: An introduction and review of current methods, T-PAMI 2020.
[2] Papamakarios, George, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference, arXiv preprint arXiv:1912.02762 (2019).
classify images from the CIFAR-10 dataset. The dataset consists of airplanes, dogs, cats, and other objects.we'll preprocess the images, then train a convolutional neural network on all the samples. The images need to be normalized and the labels need to be one-hot encoded.
This thesis presents research on using deep learning methods for feature extraction from satellite imagery to identify landslide pixels. The objectives are to classify land cover using machine learning algorithms like SVM and random forests in Google Earth Engine, design and evaluate a deep neural network for landslide identification, and compare performance of deep learning models in MATLAB. Results show that a neural network achieved over 98% accuracy at identifying landslide pixels. Future work proposes developing new indices for improved identification and an automatic landslide monitoring platform.
This document provides an introduction to deep learning in medical imaging. It explains that artificial neural networks are modeled after biological neurons and use multiple hidden layers to approximate complex functions. Convolutional neural networks are commonly used for image data, applying filters over images to extract features. Modern deep learning platforms perform cross-correlation instead of convolution for efficiency. The key process for improving deep learning models is backpropagation, which calculates the gradient of the loss function to update weights and biases in a direction that reduces loss. Deep learning has applications in medical imaging modalities like MRI, ultrasound, CT, and PET.
The document discusses the role of a full-stack data scientist. It begins with an introduction of the author, Alexey Grigorev, as a data scientist. It then outlines the plan to discuss the data science process, roles in a data science team, what defines a full-stack data scientist, and how to become a full-stack data scientist. It proceeds to explain the CRISP-DM process for data science projects. It describes the different roles in a data science team including product manager, data analyst, data engineer, data scientist, and ML engineer. It defines a full-stack data scientist as someone who can work across the entire data science lifecycle and discusses the breadth of skills required to become a
Generative adversarial networks (GANs) are a class of machine learning frameworks where two neural networks, a generator and discriminator, compete against each other. The generator learns to generate new data with the same statistics as the training set to fool the discriminator, while the discriminator learns to better distinguish real samples from generated samples. GANs have applications in image generation, image translation between domains, and image completion. Training GANs can be challenging due to issues like mode collapse.
Chest X-ray Pneumonia Classification with Deep LearningBaoTramDuong2
This document discusses using deep learning models to classify chest x-ray images as either normal or pneumonia. It obtained a dataset of over 5,800 pediatric chest x-rays from a Chinese hospital. Various deep learning models were explored, including multilayer perceptrons, convolutional neural networks, and transfer learning with VGG16, which achieved 92% validation accuracy. The document recommends future work such as distinguishing between viral and bacterial pneumonia and combining models with SVM. It also discusses recommendations to reduce childhood pneumonia prevalence.
Anomaly detection is a topic with many different applications. From social media tracking, to cybersecurity, anomaly detection (or outlier detection) algorithms can have a huge impact in your organisation.
For the video please visit: https://www.youtube.com/watch?v=XEM2bYYxkTU
This slideshare has been produced by the Tesseract Academy (http://tesseract.academy), a company that educates decision makers in deep technical topics such as data science, analytics, machine learning and blockchain.
If you are interested in data science and related topics, make sure to also visit The Data Scientist: http://thedatascientist.com.
Explainability for Natural Language ProcessingYunyao Li
Final deck for our popular tutorial on "Explainability for Natural Language Processing" at KDD'2021. See links below for downloadable version (with higher resolution) and recording of the live tutorial.
Title: Explainability for Natural Language Processing
Presenter: Marina Danilevsky, Shipi Dhanorkar, Yunyao Li and Lucian Popa and Kun Qian and Anbang Xu
Website: http://xainlp.github.io/
Recording: https://www.youtube.com/watch?v=PvKOSYGclPk&t=2s
Downloadable version with higher resolution: https://drive.google.com/file/d/1_gt_cS9nP9rcZOn4dcmxc2CErxrHW9CU/view?usp=sharing
@article{kdd2021xaitutorial,
title={Explainability for Natural Language Processing},
author= {Marina Danilevsky, Shipi Dhanorkar and Yunyao Li and Lucian Popa and Kun Qian and Anbang Xu},
journal={KDD},
year={2021}
}
Abstract:
This lecture-style tutorial, which mixes in an interactive literature browsing component, is intended for the many researchers and practitioners working with text data and on applications of natural language processing (NLP) in data science and knowledge discovery. The focus of the tutorial is on the issues of transparency and interpretability as they relate to building models for text and their applications to knowledge discovery. As black-box models have gained popularity for a broad range of tasks in recent years, both the research and industry communities have begun developing new techniques to render them more transparent and interpretable.Reporting from an interdisciplinary team of social science, human-computer interaction (HCI), and NLP/knowledge management researchers, our tutorial has two components: an introduction to explainable AI (XAI) in the NLP domain and a review of the state-of-the-art research; and findings from a qualitative interview study of individuals working on real-world NLP projects as they are applied to various knowledge extraction and discovery at a large, multinational technology and consulting corporation. The first component will introduce core concepts related to explainability inNLP. Then, we will discuss explainability for NLP tasks and reporton a systematic literature review of the state-of-the-art literaturein AI, NLP and HCI conferences. The second component reports on our qualitative interview study, which identifies practical challenges and concerns that arise in real-world development projects that require the modeling and understanding of text data.
This document discusses methods for SMS spam classification using natural language processing. It reviews approaches such as preprocessing text data, creating bag-of-words models, adding features like text length and profanity, and implementing machine learning classifiers like logistic regression, Naive Bayes, and gradient boosting. The key findings are that preprocessing text by removing stopwords and lemmatizing improves accuracy, support vector machines perform best with an accuracy of 98%, spam texts tend to contain words like "call", "txt", and "prize" and be longer with less readable syntax than non-spam texts.
Slides, thesis dissertation defense, deep generative neural networks for nove...mehdi Cherti
In recent years, significant advances made in deep neural networks enabled the creation
of groundbreaking technologies such as self-driving cars and voice-enabled
personal assistants. Almost all successes of deep neural networks are about prediction,
whereas the initial breakthroughs came from generative models. Today,
although we have very powerful deep generative modeling techniques, these techniques
are essentially being used for prediction or for generating known objects
(i.e., good quality images of known classes): any generated object that is a priori
unknown is considered as a failure mode (Salimans et al., 2016) or as spurious
(Bengio et al., 2013b). In other words, when prediction seems to be the only
possible objective, novelty is seen as an error that researchers have been trying hard
to eliminate. This thesis defends the point of view that, instead of trying to eliminate
these novelties, we should study them and the generative potential of deep nets
to create useful novelty, especially given the economic and societal importance of
creating new objects in contemporary societies. The thesis sets out to study novelty
generation in relationship with data-driven knowledge models produced by
deep generative neural networks. Our first key contribution is the clarification of
the importance of representations and their impact on the kind of novelties that
can be generated: a key consequence is that a creative agent might need to rerepresent
known objects to access various kinds of novelty. We then demonstrate
that traditional objective functions of statistical learning theory, such as maximum
likelihood, are not necessarily the best theoretical framework for studying novelty
generation. We propose several other alternatives at the conceptual level. A second
key result is the confirmation that current models, with traditional objective
functions, can indeed generate unknown objects. This also shows that even though
objectives like maximum likelihood are designed to eliminate novelty, practical
implementations do generate novelty. Through a series of experiments, we study
the behavior of these models and the novelty they generate. In particular, we propose
a new task setup and metrics for selecting good generative models. Finally,
the thesis concludes with a series of experiments clarifying the characteristics of
models that can exhibit novelty. Experiments show that sparsity, noise level, and
restricting the capacity of the net eliminates novelty and that models that are better
at recognizing novelty are also good at generating novelty
This document summarizes a research paper about blockchain technology from a sustainability perspective. It discusses how blockchain could help achieve the 17 sustainable development goals set by the UN, such as increasing transparency, reducing fraud and corruption, and enabling new funding opportunities. However, it also notes blockchain has sustainability drawbacks. The energy intensive "proof of work" algorithm used by Bitcoin requires massive electricity consumption from fossil fuel power sources, undermining climate goals. While blockchain aims to increase accessibility, its current infrastructure poses environmental risks that could threaten sustainability if left unaddressed.
Impact of big data congestion in IT: An adaptive knowledgebased Bayesian networkIJECEIAES
Recent progress on real-time systems are growing high in information technology which is showing importance in every single innovative field. Different applications in IT simultaneously produce the enormous measure of information that should be taken care of. In this paper, a novel algorithm of adaptive knowledge-based Bayesian network is proposed to deal with the impact of big data congestion in decision processing. A Bayesian system show is utilized to oversee learning arrangement toward all path for the basic leadership process. Information of Bayesian systems is routinely discharged as an ideal arrangement, where the examination work is to find a development that misuses a measurably inspired score. By and large, available information apparatuses manage this ideal arrangement by methods for normal hunt strategies. As it required enormous measure of information space, along these lines it is a tedious method that ought to be stayed away from. The circumstance ends up unequivocal once huge information include in hunting down ideal arrangement. A calculation is acquainted with achieve quicker preparing of ideal arrangement by constraining the pursuit information space. The proposed algorithm consists of recursive calculation intthe inquiry space. The outcome demonstrates that the ideal component of the proposed algorithm can deal with enormous information by processing time, and a higher level of expectation rates.
[unofficial] Pyramid Scene Parsing Network (CVPR 2017)Shunta Saito
Pyramid Scene Parsing Network introduces the Pyramid Pooling Module to improve semantic segmentation. The module captures context at different regions and scales by performing average pooling at different pyramid levels on the final convolutional feature map. Experiments on ADE20K and PASCAL VOC datasets show the Pyramid Pooling Module improves mean Intersection-over-Union by over 4% compared to global average pooling, achieving state-of-the-art performance.
The document discusses the potential applications of deep learning in healthcare. It begins by explaining that deep learning models can improve accuracy of diagnosis, prognosis, and risk prediction by analyzing large datasets. It then discusses how deep learning can optimize hospital processes like resource allocation and patient flow by early and accurate prediction of diseases. Finally, it mentions that deep learning can help identify patient subgroups for personalized and precision medicine approaches.
Generative Adversarial Networks (GANs) - Ian Goodfellow, OpenAIWithTheBest
This is how Generative Adversarial Networks (GANs) work and benefit the tech and dev industry. Although GANs still have room for improvement, GANs are important generative models that learn how to create realistic samples.
GANS
Ian Goodfellow, OpenAI Research Scientist
Machine learning is the subfield of computer science that, according to Arthur Samuel in 1959, gives "computers the ability to learn without being explicitly programmed.Evolved from the study of pattern recognition and computational learning theory in artificial intelligence,machine learning explores the study and construction of algorithms that can learn from and make predictions on data – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions,:2 through building a model from sample inputs. Machine learning is employed in a range of computing tasks where designing and programming explicit algorithms with good performance is difficult or unfeasible; example applications include email filtering, detection of network intruders or malicious insiders working towards a data breach,Optical character recognition (OCR),learning to rank and computer vision.
A sharing talk in Hsinchu Coders.
The materials (i.e. images) are from their respective owners:
https://research.googleblog.com/2017/04/federated-learning-collaborative.html
A short presentation on the emerging research on normalizing flows. The presentations follows two recent survey papers on the topic:
[1] Kobyzev, Ivan, Simon Prince, and Marcus Brubaker. Normalizing flows: An introduction and review of current methods, T-PAMI 2020.
[2] Papamakarios, George, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference, arXiv preprint arXiv:1912.02762 (2019).
classify images from the CIFAR-10 dataset. The dataset consists of airplanes, dogs, cats, and other objects.we'll preprocess the images, then train a convolutional neural network on all the samples. The images need to be normalized and the labels need to be one-hot encoded.
This thesis presents research on using deep learning methods for feature extraction from satellite imagery to identify landslide pixels. The objectives are to classify land cover using machine learning algorithms like SVM and random forests in Google Earth Engine, design and evaluate a deep neural network for landslide identification, and compare performance of deep learning models in MATLAB. Results show that a neural network achieved over 98% accuracy at identifying landslide pixels. Future work proposes developing new indices for improved identification and an automatic landslide monitoring platform.
This document provides an introduction to deep learning in medical imaging. It explains that artificial neural networks are modeled after biological neurons and use multiple hidden layers to approximate complex functions. Convolutional neural networks are commonly used for image data, applying filters over images to extract features. Modern deep learning platforms perform cross-correlation instead of convolution for efficiency. The key process for improving deep learning models is backpropagation, which calculates the gradient of the loss function to update weights and biases in a direction that reduces loss. Deep learning has applications in medical imaging modalities like MRI, ultrasound, CT, and PET.
The document discusses the role of a full-stack data scientist. It begins with an introduction of the author, Alexey Grigorev, as a data scientist. It then outlines the plan to discuss the data science process, roles in a data science team, what defines a full-stack data scientist, and how to become a full-stack data scientist. It proceeds to explain the CRISP-DM process for data science projects. It describes the different roles in a data science team including product manager, data analyst, data engineer, data scientist, and ML engineer. It defines a full-stack data scientist as someone who can work across the entire data science lifecycle and discusses the breadth of skills required to become a
Generative adversarial networks (GANs) are a class of machine learning frameworks where two neural networks, a generator and discriminator, compete against each other. The generator learns to generate new data with the same statistics as the training set to fool the discriminator, while the discriminator learns to better distinguish real samples from generated samples. GANs have applications in image generation, image translation between domains, and image completion. Training GANs can be challenging due to issues like mode collapse.
Chest X-ray Pneumonia Classification with Deep LearningBaoTramDuong2
This document discusses using deep learning models to classify chest x-ray images as either normal or pneumonia. It obtained a dataset of over 5,800 pediatric chest x-rays from a Chinese hospital. Various deep learning models were explored, including multilayer perceptrons, convolutional neural networks, and transfer learning with VGG16, which achieved 92% validation accuracy. The document recommends future work such as distinguishing between viral and bacterial pneumonia and combining models with SVM. It also discusses recommendations to reduce childhood pneumonia prevalence.
Anomaly detection is a topic with many different applications. From social media tracking, to cybersecurity, anomaly detection (or outlier detection) algorithms can have a huge impact in your organisation.
For the video please visit: https://www.youtube.com/watch?v=XEM2bYYxkTU
This slideshare has been produced by the Tesseract Academy (http://tesseract.academy), a company that educates decision makers in deep technical topics such as data science, analytics, machine learning and blockchain.
If you are interested in data science and related topics, make sure to also visit The Data Scientist: http://thedatascientist.com.
Explainability for Natural Language ProcessingYunyao Li
Final deck for our popular tutorial on "Explainability for Natural Language Processing" at KDD'2021. See links below for downloadable version (with higher resolution) and recording of the live tutorial.
Title: Explainability for Natural Language Processing
Presenter: Marina Danilevsky, Shipi Dhanorkar, Yunyao Li and Lucian Popa and Kun Qian and Anbang Xu
Website: http://xainlp.github.io/
Recording: https://www.youtube.com/watch?v=PvKOSYGclPk&t=2s
Downloadable version with higher resolution: https://drive.google.com/file/d/1_gt_cS9nP9rcZOn4dcmxc2CErxrHW9CU/view?usp=sharing
@article{kdd2021xaitutorial,
title={Explainability for Natural Language Processing},
author= {Marina Danilevsky, Shipi Dhanorkar and Yunyao Li and Lucian Popa and Kun Qian and Anbang Xu},
journal={KDD},
year={2021}
}
Abstract:
This lecture-style tutorial, which mixes in an interactive literature browsing component, is intended for the many researchers and practitioners working with text data and on applications of natural language processing (NLP) in data science and knowledge discovery. The focus of the tutorial is on the issues of transparency and interpretability as they relate to building models for text and their applications to knowledge discovery. As black-box models have gained popularity for a broad range of tasks in recent years, both the research and industry communities have begun developing new techniques to render them more transparent and interpretable.Reporting from an interdisciplinary team of social science, human-computer interaction (HCI), and NLP/knowledge management researchers, our tutorial has two components: an introduction to explainable AI (XAI) in the NLP domain and a review of the state-of-the-art research; and findings from a qualitative interview study of individuals working on real-world NLP projects as they are applied to various knowledge extraction and discovery at a large, multinational technology and consulting corporation. The first component will introduce core concepts related to explainability inNLP. Then, we will discuss explainability for NLP tasks and reporton a systematic literature review of the state-of-the-art literaturein AI, NLP and HCI conferences. The second component reports on our qualitative interview study, which identifies practical challenges and concerns that arise in real-world development projects that require the modeling and understanding of text data.
This document discusses methods for SMS spam classification using natural language processing. It reviews approaches such as preprocessing text data, creating bag-of-words models, adding features like text length and profanity, and implementing machine learning classifiers like logistic regression, Naive Bayes, and gradient boosting. The key findings are that preprocessing text by removing stopwords and lemmatizing improves accuracy, support vector machines perform best with an accuracy of 98%, spam texts tend to contain words like "call", "txt", and "prize" and be longer with less readable syntax than non-spam texts.
Slides, thesis dissertation defense, deep generative neural networks for nove...mehdi Cherti
In recent years, significant advances made in deep neural networks enabled the creation
of groundbreaking technologies such as self-driving cars and voice-enabled
personal assistants. Almost all successes of deep neural networks are about prediction,
whereas the initial breakthroughs came from generative models. Today,
although we have very powerful deep generative modeling techniques, these techniques
are essentially being used for prediction or for generating known objects
(i.e., good quality images of known classes): any generated object that is a priori
unknown is considered as a failure mode (Salimans et al., 2016) or as spurious
(Bengio et al., 2013b). In other words, when prediction seems to be the only
possible objective, novelty is seen as an error that researchers have been trying hard
to eliminate. This thesis defends the point of view that, instead of trying to eliminate
these novelties, we should study them and the generative potential of deep nets
to create useful novelty, especially given the economic and societal importance of
creating new objects in contemporary societies. The thesis sets out to study novelty
generation in relationship with data-driven knowledge models produced by
deep generative neural networks. Our first key contribution is the clarification of
the importance of representations and their impact on the kind of novelties that
can be generated: a key consequence is that a creative agent might need to rerepresent
known objects to access various kinds of novelty. We then demonstrate
that traditional objective functions of statistical learning theory, such as maximum
likelihood, are not necessarily the best theoretical framework for studying novelty
generation. We propose several other alternatives at the conceptual level. A second
key result is the confirmation that current models, with traditional objective
functions, can indeed generate unknown objects. This also shows that even though
objectives like maximum likelihood are designed to eliminate novelty, practical
implementations do generate novelty. Through a series of experiments, we study
the behavior of these models and the novelty they generate. In particular, we propose
a new task setup and metrics for selecting good generative models. Finally,
the thesis concludes with a series of experiments clarifying the characteristics of
models that can exhibit novelty. Experiments show that sparsity, noise level, and
restricting the capacity of the net eliminates novelty and that models that are better
at recognizing novelty are also good at generating novelty
This document summarizes a research paper about blockchain technology from a sustainability perspective. It discusses how blockchain could help achieve the 17 sustainable development goals set by the UN, such as increasing transparency, reducing fraud and corruption, and enabling new funding opportunities. However, it also notes blockchain has sustainability drawbacks. The energy intensive "proof of work" algorithm used by Bitcoin requires massive electricity consumption from fossil fuel power sources, undermining climate goals. While blockchain aims to increase accessibility, its current infrastructure poses environmental risks that could threaten sustainability if left unaddressed.
Impact of big data congestion in IT: An adaptive knowledgebased Bayesian networkIJECEIAES
Recent progress on real-time systems are growing high in information technology which is showing importance in every single innovative field. Different applications in IT simultaneously produce the enormous measure of information that should be taken care of. In this paper, a novel algorithm of adaptive knowledge-based Bayesian network is proposed to deal with the impact of big data congestion in decision processing. A Bayesian system show is utilized to oversee learning arrangement toward all path for the basic leadership process. Information of Bayesian systems is routinely discharged as an ideal arrangement, where the examination work is to find a development that misuses a measurably inspired score. By and large, available information apparatuses manage this ideal arrangement by methods for normal hunt strategies. As it required enormous measure of information space, along these lines it is a tedious method that ought to be stayed away from. The circumstance ends up unequivocal once huge information include in hunting down ideal arrangement. A calculation is acquainted with achieve quicker preparing of ideal arrangement by constraining the pursuit information space. The proposed algorithm consists of recursive calculation intthe inquiry space. The outcome demonstrates that the ideal component of the proposed algorithm can deal with enormous information by processing time, and a higher level of expectation rates.
Analysis of IT Monitoring Using Open Source Software Techniques: A ReviewIJERD Editor
The Network administrators usually rely on generic and built-in monitoring tools for network
security. Ideally, the network infrastructure is supposed to have carefully designed strategies to scale up
monitoring tools and techniques as the network grows, over time. Without this, there can be network
performance challenges, downtimes due to failures, and most importantly, penetration attacks. These can lead to
monetary losses as well as loss of reputation. Thus, there is a need for best practices to monitor network
infrastructure in an agile manner. Network security monitoring involves collecting network packet data,
segregating it among all the 7 OSI layers, and applying intelligent algorithms to get answers to security-related
questions. The purpose is to know in real-time what is happening on the network at a detailed level, and
strengthen security by hardening the processes, devices, appliances, software policies, etc. The Multi Router
Traffic Grapher, or just simply MRTG, is free software for monitoring and measuring the traffic load
on network links. It allows the user to see traffic load on a network over time in graphical form.
The document discusses the Internet of Things (IoT) and some of the key challenges. It notes that IoT data is multi-modal, distributed, heterogeneous, noisy and incomplete. It raises issues around data management, actuation and feedback, service descriptions, real-time analysis, and privacy and security. The document outlines research challenges around transforming raw data to actionable information, machine learning for large datasets, making data accessible and discoverable, and energy efficient data collection and communication. It emphasizes that IoT data integration requires solutions across physical, cyber and social domains.
Analytics of Performance and Data Quality for Mobile Edge Cloud ApplicationsHong-Linh Truong
The document discusses performance and data quality analytics for mobile edge cloud applications. It presents MECCA, a mobile edge cloud application for providing cornering recommendations to cars. MECCA has a complex architecture using microservices and third party services. Analyzing MECCA's performance and data quality across different edge and cloud deployments is challenging due to dependencies between application parameters, streaming processing, and third party services. Future work aims to develop toolsets and datasets to better evaluate performance and data quality metrics for mobile edge cloud applications.
The document summarizes the evolution of the semantic grid from its origins in 2001 to the present. It describes how early work on the semantic grid aimed to close the gap between grid applications and the vision of global e-science collaboration. Key developments included linking grid services with semantic web technologies to enable automation and advanced functionality through machine-processable descriptions. The semantic grid is now seen as an important approach for virtual research environments that support both formal and informal scientific processes through collaborative tools and persistent representations of discussions.
The document provides an introduction to big data, including:
1) It defines big data and discusses its key characteristics of volume, velocity, and variety.
2) It describes sources of big data like sensors, social media, and purchase transactions.
3) It discusses big data analytics including descriptive, predictive, and prescriptive analytics and the stages of capture, organize, analyze, and act.
The document discusses using machine learning for efficient attack detection in IoT devices without feature engineering. It proposes a feature-engineering-less machine learning (FEL-ML) process that uses raw packet byte streams as input instead of engineered features. This approach is lighter weight and faster than traditional methods. The FEL-ML model is trained directly on unprocessed packet data to perform malware detection on resource-constrained IoT devices. Prior research that used engineered features or complex deep learning models are not suitable for IoT due to limitations of memory and processing power. The proposed FEL-ML approach aims to enable effective network traffic security for IoT using minimal resources.
EFFICIENT ATTACK DETECTION IN IOT DEVICES USING FEATURE ENGINEERING-LESS MACH...ijcsit
Through the generalization of deep learning, the research community has addressed critical challenges in
the network security domain, like malware identification and anomaly detection. However, they have yet to
discuss deploying them on Internet of Things (IoT) devices for day-to-day operations. IoT devices are often
limited in memory and processing power, rendering the compute-intensive deep learning environment
unusable. This research proposes a way to overcome this barrier by bypassing feature engineering in the
deep learning pipeline and using raw packet data as input. We introduce a feature- engineering-less
machine learning (ML) process to perform malware detection on IoT devices. Our proposed model,”
Feature engineering-less ML (FEL-ML),” is a lighter-weight detection algorithm that expends no extra
computations on “engineered” features. It effectively accelerates the low-powered IoT edge. It is trained
on unprocessed byte-streams of packets. Aside from providing better results, it is quicker than traditional
feature-based methods. FEL-ML facilitates resource-sensitive network traffic security with the added
benefit of eliminating the significant investment by subject matter experts in feature engineering.
Big Data in Distributed Analytics,Cybersecurity And Digital ForensicsSherinMariamReji05
This document provides an overview of big data and its applications in distributed analytics, cyber security, and digital forensics. It discusses how big data can reduce the processing time of large volumes of data in distributed computing environments using Hadoop. Examples of big data applications include using social media, search engine, and aircraft black box data for analysis. The document also outlines the challenges of traditional systems and how distributed big data architectures help address them by allowing data to be processed across clustered computers.
Concept Drift Identification using Classifier Ensemble Approach IJECEIAES
Abstract:-In Internetworking system, the huge amount of data is scattered, generated and processed over the network. The data mining techniques are used to discover the unknown pattern from the underlying data. A traditional classification model is used to classify the data based on past labelled data. However in many current applications, data is increasing in size with fluctuating patterns. Due to this new feature may arrive in the data. It is present in many applications like sensornetwork, banking and telecommunication systems, financial domain, Electricity usage and prices based on its demand and supplyetc .Thus change in data distribution reduces the accuracy of classifying the data. It may discover some patterns as frequent while other patterns tend to disappear and wrongly classify. To mine such data distribution, traditionalclassification techniques may not be suitable as the distribution generating the items can change over time so data from the past may become irrelevant or even false for the current prediction. For handlingsuch varying pattern of data, concept drift mining approach is used to improve the accuracy of classification techniques. In this paper we have proposed ensemble approach for improving the accuracy of classifier. The ensemble classifier is applied on 3 different data sets. We investigated different features for the different chunk of data which is further given to ensemble classifier. We observed the proposed approach improves the accuracy of classifier for different chunks of data.
This document discusses using data mining techniques to help with crime investigation by analyzing large amounts of crime data. It compares the performance of three data mining algorithms (J48, Naive Bayes, JRip) on a sample criminal database to identify the best performing algorithm. The best algorithm would then be used on the criminal database to help identify possible suspects for a crime based on evidence and attributes. The document provides details on each of the three algorithms and evaluates them based on classification accuracy and other metrics to select the best technique for the criminal investigation application.
Enhanced Privacy Preserving Accesscontrol in Incremental Datausing Microaggre...rahulmonikasharma
In microdata releases, main task is to protect the privacy of data subjects. Microaggregation technique use to disclose the limitation at protecting the privacy of microdata. This technique is an alternative to generalization and suppression, which use to generate k-anonymous data sets. In this dataset, identity of each subject is hidden within a group of k subjects. Microaggregation perturbs the data and additional masking allows refining data utility in many ways, like increasing data granularity, to avoid discretization of numerical data, to reduce the impact of outliers. If the variability of the private data values in a group of k subjects is too small, k-anonymity does not provide protection against attribute disclosure. In this work Role based access control is assumed. The access control policies define selection predicates to roles. Then use the concept of imprecision bound for each permission to define a threshold on the amount of imprecision that can be tolerated. So the proposed approach reduces the imprecision for each selection predicate. Anonymization is carried out only for the static relational table in the existing papers. Privacy preserving access control mechanism is applied to the incremental data.
IRJET- Fault Detection and Prediction of Failure using Vibration AnalysisIRJET Journal
This document discusses fault detection and prediction of failures in rotating equipment using vibration analysis. It begins by introducing vibration analysis as a method to monitor machines and detect faults in rotating components that may cause failures. It then discusses how motor vibration is measured and analyzed using techniques like spectrum analysis to identify faults like unbalance, bearing issues, or broken rotor bars. The document proposes decomposing vibration signals using intrinsic mode functions and calculating the Gabor representation's frequency marginal to identify fault types using classifiers like support vector machines or random forests. It provides context on data mining techniques relevant to this type of fault prediction problem.
Data Mining Framework for Network Intrusion Detection using Efficient TechniquesIJAEMSJORNAL
The implementation measures the classification accuracy on benchmark datasets after combining SIS and ANNs. In order to put a number on the gains made by using SIS as a strategic tool in data mining, extensive experiments and analyses are carried out. The predicted results of this investigation will have implications for both theoretical and applied settings. Predictive models in a wide variety of disciplines may benefit from the enhanced classification accuracy enabled by SIS inside ANNs. An invaluable resource for scholars and practitioners in the fields of AI and data mining, this study adds to the continuing conversation about how to maximize the efficacy of machine learning methods.
IDENTITY DISCLOSURE PROTECTION IN DYNAMIC NETWORKS USING K W – STRUCTURAL DIV...IJITE
The data mining figures out accurate information for requesting user after the raw data is analyzed. Among
lots of developments, data mining face hot issues on security, privacy and integrity. Data mining use one of the latest technique called privacy preserving data publishing (PPDP), which enforces security for the digital information provided by governments, corporations, companies and individuals in social networks. People become embarrassed when adversary tries to know the sensitive information shared. Sensitive information is gathered through the vertex and multi community identities of the user. Vertex identity denotes the self-information of user like name, address, mobile number, etc. Multi community identity denotes the community group in which the user participates. To prevent such identity disclosures, this paper proposes KW -structural diversity anonymity technique, for the protection of vertex and multi community identity disclosure. In KW -structural diversity anonymity technique, k is privacy level applied for users and W is an adversary monitoring time.
Rao Mikkilineni discusses the emergence of cognitive computing models and a new cognitive infrastructure. He argues that increasing data volumes and the need for real-time insights are driving the need for intelligent, sentient, and resilient systems. The new cognitive infrastructure will include a cognitive and infrastructure agnostic control overlay, composable services, and cognitive deep learning integration. It will enable a post-hypervisor cognitive computing era with intelligent, distributed systems.
Review Paper on Shared and Distributed Memory Parallel Algorithms to Solve Bi...JIEMS Akkalkuwa
This document presents a review of parallel algorithms to solve big data problems in biological, social network, and spatial domains using shared and distributed memory. It discusses sequential and parallel algorithms for community detection in protein-protein interaction networks and social networks. It also discusses techniques for processing and analyzing large LiDAR point cloud data for applications like forest monitoring and 3D modeling. The document reviews relevant literature on algorithms for community detection, network partitioning, and LiDAR data reduction and interpolation. It then describes the BLLP algorithm for community detection in biological networks and discusses how it could be extended to distributed memory systems.
Image Recognition Expert System based on deep learningPRATHAMESH REGE
The document summarizes literature on image recognition expert systems and deep learning. It discusses two papers:
1. The Low-Power Image Recognition Challenge which established a benchmark for comparing low-power image recognition solutions based on both accuracy and energy efficiency using datasets like ILSVRC.
2. The role of knowledge-based systems and expert systems in automatic interpretation of aerial images. It discusses techniques like semantic networks, frames and logical inference used to solve ill-defined problems with limited information. Frameworks like the blackboard model, ACRONYM and SIGMA are discussed.
Integrated Analytics for IIoT Predictive Maintenance using IoT Big Data Cloud...Hong-Linh Truong
For predictive maintenance of equipment with In-
dustrial Internet of Things (IIoT) technologies, existing IoT Cloud
systems provide strong monitoring and data analysis capabilities
for detecting and predicting status of equipment. However, we
need to support complex interactions among different software
components and human activities to provide an integrated analyt-
ics, as software algorithms alone cannot deal with the complexity
and scale of data collection and analysis and the diversity of
equipment, due to the difficulties of capturing and modeling
uncertainties and domain knowledge in predictive maintenance.
In this paper, we describe how we design and augment complex
IoT big data cloud systems for integrated analytics of IIoT
predictive maintenance. Our approach is to identify various
complex interactions for solving system incidents together with
relevant critical analytics results about equipment. We incorpo-
rate humans into various parts of complex IoT Cloud systems
to enable situational data collection, services management, and
data analytics. We leverage serverless functions, cloud services,
and domain knowledge to support dynamic interactions between
human and software for maintaining equipment. We use a real-
world maintenance of Base Transceiver Stations to illustrate our
engineering approach which we have prototyped with state-of-
the art cloud and IoT technologies, such as Apache Nifi, Hadoop,
Spark and Google Cloud Functions.
Similar to Detection of fraud in financial blockchain-based transactions through big data analytics (20)
El documento describe una investigación sobre la monitorización de redes sociales y la desinformación en Europa. El proyecto busca desarrollar una plataforma híbrida que utilice técnicas deterministas e inteligencia artificial para clasificar y analizar contenidos en redes, detectar bots y medir la viralidad. El objetivo final es ayudar a verificar la información y combatir la desinformación.
This document discusses engineering digitalization through task automation and reuse in the development lifecycle. It proposes a knowledge-centric approach to systems engineering using a knowledge management strategy. This includes defining a controlled vocabulary, relating terms through relationships and clusters, representing textual patterns for matching, and combining rules and tasks to infer information. This knowledge graph could then enable capabilities like requirements extraction, model population, quality checking, and reuse of system artifacts. The approach aims to automate tasks, link different artifact types, and leverage semantics and AI/ML to better understand and exploit knowledge embedded in systems artifacts.
Presentation adapted from the ProSTEP symposium to present the concept and advances in the digitalization of the lifecyle with focus on task automation and reuse.
1) The document discusses how systems engineering methods can be integrated with the AI/ML lifecycle to engineer intelligent systems. It identifies 10 major challenges for this integration, including describing AI/ML model needs and capabilities, integrating AI/ML into specification, verification, and other systems engineering processes.
2) The document proposes concepts for tackling each challenge, such as using standards to describe AI/ML model lifecycles and digital twin environments for verification. It also discusses opportunities like reusing existing AI/ML models and the need to educate new professionals.
3) Key points are that research is active in integrating systems engineering and AI/ML to build safer, more cost-effective cyber-physical systems, and
This document discusses digitalizing the engineering lifecycle through task automation and reuse. It proposes a knowledge-centric systems engineering approach using a knowledge management strategy called "Sailing the V". This involves defining a controlled vocabulary and formalizing relationships between terms, textual patterns, and rules to infer information and link system artifacts like requirements, models, and simulations. The goal is to automate tasks, enable reuse, ensure quality, and provide a more integrated environment for engineers. Future work will focus on data integration, semantics, artificial intelligence, and enhancing engineering methods.
Este documento presenta una introducción a Deep Learning. Comienza con una agenda que incluye una visión general de Deep Learning, Keras y ejemplos de casos de uso. Luego cubre arquitecturas y configuraciones de redes neuronales profundas, incluidas funciones de activación, pérdida y ejemplos de redes como AlexNet y ResNet. También describe el entorno tecnológico, incluidos frameworks como TensorFlow y Keras, e infraestructura en la nube. Finalmente, proporciona una metodología de trabajo y una lista de ejemplos práct
This presentation is a keynote in the AI4SE International Workshop exploring the challenges and opportunities of bringing Systems Engineering the development of AI/ML functions for safety-critical systems.
This is the presentation of the paper about the integration of artificial intelligence and the systems engineering lifecycle.
You can find more information in the following link: https://event.conflr.com/IS2019/sessiondetail_395325
The objective of this presentation to present some challenges and opportunities in the integration of Systems Engineering and the Artificial Intelligence/Machine Learning model lifecycle.
A presentation of the on-going work on interoperability within the toolchain. A new domain OSLC KM is introduced, some experiments for reusing models are also presented and, some videos are also used to present some user stories.
This document introduces software architecture and provides examples using GitHub. It defines software architecture as the fundamental concepts or properties of a system embodied in its elements, relationships, and design principles. The document outlines Philippe Kruchten's 4+1 view model for describing software architecture, including logical, process, physical and development views in addition to scenarios. Diagrams for GitHub's class, component, sequence and deployment architectures are presented as examples.
This is the final degree project of Eduardo Cibrián that has developed a semantic system to generate news headlines for several sports based on a set of patterns
In this presentation, a an overview of the blockchain foundations are presented. The presentation introduces the use of blockchain in the music industry. To do so, a good number of platforms are presented. It mainly reviews the use of blockchain for intellectual property management, digital identity, monetization, etc.
OJP data from firms like Vicinity Jobs have emerged as a complement to traditional sources of labour demand data, such as the Job Vacancy and Wages Survey (JVWS). Ibrahim Abuallail, PhD Candidate, University of Ottawa, presented research relating to bias in OJPs and a proposed approach to effectively adjust OJP data to complement existing official data (such as from the JVWS) and improve the measurement of labour demand.
Lecture slide titled Fraud Risk Mitigation, Webinar Lecture Delivered at the Society for West African Internal Audit Practitioners (SWAIAP) on Wednesday, November 8, 2023.
5 Tips for Creating Standard Financial ReportsEasyReports
Well-crafted financial reports serve as vital tools for decision-making and transparency within an organization. By following the undermentioned tips, you can create standardized financial reports that effectively communicate your company's financial health and performance to stakeholders.
Economic Risk Factor Update: June 2024 [SlideShare]Commonwealth
May’s reports showed signs of continued economic growth, said Sam Millette, director, fixed income, in his latest Economic Risk Factor Update.
For more market updates, subscribe to The Independent Market Observer at https://blog.commonwealth.com/independent-market-observer.
STREETONOMICS: Exploring the Uncharted Territories of Informal Markets throug...sameer shah
Delve into the world of STREETONOMICS, where a team of 7 enthusiasts embarks on a journey to understand unorganized markets. By engaging with a coffee street vendor and crafting questionnaires, this project uncovers valuable insights into consumer behavior and market dynamics in informal settings."
BONKMILLON Unleashes Its Bonkers Potential on Solana.pdfcoingabbar
Introducing BONKMILLON - The Most Bonkers Meme Coin Yet
Let's be real for a second – the world of meme coins can feel like a bit of a circus at times. Every other day, there's a new token promising to take you "to the moon" or offering some groundbreaking utility that'll change the game forever. But how many of them actually deliver on that hype?
BONKMILLON Unleashes Its Bonkers Potential on Solana.pdf
Detection of fraud in financial blockchain-based transactions through big data analytics
1. Detection of fraud in financial blockchain-based
transactions through big data analytics
Jessica P´aez Bonilla
Director: Jose Maria ´Alvarez Rodr´ıguez
Universidad Carlos III de Madrid
Master in Big Data Analytics
2017-2018
July 11,2018
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 1 / 27
2. Overview
1 Introduction
2 Project Objectives
3 System Design
4 Implementation
5 Experiment
6 Project Budget and Plan
7 Legal Framework and socio-economic environment
8 Conclusions and Future works
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 2 / 27
3. Introduction
Using analytical techniques -data gathering, preprocessing, and model
building- it could be possible to detect and prevent financial fraud.
The aim to describe complex fraud in terms of patterns suitable for
system-driven detection and analysis.
Network analysis can provide useful insight into large datasets based
on the interconnectedness of the agents in the network being
analyzed.
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 3 / 27
4. Introduction
Network: shows relationships among the blockchain users and flux of
money. It enables the fraud patterns discovery.
Network graph analysis offers a method for capturing the context
of fraud in a standard, machine readable and transferable format.
Associations learned from visually observing fraudulent transactions,
could be used as knowledge input to create analytical models.
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 4 / 27
5. Project Objectives
1 Research techniques used for fraud detection and explore blockchain
data.
2 Design a system that could take into account the patterns
surrounding the fraudulent transactions.
3 Implement a system using big data analytic tools like R and Python.
4 Experiment and validate the designed system.
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 5 / 27
7. System Design - Network Metrics
Metric Interpretation
Degree Influence on the network
Closeness How quick is the access to other nodes in the network
Betweeness Node location. Is it in the shortest path to other nodes?
Density Level of linkage among the nodes
Modularity How modular the network is
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 7 / 27
8. Implementation - Technology used
BigQuery, R
(igraph) and
Python have been
used in the
development of
this system.
Table 1: Used Packages Versions
Package Used Version
matplotlib 1.5.1
pandas 0.19.2
networkx 1.11
community 0.9
numpy 1.11.3
scipy 0.18.1
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 8 / 27
9. Experiment - Steps
1 Data Exploration.
2 Network metrics and extraction of communities.
3 Features and ML algorithms selection.
4 Performance Measures.
5 Execution.
6 Analysis of Results.
7 Experiment Limitations.
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 9 / 27
10. Experiment - 1. Data Exploration
Bitcoin blockchain data was explored using BigQuery. A data segment
containing fraudulent movements was chosen as sample for analysis in this
project.
Figure 1: Blocks over time Figure 2: Transactions in the sample
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 10 / 27
12. Experiment - 3. Features and ML algorithms selection
Figure 4: Selected features
ML Algorithms
1 Decision Tree
1 White-box modeled. Can be
interpreted.
2 Perform well on imbalanced
datasets.
2 Random Forest
1 Ensemble: combine the
predictions of several base
estimators in order to improve
robustness over a single
estimator.
2 Each tree in the ensemble is
built from a sample drawn
with replacement
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 12 / 27
13. Experiment - 4. Performance Measures
Classification Precision
It gives the percentage of correct predictions.
Confusion Matrix
It is a 2x2 matrix that tells us the types of errors that the classifier is
making.
AUC - Area Under the (ROC) Curve
It is a single number summary of classifier performance, useful even when
there is class imbalance.
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 13 / 27
14. Experiment - 5. Execution
Once the features (transaction network metrics) are obtained, and ML
algortithms and its performance metrics are defined, 2 main tasks need to
be run before fitting the system.
Observations Labeling
Analysis of a real fraudulent transaction.
Dataset Balancing
Once the dataset is labeled, there were many more observations of one
class. An oversampling technique was applied in order to balance it.
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 14 / 27
15. Experiment- 5.1. Analysis of a fraudulent transaction
Figure 5: Fraudster Neighbours
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 15 / 27
16. Experiment - 5.2. Dataset Balancing
The dataset used
has around 30k
observations in
the training set
and around 7k in
the test set.
Python package
Imbalanced-learn
was used. It
applies an
oversampling on
the minority class.
Table 2: Proportion of classes
Dataset Class Proportion
Train Suspicious 0.498627
Train Non-suspicious 0.501373
Test Suspicious 0.500343
Test Non-suspicious 0.499657
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 16 / 27
17. Experiment - 6. Analysis of Results
The obtained metrics of the selected ML algorithms are summarized in the
table below:
Table 3: Classification Metrics Comparison
Model Class. Accuracy Sensitivity AUC
Decision Tree 0.9989 0.9979 0.9994
Random Forest 0.9619 0.9752 0.9974
The selected method was the Random Forest, as was the one giving more
weight to the different network metrics and still achieving a high accuracy.
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 17 / 27
18. Experiment - 6. Analysis of Results
The weight given to each of the features of Random Forest is presented in
this barchart.
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 18 / 27
21. Experiment - 7. Limitations
Studying more known cases of fraud within the bitcoin blockchain, it
could be possible to increase the known fraudulent transaction
patterns.
Having more data will also help to prevent the overfitting with
decision trees, as the tree design would not be able to cover all the
training data.
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 21 / 27
22. Project Budget
A summary of the project budget is presented in the table.
Cost Total (AC)
Direct Costs 8,827.5
Indirect Costs 882,75
Total Costs 9,710.25
Profit (10%) 971.025
Cost + Profit 10,681.275
IVA (21%) 2,243.06
TOTAL + IVA 12,924.343
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 22 / 27
24. Legal Framework and socio-economic environment
Legal Framework: The Bitcoin blockchain data is now available for
exploration with BigQuery, using Google Cloud services. Data is
public and no licensing is required.
Socio-economic environment: Blockchain technology is rapidly
evolving and will be widely used in the finance world in the coming
years.
10 % of world GDP will be stored in blockchains by 2020.
IoT era also promotes the Fintech revolution.
It creates the challenge to develop and apply different sets of
techniques in order to detect fraud in these new digital platforms.
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 24 / 27
25. Conclusions
1 Business: Detecting and flagging activity suspicious of fraud before it
actually takes place could save billions annually in both developed and
non-developed economies.
2 Technical: The proposed system can flag a suspicious blockchain
transaction with a high accuracy taking into account network metrics
resulting of modeling the giant components of the transactions.
3 Personal: Learning of a ongrowing sector (”Fintech”) that combines
finance and technology as well as of how the analytic techniques can
be applied to it.
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 25 / 27
26. Future works
1 Create a software platform that could access and integrate both
environments R and Python.
2 This platform could be running continuously and flag by means of
an UI whenever the model classifies a new observation as Suspicious.
3 Knowing more patterns of fraudulent transactions can help to
avoid the overfitting in the models.
4 Try other network metrics (like mean neighbour degree, node
correlation similarity etc..) as features for the classification model.
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 26 / 27
27. Thank you for your attention
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 27 / 27