This document discusses using machine learning to classify malware into families based on the DREBIN dataset. It covers:
1. Preprocessing the dataset, including integer encoding and one-hot encoding to convert categorical data to numeric form for modeling.
2. Addressing overfitting by splitting the data into training and test sets and using cross-validation.
3. Using classifiers like Random Forest and SVM with strategies like one-vs-all and one-vs-one to perform multiclass classification of malware families.
4. The process of using binary classifiers for each family first, then combining the results to classify malware into the appropriate family.
The document describes using a VGG model for image classification of Venice boat types from the MarDCT dataset. It discusses:
1. Using the VGG16 and VGG19 pre-trained models from Keras to extract features from images in the MarDCT training and test sets.
2. Training linear SVM and Random Forest classifiers on the extracted features to classify images into 24 boat types.
3. Evaluating the classifiers using techniques like k-fold cross-validation, and calculating accuracy, precision, recall, and F1 scores.
Black-Box attacks against Neural Networks - technical project reportRoberto Falconi
This document summarizes a technical report on practical black-box attacks against machine learning. It describes how the authors implemented black-box attacks against deep neural network classifiers without any knowledge of the model's architecture or parameters. The attack strategy involves training a substitute model using synthetic inputs generated from the target model's outputs, then crafting adversarial examples using the substitute model that are misclassified by the target model. The authors validated the attacks on MNIST and CIFAR classifiers using two different attack techniques and also tested attacks on a locally trained dataset. Defenses such as adversarial training and defensive distillation were discussed.
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
Automatic reverse engineering of malware emulatorsUltraUploader
This document proposes techniques for automatically reverse engineering malware emulators. It presents an algorithm using dynamic analysis to execute emulated malware, record the x86 instruction trace, and use data flow and taint analysis to identify the bytecode program and extract syntactic and semantic information about the bytecode instruction set. The authors implemented a proof-of-concept system called Rotalumé, which accurately revealed the syntax and semantics of emulated instruction sets for programs obfuscated by VMProtect and Code Virtualizer.
Software Birthmark for Theft Detection of JavaScript Programs: A Survey Swati Patel
The document discusses software birthmarks, which are characteristics of a program that uniquely identify it. A birthmark can be used to detect software theft by searching for the birthmark of a plaintiff program in a suspected program. Specifically, the document discusses heap graph-based birthmarks, which are generated from a program's runtime heap structure and object references. A subgraph of the heap graph forms the birthmark. Subgraph monomorphism is used to search for the birthmark in a suspected program's heap graph to detect copying of code. Heap graph-based birthmarks are robust against attacks like code obfuscation that aim to disguise stolen code.
This document outlines a project on analyzing sentiment from Twitter data using Python. Chapter 1 introduces the tools and packages used, including Tweepy, tkinter, TextBlob and Matplotlib. Chapter 2 describes collecting tweets using the Twitter API, preprocessing the data through tokenization and removing stop words. Chapter 3 presents the results of the sentiment analysis but does not provide details. Chapter 4 concludes that the project covered basics of Twitter data collection and preprocessing in Python as an introduction to more advanced analysis.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
These are some of the FAQ's that are asked in TCS NQT exam. By preparing these questions you can obtain good marks.
NOTE: These are FAQ's don't completely relay on it.
The document describes using a VGG model for image classification of Venice boat types from the MarDCT dataset. It discusses:
1. Using the VGG16 and VGG19 pre-trained models from Keras to extract features from images in the MarDCT training and test sets.
2. Training linear SVM and Random Forest classifiers on the extracted features to classify images into 24 boat types.
3. Evaluating the classifiers using techniques like k-fold cross-validation, and calculating accuracy, precision, recall, and F1 scores.
Black-Box attacks against Neural Networks - technical project reportRoberto Falconi
This document summarizes a technical report on practical black-box attacks against machine learning. It describes how the authors implemented black-box attacks against deep neural network classifiers without any knowledge of the model's architecture or parameters. The attack strategy involves training a substitute model using synthetic inputs generated from the target model's outputs, then crafting adversarial examples using the substitute model that are misclassified by the target model. The authors validated the attacks on MNIST and CIFAR classifiers using two different attack techniques and also tested attacks on a locally trained dataset. Defenses such as adversarial training and defensive distillation were discussed.
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
Automatic reverse engineering of malware emulatorsUltraUploader
This document proposes techniques for automatically reverse engineering malware emulators. It presents an algorithm using dynamic analysis to execute emulated malware, record the x86 instruction trace, and use data flow and taint analysis to identify the bytecode program and extract syntactic and semantic information about the bytecode instruction set. The authors implemented a proof-of-concept system called Rotalumé, which accurately revealed the syntax and semantics of emulated instruction sets for programs obfuscated by VMProtect and Code Virtualizer.
Software Birthmark for Theft Detection of JavaScript Programs: A Survey Swati Patel
The document discusses software birthmarks, which are characteristics of a program that uniquely identify it. A birthmark can be used to detect software theft by searching for the birthmark of a plaintiff program in a suspected program. Specifically, the document discusses heap graph-based birthmarks, which are generated from a program's runtime heap structure and object references. A subgraph of the heap graph forms the birthmark. Subgraph monomorphism is used to search for the birthmark in a suspected program's heap graph to detect copying of code. Heap graph-based birthmarks are robust against attacks like code obfuscation that aim to disguise stolen code.
This document outlines a project on analyzing sentiment from Twitter data using Python. Chapter 1 introduces the tools and packages used, including Tweepy, tkinter, TextBlob and Matplotlib. Chapter 2 describes collecting tweets using the Twitter API, preprocessing the data through tokenization and removing stop words. Chapter 3 presents the results of the sentiment analysis but does not provide details. Chapter 4 concludes that the project covered basics of Twitter data collection and preprocessing in Python as an introduction to more advanced analysis.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
These are some of the FAQ's that are asked in TCS NQT exam. By preparing these questions you can obtain good marks.
NOTE: These are FAQ's don't completely relay on it.
This document is a thesis submitted by Iason Papapanagiotakis-Bousy to University College London for the degree of Master of Science in Information Security. The thesis defines external metamorphic obfuscation engines using term rewriting systems and analyzes the problem of learning the rewriting rules of such obfuscations given a finite set of malware samples. Specifically, it proves the impossibility of exactly learning the rules but provides an algorithm for approximating the rules under certain assumptions. The work aims to lay the foundations for further research on analyzing metamorphic malware obfuscations.
This document proposes quality measures for assessing linkset quality in linked data. It defines quality indicators, scoring functions, and aggregate metrics for evaluating linksets. Quality indicators examine aspects like entity types and counts. Scoring functions measure type coverage, completeness, and entity coverage within linksets. Interpretation tables help users understand scoring results and determine next steps. The measures specifically address linkset completeness for complementing datasets. The work contributes a first formalization and prototype for linkset quality assessment.
A hybrid model to detect malicious executablesUltraUploader
This document presents a hybrid model for detecting malicious executables that uses three types of features: binary n-grams extracted from executable files, assembly n-grams extracted from disassembled executables, and DLL function calls extracted from program headers. A classifier like SVM is trained on the combined "hybrid feature set" to distinguish between benign and malicious executables. The model achieves high detection accuracy and low false positive rates compared to other feature-based approaches.
Clone group mapping has a very important significance in the evolution of code clone. The topic modeling techniques were applied into code clone firstly and a new clone group mapping method was proposed. By using topic modeling techniques to transform the mapping problem of
high-dimensional code space into a low-dimensional topic space, the goal of clone group mapping was indirectly reached by mapping clone group topics. Experiments on four open source software show that the recall and precision are up to 0.99, thus the method can effectively and accurately reach the goal of clone group mapping.
Architecture of a morphological malware detectorUltraUploader
This document proposes an architecture for a morphological malware detector that combines syntactic and semantic analysis. It builds an efficient signature matching engine using tree automata techniques to represent control flow graphs (CFG). It also describes a graph rewriting engine to handle common malware mutations. The detector extracts CFGs from malware binaries to generate signatures, which are compiled into a minimal automaton database for efficient matching. Experiments showed promising results with a low false positive rate.
The document discusses analyzing Twitter data on the 2016 Chevrolet Camaro to detect consumer sentiment. It describes conducting word cloud analysis and sentiment classification using R programming. Key steps include collecting Twitter data on the Camaro, preprocessing the text data, generating a word cloud to identify frequent keywords, classifying tweets by emotion using naive Bayes classification, and classifying polarity as positive or negative sentiment. Graphs are produced to show the results of the emotion and polarity classification analyses.
An Approach to Software Testing of Machine Learning Applicationsbutest
This document describes an approach to software testing of machine learning applications. The approach involves analyzing the problem domain, algorithm, and implementation options to generate test cases. Two case studies applying this approach found bugs in implementations of MartiRank and SVM ranking algorithms. Analyzing the problem domain uncovered issues like how to handle missing/negative values. Analyzing algorithms revealed potential specification imprecisions. Analyzing options showed how inputs could be manipulated. The approach helped find bugs, create regression test cases, and validate implementations by comparing results across different versions and ML algorithms.
Source code recovery is one of the most tedious, and interesting, tasks in reverse engineering. During the course of this talk, the author will talk about a tool being developed (on and off) since last year that aims to generate auto-compilable source code from binaries. The tool is currently working though it needs a lot more work.
USING CATEGORICAL FEATURES IN MINING BUG TRACKING SYSTEMS TO ASSIGN BUG REPORTSijseajournal
This paper investigates using categorical features of bug reports, such as the component a bug belongs to, to build a classification model for bug assignment. The model is trained to predict the developer assigned to a bug report based on its categorical fields rather than textual content. An evaluation on three projects found that using both categorical features and textual content improved accuracy over using textual content alone. Using only categorical features provided some improvement over prior approaches but was less accurate than using both data types.
This document provides an overview of source code plagiarism detection. It discusses different types of source code plagiarism including textual similarity and functional similarity. It also describes various source code plagiarism detection algorithms such as text-based, token-based, parse tree-based, and metrics-based approaches. Detection techniques including lexical analysis and parse tree comparisons are explained. Popular source code plagiarism detection tools like JPlag, MOSS, and YAP are outlined. The document concludes that plagiarism detection in programming assignments is challenging and detection depends on the programming languages supported.
This document is a project report submitted by four students - Anil Shrestha, Bijay Sahani, Bimal Shrestha, and Deshbhakta Khanal - to the Department of Electronics and Computer Engineering at Tribhuvan University in partial fulfillment of the requirements for a Bachelor's degree in Computer Engineering. The report details the development of a web application called "Tweezer" to perform sentiment analysis on tweets in order to determine public sentiment towards various products, services, or personalities. Literature on previous work related to sentiment analysis, especially on social media data like tweets, is also reviewed in the report.
This document summarizes a machine learning approach called SuSi that automatically identifies sources and sinks in Android applications without using predefined lists. SuSi analyzes Android framework and pre-installed app code to generate a categorized list of sources and sinks. It achieves over 92% accuracy using support vector machines. SuSi can detect previously unknown sources and sinks in new Android versions and provides a list that can be directly used by static and dynamic analysis tools to identify privacy leaks.
Comparing static analysis in Visual Studio 2012 (Visual C++ 2012) and PVS-StudioPVS-Studio
After Visual Studio 2012 was released with a new static analysis unit included in all of the product's editions, a natural question arises: "Is PVS-Studio still relevant as a static analysis tool or can it be replaced by the tool integrated into VS?". A detailed answer with examples is given in this article. We have performed interface and usability comparison as well as a comparison of error diagnosis strength in real software code. The comparison was carried out on the source code of three open-source projects by id Software: Doom 3, Quake 3: Arena, Wolfenstein: Enemy Territory.
The first lab introduces MATLAB for signal analysis. It covers entering matrices, basic matrix operations, plotting signals, and saving and loading variables. Complex numbers and variables are also introduced. Plotting commands allow visualizing signals in both the time and frequency domains. Key MATLAB functions taught include plot, xlabel, ylabel, size, length, clear, save, and load.
The document presents a study that analyzes sentiment on Twitter using various classification algorithms. It compares the performance of Naive Bayes, Bayes Net, Discriminative Multinomial Naive Bayes, Sequential Minimal Optimization, Hyperpipes, and Random Forest algorithms on a Twitter sentiment dataset. The study finds that Discriminative Multinomial Naive Bayes and Sequential Minimal Optimization algorithms have the best performance with overall F-scores of 0.769 and 0.75, respectively. The study aims to determine the most accurate and efficient algorithms for Twitter sentiment classification.
The document describes the Like2Vec recommender system model. It transforms sparse user-item rating matrices into a graph representation, and then uses the DeepWalk algorithm to learn embeddings of nodes in the graph. These embeddings are trained with the Skip-Gram language model on random walks generated through the graph. Like2Vec is evaluated on the Netflix dataset and is shown to outperform baselines in Recall-at-N, which directly measures the quality of top recommendations compared to RMSE which does not. Recall-at-N is argued to be a superior evaluation metric for recommender systems.
This document summarizes methods of dynamic binary analysis and the Valgrind tool. It discusses how dynamic binary analysis tools like Valgrind use techniques like dynamic binary instrumentation and shadow memory to detect errors in programs as they execute. Valgrind's Memcheck tool tracks definedness of values at the bit level to detect errors like bad memory accesses and uses of uninitialized data with low false positive rates. The document also explains Valgrind's use of disassembly and re-synthesis to translate machine code to an intermediate representation for instrumentation.
Mining Fix Patterns for FindBugs ViolationsDongsun Kim
Several static analysis tools, such as Splint or FindBugs, have been proposed to the software development community to help detect security vulnerabilities or bad programming practices. However, the adoption of these tools is hindered by their high false positive rates. If the false positive rate is too high, developers may get acclimated to violation reports from these tools, causing concrete and severe bugs being overlooked. Fortunately, some violations are actually addressed and resolved by developers. We claim that those violations that are recurrently fixed are likely to be true positives, and an automated approach can learn to repair similar unseen violations. However, there is lack of a systematic way to investigate the distributions on existing violations and fixed ones in the wild, that can provide insights into prioritizing violations for developers, and an effective way to mine code and fix patterns which can help developers easily understand the reasons of leading violations and how to fix them.
In this paper, we first collect and track a large number of fixed and unfixed violations across revisions of software. The empirical analyses reveal that there are discrepancies in the distributions of violations that are detected and those that are fixed, in terms of occurrences, spread and categories, which can provide insights into prioritizing violations. To automatically identify patterns in violations and their fixes, we propose an approach that utilizes convolutional neural networks to learn features and clustering to regroup similar instances. We then evaluate the usefulness of the identified fix patterns by applying them to unfixed violations. The results show that developers will accept and merge a majority (69/116) of fixes generated from the inferred fix patterns. It is also noteworthy that the yielded patterns are applicable to four real bugs in the Defects4J major benchmark for software testing and automated repair.
This document summarizes a research paper on sentiment analysis of tweets from Twitter. It discusses how tweets are collected and preprocessed, including removing punctuation and stop words. A Naive Bayes classifier is used to classify the preprocessed tweets as positive, negative, or neutral based on a lexicon dictionary. The results are evaluated to check accuracy. Future work proposed includes computing an overall sentiment score for topics and creating a web app for users to input keywords to analyze sentiment.
IRJET - Automation in Python using Speech RecognitionIRJET Journal
This document describes a project to automate Python using speech recognition. The system allows a user to compile and execute Python scripts and Java codes using voice commands. It works by monitoring a specified directory for file system changes. When a voice command is received with the name of a file, a batch file is dynamically created to execute the corresponding Python script or Java code. The output is then displayed in the command prompt window. The overall goal is to reduce the effort required to run programs by automating the process through speech commands.
This document discusses how to create a private version of CRAN packages behind a company firewall for internal use. It describes using the miniCRAN R package to selectively download and create a local mirror of specific CRAN packages and their dependencies. This allows controlling the packages available internally. It also discusses using MRAN and the RRT package to reproduce analyses and ensure scripts work across package and R version updates.
This document is a thesis submitted by Iason Papapanagiotakis-Bousy to University College London for the degree of Master of Science in Information Security. The thesis defines external metamorphic obfuscation engines using term rewriting systems and analyzes the problem of learning the rewriting rules of such obfuscations given a finite set of malware samples. Specifically, it proves the impossibility of exactly learning the rules but provides an algorithm for approximating the rules under certain assumptions. The work aims to lay the foundations for further research on analyzing metamorphic malware obfuscations.
This document proposes quality measures for assessing linkset quality in linked data. It defines quality indicators, scoring functions, and aggregate metrics for evaluating linksets. Quality indicators examine aspects like entity types and counts. Scoring functions measure type coverage, completeness, and entity coverage within linksets. Interpretation tables help users understand scoring results and determine next steps. The measures specifically address linkset completeness for complementing datasets. The work contributes a first formalization and prototype for linkset quality assessment.
A hybrid model to detect malicious executablesUltraUploader
This document presents a hybrid model for detecting malicious executables that uses three types of features: binary n-grams extracted from executable files, assembly n-grams extracted from disassembled executables, and DLL function calls extracted from program headers. A classifier like SVM is trained on the combined "hybrid feature set" to distinguish between benign and malicious executables. The model achieves high detection accuracy and low false positive rates compared to other feature-based approaches.
Clone group mapping has a very important significance in the evolution of code clone. The topic modeling techniques were applied into code clone firstly and a new clone group mapping method was proposed. By using topic modeling techniques to transform the mapping problem of
high-dimensional code space into a low-dimensional topic space, the goal of clone group mapping was indirectly reached by mapping clone group topics. Experiments on four open source software show that the recall and precision are up to 0.99, thus the method can effectively and accurately reach the goal of clone group mapping.
Architecture of a morphological malware detectorUltraUploader
This document proposes an architecture for a morphological malware detector that combines syntactic and semantic analysis. It builds an efficient signature matching engine using tree automata techniques to represent control flow graphs (CFG). It also describes a graph rewriting engine to handle common malware mutations. The detector extracts CFGs from malware binaries to generate signatures, which are compiled into a minimal automaton database for efficient matching. Experiments showed promising results with a low false positive rate.
The document discusses analyzing Twitter data on the 2016 Chevrolet Camaro to detect consumer sentiment. It describes conducting word cloud analysis and sentiment classification using R programming. Key steps include collecting Twitter data on the Camaro, preprocessing the text data, generating a word cloud to identify frequent keywords, classifying tweets by emotion using naive Bayes classification, and classifying polarity as positive or negative sentiment. Graphs are produced to show the results of the emotion and polarity classification analyses.
An Approach to Software Testing of Machine Learning Applicationsbutest
This document describes an approach to software testing of machine learning applications. The approach involves analyzing the problem domain, algorithm, and implementation options to generate test cases. Two case studies applying this approach found bugs in implementations of MartiRank and SVM ranking algorithms. Analyzing the problem domain uncovered issues like how to handle missing/negative values. Analyzing algorithms revealed potential specification imprecisions. Analyzing options showed how inputs could be manipulated. The approach helped find bugs, create regression test cases, and validate implementations by comparing results across different versions and ML algorithms.
Source code recovery is one of the most tedious, and interesting, tasks in reverse engineering. During the course of this talk, the author will talk about a tool being developed (on and off) since last year that aims to generate auto-compilable source code from binaries. The tool is currently working though it needs a lot more work.
USING CATEGORICAL FEATURES IN MINING BUG TRACKING SYSTEMS TO ASSIGN BUG REPORTSijseajournal
This paper investigates using categorical features of bug reports, such as the component a bug belongs to, to build a classification model for bug assignment. The model is trained to predict the developer assigned to a bug report based on its categorical fields rather than textual content. An evaluation on three projects found that using both categorical features and textual content improved accuracy over using textual content alone. Using only categorical features provided some improvement over prior approaches but was less accurate than using both data types.
This document provides an overview of source code plagiarism detection. It discusses different types of source code plagiarism including textual similarity and functional similarity. It also describes various source code plagiarism detection algorithms such as text-based, token-based, parse tree-based, and metrics-based approaches. Detection techniques including lexical analysis and parse tree comparisons are explained. Popular source code plagiarism detection tools like JPlag, MOSS, and YAP are outlined. The document concludes that plagiarism detection in programming assignments is challenging and detection depends on the programming languages supported.
This document is a project report submitted by four students - Anil Shrestha, Bijay Sahani, Bimal Shrestha, and Deshbhakta Khanal - to the Department of Electronics and Computer Engineering at Tribhuvan University in partial fulfillment of the requirements for a Bachelor's degree in Computer Engineering. The report details the development of a web application called "Tweezer" to perform sentiment analysis on tweets in order to determine public sentiment towards various products, services, or personalities. Literature on previous work related to sentiment analysis, especially on social media data like tweets, is also reviewed in the report.
This document summarizes a machine learning approach called SuSi that automatically identifies sources and sinks in Android applications without using predefined lists. SuSi analyzes Android framework and pre-installed app code to generate a categorized list of sources and sinks. It achieves over 92% accuracy using support vector machines. SuSi can detect previously unknown sources and sinks in new Android versions and provides a list that can be directly used by static and dynamic analysis tools to identify privacy leaks.
Comparing static analysis in Visual Studio 2012 (Visual C++ 2012) and PVS-StudioPVS-Studio
After Visual Studio 2012 was released with a new static analysis unit included in all of the product's editions, a natural question arises: "Is PVS-Studio still relevant as a static analysis tool or can it be replaced by the tool integrated into VS?". A detailed answer with examples is given in this article. We have performed interface and usability comparison as well as a comparison of error diagnosis strength in real software code. The comparison was carried out on the source code of three open-source projects by id Software: Doom 3, Quake 3: Arena, Wolfenstein: Enemy Territory.
The first lab introduces MATLAB for signal analysis. It covers entering matrices, basic matrix operations, plotting signals, and saving and loading variables. Complex numbers and variables are also introduced. Plotting commands allow visualizing signals in both the time and frequency domains. Key MATLAB functions taught include plot, xlabel, ylabel, size, length, clear, save, and load.
The document presents a study that analyzes sentiment on Twitter using various classification algorithms. It compares the performance of Naive Bayes, Bayes Net, Discriminative Multinomial Naive Bayes, Sequential Minimal Optimization, Hyperpipes, and Random Forest algorithms on a Twitter sentiment dataset. The study finds that Discriminative Multinomial Naive Bayes and Sequential Minimal Optimization algorithms have the best performance with overall F-scores of 0.769 and 0.75, respectively. The study aims to determine the most accurate and efficient algorithms for Twitter sentiment classification.
The document describes the Like2Vec recommender system model. It transforms sparse user-item rating matrices into a graph representation, and then uses the DeepWalk algorithm to learn embeddings of nodes in the graph. These embeddings are trained with the Skip-Gram language model on random walks generated through the graph. Like2Vec is evaluated on the Netflix dataset and is shown to outperform baselines in Recall-at-N, which directly measures the quality of top recommendations compared to RMSE which does not. Recall-at-N is argued to be a superior evaluation metric for recommender systems.
This document summarizes methods of dynamic binary analysis and the Valgrind tool. It discusses how dynamic binary analysis tools like Valgrind use techniques like dynamic binary instrumentation and shadow memory to detect errors in programs as they execute. Valgrind's Memcheck tool tracks definedness of values at the bit level to detect errors like bad memory accesses and uses of uninitialized data with low false positive rates. The document also explains Valgrind's use of disassembly and re-synthesis to translate machine code to an intermediate representation for instrumentation.
Mining Fix Patterns for FindBugs ViolationsDongsun Kim
Several static analysis tools, such as Splint or FindBugs, have been proposed to the software development community to help detect security vulnerabilities or bad programming practices. However, the adoption of these tools is hindered by their high false positive rates. If the false positive rate is too high, developers may get acclimated to violation reports from these tools, causing concrete and severe bugs being overlooked. Fortunately, some violations are actually addressed and resolved by developers. We claim that those violations that are recurrently fixed are likely to be true positives, and an automated approach can learn to repair similar unseen violations. However, there is lack of a systematic way to investigate the distributions on existing violations and fixed ones in the wild, that can provide insights into prioritizing violations for developers, and an effective way to mine code and fix patterns which can help developers easily understand the reasons of leading violations and how to fix them.
In this paper, we first collect and track a large number of fixed and unfixed violations across revisions of software. The empirical analyses reveal that there are discrepancies in the distributions of violations that are detected and those that are fixed, in terms of occurrences, spread and categories, which can provide insights into prioritizing violations. To automatically identify patterns in violations and their fixes, we propose an approach that utilizes convolutional neural networks to learn features and clustering to regroup similar instances. We then evaluate the usefulness of the identified fix patterns by applying them to unfixed violations. The results show that developers will accept and merge a majority (69/116) of fixes generated from the inferred fix patterns. It is also noteworthy that the yielded patterns are applicable to four real bugs in the Defects4J major benchmark for software testing and automated repair.
This document summarizes a research paper on sentiment analysis of tweets from Twitter. It discusses how tweets are collected and preprocessed, including removing punctuation and stop words. A Naive Bayes classifier is used to classify the preprocessed tweets as positive, negative, or neutral based on a lexicon dictionary. The results are evaluated to check accuracy. Future work proposed includes computing an overall sentiment score for topics and creating a web app for users to input keywords to analyze sentiment.
IRJET - Automation in Python using Speech RecognitionIRJET Journal
This document describes a project to automate Python using speech recognition. The system allows a user to compile and execute Python scripts and Java codes using voice commands. It works by monitoring a specified directory for file system changes. When a voice command is received with the name of a file, a batch file is dynamically created to execute the corresponding Python script or Java code. The output is then displayed in the command prompt window. The overall goal is to reduce the effort required to run programs by automating the process through speech commands.
This document discusses how to create a private version of CRAN packages behind a company firewall for internal use. It describes using the miniCRAN R package to selectively download and create a local mirror of specific CRAN packages and their dependencies. This allows controlling the packages available internally. It also discusses using MRAN and the RRT package to reproduce analyses and ensure scripts work across package and R version updates.
Parallel and Distributed Algorithms for Large Text Datasets AnalysisIllia Ovchynnikov
This document is a thesis submitted for the degree of Bachelor of Computer Science at Opole University of Technology. It explores using distributed systems for processing large text datasets in the context of near duplicate text detection. The study reviews big data concepts, popular analytics frameworks like Hadoop and Spark, and algorithms for determining document duplication levels. The results were applied to develop a prototype distributed anti-plagiarism system that showed improved performance over existing solutions for analyzing large collections of text data.
This document discusses using machine learning algorithms to predict employee attrition and understand factors that influence turnover. It evaluates different machine learning models on an employee turnover dataset to classify employees who are at risk of leaving. Logistic regression and random forest classifiers are applied and achieve accuracy rates of 78% and 98% respectively. The document also discusses preprocessing techniques and visualizing insights from the models to better understand employee turnover.
This document provides guidance on analyzing malicious software through a multi-stage process including automated analysis, behavioral analysis, and static and dynamic code analysis. It outlines tips and tools for each stage, with the goal of understanding a malware sample's capabilities and origins to strengthen an organization's security. Key stages involve examining static properties, behavioral interactions, and reversing code using tools like Ghidra and x64dbg.
This document discusses strategies for fuzzing complex file formats that contain multiple data types, encodings, and embedded files. It recommends separating fuzzing into modular components that focus on individual data types, encodings, and objects. This allows fuzzing ASCII, binary, images, fonts and other embedded objects independently before combining them back into a single test case in a manner similar to the complex file format. Taking this modular approach helps address issues like protocol awareness, code coverage, and handling multiple encoding levels within a single complex format.
This document provides an overview of the RAPTOR protein structure prediction software developed by Bioinformatics Solutions Inc. It describes the installation process, user interface, and how to perform a sample run of the software. Key information covered includes installing required files and registering the software, navigating the menu system and output panels, configuring a run using different threading methods and databases, and interpreting the output results including alignments and structural predictions.
Data Science - Part II - Working with R & R studioDerek Kane
This tutorial will go through a basic primer for individuals who want to get started with predictive analytics through downloading the open source (FREE) language R. I will go through some tips to get up and started and building predictive models ASAP.
1) The document proposes developing a web-based course enrollment system using PHP, MySQL, JavaScript, HTML, and CSS.
2) It will allow students to enroll in courses online and provide reports to staff.
3) The system will be tested at the database level and interface level before full implementation. Maintenance of the system will be conducted regularly to ensure functionality.
IRJET - Pseudocode to Python Translation using Machine LearningIRJET Journal
This document describes a system that translates pseudocode written in natural language into executable Python code. It uses recurrent neural networks with sequence-to-sequence translation to first convert the pseudocode into an intermediate XML representation, and then recursively parses that XML to produce the final Python code. The system aims to help students learn programming by allowing them to test algorithms written in pseudocode. It was implemented using Keras and trained on a dataset containing pseudocode statements and their Python translations.
1) The document outlines the tasks, tools, and topics explored by Vipul Divyanshu during a summer internship at India Innovation Labs, including data analytics on a medium-sized database and building a recommender engine.
2) Key tools explored include Mahout for machine learning algorithms, Hadoop for distributed processing, and Rush Analyzer (with KNIME) for data visualization and analytics.
3) Vipul implemented recommendation engines including user-based, item-based, and SlopeOne recommenders and evaluated performance using recommender evaluators.
This document provides a summary of the key points from the document "Consumer-Centric API Design".
1. The document discusses best practices for designing APIs that are consumer-centric and easy for developers to use. It emphasizes data abstraction, using common HTTP methods and patterns, and focusing on the needs of API consumers.
2. The author advocates designing APIs around core CRUD concepts to abstract complex business logic and data structures. Real-world examples show both good and bad approaches to data abstraction.
3. Additional chapters will cover topics like HTTP requests and responses, API versioning, authentication, permissions, documentation and testing. The goal is for readers to understand how to build APIs that third-party developers will enjoy
Telemetry doesn't have to be scary; Ben FordPuppet
This document discusses Puppet telemetry and metrics collection. It introduces Dropsonde, an open source tool for collecting anonymous usage data from Puppet servers. Dropsonde plugins define metrics that are collected and sent to Google BigQuery for analysis. The data is aggregated and made public to help understand Puppet module usage and ecosystem trends, while keeping individual server data private. Users are encouraged to contribute plugins and use the public data for their own analysis and tools.
This document discusses Puppet telemetry and metrics collection. It introduces Dropsonde, an open source tool for collecting anonymous usage data from Puppet servers. Dropsonde plugins define metrics that are collected and sent to Google BigQuery for analysis. The data is aggregated and made public to help understand Puppet module usage and ecosystem trends, while keeping individual server data private. Users are encouraged to contribute plugins and use the public data for their own analysis and tools.
1. Reproducible research is the ability to reproduce an experiment or study by independently reproducing the entire process and obtaining the same results. This is a core principle of the scientific method.
2. Using R and RStudio aids reproducibility by encouraging researchers to structure projects systematically, automate analyses with code rather than manual steps, and connect analyses and results to written reports through tools like R Markdown.
3. Version control systems like git allow researchers to track changes, revert to previous versions of documents and code, and facilitate collaboration through online repositories like GitHub.
This document provides an introduction to the Python programming language. It discusses what Python is, why it was created, its basic features and uses. Python is an interpreted, object-oriented programming language that is designed to be readable. It can be used for tasks such as web development, scientific computing, and scripting. The document also covers Python basics like variables, data types, operators, and input/output functions. It provides examples of Python code and discusses best practices for writing and running Python programs.
Language-agnostic data analysis workflows and reproducible researchAndrew Lowe
This was a talk that I gave at CERN at the Inter-experimental Machine Learning (IML) Working Group Meeting in April 2017 about language-agnostic (or polyglot) analysis workflows. I show how it is possible to work in multiple languages and switch between them without leaving the workflow you started. Additionally, I demonstrate how an entire workflow can be encapsulated in a markdown file that is rendered to a publishable paper with cross-references and a bibliography (and with raw LaTeX file produced as a by-product) in a simple process, making the whole analysis workflow reproducible. For experimental particle physics, ROOT is the ubiquitous data analysis tool, and has been for the last 20 years old, so I also talk about how to exchange data to and from ROOT.
This document describes a study that uses data mining techniques to detect malware. The study involved extracting opcode frequencies from 300 malware samples and 150 benign software samples. The opcode data was analyzed using the WEKA machine learning tool to generate rules for classifying software as malware or benign. Through a recursive process of removing the top predictive opcode and re-analyzing the data, the study identified a set of opcodes that predicted malware versus benign software with 96% accuracy. Testing the rules against noise added to the data showed the classification remained over 91% accurate, demonstrating the robustness of the approach. The document outlines the full methodology used in the study.
This document provides a summary of Jake VanderPlas' book "A Whirlwind Tour of Python". It introduces Python as a teaching and scripting language embraced by programmers, engineers, researchers, and data scientists. The book aims to provide a brief but comprehensive tour of the Python language for readers familiar with other languages, rather than starting from the basics. It covers Python's syntax, built-in types and data structures, functions, control flow, and other aspects to provide a foundation for exploring Python's data science ecosystem.
River Trail: A Path to Parallelism in JavaScriptRoberto Falconi
River Trail enables new web usages with positive impact on performances using high-level parallel patterns, bounds checked array accesses, automatic heap management and familiar JavaScript libraries.
Biometric Systems - Automate Video Streaming Analysis with Azure and AWSRoberto Falconi
This document discusses automating video streaming analysis using Microsoft Azure and Amazon Web Services. It explores using .NET Core, OpenCV, Face and Computer Vision APIs from Azure Cognitive Services, and Amazon Rekognition from AWS. Experiments were conducted using the Extended Cohn-Kanade Dataset to compare the APIs from Azure and AWS for tasks like face detection, recognition, and emotion analysis. The document concludes that Azure provided more accurate and user-friendly experiences compared to AWS.
Biometric Systems - Automate Video Streaming Analysis with Azure and AWSRoberto Falconi
Perform near-real-time analysis on faces (emotions, gender, age, etc.), taken from a live video stream with Azure Cognitive Services and AWS Rekognition.
Black-Box attacks against Neural Networks - technical project presentationRoberto Falconi
Project paper at: https://www.slideshare.net/RobertoFalconi4/blackbox-attacks-against-neural-networks-technical-project-report
Python implementation of a practical black-box attack against machine learning.This is the technical report for the Neural Networks course by Professor A. Uncini, PhD S. Scardapane and PhD D. Comminiello. The report is about Practical Black-Box Attacks against Machine Learning, scientific paper by N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik and A. Swami. The work is done by Dr S. Clinciu and Dr R. Falconi, while studying at MSc in Engineering in Computer Science, at Sapienza University of Rome.
Project’s goal is to introduce the first demonstration that black box attacks against deep neural networks (DNN) classifiers are practical for real-world adversaries with no knowledge about the model. We assume the adversary has no information about the structure or parameters of the DNN, and the defender does not have access to any large training dataset. A can only observe labels assigned by the DNN for chosen inputs, in a manner analog to a cryptographic oracle.
SUOMI - UCD approach to build an IoT smart guide for spaRoberto Falconi
SUOMI is a web software that guide users of a spa in the best possible journey using IoT. SUOMI is a software that guide users of a wellness center in the best possible journey using IoT.
Developed by Federico Guidi, Roberto Falconi and Chiara Navarra for Pervasive Systems course by Prof. Ioannis Chatzigiannakis and later improved for Human-Computer Interaction course by Prof. Tiziana Catarci and Mobile Applications and Cloud Computing course by Prof. Roberto Beraldi of MSc in Engineering in Computer Science at Sapienza, University of Rome.
Kalypso: She who hides. Encryption and decryption web app.Roberto Falconi
GitHub: https://github.com/RobertoFalconi/Kalypso
Web app that let users to encrypt messages and to send them via any social, IM or QRCode. Bachelor's degree in Control System and Computer Science Engineering, Thesis in Information Security and Software Architecture.
GitHub: https://github.com/RobertoFalconi/GameRatingsPredictor
Brief description and useful links:
Hi everyone!
This is a project originally made by Roberto Falconi and Federico Guidi for the course "Quantitative Methods for Computer Science" and its teacher Luigi Freda, based at Sapienza - University of Rome.
The code is open source and written in Python 3.x but it's also Python 2.x backward compatible.
This project goal is to classifie each video game in the dataset by ESRB rating, to do this we used Logistic Regression, Random Forest and k-NON.
GitHub repository with full code: https://github.com/RobertoFalconi/GameRatingsPredictor
This document provides an overview of a Star Wars video game developed using Three.js and WebGL. It discusses the following key points:
1. The game environments use Three.js and WebGL frameworks. Models include imported X-Wing and rocks, as well as hierarchical BB-8 droid.
2. Shadows and lighting are implemented using shadow maps, directional light, and Lambert materials. Textures are added to models.
3. The game includes a start screen, a rotating spherical world populated with randomly spawned rocks, and player control of BB-8 droid movement.
Visual Analytics: Traffic Collisions in ItalyRoberto Falconi
The document describes a visual analytics project analyzing traffic collision statistics in Italy. It uses an interactive dashboard with an Italy map, histograms, and sliders to filter data by year, region, and other factors. Principal component analysis is applied to reduce the dataset dimensions before representation. The dashboard allows users to gain insights through interactive exploration of quantitative relationships between variables like accident rates in different regions.
Visual Analytics: Traffic Collisions in ItalyRoberto Falconi
This document describes a visual analytics project analyzing traffic collision data in Italy from 2003 to 2013. It discusses the tasks, dataset, data preprocessing with PCA, and various visualizations and interactive elements in the dashboard, including an interactive map of Italy, histograms, and slider filters for year and PCA scaler. The project aims to provide insights into traffic collisions and identify relationships between different factors.
SUOMI - Web and mobile app for spa users, using STM32 IoT, Microsoft Azure Cl...Roberto Falconi
SUOMI is a web and mobile app that guide users of a spa in the best possible journey using IoT. SUOMI is a software that guide users of a wellness center in the best possible journey using IoT.
Developed by Federico Guidi, Roberto Falconi and Chiara Navarra for Pervasive Systems course by Prof. Ioannis Chatzigiannakis and later improved for Human-Computer Interaction course by Prof. Tiziana Catarci and Mobile Applications and Cloud Computing course by Prof. Roberto Beraldi of MSc in Engineering in Computer Science at Sapienza, University of Rome.
This document discusses the development of a Star Wars video game using Three.js and WebGL. It describes importing 3D models like the X-Wing and creating simple models. It also covers setting up environments, adding animations, lights, textures, and user interactions. Hierarchical models like BB-8 are created. The document provides details on the game logic including moving and spawning objects on a spherical world. It includes a user manual for playing the game.
Game Ratings Predictor - machine learning software to predict video games co...Roberto Falconi
GitHub: https://github.com/RobertoFalconi/GameRatingsPredictor
Brief description and useful links:
Hi everyone!
This is a project originally made by Roberto Falconi and Federico Guidi for the course "Quantitative Methods for Computer Science" and its teacher Luigi Freda, based at Sapienza - University of Rome.
The code is open source and written in Python 3.x but it's also Python 2.x backward compatible.
This project goal is to classifie each video game in the dataset by ESRB rating, to do this we used Logistic Regression, Random Forest and k-NON.
GitHub repository with full code: https://github.com/RobertoFalconi/GameRatingsPredictor
House Temperature Monitoring using AWS IoT And Raspberry PiRoberto Falconi
Brief description and useful links:
Developed smart home automation project to measure your house's temperature and send it on your smartphone.
LinkedIn profile: https://www.linkedin.com/in/roberto-falconi
GitHub repository: https://github.com/RobertoFalconi/HouseTemperatureMonitoring
Hackster full description: https://www.hackster.io/Falkons/house-temperature-monitoring-using-aws-iot-and-raspberry-pi-3b6410
SlideShare presentation: https://www.slideshare.net/RobertoFalconi4/house-temperature-monitoring-using-aws-iot-and-raspberry-pi
YouTube video: https://www.youtube.com/watch?v=gQxOSbcN79s
Odoo ERP software
Odoo ERP software, a leading open-source software for Enterprise Resource Planning (ERP) and business management, has recently launched its latest version, Odoo 17 Community Edition. This update introduces a range of new features and enhancements designed to streamline business operations and support growth.
The Odoo Community serves as a cost-free edition within the Odoo suite of ERP systems. Tailored to accommodate the standard needs of business operations, it provides a robust platform suitable for organisations of different sizes and business sectors. Within the Odoo Community Edition, users can access a variety of essential features and services essential for managing day-to-day tasks efficiently.
This blog presents a detailed overview of the features available within the Odoo 17 Community edition, and the differences between Odoo 17 community and enterprise editions, aiming to equip you with the necessary information to make an informed decision about its suitability for your business.
Graspan: A Big Data System for Big Code AnalysisAftab Hussain
We built a disk-based parallel graph system, Graspan, that uses a novel edge-pair centric computation model to compute dynamic transitive closures on very large program graphs.
We implement context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases such as Linux shows that their Graspan implementations scale to millions of lines of code and are much simpler than their original implementations.
These analyses were used to augment the existing checkers; these augmented checkers found 132 new NULL pointer bugs and 1308 unnecessary NULL tests in Linux 4.4.0-rc5, PostgreSQL 8.3.9, and Apache httpd 2.2.18.
- Accepted in ASPLOS ‘17, Xi’an, China.
- Featured in the tutorial, Systemized Program Analyses: A Big Data Perspective on Static Analysis Scalability, ASPLOS ‘17.
- Invited for presentation at SoCal PLS ‘16.
- Invited for poster presentation at PLDI SRC ‘16.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
8 Best Automated Android App Testing Tool and Framework in 2024.pdfkalichargn70th171
Regarding mobile operating systems, two major players dominate our thoughts: Android and iPhone. With Android leading the market, software development companies are focused on delivering apps compatible with this OS. Ensuring an app's functionality across various Android devices, OS versions, and hardware specifications is critical, making Android app testing essential.
What is Augmented Reality Image Trackingpavan998932
Augmented Reality (AR) Image Tracking is a technology that enables AR applications to recognize and track images in the real world, overlaying digital content onto them. This enhances the user's interaction with their environment by providing additional information and interactive elements directly tied to physical images.
SMS API Integration in Saudi Arabia| Best SMS API ServiceYara Milbes
Discover the benefits and implementation of SMS API integration in the UAE and Middle East. This comprehensive guide covers the importance of SMS messaging APIs, the advantages of bulk SMS APIs, and real-world case studies. Learn how CEQUENS, a leader in communication solutions, can help your business enhance customer engagement and streamline operations with innovative CPaaS, reliable SMS APIs, and omnichannel solutions, including WhatsApp Business. Perfect for businesses seeking to optimize their communication strategies in the digital age.
Using Query Store in Azure PostgreSQL to Understand Query PerformanceGrant Fritchey
Microsoft has added an excellent new extension in PostgreSQL on their Azure Platform. This session, presented at Posette 2024, covers what Query Store is and the types of information you can get out of it.
E-commerce Application Development Company.pdfHornet Dynamics
Your business can reach new heights with our assistance as we design solutions that are specifically appropriate for your goals and vision. Our eCommerce application solutions can digitally coordinate all retail operations processes to meet the demands of the marketplace while maintaining business continuity.
Microservice Teams - How the cloud changes the way we workSven Peters
A lot of technical challenges and complexity come with building a cloud-native and distributed architecture. The way we develop backend software has fundamentally changed in the last ten years. Managing a microservices architecture demands a lot of us to ensure observability and operational resiliency. But did you also change the way you run your development teams?
Sven will talk about Atlassian’s journey from a monolith to a multi-tenanted architecture and how it affected the way the engineering teams work. You will learn how we shifted to service ownership, moved to more autonomous teams (and its challenges), and established platform and enablement teams.
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsPeter Muessig
The UI5 tooling is the development and build tooling of UI5. It is built in a modular and extensible way so that it can be easily extended by your needs. This session will showcase various tooling extensions which can boost your development experience by far so that you can really work offline, transpile your code in your project to use even newer versions of EcmaScript (than 2022 which is supported right now by the UI5 tooling), consume any npm package of your choice in your project, using different kind of proxies, and even stitching UI5 projects during development together to mimic your target environment.
E-commerce Development Services- Hornet DynamicsHornet Dynamics
For any business hoping to succeed in the digital age, having a strong online presence is crucial. We offer Ecommerce Development Services that are customized according to your business requirements and client preferences, enabling you to create a dynamic, safe, and user-friendly online store.
Software Engineering, Software Consulting, Tech Lead, Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Transaction, Spring MVC, OpenShift Cloud Platform, Kafka, REST, SOAP, LLD & HLD.
Revolutionizing Visual Effects Mastering AI Face Swaps.pdfUndress Baby
The quest for the best AI face swap solution is marked by an amalgamation of technological prowess and artistic finesse, where cutting-edge algorithms seamlessly replace faces in images or videos with striking realism. Leveraging advanced deep learning techniques, the best AI face swap tools meticulously analyze facial features, lighting conditions, and expressions to execute flawless transformations, ensuring natural-looking results that blur the line between reality and illusion, captivating users with their ingenuity and sophistication.
Web:- https://undressbaby.com/
Hand Rolled Applicative User ValidationCode KataPhilip Schwarz
Could you use a simple piece of Scala validation code (granted, a very simplistic one too!) that you can rewrite, now and again, to refresh your basic understanding of Applicative operators <*>, <*, *>?
The goal is not to write perfect code showcasing validation, but rather, to provide a small, rough-and ready exercise to reinforce your muscle-memory.
Despite its grandiose-sounding title, this deck consists of just three slides showing the Scala 3 code to be rewritten whenever the details of the operators begin to fade away.
The code is my rough and ready translation of a Haskell user-validation program found in a book called Finding Success (and Failure) in Haskell - Fall in love with applicative functors.
DDS Security Version 1.2 was adopted in 2024. This revision strengthens support for long runnings systems adding new cryptographic algorithms, certificate revocation, and hardness against DoS attacks.
2. www.robertodaguarcino.com
Summary
1. Abstract...............................................................................................................................3
2. Why I used Python..............................................................................................................3
3. Program setup.....................................................................................................................4
3.1 Windows setup............................................................................................................4
3.2 Linux or macOS setup..................................................................................................4
4. Dataset preparation............................................................................................................5
4.1 Drebin dataset .............................................................................................................5
4.2 Data pre-processing.....................................................................................................6
4.3 From nominal data to numeric data ...........................................................................7
4.4 Integer Encoding..........................................................................................................8
4.5 One-Hot Encoder.........................................................................................................9
5. Overfitting and underfitting avoidance ............................................................................10
5.1 Bias and Variance ......................................................................................................10
5.2 Training set and test set ............................................................................................10
5.3 Cross-validation .........................................................................................................11
6. The classification problem ................................................................................................12
6.1 Random Forest ..........................................................................................................12
6.2 SVM............................................................................................................................12
6.3 Classify malwares by family.......................................................................................12
6.4 Binary classifiers ........................................................................................................13
6.5 From binary to multiclass..........................................................................................14
6.6 One vs All and One vs One.........................................................................................14
6.7 Accuracy score...........................................................................................................15
6.8 Confusion Matrix .......................................................................................................16
6.9 Precision score...........................................................................................................16
6.10 Recall score ............................................................................................................17
6.11 F1 score..................................................................................................................18
7. Final results.......................................................................................................................18
3. www.robertodaguarcino.com
1. Abstract
Goal of this project is to understand the data contained in the DREBIN dataset to define a
classification problem for malware analysis (target function).
I decided to make a multiclass classification: I classified all malwares by family.
First, I classified malwares by family using binary classifiers for each family.
Second, I calculated the probability that a malware really belongs to some family and finally,
with One vs All and other methodologies, I came back to the multiclass problem to classify
the malware by family.
The evaluation procedure and the results are described in the following report and it
includes (but it is not limited to) dataset preprocessing, integer encoding, one-hot encoder,
overfitting avoidance using dataset splitting into training set and test set and cross
validation, the usage of classifiers such as Random Forest and Support Vector Machine (also
known as SVM or Support Vector Classification and SVC in Scikit-learn), and the coming back
from binary to multiclass problem using One vs All and finally the calculation of confusion
matrix, accuracy, misclassification rate, precision, recall, f1 and others (each described and
argued).
2. Why I used Python
First, I want to explain why I used Python to develop this project.
Python is popular in machine learning because of many inter-related reasons: it is simple,
elegant, consistent, and math-like.
Python code has been described as readable pseudocode. It is easy to pick up due to its
consistent syntax and the way it mirrors human language and/or their mathematical
counterparts.
4. www.robertodaguarcino.com
It is math-like in that some "objects" that are very much part of a mathematician's
vocabulary are part of the language without you having to install / import them, and they
resemble their equivalent mathematical counterparts. With carefully chosen
variable/function names, the code can be read like math or English, because you simply
don’t need to declare the type of a variable or to manually cast them.
The latter (much due to libraries such as Pandas, NumPy and Scikit-learn) is something one
will appreciate if he were to implement a machine learning algorithm of which the core is
likely just mathematical optimization.
3. Program setup
3.1 Windows setup
1. Download Visual Studio Code
https://code.visualstudio.com/Download
2. Install Python plugin for VS Code
https://marketplace.visualstudio.com/items?itemName=ms-python.python
3. Download and install Python 3.7.1 64 bit for Window
https://www.python.org/downloads/release/python-371
(It is important to mark "Add Python 3.7 to PATH" and "Disable max path length" at
the end of the setup)
4. Open VS Code terminal (ctrl + ò)
5. In the terminal, type the "pip install pandas" command
6. Again, type "pip install sklearn" command
7. Finally, "py <homeworkpath.py>" to run the code and read the printed results
described and argued in this report in the following chapters.
3.2 Linux or macOS setup
As the same way of the Windows setup, it is important to download Python at 64 bit (adding
it to the path and disabling the max path length), then, you can run it in an IDE like Visual
5. www.robertodaguarcino.com
Studio or just in the terminal using the same commands, but it is important to replace the
“pip” command with “sudo pip3” and “py” with “sudo python3”.
4. Dataset preparation
4.1 Drebin dataset
As the limited resources impede monitoring applications at run-time, DREBIN performs a
broad static analysis, gathering as many features of an application as possible. These features
are embedded in a joint vector space, such that typical patterns indicative for malware can be
automatically identified and used for explaining the decisions of our method.
In an evaluation with 123,453 applications and 5,560 malware samples DREBIN outperforms
several related approaches and detects 94% of the malware with few false alarms, where the
explanations provided for each detection reveal relevant properties of the detected malware.
DREBIN performs a broad static analysis, gathering as many features from an application’s
code and manifest as possible. These features are organized in sets of strings (such as
permissions, API calls and network addresses) and embedded in a joint vector space.
As an example, an application sending premium SMS messages is cast to a specific region in
the vector space associated with the corresponding permissions, intents and API calls.
6. www.robertodaguarcino.com
To foster research in the area of malware detection and to enable a comparison of different
approaches, we make the malicious Android applications used in our work as well as all
extracted feature sets available to other researchers in the DREBIN dataset.
4.2 Data pre-processing
It is possible to analyze malware and to classify them in families with a Machine Learning
program. In this case I chosen to use Python as explained above.
Data preprocessing involves data preparation and dividing the data into training and testing
sets.
Drebin sha256_family.csv has been loaded as the dataset and so it has been used to data pre-
processing and data analysis through some methodologies and techniques. The dataset has
two elements: sha256 string (file name) and its own family, but I take only families with more
than 20 elements, while all the other families (and their own elements) have been deleted.
This, because families with less then 20 elements can be statistically irrelevant, or worse.
7. www.robertodaguarcino.com
Features of the files were loaded into Design Matrix 𝑿 ∈ 𝑅 𝑁×𝐷
and the output in 𝒚 ∈ 𝑅 𝑁
,
𝑦𝑖 ∈ {𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛, 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟, 𝑒𝑡𝑐. }:
𝑿 = [
𝑥11 ⋯ 𝑥1𝑛
⋮ ⋱ ⋮
𝑥 𝑛1 ⋯ 𝑥 𝑛𝑛
]
𝒚 = (
𝑦1
⋮
𝑦𝑛
)
4.3 From nominal data to numeric data
Nominal data (or categorical data) are variables that contain label values rather than numeric
values. The number of possible values is often limited to a fixed set. Categorical variables are
often called nominal.
Some examples include:
A “pet” variable with the values: “dog” and “cat“.
A “color” variable with the values: “red“, “green” and “blue“.
A “place” variable with the values: “first”, “second” and “third“.
Each value represents a different category.
Some categories may have a natural relationship to each other, such as a natural ordering.
The “place” variable above does have a natural ordering of values. This type of categorical
variable is called an ordinal variable.
What is the Problem with Categorical Data?
Some algorithms can work with categorical data directly.
8. www.robertodaguarcino.com
For example, a decision tree can be learned directly from categorical data with no data
transform required (this depends on the specific implementation).
Many machine learning algorithms cannot operate on label data directly. They require all
input variables and output variables to be numeric.
In general, this is mostly a constraint of the efficient implementation of machine learning
algorithms rather than hard limitations on the algorithms themselves.
This means that categorical data must be converted to a numerical form. If the categorical
variable is an output variable, you may also want to convert predictions by the model back
into a categorical form in order to present them or use them in some application.
How to Convert Categorical Data to Numerical Data?
This involves two steps: integer encoding and One-Hot Encoder.
4.4 Integer Encoding
As a first step, each unique category value is assigned an integer value.
For example, “red” is 1, “green” is 2, and “blue” is 3.
This is called a label encoding or an integer encoding and is easily reversible.
For some variables, this may be enough. The integer values have a natural ordered
relationship between each other and machine learning algorithms may be able to understand
and harness this relationship.
For example, ordinal variables like places would be a good example where a label/integer
encoding would be enough.
9. www.robertodaguarcino.com
4.5 One-Hot Encoder
For categorical variables where no such ordinal relationship exists, the integer encoding is not
enough.
In fact, using this encoding and allowing the model to assume a natural ordering between
categories may result in poor performance or unexpected results (predictions halfway
between categories).
In this case, a one-hot encoder can be applied to the integer representation. This is where the
integer encoded variable is removed and a new binary variable is added for each unique
integer value.
In the “color” variable example, there are 3 categories and therefore 3 binary variables are
needed. A “1” value is placed in the binary variable for the color and “0” values for the other
colors.
For example, for three different color elements:
red green blue
1 0 0
0 1 0
0 0 1
The binary variables are often called “dummy variables” in other fields, such as statistics.
In our case, I have applied this last technique on all the features, including the families and
the other inside the features’ vectors.
10. www.robertodaguarcino.com
5. Overfitting and underfitting avoidance
5.1 Bias and Variance
When evaluating a machine learning model, it is important to balance Bias and Variance.
High Bias refers to a scenario where the model is “underfitting” the dataset. This is bad
because the model is not presenting a very accurate or representative picture of the
relationship between inputs and predicted output and is often outputting high error.
High Variance represents the opposite scenario. In cases of High Variance or “overfitting”,
machine learning model is so accurate that it is perfectly fitted to your example dataset. While
this may seem like a good outcome, it is also a cause for concern, as such models often fail to
generalize to future datasets. So, while the model works well for existing data, it is not known
how well it will perform on other examples.
5.2 Training set and test set
Learning the parameters of a prediction function and testing it on the same data is a
methodological mistake: a model that would just repeat the labels of the samples that it has
just seen would have a perfect score but would fail to predict anything useful on yet-unseen
data. This situation is called overfitting. To avoid it, it is common practice when performing a
(supervised) machine learning experiment to hold out part of the available data as a test set
X_test, y_test. Note that the word “experiment” is not intended to denote academic use only,
because even in commercial settings machine learning usually starts out experimentally.
In scikit-learn a random split into training and test sets can be quickly computed with the
train_test_split helper function.
11. www.robertodaguarcino.com
5.3 Cross-validation
When evaluating different settings (“hyperparameters”) for estimators, such as the C setting
that must be manually set for an SVM, there is still a risk of overfitting on the test set because
the parameters can be tweaked until the estimator performs optimally. This way, knowledge
about the test set can “leak” into the model and evaluation metrics no longer report on
generalization performance. To solve this problem, yet another part of the dataset can be held
out as a so-called “validation set”: training proceeds on the training set, after which evaluation
is done on the validation set, and when the experiment seems to be successful, final
evaluation can be done on the test set.
However, by partitioning the available data into three sets, we drastically reduce the number
of samples which can be used for learning the model, and the results can depend on a
particular random choice for the pair of (train, validation) sets.
A solution to this problem is a procedure called cross-validation (CV for short). A test set
should still be held out for final evaluation, but the validation set is no longer needed when
doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets
(other approaches are described below, but generally follow the same principles). The
following procedure is followed for each of the k “folds”:
A model is trained using of the folds as training data;
the resulting model is validated on the remaining part of the data (i.e., it is used as a test set
to compute a performance measure such as accuracy).
The performance measure reported by k-fold cross-validation is then the average of the values
computed in the loop. This approach can be computationally expensive but does not waste
too much data (as is the case when fixing an arbitrary validation set), which is a major
advantage in problems such as inverse inference where the number of samples is very small.
The simplest way to use cross-validation is to call the cross_val_score helper function on the
estimator and the dataset, but I have used also the cross_val_predict method in order to come
back from a binary to a multiclass problem.
12. www.robertodaguarcino.com
6. The classification problem
6.1 Random Forest
Random Forest is intrinsically suited for multiclass problems. It works well with a mixture of
numerical and categorical features. When features are on the various scales, it is also fine.
Roughly speaking, with Random Forest you can use data as they are. As a consequence, one-
hot encoder for categorical features is a must-do. Further, min-max or other scaling is highly
recommended at preprocessing step. At last, for a classification problem Random Forest
gives you probability of belonging to a class.
6.2 SVM
SVC and LinearSVC are classes capable of performing multi-class classification on a dataset.
LinearSVC is an implementation of Support Vector Classification for the case of a linear kernel.
SVC implements the One vs One approach for multiclass classification. On the other hand,
LinearSVC implements One vs Rest multiclass strategy, thus training n class models.
Both One vs One and One vs Rest strategies are discussed in the next chapters.
One-hot encoder discretization, in a first time, lead us to multiple binary classifiers instead of
a single multiclass classifier.
𝒚 = 𝒘 𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛 + 𝒘 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟 + ⋯
𝑤ℎ𝑒𝑟𝑒:
𝒘 𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛 ∈ {0,1}
𝒘 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟 ∈ {0,1}
⋮
6.3 Classify malwares by family
Running binary classifier for each family return partial functions:
13. www.robertodaguarcino.com
𝑓𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛, 𝑓𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟, …
With Scikit-learn predict method we got:
𝒚̃𝑖, 𝑤ℎ𝑒𝑟𝑒 𝑖 = {𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛, 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟, … },
𝒚̃𝑖 = 𝑦̃1, … , 𝑦̃ 𝑛,
𝑦̃1, … , 𝑦̃ 𝑛 ∈ {0,1}
6.4 Binary classifiers
Scikit-learn provides methods in the sklearn.metrics module such as accuracy_score,
precision_score, recall_score and f1_score (discussed later). These methods print
misclassifications and accuracy, precision, recall and f1 scores of every binary classifier of
each family.
In our case, we have good results for each binary classifier (discussed and argued in the last
chapter), but it is not enough to understand if the results are so good as they seem to be,
because at this moment we are considering each single classifier and not them all, nor the
probability that an element belongs to the predicted family or to the others.
When performing classification, you often want to predict not only the class, but also the
associated probability. This probability gives you some kind of confidence on the prediction.
However, not all classifiers provide well-calibrated probabilities, some being over-confident
while others being under-confident. Thus, a separate calibration of predicted probabilities is
often desirable as a postprocessing.
In the next chapters, I will go deep the various scores analyzing them re-calculated using
predict_proba method instead of the predict method, normalizing the probability and
applying One vs All methodology.
14. www.robertodaguarcino.com
6.5 From binary to multiclass
Some metrics are essentially defined for binary classification tasks. In these cases, by default
only the positive label is evaluated, assuming by default that the positive class is labelled 1
(though this may be configurable through the pos_label parameter).
In extending a binary metric to multiclass problems, the data is treated as a collection of binary
problems, one for each class. There are then several ways to average binary metric
calculations across the set of classes, each of which may be useful in some scenario. Where
available, Scikit-learn suggests us to select among these using the average parameter: "macro"
simply calculates the mean of the binary metrics, giving equal weight to each class. In
problems where infrequent classes are nonetheless important, macro-averaging may be a
means of highlighting their performance. On the other hand, the assumption that all classes
are equally important is often untrue, such that macro-averaging will over-emphasize the
typically low performance on an infrequent class; "weighted" accounts for class imbalance by
computing the average of binary metrics in which each class’s score is weighted by its
presence in the true data sample.
6.6 One vs All and One vs One
One vs All (also known as One vs Rest) strategy involves training a single classifier per class,
with the samples of that class as positive samples and all other samples as negatives. This
strategy requires the base classifiers to produce a real-valued confidence score for its
decision, rather than just a class label; discrete class labels alone can lead to ambiguities,
where multiple classes are predicted for a single sample. In the One vs One reduction, one
trains K (K − 1) / 2 binary classifiers for a K-way multiclass problem; each receives the
samples of a pair of classes from the original training set and must learn to distinguish these
two classes.
One vs All and One vs One methodologies bring us to the original problem of multiclass
classification.
15. www.robertodaguarcino.com
Making decisions means applying all classifiers to an unseen sample x and predicting the class
k for which the corresponding classifier reports the highest confidence score:
𝑦̂ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑓𝑘(𝑥), 𝑘 ∈ {1, … , 𝐾}
Thanks to predict_proba method of Scikit-learn we have:
𝒚̂𝑖, 𝑖 = {𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛, 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟, … },
𝒚̂𝑖 = 𝑦̂1, … , 𝑦̂ 𝑛,
𝑦̂1, … , 𝑦̂ 𝑛 ∈ [0,1]
that represent this confidence score in a probability status.
To normalize the partial results (∑ 𝒚̅𝑖 𝑖
= 1):
𝒚̅𝑖, 𝑖 = {𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛, 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟, … },
𝒚̅𝑖 =
𝒚̂𝑖
𝒚̂ 𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛 + 𝒚̂ 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟 + ⋯
At last, we can apply the majority rule with a threshold of 0.50 confidence score:
𝐼𝑓 𝒚̅𝑖 > 0.50 ⇒ 𝒚̅𝑖 ∈ 𝑖, 𝑖 = {𝑃𝑙𝑎𝑛𝑘𝑡𝑜𝑛, 𝐹𝑎𝑘𝑒𝐼𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑟, … }
else, it goes to a misclassification.
Once proceeds the confidence score (or misclassifications) for every family, we can calculate
scores to understand if the classification can be good enough. In the next paragraphs there
will be discussed each used score and in the final chapter it will be analyzed my code’s results.
6.7 Accuracy score
The accuracy_score function computes the accuracy, either the fraction (default) or the count
(normalize=False) of correct predictions. If 𝑦̂𝑖 is the predicted value of the 𝑖-th sample and 𝑦𝑖
is the corresponding true value, then the fraction of correct predictions over 𝑛 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 is
defined as
16. www.robertodaguarcino.com
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦(𝑦, 𝑦̂) =
1
𝑛 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
∑ 1(𝑦̂𝑖 = 𝑦𝑖)
𝑛 𝑠𝑎𝑚𝑝𝑙𝑒𝑠−1
𝑖=0
Where 1(𝑥) is the indicator function (function defined on a set X that indicates membership
of an element in a subset A of X, having the value 1 for all elements of A and the value 0 for
all elements of X not in A).
Is accuracy score enough? No. Accuracy is not the be-all and end-all model metric to use when
selecting the best model. When performing classification, one often wants to predict not only
the class label, but also the associated probability. This probability gives confidence on the
prediction.
6.8 Confusion Matrix
Compute confusion matrix to evaluate the accuracy of a classification.
By definition a confusion matrix 𝐶 is such that 𝐶𝑖,𝑗 is equal to the number of observations
known to be in group 𝑖 but predicted to be in group 𝑗.
Thus, in binary classification, the count of true negatives is 𝐶0,0, false negatives is 𝐶1,0, true
positives is 𝐶1,1 and false positives is 𝐶0,1.
6.9 Precision score
Precision is the probability that a (randomly selected) retrieved document is relevant.
The precision is intuitively the ability of the classifier not to label as positive a sample that is
negative. The best value is 1 and the worst value is 0.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
17. www.robertodaguarcino.com
Immediately, you can see that Precision talks about how precise/accurate your model is out
of those predicted positive, how many of them are actual positive.
Precision is a good measure to determine, when the costs of False Positive is high. For
instance, email spam detection. In email spam detection, a false positive means that an email
that is non-spam (actual negative) has been identified as spam (predicted spam). The email
user might lose important emails if the precision is not high for the spam detection model.
6.10 Recall score
Recall is the probability that a (randomly selected) relevant document is retrieved in a
search.
The recall is intuitively the ability of the classifier to find all the positive samples. The best
value is 1 and the worst value is 0.
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒
So, Recall actually calculates how many of the Actual Positives our model capture through
labeling it as Positive (True Positive).
Applying the same understanding, we know that Recall shall be the model metric we use to
select our best model when there is a high cost associated with False Negative.
For instance, in fraud detection or sick patient detection.
If a fraudulent transaction (Actual Positive) is predicted as non-fraudulent (Predicted
Negative), the consequence can be very bad for the bank.
Similarly, in sick patient detection. If a sick patient (Actual Positive) goes through the test and
predicted as not sick (Predicted Negative). The cost associated with False Negative will be
extremely high if the sickness is contagious.
18. www.robertodaguarcino.com
6.11 F1 score
F1 score, also known as balanced F-score or F-measure, can be interpreted as a weighted
average of the precision and recall, where an F1 score reaches its best value at 1 and worst
value at 0. The relative contribution of precision and recall to the F1 score are equal. In the
multi-class and multi-label case, this is the average of the F1 score of each class with weighting
depending on the average parameter.
𝐹1 = 2 ×
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
F1 Score is needed when you want to seek a balance between Precision and Recall. We have
previously seen that accuracy can be largely contributed by a large number of True Negatives
which in most business circumstances, we do not focus on much whereas False Negative and
False Positive usually has business costs (tangible & intangible) thus F1 Score might be a better
measure to use if we need to seek a balance between Precision and Recall and there is an
uneven class distribution (large number of Actual Negatives).
7. Final results
After reasoned and argued all my program and the procedures I followed to solve the
multiclass classification problem, I am going to write into this chapter results and
consequences of what I discovered and why.
All binary classifiers, applied on each family, have very good scores, including accuracy,
balanced accuracy, misclassification rate, recall, precision and f1.
Here are the results of some family, using a 3-fold cross validation with Random Forest.
Family #0: GinMaster
accuracy: [0.98825372 0.99451411 0.99137255]
20. www.robertodaguarcino.com
…
I am not going to print out results of SVM classifier for each family, because it is not so
important for our multiclass classification purposes, so it follows multiclass classification
problems results for both Random Forest and SVM after and before some consideration.
Then, we can come back to the multiclass classification to analysis its results.
The script I used to test the classifier implemented cross-validation and many other
techniques to avoid overfitting and to balance bias and variance. I was skeptical of the
relatively high precision, recall, and F1 score recorded of the single binary classifiers, but
looking through the script, I saw that the random seed for the cross-validation split was set at
some value in order to generate reproducible results. I changed the random seed and sure
enough, the performance of my model decreased. Therefore, I must have made the classic
mistake of overfitting on my training set for the given cross-validation random seed and
changed the seed again. Taking all of these precautions against overfitting, I had optimized
my model for a nonspecific set of data. In order to get a better indicator of the performance
of the Decision Tree model, I ran 10 tests with different random seeds and found the average
performance metrics. The final results for my model are summarized below:
Random Forest:
Average Accuracy Precision Recall F1
weighted 0.77 0.97 0.77 0.86
micro 0.76 0.76 0.76 0.76
macro 0.78 0.94 0.66 0.75
Linear SVC:
Average Accuracy Precision Recall F1
weighted 0.82 0.87 0.84 0.86
micro 0.81 0.81 0.81 0.81
macro 0.81 0.86 0.84 0.85
21. www.robertodaguarcino.com
The results are good given the nature of the classification. We can instantly notice that SVM
has quite better results than Random Forest.
Due to the small size of the available data, even a minor change such as altering the random
seed when doing a train-test split can have significant effects on the performance of the
algorithm which must be accounted for by testing over many subsets and calculating the
average performance.
This project points out and highlights the importance and goodness of machine learning while
trying to classify complex, intricate and artificial objects such as a malware, an activity that is
almost always too difficult (if not simply impossible) to be done by humans independently.
Roberto Falconi