M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Brambilla Marco

SCUOLA DI INGEGNERIA
INDUSTRIALE E DELL’INFORMAZIONE
Thesis Proposals
2024
Marco Brambilla
Data Science Lab
marco.brambilla@polimi.it

2
TOC
1. Proposals
2. References and pointers

3
Explainable AI
The final aim of the Explainable Artificial Intelligence (XAI) research field can be
summarised as
“Developing inherently explainable
systems and explainability techniques
that faithfully explicit the behaviour of
complex machine learning models
tailoring their explanation in an
understandable way for humans.”

4
Gamified Data Collection for NLP Explainability Tasks
Development of a gamified platform to collect structured human knowledge for
multiple, different NLP tasks.
NLP Task
Selection
Gamified Activity
Task #1
Gamified Activity
Task #N
Data
Structuring
Data Storing

5
IMAGE ANALYSIS AND EXPLAINABILITY
► Development of NLP Techniques for Semantic Clustering of Words to be used as
labels in the context of explainability of image analysis.
► Development of Gamification and Machine Learning Techniques for Debugging and
Improvement of Image Classification Algorithms.
► Feasibility of crowdsourcing approaches for image classification in categories that
are very similar → double problem: human task complexity (learning cost, type and
quantity of learning), and ML task complexity (much more expensive training).
► How to define training for humans to ensure explainability of objects from very
similar classes?
► Feasibility of crowdsourcing approaches for tagging actions (on videos?)
► Automatic generation of explanations from crowdsourced tagging of relevance
heatmaps of image features for classification.

6
► Study of single-class and cross-class explainability label to see if the same labels on
different classes are generated by the same part of the network.
► Study of techniques that, given the extracted features (blackbox), classify
explainability labels.
► Classification of labels obtained from the crowd, starting from labels produced for
explainability, and generating classifier results.
► Crowdsourcing techniques for a priori image explanation, composing sets of
descriptive features of objects/concepts. Comparative study on image explainability
techniques.
► Extension of the work "A Flexible Metric-Based Approach to Assess Neural Network
Interpretability in Image Classification"

7
► Testing and comparing using different GRAD-Cam and saliency map methods other
than the basic one.
► Testing the method using more complex datasets (e.g., PASCAL) and similar-class
datasets to assess explanations (e.g., shape vs. background).
► Validation of the final ranking of the models through human-in-the-loop approaches
(e.g., showing them the same explanation from different models and having them
order them) to compare it with the obtained scores.
► Study of the level of detail necessary for a classifier to correctly classify various
objects starting from the segmentation of a class (e.g., parachute).
► Training the model with the segmentation of a concept, is it enough to understand
the concept, or does it need extra detail? (e.g., "Soccer ball" shape alone may not
be enough, but with color, it could work).

8
TEXT ANALYSIS EXPLAINABILITY
► Explainability of ML models (deep, LLM, …) on text processing and NLP.
► Linguistic and behavioural methods

9
Large Language Models - LLMs
► Design and use of LLMs
– Exploration of different LLMs, experiments and comparison
► Model refinement / verticalization
– Legal
– Tech
– Security

10
LLM and Deep Learning for Security
Development of a Multimedia Pipeline for Data Extraction and Annotation to Support Investigations: An Integrated Approach
with Apache NiFi and Streamlit
Problem:
In security investigations, multimedia material such as audio, video, and images may contain crucial information. However,
extracting such information in a structured and efficient manner poses a significant challenge.
Proposed Solution:
Create an automated pipeline for transforming multimedia material into annotations useful for investigative purposes. This
pipeline will integrate technologies like neural networks for speaker identification, automatic translation, summarization, entity
extraction, metadata extraction, and similar tasks. It will use Apache NiFi for managing data flows and Streamlit for the user
interface, allowing operators to view and use the generated annotations.
Technologies:
Kubernetes and Docker for scalability and maintainability. DevOps techniques for configuring and managing infrastructure.
Pipeline Construction: Use of Apache NiFi for managing data flows and integration with Streamlit for annotation visualization.
Final Use Case:
In the end, it will be possible to demonstrate how these extractions can create alerts that are useful for investigative authorities.

11
LLM and Deep Learning for Security
Using Large Language Models to Guide Investigative Decisions:Prioritizing Actions in a Sea of Options
Problem:
Operators in the field of security investigations oftenface a wide range of operational choices.Some of these options,suchas accessing
specialized databases,can be costly and time-consuming.At the same time, not all actions have the same likelihood of leading to useful
results. The challenge is, therefore,to determine which operations to undertake to maximize effectiveness and reduce costs.
Proposed Solution:
The idea is to use a large language modeltrained to assess the various investigative options available and estimate their likelihood of
success.This way, the operator will be guided towards actions that are more likely to be fruitful, avoiding unnecessaryexpenses and
efforts.
Added Value:
Operational Efficiency:Saving time and resources by focusing on options with high probabilities of success.
DecisionSupport:Providing operators with a system that helps them make more informed decisionsquickly and securely.
WorkflowOptimization: The possibilityof integrating the modelwith existing platforms,making the decision-making process smoother
and integrated.
Possible Steps:
Testing 0-shot or multishot models of existing open-source models and evaluating further training of language models forthis purpose.
Evaluation of the system with real or simulated use cases.

12
Generative approaches for Security
Large-Scale Simulation of Realistic Data for Testing National Security Analysis Tools in the Octostar Environment
Objective:
Develop an advanced simulation modelto generate a realistic dataset representing the dynamics of the daily lives of 10 million people,to
be used as a testbed forthe Octostar platform in the context of national security.
Methodology:
Collecting Open Datasets:Using open and anonymized datasets to modelvarious aspects of urban life (e.g., traffic, economic
transactions, communications).
Model Creation: Using neural networks to generate differenttypes of data, such as daily movements (e.g., home-work, school,shopping),
use of private vehicles or public transportation, banking transactions (e.g., withdrawals, online purchases),communications(e.g., calls,
messages,emails).
Computational Optimization: Methods such as parallelism and distributed computing to address the creation of 100-1000 billion records.
Model Validation: Comparisonwith real or simulated data from accredited sourcesto ensure accuracy.
Integration with Octostar: Importing generated data into the Octostar platform for testing and demonstrations.
Computational Requirements:
Generating large-scale data, estimating a total of 100-1000 billion records.Implementing efficientalgorithms to minimize the time and
computational resources required.
Octostarwill provide the required computational resources.
Scientific and Practical Value:
Provides a realistic dataset fortesting national security algorithms and tools. Offers the opportunity to experimentwith advanced
simulation methods and neural networks to generate realistic human behaviors.

13
Other Topics
► Data science analysis
► Network analysis
► Robotic Process Automation (RPA)
► …
► (see second slide deck too)

14
Pointers
https://marco-brambilla.com/blog/
For past theses examples: POLITESI WEBSITE(search by advisor)
Big data and data science
https://marco-brambilla.com/2022/11/04/exploring-the-bi-verse-a-trip-across-the- digital-and-
physical-ecospheres/
Explainability
https://marco-brambilla.com/2022/07/11/the-role-of-human-knowledge-in-
explainable-ai/
https://marco-brambilla.com/2022/06/01/exp-crowd-gamified-crowdsourcing-for-ai- explainability/

Thesis Proposals 2024
Marco Brambilla
Data Science Lab
marco.brambilla@polimi.it

M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Brambilla Marco

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Brambilla Marco

Similar to M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Brambilla Marco (20)

More from Marco Brambilla

More from Marco Brambilla (20)

Recently uploaded

Recently uploaded (20)

M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Brambilla Marco