DEEP NEURAL ENCODING MODELS
OF THE HUMAN VISUAL CORTEX
TO PREDICT FMRI RESPONSES
TO NATURAL VISUAL SCENES
University of Milano-Bicocca
Department of Informatics, Systems and Communication
Master's Degree in Data Science
Academic Year 2022-2023
Master’s Degree Thesis by:
Giorgio Carbone
ID 811974
Supervisor:
Prof. Simone Bianco
Co-supervisor:
Prof. Paolo Napoletano
[ INTRODUCTION ]
2
/ Visual Encoding in Neuroscience
Visual Neural Encoding
▪ humans understand complex visual stimuli
▪ visual information is represented as neural
activations in the visual cortex
▪ neural activations (or responses) ➨ patterns of
measurable electrical activity
Visual Encoding Models [1]
▪ mimic the human visual system
▪ explain natural visual stimulus ⬌ neural activations
relationship
▪ structured system to test biological hypotheses about
the visual pathways Visual
Encoding
model
Neural
responses
Stimulus Brain
scan
Visual
cortex
3
[1] Naselaris et al. (2011). Encoding and decoding in fMRI. NeuroImage 56.
/ The Algonauts Project 2023: Challenge and Dataset
Algonauts 2023 Challenge goals:
▪ promote artificial intelligence and computational
neuroscience interdisciplinary research
▪ develop cutting-edge image-fMRI encoding models of
the visual brain
Natural Scene Dataset [2] :
▪ fMRI responses to ~73,000 images from MS COCO
▪ each of the eight subjects was shown: ~9000-10000
training images and ~150-400 test images
▪ measured the fMRI activity in the 39,548 voxels of the
visual cortex
▪ betas ➨ single value response estimates
▪ functional Region of Interest (ROI) label for each voxel
4
[2] Allen et al. (2021). A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence. Nature Neuroscience 25.
2
-2 0
fMRI betas for the
39,548 voxels
Stimulus
images
LH RH
Early
retinotopic
ROIs
Body
selective
ROIs
Face
selective
ROIs
Place
selective
ROIs
Word
selective
ROIs
Functional classes of regions of interest (ROIs)
LH RH
/ Evaluation Metric
5
Median Noise Normalized Squared Corellation (MNNSC) ➨ voxel-wise accuracy metric across 𝑁 voxels
➨ : Noise Ceiling for voxel
predicted responses - true responses
squared Pearson’s correlation for voxel
➨ :
Test Images
Measured
Responses
Voxel-wise true betas
vectors
fMRI
scan
Squared
Person’s
correlation
Median
MNNSC
S1
S8
Predicted
responses
S1 S8
Voxel-wise predictions
vectors
S1 S8
Visual
encoder
/ Research Goals
Main goal:
develop subject-specific image-fMRI encoders of
the visual cortices of the eight subjects
▪ based on deep neural networks and transfer
learning
▪ characterised by high stimulus compatibility
▪ mappability
▪ high predictivity across the entire visual cortex
6
Research Questions:
1. how well can variations in neural activity be
predicted given the stimulus that evoked them?
2. how relevant are the visual features extracted
from pre-trained DNNs for the neural encoding
task?
3. is there a similarity between the visual processing
in the DNNs and the visual cortex?
[ METHODS ]
7
/ A Two-Step Voxel-Based Deep Visual Encoder
1. Non-Linear Feature Mapping
using a pre-trained DNN backbone
Bird
Cow
Face
Ship
Low-level
visual
features
Mid-level
visual
features
High-level
visual
features
Output layer(s) selection
Flattened and concateneted feature maps
Input
image
Visual
features
8
Predicted
response for
voxel 𝒗
/ A Two-Step Voxel-Based Deep Visual Encoder
1. Non-Linear Feature Mapping
using a pre-trained DNN backbone
Bird
Cow
Face
Ship
Low-level
visual
features
Mid-level
visual
features
High-level
visual
features
Output layer(s) selection
Flattened and concateneted feature maps
Dimensionality
reduction
Voxel-based
linear
regression
2. Linear Activity Mapping
Input
image
Visual
features
8
/ Activity Mapping Methods
9
Goal
▪ find the activity mapping method that maximises the
10-fold cross-validation accuracy on Subject 1
▪ feature mapping ➨ pre-trained AlexNet
Dimensionality reduction ➨ 300-components Incremental PCA
Linear regression
▪ Ordinary Least Squares (OLS)
▪ Ridge Regression with optimization of the α parameter
Non-linear regression
▪ Regression Trees (RTs)
▪ Support Vector Regression (SVR)
Regression Model MNNSC
on Subject 1
Linear: OLS Regression 0.35
Linear: Ridge Regression 0.45
Non-linear: RTs 0.15
Non-linear: SVR 0.08
/ Feature Mapping Methods
10
Goals:
1. find the overall and ROI-wise best-performing
feature mapping methods on Subject 1
2. compare pre-trained DNNs with:
▪ different architectures and depths
▪ different training parameters (learning tasks,
learning methods and datasets)
▪ output layer(s) at varying depths
3. test a fused features approach
Architecture Learning task/method Dataset
AlexNet Image classification ImageNet-1K
ZFNet Image classification ImageNet-1K
VGG-16/19 Image classification ImageNet-1K
EfficientNet-B2 Image classification ImageNet-1K
ResNet-50 Image classification ImageNet-1K
ResNet-50
(DINOv1) [3]
Self-supervised ImageNet-1K
RetinaNet Object detection MS COCO
Architecture Learning task/method Dataset
ViT-S/14 (DINOv2) Self-supervised LVD-142M
ViT-B/14 (DINOv2) Self-supervised LVD-142M
ViT-L/14 (DINOv2) Self-supervised LVD-142M
ViT-B/16-GPT2 Image captioning MS COCO
Pre-trained Convolutional Neural Networks (CNNs)
Pre-trained Vision Transformers (ViTs)
[3] M. Caron et al. (2021). Emerging Properties in Self-Supervised Vision Transformers. IEEE/CVF ICCV.
11
ResNet-50
ViT-L/14
(DINOv2)
(a) (b)
Contribution
Rate
(%)
to
the
Highest
Voxel-wise
Accuracy
Contribution
Rate
(%)
to
the
Highest
Voxel-wise
Accuracy
Layer Index
Layer Index
ROI-wise
(MNNSC)
Accuracy
Similarity between DNNs and the human visual cortex: features extraction from output layers at increasing depths.
(a) contribution rate (%) to the highest voxel-wise accuracy for each ROI class
(b) ROI-wise MNNSC for each ROI class
Layer Index
Layer Index
ROI-wise
(MNNSC)
Accuracy
Early
Vis.
ROIs
Body
Sel.
ROIs
Face
Sel.
ROIs
Place
Sel.
ROIs
Word
Sel.
ROIs
Early
Vis.
ROIs
Body
Sel.
ROIs
Face
Sel.
ROIs
Place
Sel.
ROIs
Word
Sel.
ROIs
Early Vis. ROIs
Body Sel. ROIs
Face Sel. ROIs
Place Sel. ROIs
Word Sel. ROIs
Early Vis. ROIs
Body Sel. ROIs
Face Sel. ROIs
Place Sel. ROIs
Word Sel. ROIs
Image pre-
processing
Voxel-based
Ridge (α)
regression
ROI 1 voxels
mask
Output layer(s) selection
Pre-trained feature extractor
PCA
𝒏 comp.
Visual
features
Output layer(s) selection
Pre-trained feature extractor
PCA
𝒏 comp.
Visual
features
Voxel-based
Ridge (α)
regression
Image pre-
processing
ROI 𝐽 voxels
mask
ROI 𝑗
ROI 𝐽
ROI 1 ROI 1
voxels
responses
All voxels
responses
ROI 𝐽
voxels
responses
12
/ A Mixed and ROI-wise Encoding Model
Proposed architecture: a mixed (multi-layer and multi-network) subject-specific encoding model
[ RESULTS ]
13
14
All subjects
MNNSC: 0.62
Subj 8
MNNSC: 0.60
Subj 7
MNNSC: 0.57
Subj 6
MNNSC: 0.54
Subj 5
MNNSC: 0.65
Subj 4
MNNSC: 0.68
Subj 3
MNNSC: 0.65
Subj 1
MNNSC: 0.64
Subj 2
MNNSC: 0.66
Voxel-wise Noise Normalized
Squared Correlation (NNSC)
/ Best ROI-wise Encoder: All Subjects Cross-Validation
(a) (b)
(a) distributions of the voxel-wise accuracies (NNSC) across all subjects conditioned to the hemisphere and the ROI
(b) voxel-wise prediction accuracies (NNSC) across all subjects visualized on a common cortical surface
Early
Retinotopic Visual
ROIs
Body-
selective
ROIs
Face-
selective
ROIs
Place-
selective
ROIs
Word-
selective
ROIs
Voxel-wise
Noise
Normalized
Squared
Correlation
(NNSC)
Left hemisphere
Right hemisphere
ResNet-50 (DINOv1)
RetinaNet
ViT-L/14 (DINOv2)
ViT-B/16-GPT2
15
/ Best ROI-wise Encoder: Test Set Performance
0
20
40
60
80
100
Median
Noise
Normalized
Squared
Correlation
(MNNSC)
Early
visual
ROIs
All
voxels
Body
sel.
ROIs
Face
sel.
ROIs
Place
sel.
ROIs
Word
sel.
ROIs
Median
Noise
Normalized
Squared
Correlation
(MNNSC)
0
20
40
60
80
100
Early
visual
ROIs
All
voxels
Body
sel.
ROIs
Face
sel.
ROIs
Place
sel.
ROIs
Word
sel.
ROIs
Subject
1
2
3
4
5
6
7
8
Proposed ROI-wise encoding model Baseline encoding model (AlexNet-based, not ROI-wise)
All Subjects Subj. 1 Subj. 2 Subj. 3 Subj. 4 Subj. 5 Subj. 6 Subj. 7 Subj. 8
Proposed ROI-wise Encoder 0.52 0.53 0.51 0.56 0.54 0.50 0.59 0.40 0.57
Baseline Encoder 0.41 0.39 0.39 0.47 0.42 0.37 0.44 0.32 0.47
State-of-the -art Pure Neural
Encoder [4]
0.64
[4] Adeli et al. (2023). Predicting brain activity using transformers. digital preprint, bioRxiv.
Overall and subject-specific MNNSC
[ CONCLUSIONS AND FUTURE WORK ]
16
/ Conclusions and Future Work
Conclusions:
▪ effectiveness of transfer learning-based image-
fMRI encoding
▪ generalizability of visual features extracted
from computer vision models, particularly
those pre-trained in a self-supervised
manner
▪ functional alignment between DNNs and the
human visual cortex
▪ a mixed (multi-layer and multi-network) and
independent encoding of each ROI
guarantees mappability and high predictivity
over the entire visual cortex
17
Future Work:
• apply a voxel-wise encoding optimization strategy
to voxels that exhibit poor performance
• implement auxiliary input data: physiological data,
eye tracking data and COCO annotations
• develop a pure neural encoder trained in an end-
to-end way for the image-fMRI task
{ Thank you for your attention }
University of Milano-Bicocca
Department of Informatics, Systems and Communication
Master's Degree in Data Science
Academic Year 2022-2023
Master’s Degree Thesis by:
Giorgio Carbone
ID 811974
{ Thank you for your attention }
Supervisor:
Prof. Simone Bianco
Co-supervisor:
Prof. Paolo Napoletano
/ Bibliography
[1 ] Naselaris T, Kay KN, Nishimoto S, Gallant JL. (2011). Encoding and decoding in fMRI. NeuroImage (56).
[2] Allen, E.J., St-Yves, G., Wu, Y. et al. (2021). A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence.
Nature Neuroscience.
[3] M. Caron et al. (2021). Emerging Properties in Self-Supervised Vision Transformers. IEEE/CVF ICCV.
[4] H. Adeli, S. Minni, and N. Kriegeskorte. (2023). Predicting brain activity using transformers. Preprint at bioRxiv.
[5] Gifford, A. T., Lahner, B., Saba-Sadiya, S., Vilas, M. G., Lascelles, A., Oliva, A., Kay, K., Roig, G., & Cichy, R. M. (2023).
The Algonauts Project 2023 Challenge: How the Human Brain Makes Sense of Natural Scenes. Preprint at arXiv.
[6] Yamins, D. L. K., & DiCarlo, J. J. (2016). Using goal-driven deep learning models to understand sensory cortex. Nature
Neuroscience, 19(3), Article 3.
[7] Dwivedi, K., Bonner, M. F., Cichy, R. M., & Roig, G. (2021). Unveiling functions of the visual cortex using task-specific deep neural
networks. PLOS Computational Biology, 17(8).
/ Natural Scene Dataset: Details
Distribution of the Algonauts Project 2023 Challenge dataset
images in the full training and test sets across the eight subjects,
and in the training and validation subsets defined in the 10-fold
cross-validation phase:
Number of vertices composing the cortical challenge surface and
the cortical fsaverage surface, considering the right and left
hemispheres of the eight subjects:
Lists of the ROIs belonging to each functional ROI class:
• Early retinotopic visual regions: V1v, V1d, V2v, V2d, V3v, V3d, hV4
(V4).
• Body-selective regions: EBA, FBA-1, FBA-2, mTL-bodies.
• Face-selective regions: OFA, FFA-1, FFA-2, mTL-faces, aTL-faces.
• Place-selective regions: OPA, PPA, RSC.
• Word-selective regions: OWFA, VWFA-1, VWFA-2, mfs-words, mTL-
words.
/ Evaluation Metric: Details
Median Noise-Normalized Squared Correlation (MNNSC) over N voxels:
Voxel-wise Pearson’s correlation between the voxel-wise vector of the predicted (P) responses for the voxel v and the ground truth (G) voxel-
wise vector (t is the index of the stimulus image):
Noise Ceiling for voxel v from the corresponding noise ceiling signal-to-noise ratio (considering the responses to m images, of which A responses
are averaged over three trials, B over two trials, and C over one trial):
(4) Noise Ceiling (NC) and (5) noise ceiling signal-to-noise ratio (ncsnr) formal definitions:
(1)
(2)
(3)
(4) (5)
/ Non-Linear Activity Mapping Methods: Details
Supervised Regression Trees (RTs) learning approach, tested and
chosen parameters:
• Split criterion: Mean Squared Error (MSE)
• maximum depth of the tree [5, 10, 15] ➨ 5
• minimum number of samples required to split an internal
node [2,3] ➨ 2
• minimum number of samples needed to define a node as
a leaf node [1,2] ➨ 1
• number of features considered when searching for the
best split ➨ number of PCA components
Support Vector Regression (SVM) learning approach, chosen
parameters:
• tube width ε (maximum distance between predicted and
true values within which a penalty on the loss function is
not generated) ➨ 0.1
• regularization parameter C (high values lead to more
accurate fits on the training data but increase the
sensitivity of the model to noise) ➨ 1.0
• kernel ➨ Radial Basis Function (RBF)
• Gaussian kernel
• gamma parameter (how far the influence of
individual training examples can reach) ➨ 1 /
number of PCA components
/ Feature Mapping Methods: Details
Summary of the properties of the pre-trained models used as
feature extractors:
Summary of the different sets of image pre-processing steps
applied to image inputs:
/ Comparing Fused Feature and Single Layer Approaches
Single Layer
Fused Features
Single Layer
Fused Features Single Layer
Fused Features
(a) (b) (c)
(d) (e) (f)
Comparison of the voxel-wise prediction accuracy for Subject 1, between an encoding model based on fused feature mapping (ViT-S/14
(DINOv2) 5+6+7), and encoding models using a single feature layer approach (ViT-S/14 (DINOv2) with output layers 5, 6 or 7):
• randomized permutation test to determine a minimum threshold MNNSC value significantly different from zero ➨ 0.19 (p < 0.001)
• (a, b, c): NNSC (abscissae) of the single feature models and NNSC (ordinates) of the 5+6+7 fused feature encoding model
• (d, e, f): distributions of voxel-wise differences between the accuracy of the fused feature model and the single feature layer models
/ Comparing CNNs Pre-Trained with Different Training Tasks and Learning Methods
Comparison of the voxel-wise prediction accuracy for Subject 1 between the two best configurations of the ROI-wise mixed encoding
models based respectively on the:
• (a) pre-trained ResNet-50 (self-supervised DINOv1, ImageNet-1K) and ResNet-50 (image classification, ImageNet-1K) models
• (b) pre-trained RetinaNet (object detection, MS COCO) and ResNet-50 (image classification, ImageNet-1K) models
• (c) pre-trained RetinaNet (object detection, MS COCO) and ResNet-50 (self-supervised DINOv1, ImageNet-1K) models
(a) (b) (c)
/ Best ROI-wise Encoder: Cross-Validation Details
Overall and functional ROI class-specific 10-fold cross-validation
accuracies (MMNSC) considering the voxels of all subjects:
ROI-specific 10-fold cross-validation accuracies (MMNSC)
considering the voxels of all subjects:
/ Best ROI-wise and Baseline Models: Cross-Validation
Proposed ROI-wise and mixed encoding model
Baseline encoding model
/ Best Subject 1 and 2 Encoders: Cross-Validation Details
S1
S2
/ Best Subject 3 and 4 Encoders: Cross-Validation Details
S3
S4
/ Best Subject 5 and 6 Encoders: Cross-Validation Details
S5
S6
/ Best Subject 7 and 8 Encoders: Cross-Validation Details
S7
S8
/ DNNs/Visual Cortex Similarity: AlexNet - ZFNet
AlexNet ZFNet
/ DNNs/Visual Cortex Similarity: VGG-19
VGG-19 VGG-19 (BN)
/ DNNs/Visual Cortex Similarity: ResNet-50
ResNet-50 (image classification) ResNet-50 (DINOv1)
/ DNNs/Visual Cortex Similarity: RetinaNet – EfficientNet-B2
RetinaNet EfficientNet-B2
/ DNNs/Visual Cortex Similarity: ViT-S/14 - ViT-B/14 (DINOv2)
ViT-S/14 (DINOv2) ViT-B/14 (DINOv2)
/ DNNs/Visual Cortex: ViT-L/14 (DINOv2) - ViT-B/16-GPT2
ViT-L/14 (DINOv2)
ViT-B/16-GPT2

Master's Thesis - Data Science - Presentation

  • 1.
    DEEP NEURAL ENCODINGMODELS OF THE HUMAN VISUAL CORTEX TO PREDICT FMRI RESPONSES TO NATURAL VISUAL SCENES University of Milano-Bicocca Department of Informatics, Systems and Communication Master's Degree in Data Science Academic Year 2022-2023 Master’s Degree Thesis by: Giorgio Carbone ID 811974 Supervisor: Prof. Simone Bianco Co-supervisor: Prof. Paolo Napoletano
  • 2.
  • 3.
    / Visual Encodingin Neuroscience Visual Neural Encoding ▪ humans understand complex visual stimuli ▪ visual information is represented as neural activations in the visual cortex ▪ neural activations (or responses) ➨ patterns of measurable electrical activity Visual Encoding Models [1] ▪ mimic the human visual system ▪ explain natural visual stimulus ⬌ neural activations relationship ▪ structured system to test biological hypotheses about the visual pathways Visual Encoding model Neural responses Stimulus Brain scan Visual cortex 3 [1] Naselaris et al. (2011). Encoding and decoding in fMRI. NeuroImage 56.
  • 4.
    / The AlgonautsProject 2023: Challenge and Dataset Algonauts 2023 Challenge goals: ▪ promote artificial intelligence and computational neuroscience interdisciplinary research ▪ develop cutting-edge image-fMRI encoding models of the visual brain Natural Scene Dataset [2] : ▪ fMRI responses to ~73,000 images from MS COCO ▪ each of the eight subjects was shown: ~9000-10000 training images and ~150-400 test images ▪ measured the fMRI activity in the 39,548 voxels of the visual cortex ▪ betas ➨ single value response estimates ▪ functional Region of Interest (ROI) label for each voxel 4 [2] Allen et al. (2021). A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence. Nature Neuroscience 25. 2 -2 0 fMRI betas for the 39,548 voxels Stimulus images LH RH Early retinotopic ROIs Body selective ROIs Face selective ROIs Place selective ROIs Word selective ROIs Functional classes of regions of interest (ROIs) LH RH
  • 5.
    / Evaluation Metric 5 MedianNoise Normalized Squared Corellation (MNNSC) ➨ voxel-wise accuracy metric across 𝑁 voxels ➨ : Noise Ceiling for voxel predicted responses - true responses squared Pearson’s correlation for voxel ➨ : Test Images Measured Responses Voxel-wise true betas vectors fMRI scan Squared Person’s correlation Median MNNSC S1 S8 Predicted responses S1 S8 Voxel-wise predictions vectors S1 S8 Visual encoder
  • 6.
    / Research Goals Maingoal: develop subject-specific image-fMRI encoders of the visual cortices of the eight subjects ▪ based on deep neural networks and transfer learning ▪ characterised by high stimulus compatibility ▪ mappability ▪ high predictivity across the entire visual cortex 6 Research Questions: 1. how well can variations in neural activity be predicted given the stimulus that evoked them? 2. how relevant are the visual features extracted from pre-trained DNNs for the neural encoding task? 3. is there a similarity between the visual processing in the DNNs and the visual cortex?
  • 7.
  • 8.
    / A Two-StepVoxel-Based Deep Visual Encoder 1. Non-Linear Feature Mapping using a pre-trained DNN backbone Bird Cow Face Ship Low-level visual features Mid-level visual features High-level visual features Output layer(s) selection Flattened and concateneted feature maps Input image Visual features 8
  • 9.
    Predicted response for voxel 𝒗 /A Two-Step Voxel-Based Deep Visual Encoder 1. Non-Linear Feature Mapping using a pre-trained DNN backbone Bird Cow Face Ship Low-level visual features Mid-level visual features High-level visual features Output layer(s) selection Flattened and concateneted feature maps Dimensionality reduction Voxel-based linear regression 2. Linear Activity Mapping Input image Visual features 8
  • 10.
    / Activity MappingMethods 9 Goal ▪ find the activity mapping method that maximises the 10-fold cross-validation accuracy on Subject 1 ▪ feature mapping ➨ pre-trained AlexNet Dimensionality reduction ➨ 300-components Incremental PCA Linear regression ▪ Ordinary Least Squares (OLS) ▪ Ridge Regression with optimization of the α parameter Non-linear regression ▪ Regression Trees (RTs) ▪ Support Vector Regression (SVR) Regression Model MNNSC on Subject 1 Linear: OLS Regression 0.35 Linear: Ridge Regression 0.45 Non-linear: RTs 0.15 Non-linear: SVR 0.08
  • 11.
    / Feature MappingMethods 10 Goals: 1. find the overall and ROI-wise best-performing feature mapping methods on Subject 1 2. compare pre-trained DNNs with: ▪ different architectures and depths ▪ different training parameters (learning tasks, learning methods and datasets) ▪ output layer(s) at varying depths 3. test a fused features approach Architecture Learning task/method Dataset AlexNet Image classification ImageNet-1K ZFNet Image classification ImageNet-1K VGG-16/19 Image classification ImageNet-1K EfficientNet-B2 Image classification ImageNet-1K ResNet-50 Image classification ImageNet-1K ResNet-50 (DINOv1) [3] Self-supervised ImageNet-1K RetinaNet Object detection MS COCO Architecture Learning task/method Dataset ViT-S/14 (DINOv2) Self-supervised LVD-142M ViT-B/14 (DINOv2) Self-supervised LVD-142M ViT-L/14 (DINOv2) Self-supervised LVD-142M ViT-B/16-GPT2 Image captioning MS COCO Pre-trained Convolutional Neural Networks (CNNs) Pre-trained Vision Transformers (ViTs) [3] M. Caron et al. (2021). Emerging Properties in Self-Supervised Vision Transformers. IEEE/CVF ICCV.
  • 12.
    11 ResNet-50 ViT-L/14 (DINOv2) (a) (b) Contribution Rate (%) to the Highest Voxel-wise Accuracy Contribution Rate (%) to the Highest Voxel-wise Accuracy Layer Index LayerIndex ROI-wise (MNNSC) Accuracy Similarity between DNNs and the human visual cortex: features extraction from output layers at increasing depths. (a) contribution rate (%) to the highest voxel-wise accuracy for each ROI class (b) ROI-wise MNNSC for each ROI class Layer Index Layer Index ROI-wise (MNNSC) Accuracy Early Vis. ROIs Body Sel. ROIs Face Sel. ROIs Place Sel. ROIs Word Sel. ROIs Early Vis. ROIs Body Sel. ROIs Face Sel. ROIs Place Sel. ROIs Word Sel. ROIs Early Vis. ROIs Body Sel. ROIs Face Sel. ROIs Place Sel. ROIs Word Sel. ROIs Early Vis. ROIs Body Sel. ROIs Face Sel. ROIs Place Sel. ROIs Word Sel. ROIs
  • 13.
    Image pre- processing Voxel-based Ridge (α) regression ROI1 voxels mask Output layer(s) selection Pre-trained feature extractor PCA 𝒏 comp. Visual features Output layer(s) selection Pre-trained feature extractor PCA 𝒏 comp. Visual features Voxel-based Ridge (α) regression Image pre- processing ROI 𝐽 voxels mask ROI 𝑗 ROI 𝐽 ROI 1 ROI 1 voxels responses All voxels responses ROI 𝐽 voxels responses 12 / A Mixed and ROI-wise Encoding Model Proposed architecture: a mixed (multi-layer and multi-network) subject-specific encoding model
  • 14.
  • 15.
    14 All subjects MNNSC: 0.62 Subj8 MNNSC: 0.60 Subj 7 MNNSC: 0.57 Subj 6 MNNSC: 0.54 Subj 5 MNNSC: 0.65 Subj 4 MNNSC: 0.68 Subj 3 MNNSC: 0.65 Subj 1 MNNSC: 0.64 Subj 2 MNNSC: 0.66 Voxel-wise Noise Normalized Squared Correlation (NNSC) / Best ROI-wise Encoder: All Subjects Cross-Validation (a) (b) (a) distributions of the voxel-wise accuracies (NNSC) across all subjects conditioned to the hemisphere and the ROI (b) voxel-wise prediction accuracies (NNSC) across all subjects visualized on a common cortical surface Early Retinotopic Visual ROIs Body- selective ROIs Face- selective ROIs Place- selective ROIs Word- selective ROIs Voxel-wise Noise Normalized Squared Correlation (NNSC) Left hemisphere Right hemisphere ResNet-50 (DINOv1) RetinaNet ViT-L/14 (DINOv2) ViT-B/16-GPT2
  • 16.
    15 / Best ROI-wiseEncoder: Test Set Performance 0 20 40 60 80 100 Median Noise Normalized Squared Correlation (MNNSC) Early visual ROIs All voxels Body sel. ROIs Face sel. ROIs Place sel. ROIs Word sel. ROIs Median Noise Normalized Squared Correlation (MNNSC) 0 20 40 60 80 100 Early visual ROIs All voxels Body sel. ROIs Face sel. ROIs Place sel. ROIs Word sel. ROIs Subject 1 2 3 4 5 6 7 8 Proposed ROI-wise encoding model Baseline encoding model (AlexNet-based, not ROI-wise) All Subjects Subj. 1 Subj. 2 Subj. 3 Subj. 4 Subj. 5 Subj. 6 Subj. 7 Subj. 8 Proposed ROI-wise Encoder 0.52 0.53 0.51 0.56 0.54 0.50 0.59 0.40 0.57 Baseline Encoder 0.41 0.39 0.39 0.47 0.42 0.37 0.44 0.32 0.47 State-of-the -art Pure Neural Encoder [4] 0.64 [4] Adeli et al. (2023). Predicting brain activity using transformers. digital preprint, bioRxiv. Overall and subject-specific MNNSC
  • 17.
    [ CONCLUSIONS ANDFUTURE WORK ] 16
  • 18.
    / Conclusions andFuture Work Conclusions: ▪ effectiveness of transfer learning-based image- fMRI encoding ▪ generalizability of visual features extracted from computer vision models, particularly those pre-trained in a self-supervised manner ▪ functional alignment between DNNs and the human visual cortex ▪ a mixed (multi-layer and multi-network) and independent encoding of each ROI guarantees mappability and high predictivity over the entire visual cortex 17 Future Work: • apply a voxel-wise encoding optimization strategy to voxels that exhibit poor performance • implement auxiliary input data: physiological data, eye tracking data and COCO annotations • develop a pure neural encoder trained in an end- to-end way for the image-fMRI task
  • 19.
    { Thank youfor your attention } University of Milano-Bicocca Department of Informatics, Systems and Communication Master's Degree in Data Science Academic Year 2022-2023 Master’s Degree Thesis by: Giorgio Carbone ID 811974 { Thank you for your attention } Supervisor: Prof. Simone Bianco Co-supervisor: Prof. Paolo Napoletano
  • 20.
    / Bibliography [1 ]Naselaris T, Kay KN, Nishimoto S, Gallant JL. (2011). Encoding and decoding in fMRI. NeuroImage (56). [2] Allen, E.J., St-Yves, G., Wu, Y. et al. (2021). A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence. Nature Neuroscience. [3] M. Caron et al. (2021). Emerging Properties in Self-Supervised Vision Transformers. IEEE/CVF ICCV. [4] H. Adeli, S. Minni, and N. Kriegeskorte. (2023). Predicting brain activity using transformers. Preprint at bioRxiv. [5] Gifford, A. T., Lahner, B., Saba-Sadiya, S., Vilas, M. G., Lascelles, A., Oliva, A., Kay, K., Roig, G., & Cichy, R. M. (2023). The Algonauts Project 2023 Challenge: How the Human Brain Makes Sense of Natural Scenes. Preprint at arXiv. [6] Yamins, D. L. K., & DiCarlo, J. J. (2016). Using goal-driven deep learning models to understand sensory cortex. Nature Neuroscience, 19(3), Article 3. [7] Dwivedi, K., Bonner, M. F., Cichy, R. M., & Roig, G. (2021). Unveiling functions of the visual cortex using task-specific deep neural networks. PLOS Computational Biology, 17(8).
  • 21.
    / Natural SceneDataset: Details Distribution of the Algonauts Project 2023 Challenge dataset images in the full training and test sets across the eight subjects, and in the training and validation subsets defined in the 10-fold cross-validation phase: Number of vertices composing the cortical challenge surface and the cortical fsaverage surface, considering the right and left hemispheres of the eight subjects: Lists of the ROIs belonging to each functional ROI class: • Early retinotopic visual regions: V1v, V1d, V2v, V2d, V3v, V3d, hV4 (V4). • Body-selective regions: EBA, FBA-1, FBA-2, mTL-bodies. • Face-selective regions: OFA, FFA-1, FFA-2, mTL-faces, aTL-faces. • Place-selective regions: OPA, PPA, RSC. • Word-selective regions: OWFA, VWFA-1, VWFA-2, mfs-words, mTL- words.
  • 22.
    / Evaluation Metric:Details Median Noise-Normalized Squared Correlation (MNNSC) over N voxels: Voxel-wise Pearson’s correlation between the voxel-wise vector of the predicted (P) responses for the voxel v and the ground truth (G) voxel- wise vector (t is the index of the stimulus image): Noise Ceiling for voxel v from the corresponding noise ceiling signal-to-noise ratio (considering the responses to m images, of which A responses are averaged over three trials, B over two trials, and C over one trial): (4) Noise Ceiling (NC) and (5) noise ceiling signal-to-noise ratio (ncsnr) formal definitions: (1) (2) (3) (4) (5)
  • 23.
    / Non-Linear ActivityMapping Methods: Details Supervised Regression Trees (RTs) learning approach, tested and chosen parameters: • Split criterion: Mean Squared Error (MSE) • maximum depth of the tree [5, 10, 15] ➨ 5 • minimum number of samples required to split an internal node [2,3] ➨ 2 • minimum number of samples needed to define a node as a leaf node [1,2] ➨ 1 • number of features considered when searching for the best split ➨ number of PCA components Support Vector Regression (SVM) learning approach, chosen parameters: • tube width ε (maximum distance between predicted and true values within which a penalty on the loss function is not generated) ➨ 0.1 • regularization parameter C (high values lead to more accurate fits on the training data but increase the sensitivity of the model to noise) ➨ 1.0 • kernel ➨ Radial Basis Function (RBF) • Gaussian kernel • gamma parameter (how far the influence of individual training examples can reach) ➨ 1 / number of PCA components
  • 24.
    / Feature MappingMethods: Details Summary of the properties of the pre-trained models used as feature extractors: Summary of the different sets of image pre-processing steps applied to image inputs:
  • 25.
    / Comparing FusedFeature and Single Layer Approaches Single Layer Fused Features Single Layer Fused Features Single Layer Fused Features (a) (b) (c) (d) (e) (f) Comparison of the voxel-wise prediction accuracy for Subject 1, between an encoding model based on fused feature mapping (ViT-S/14 (DINOv2) 5+6+7), and encoding models using a single feature layer approach (ViT-S/14 (DINOv2) with output layers 5, 6 or 7): • randomized permutation test to determine a minimum threshold MNNSC value significantly different from zero ➨ 0.19 (p < 0.001) • (a, b, c): NNSC (abscissae) of the single feature models and NNSC (ordinates) of the 5+6+7 fused feature encoding model • (d, e, f): distributions of voxel-wise differences between the accuracy of the fused feature model and the single feature layer models
  • 26.
    / Comparing CNNsPre-Trained with Different Training Tasks and Learning Methods Comparison of the voxel-wise prediction accuracy for Subject 1 between the two best configurations of the ROI-wise mixed encoding models based respectively on the: • (a) pre-trained ResNet-50 (self-supervised DINOv1, ImageNet-1K) and ResNet-50 (image classification, ImageNet-1K) models • (b) pre-trained RetinaNet (object detection, MS COCO) and ResNet-50 (image classification, ImageNet-1K) models • (c) pre-trained RetinaNet (object detection, MS COCO) and ResNet-50 (self-supervised DINOv1, ImageNet-1K) models (a) (b) (c)
  • 27.
    / Best ROI-wiseEncoder: Cross-Validation Details Overall and functional ROI class-specific 10-fold cross-validation accuracies (MMNSC) considering the voxels of all subjects: ROI-specific 10-fold cross-validation accuracies (MMNSC) considering the voxels of all subjects:
  • 28.
    / Best ROI-wiseand Baseline Models: Cross-Validation Proposed ROI-wise and mixed encoding model Baseline encoding model
  • 29.
    / Best Subject1 and 2 Encoders: Cross-Validation Details S1 S2
  • 30.
    / Best Subject3 and 4 Encoders: Cross-Validation Details S3 S4
  • 31.
    / Best Subject5 and 6 Encoders: Cross-Validation Details S5 S6
  • 32.
    / Best Subject7 and 8 Encoders: Cross-Validation Details S7 S8
  • 33.
    / DNNs/Visual CortexSimilarity: AlexNet - ZFNet AlexNet ZFNet
  • 34.
    / DNNs/Visual CortexSimilarity: VGG-19 VGG-19 VGG-19 (BN)
  • 35.
    / DNNs/Visual CortexSimilarity: ResNet-50 ResNet-50 (image classification) ResNet-50 (DINOv1)
  • 36.
    / DNNs/Visual CortexSimilarity: RetinaNet – EfficientNet-B2 RetinaNet EfficientNet-B2
  • 37.
    / DNNs/Visual CortexSimilarity: ViT-S/14 - ViT-B/14 (DINOv2) ViT-S/14 (DINOv2) ViT-B/14 (DINOv2)
  • 38.
    / DNNs/Visual Cortex:ViT-L/14 (DINOv2) - ViT-B/16-GPT2 ViT-L/14 (DINOv2) ViT-B/16-GPT2