1. DEEP NEURAL ENCODING MODELS
OF THE HUMAN VISUAL CORTEX
TO PREDICT FMRI RESPONSES
TO NATURAL VISUAL SCENES
University of Milano-Bicocca
Department of Informatics, Systems and Communication
Master's Degree in Data Science
Academic Year 2022-2023
Master’s Degree Thesis by:
Giorgio Carbone
ID 811974
Supervisor:
Prof. Simone Bianco
Co-supervisor:
Prof. Paolo Napoletano
3. / Visual Encoding in Neuroscience
Visual Neural Encoding
▪ humans understand complex visual stimuli
▪ visual information is represented as neural
activations in the visual cortex
▪ neural activations (or responses) ➨ patterns of
measurable electrical activity
Visual Encoding Models [1]
▪ mimic the human visual system
▪ explain natural visual stimulus ⬌ neural activations
relationship
▪ structured system to test biological hypotheses about
the visual pathways Visual
Encoding
model
Neural
responses
Stimulus Brain
scan
Visual
cortex
3
[1] Naselaris et al. (2011). Encoding and decoding in fMRI. NeuroImage 56.
4. / The Algonauts Project 2023: Challenge and Dataset
Algonauts 2023 Challenge goals:
▪ promote artificial intelligence and computational
neuroscience interdisciplinary research
▪ develop cutting-edge image-fMRI encoding models of
the visual brain
Natural Scene Dataset [2] :
▪ fMRI responses to ~73,000 images from MS COCO
▪ each of the eight subjects was shown: ~9000-10000
training images and ~150-400 test images
▪ measured the fMRI activity in the 39,548 voxels of the
visual cortex
▪ betas ➨ single value response estimates
▪ functional Region of Interest (ROI) label for each voxel
4
[2] Allen et al. (2021). A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence. Nature Neuroscience 25.
2
-2 0
fMRI betas for the
39,548 voxels
Stimulus
images
LH RH
Early
retinotopic
ROIs
Body
selective
ROIs
Face
selective
ROIs
Place
selective
ROIs
Word
selective
ROIs
Functional classes of regions of interest (ROIs)
LH RH
6. / Research Goals
Main goal:
develop subject-specific image-fMRI encoders of
the visual cortices of the eight subjects
▪ based on deep neural networks and transfer
learning
▪ characterised by high stimulus compatibility
▪ mappability
▪ high predictivity across the entire visual cortex
6
Research Questions:
1. how well can variations in neural activity be
predicted given the stimulus that evoked them?
2. how relevant are the visual features extracted
from pre-trained DNNs for the neural encoding
task?
3. is there a similarity between the visual processing
in the DNNs and the visual cortex?
8. / A Two-Step Voxel-Based Deep Visual Encoder
1. Non-Linear Feature Mapping
using a pre-trained DNN backbone
Bird
Cow
Face
Ship
Low-level
visual
features
Mid-level
visual
features
High-level
visual
features
Output layer(s) selection
Flattened and concateneted feature maps
Input
image
Visual
features
8
9. Predicted
response for
voxel 𝒗
/ A Two-Step Voxel-Based Deep Visual Encoder
1. Non-Linear Feature Mapping
using a pre-trained DNN backbone
Bird
Cow
Face
Ship
Low-level
visual
features
Mid-level
visual
features
High-level
visual
features
Output layer(s) selection
Flattened and concateneted feature maps
Dimensionality
reduction
Voxel-based
linear
regression
2. Linear Activity Mapping
Input
image
Visual
features
8
10. / Activity Mapping Methods
9
Goal
▪ find the activity mapping method that maximises the
10-fold cross-validation accuracy on Subject 1
▪ feature mapping ➨ pre-trained AlexNet
Dimensionality reduction ➨ 300-components Incremental PCA
Linear regression
▪ Ordinary Least Squares (OLS)
▪ Ridge Regression with optimization of the α parameter
Non-linear regression
▪ Regression Trees (RTs)
▪ Support Vector Regression (SVR)
Regression Model MNNSC
on Subject 1
Linear: OLS Regression 0.35
Linear: Ridge Regression 0.45
Non-linear: RTs 0.15
Non-linear: SVR 0.08
11. / Feature Mapping Methods
10
Goals:
1. find the overall and ROI-wise best-performing
feature mapping methods on Subject 1
2. compare pre-trained DNNs with:
▪ different architectures and depths
▪ different training parameters (learning tasks,
learning methods and datasets)
▪ output layer(s) at varying depths
3. test a fused features approach
Architecture Learning task/method Dataset
AlexNet Image classification ImageNet-1K
ZFNet Image classification ImageNet-1K
VGG-16/19 Image classification ImageNet-1K
EfficientNet-B2 Image classification ImageNet-1K
ResNet-50 Image classification ImageNet-1K
ResNet-50
(DINOv1) [3]
Self-supervised ImageNet-1K
RetinaNet Object detection MS COCO
Architecture Learning task/method Dataset
ViT-S/14 (DINOv2) Self-supervised LVD-142M
ViT-B/14 (DINOv2) Self-supervised LVD-142M
ViT-L/14 (DINOv2) Self-supervised LVD-142M
ViT-B/16-GPT2 Image captioning MS COCO
Pre-trained Convolutional Neural Networks (CNNs)
Pre-trained Vision Transformers (ViTs)
[3] M. Caron et al. (2021). Emerging Properties in Self-Supervised Vision Transformers. IEEE/CVF ICCV.
12. 11
ResNet-50
ViT-L/14
(DINOv2)
(a) (b)
Contribution
Rate
(%)
to
the
Highest
Voxel-wise
Accuracy
Contribution
Rate
(%)
to
the
Highest
Voxel-wise
Accuracy
Layer Index
Layer Index
ROI-wise
(MNNSC)
Accuracy
Similarity between DNNs and the human visual cortex: features extraction from output layers at increasing depths.
(a) contribution rate (%) to the highest voxel-wise accuracy for each ROI class
(b) ROI-wise MNNSC for each ROI class
Layer Index
Layer Index
ROI-wise
(MNNSC)
Accuracy
Early
Vis.
ROIs
Body
Sel.
ROIs
Face
Sel.
ROIs
Place
Sel.
ROIs
Word
Sel.
ROIs
Early
Vis.
ROIs
Body
Sel.
ROIs
Face
Sel.
ROIs
Place
Sel.
ROIs
Word
Sel.
ROIs
Early Vis. ROIs
Body Sel. ROIs
Face Sel. ROIs
Place Sel. ROIs
Word Sel. ROIs
Early Vis. ROIs
Body Sel. ROIs
Face Sel. ROIs
Place Sel. ROIs
Word Sel. ROIs
13. Image pre-
processing
Voxel-based
Ridge (α)
regression
ROI 1 voxels
mask
Output layer(s) selection
Pre-trained feature extractor
PCA
𝒏 comp.
Visual
features
Output layer(s) selection
Pre-trained feature extractor
PCA
𝒏 comp.
Visual
features
Voxel-based
Ridge (α)
regression
Image pre-
processing
ROI 𝐽 voxels
mask
ROI 𝑗
ROI 𝐽
ROI 1 ROI 1
voxels
responses
All voxels
responses
ROI 𝐽
voxels
responses
12
/ A Mixed and ROI-wise Encoding Model
Proposed architecture: a mixed (multi-layer and multi-network) subject-specific encoding model
18. / Conclusions and Future Work
Conclusions:
▪ effectiveness of transfer learning-based image-
fMRI encoding
▪ generalizability of visual features extracted
from computer vision models, particularly
those pre-trained in a self-supervised
manner
▪ functional alignment between DNNs and the
human visual cortex
▪ a mixed (multi-layer and multi-network) and
independent encoding of each ROI
guarantees mappability and high predictivity
over the entire visual cortex
17
Future Work:
• apply a voxel-wise encoding optimization strategy
to voxels that exhibit poor performance
• implement auxiliary input data: physiological data,
eye tracking data and COCO annotations
• develop a pure neural encoder trained in an end-
to-end way for the image-fMRI task
19. { Thank you for your attention }
University of Milano-Bicocca
Department of Informatics, Systems and Communication
Master's Degree in Data Science
Academic Year 2022-2023
Master’s Degree Thesis by:
Giorgio Carbone
ID 811974
{ Thank you for your attention }
Supervisor:
Prof. Simone Bianco
Co-supervisor:
Prof. Paolo Napoletano
20. / Bibliography
[1 ] Naselaris T, Kay KN, Nishimoto S, Gallant JL. (2011). Encoding and decoding in fMRI. NeuroImage (56).
[2] Allen, E.J., St-Yves, G., Wu, Y. et al. (2021). A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence.
Nature Neuroscience.
[3] M. Caron et al. (2021). Emerging Properties in Self-Supervised Vision Transformers. IEEE/CVF ICCV.
[4] H. Adeli, S. Minni, and N. Kriegeskorte. (2023). Predicting brain activity using transformers. Preprint at bioRxiv.
[5] Gifford, A. T., Lahner, B., Saba-Sadiya, S., Vilas, M. G., Lascelles, A., Oliva, A., Kay, K., Roig, G., & Cichy, R. M. (2023).
The Algonauts Project 2023 Challenge: How the Human Brain Makes Sense of Natural Scenes. Preprint at arXiv.
[6] Yamins, D. L. K., & DiCarlo, J. J. (2016). Using goal-driven deep learning models to understand sensory cortex. Nature
Neuroscience, 19(3), Article 3.
[7] Dwivedi, K., Bonner, M. F., Cichy, R. M., & Roig, G. (2021). Unveiling functions of the visual cortex using task-specific deep neural
networks. PLOS Computational Biology, 17(8).
21. / Natural Scene Dataset: Details
Distribution of the Algonauts Project 2023 Challenge dataset
images in the full training and test sets across the eight subjects,
and in the training and validation subsets defined in the 10-fold
cross-validation phase:
Number of vertices composing the cortical challenge surface and
the cortical fsaverage surface, considering the right and left
hemispheres of the eight subjects:
Lists of the ROIs belonging to each functional ROI class:
• Early retinotopic visual regions: V1v, V1d, V2v, V2d, V3v, V3d, hV4
(V4).
• Body-selective regions: EBA, FBA-1, FBA-2, mTL-bodies.
• Face-selective regions: OFA, FFA-1, FFA-2, mTL-faces, aTL-faces.
• Place-selective regions: OPA, PPA, RSC.
• Word-selective regions: OWFA, VWFA-1, VWFA-2, mfs-words, mTL-
words.
22. / Evaluation Metric: Details
Median Noise-Normalized Squared Correlation (MNNSC) over N voxels:
Voxel-wise Pearson’s correlation between the voxel-wise vector of the predicted (P) responses for the voxel v and the ground truth (G) voxel-
wise vector (t is the index of the stimulus image):
Noise Ceiling for voxel v from the corresponding noise ceiling signal-to-noise ratio (considering the responses to m images, of which A responses
are averaged over three trials, B over two trials, and C over one trial):
(4) Noise Ceiling (NC) and (5) noise ceiling signal-to-noise ratio (ncsnr) formal definitions:
(1)
(2)
(3)
(4) (5)
23. / Non-Linear Activity Mapping Methods: Details
Supervised Regression Trees (RTs) learning approach, tested and
chosen parameters:
• Split criterion: Mean Squared Error (MSE)
• maximum depth of the tree [5, 10, 15] ➨ 5
• minimum number of samples required to split an internal
node [2,3] ➨ 2
• minimum number of samples needed to define a node as
a leaf node [1,2] ➨ 1
• number of features considered when searching for the
best split ➨ number of PCA components
Support Vector Regression (SVM) learning approach, chosen
parameters:
• tube width ε (maximum distance between predicted and
true values within which a penalty on the loss function is
not generated) ➨ 0.1
• regularization parameter C (high values lead to more
accurate fits on the training data but increase the
sensitivity of the model to noise) ➨ 1.0
• kernel ➨ Radial Basis Function (RBF)
• Gaussian kernel
• gamma parameter (how far the influence of
individual training examples can reach) ➨ 1 /
number of PCA components
24. / Feature Mapping Methods: Details
Summary of the properties of the pre-trained models used as
feature extractors:
Summary of the different sets of image pre-processing steps
applied to image inputs:
25. / Comparing Fused Feature and Single Layer Approaches
Single Layer
Fused Features
Single Layer
Fused Features Single Layer
Fused Features
(a) (b) (c)
(d) (e) (f)
Comparison of the voxel-wise prediction accuracy for Subject 1, between an encoding model based on fused feature mapping (ViT-S/14
(DINOv2) 5+6+7), and encoding models using a single feature layer approach (ViT-S/14 (DINOv2) with output layers 5, 6 or 7):
• randomized permutation test to determine a minimum threshold MNNSC value significantly different from zero ➨ 0.19 (p < 0.001)
• (a, b, c): NNSC (abscissae) of the single feature models and NNSC (ordinates) of the 5+6+7 fused feature encoding model
• (d, e, f): distributions of voxel-wise differences between the accuracy of the fused feature model and the single feature layer models
26. / Comparing CNNs Pre-Trained with Different Training Tasks and Learning Methods
Comparison of the voxel-wise prediction accuracy for Subject 1 between the two best configurations of the ROI-wise mixed encoding
models based respectively on the:
• (a) pre-trained ResNet-50 (self-supervised DINOv1, ImageNet-1K) and ResNet-50 (image classification, ImageNet-1K) models
• (b) pre-trained RetinaNet (object detection, MS COCO) and ResNet-50 (image classification, ImageNet-1K) models
• (c) pre-trained RetinaNet (object detection, MS COCO) and ResNet-50 (self-supervised DINOv1, ImageNet-1K) models
(a) (b) (c)
27. / Best ROI-wise Encoder: Cross-Validation Details
Overall and functional ROI class-specific 10-fold cross-validation
accuracies (MMNSC) considering the voxels of all subjects:
ROI-specific 10-fold cross-validation accuracies (MMNSC)
considering the voxels of all subjects:
28. / Best ROI-wise and Baseline Models: Cross-Validation
Proposed ROI-wise and mixed encoding model
Baseline encoding model
29. / Best Subject 1 and 2 Encoders: Cross-Validation Details
S1
S2
30. / Best Subject 3 and 4 Encoders: Cross-Validation Details
S3
S4
31. / Best Subject 5 and 6 Encoders: Cross-Validation Details
S5
S6
32. / Best Subject 7 and 8 Encoders: Cross-Validation Details
S7
S8