TASK-OPTIMIZED DEEP NEURAL NETWORK TO REPLICATE THE HUMAN AUDITORY CORTEX

MODEL TO
REPLICATE HUMAN
AUDITORY SYSTEM

AN OVERVIEW OF TASK-
OPTIMIZED NEURAL
NETWORK TO REPLICATE
HUMAN AUDITORY
BEHAVIOUR


 Introduction to the article and some insights on it
 Mechanism of hearing by the ear and brain
 Organisation in the auditory cortex
 Earlier models on cortex and need for a new one
 Deep-learning model based on 6 contexts
 The method of generating cortical responses by model and
traditional method with comparisons
 Findings and inferences through different perspectives and
analyses
 Achievements by the model
 Added features in model in comparison to its precursors
 Disadvantages of model and future directions
ROAD MAP


 This paper is about the development of a
Convolutional Neural Network (CNN) model which
is task optimised to perform some real-world
auditory tasks.
 This model can be used to understand the
architecture of human auditory system. (why need
model?)
 This model also generates the fMRI voxel responses
throughout auditory cortex better than the standard
method of using spectrotemporal filter.
 The model provides details about the primary and
non-primary responses
Introduction


 There are two types of auditory responses namely
the primary and non-primary responses.
 PRIMARY RESPONSE is the response obtained from
cochlea and the primary pathway (purely auditory).
This represents the primary auditory cortex (A1).
Carry out simpler tasks. (reason given later)
 NON-PRIMARY RESPONSE is the response
obtained from the non-primary pathway (mixed
senses). This represents the regions beyond the
auditory cortex. Carry out complex tasks
Auditory responses


 The neuronal processing occurring the human ear
transforms sound into cortical representations.
 They render out behaviourally important sounds
explicit (like ampilfier)
 The organisation of the human auditory cortex
remains unsolved and there are no competent
models to explain the process this process of
transformation of auditory sounds to representations
Ear mechanism


 This question was debatable with researches
favouring each sides (distributive or hierarchical).
 Formisano and Staeren proposed an anatomical
distributive organisation
 There is a tripartite hierarchical organisation seen in
non-human type animals which carry out simple
tasks.
 However, these cannot confirm whether the auditory
cortex is distributive or hierarchical.
Organisation of auditory
cortex


 Some early models used linear filtering of
cochleogram (sound in image format) using 1-2
stages.
 However the process of transformation is non-linear.
So those methods failed to address the purpose.
 So it is essential to develop models which carry out
non-linear functions.
 Model has to provide answer for the organisation
and the transformation.
Early models and cause of failure

Sl.no Name Description Our case scenario
1 Data Type of data provided to
input
Cochleogram
2 Task Operation required to do on
the input
Classification(multiple)
Regression(prediction)
3 Model The mathematical relation
between input and output.
This varies based on the task
and complexity and may
involve layers
CNN (Convolutional Neural
Network)
4 Error Kind of a compiler which
finds error between two
different quantities
Comparison of the model’s
classification with human’s
classification
5 Algorithm A kind of learning procedure
which tries to reduce the
error computed before
Stochastic Gradient descent
6 Evaluation Finding how good the model
has performed
Comparison with human
behaviour


 The data for which the CNN is trained is known as
cochleogram.
 The cochleogram is the
visual representation of the
sound signals.
 The cochleogram is a
spectro-temporal
representation of speech.
 A 2-second sound signal is
taken as input.
Data


 There are two tasks to be performed namely word
identification and music genre recognition
 The task is made difficult by introducing
background noises with the music/word sound
 The task is to find one word out of the 587 or to find
one genre out of 41 categories.
 Also the model produces cortical responses.
Task


 The model contains Convolution, Pooling, Dense,
Filter response normalisation and Dropout layers.
 It is a hierarchical model. The layers present in
CNN (convolution and pooling) perform non-
linear operations.
 The model had five convolutional, three pooling,
two normalization, and two fully connected layers.
 The processing (7 shared layers) are same for both
but have different FC layers(5 different). So models
parameter reduces by half.
 The hyper parameters were task optimized.
Model


 This model was derived from two-steps
 First step involved 180 architectures each being 12
layered and single tasked
 The second step involved 7 architectures of 12 layers
and dual tasked.
Model selection


 In order to evaluate the likeliness of the models
response with that of the human, the model is
compared with that of the human
 For WORD IDENTIFICATION, the human is
allowed to use an UI which will auto-complete the
word (to ensure that it belongs to one of 587 classes)
 For GENRE IDENTIFICATION, the human is
allowed to list down five preferences of genre (top 5).
 The error here is the wrong predictions.
 A interesting feature observed was that the model
made error pattern like human.
Error


 The algorithm used here is the stochastic gradient
descent.
 The role of the algorithm is to find the optimum
values of the parameters such that the loss is very
less (theoretically 0)
 The word stochastic refers to the way of taking the
input)- one at a time is stochastic
 The gradient descent refers to the attempt of
reducing the gradient by finding the local minima of
the gradient
Algorithm


 The confusion matrix is used to evaluate the performance
of the model in the genre recognition task. (41 classes)
 The confusion matrix is matrix
with rows and columns equal to
classes and it compares the truth
with model prediction and has 4 fields.
 The same can be plotted for
word identification but the graph
will be erroneous due to 587 classes.
Evaluation


 The next task to be done by the model is to generate fMRI
voxel responses throughout auditory cortex. In short, it
has to produce cortical responses.
 The voxel is a single unit block in a 3-D image (mine
craft).
 The data used here are 165 natural sounds heard
regularly in which 52 were words and music.
 The model was trained for these sounds and the voxels
generated for each of these sounds were collected.
 These were compared with the standard method of
spectrotemporal filter
Cortical responses


 Listening and hearing…..
 An important process in the processing of the auditory
signals is the ‘attention’.
 Taking in the required signal and eliminating the rest
unwanted ones.
 Hence a filter is formed inside the auditory cortex with two
functions. Like neurons which respond maximally to given
input frequencies.
 To incorporate information about both the timing (rhythm)
and the frequency content of the relevant auditory stimulus
stream.
 To enhance the sensory representation of attended stimuli
along these two feature dimensions.
Spectrotemporal filters


 The response/prediction from each of the layer in the time-
averaged model was taken into consideration.
 This is done by using the linear
regression, by using the ‘linear’
activation function in each of
the layers.
 The predictions from each layer
were linearly combined to artificially
create a ‘voxel’.
 As a result, we have a voxel’s response
for all 165 sounds from all layers.
 The BOLD curve looks inactive for 2-s,
hence the average is used.
Method of extraction


 The comparisons were made using four elements:-
 The trained model with perfect weights
 The untrained model with random weights
 The traditional spectrotemporal filter model
 The random model from selection
Comparisons


 The comparison must be done with the truth
 The truth is obtained by feeding the same to a fMRI
machine to get
the voxels
 At first, the BOLD variance for all the 4 methods
 This was done for correcting both the reliability of the
measured voxel response and the predicted voxel
response
 The comparisons for made on all voxels and some
specified voxels.
 As expected, the trained model has high variance and was
better than spectrotemporal model and untrained one.
BOLD variance


 Then the median variance was taken for the same
 The trained model (70%) had more variance than the
spectrotemporal filter (55%)
 The filter model had the highest number of
parameters it can withstand.
 And it eventually saturated.
 The untrained and random model was worse than
the trained model and spectrotemporal filter model.
 The trained model had the highest variance on all
ROI and proved to be better than traditional one.
Median variance


 The trained deep learning model performed the best
and was far better than the spectrotemporal one.
 The reason for this improved voxel response is due
to the hierarchical organisation of the model.
 The convolution and the pooling layers of the model
produced a receptive field (spectrum of signals)
similar to that of the cortical system.
 Also the model performed better than the
spectrotemporal model in the region of interests.
Findings


 So this says that the model is able to respond to the
natural sounds better than that of the
spectrotemporal filter model throughout the
auditory cortex
 This is due to the hierarchical organisation of the
model.
 The task optimization has resulted in a good cortical
response
Inference


 The responses obtained from the later layers of the
network were non-linear when compared to other
layers.
 So in order to assess this property, it is essential to
compare the response from each layer of the model.
 The median variance for individual layers were
taken into consideration for comparison.
 And based on these, there were some important
findings which lead to some inferences about the
organisation of the human auditory system.
Procedure to assess the
hierarchical organisation


 The median variance increased
for all layers and then deceased
for the last layers.
 All layers except the first and
last performed better than the
spectrotemporal filter model
 All layers except the last layer
in the trained model had more
variance compared to the untrained
even though their dependencies
with data were the same.
The intermediate layers made the best prediction whereas the final
layers made poor predictions.
Findings


 The receptive fields of some of the layers in the network
were similar to that in auditory cortex and this maybe the
reason for their high performance.
 The task optimisation has helped in replicating some of
the cortical properties onto the model.
 As per the task, the neurons in the final layers involved in
perpetual decisions.
 Such neurons maybe present in the auditory cortex but
their organisation maybe not accessible by conventional
fMRI.
 Or these might be beyond the auditory cortex either on
other brain lobes
Inferences


Summary map
 The variance of the layers were
plotted using special images.
 The heatmaps of the variance and
predictions of the individual
layers were mapped onto the
probabilistic map which involves
three anatomically defined
regions of the primary auditory
cortex. This is done for individual
test subject.
 The average taken over all subject
is a summary map.
 This is relating the model and
human cortex.
 The black outlines are the
anatomical regions.


Findings
 The intermediate layers best
predicted the voxels and this
constitutes to the primary
auditory cortex(core)
 The last layer of the network
constitutes to the region
away from auditory cortex
(non-core) .
 The same results were not
seen in an untrained model
with random weights.
 Also the same results were
seen when words and music
were removed from training
data.


 This gives the reason that the intermediate and the last
layers of network generates primary and non-primary
responses.
 Also the intermediate layers perform simpler tasks when
compared to the later layers (reason given later)
 The same results were seen i.e. the primary voxel best
from intermediate and non-primary voxel best from last
even when word and music were removed.
 This suggests that the hierarchical structure of the model
helped it in generating better cortical responses for
everyday sounds
Inference


 These are four functionally defined Region Of
Interests (ROI’s) namely:-
 frequency selective
 pitch selective
 word selective
 music selective
Regions of interest


VOXEL TYPE/LAYER INTERMEDIATE
LAYER
DEEP LAYER
FREQUENCY-SELECTIVE  
PITCH-SELECTIVE  
MUSIC-SELECTIVE  
SPEECH-SELECTIVE  
Findings


 The frequency voxels which were best explained by the
intermediate layers are found early in hierarchy and the
speech voxels which were best explained by the later layers
were found later in the hierarchy.
 This can be the reason for which intermediate layer does
simpler function and the later layers perform complex
functions.
 As before the untrained network was lower than that of the
trained network and also the spectrotemporal model.
 The dependencies did not affect the performance of the
model suggesting that the task optimization was critical to
map the features in the layers to the auditory cortex.
 The ROI analysis supports hierarchy organisation
Inference


HENCE BOTH THE MODEL
AND THE HUMAN CORTEX
ARE ORGANISED
HIERARCHICALLY!!


 The representation of the acoustic features by the
network were compared with that of the
spectrotemporal model.
 To check whether the representations of both models
were linearly decodable.
 For this, the data was divided into two subsets for
which the first one was used for mapping and
second for quality checking.
Acoustic features


 The ability for the network layers to extract spectral
information from the data decreased
as the layers progressed.
 The extraction ability was
constant for the spectrotemporal
model which peaked at the
intermediate layer.
 The prediction of the later layer
is worse than the earlier and this
was prominent in the untrained model.
Findings and inference


 It is essential that the model performs well on real
world task in order to replicate the auditory cortex
 The model was analysed layer-wise on the existing
task and a new speaker identification task for which
model wasn’t trained.
 This was done by fixing the weights and optimizing
by using the softmax activation function in the layers
which took output from a previous layer and gave it
to the next layer.
Real-world task performance


Findings
 The findings were contrary to
that seen previously
 The performance improved
from early to the deeper
layers of the network.
 The same level of performance
was seen also in the speaker
identification task except
for final layer.
 This suggests that the network
representations are task-
generalised.
(same for most auditory tasks)


 All of the previous findings and analyses portray the process of
transformation from cochlea to cortex
 The role of the cortex is to transform acoustic features obtained
from the cochlea into meaningful representations and the role
of this transformation is unknown
 These analyses suggest that the task-related information which
were not clear/explained in cochlea (implicit) and when these
went to the auditory cortex which transforms into
representations which were well clear/explained (explicit)
 In simpler terms, the transformation has provided some
meaning and explanation to the information using which both
the brain and the model figured out the output.
Inference


 The input data involved the incorporation of
background noise with the sound signal
 They were added at different SNR (Signal to Noise
Ratio)
 The analysis done on this constitutes to the SNC
(Signal to Noise Characteristics).
 The signals were categorised according to the SNR
and were fed to the network for analysis.
 The objective is to find the role of noise in processing
information from the signal.
SNC


 The signals with less noise were
well classified by the intermediate
layers as well as the deep layers.
 But, the signals with more noise
were well classified by the
deep layers only.
 The later layers of the model are
insensitive to noise or they are
noise-immune
Findings and inference


 The data used here was the same as of fMRI but the words and
music were excluded (113 samples).
 These sounds were divided into
two subsets based on stationarity
(the stability of mean, SD etc.)
 They divided the cochleogram
into categories and taking
standard deviation over time.
 Then the individual layer
response for the two sets of
sounds were measured.
Later the same was compared with voxels
From the fMRI machine
Noise-stimuli sensitivity


 The deep layers of the network trained on these natural sounds
had exhibited a greater
response for the non-stationary
sounds when compared to that
of the stationary sounds.
 However the same effect was not
observed in the untrained network.
 From the fMRI, the responses to
stationary and non-stationary
responses were similar in the
primary areas (A1), but more response
was seen to non-stationary sounds
in the non-primary areas.
Findings


 There is a differentiation between the primary and
the non-primary regions functionally and these
proofs support to that of the similar (intermediate-
primary and deep-non primary)
 There is a suppression of sound in the later layers of
the model and in the non-primary regions and hence
this contributed for better response to non-stationary
sounds by the deep layers and non-primary cortex.
 This has helped the model to predict responses to
natural sounds even though they were affected with
noise.
Inference


Task-performance
 It was found that
networks with better
performance on a real-
world visual object
recognition task better
predict cortical responses
in the visual stream.
 To prove the same, 57
different models from
stage-1 were taken at 14
different training points
(798) for either word or
genre task
 The median variance was
measured for each layer.


 The performance of a network on a task strongly
correlated with the variance it explained in auditory
cortical responses.
 The word task had a Spearman correlation of 0.87
and the genre task had a Spearman correlation of
0.85
 These results suggests that the task-based
optimization of deep neural networks can help yield
more predictive models of sensory systems.
Continued…


 The model performed as good as that of humans in
the task of word recognition and genre identification.
 The model produced human-like error patterns.
 The task optimization resulted in the model
replicating the auditory cortex in one aspect
(branching of layers for specific tasks).
 The predicted fMRI responses throughout the
auditory cortex way better than that of the standard
method (spectrotemporal filter)
Achievements


 Task optimization resulted in better cortical responses by
the model, without which the predictions were poor
(untrained model)
 Intermediate layers of model predicted the primary
response and deep layers of model predicted non-
primary response.
 The model has proven that the organisation of the human
auditory cortex is hierarchical.
 The model was general and the hierarchical organisation
and task optimization made it general and powerful.
Continued…


 The model had some non-linear operations like
normalization and pooling and this is the reason for
its improved response, as a matter of fact research
says that the inner operations in cortex is non-linear,
the model was better than filter which didn’t have
these features.
 An alternative method for evaluating the cortex
organisation was provided by the model (model and
human on same task, both performed same so model
architecture is similar to human)
Continued…


 The task optimization resulted in powerful models
which can replicate the visual and auditory system.
 The primary visual responses were best given by the
early layers of the model and the primary auditory
responses were best given by the intermediate layers
of the model.
 This suggests that the auditory cortex is present
deeper in the computational hierarchy compared to
the visual.
 This is in accordance with the fact that the auditory
cortex has more subcortical nuclei.
Comparisons with the visual
system


 This deep learning model (12 layers) is deeper when
compared to its ancestral models (2 or 3 layers)
 This depth helped in a good representation of complex
real-world tasks and better cortical responses
 The branching of network in deep layers as a result of
task optimisation goes in accordance with the fact of
functional segregation in the non-primary cortex.
 The model could perform other sound related tasks even
though not trained on them.
 The parameters were based only on half of the data and
the model performed better for the untrained data also.
Advantages


 The individual units used in the model are less
readily understood.
 The choice of task wasn’t so important for analysis of
human cortex. The genre task was taken into
consideration due to readily available large dataset,
but this task had some discrepancies that the task is
culture biased.
 The model couldn’t replicate the human in terms of
learning; humans learn by experience and feedback
whereas machine learns by data.
Disadvantages


 The model was able to prove that the human cortex has
hierarchical organisation, but an even better one is
required to prove if it is tripartite or not as seen in
animals.
 Research says that the auditory cortex has more
subcortical nuclei; this can be proven by predicting the
subcortical responses by the early layers of the model.
 Training the model for additional music-related tasks, or
tasks not specific to speech or music, could yield a more
complete model of human behaviour.
 Improving the model from the learning point of view can
make the model more correlated to that of the human.
Future updates


REFERENCE…
Kell, A. J. E., Yamins, D. L. K., Shook, E. N., Norman-Haignere, S. V., &
McDermott, J. H. (2018). A Task-Optimized Neural Network Replicates Human
Auditory Behavior, Predicts Brain Responses, and Reveals a Cortical Processing
Hierarchy. Neuron, 98(3), 630–644.e16. doi:10.1016/j.neuron.2018.03.044
**All information have been taken from this research article.**

TASK-OPTIMIZED DEEP NEURAL NETWORK TO REPLICATE THE HUMAN AUDITORY CORTEX

TASK-OPTIMIZED DEEP NEURAL NETWORK TO REPLICATE THE HUMAN AUDITORY CORTEX

Recommended

Recommended

More Related Content

Similar to TASK-OPTIMIZED DEEP NEURAL NETWORK TO REPLICATE THE HUMAN AUDITORY CORTEX

Similar to TASK-OPTIMIZED DEEP NEURAL NETWORK TO REPLICATE THE HUMAN AUDITORY CORTEX (20)

More from Sairam Adithya

More from Sairam Adithya (11)

Recently uploaded

Recently uploaded (20)

TASK-OPTIMIZED DEEP NEURAL NETWORK TO REPLICATE THE HUMAN AUDITORY CORTEX