SlideShare a Scribd company logo
1 of 8
Download to read offline
Stacked Convolutional Neural Networks as a Language Model of Imagined
Speech in EEG Data
Barak Oshri
Stanford University
Manu Chopra
Stanford University
Nishith Khandwala
Stanford University
1. Introduction
1.1. Imagined Speech Faculty
Imagined speech refers to thinking in the form of sound,
often in a person’s own spoken language, without moving
any muscles. It happens consciously and subconsciously,
when people use it to imagine their vocalization or sub-
consciously as a form of thinking, whether during reading,
reciting, or silently talking to oneself.
This phenomenon includes an imaginative component and a
vocal component, with the brain calling on parts of the men-
tal circuit responsible for speech production, falling short of
producing the signals for moving the muscles of the vocal
cords. Instead, the language signals are embedded and pro-
cessed in the buffer that leads into the imaginative space of
the mind.
Despite a limited understanding of what this mental cir-
cuit looks like, we can reason necessary and sufficient con-
ditions about what the silent language formation pipeline
must entail, for example that ”sound representation deeply
informs... linguistic expressions at a much higher level than
previously thought,” indicating that speech production must
be tightly subsumed within language formation, instead of
the circuits running in parallel with some auxiliary connec-
tions [3].
1.2. Brain Representations of
Imagined Speech
What makes imagined speech research promising and opti-
mistic is that the rich linguistic features present in language
is what the brain anchors to as it develops its language fa-
cilities. This means that the structures present in the brain
for speech formation must be ordered at least in part in the
same way that language is structured abstractly. Hence to
succeed in imagined speech, we need only learn how these
speech patterns interface with imagination in a determinis-
tic way. Once we do, was can apply many of the same tech-
niques used in traditional speech recognition such as hidden
markov models, language tree models, and deep learning
So whether this is possible is not just a question of how lan-
guage is formed in the brain, but how it is represented in
the high-level faculties such as imagination and conscious-
ness. Even if language models are deterministically present
in language formation, we still do not have any understand-
ing of how they are manifested in imaginative capacities;
imagination could just as well noise the lucid language rep-
resentations evoked from fundamental language structures
such as the Broca’s and Wernicke’s areas.
And what understanding we have of how imagination works
merely points to how complex and unconfined it is, being
unlocalized to any specific structure in the brain and be-
ing spread out in a highly diffuse manner. Just alone the
fact that mental states severely affect its functionality via
changing neural oscillation frequencies depending on levels
of waking attention shows how the output from electroen-
cephalography (EEG) appears indeterminate and chaotic.
That is, assuming such a ”layered understanding” of the
brain, we don’t know how close imagination is from the
act of producing the language.
Therefore, imagined speech research is a neuroscientific en-
deavor as much as it is a computational one. Any successes
and failures in performance are indications on how well the
model fits to the data, which allows us to then make extrap-
olating claims about how the brain is organized.
1.3. The Trouble with Data
Attaining well-formed data for imagined speech research is
the largest challenge in this field. Most studies have used
EEG imaging for its high temporal resolution. The trouble
with this data is that each channel only captures an ”av-
erage” electric signal across an axis relative to some root
(often the top of the spin at the back of the head).
Experimentation with imagined speech is exceedingly chal-
lenging. The human brain is incredibly sensitive to external
stimuli, anything of which could confound the controlled
lab environment of the experiment. It is difficult to make
a setup that determines exactly when a word is thought. If
a pitch is used to cue the subject, then the brain response
to the pitch affects the immediate signal. This is a serious
danger to the integrity of the experiment as a classifier could
inadvertently classify the variable responses to the different
level pitches.
And even when data is well-formed and meets the task de-
scriptions, there is no guarantee that the resolution of the
technology has captured the relevant information. EEG
has high noise-to-signal ratio, and noise reduction methods
are effective but not complete. Technological advances in
brain-imaging in the future will increasingly help research
in this field.
1.4. The Brain as a Personal Computer
To summarize the challenge of this task in an analogy, con-
sider a modern personal computer. Such a PC computes a
myriad of tasks at any given moment: resting, active, back-
ground, etc. We would like to identify the current state of
exactly one of these processes. We have faint ideas of where
this process appears within the entire system, but we are not
sure what its form is, or how loud or prominent it is in com-
parison with other processes. In fact, it is not a single pro-
cess, but a parallel one, merged, influenced, and blockaded
by all the other processes happening in the computer. Also
consider that our apparatus for measuring this brain, call it
our voltmeter, does not get to measure the state of every
transistor, but only some start and end points.
To find this elusive process, we start with the assumption
that our voltmeter is omnipotent and knows the states of
all transistors. Given the entire state of the computer, we
exploit any understanding we have of the structure of the
process, such as expectations for how some of the wires
should be arranged in the process, allowing us to ”pattern
match” with the data. Without understanding of the struc-
ture of this process, any algorithm could only blindly tra-
verse the data as it doesn’t know what to expect from this
process. Therefore the success of generalizing this process
to new situations depends on the correctness of our exter-
nal ”fitting”. It seems like pattern recognition, even without
hand-engineered features, is doomed to needing some level
of assumption about the organization of the data, or at least
how it could be learnt.
This analogy is important in explaining why we need a ro-
bust language model suitable for the task to be able to com-
prehensively solve imagined speech. Keeping the challenge
of this field in mind, any research done in the present times
will aid efforts in the future when the technology and neuro-
science catch up with the immensity of the task. Therefore,
it is important not to lose aspiration to solve this problem so
that the methodology and foundations are ready and an op-
timistic tone has been set for when science and technology
allow us to converge upon a solution.
1.5. Uses of Imagined Speech
These challenges should be viewed against the many bene-
fits and opportunities opened by commercial, medical, and
research aspects of imagined speech. Thousands of severely
disabled patients are unable to communicate due to paraly-
sis, locked-in syndrome, Lou Gehrigs disease, or other neu-
rological diseases. Restoring communication in these pa-
tients has proven to be a major challenge. Prosthetic de-
vices that are operated by electrical signals measured by
sensors implanted in the brain are being developed in an
effort to solve this problem. Researchers at U.C Berkeley
have worked on developing models to generate speech, in-
cluding arbitrary words and sentences, using brain record-
ings from the human cortex. [4] Success in understand-
ing imagined speech will enable these patients to talk as
thoughts can be directly synthesized into sound. Imagined
speech may finally restore the lost speech functions to thou-
sands, if not millions, of patients.
Furthermore, the capability to comprehend imagined
speech has the clear potential for a variety of other uses,
such as silent communication when visual or oral commu-
nication is undesirable. As fluency and dependence on tech-
nology raises the demands for faster, cleaner, and more pro-
ductive interfaces, the pathways such an accomplishment
would pave in the scope of our communicative abilities
would lead to a revolution in our natural and digital interac-
tion with the world.
Humans have created a world of messages, expressions, and
meanings central to our living individually and in commu-
nities, and this is true in the mind just as much as it is in our
writings and other artistic mediums.
Tapping into the sheer vastness and wealth of color exhib-
ited by the mind is sure to revolutionize the dimensionality
of human expression, its data, and our relationship with the
digital world. The ability to understand imagined speech
will fundamentally change the way we interact with our de-
vices, as digital technology shows trends of connecting with
our bodies and activities increasingly more, from eye-wear
to watches. While such research is controversial in the least,
invasive at worst, these fears should not inhibit attempts at
studying how much information could be mined from the
Tapping into imagined speech data is the holy grail of hu-
man communication interfacing. There are fewer problems
that reach so closely to human demands and intentions, as
imagined speech happens involuntarily and irrepressibly.
This makes imagined speech an underestimated advance
in our ability to make use of human abilities in our activ-
1.6. The State of Imagined Speech
Progress in imagined speech has been slow, lacking, and
misguided. Studies in this field make simple and blanketing
studies unfavorable to progressive growth and experimenta-
tion. Most of these studies perform classification tasks of
words, syllables, or phonemes such as classifying between
several thousand samples of syllables ba and ku using an
EEG or classifying between yes or no using an Emotiv. The
first of these can be given recognition as showing that pho-
netically disparate syllables can be strongly differentiated
[2], and the second of these that even given the small num-
ber of channels an Emotiv can differentiate between two
Neither of these studies address the larger challenges posed
by general imagined speech progress or advise for a robust
approach to experimentation with imagined speech by ex-
ploiting domain-specific knowledge. Neither of the studies
can claim to have classified patterns that activate on those
syllables or words when the class space is grown. That is,
each symbol studied is classified dependently on the other
symbols classified which makes anything learnt in the final
model completely restricted to the task-definition.
The only major sponsored effort for imagined speech re-
search was by DARPA in 2008 in a $4 million grant to UC
Irvine to conduct experiments that will allow user-to-user
communication on the battlefield without the use of vocal-
ized speech through neural signals analysis” on the basis
that ”the brain generates word-specific signals prior to send-
ing electrical impulses to the vocal cords.” While promis-
ing, this research has not produced a foundation for general
imagined speech research, instead focusing on military as-
pects and small tasks that benefit battlefield communication
in the short-term.
There is thus a theoretical need to direct this field into an ex-
perimental understanding of how decoding speech from the
brain could work. In this way, research in this field should
not focus on results or data evaluation, but on approaches
and methodologies, especially given that data produced is
heavily dependent on specific context, technology, and sub-
1.7. Learning Linguistic Features
In a previous study of ours, we showed that multi-class
classification of four syllables is possible and effective with
neural networks to an accuracy of greater than 80%. These
results, however, are not as promising as the performance
indicates. There is no reason to believe that the model is
classifying the syllables for their linguistic features; any
other patterns such as associative memory activations could
be the differentiating factors. So what we repeatedly see
in the examples discussed is that the tasks used to research
imagined speech are inherently ill-posed. Therefore, for any
model to succeed, we need to be able to assure ourselves
that the classifier is classifying for the right reasons, in this
case linguistic features, which are invariable and context-
We guarantee that the model is learning linguistic features
by teaching it to classify the same linguistic features in dif-
ferent words and expressions. That is, for robustness the
model must not only generalize to new samples, but to new
language situations. By defining a learning model that is
built around the foundations of general language under-
standing, we can at least guarantee that the parameter space
includes solutions that capture the dependencies inherent in
the language so that the differentiating factors include the
constituents of the language.
2. A Language Model using Convolutional
Neural Networks
Having discussed the main foundations of imagined speech,
we now devote the rest of the paper to suggesting the re-
search questions and methods that need to be used to make
progress in imagined speech research, as well as some pre-
liminary results using these approaches.
2.1. Segmentation and Composition
The first of these questions is related to proving that clas-
sification is a valid approach to solving imagined speech.
Before we can assume that language features are static sym-
bols in imagined speech, and thus before we can prove that
classification is suited to this problem, we must verify that
the linguistic features ”add up” in EEG data in such a way
that they appropriately and convincingly capture the pat-
terns of higher level features. That is, the language ob-
served in the imagination is not necessarily correspondent
and tainted by associative caches of different words that
obscure the layered foundation of language models as ob-
served in the imagination.
We can answer this question by proving that classifications
of components in imagined speech can be segmented into
correct classifications of the sub-components, and that the
converse is also true, namely that two language symbols
composed next to each other have the same classifications
as if the two symbols were featured as one symbol.
This is alike to proving the ”inductive step” that if the basic
rudiments are known, which the studies seem to be showing
as possible and which can only be improved as our com-
putational resources increase with time, then these compo-
nents can be ”built up” and ”built down” without losing the
linguistic integrity of the classifications.
This is a necessary condition for speech recognition of EEG
data. In the classical setting of vocal speech recognition, it
is trivial that language features are at least approximately
built up and built down in the signal (if a person says ”ha”
and ”bit”, then the signal of ”habit” can only deviate a lit-
tle bit different from the sums of the signals of the compo-
In imagined speech, this is certainly not necessarily true. A
person could have multiple representations and activations
for the word ”habit”: one through its phonemes, and one
through single symbols, such as a unique association to the
concept of a ”habit”. Therefore, the EEG signal of ”habit”
will encode the concept of the word while the signals of
”ha” and ”bit” will be purely linguistic concepts.
That is, the noise of imagined speech recognition is sig-
nificantly more complicated and meaningful than the noise
of vocal speech recognition, which is mainly interference
and voice fluctuations. We need to prove segmentation and
composition to show that the noise does not interfere with
the classification in a destructive way, and that it is at least
interfering in a consistent manner across all linguistic levels
of classification. In imagined speech, we have two modes
of classification, ”linguistic” and ”associative”, while in vo-
cal speech recognition there is only the former. Classifying
based on associative features is immediately ill-fated in that
each word has its own unique representation and that two
words linguistically similar could have wildly different as-
sociative representations.
Proving segmentation and composition will show that asso-
ciative meanings do not interfere with the phonetic qualities
of imagined language. This is believed to be true since the
language formation circuitry must regardless be activated if
the word is thought of, hence segmentation and composition
will be true if the linguistic signal is not hidden by concept
activations of the word thought of.
2.2. CNN Features in Sequence Alignment
One approach of attempting to prove segmentation and
composition is using sequence alignment. Sequence align-
ment, mostly used in arranging sequences of DNA and RNA
to identify regions of similarity, can be useful in imagined
speech to test how close of a fit the features are of similar
components in different words. That is, given EEG signals
for ”habit” and signals for ”ha” and ”bit”, we align the sig-
Figure 1. Sequence Alignment Algorithm
nal for ”ha” at each point in the ”habit” signal and do the
same with ”bit”. If the representation of ”ha” shares fea-
tures with the first part of ”habit” then ”ha” will success-
fully align at the beginning of ”habit” and the opposite with
This approach of proving segmentation is ill-fated as it al-
ready depends on some successful feature extraction. Do-
ing sequence alignment with euclidean distance as a base-
line, we found that the results were insignificant, showing
that whatever meaningful signal is encased in the EEG data
is intricate enough to demand sophisticated pattern extrac-
tion. Therefore this approach is merely useful as a way of
validating the results of some feature extraction by creating
a metric for how closely representations across similar lan-
guage features match in different strings of words.
2.3. Convolutional Neural Networks as
EEG Classifiers
Sequence alignment suffers from a larger problem of fail-
ing to deal with compression. That is, without introducing
computationally expensive alignment methods such as gaps
and duplicates, if ”ha” is said faster in ”habit” then the se-
quence alignment cannot account for the fact that the same
representation may exist in both trials but in one happening
faster than the other.
It is to this challenge that we begin to consider the use of
Convolutional Neural Networks (CNNs) in doing feature
representation of imagined speech EEG data. CNNs were
traditionally created to solve the problem of image recogni-
tion by building up low-level image features such as lines
and edges and using smooth features to build higher-level
understandings. It turns out, however, that the philosophy
used by CNNs to solve image recognition is applicable in
the context of EEG data: by accounting for transformational
variance using pooling layers, patterns are allowed to de-
viate between trials without affecting the outcome of the
classification, which is needed for when the language cir-
cuitry of one person is the same as another’s but occurring
in different regions of the brain or being activated in a dif-
ferent manner. Pooling layers are also an effective solution
to compressed data, evidenced by how CNNs are still able
to classify compressed images of the same content.
CNNs must rely on the assumption that the data is smooth
and continuous. This is somewhat made possible in this sit-
uation by the fact that close EEG channels somewhat get
data from the same regions in the brain (which makes the
ordering of the EEG channels in the form of the data es-
pecially important). If EEG in the future were to provide
3d time-based images of activity this would be an immense
gain to CNN feature extraction. But currently, CNN ap-
proaches are most suited to EEG caps with many channels
distributed evenly around the brain. CNNs would work even
better in technologies that capture a uniform image of the
brain, such as fMRI scanning.
CNNs lose out on important global patterns, in this case cor-
related neural wires that are disconnected from each other,
when there are not enough layers for both patterns to be
weighted together. However, it is a fair hypothesis that lo-
cal language features must be represented in local circuits
of the brain while large and more comprehensive models are
more diffuse since they require more reasoning and memory
around the brain.
2.4. Stacked CNNs as a Robust
Language Model
2.4.1 Overview
It is this assumption that denser and more expansive neu-
ral connections encode more language information that
makes it fitting to experiment with component-based mod-
In this section, we present a model that we believe is sophis-
ticated yet logically coherent to experiment with classifying
imagined speech using language models. Our central idea
that we believe tackled our initial questions of decomposi-
tion and summation was the innovation of a large CNN that
has a loss function every two layers with the first few lay-
ers being trained to classify low-level symbols in our data
and higher levels using the feature representations of the
earlier ones to understand higher-level language elements.
The lower levels are trained first so that higher levels use
effective features of the smaller components.
For example, suppose we intended a model to learn the
word ”signal”. We divide the sample of the word into small
component images, (each component can be likened to a
single letter), train the classifier on those letters, use the
feature extractions of the letters to train the classification of
syllables using learned representations of letters, and build-
ing upwards so that every language element is classified
using trained representations of the highest language fea-
ture below it. This should happen to the point that if the
net had a representation of ”signal” and through some other
source ”ing”, it should also be able to classify ”signaling”
as long as the original data was split into the components in
a suitable manner (which we can easily ensure using small
What is innovative is that this is a ”component-based ap-
proach” such that nothing but the initial components are
classified on the raw data. This forces the net to classify
the language features exclusively on known and recognized
language features as opposed to random and irrelevant pat-
terns that happen to distinguish classes. We have reason
to believe that this is the most appropriate way to build a
framework to read language in the brain because the brain
seems to build abstract layers of symbols for language fea-
2.4.2 Architecture
The model described is a stacked CNN, with each CNN
taking the feature representation of the previous level. A
level here denotes a CNN with various number of convolu-
tional layers, ending with its own affine layer (not present
in the stacked net) and loss function that classifies to that
Therefore, the image size of each layer is the image size
of the output of the previous, with the first level taking raw
input of the dimensions of a single component. Each CNN
is trained independently in ascending order, obviously to
ensure that the features that the level takes as input are the
best formed possible.
When testing, on raw data the data is first split into pieces
of the size of the first component (with some stride). The
stacked net then outputs the classification at each of the
levels, given a low-level to high-level breakdown. How
these level classes are analyzed is left to the purview of the
Importantly, each level can take in a variable number of
component extractions. That is, if the first level takes as
input a data sample equivalent to a single letter, then the
second level does not have to take the feature extraction of
one letter but can take the extractions of two or three or
more letters side by side. This level parameter denotes how
Figure 2. MultiLevel ConvNet
many language features of the previous layer it takes in to
compose the next level of features.
2.4.3 Component-Size Tradeoffs
A component represents the smallest block of time accept-
able for dividing the raw input into sizes that encapsulate the
right amount of abstraction for a complete language model
of the data. It is the image size of the first layer of the
A natural trade-off exists between the size of the compo-
nent and the classification accuracies at different parts of
the model. If a smaller component is chosen (more specific
knowledge), then many more components need to be clas-
sified at the lowest level of the model (since there are more
components per given time span). However, since more
specific feature representations are learnt, then the language
model is more successful in higher levels when generalizing
to new linguistic features because more detailed knowledge
of the rudimentary components has been learnt.
2.4.4 Flexibility and Robustness
The model is a naturally flexible approach to building dif-
ferent kinds of language models used to fit the situation and
training scenario. A many-level model with incremental ad-
vances in the size of the components will allow for a very
robust, tight, and high-level classification of the data. A
model with large components assumes that the basic lan-
guage symbols are long. The chief benefit of this approach
is that the symbols used to define the model are self-defined.
There are no definitional encumbrances to concepts such
as ”phonemes” or ”syllables” present in linguistics analy-
sis of vocal content. This allows imagined speech recogni-
tion to define and settle with its own kind and size of sym-
This model, crucially, allows for variable-sized images,
since the stacked net classifies to components and not to
entire raw images. We note importantly that not each level
needs to be a CNN and that it can be a different pattern
recognizer in its own right. What we are proposing is the
stacked learning model and testing it in this paper with
CNNs in each level. Its use is certainly not limited to only
this kind of pattern recognizer.
3. Data Collection
3.1. Experimentation
There are no publicly available datasets of imagined speech
trials. We have held three experiments at Professor Takako
Fujioka’s EEG lab at the Center for Computer Research in
Music and Acoustics (CCRMA) at Stanford to generate our
own dataset. Each of these experiments were aimed at ac-
quiring some form of understanding of the levels of com-
plexity in comprehending imagined speech. The first were
samples of four phonetically distinct syllables, the second
of words and the decompositions of the words, and the third
of words that differ by only one phoneme.
For the purposes of our experiment, we used a 10-20 system
EEG with 64 channels covering the entirety of the subjects
head. Three additional nodes tracked eye and upper-facial
movements to assist removing blinking and face movement
artifacts from the data. The EEG sampled at a rate of
Each of the six experiments involved a subject imagining
speaking a pair of symbols alternating between trials. A low
and a high pitch tone were predecided before the experiment
to correspond to these pairs, the lower tone corresponding
to one pair and the upper tone corresponding to the other.
In one round of readings, 200 trials of a symbol pair, 100
of each symbol, were mixed randomly and presented to the
subject as the tones. After a short break, the experiment
was repeated with the other pair of symbols. We then per-
formed the first pair again in another round and the second
pair in the next. In total, 200 readings of each symbol were
The length of the queuing sound lasted for 0.2 seconds,
enough to perceive the pitch but not too long that response
to the tone interferes with thinking. The subject was given
2.3 seconds and asked to utter the correct syllable once, af-
ter which he was asked to rest his mind until the next beep
was heard.
A time line for a single trial for a symbol pair is shown
3.2. Labelling
The labels were given to the trials based on the construc-
tion of the stacked CNN. That is, for each level, there were
as many classes as there are linguistic features on that level
trained on. For example suppose a component was roughly
the size of a single letter, and the second level took three
components of the layer before. Then the word ”signal”
would have 6 classes in its first level for each letter, and 2
classes in its second level for ”sig” and ”nal”. Note that in
the first level we have 6 different classes because each word
is different. Classes are assigned based on the unique lin-
guistic features that are matched on that level. If the compo-
nent is sufficiently small that no relevant linguistic feature
can be matched to its size, then the classes are just assigned
in order. As an example, if we did letter classifications of
the words ”had” and ”hat”, then the samples on the first
level would have labels of [0, 1, 2] and [0, 1, 3] respec-
3.3. Preprocessing
For the purposes of this project, we used the Neuroscan soft-
ware to interact with the experiment environment, record
EEG signals in their raw form and storing them in the CNT
format. EEG recordings, in the pure form, have a very
low signal-to-noise ratio and hence need to undergo heavy
preprocessing before being operated upon. In order to do
so, we used Brainstorm, a MATLAB toolkit, to refine our
dataset. This module facilitated the process of removing eye
blinks and other involuntary muscle movements recorded
by the EEG. It also offers a functionality to detect blatantly
corrupt samples and outliers. After the data cleansing step,
the toolkit stores them as MATLAB structures, mat files.
As a final step, we bridge the gap between our coding lan-
guage and the dataset format so that our Python scripts are
able to successfully interface with the data. Python’s Scien-
tific Python (SciPy) module hosts a few input-ouput (IO)
functions - the most important one in this context being
the loadmat function which reads in a MATLAB file as
a numpy array. We stack each instance of a symbol over
each other to obtain the dataset in the right form.
4. Results and Evaluation
We trained and tested our data using a three level stacked
CNN with two layers at each level. The levels above the first
tend to always perform better than the first because the class
size is smaller, so we will focus our analysis on training the
first level. We found that the CNN would severely over-
fit, with making tuneups to encourage generalization having
very little desired effect.
Our tests were run against our samples of ”sig”, ”nal”, and
”signal”. We decided to split each of the syllables into ten
classes, making 20 classes in total in the first layer. There
were 200 samples of each word, and with 10 components
per syllable and 20 for ”signal”, with 90% of the trials taken
for training, we had a training set size of 7200 samples.
Since we ran our experiments on a limited resource com-
puter, we advise future experiments to perform a data aug-
mentation of averaging different combinations of samples
to produce artificially new trials.
We use an SVM loss function to capture the notion that the
symbolic components should be as distinct and dissimilar
as possible. Our two best models have a training accuracy
of 94% with a validation accuracy of 10%, and the second
has a training accuracy of 45% with a validation accuracy
of 29%. The performance of the second was rare and not
reproducible, affected by weight initializations.
In every training case, RMSProp unfailingly solved the op-
timization problem, reducing the loss function in a strongly
exponential manner. Strangely, many times the loss fell
without improvements in the training or validation rate.
This must indicate that the optimization problem does not
sufficiently represent the situation and that a stronger pat-
tern recognition model should be used.
Using dropout, increasing the regularization rate, and in-
creasing the number of samples trained on had little ef-
fect on the validation rate. Increasing the number of fil-
ters had a consistent positive increase on the validation rate.
The innefectiveness of dropout and the regularization rate
likely indicate that the patterns observed are highly intri-
cate with weights that are best not biased to be smooth or
The low validation rate should not eclipse the fact that train-
ing produced extremely high results, suggesting that even
with one convolutional layer and one affine layer strongly
distinguishing features were found. This suggests that with
a significantly stronger model, one that matches the ex-
tent of the problem discussed in the beginning of the pa-
per, trained using GPUs and other advanced methods, more
global patterns may be found.
5. Future Work
In conclusion, we found in our limited experiment that
small-layered CNNs offer modest performance accuracies
in basic imagined speech tasks but likely nowhere near to
fully reaching or testing the validity of a stacked language
model. Further experiments in this field must increase the
size of the dataset, improve the breadth of the data, and train
on much larger CNNs to saturate their effectiveness on the
We encourage further experimentation with pattern recogni-
tion methods that fit well into the stacked model of language
understanding. For example, this paper does not mention
the use of Recurrent Neural Networks and their widely ob-
served success in vocal speech recognition [1], which could
also prove effective in imagined speech recognition.
6. Acknowledgements
We would like to acknowledge the efforts of the follow-
ing people: Prof. Fei Fei Li for overseeing this project,
Dave Deriso for his inspiring support, Andrej Karpathy
and CS231N TAs for instruction in the art of CNNs, Prof.
Takako Fujioka for lending her time and effort in using her
lab, and Rafael Cosman for helping fuel some of the ideas
presented in this paper.
[1] Awni Hannun, Carl Case, Jared Casper, Bryan Catan-
zaro, Greg Diamos, Erich Elsen, Ryan Prenger, San-
jeev Satheesh, Shubho Sengupta, Adam Coates, and
Andrew Y. Ng. Deep speech: Scaling up end-to-end
speech recognition. ArXiv, 2014.
[2] B.V.K Kumar Vijaya and K Brigham. Imagined
speech classification with eeg signals for silent com-
munication: A preliminary investigation into synthetic
telepathy. Bioinformatics and Biomedical Engineering
(iCBBE), 2010.
[3] Lorenzo Magrassi, Giuseppe Aromataris, Alessandro
Cabrini, Valerio Annovazzi-Lodi, and Andrea Moro.
Sound representation in higher language areas dur-
ing language generation. Proceedings of the National
Academy of Science of the United States of America,
[4] Marc Oettinger. Decoding heard speech and imagined
speech from human brain signals. Office of Intellectual
Property Industry Research Alliances, 2012.

More Related Content

What's hot

 soft-computing soft-computing
L2 presentation
L2 presentationL2 presentation
L2 presentationFrancisco
Aspect oriented a candidate for neural networks and evolvable software
Aspect oriented a candidate for neural networks and evolvable softwareAspect oriented a candidate for neural networks and evolvable software
Aspect oriented a candidate for neural networks and evolvable softwareLinchuan Wang
Introduction to Soft Computing
Introduction to Soft Computing Introduction to Soft Computing
Introduction to Soft Computing Aakash Kumar
Eye movements as a method for investigating cognitive processing by Kevin Tao
Eye movements as a method for investigating cognitive processing by Kevin TaoEye movements as a method for investigating cognitive processing by Kevin Tao
Eye movements as a method for investigating cognitive processing by Kevin Tao悠識學院
Our Best Ideas in Our Hands with Adaptive Virtual Reality - MCAA - ESOF 2016
Our Best Ideas in Our Hands with Adaptive Virtual Reality - MCAA - ESOF 2016Our Best Ideas in Our Hands with Adaptive Virtual Reality - MCAA - ESOF 2016
Our Best Ideas in Our Hands with Adaptive Virtual Reality - MCAA - ESOF 2016Niki Lambropoulos PhD
Faq connectionism-due 6-28
Faq connectionism-due 6-28Faq connectionism-due 6-28
Faq connectionism-due 6-28Bridget Bengtson
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...StampedeCon
Soft computing abstracts
Soft computing abstractsSoft computing abstracts
Soft computing abstractsabctry
Transcranial Electrostim Asd
Transcranial Electrostim AsdTranscranial Electrostim Asd
Transcranial Electrostim AsdWinesforAutism
Do We Click? - Laurent Silbert - H+ Summit @ Harvard
Do We Click? - Laurent Silbert - H+ Summit @ HarvardDo We Click? - Laurent Silbert - H+ Summit @ Harvard
Do We Click? - Laurent Silbert - H+ Summit @ HarvardHumanity Plus
soft computing manoj
soft computing manojsoft computing manoj
soft computing manojManoj Yadav
Algorithms that mimic the human brain
Algorithms that mimic the human brainAlgorithms that mimic the human brain
Algorithms that mimic the human brainBindu Reddy
IRJET- Survey on Text Error Detection using Deep Learning
IRJET-  	  Survey on Text Error Detection using Deep LearningIRJET-  	  Survey on Text Error Detection using Deep Learning
IRJET- Survey on Text Error Detection using Deep LearningIRJET Journal

What's hot (20)

Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligence
 soft-computing soft-computing
L2 presentation
L2 presentationL2 presentation
L2 presentation
Learning & Memory
Learning & MemoryLearning & Memory
Learning & Memory
Aspect oriented a candidate for neural networks and evolvable software
Aspect oriented a candidate for neural networks and evolvable softwareAspect oriented a candidate for neural networks and evolvable software
Aspect oriented a candidate for neural networks and evolvable software
Introduction to Soft Computing
Introduction to Soft Computing Introduction to Soft Computing
Introduction to Soft Computing
Semantic, Cognitive and Perceptual Computing -Deep learning
Semantic, Cognitive and Perceptual Computing -Deep learning Semantic, Cognitive and Perceptual Computing -Deep learning
Semantic, Cognitive and Perceptual Computing -Deep learning
Eye movements as a method for investigating cognitive processing by Kevin Tao
Eye movements as a method for investigating cognitive processing by Kevin TaoEye movements as a method for investigating cognitive processing by Kevin Tao
Eye movements as a method for investigating cognitive processing by Kevin Tao
Our Best Ideas in Our Hands with Adaptive Virtual Reality - MCAA - ESOF 2016
Our Best Ideas in Our Hands with Adaptive Virtual Reality - MCAA - ESOF 2016Our Best Ideas in Our Hands with Adaptive Virtual Reality - MCAA - ESOF 2016
Our Best Ideas in Our Hands with Adaptive Virtual Reality - MCAA - ESOF 2016
Faq connectionism-due 6-28
Faq connectionism-due 6-28Faq connectionism-due 6-28
Faq connectionism-due 6-28
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Soft computing abstracts
Soft computing abstractsSoft computing abstracts
Soft computing abstracts
Transcranial Electrostim Asd
Transcranial Electrostim AsdTranscranial Electrostim Asd
Transcranial Electrostim Asd
Soft computing
Soft computingSoft computing
Soft computing
Do We Click? - Laurent Silbert - H+ Summit @ Harvard
Do We Click? - Laurent Silbert - H+ Summit @ HarvardDo We Click? - Laurent Silbert - H+ Summit @ Harvard
Do We Click? - Laurent Silbert - H+ Summit @ Harvard
soft computing manoj
soft computing manojsoft computing manoj
soft computing manoj
Algorithms that mimic the human brain
Algorithms that mimic the human brainAlgorithms that mimic the human brain
Algorithms that mimic the human brain
Basics of Soft Computing
Basics of Soft  Computing Basics of Soft  Computing
Basics of Soft Computing
IRJET- Survey on Text Error Detection using Deep Learning
IRJET-  	  Survey on Text Error Detection using Deep LearningIRJET-  	  Survey on Text Error Detection using Deep Learning
IRJET- Survey on Text Error Detection using Deep Learning

Similar to Oshri,Khandwala,Chopra199

What is it about the human brain that makes us smarter than other animals.pdf
What is it about the human brain that makes us smarter than other animals.pdfWhat is it about the human brain that makes us smarter than other animals.pdf
What is it about the human brain that makes us smarter than other animals.pdfRazaAliKhan10
Contents lists available at ScienceDirectBrain and Cogniti.docx
Contents lists available at ScienceDirectBrain and Cogniti.docxContents lists available at ScienceDirectBrain and Cogniti.docx
Contents lists available at ScienceDirectBrain and Cogniti.docxdonnajames55
How To Make Outline For Essay
How To Make Outline For EssayHow To Make Outline For Essay
How To Make Outline For EssayJulia Slater
Teaching the brain to read by Dr Duncan Milne
Teaching the brain to read by Dr Duncan MilneTeaching the brain to read by Dr Duncan Milne
Teaching the brain to read by Dr Duncan MilneDyslexia International
How does the brain learn language
How does the brain learn languageHow does the brain learn language
How does the brain learn languageMery Machie
How does the brain learn language
How does the brain learn languageHow does the brain learn language
How does the brain learn languageDorothy Bishop
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
A history of optogenetics the development of tools for controlling brain circ...
A history of optogenetics the development of tools for controlling brain circ...A history of optogenetics the development of tools for controlling brain circ...
A history of optogenetics the development of tools for controlling brain circ...merzak emerzak
Should a computer compete for language?
Should a computer compete for language?Should a computer compete for language?
Should a computer compete for language?Mathijs van Meerkerk
Mind reading computers
Mind reading computersMind reading computers
Mind reading computersselvalakshmi24
The Psychology Of Childhood Social And Emotional Development
The Psychology Of Childhood Social And Emotional DevelopmentThe Psychology Of Childhood Social And Emotional Development
The Psychology Of Childhood Social And Emotional DevelopmentKristen Stacey
Book Obama Promised Land
Book Obama Promised LandBook Obama Promised Land
Book Obama Promised Landelhosanyreda
POWERFUL Brain & Hearing Support
POWERFUL Brain & Hearing SupportPOWERFUL Brain & Hearing Support
POWERFUL Brain & Hearing Supportelhosanyreda
Artificial intelligence turns brain activity into speech
Artificial intelligence turns brain activity into speechArtificial intelligence turns brain activity into speech
Artificial intelligence turns brain activity into speechusmsystems
Can abstraction lead to intelligence?
Can abstraction lead to intelligence?Can abstraction lead to intelligence?
Can abstraction lead to intelligence?Dr Janet Bastiman

Similar to Oshri,Khandwala,Chopra199 (20)

What is it about the human brain that makes us smarter than other animals.pdf
What is it about the human brain that makes us smarter than other animals.pdfWhat is it about the human brain that makes us smarter than other animals.pdf
What is it about the human brain that makes us smarter than other animals.pdf
Contents lists available at ScienceDirectBrain and Cogniti.docx
Contents lists available at ScienceDirectBrain and Cogniti.docxContents lists available at ScienceDirectBrain and Cogniti.docx
Contents lists available at ScienceDirectBrain and Cogniti.docx
How To Make Outline For Essay
How To Make Outline For EssayHow To Make Outline For Essay
How To Make Outline For Essay
Teaching the brain to read by Dr Duncan Milne
Teaching the brain to read by Dr Duncan MilneTeaching the brain to read by Dr Duncan Milne
Teaching the brain to read by Dr Duncan Milne
How does the brain learn language
How does the brain learn languageHow does the brain learn language
How does the brain learn language
How does the brain learn language
How does the brain learn languageHow does the brain learn language
How does the brain learn language
Hypothetical Technology.pdf
Hypothetical Technology.pdfHypothetical Technology.pdf
Hypothetical Technology.pdf
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
Lang & brain
Lang & brainLang & brain
Lang & brain
Essay Plan Form
Essay Plan FormEssay Plan Form
Essay Plan Form
A history of optogenetics the development of tools for controlling brain circ...
A history of optogenetics the development of tools for controlling brain circ...A history of optogenetics the development of tools for controlling brain circ...
A history of optogenetics the development of tools for controlling brain circ...
deepLearning report
deepLearning reportdeepLearning report
deepLearning report
Should a computer compete for language?
Should a computer compete for language?Should a computer compete for language?
Should a computer compete for language?
Mind reading computers
Mind reading computersMind reading computers
Mind reading computers
The Psychology Of Childhood Social And Emotional Development
The Psychology Of Childhood Social And Emotional DevelopmentThe Psychology Of Childhood Social And Emotional Development
The Psychology Of Childhood Social And Emotional Development
Book Obama Promised Land
Book Obama Promised LandBook Obama Promised Land
Book Obama Promised Land
POWERFUL Brain & Hearing Support
POWERFUL Brain & Hearing SupportPOWERFUL Brain & Hearing Support
POWERFUL Brain & Hearing Support
Artificial intelligence turns brain activity into speech
Artificial intelligence turns brain activity into speechArtificial intelligence turns brain activity into speech
Artificial intelligence turns brain activity into speech
Can abstraction lead to intelligence?
Can abstraction lead to intelligence?Can abstraction lead to intelligence?
Can abstraction lead to intelligence?


  • 1. Stacked Convolutional Neural Networks as a Language Model of Imagined Speech in EEG Data Barak Oshri Stanford University Manu Chopra Stanford University Nishith Khandwala Stanford University 1. Introduction 1.1. Imagined Speech Faculty Imagined speech refers to thinking in the form of sound, often in a person’s own spoken language, without moving any muscles. It happens consciously and subconsciously, when people use it to imagine their vocalization or sub- consciously as a form of thinking, whether during reading, reciting, or silently talking to oneself. This phenomenon includes an imaginative component and a vocal component, with the brain calling on parts of the men- tal circuit responsible for speech production, falling short of producing the signals for moving the muscles of the vocal cords. Instead, the language signals are embedded and pro- cessed in the buffer that leads into the imaginative space of the mind. Despite a limited understanding of what this mental cir- cuit looks like, we can reason necessary and sufficient con- ditions about what the silent language formation pipeline must entail, for example that ”sound representation deeply informs... linguistic expressions at a much higher level than previously thought,” indicating that speech production must be tightly subsumed within language formation, instead of the circuits running in parallel with some auxiliary connec- tions [3]. 1.2. Brain Representations of Imagined Speech What makes imagined speech research promising and opti- mistic is that the rich linguistic features present in language is what the brain anchors to as it develops its language fa- cilities. This means that the structures present in the brain for speech formation must be ordered at least in part in the same way that language is structured abstractly. Hence to succeed in imagined speech, we need only learn how these speech patterns interface with imagination in a determinis- tic way. Once we do, was can apply many of the same tech- niques used in traditional speech recognition such as hidden markov models, language tree models, and deep learning approaches. So whether this is possible is not just a question of how lan- guage is formed in the brain, but how it is represented in the high-level faculties such as imagination and conscious- ness. Even if language models are deterministically present in language formation, we still do not have any understand- ing of how they are manifested in imaginative capacities; imagination could just as well noise the lucid language rep- resentations evoked from fundamental language structures such as the Broca’s and Wernicke’s areas. And what understanding we have of how imagination works merely points to how complex and unconfined it is, being unlocalized to any specific structure in the brain and be- ing spread out in a highly diffuse manner. Just alone the fact that mental states severely affect its functionality via changing neural oscillation frequencies depending on levels of waking attention shows how the output from electroen- cephalography (EEG) appears indeterminate and chaotic. That is, assuming such a ”layered understanding” of the brain, we don’t know how close imagination is from the act of producing the language. Therefore, imagined speech research is a neuroscientific en- deavor as much as it is a computational one. Any successes and failures in performance are indications on how well the model fits to the data, which allows us to then make extrap- olating claims about how the brain is organized. 1.3. The Trouble with Data Attaining well-formed data for imagined speech research is the largest challenge in this field. Most studies have used EEG imaging for its high temporal resolution. The trouble with this data is that each channel only captures an ”av- erage” electric signal across an axis relative to some root (often the top of the spin at the back of the head). Experimentation with imagined speech is exceedingly chal- lenging. The human brain is incredibly sensitive to external stimuli, anything of which could confound the controlled 1
  • 2. lab environment of the experiment. It is difficult to make a setup that determines exactly when a word is thought. If a pitch is used to cue the subject, then the brain response to the pitch affects the immediate signal. This is a serious danger to the integrity of the experiment as a classifier could inadvertently classify the variable responses to the different level pitches. And even when data is well-formed and meets the task de- scriptions, there is no guarantee that the resolution of the technology has captured the relevant information. EEG has high noise-to-signal ratio, and noise reduction methods are effective but not complete. Technological advances in brain-imaging in the future will increasingly help research in this field. 1.4. The Brain as a Personal Computer To summarize the challenge of this task in an analogy, con- sider a modern personal computer. Such a PC computes a myriad of tasks at any given moment: resting, active, back- ground, etc. We would like to identify the current state of exactly one of these processes. We have faint ideas of where this process appears within the entire system, but we are not sure what its form is, or how loud or prominent it is in com- parison with other processes. In fact, it is not a single pro- cess, but a parallel one, merged, influenced, and blockaded by all the other processes happening in the computer. Also consider that our apparatus for measuring this brain, call it our voltmeter, does not get to measure the state of every transistor, but only some start and end points. To find this elusive process, we start with the assumption that our voltmeter is omnipotent and knows the states of all transistors. Given the entire state of the computer, we exploit any understanding we have of the structure of the process, such as expectations for how some of the wires should be arranged in the process, allowing us to ”pattern match” with the data. Without understanding of the struc- ture of this process, any algorithm could only blindly tra- verse the data as it doesn’t know what to expect from this process. Therefore the success of generalizing this process to new situations depends on the correctness of our exter- nal ”fitting”. It seems like pattern recognition, even without hand-engineered features, is doomed to needing some level of assumption about the organization of the data, or at least how it could be learnt. This analogy is important in explaining why we need a ro- bust language model suitable for the task to be able to com- prehensively solve imagined speech. Keeping the challenge of this field in mind, any research done in the present times will aid efforts in the future when the technology and neuro- science catch up with the immensity of the task. Therefore, it is important not to lose aspiration to solve this problem so that the methodology and foundations are ready and an op- timistic tone has been set for when science and technology allow us to converge upon a solution. 1.5. Uses of Imagined Speech These challenges should be viewed against the many bene- fits and opportunities opened by commercial, medical, and research aspects of imagined speech. Thousands of severely disabled patients are unable to communicate due to paraly- sis, locked-in syndrome, Lou Gehrigs disease, or other neu- rological diseases. Restoring communication in these pa- tients has proven to be a major challenge. Prosthetic de- vices that are operated by electrical signals measured by sensors implanted in the brain are being developed in an effort to solve this problem. Researchers at U.C Berkeley have worked on developing models to generate speech, in- cluding arbitrary words and sentences, using brain record- ings from the human cortex. [4] Success in understand- ing imagined speech will enable these patients to talk as thoughts can be directly synthesized into sound. Imagined speech may finally restore the lost speech functions to thou- sands, if not millions, of patients. Furthermore, the capability to comprehend imagined speech has the clear potential for a variety of other uses, such as silent communication when visual or oral commu- nication is undesirable. As fluency and dependence on tech- nology raises the demands for faster, cleaner, and more pro- ductive interfaces, the pathways such an accomplishment would pave in the scope of our communicative abilities would lead to a revolution in our natural and digital interac- tion with the world. Humans have created a world of messages, expressions, and meanings central to our living individually and in commu- nities, and this is true in the mind just as much as it is in our writings and other artistic mediums. Tapping into the sheer vastness and wealth of color exhib- ited by the mind is sure to revolutionize the dimensionality of human expression, its data, and our relationship with the digital world. The ability to understand imagined speech will fundamentally change the way we interact with our de- vices, as digital technology shows trends of connecting with our bodies and activities increasingly more, from eye-wear to watches. While such research is controversial in the least, invasive at worst, these fears should not inhibit attempts at studying how much information could be mined from the brain. Tapping into imagined speech data is the holy grail of hu- man communication interfacing. There are fewer problems that reach so closely to human demands and intentions, as imagined speech happens involuntarily and irrepressibly. 2
  • 3. This makes imagined speech an underestimated advance in our ability to make use of human abilities in our activ- ities. 1.6. The State of Imagined Speech Progress in imagined speech has been slow, lacking, and misguided. Studies in this field make simple and blanketing studies unfavorable to progressive growth and experimenta- tion. Most of these studies perform classification tasks of words, syllables, or phonemes such as classifying between several thousand samples of syllables ba and ku using an EEG or classifying between yes or no using an Emotiv. The first of these can be given recognition as showing that pho- netically disparate syllables can be strongly differentiated [2], and the second of these that even given the small num- ber of channels an Emotiv can differentiate between two words. Neither of these studies address the larger challenges posed by general imagined speech progress or advise for a robust approach to experimentation with imagined speech by ex- ploiting domain-specific knowledge. Neither of the studies can claim to have classified patterns that activate on those syllables or words when the class space is grown. That is, each symbol studied is classified dependently on the other symbols classified which makes anything learnt in the final model completely restricted to the task-definition. The only major sponsored effort for imagined speech re- search was by DARPA in 2008 in a $4 million grant to UC Irvine to conduct experiments that will allow user-to-user communication on the battlefield without the use of vocal- ized speech through neural signals analysis” on the basis that ”the brain generates word-specific signals prior to send- ing electrical impulses to the vocal cords.” While promis- ing, this research has not produced a foundation for general imagined speech research, instead focusing on military as- pects and small tasks that benefit battlefield communication in the short-term. There is thus a theoretical need to direct this field into an ex- perimental understanding of how decoding speech from the brain could work. In this way, research in this field should not focus on results or data evaluation, but on approaches and methodologies, especially given that data produced is heavily dependent on specific context, technology, and sub- jects. 1.7. Learning Linguistic Features In a previous study of ours, we showed that multi-class classification of four syllables is possible and effective with neural networks to an accuracy of greater than 80%. These results, however, are not as promising as the performance indicates. There is no reason to believe that the model is classifying the syllables for their linguistic features; any other patterns such as associative memory activations could be the differentiating factors. So what we repeatedly see in the examples discussed is that the tasks used to research imagined speech are inherently ill-posed. Therefore, for any model to succeed, we need to be able to assure ourselves that the classifier is classifying for the right reasons, in this case linguistic features, which are invariable and context- independent. We guarantee that the model is learning linguistic features by teaching it to classify the same linguistic features in dif- ferent words and expressions. That is, for robustness the model must not only generalize to new samples, but to new language situations. By defining a learning model that is built around the foundations of general language under- standing, we can at least guarantee that the parameter space includes solutions that capture the dependencies inherent in the language so that the differentiating factors include the constituents of the language. 2. A Language Model using Convolutional Neural Networks Having discussed the main foundations of imagined speech, we now devote the rest of the paper to suggesting the re- search questions and methods that need to be used to make progress in imagined speech research, as well as some pre- liminary results using these approaches. 2.1. Segmentation and Composition The first of these questions is related to proving that clas- sification is a valid approach to solving imagined speech. Before we can assume that language features are static sym- bols in imagined speech, and thus before we can prove that classification is suited to this problem, we must verify that the linguistic features ”add up” in EEG data in such a way that they appropriately and convincingly capture the pat- terns of higher level features. That is, the language ob- served in the imagination is not necessarily correspondent and tainted by associative caches of different words that obscure the layered foundation of language models as ob- served in the imagination. We can answer this question by proving that classifications of components in imagined speech can be segmented into correct classifications of the sub-components, and that the converse is also true, namely that two language symbols composed next to each other have the same classifications as if the two symbols were featured as one symbol. 3
  • 4. This is alike to proving the ”inductive step” that if the basic rudiments are known, which the studies seem to be showing as possible and which can only be improved as our com- putational resources increase with time, then these compo- nents can be ”built up” and ”built down” without losing the linguistic integrity of the classifications. This is a necessary condition for speech recognition of EEG data. In the classical setting of vocal speech recognition, it is trivial that language features are at least approximately built up and built down in the signal (if a person says ”ha” and ”bit”, then the signal of ”habit” can only deviate a lit- tle bit different from the sums of the signals of the compo- nents). In imagined speech, this is certainly not necessarily true. A person could have multiple representations and activations for the word ”habit”: one through its phonemes, and one through single symbols, such as a unique association to the concept of a ”habit”. Therefore, the EEG signal of ”habit” will encode the concept of the word while the signals of ”ha” and ”bit” will be purely linguistic concepts. That is, the noise of imagined speech recognition is sig- nificantly more complicated and meaningful than the noise of vocal speech recognition, which is mainly interference and voice fluctuations. We need to prove segmentation and composition to show that the noise does not interfere with the classification in a destructive way, and that it is at least interfering in a consistent manner across all linguistic levels of classification. In imagined speech, we have two modes of classification, ”linguistic” and ”associative”, while in vo- cal speech recognition there is only the former. Classifying based on associative features is immediately ill-fated in that each word has its own unique representation and that two words linguistically similar could have wildly different as- sociative representations. Proving segmentation and composition will show that asso- ciative meanings do not interfere with the phonetic qualities of imagined language. This is believed to be true since the language formation circuitry must regardless be activated if the word is thought of, hence segmentation and composition will be true if the linguistic signal is not hidden by concept activations of the word thought of. 2.2. CNN Features in Sequence Alignment One approach of attempting to prove segmentation and composition is using sequence alignment. Sequence align- ment, mostly used in arranging sequences of DNA and RNA to identify regions of similarity, can be useful in imagined speech to test how close of a fit the features are of similar components in different words. That is, given EEG signals for ”habit” and signals for ”ha” and ”bit”, we align the sig- Figure 1. Sequence Alignment Algorithm nal for ”ha” at each point in the ”habit” signal and do the same with ”bit”. If the representation of ”ha” shares fea- tures with the first part of ”habit” then ”ha” will success- fully align at the beginning of ”habit” and the opposite with ”bit”. This approach of proving segmentation is ill-fated as it al- ready depends on some successful feature extraction. Do- ing sequence alignment with euclidean distance as a base- line, we found that the results were insignificant, showing that whatever meaningful signal is encased in the EEG data is intricate enough to demand sophisticated pattern extrac- tion. Therefore this approach is merely useful as a way of validating the results of some feature extraction by creating a metric for how closely representations across similar lan- guage features match in different strings of words. 2.3. Convolutional Neural Networks as EEG Classifiers Sequence alignment suffers from a larger problem of fail- ing to deal with compression. That is, without introducing computationally expensive alignment methods such as gaps and duplicates, if ”ha” is said faster in ”habit” then the se- quence alignment cannot account for the fact that the same representation may exist in both trials but in one happening faster than the other. It is to this challenge that we begin to consider the use of Convolutional Neural Networks (CNNs) in doing feature representation of imagined speech EEG data. CNNs were traditionally created to solve the problem of image recogni- tion by building up low-level image features such as lines and edges and using smooth features to build higher-level understandings. It turns out, however, that the philosophy used by CNNs to solve image recognition is applicable in 4
  • 5. the context of EEG data: by accounting for transformational variance using pooling layers, patterns are allowed to de- viate between trials without affecting the outcome of the classification, which is needed for when the language cir- cuitry of one person is the same as another’s but occurring in different regions of the brain or being activated in a dif- ferent manner. Pooling layers are also an effective solution to compressed data, evidenced by how CNNs are still able to classify compressed images of the same content. CNNs must rely on the assumption that the data is smooth and continuous. This is somewhat made possible in this sit- uation by the fact that close EEG channels somewhat get data from the same regions in the brain (which makes the ordering of the EEG channels in the form of the data es- pecially important). If EEG in the future were to provide 3d time-based images of activity this would be an immense gain to CNN feature extraction. But currently, CNN ap- proaches are most suited to EEG caps with many channels distributed evenly around the brain. CNNs would work even better in technologies that capture a uniform image of the brain, such as fMRI scanning. CNNs lose out on important global patterns, in this case cor- related neural wires that are disconnected from each other, when there are not enough layers for both patterns to be weighted together. However, it is a fair hypothesis that lo- cal language features must be represented in local circuits of the brain while large and more comprehensive models are more diffuse since they require more reasoning and memory around the brain. 2.4. Stacked CNNs as a Robust Language Model 2.4.1 Overview It is this assumption that denser and more expansive neu- ral connections encode more language information that makes it fitting to experiment with component-based mod- els. In this section, we present a model that we believe is sophis- ticated yet logically coherent to experiment with classifying imagined speech using language models. Our central idea that we believe tackled our initial questions of decomposi- tion and summation was the innovation of a large CNN that has a loss function every two layers with the first few lay- ers being trained to classify low-level symbols in our data and higher levels using the feature representations of the earlier ones to understand higher-level language elements. The lower levels are trained first so that higher levels use effective features of the smaller components. For example, suppose we intended a model to learn the word ”signal”. We divide the sample of the word into small component images, (each component can be likened to a single letter), train the classifier on those letters, use the feature extractions of the letters to train the classification of syllables using learned representations of letters, and build- ing upwards so that every language element is classified using trained representations of the highest language fea- ture below it. This should happen to the point that if the net had a representation of ”signal” and through some other source ”ing”, it should also be able to classify ”signaling” as long as the original data was split into the components in a suitable manner (which we can easily ensure using small strides). What is innovative is that this is a ”component-based ap- proach” such that nothing but the initial components are classified on the raw data. This forces the net to classify the language features exclusively on known and recognized language features as opposed to random and irrelevant pat- terns that happen to distinguish classes. We have reason to believe that this is the most appropriate way to build a framework to read language in the brain because the brain seems to build abstract layers of symbols for language fea- tures. 2.4.2 Architecture The model described is a stacked CNN, with each CNN taking the feature representation of the previous level. A level here denotes a CNN with various number of convolu- tional layers, ending with its own affine layer (not present in the stacked net) and loss function that classifies to that level. Therefore, the image size of each layer is the image size of the output of the previous, with the first level taking raw input of the dimensions of a single component. Each CNN is trained independently in ascending order, obviously to ensure that the features that the level takes as input are the best formed possible. When testing, on raw data the data is first split into pieces of the size of the first component (with some stride). The stacked net then outputs the classification at each of the levels, given a low-level to high-level breakdown. How these level classes are analyzed is left to the purview of the user. Importantly, each level can take in a variable number of component extractions. That is, if the first level takes as input a data sample equivalent to a single letter, then the second level does not have to take the feature extraction of one letter but can take the extractions of two or three or more letters side by side. This level parameter denotes how 5
  • 6. Figure 2. MultiLevel ConvNet many language features of the previous layer it takes in to compose the next level of features. 2.4.3 Component-Size Tradeoffs A component represents the smallest block of time accept- able for dividing the raw input into sizes that encapsulate the right amount of abstraction for a complete language model of the data. It is the image size of the first layer of the net. A natural trade-off exists between the size of the compo- nent and the classification accuracies at different parts of the model. If a smaller component is chosen (more specific knowledge), then many more components need to be clas- sified at the lowest level of the model (since there are more components per given time span). However, since more specific feature representations are learnt, then the language model is more successful in higher levels when generalizing to new linguistic features because more detailed knowledge of the rudimentary components has been learnt. 2.4.4 Flexibility and Robustness The model is a naturally flexible approach to building dif- ferent kinds of language models used to fit the situation and training scenario. A many-level model with incremental ad- vances in the size of the components will allow for a very robust, tight, and high-level classification of the data. A model with large components assumes that the basic lan- guage symbols are long. The chief benefit of this approach is that the symbols used to define the model are self-defined. There are no definitional encumbrances to concepts such as ”phonemes” or ”syllables” present in linguistics analy- sis of vocal content. This allows imagined speech recogni- tion to define and settle with its own kind and size of sym- bols. This model, crucially, allows for variable-sized images, since the stacked net classifies to components and not to entire raw images. We note importantly that not each level needs to be a CNN and that it can be a different pattern recognizer in its own right. What we are proposing is the stacked learning model and testing it in this paper with CNNs in each level. Its use is certainly not limited to only this kind of pattern recognizer. 3. Data Collection 3.1. Experimentation There are no publicly available datasets of imagined speech trials. We have held three experiments at Professor Takako Fujioka’s EEG lab at the Center for Computer Research in Music and Acoustics (CCRMA) at Stanford to generate our own dataset. Each of these experiments were aimed at ac- quiring some form of understanding of the levels of com- plexity in comprehending imagined speech. The first were samples of four phonetically distinct syllables, the second of words and the decompositions of the words, and the third of words that differ by only one phoneme. For the purposes of our experiment, we used a 10-20 system EEG with 64 channels covering the entirety of the subjects head. Three additional nodes tracked eye and upper-facial movements to assist removing blinking and face movement artifacts from the data. The EEG sampled at a rate of 1000HZ. Each of the six experiments involved a subject imagining speaking a pair of symbols alternating between trials. A low and a high pitch tone were predecided before the experiment to correspond to these pairs, the lower tone corresponding to one pair and the upper tone corresponding to the other. In one round of readings, 200 trials of a symbol pair, 100 of each symbol, were mixed randomly and presented to the 6
  • 7. subject as the tones. After a short break, the experiment was repeated with the other pair of symbols. We then per- formed the first pair again in another round and the second pair in the next. In total, 200 readings of each symbol were collected. The length of the queuing sound lasted for 0.2 seconds, enough to perceive the pitch but not too long that response to the tone interferes with thinking. The subject was given 2.3 seconds and asked to utter the correct syllable once, af- ter which he was asked to rest his mind until the next beep was heard. A time line for a single trial for a symbol pair is shown below: 3.2. Labelling The labels were given to the trials based on the construc- tion of the stacked CNN. That is, for each level, there were as many classes as there are linguistic features on that level trained on. For example suppose a component was roughly the size of a single letter, and the second level took three components of the layer before. Then the word ”signal” would have 6 classes in its first level for each letter, and 2 classes in its second level for ”sig” and ”nal”. Note that in the first level we have 6 different classes because each word is different. Classes are assigned based on the unique lin- guistic features that are matched on that level. If the compo- nent is sufficiently small that no relevant linguistic feature can be matched to its size, then the classes are just assigned in order. As an example, if we did letter classifications of the words ”had” and ”hat”, then the samples on the first level would have labels of [0, 1, 2] and [0, 1, 3] respec- tively. 3.3. Preprocessing For the purposes of this project, we used the Neuroscan soft- ware to interact with the experiment environment, record EEG signals in their raw form and storing them in the CNT format. EEG recordings, in the pure form, have a very low signal-to-noise ratio and hence need to undergo heavy preprocessing before being operated upon. In order to do so, we used Brainstorm, a MATLAB toolkit, to refine our dataset. This module facilitated the process of removing eye blinks and other involuntary muscle movements recorded by the EEG. It also offers a functionality to detect blatantly corrupt samples and outliers. After the data cleansing step, the toolkit stores them as MATLAB structures, mat files. As a final step, we bridge the gap between our coding lan- guage and the dataset format so that our Python scripts are able to successfully interface with the data. Python’s Scien- tific Python (SciPy) module hosts a few input-ouput (IO) functions - the most important one in this context being the loadmat function which reads in a MATLAB file as a numpy array. We stack each instance of a symbol over each other to obtain the dataset in the right form. 4. Results and Evaluation We trained and tested our data using a three level stacked CNN with two layers at each level. The levels above the first tend to always perform better than the first because the class size is smaller, so we will focus our analysis on training the first level. We found that the CNN would severely over- fit, with making tuneups to encourage generalization having very little desired effect. Our tests were run against our samples of ”sig”, ”nal”, and ”signal”. We decided to split each of the syllables into ten classes, making 20 classes in total in the first layer. There were 200 samples of each word, and with 10 components per syllable and 20 for ”signal”, with 90% of the trials taken for training, we had a training set size of 7200 samples. Since we ran our experiments on a limited resource com- puter, we advise future experiments to perform a data aug- mentation of averaging different combinations of samples to produce artificially new trials. We use an SVM loss function to capture the notion that the symbolic components should be as distinct and dissimilar as possible. Our two best models have a training accuracy of 94% with a validation accuracy of 10%, and the second has a training accuracy of 45% with a validation accuracy of 29%. The performance of the second was rare and not reproducible, affected by weight initializations. In every training case, RMSProp unfailingly solved the op- timization problem, reducing the loss function in a strongly exponential manner. Strangely, many times the loss fell without improvements in the training or validation rate. This must indicate that the optimization problem does not sufficiently represent the situation and that a stronger pat- tern recognition model should be used. Using dropout, increasing the regularization rate, and in- creasing the number of samples trained on had little ef- 7
  • 8. fect on the validation rate. Increasing the number of fil- ters had a consistent positive increase on the validation rate. The innefectiveness of dropout and the regularization rate likely indicate that the patterns observed are highly intri- cate with weights that are best not biased to be smooth or averaged. The low validation rate should not eclipse the fact that train- ing produced extremely high results, suggesting that even with one convolutional layer and one affine layer strongly distinguishing features were found. This suggests that with a significantly stronger model, one that matches the ex- tent of the problem discussed in the beginning of the pa- per, trained using GPUs and other advanced methods, more global patterns may be found. 5. Future Work In conclusion, we found in our limited experiment that small-layered CNNs offer modest performance accuracies in basic imagined speech tasks but likely nowhere near to fully reaching or testing the validity of a stacked language model. Further experiments in this field must increase the size of the dataset, improve the breadth of the data, and train on much larger CNNs to saturate their effectiveness on the data. We encourage further experimentation with pattern recogni- tion methods that fit well into the stacked model of language understanding. For example, this paper does not mention the use of Recurrent Neural Networks and their widely ob- served success in vocal speech recognition [1], which could also prove effective in imagined speech recognition. 6. Acknowledgements We would like to acknowledge the efforts of the follow- ing people: Prof. Fei Fei Li for overseeing this project, Dave Deriso for his inspiring support, Andrej Karpathy and CS231N TAs for instruction in the art of CNNs, Prof. Takako Fujioka for lending her time and effort in using her lab, and Rafael Cosman for helping fuel some of the ideas presented in this paper. References [1] Awni Hannun, Carl Case, Jared Casper, Bryan Catan- zaro, Greg Diamos, Erich Elsen, Ryan Prenger, San- jeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng. Deep speech: Scaling up end-to-end speech recognition. ArXiv, 2014. [2] B.V.K Kumar Vijaya and K Brigham. Imagined speech classification with eeg signals for silent com- munication: A preliminary investigation into synthetic telepathy. Bioinformatics and Biomedical Engineering (iCBBE), 2010. [3] Lorenzo Magrassi, Giuseppe Aromataris, Alessandro Cabrini, Valerio Annovazzi-Lodi, and Andrea Moro. Sound representation in higher language areas dur- ing language generation. Proceedings of the National Academy of Science of the United States of America, 2014. [4] Marc Oettinger. Decoding heard speech and imagined speech from human brain signals. Office of Intellectual Property Industry Research Alliances, 2012. 8