Components of an IR QA System
Data Ingestion and Indexing
Types of Answers
Types of Information Retrieval
Data Ingestion and Indexing
Focus on text: typically a custom process
for inputting data
Indexing can be done on a per token
(per word) basis
Augment the index with certain kinds
Cosine similarity measures how “close” a document is to a query.
The query is vectorized via bag-of-words and compared to each
document, also vectorized. This is done in light of TF-IDF.
Different Kinds of QA
Definition (Several relevant paragraphs
concatenated in to one)
Direct Answer (Watson)
Interactive (Siri, voice assistants)
Question Answering Systems
Are a step beyond search engines…
Have several data sources from which a question
can be answered
A classifier for the question type is used on the query
Several data sources can answer different kinds of
This is used to compile a list of question-answering
candidates, or “documents most likely to succeed”.
Question Answering Systems (cont.)
Take the input text and classify it, to determine
the type of question
Use different answer sources (search engines,
triple stores/graph databases and computational
engines such as Wolfram Alpha)
Compile a list of answer candidates from each
Rank each answer candidate by relevance
Deep Learning with DBNs
● Any NLP (collobert and Weston 2011)
● Sound with phonetics (asamir,gdahl,hinton)
● Computer Vision (Lee, Grosse, Ng)
● Watson (DeepQA)
● Image search via object recogition (Google)
● Recommendation Engines (Netflix)
Restricted Boltzmann Machines
● Units – Binary,Gaussian,Rectified Linear,Softmax,Multinomial
● Hidden/Visible Units – Visible learns data – Hidden is partition
● Contrastive Divergence is used for learning the weights
● Positive phase: Learn inputs (Visible) Negative: Balance out
wrt partition function (Hidden)
Real Valued Inputs:
● Stacked Restricted Boltzmann Machines – Compose to learn higher level
● correlations in the data
● Creates feature extractors
Use any sort of output layer with different objective functions to do different
● Logisitic/Softmax Regression – Negative Log likelihood classification
● Mean Squared Error – Regression
● Cross Entropy - Reconstructions
Deep Learning and QA Systems
Part of the problem with answer-candidate searches
is speed: They’re slow.
Each question to be answered is computationally intensive.
Deep learning allows for fast lookup of various kinds
of answer candidates by encoding them.
Deep autoencoders allow for the encoding and decoding
of images as well as text.
Deep autoencoders are two deep-belief networks:
The first is a series of RBMs that encode the input into
a very tiny set of numbers, also called the codes.
The codes are what’s indexed and stored in search.
The second DBN is the decoder, which reconstructs
the results from the codes.
The Encoding Layer: A How-To
Take the input, and make the parameters of the
first hidden layer slightly bigger than that input.
That allows for more information representation
on the first layer.
Progressively decrease the hidden-layer sizes at
each layer until you get to the final, coding, layer,
which is a small number (10-30 figures).
Make the final hidden-layer output be linear
(i.e. real numbers). Linear is pass-through.
Transpose the matrices of the encoder and reverse the
order of its layers.
Each parameter, after training the encoder, is used
to create the decoding net.
The decoder’s hidden layers are the exact opposite of
The output layer of the decoder is then trained to reconstruct
Connecting the Dots
Deep autoencoders can assist in creating answer
candidates for information-retrieval systems.
This works for image or text search.
This technique is called semantic hashing.