3. FLARE
Abstract
Most existing retrieval-augmented LMs employ a
retrieve-and-generate setup that only retrieves information once based
on the input.
This is limiting, however, in more general scenarios involving
generation of long texts, where continually gathering information
throughout the generation process is essential.
3 / 51
4. FLARE
Abstract
This paper propose Forward-Looking Active REtrieval augmented
generation (FLARE), a generic retrieval-augmented generation
method which iteratively uses a prediction of the upcoming sentence
to anticipate future content, which is then utilized as a query to
retrieve relevant documents to regenerate the sentence if it
contains low-confidence tokens.
4 / 51
6. FLARE
Introduction
Introduction
Generative language models (LMs) still tend to hallucinate and create
imaginary content.
To address the issue of hallucination, one promising direction is to
augment generation with retrieval, which involves augmenting
parametric LMs with non-parametric retrieval components that can
look up relevant information from external knowledge resources such
as document corpora [5].
6 / 51
7. FLARE
Introduction
Introduction
But there has some problems, that initial retrieval based on the topic
name (e.g., Joe Biden) may not cover all aspects and details.
Therefore, it is crucial to retrieve extra information as needed
during the generation process, such as when generating a certain
aspect (e.g., the education history of Joe Biden) or a specific detail
(e.g., when did Joe Biden announce his candidacy for the 2020
presidential campaign).
7 / 51
8. FLARE
Introduction
Figure 1: FLARE: Starting with the user input x and initial retrieval results
Dx. Low-probability tokens (indicated with underline)
8 / 51
9. FLARE
Introduction
Contribution
Forward-Looking Active REtrieval augmented generation (FLARE),
as illustrated in Figure 1. Iteratively generates a temporary next
sentence, use it as the query to retrieve relevant documents if it
contains low-probability tokens and regenerate the next sentence until
reaches the end.
FLARE is applicable to any existing LMs at inference time without
additional training. Since GPT-3.5 [6] announced, they examine the
effectiveness of their methods on text-davinci-003.
9 / 51
11. FLARE
Retrieval-Augmented Generation
Notations and Definitions
Given a user input x and a document corpus D = {di}|D|
i=1 , the goal of
retrieval-augmented LMs is to generate the answer
y = [s1, s2, . . . , sm] = [w1, w2, . . . , wn] containing m sentences or n
tokens leveraging information retrieved from the corpus.
A retriever that can retrieve a list of documents Dq = ret(q) for a
query q.
Their method following existing methods [8, 11] to prepend the
retrieved documents before the user input to aid future generation for
both baselines and FLARE for fair comparisons: y = LM([Dq, x]),
where [·, ·] is concatenation following the specified order.
11 / 51
12. FLARE
Retrieval-Augmented Generation
Active Retrieval Augmented Generation
Unlike single-time retrieval-augmented generation [5].
Active retrieval augmented generation is a generic framework that
actively decides when and what to retrieve through the generation
process.
Formally, at step t(t ≥ 1), the retrieval query qt is formulated based on
both the user input x and previously generated output y<t
12 / 51
13. FLARE
Retrieval-Augmented Generation
Active Retrieval Augmented Generation
Query formulation function will be
qt = qry(x, y<t)
, and when t = 1, the y<t = ∅.
Given the retrieved documents Dqt
, LMs continually generate the
answer until the next retrieval is triggered or reaches the end:
yt = LM
Dqt
, x, yt
At each step, it discard previously retrieved documents ∪t′tDqt′ and
only use the retrieved documents from the current step to generate the
next token.
13 / 51
14. FLARE
Forward-Looking Active REtrieval Augmented Generation
Table of contents I
1 Introduction
2 Retrieval-Augmented Generation
3 Forward-Looking Active REtrieval Augmented Generation
FLARE with Retrieval Instructions
Direct FLARE
Implementation Details
4 Multi-time Retrieval Baselines
5 Experimental Setup / Results
14 / 51
16. FLARE
Forward-Looking Active REtrieval Augmented Generation
Forward-Looking Active REtrieval Augmented Generation
FLARE follows these law:
1 LMs should only retrieve information when they do not have the
necessary knowledge to avoid unnecessary or inappropriate
retrieval
2 The retrieval queries should reflect the intents of future
generations
This paper also inspired by Toolformer [10]:
1 It prompts the LM to generate retrieval queries when necessary
while generating the answer using retrieval-encouraging
instructions (FLAREinstruct)
2 It directly uses the LM’s generation as search queries, if uncertain
tokens are present, retrieve and regenerate (FLAREdirect)
16 / 51
17. FLARE
Forward-Looking Active REtrieval Augmented Generation
FLARE with Retrieval Instructions
FLARE with Retrieval Instructions
A straightforward way of expressing information needs for retrieval is
to generate “[Search(query)]” when additional information is needed
[10].
Prompt 3.1: retrieval instructions
Skill 1: An instruction to guide LMs to generate search queries.
Skill 2: An instruction to guide LMs to perform a specific
downstream task (e.g., multihop QA).
An instruction to guide LMs to combine skills 1 and 2 for the test case.
17 / 51
18. FLARE
Forward-Looking Active REtrieval Augmented Generation
FLARE with Retrieval Instructions
When the LM generates “[Search(query)]”, it stops the generation and
use the query to retrieve relevant documents, as shown in Figure 2.
Figure 2: An illustration of FLAREinstruct
18 / 51
19. FLARE
Forward-Looking Active REtrieval Augmented Generation
Direct FLARE
Direct FLARE
Using FLAREinstruct to retrieve instructions might not be reliable, so
they propose FLAREdirect.
It has two tricks:
1 Confidence-based Active Retrieval
2 Confidence-based Query Formulation
Masked sentences as implicit queries
Generated questions as explicit queries
19 / 51
20. FLARE
Forward-Looking Active REtrieval Augmented Generation
Direct FLARE
Confidence-based Active Retrieval
First, it generates a temporary next sentence ŝt = LM([x, yt])
without conditioning on retrieved documents.
Then it will decide whether to trigger retrieval and formulate
queries based on ŝt. If ŝt is being confident by LM, it will accept ŝt.
Otherwise, it will use ŝt to formulate search queries qt to retrieve
relevant documents, and regenerate the next sentence st.
It actively trigger retrieval if any token of ŝt has a probability lower
than a threshold 𝜃 ∈ [0, 1]. 𝜃 = 0 means that retrieval is never
triggered, while 𝜃 = 1 triggers retrieval for every sentence.
20 / 51
21. FLARE
Forward-Looking Active REtrieval Augmented Generation
Direct FLARE
Confidence-based Query Formulation
One way to perform retrieval is to directly use the next sentence ŝt as
the query qt, and this method achieves significantly better results than
with the previous context.
However, it has a risk of perpetuating errors contained in it , using this
erroneous sentence as a query could prompt the retriever to
retrieve irrelevant information, which could potentially mislead
future generations.
So they propose two method that can overcome this drawback:
Masked sentences as implicit queries
Generated questions as explicit queries
21 / 51
22. FLARE
Forward-Looking Active REtrieval Augmented Generation
Direct FLARE
Implicit the explicit query formulation
Figure 3: Tokens with low probabilities are marked with underlines.
22 / 51
23. FLARE
Forward-Looking Active REtrieval Augmented Generation
Direct FLARE
Masked sentences as implicit queries
Queries qt are formulated based on ŝt as follows:
qt =
(
∅ if all tokens of ŝt have probs ≥ 𝜃
mask (ŝt) or qgen (ŝt) otherwise
The first method masks out low-confidence tokens in ŝt with
probabilities below a threshold 𝛽 ∈ [0, 1], the higher 𝛽, more
aggressive masking.
23 / 51
24. FLARE
Forward-Looking Active REtrieval Augmented Generation
Direct FLARE
Generated questions as explicit queries
Another method is to generate explicit questions that target the
low-confident span in ŝt.
Self-ask [7] achieved this by manually inserting follow-up questions
into downstream task exemplars as shown later in Prompt 4.1, which
requires task-specific annotation efforts.
Such that they developed a universal approach that generates questions
for low-confidence spans without additional annotation.
24 / 51
25. FLARE
Forward-Looking Active REtrieval Augmented Generation
Direct FLARE
Generated questions as explicit queries
It first extract all spans from ŝt with probabilities below 𝛽. For each
extracted span z, it will prompt gpt-3.5-turbo to generate a question
qt,z that can be answered with the span, using the following prompt:
Prompt 3.2: zero-shot question generation
Using input x.
Generated output so far y≤t.
Given the above passage, ask a question to which the answer is the
term/entity/phrase “z”.
25 / 51
26. FLARE
Forward-Looking Active REtrieval Augmented Generation
Implementation Details
Implementation Details
The initial query: ŝ1 = LM ([Dx, x])
Sentence tokenization: For each step t, it generates 64 tokens which
are longer than most sentences, and use NLTK sentence tokenizer1 to
extract the first sentence and discard the rest.
Document corpus and retrievers:
Wikipedia dump (document corpus) [3]
BM25 (retriever) [9]
1nltk.tokenize.PunktSentenceTokenizer
26 / 51
27. FLARE
Forward-Looking Active REtrieval Augmented Generation
Implementation Details
Implementation Details
Retrieved document formatting:
Prompt 3.3: document formatting
Search results:
[1] Document 1
[2] Document 2
. . .
The user input x
Efficiency:
It average retrieval is triggered for 30% ∼ 60% of sentences depending
on downstream tasks. Compared to single-time retrieval,
interleaving retrieval and generation with a naive implementation
indeed increases overheads.
27 / 51
29. FLARE
Multi-time Retrieval Baselines
Multi-time Retrieval Baselines
They formally introduce three baseline categories based on when and
what to retrieve.
1 Previous-window
2 Previous-sentence
3 Question decomposition
29 / 51
30. FLARE
Multi-time Retrieval Baselines
Previous-window
It approaches trigger retrieval every l tokens, where l represents the
window size. Generated tokens from the previous window are used as
the query:
qt = yt−1 (t ≥ 2)
yt =
w(t−1)l+1, . . . , wtl
There are some existing methods in this category are RETRO [1],
IC-RALM [8], KNN-LM [4].
30 / 51
31. FLARE
Multi-time Retrieval Baselines
Previous-sentence / Question decomposition
Previous-sentence approaches trigger retrieval every sentence and
use the previous sentence as the query:
qt = yt−1 (t ≥ 2)
yt = st
Question decomposition approaches manually annotated
task-specific exemplars to guide LMs to generate decomposed
sub-questions while producing outputs.
31 / 51
32. FLARE
Multi-time Retrieval Baselines
Prompt 4.1: multihop QA with self-ask
Question: Who lived longer, Theodor Haecker or Harry Vaughan
Watkins?
Are follow up questions needed here: Yes.
Follow up: How old was Theodor Haecker when he died?
Intermediate answer: Theodor Haecker was 65 years old when he
died.
Follow up: How old was Harry Vaughan Watkins when he died?
Intermediate answer: Harry Vaughan Watkins was 69 years old when
he died.
So the final answer is: Harry Vaughan Watkins.
32 / 51
33. FLARE
Multi-time Retrieval Baselines
Notable drawbacks
1 Fixed-interval approaches use previously generated tokens as
queries which might not reflect what LMs intend to generate
in the future
2 Retrieving information at a fixed interval can be inefficient
because it might occur at inappropriate points
3 Question decomposition approaches require task-specific prompt
engineering, which restricts their generalizability in new tasks
33 / 51
35. FLARE
Experimental Setup / Results
Experimental Setup
They evaluate the effectiveness of FLARE on 4 diverse
knowledge-intensive tasks using few-shot in-context learning, as
summarized in Table 1.
They compare the results of FLARE with baselines using the same
setting, sub-sample at most 500 examples from each dataset due to the
cost of running experiments.
The hyperparameters of FLARE are selected based on the
development set and listed in Table 2.
35 / 51
36. FLARE
Experimental Setup / Results
Experimental Setup
Dataset that using for the experiment:
1 Multihop QA
2 Commonsense Reasoning
3 Long-form QA
4 Open-domain Summarization
36 / 51
37. FLARE
Experimental Setup / Results
Experimental Results
Figure 4: Comparision between FLARE and baselines across all
tasks/datasets.
37 / 51
38. FLARE
Experimental Setup / Results
Ablation Study
Importance of forward-looking retrieval.
They first validate their hypothesis that forward-looking retrieval is
indeed more powerful than past-context-based retrieval.
Figure 5: A head-to-head comparison between using the previous sentence
and the next sentence for retrieval.
38 / 51
39. FLARE
Experimental Setup / Results
Ablation Study
Importance of active retrieval.
Next, they investigate the relationship between performance and the
active retrieval threshold 𝜃.
Figure 6: Performance (EM) of FLARE with respect to the percentage of
steps/sentences with retrieval on 2WikiMultihopQA and StrategyQA.
39 / 51
40. FLARE
Experimental Setup / Results
Ablation Study
Effectiveness of different query formulation methods.
Last, they study implicit query formation by masking and explicit
query formulation through question generation.
Figure 7: Performance of FLARE with respect to the masking threshold 𝛽 on
2WikiMultihopQA.
40 / 51
42. FLARE
Conclusion / Limitation
Conclusion
This paper implement a framework with forward-looking active
retrieval that iteratively uses the upcoming sentence to retrieve
relevant information if it contains low-confidence tokens and
regenerates the next sentence.
42 / 51
43. FLARE
Conclusion / Limitation
Limitation
FLARE did not provide significant gains in Wizard of Wikipedia
dataset. Since its output is relatively short so retrieving multiple
disparate pieces of information might not be necessary.
From an engineering perspective, the LM needs to be activated
multiple times (once for each retrieval) and a caching-free
implementation will also require recomputing the previous activation
each time after a retrieval.
It can design an architecture to encode retrieved documents Dqt
and
the input/generation (x/yt) independently.
43 / 51
45. FLARE
Reflection
Reflection
Retrieval augmented generation (RAG) can help to specialize a
LLM to particular domain and use case. In this work, they provide
more trustworthy way to use retrieval.
If we want the language model to generate more thoughtful (more
emotion friendly) responses. In this case, it would be more interesting
to fine-tune a model on a psychology chat dataset [2]. So find the
dataset or prompt may a good way to focus. And parameter-efficient
method may help a lot for the dialogue system.
45 / 51
46. FLARE
Reflection
References I
[1] Sebastian Borgeaud et al. “Improving Language Models by
Retrieving from Trillions of Tokens”. In: Proceedings of the
39th International Conference on Machine Learning. Ed. by
Kamalika Chaudhuri et al. Vol. 162. Proceedings of Machine
Learning Research. PMLR, July 2022, pp. 2206–2240. url:
https:
//proceedings.mlr.press/v162/borgeaud22a.html.
[2] Minhajul Hoque. Making Chat-bot more emotional. 2023. url:
https://www.kaggle.com/discussions/questions-
and-answers/418148#2309888.
46 / 51
47. FLARE
Reflection
References II
[3] Vladimir Karpukhin et al. “Dense Passage Retrieval for
Open-Domain Question Answering”. In: Proceedings of the
2020 Conference on Empirical Methods in Natural Language
Processing (EMNLP). Online: Association for Computational
Linguistics, Nov. 2020, pp. 6769–6781. doi:
10.18653/v1/2020.emnlp-main.550. url:
https://aclanthology.org/2020.emnlp-main.550.
[4] Urvashi Khandelwal et al. Generalization through
Memorization: Nearest Neighbor Language Models. 2020.
arXiv: 1911.00172 [cs.CL].
47 / 51
48. FLARE
Reflection
References III
[5] Patrick Lewis et al. “Retrieval-Augmented Generation for
Knowledge-Intensive NLP Tasks”. In: Advances in Neural
Information Processing Systems. Ed. by H. Larochelle et al.
Vol. 33. Curran Associates, Inc., 2020, pp. 9459–9474. url:
https:
//proceedings.neurips.cc/paper_files/paper/2020/
file/6b493230205f780e1bc26945df7481e5-Paper.pdf.
[6] Long Ouyang et al. Training language models to follow
instructions with human feedback. 2022. arXiv: 2203.02155
[cs.CL].
[7] Ofir Press et al. Measuring and Narrowing the Compositionality
Gap in Language Models. 2023. arXiv: 2210.03350 [cs.CL].
48 / 51
49. FLARE
Reflection
References IV
[8] Ori Ram et al. In-Context Retrieval-Augmented Language
Models. 2023. arXiv: 2302.00083 [cs.CL].
[9] Stephen Robertson and Hugo Zaragoza. “The Probabilistic
Relevance Framework: BM25 and Beyond”. In: Foundations
and Trends® in Information Retrieval 3.4 (2009), pp. 333–389.
issn: 1554-0669. doi: 10.1561/1500000019. url:
http://dx.doi.org/10.1561/1500000019.
[10] Timo Schick et al. Toolformer: Language Models Can Teach
Themselves to Use Tools. 2023. arXiv: 2302.04761 [cs.CL].
[11] Harsh Trivedi et al. Interleaving Retrieval with
Chain-of-Thought Reasoning for Knowledge-Intensive
Multi-Step Questions. 2022. arXiv: 2212.10509 [cs.CL].
49 / 51
50. FLARE
Reflection
A1:Experimental settings
Settings 2WikiMultihopQA StrategyQA ASQA WikiAsp
Dataset statistics
Task multihop QA commonsense QA long-form QA open-domain summarization
#Examples 500 229 500 500
Evaluation settings
Metrics EM, F1, Prec., Rec. EM EM, Disambig- F1, ROL UniEval, entity- F1, ROUGE
Retrieval settings
Corpus Wikipedia Wikipedia Wikipedia open web
Retriever BM25 BM25 BM25 Bing
Top-k 2 3 3 5
Prompt format
#Exemplars 8 6 8 4
Ret. for exemplars ✓ x x x
Table 1: Statistics and experimental settings of different tasks/datasets
50 / 51
51. FLARE
Reflection
A1:Experimental settings
Dataset 𝜃 𝛽 Query formulation Combine single- multi-time retrieval
2WikiMultihopQA 0.8 0.4 implicit X
StrategyQA 0.4 0.4 implicit X
ASQA ASQA-hint 0.8 0.4 explicit ✓
WikiAsp 0.8 0.4 explicit ✓
Table 2: Statistics and experimental settings of different tasks/datasets
51 / 51