Open Data Synthesis For Deep Research
#1 Paper of the day
© NABLAS Inc. All Rights Reserved 2
There is huge gap between open-sourced deep research agents and proprietary ones in terms of
accuracy
Motivation
© NABLAS Inc. All Rights Reserved 3
A deep research task is a complex information seeking activity characterized by multi-layered
information dependencies.
Deep research
© NABLAS Inc. All Rights Reserved 4
Good questions should not ambiguous and they should be verifiable
What are good questions?
© NABLAS Inc. All Rights Reserved 5
the goal is to identify the unique answer set A that simultaneously satisfies all constraints extracted
from the question (ex: Sudoku)
If |A| is larger than 1, it means answer is ambiguous
Constraint Satisfaction Problem (CSP)
© NABLAS Inc. All Rights Reserved 6
Multi-hop Problem (MHP)
© NABLAS Inc. All Rights Reserved 7
Hierarchical Constraint Satisfaction Problem (HCSP)
Let’s make it hierarchical
© NABLAS Inc. All Rights Reserved 8
It builds so-called research trees which are used to generate question/answer pairs
InfoSeek
Facheiroa cephaliomelana + page content
Albert Frederik Hendrik Buining + page content
A
is the son
of B A
B
Parent node
Child node
from Wikipedia
Example
© NABLAS Inc. All Rights Reserved 9
Planner and Browser interactively take actions to construct research trees
InfoSeek
ACTION 1
ACTION 2 ACTION 3
ACTION 3
Initializing the root node
Setting claims and extending the tree
Generating question/answer pair (questions
should be difficult enough and verifiable)
© NABLAS Inc. All Rights Reserved 10
Example
Question
What was the public health program that was
managed by a general who was appointed by
Mario Draghi, and was mandatory for people
over 50 from 15/02/2022 to 30/06/2022?
COVID-19 vaccination in Italy
Answer
Question
What is the shape of the object featured in the
2025 episode of the television show known for
celebrity gossip and scandals that was spotted
over a clandestine military installation of the
nation known for its fifty stars on its flag in
2015?
Diamond-shaped
Answer
Qwen2.5-32B-Inst can answer only 2% of the questions (no tools)
© NABLAS Inc. All Rights Reserved 11
Message
InfoSeeker - distillation
… **Question:** What was an
election won by a person whose
nomination was engineered by
Frank Hague?
<answer> Nguyu1ec5n Ngu1ecdc
Kiu1ec1u Duy </answer>
Response
Qwen2.5-72B → Qwen2.5-3B-Inst
© NABLAS Inc. All Rights Reserved 12
GRPO
2 rounds
InfoSeeker - RL
© NABLAS Inc. All Rights Reserved 13
Notably, most baselines rely heavily on large amounts of in-domain supervision (i.e., more than
100K NQ&HQA), while our approach focuses on leveraging purpose-built InfoSeek dataset for
training
Experiment
© NABLAS Inc. All Rights Reserved 14
GPT-5…
Experiment
© NABLAS Inc. All Rights Reserved 15
Can we apply it to different data not like Wikipedia especially data without explicit links?
Can we apply it to generate fact-check datasets for GENIAC #3?
Discussion

社内勉強会資料_Open Data Synthesis For Deep Research

  • 1.
    Open Data SynthesisFor Deep Research #1 Paper of the day
  • 2.
    © NABLAS Inc.All Rights Reserved 2 There is huge gap between open-sourced deep research agents and proprietary ones in terms of accuracy Motivation
  • 3.
    © NABLAS Inc.All Rights Reserved 3 A deep research task is a complex information seeking activity characterized by multi-layered information dependencies. Deep research
  • 4.
    © NABLAS Inc.All Rights Reserved 4 Good questions should not ambiguous and they should be verifiable What are good questions?
  • 5.
    © NABLAS Inc.All Rights Reserved 5 the goal is to identify the unique answer set A that simultaneously satisfies all constraints extracted from the question (ex: Sudoku) If |A| is larger than 1, it means answer is ambiguous Constraint Satisfaction Problem (CSP)
  • 6.
    © NABLAS Inc.All Rights Reserved 6 Multi-hop Problem (MHP)
  • 7.
    © NABLAS Inc.All Rights Reserved 7 Hierarchical Constraint Satisfaction Problem (HCSP) Let’s make it hierarchical
  • 8.
    © NABLAS Inc.All Rights Reserved 8 It builds so-called research trees which are used to generate question/answer pairs InfoSeek Facheiroa cephaliomelana + page content Albert Frederik Hendrik Buining + page content A is the son of B A B Parent node Child node from Wikipedia Example
  • 9.
    © NABLAS Inc.All Rights Reserved 9 Planner and Browser interactively take actions to construct research trees InfoSeek ACTION 1 ACTION 2 ACTION 3 ACTION 3 Initializing the root node Setting claims and extending the tree Generating question/answer pair (questions should be difficult enough and verifiable)
  • 10.
    © NABLAS Inc.All Rights Reserved 10 Example Question What was the public health program that was managed by a general who was appointed by Mario Draghi, and was mandatory for people over 50 from 15/02/2022 to 30/06/2022? COVID-19 vaccination in Italy Answer Question What is the shape of the object featured in the 2025 episode of the television show known for celebrity gossip and scandals that was spotted over a clandestine military installation of the nation known for its fifty stars on its flag in 2015? Diamond-shaped Answer Qwen2.5-32B-Inst can answer only 2% of the questions (no tools)
  • 11.
    © NABLAS Inc.All Rights Reserved 11 Message InfoSeeker - distillation … **Question:** What was an election won by a person whose nomination was engineered by Frank Hague? <answer> Nguyu1ec5n Ngu1ecdc Kiu1ec1u Duy </answer> Response Qwen2.5-72B → Qwen2.5-3B-Inst
  • 12.
    © NABLAS Inc.All Rights Reserved 12 GRPO 2 rounds InfoSeeker - RL
  • 13.
    © NABLAS Inc.All Rights Reserved 13 Notably, most baselines rely heavily on large amounts of in-domain supervision (i.e., more than 100K NQ&HQA), while our approach focuses on leveraging purpose-built InfoSeek dataset for training Experiment
  • 14.
    © NABLAS Inc.All Rights Reserved 14 GPT-5… Experiment
  • 15.
    © NABLAS Inc.All Rights Reserved 15 Can we apply it to different data not like Wikipedia especially data without explicit links? Can we apply it to generate fact-check datasets for GENIAC #3? Discussion