Writing a deep-research agent
for fun, learning and costs

- Nitin Bansal2nd May 2026~13 min read

Over the last year, deep research quietly became one of the most interesting features in modern AI products.

You type a question. A model disappears for a few minutes. Then it comes back with citations, sources, structured analysis, and an answer that feels less like autocomplete and more like actual research.

It is genuinely impressive.

It is also usually hidden behind:

  1. strict usage limits,
  2. daily or weekly quotas,
  3. queue times,
  4. premium pricing tiers,
  5. and a black box you are not supposed to look inside.

After hitting those limits enough times, a different question starts to feel more interesting than the research itself:

What would it actually take to build one of these systems myself?

Not a demo. Not a "search five URLs and summarize them" toy. A real system—something that can search broadly, dig deeper, handle messy sources, survive failures, manage costs, keep citations intact, and produce something you would actually trust.

From 30,000 feet, it sounds straightforward.

Up close, it is absolutely not.

Want to first see what the generated reports actually look like? Jump to Sample Reports →

It is easy to assume "deep research" is just:

  1. Google a thing.
  2. Open a few tabs.
  3. Ask an LLM to summarize.
  4. Ship it.

That works until you try it twice. The third time, you discover the secret final boss of the internet: reality. Reality comes with popups, PDFs, infinite scroll, "accept cookies", rate limits, paywalls, 9MB hero images, and pages that are basically one giant < div> pretending to be content.

If the goal is to build a real deep-research agent, the system has to start simple and then grow all the defensive features that only show up after getting burned a few times. From 30,000 feet, the problem looks straightforward. Up close, it is absolutely not.

And yes: cost matters. If raw web pages are fed into an LLM like an all-you-can-eat buffet, the wallet does a disappearing act.


The "Hello World" Research Agent (It’s Cute, But It Lies)

Let’s start with the obvious pipeline:

  1. Take a user query.
  2. Use web search to get a handful of URLs.
  3. Fetch each URL.
  4. Dump all the text into an LLM prompt.
  5. Ask for a summary.

In pseudocode:

jsconst query="my topic";
const urls=await search(query);
const pages=await Promise.all(urls.map(fetchText));
const report=await llm(`Summarize this:\n\n${pages.join("\n\n")}`);

This is the research-agent equivalent of claiming cooking skills because toast happened once.

Full pipeline this system grows into — from naive summarizer to what it actually needs to be
Full pipeline this system grows into — from naive summarizer to what it actually needs to be

Problem #1"A handful of URLs" is a handful of bias

If you only look at 5 results:

So the first upgrade is: don’t do one query. Do multiple better queries.


Step 1: Turn One Query Into Several Good Queries

Humans are decent at writing search queries. LLMs can be great at it when constrained properly.

So the agent starts by asking an LLM to generate a small set of high-leverage queries. The intent is coverage:

⌗ Step 1

Of course, the internet immediately tries to sabotage you.

Problem #2LLMs occasionally return… interpretive JSON

Sometimes you ask for JSON and you get:

This is where you plug-in some JSON repair layer. LLMs do not just break JSON one way; they break it in ten slightly annoying ways. Using a repair utility that can recover fenced JSON, strip surrounding prose, normalize Python literals, and fix low-grade syntax damage is a lot more robust than pretending a plain JSON.parse(...)is enough.

But, the repair layer might not be able to handle all kinds of damages. And this is why you treat that layer as a "best effort" step: if a clean list cannot be parsed, the system should fall back to the original user query and continue. Research should degrade gracefully, not faceplant because of a curly brace.


Step 2: Search Is Cheap; Fetching and Reading Is Not

Once we have good queries, we search them (through a hosted search provider) and collect URLs.

Problem #3Duplicate URLs, tracking params, and the "www." multiverse

Search results are full of near-duplicates:

So we normalize and dedupe aggressively before any fetch or cache lookup. This does three things:

  1. Reduces wasted fetches — no re-crawling the same page via five superficially different URLs.
  2. Prevents citation spam — the report won't cite example.com/article?utm_source=twitter and www.example.com/article/ as separate sources.
  3. Keeps the visited set trustworthy — a hash-map of raw URLs is nearly useless; a hash-map of normalized URLs is an actual dedup layer.

The normalization pipeline, we built, does following, in order:

We also maintain "already visited" sets for both queries and URLs. A deep-research agent without memory is just a goldfish with an API key — it'll refetch the same Reuters article six times and call it thoroughness.


Step 3: Fetch Pages… But Don’t Trust Them

Now we try to fetch each URL and extract usable text.

This is the point where naive systems go to die, because "fetch the page" is not one problem. It’s a bag of problems wearing a trench coat.

Problem #4Not every URL is HTML

Some URLs are PDFs. Some are HTML. Some are "HTML" that is actually an error page politely wearing a 200 OK badge.

So the agent resolves a content handler per URL:

The content-type sniffing is intentionally cautious: it tries HEAD, and if that’s blocked or unhelpful, it uses a minimal ranged GET (fetching the tiniest sliver possible). This is faster and reduces bandwidth, and it also avoids downloading a whole novel just to learn the cover says "PDF".

The handler resolution and fetch fallback path in detail
The handler resolution and fetch fallback path in detail

Problem #5HTML is massive, noisy, and allergic to being summarized

Even when you get HTML successfully, you can’t just feed it to an LLM:

So the safer pattern is to convert HTML into clean markdown locally:

This is the cost-saving heart of the system: do the messy cleanup without paying LLM token prices for garbage.


Step 4: "But Wait, The Web Is Hostile Now" (Anti-bot Measures)

Let’s pretend you wrote the perfect extractor. You still have a problem:

More and more sites do not want your scripts reading their pages.

You’ll see:

Problem #6Direct fetching fails more often than you’d like

So the fetching strategy should become:

  1. Try a direct origin fetch first (fast, cheapest).
  2. If it fails, try proxy routes (slower, costs money, but often works).
  3. If it still fails, mark the source as unusable and move on.

The key is that proxy usage should be a fallback, not the default. Proxies are like taxis: extremely useful, but you do not want to take one to walk across the street.

Also: not all content types should be treated equally. For example, PDF proxying can be enabled explicitly rather than always-on (because downloading large binaries through proxies is a special kind of fun you didn’t ask for).


Step 5: "Now We Have Content. Still Can’t Just Shove It Into An LLM."

At this point we’ve got markdown for a bunch of sources. We still don’t want to do the naive thing:

txtHere are 25 sources of markdown. Please produce a report. 

That can work… but it’s brittle and expensive, and you lose control over what was extracted versus what was inferred.

Problem #7You need structure, not vibes

So instead, the safer approach is to ask the LLM to extract structured notes per source:

Two important details make this practical:

  1. The output format is constrained to JSON (inside a fenced block), so it’s parseable.
  2. The code treats "malformed output" as a recoverable error and keeps going.

This step is the difference between a "summary generator" and a "research agent." Summaries are vibes. Notes are evidence.


Step 6: "We Need More Sources" (Recursive Expansion Without Spiraling)

Now comes the part that makes it "deep research" instead of "search-and-summarize":

From each good source, we extract:

⌗ Step 6

Then we repeat, depth by depth.

Problem #8Recursion is how you accidentally invent a web crawler

If you expand without guardrails, you’ll end up:

So hard caps are needed:

This creates a controlled exploration: it spreads wide, but it does not spiral into chaos.

Problem #8.5LLMs invent "related URLs" more often than you want

Even after per-source extraction is working, one ugly problem remains: the model will sometimes suggest URLs that look plausible but are brittle, stale, malformed, or simply hallucinated.

That creates a bad downstream loop:

So related-URL expansion needs another guardrail: only follow URLs that are actually present in the extracted source content, or a deterministic canonical form of one that is.

In practice, that means:

The canonicalization step does real work here — a source might link to an arXiv PDF, a dx.doi.org redirect, and an ar5iv.org render of the same paper. Without it, all three get crawled and cited separately. With it, they collapse to one arxiv.org/abs/ URL and the visited-set correctly suppresses the duplicates.

The rule is simple: if it wasn't in the source, don' t follow it. The model's URL suggestions are a starting point for query expansion, not a crawl frontier.


Step 7: "Images Exist, and Sometimes They’re the Whole Point"

The web loves to put the important content in:

⌗ Step 7

If you ignore images, you’ll miss exactly the information someone cared about.

Problem #9You can’t send 50 images to a model and pretend it’s fine

So image understanding should be:

This is a useful compromise:

Also, images are fetched directly (no proxy routing for images), because image fetching is a slippery slope that can turn your networking layer into a thriller novel.


Step 8: Make the Final Report… Without Losing Citation Integrity

After per-source notes have been collected, the next challenge is synthesis. This is where a lot of systems quietly become unreliable:

⌗ Step 8

Problem #10If everything is sent to the model at once, it forgets things

Once the source set gets large, a single giant "write the report" prompt starts causing familiar failures:

So instead of doing one monolithic final-report pass, it is safer to use a two-stage synthesis flow:

  1. Split the collected sources into N chunks.
  2. Generate a detailed intermediate report for each chunk, keeping the original source citation numbers intact.
  3. Merge those intermediate reports into the actual final report.

This does two useful things:

The two-stage synthesis approach that keeps citation numbers stable
The two-stage synthesis approach that keeps citation numbers stable

Problem #11Chunking is only useful if citation integrity survives it

Chunked synthesis sounds nice until citations drift.

If chunk 2 renumbers its local sources from [1] to [8], or if the final merge invents references that do not map cleanly back to the original source list, the whole thing becomes unreliable.

So the safer rule is: keep the original global source numbers all the way through the chunk reports and into the final report. Then rebuild the final references section deterministically from the full source list.

That means the post-processing step still matters:

This is unglamorous engineering, which is another way of saying: it is the part that makes the output feel like a product.


Step 9: Concurrency, Retries, and "Please Don’t Panic"

Real web research is mostly waiting for networks. So performance and resilience matter:

⌗ Step 9

The spirit here is:

Some failures are treated as "soft" (skip source, continue). A few are "hard" (for example: you can’t generate a report if you extracted zero usable sources).


A Quick Tour of "Stuck Moments" (And How We Unstuck Them)

If a deep-research build had a subtitle, it would be:

"How many edge cases can fit inside a simple idea?"

Here’s the sequence of pain, in order:

  1. "Let’s summarize a few URLs."
  2. "But the URLs are too few. We need broader coverage."
  3. "But now we have lots of URLs. Some are duplicates and tracking junk."
  4. "But now the HTML is massive and mostly noise."
  5. "But some URLs are PDFs and behave differently."
  6. "But some sites block direct fetching."
  7. "But proxies cost money; we can’t use them everywhere."
  8. "But the important information is sometimes inside images."
  9. "But image understanding can also cost money; it must be bounded."
  10. "But now citations and references need to be consistent and clickable."
  11. "But the model is losing detail because the final synthesis prompt is too big."
  12. "But chunking the synthesis will break things unless citation numbers stay stable."

A deep-research agent is not "one clever prompt." It is a pipeline of small, defensive decisions that keep the system stable, affordable, and honest.

Sample Reports

Below are outputs from a few runs of this deep-research agent:

Report Sources used Max depth Time (mm:ss) Approx cost (USD)
What Happens If Compute Becomes a Sovereign Reserve Asset? 38 2 22:40 0.06
The Financialization of Compute Futures 44 2 24:15 0.08
The Coming "Power Wars" Between Humans and Datacenters: Rising Costs, Grid Strain, and the Battle for Electricity 54 2 30:32 0.09

Note: these pages were generated from prepared markdown files to showcase on web.

The exact numbers can vary wildly based on:


What Is Still Missing?

Even with all of this working, there are still several obvious ways to make the system better. A deep-research agent is never really “done”; it just becomes more useful as its memory, verification, and planning layers improve.

Future Improvements

The most interesting next step is probably not “make the model smarter.” It is making the surrounding system more disciplined: better source selection, better verification, better memory, and better ways to notice when the answer is uncertain.

In other words, the agent should become less like a fast summarizer and more like a careful researcher with a notebook, a checklist, and a healthy fear of being confidently wrong.


Closing Thoughts: It Looked Simple, Then It Wasn’t

Building a system like this is a reminder that "research" is a system problem:

coveragevsdepth
costvsquality
speedvsreliability
automationvstruthfulness

From above, it looks like "search + summarize." Up close, it’s:

query strategy
dedupe and normalization
content-type handling
extraction pipelines
proxy fallbacks
bounded multimodal enrichment
structured note-taking
recursive expansion with caps
source-grounded URL expansion
chunked intermediate synthesis
post-processing for citation hygiene

This started as a fun exercise — mostly curiosity, partly frustration with existing rate limits, and a simple question: what would it actually take to build one of these myself?

More than anything, though, it was genuinely rewarding. There is something deeply satisfying about taking an idea that looks simple from the outside, wrestling with all its ugly edge cases, and slowly turning it into something that actually works.

Source code will be published on GitHub in a few days. It needs a bit of refactoring, cleanup, removal of many temp files and hopefully a bit more to make it better (mostly around claim level citations and contradiction detection).

Thanks for Reading :)