Firecrawl Research Index: the fastest way to give your AI agent access to every arXiv paper

If you're building a research agent and you need it to actually find the right paper, Firecrawl Research Index is the one to wire in first. It indexes 3M+ arXiv papers plus the GitHub artifacts behind them, and on the arXivQA benchmark it tops the next-best provider by 18% at comparable cost (per Firecrawl's launch post). You call it with one HTTP request. The other options are not bad. They are just more work for less recall, and recall is the whole game when your agent reads the results.

We say "first" deliberately. There are four sane ways to give an agent access to AI/ML literature in 2026, and three of them make you do plumbing that the Research Index already did. Here is the clear-eyed map of the four, the numbers behind the claim, the actual code to call it, and which tier you should be on.

The four ways to give an agent paper search

The job is narrower than "search the web." Your agent needs to take a fuzzy query like "what optimizer did the recent muon follow-up paper use," pull the right arXiv IDs, and ideally read the passage that confirms the answer before it builds on anything. Web search hands you blog posts. A research index hands you papers, ranked, with the full text one call away.

Four candidates show up when teams scope this out:

Firecrawl Research Index (/search/research) is a purpose-built paper index with a research-specific toolset: search papers, inspect metadata, read passages, expand to related work, and search GitHub history.
Semantic Scholar API is the long-standing academic graph. Huge corpus across every field, a free public API, and the citation graph that a lot of tooling already leans on.
Raw arXiv (the RSS feeds plus the export API) is the build-it-yourself route. Free, official, and yours to host. You own the index, the embeddings, and the refresh job.
Perplexity Deep Research API is the answer-engine route. You ask a question, it runs a multi-step web search and hands back a synthesized answer with citations.

They are not really four versions of the same thing. Two of them (Firecrawl, Semantic Scholar) give your agent a search tool it drives. One (raw arXiv) gives you raw material to build that tool. One (Perplexity) takes the search loop away from your agent entirely and returns prose. The right pick depends on whether you want a retrieval tool, a corpus, or an answer.

The comparison, with the cost columns that matter

Here is the decision table. Costs are pulled from each vendor's current pricing; where a provider does not publish a clean per-call number, the cell says so rather than invents one.

	Firecrawl Research Index	Semantic Scholar API	Raw arXiv (DIY)	Perplexity Deep Research API
What it is	Purpose-built paper + code index with a research toolset	Academic graph + citation API	Official feeds you index yourself	Answer engine over web search
Corpus	3M+ arXiv papers + GitHub artifacts, refreshed daily (per Firecrawl)	~200M+ papers, all fields	All of arXiv, you host it	Open web, not a paper index
arXivQA recall	53.3% (per Firecrawl's launch post)	Not benchmarked on arXivQA	Depends entirely on your retrieval build	Not a paper-retrieval tool
Reads full-text passages	Yes, one call per paper	Abstracts via API; full text varies	Whatever you parse and store	Returns synthesized prose, not passages
GitHub code search	Yes, issues / PRs / READMEs	No	No	Web-wide, not repo-structured
Setup effort	One API call	API client + ranking layer	Indexing pipeline + embeddings + refresh	One API call
Cost model	Credit-based: Search is 2 credits per 10 results (per pricing); Standard 100k credits/mo at $83/mo billed yearly	Free public API with rate limits; paid keys for higher throughput	Free corpus; you pay your own infra	Per-request pricing, query-dependent
Best when	You want top recall and code, with zero index maintenance	You need cross-field coverage and the citation graph	You need full control and have engineers to spare	You want a finished answer, not a tool

A few of those cells deserve a sentence instead of a box.

On recall, the headline holds up against the source. On arXivQA the index hits 53.3% recall at $0.32 per task, against 45.4% for the next-best provider at similar cost, which is where the "18% above" framing comes from (per Firecrawl's launch post). It also posts 0.750 MRR, meaning the correct paper tends to land in the top one or two results (per the changelog). Higher MRR is the part agent builders should care about: fewer wasted tokens before the agent finds the paper it actually needed.

On cost, nobody publishes a clean "dollars per 1,000 paper searches" number, so do the credit math yourself. Search bills at 2 credits per 10 results (per Firecrawl's pricing page). The Standard plan is 100,000 credits a month at $83/mo billed yearly, and Growth is 500,000 credits at $333/mo. Semantic Scholar's API has a free tier with rate limits and offers higher-throughput keys on request, which is genuinely hard to beat on raw cost if your volume is modest and you can live with the limits. Raw arXiv is free as a corpus, but "free" stops at the download. You pay in the embedding pipeline, the vector store, and the daily refresh job that keeps you from missing this week's papers.

On setup, this is the real spread. Firecrawl and Perplexity are both one call. Semantic Scholar is a client plus whatever re-ranking you bolt on to make abstract search behave. Raw arXiv is a project.

How the benchmark was actually measured

The 53.3% number is worth a paragraph because recall benchmarks are easy to game and this one is reasonably clear about its method. Firecrawl ran roughly 200 queries from alphaXiv's arXivQA set, each labeled with up to 10 ground-truth arXiv IDs, and let Opus 4.8 drive each provider through its MCP and SKILL.md, then scored the papers each one surfaced against the labels (per Firecrawl's launch post). The arXivQA set itself comes from alphaXiv's work on training retrieval agents for arXiv search, which maps real AI-research queries to arXiv IDs (per alphaXiv).

Two things follow. First, 53.3% recall is not "it finds the right paper half the time." It is "across queries with up to ten correct papers each, it surfaces a leading share of the labeled set," a harder bar than top-1 accuracy. Second, the test ran agent-in-the-loop, not as a single keyword query, which matches how you would actually use it.

The catch worth naming: it is a benchmark on AI/ML queries specifically. If your agent researches molecular biology or macroeconomics, arXivQA tells you nothing, and the index's arXiv-plus-CS-GitHub focus is a mismatch. This is a tool for agents working the AI/ML frontier, not a general academic search.

Wire it into an agent in one call

Here is the part that earns the "fastest" claim. You do not stand up an index. You make a GET request. The endpoint returns ranked papers with a canonical paperId, the preferred primaryId, title, abstract, and a relevance score (per the docs).

This Python snippet runs a research search and parses the first result. It is the shape you would drop into a tool function your agent calls.

import os
import requests

API = "https://api.firecrawl.dev/v2/search/research/papers"

# An API key is optional to start; add it for higher rate limits.
headers = {}
if os.getenv("FIRECRAWL_API_KEY"):
    headers["Authorization"] = f"Bearer {os.environ['FIRECRAWL_API_KEY']}"

resp = requests.get(
    API,
    params={"query": "muon optimizer training stability", "k": 20},
    headers=headers,
    timeout=30,
)
resp.raise_for_status()

papers = resp.json().get("data", [])
top = papers[0]
print(top["title"])
print(top["primaryId"])     # e.g. arxiv:2502.xxxxx
print(round(top["score"], 3))

From there, the natural next call is to read the passage that confirms the paper is relevant before your agent cites it. Same paper path, plus a query:

curl -s "https://api.firecrawl.dev/v2/search/research/papers/arxiv:1706.03762?query=what%20is%20the%20attention%20mechanism&k=4"

That two-step (search, then verify the passage) is the whole reason a purpose-built index beats raw web search for an agent. The agent does not have to trust a title. It can check the claim against the full text in the same loop. If it wants the implementation, there is a GitHub history endpoint (/search/research/github) that searches issues, merged PRs, and READMEs for the code behind the paper. An agent tuning a training run overnight can pull an optimizer from a recent paper and a stability fix from a related GitHub issue in two calls.

The index is available now via the API, CLI, MCP, and SDKs, and it plugs into Codex, Claude Code, and Grok Build (per Firecrawl's launch post). It shipped on June 17, 2026, with the changelog dating it June 16 (per the changelog). For a Claude Code or n8n loop, the cleanest setup is the dedicated research skill, installed with npx skills add firecrawl/skills@firecrawl-research-index, which wires the toolset into the harness for you.

Who should use which tier

For a solo dev or a small team running an autonomous research agent on the AI/ML frontier, start on Firecrawl's free credits to confirm the recall difference on your own queries, then move to the Standard plan once your agent runs enough searches that the daily refresh and top recall earn the $83/mo billed yearly (per Firecrawl's pricing page). That is the sweet spot: 100,000 credits covers a lot of 10-result searches, and you skip building an index entirely.

If you are running a high-volume research platform where paper search is the product, the Growth plan (500,000 credits at $333/mo billed yearly, per the pricing page) is the one, and the GitHub history endpoint stops being a nice-to-have once your users expect code alongside papers.

If your budget is zero and you have engineering time to spare, raw arXiv plus your own embeddings is a real option. Just be clear-eyed that you are signing up to maintain a retrieval system, and your recall will be whatever your build manages. If you need broad cross-field academic coverage rather than AI/ML depth, Semantic Scholar's free API is the better corpus, with the citation graph as a bonus.

For most teams building an AI/ML research agent in 2026, the math is simple. The recall lead is documented, the setup is a single call, and you are not on the hook for an index. Grab a key and run your own queries against it before you commit to building anything yourself. If the recall holds on your workload, the build-vs-buy question answers itself.