Skip to main content

One post tagged with "Research"

View All Tags

Building an AI Research Agent with Firecrawl

· 8 min read
Karan
Software Developer

The challenge facing modern AI applications isn't the intelligence itself. It's feeding that intelligence with clean, structured data. Anyone who has tried scraping the web at scale knows the pain: navigating dynamic JavaScript rendering, managing rate limits, dealing with CAPTCHA walls, and parsing inconsistent HTML structures. These aren't trivial problems.

This is where purpose-built tooling makes the difference. In this guide, we'll build a research agent that autonomously searches the web, extracts structured content, and synthesizes findings using an LLM. Think of it as a proof-of-concept for more ambitious systems like market intelligence dashboards, competitive analysis tools, or automated research assistants.

The Architecture

Our agent follows a three-phase pipeline:

Discovery - Search for relevant sources based on a query
Extraction - Convert raw web pages into structured, LLM-ready text
Synthesis - Aggregate and analyze the extracted data

This pattern scales well. The same core loop powers everything from simple Q&A bots to sophisticated autonomous research systems.

What You'll Need

  • Node.js (v18 or later recommended)
  • Firecrawl API key from firecrawl.dev
  • OpenAI API key for the analysis layer

Firecrawl handles the heavy lifting of web scraping - JavaScript rendering, proxy rotation, and anti-bot evasion - so you can focus on building features rather than fighting infrastructure.

Project Setup

Initialize your project and install dependencies:

mkdir research-agent
cd research-agent
npm init -y
npm install @mendable/firecrawl-js openai dotenv

Create a .env file for your credentials:

FIRECRAWL_API_KEY=fc-YOUR_KEY_HERE
OPENAI_API_KEY=sk-YOUR_KEY_HERE

Security note: Never commit API keys to version control. Use a secrets manager for production deployments.

Building the Agent

Phase 1: Discovery Layer

Rather than manually curating URLs, we'll use Firecrawl's search capability to dynamically discover relevant sources. This makes the agent adaptable to any query without hardcoded assumptions.

import Firecrawl from '@mendable/firecrawl-js';
import dotenv from 'dotenv';

dotenv.config();

const firecrawl = new Firecrawl({ apiKey: process.env.FIRECRAWL_API_KEY });

async function searchTopic(query, maxResults = 3) {
console.log(`🔍 Searching for: "${query}"...`);

const searchResult = await firecrawl.search(query, { limit: maxResults });

if (!searchResult.success || !searchResult.data.length) {
throw new Error(`No results found for query: ${query}`);
}

const urls = searchResult.data.map(item => item.url);
console.log(`Found ${urls.length} sources`);

return urls;
}

The search API returns ranked results, similar to what you'd see in a traditional search engine. For production use, consider implementing relevance filtering or diversity checks to avoid redundant sources.

Phase 2: Extraction Layer

Web scraping is deceptively complex. Modern sites render content via JavaScript, employ anti-bot measures, and vary wildly in structure. Firecrawl abstracts this complexity, returning clean Markdown that preserves semantic structure without HTML noise.

async function scrapeContent(urls) {
console.log(`🕷️ Scraping ${urls.length} pages...`);

const scrapePromises = urls.map(async (url) => {
try {
const result = await firecrawl.scrape(url, {
formats: ['markdown']
});

if (result.success && result.markdown) {
return {
url,
content: result.markdown,
success: true
};
}
} catch (err) {
console.error(`Failed to scrape ${url}: ${err.message}`);
}

return { url, success: false };
});

const results = await Promise.all(scrapePromises);
const successful = results.filter(r => r.success);

console.log(`✓ Successfully scraped ${successful.length}/${urls.length} pages`);

return successful.map(r =>
`SOURCE: ${r.url}\n\n${r.content}\n\n---\n`
).join('\n');
}

Why Markdown? LLMs are trained on vast amounts of Markdown text from documentation, GitHub, and technical writing. The format preserves semantic hierarchy - headers, lists, code blocks - while remaining token-efficient, which matters when you're working with context window limits.

Phase 3: Synthesis Layer

With structured data in hand, we can now leverage an LLM to synthesize findings. This is where the magic happens: transforming disparate sources into coherent, actionable insights.

import OpenAI from 'openai';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function generateReport(topic, context) {
console.log('🧠 Analyzing and synthesizing data...');

const completion = await openai.chat.completions.create({
model: 'gpt-4o-mini',
temperature: 0.3, // Lower temperature for factual accuracy
messages: [
{
role: 'system',
content: 'You are a research analyst. Synthesize information from multiple sources into a clear, well-structured technical briefing. Cite sources when making specific claims.'
},
{
role: 'user',
content: `Research Topic: ${topic}\n\nGathered Information:\n\n${context}\n\nProvide a comprehensive summary with key findings and insights.`
}
],
max_tokens: 1500
});

return completion.choices[0].message.content;
}

Model selection matters here. We're using gpt-4o-mini for cost efficiency, but for production research tools, consider gpt-4o for improved reasoning and source citation accuracy.

Putting It All Together

Here's the complete agent orchestrating all three phases:

import Firecrawl from '@mendable/firecrawl-js';
import OpenAI from 'openai';
import dotenv from 'dotenv';

dotenv.config();

const firecrawl = new Firecrawl({ apiKey: process.env.FIRECRAWL_API_KEY });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function researchAgent(topic, maxSources = 3) {
const startTime = Date.now();

try {
// Phase 1: Discovery
console.log(`\n🔍 Starting research on: "${topic}"\n`);
const urls = await searchTopic(topic, maxSources);
console.log(`Sources identified:\n${urls.map(u => `${u}`).join('\n')}\n`);

// Phase 2: Extraction
console.log('🕷️ Extracting content from sources...\n');
const context = await scrapeContent(urls);

if (!context.trim()) {
throw new Error('Failed to extract content from any sources');
}

// Phase 3: Synthesis
console.log('🧠 Generating research report...\n');
const report = await generateReport(topic, context);

// Output results
const duration = ((Date.now() - startTime) / 1000).toFixed(2);
console.log('═'.repeat(60));
console.log('RESEARCH REPORT');
console.log('═'.repeat(60));
console.log(`\n${report}\n`);
console.log('═'.repeat(60));
console.log(`✓ Research completed in ${duration}s\n`);

} catch (error) {
console.error(`\n❌ Research failed: ${error.message}`);
process.exit(1);
}
}

async function searchTopic(query, maxResults = 3) {
const searchResult = await firecrawl.search(query, { limit: maxResults });

if (!searchResult.success || !searchResult.data.length) {
throw new Error(`No results found for query: ${query}`);
}

return searchResult.data.map(item => item.url);
}

async function scrapeContent(urls) {
const scrapePromises = urls.map(async (url) => {
try {
const result = await firecrawl.scrape(url, {
formats: ['markdown']
});

if (result.success && result.markdown) {
return `SOURCE: ${url}\n\n${result.markdown}\n\n---\n`;
}
} catch (err) {
console.error(` ⚠️ Failed to scrape ${url}`);
}
return null;
});

const results = await Promise.all(scrapePromises);
const successful = results.filter(r => r !== null);

console.log(` ✓ Successfully scraped ${successful.length}/${urls.length} pages\n`);

return successful.join('\n');
}

async function generateReport(topic, context) {
const completion = await openai.chat.completions.create({
model: 'gpt-4o-mini',
temperature: 0.3,
messages: [
{
role: 'system',
content: 'You are a research analyst. Synthesize information from multiple sources into a clear, well-structured technical briefing. Cite sources when making specific claims.'
},
{
role: 'user',
content: `Research Topic: ${topic}\n\nGathered Information:\n\n${context}\n\nProvide a comprehensive summary with key findings and insights.`
}
],
max_tokens: 1500
});

return completion.choices[0].message.content;
}

// Execute the agent
const topic = process.argv[2] || 'Latest developments in WebAssembly';
researchAgent(topic);

Running the Agent

Execute from your terminal:

node agent.js "What is retrieval-augmented generation?"

The agent will discover relevant sources, extract their content, and produce a synthesized report - all in one command.

Taking It to Production

This implementation is a solid foundation, but production systems need additional work. Here's what to consider:

Error Handling - Implement exponential backoff for rate limits and transient failures. Consider circuit breakers for consistently failing sources.

Observability - Add structured logging (Winston, Pino) and metrics to track success rates, latency, and token usage.

Cost Management - Monitor API usage closely. At scale, LLM calls dominate costs. Consider caching frequently requested topics or implementing tiered analysis: quick summaries vs. deep dives.

Content Quality - Not all scraped content is equally useful. Implement filtering based on content length, language detection, or relevance scoring before sending to the LLM.

Rate Limiting - Both Firecrawl and OpenAI have rate limits. Implement request queuing and concurrency controls for batch operations.

Context Window Management - Large scrapes can exceed LLM context windows. Implement chunking strategies or use map-reduce patterns for processing extensive content.

WebSocket Support - For long-running crawls, Firecrawl supports WebSocket connections that provide real-time updates. This is useful when you need to process documents as they arrive rather than waiting for the entire crawl to complete:

const { id } = await firecrawl.startCrawl('https://docs.example.com', {
limit: 50
});

const watcher = firecrawl.watcher(id, {
kind: 'crawl',
pollInterval: 2
});

watcher.on('document', (doc) => {
// Process each document as it arrives
console.log('New document:', doc.url);
});

watcher.on('done', (state) => {
console.log('Crawl complete:', state.status);
});

await watcher.start();

Pagination for Large Datasets - When dealing with extensive crawls, you can control pagination behavior to manage memory and processing:

// Auto-pagination with limits
const crawlLimited = await firecrawl.getCrawlStatus(jobId, {
autoPaginate: true,
maxPages: 5,
maxResults: 100,
maxWaitTime: 30
});

// Manual pagination for fine control
const crawlPage = await firecrawl.getCrawlStatus(jobId, {
autoPaginate: false
});
// Process crawlPage.data, then fetch next page if crawlPage.next exists

Scaling the Pattern

The Search - Scrape - Synthesize loop is remarkably flexible. Here are some real-world applications:

Competitive Intelligence - Monitor competitor websites for product changes, pricing updates, or new features. Schedule the agent to run daily and alert your team when significant changes are detected.

Market Research - Track industry trends by analyzing news sites, blogs, and technical forums. Aggregate insights weekly to inform strategic decisions.

Documentation Assistants - Aggregate and synthesize scattered documentation into coherent guides. This is particularly useful when working with microservices where documentation is spread across multiple repositories.

Fact-Checking Pipelines - Cross-reference claims across multiple authoritative sources. Useful for journalism, research validation, or content moderation.

Each application shares the same core architecture with domain-specific refinements. The key is identifying what makes sense to automate and what still requires human judgment.

Final Thoughts

Building AI agents isn't about complex algorithms - it's about orchestrating the right tools effectively. By combining specialized APIs like Firecrawl with LLMs, you can build sophisticated research systems without reinventing web scraping infrastructure.

The agent we've built demonstrates the fundamental pattern, but it's just the starting point. The real power comes from iterating on this foundation: adding persistence, implementing feedback loops, or chaining multiple agents together for complex workflows.

Start simple, measure everything, and scale what works. That's how you build systems that last.