Back to Blog
Website Crawling with Puppeteer: Resource Blocking and DOM Extraction

Website Crawling with Puppeteer: Resource Blocking and DOM Extraction

January 3, 2026
Stefan Mentović
puppeteerweb-scrapingseonodejsautomation

Master efficient web crawling with Puppeteer for SEO analysis. Learn resource blocking, visible DOM extraction, and production-ready patterns.

#Website Crawling with Puppeteer: Resource Blocking and DOM Extraction

You've built an SEO analysis tool, scraped some pages with Cheerio, and everything works beautifully... until you encounter a modern React or Vue.js application. Suddenly, your crawler sees empty <div id="root"></div> containers instead of actual content.

Sound familiar? This is the critical moment where developers realize not all web scraping is created equal. While Cheerio excels at parsing static HTML, modern web applications require a headless browser that can execute JavaScript and render the DOM.

In this guide, we'll explore production-ready patterns for web crawling with Puppeteer, focusing on performance optimization through resource blocking, efficient DOM extraction, and respectful crawling practices. These techniques power SEO analysis tools processing millions of pages monthly.

#When to Use Puppeteer vs Cheerio

Choosing the right tool saves both time and resources.

#Use Cheerio When:

  • Static HTML: The content exists in the initial server response
  • Simple parsing: You need to extract meta tags, headings, or structured data
  • Speed is critical: Cheerio is 10-100x faster than headless browsers
  • Low memory footprint: Processing thousands of pages concurrently

#Use Puppeteer When:

  • JavaScript-rendered content: Single-page applications (React, Vue, Angular)
  • Dynamic content: Content loaded via AJAX or lazy-loaded on scroll
  • Client-side rendering: Next.js client components, Gatsby with JS hydration
  • Interactive elements: You need to click, scroll, or wait for animations
  • Accurate visibility detection: Filtering content that's actually visible to users

#The Hybrid Approach

The most efficient solution combines both tools (Cheerio + Puppeteer):

import * as cheerio from 'cheerio';
import puppeteer from 'puppeteer-core';

async function crawlPage(url: string) {
	// Launch browser with resource blocking enabled
	const browser = await puppeteer.launch({
		headless: 'new',
		args: ['--no-sandbox', '--disable-setuid-sandbox'],
	});

	const page = await browser.newPage();

	// Block heavy resources (covered in next section)
	await page.setRequestInterception(true);
	page.on('request', (req) => {
		if (['image', 'stylesheet', 'font', 'media'].includes(req.resourceType())) {
			req.abort();
		} else {
			req.continue();
		}
	});

	// Navigate and wait for content
	await page.goto(url, {
		waitUntil: 'networkidle2',
		timeout: 60000,
	});

	// Get the rendered HTML
	const html = await page.content();

	// Parse with Cheerio for static elements (meta tags, JSON-LD)
	const $ = cheerio.load(html);
	const metaTags = $('meta')
		.map((_, el) => ({
			name: $(el).attr('name'),
			property: $(el).attr('property'),
			content: $(el).attr('content'),
		}))
		.get();

	// Use Puppeteer for dynamic content (visible headings)
	const headings = await page.$$eval('h1,h2,h3,h4,h5,h6', (els) =>
		els
			.filter((el) => {
				const style = window.getComputedStyle(el);
				return style.display !== 'none' && style.visibility !== 'hidden' && !el.closest('[aria-hidden="true"]');
			})
			.map((el) => ({
				tag: el.tagName.toLowerCase(),
				text: el.innerText.trim(),
			})),
	);

	await browser.close();

	return { metaTags, headings };
}

Why this works: Cheerio handles the static HTML parsing (faster), while Puppeteer focuses on dynamic content that requires JavaScript execution. This hybrid approach reduces browser execution time by 60-80%.

#Request Interception: Blocking Non-Essential Resources

Puppeteer downloads every resource by default—images, fonts, stylesheets, videos. For SEO analysis, most of these resources are unnecessary and dramatically slow down crawling.

#The Performance Impact

Without resource blocking:

  • Average page load: 3-8 seconds
  • Data transfer: 2-5 MB per page
  • Memory usage: 150-300 MB per browser instance

With resource blocking:

  • Average page load: 0.8-2 seconds
  • Data transfer: 50-200 KB per page
  • Memory usage: 50-100 MB per browser instance

That's a 60-75% reduction in load time and 90-95% reduction in bandwidth.

#Implementing Request Interception

import { Page } from 'puppeteer-core';

interface BlockingOptions {
	blockImages?: boolean;
	blockStylesheets?: boolean;
	blockFonts?: boolean;
	blockMedia?: boolean;
	allowedDomains?: string[];
}

async function setupResourceBlocking(page: Page, options: BlockingOptions = {}) {
	const {
		blockImages = true,
		blockStylesheets = true,
		blockFonts = true,
		blockMedia = true,
		allowedDomains = [],
	} = options;

	await page.setRequestInterception(true);

	page.on('request', (request) => {
		const resourceType = request.resourceType();
		const url = request.url();

		// Allow requests from specific domains (for CDNs, analytics)
		if (allowedDomains.length > 0) {
			const requestDomain = new URL(url).hostname;
			if (allowedDomains.some((domain) => requestDomain.includes(domain))) {
				request.continue();
				return;
			}
		}

		// Block based on resource type
		if (
			(blockImages && resourceType === 'image') ||
			(blockStylesheets && resourceType === 'stylesheet') ||
			(blockFonts && resourceType === 'font') ||
			(blockMedia && resourceType === 'media')
		) {
			request.abort();
			return;
		}

		// Allow everything else (HTML, scripts, XHR, fetch)
		request.continue();
	});
}

// Usage
const page = await browser.newPage();
await setupResourceBlocking(page, {
	blockImages: true,
	blockStylesheets: true,
	blockFonts: true,
	blockMedia: true,
	allowedDomains: ['googleapis.com'], // Allow Google Fonts if needed
});

#What to Block and What to Keep

Always block:

  • Images: Not needed for SEO text analysis (except for alt text, which comes from HTML)
  • Fonts: Only affect rendering, not content
  • Media: Videos and audio files

Sometimes block:

  • Stylesheets: Block for pure SEO analysis, keep for visibility detection
  • Third-party scripts: Block analytics, ads; keep for authentication or content delivery

Never block:

  • HTML documents: The core content
  • JavaScript: Required for SPA rendering
  • XHR/Fetch: API calls that populate content
  • WebSocket: Real-time content updates

#Advanced Pattern: Selective Blocking

page.on('request', (request) => {
	const resourceType = request.resourceType();
	const url = request.url();

	// Block third-party analytics and ads
	const blockedDomains = ['google-analytics.com', 'googletagmanager.com', 'facebook.com', 'doubleclick.net'];

	if (blockedDomains.some((domain) => url.includes(domain))) {
		request.abort();
		return;
	}

	// Block heavy resources
	if (['image', 'font', 'media'].includes(resourceType)) {
		request.abort();
		return;
	}

	// Allow everything else
	request.continue();
});

#Extracting Visible DOM Elements vs Full HTML

A common mistake in SEO crawling is extracting all content from the HTML source, including hidden elements, CSS-injected text, and content marked as invisible.

Search engines prioritize visible content. Google's algorithms explicitly penalize hidden content used for keyword stuffing. Your crawler should do the same.

#The Problem with page.content()

// ❌ BAD: Gets ALL content including hidden elements
const html = await page.content();
const $ = cheerio.load(html);
const allHeadings = $('h1,h2,h3,h4,h5,h6')
	.map((_, el) => $(el).text())
	.get();

// Result: ['Visible H1', 'Hidden SEO spam', 'Display none heading', ...]

This captures:

  • Elements with display: none or visibility: hidden
  • Content inside <noscript> tags
  • Text with opacity: 0 or positioned off-screen
  • Elements with aria-hidden="true"

#The Solution: $$eval with Visibility Checks

// ✅ GOOD: Only visible headings
const visibleHeadings = await page.$$eval('h1,h2,h3,h4,h5,h6', (elements) =>
	elements
		.filter((el) => {
			// Get computed styles (after CSS applies)
			const style = window.getComputedStyle(el);

			// Check multiple visibility criteria
			const isVisible =
				style.display !== 'none' &&
				style.visibility !== 'hidden' &&
				parseFloat(style.opacity) > 0 &&
				!el.closest('[aria-hidden="true"]') &&
				!el.hasAttribute('hidden') &&
				el.offsetWidth > 0 &&
				el.offsetHeight > 0;

			// Ensure element has actual text content
			const hasContent = el.innerText.trim().length > 0;

			return isVisible && hasContent;
		})
		.map((el) => ({
			tag: el.tagName.toLowerCase(),
			text: el.innerText.trim(), // innerText respects CSS and line breaks
		})),
);

// Result: [{ tag: 'h1', text: 'Visible H1' }]

#Why innerText Over textContent

// Example DOM
<div style="display: none;">Hidden text</div>
<div>Visible <span style="visibility: hidden;">Hidden span</span> text</div>

// innerText (respects CSS)
const inner = el.innerText; // "Visible text"

// textContent (ignores CSS)
const text = el.textContent; // "Hidden text Visible Hidden span text"

Use innerText for SEO analysis—it matches what users and search engines actually see. The difference between innerText and textContent is that innerText respects CSS styling and visibility.

#Extracting Images with Proper Attributes

interface ImageData {
	src: string;
	alt: string;
	loading?: string;
	fetchpriority?: string;
	srcset?: string;
	isDecorative: boolean;
}

const images: ImageData[] = await page.$$eval('img', (elements) =>
	elements.map((el) => ({
		src: el.src || el.getAttribute('data-src') || '', // Handle lazy loading
		alt: el.alt || '',
		loading: el.getAttribute('loading') || undefined,
		fetchpriority: el.getAttribute('fetchpriority') || undefined,
		srcset: el.srcset || undefined,
		isDecorative:
			el.getAttribute('role') === 'presentation' || el.getAttribute('aria-hidden') === 'true' || el.alt === '',
	})),
);

// Filter out decorative images for SEO
const contentImages = images.filter((img) => !img.isDecorative);
interface LinkData {
	href: string;
	text: string;
	rel?: string;
	type: 'internal' | 'external';
}

const links = await page.evaluate((baseUrl) => {
	const base = new URL(baseUrl);
	const linkElements = document.querySelectorAll('a[href]');

	return Array.from(linkElements)
		.map((a) => {
			try {
				// Resolve relative URLs
				const href = new URL(a.getAttribute('href')!, base).href;
				const url = new URL(href);
				const isInternal = url.hostname === base.hostname;

				return {
					href,
					text: a.textContent?.trim() || '',
					rel: a.getAttribute('rel') || undefined,
					type: isInternal ? 'internal' : 'external',
				} as LinkData;
			} catch {
				return null; // Invalid URL
			}
		})
		.filter((link): link is LinkData => link !== null);
}, page.url());

const internalLinks = links.filter((l) => l.type === 'internal');
const externalLinks = links.filter((l) => l.type === 'external');

#Parsing Meta Tags, Structured Data, and robots.txt

SEO analysis requires extracting metadata that doesn't necessarily render visually.

#Meta Tags Extraction

import * as cheerio from 'cheerio';

interface MetaTag {
	name?: string;
	property?: string;
	content?: string;
	httpEquiv?: string;
}

async function extractMetaTags(page: Page): Promise<MetaTag[]> {
	const html = await page.content();
	const $ = cheerio.load(html);

	return $('meta')
		.map((_, el) => ({
			name: $(el).attr('name'),
			property: $(el).attr('property'),
			content: $(el).attr('content'),
			httpEquiv: $(el).attr('http-equiv'),
		}))
		.get();
}

// Usage
const meta = await extractMetaTags(page);
const description = meta.find((m) => m.name === 'description' || m.property === 'og:description');
const robots = meta.find((m) => m.name === 'robots');

#JSON-LD Structured Data

interface StructuredData {
	type: string;
	data: unknown;
}

async function extractStructuredData(page: Page): Promise<StructuredData[]> {
	const html = await page.content();
	const $ = cheerio.load(html);
	const results: StructuredData[] = [];

	$('script[type="application/ld+json"]').each((_, el) => {
		try {
			const text = $(el).html() || '{}';
			const parsed = JSON.parse(text);

			// Helper to add structured data items
			const addItem = (item: Record<string, unknown>) => {
				results.push({ type: (item['@type'] as string) || 'Unknown', data: item });
			};

			// Handle array, @graph format, or single object
			const items = Array.isArray(parsed) ? parsed : parsed['@graph'] ?? [parsed];
			items.forEach(addItem);
		} catch {
			// Invalid JSON-LD, skip
		}
	});

	return results;
}

// Usage
const structuredData = await extractStructuredData(page);
const organizationData = structuredData.find((s) => s.type === 'Organization');
const articleData = structuredData.find((s) => s.type === 'Article');

#robots.txt Parsing and Compliance

interface RobotsRule {
	userAgent: string;
	disallow: string[];
	allow: string[];
	crawlDelay?: number;
	sitemap?: string[];
}

async function fetchRobotsTxt(baseUrl: string): Promise<string | null> {
	try {
		const robotsUrl = new URL('/robots.txt', baseUrl).href;
		const response = await fetch(robotsUrl);

		if (response.ok) {
			return await response.text();
		}
	} catch {
		// robots.txt doesn't exist
	}

	return null;
}

function parseRobotsTxt(robotsTxt: string): RobotsRule[] {
	const rules: RobotsRule[] = [];
	let currentRule: RobotsRule | null = null;

	robotsTxt.split('\n').forEach((line) => {
		const cleanLine = line.split('#')[0].trim(); // Remove comments
		if (!cleanLine) return;

		const [key, value] = cleanLine.split(':').map((s) => s.trim());

		switch (key.toLowerCase()) {
			case 'user-agent':
				if (currentRule) rules.push(currentRule);
				currentRule = { userAgent: value, disallow: [], allow: [], sitemap: [] };
				break;
			case 'disallow':
				currentRule?.disallow.push(value);
				break;
			case 'allow':
				currentRule?.allow.push(value);
				break;
			case 'crawl-delay':
				if (currentRule) currentRule.crawlDelay = parseInt(value, 10);
				break;
			case 'sitemap':
				currentRule?.sitemap?.push(value);
				break;
		}
	});

	if (currentRule) rules.push(currentRule);
	return rules;
}

function isUrlAllowed(robotsTxt: string, url: string): boolean {
	const rules = parseRobotsTxt(robotsTxt);
	const path = new URL(url).pathname;

	// Find applicable rule (prefer specific user-agent, fallback to *)
	const rule = rules.find((r) => r.userAgent.toLowerCase() === 'googlebot') || rules.find((r) => r.userAgent === '*');

	if (!rule) return true; // No rules = allowed

	// Check disallow rules
	for (const disallowPath of rule.disallow) {
		if (path.startsWith(disallowPath)) {
			// Check if there's a more specific allow rule
			for (const allowPath of rule.allow) {
				if (path.startsWith(allowPath)) {
					return true;
				}
			}
			return false;
		}
	}

	return true; // Not explicitly disallowed
}

// Usage
const robotsTxt = await fetchRobotsTxt('https://example.com');
if (robotsTxt) {
	const canCrawl = isUrlAllowed(robotsTxt, 'https://example.com/admin');
	const rules = parseRobotsTxt(robotsTxt);
	const crawlDelay = rules.find((r) => r.userAgent === '*')?.crawlDelay || 0;
}

#Memory Management and Browser Instance Pooling

Running hundreds of Puppeteer instances simultaneously will crash your server. Proper resource management is critical.

#The Problem: Memory Leaks

// ❌ BAD: Creates a new browser for every page
async function crawlUrls(urls: string[]) {
	const results = [];

	for (const url of urls) {
		const browser = await puppeteer.launch(); // New browser instance!
		const page = await browser.newPage();
		// ... crawl logic
		await browser.close();
		results.push(data);
	}

	return results;
}

// After 100 URLs, your server runs out of memory

Each browser instance consumes 50-150 MB. Launching 100 browsers = 5-15 GB of memory.

#The Solution: Browser Instance Pooling

import puppeteer, { Browser, Page } from 'puppeteer-core';

class BrowserPool {
	private browsers: Browser[] = [];
	private readonly maxBrowsers: number;
	private readonly pagesPerBrowser: number;
	private currentBrowserIndex = 0;

	constructor(maxBrowsers = 3, pagesPerBrowser = 5) {
		this.maxBrowsers = maxBrowsers;
		this.pagesPerBrowser = pagesPerBrowser;
	}

	async initialize() {
		for (let i = 0; i < this.maxBrowsers; i++) {
			const browser = await puppeteer.launch({
				headless: 'new',
				args: [
					'--no-sandbox',
					'--disable-setuid-sandbox',
					'--disable-dev-shm-usage', // Overcome limited resource problems
					'--disable-accelerated-2d-canvas',
					'--no-first-run',
					'--no-zygote',
					'--disable-gpu',
				],
			});
			this.browsers.push(browser);
		}
	}

	async getPage(): Promise<{ page: Page; release: () => Promise<void> }> {
		// Round-robin browser selection
		const browser = this.browsers[this.currentBrowserIndex];
		this.currentBrowserIndex = (this.currentBrowserIndex + 1) % this.browsers.length;

		const page = await browser.newPage();

		// Set memory limits
		await page.setViewport({ width: 1280, height: 720 });
		await page.setDefaultNavigationTimeout(60000);
		await page.setDefaultTimeout(30000);

		const release = async () => {
			await page.close();
		};

		return { page, release };
	}

	async destroy() {
		await Promise.all(this.browsers.map((b) => b.close()));
		this.browsers = [];
	}
}

// Usage
const pool = new BrowserPool(3, 5); // 3 browsers, 5 pages each = 15 concurrent pages
await pool.initialize();

async function crawlUrls(urls: string[]) {
	const results = await Promise.all(
		urls.map(async (url) => {
			const { page, release } = await pool.getPage();

			try {
				await page.goto(url, { waitUntil: 'networkidle2' });
				const title = await page.title();
				return { url, title };
			} finally {
				await release(); // Always release the page
			}
		}),
	);

	return results;
}

// Cleanup
process.on('SIGINT', async () => {
	await pool.destroy();
	process.exit(0);
});

#Monitoring Memory Usage

class BrowserPool {
	// ... existing code

	async getMemoryUsage() {
		const usage = await Promise.all(
			this.browsers.map(async (browser, index) => {
				const pages = await browser.pages();
				const metrics = await Promise.all(pages.map((page) => page.metrics()));

				const totalMemory = metrics.reduce((sum, m) => sum + (m.JSHeapUsedSize || 0), 0);

				return {
					browserIndex: index,
					pageCount: pages.length,
					memoryMB: (totalMemory / 1024 / 1024).toFixed(2),
				};
			}),
		);

		return usage;
	}
}

// Monitor every 30 seconds
setInterval(async () => {
	const usage = await pool.getMemoryUsage();
	console.log('Browser memory usage:', usage);
}, 30000);

#Respecting robots.txt and Crawl Delays

Ethical crawling respects website owners' wishes and prevents server overload.

#Implementing Crawl Delays

interface CrawlerConfig {
	respectRobotsTxt: boolean;
	defaultCrawlDelay: number; // milliseconds
	maxConcurrentPages: number;
	userAgent: string;
}

class RespectfulCrawler {
	private config: CrawlerConfig;
	private lastCrawlTimes = new Map<string, number>();
	private robotsCache = new Map<string, { rules: RobotsRule[]; txt: string }>();

	constructor(config: Partial<CrawlerConfig> = {}) {
		this.config = {
			respectRobotsTxt: true,
			defaultCrawlDelay: 1000,
			maxConcurrentPages: 5,
			userAgent: 'SEOAnalyzer/1.0 (+https://example.com/bot)',
			...config,
		};
	}

	async canCrawl(url: string): Promise<{ allowed: boolean; delay: number }> {
		const baseUrl = new URL(url).origin;

		// Check robots.txt
		if (this.config.respectRobotsTxt) {
			let robotsData = this.robotsCache.get(baseUrl);

			if (!robotsData) {
				const robotsTxt = await fetchRobotsTxt(baseUrl);
				if (robotsTxt) {
					robotsData = {
						rules: parseRobotsTxt(robotsTxt),
						txt: robotsTxt,
					};
					this.robotsCache.set(baseUrl, robotsData);
				}
			}

			if (robotsData && !isUrlAllowed(robotsData.txt, url)) {
				return { allowed: false, delay: 0 };
			}

			// Get crawl delay from robots.txt
			const rule = robotsData?.rules.find((r) => r.userAgent === '*');
			const crawlDelay = (rule?.crawlDelay || 0) * 1000; // Convert to ms

			return {
				allowed: true,
				delay: Math.max(crawlDelay, this.config.defaultCrawlDelay),
			};
		}

		return { allowed: true, delay: this.config.defaultCrawlDelay };
	}

	async crawlWithDelay(url: string, crawlFn: (url: string) => Promise<void>) {
		const { allowed, delay } = await this.canCrawl(url);

		if (!allowed) {
			throw new Error(`Crawling ${url} is disallowed by robots.txt`);
		}

		const baseUrl = new URL(url).origin;
		const lastCrawl = this.lastCrawlTimes.get(baseUrl) || 0;
		const timeSinceLastCrawl = Date.now() - lastCrawl;

		if (timeSinceLastCrawl < delay) {
			const waitTime = delay - timeSinceLastCrawl;
			await new Promise((resolve) => setTimeout(resolve, waitTime));
		}

		this.lastCrawlTimes.set(baseUrl, Date.now());
		await crawlFn(url);
	}

	async crawlMultiple(urls: string[], crawlFn: (url: string) => Promise<void>) {
		// Group by domain to respect per-domain delays
		const urlsByDomain = new Map<string, string[]>();

		for (const url of urls) {
			const domain = new URL(url).origin;
			if (!urlsByDomain.has(domain)) {
				urlsByDomain.set(domain, []);
			}
			urlsByDomain.get(domain)!.push(url);
		}

		// Crawl each domain sequentially (respecting delays)
		// but crawl different domains concurrently
		await Promise.all(
			Array.from(urlsByDomain.entries()).map(async ([_, domainUrls]) => {
				for (const url of domainUrls) {
					await this.crawlWithDelay(url, crawlFn);
				}
			}),
		);
	}
}

// Usage
const crawler = new RespectfulCrawler({
	respectRobotsTxt: true,
	defaultCrawlDelay: 1000,
	maxConcurrentPages: 5,
});

await crawler.crawlMultiple(urls, async (url) => {
	const { page, release } = await pool.getPage();
	try {
		await page.goto(url);
		// ... extract data
	} finally {
		await release();
	}
});

#Error Handling: Timeouts and Navigation Failures

Crawling the open web means dealing with broken sites, slow servers, and unexpected errors.

#Common Failure Scenarios

  1. Navigation timeout: Site takes >60s to load
  2. 404 errors: Page doesn't exist
  3. 500 errors: Server errors
  4. Connection refused: Site is down
  5. SSL errors: Certificate issues
  6. Infinite redirects: Misconfigured redirects
  7. JavaScript errors: Client-side crashes

#Robust Error Handling Pattern

interface CrawlResult {
	url: string;
	success: boolean;
	data?: unknown;
	error?: {
		type: string;
		message: string;
		statusCode?: number;
	};
}

// Error pattern mapping for cleaner categorization
const ERROR_PATTERNS: Array<{ pattern: string; type: string; message: string }> = [
	{ pattern: 'Timeout', type: 'TIMEOUT', message: 'Page load timeout exceeded' },
	{ pattern: 'net::ERR_NAME_NOT_RESOLVED', type: 'DNS_ERROR', message: 'Domain does not exist' },
	{ pattern: 'net::ERR_CONNECTION_REFUSED', type: 'CONNECTION_REFUSED', message: 'Server refused connection' },
	{ pattern: 'net::ERR_CERT', type: 'SSL_ERROR', message: 'SSL certificate error' },
	{ pattern: 'ERR_TOO_MANY_REDIRECTS', type: 'REDIRECT_ERROR', message: 'Too many redirects' },
];

function categorizeError(error: unknown): { type: string; message: string } {
	if (!(error instanceof Error)) {
		return { type: 'UNKNOWN_ERROR', message: 'An unknown error occurred' };
	}

	const match = ERROR_PATTERNS.find(({ pattern }) => error.message.includes(pattern));
	return match ?? { type: 'UNKNOWN_ERROR', message: error.message };
}

async function crawlPageSafely(page: Page, url: string): Promise<CrawlResult> {
	try {
		const response = await page.goto(url, { waitUntil: 'networkidle2', timeout: 60000 });
		const statusCode = response?.status() || 0;

		if (statusCode >= 400) {
			return { url, success: false, error: { type: 'HTTP_ERROR', message: `HTTP ${statusCode}`, statusCode } };
		}

		await page.waitForSelector('body', { timeout: 10000 });

		const title = await page.title();
		const headings = await page.$$eval('h1', (els) => els.map((el) => el.innerText));

		return { url, success: true, data: { title, headings } };
	} catch (error) {
		return { url, success: false, error: categorizeError(error) };
	}
}

// Usage with retry logic
async function crawlWithRetry(page: Page, url: string, maxRetries = 3): Promise<CrawlResult> {
	let lastResult: CrawlResult | null = null;

	for (let attempt = 1; attempt <= maxRetries; attempt++) {
		lastResult = await crawlPageSafely(page, url);

		if (lastResult.success) {
			return lastResult;
		}

		// Retry only on timeout or temporary errors
		const shouldRetry = lastResult.error?.type === 'TIMEOUT' || lastResult.error?.type === 'CONNECTION_REFUSED';

		if (!shouldRetry || attempt === maxRetries) {
			break;
		}

		// Exponential backoff: 2s, 4s, 8s
		const backoff = Math.pow(2, attempt) * 1000;
		await new Promise((resolve) => setTimeout(resolve, backoff));
	}

	return lastResult!;
}

#Handling JavaScript Errors on the Page

async function crawlWithConsoleMonitoring(page: Page, url: string) {
	const errors: string[] = [];

	// Monitor console errors
	page.on('console', (msg) => {
		if (msg.type() === 'error') {
			errors.push(msg.text());
		}
	});

	// Monitor page errors
	page.on('pageerror', (error) => {
		errors.push(`Page error: ${error.message}`);
	});

	// Monitor failed requests
	page.on('requestfailed', (request) => {
		errors.push(`Request failed: ${request.url()}`);
	});

	await page.goto(url, { waitUntil: 'networkidle2' });

	return {
		url,
		errors,
		hasErrors: errors.length > 0,
	};
}

#Production-Ready Complete Example

Here's a production-ready crawler combining all the techniques:

import puppeteer, { Browser, Page } from 'puppeteer-core';
import * as cheerio from 'cheerio';

interface CrawlOptions {
	maxConcurrentPages?: number;
	maxBrowsers?: number;
	respectRobotsTxt?: boolean;
	crawlDelay?: number;
	timeout?: number;
}

interface CrawlData {
	url: string;
	title: string;
	metaTags: MetaTag[];
	structuredData: StructuredData[];
	headings: { tag: string; text: string }[];
	links: { internal: LinkData[]; external: LinkData[] };
	images: ImageData[];
	textContent: string;
	statusCode: number;
}

class ProductionCrawler {
	private pool: BrowserPool;
	private crawler: RespectfulCrawler;

	constructor(options: CrawlOptions = {}) {
		this.pool = new BrowserPool(options.maxBrowsers || 3, options.maxConcurrentPages || 5);
		this.crawler = new RespectfulCrawler({
			respectRobotsTxt: options.respectRobotsTxt ?? true,
			defaultCrawlDelay: options.crawlDelay || 1000,
		});
	}

	async initialize() {
		await this.pool.initialize();
	}

	async crawl(url: string): Promise<CrawlResult> {
		const { page, release } = await this.pool.getPage();

		try {
			// Setup resource blocking
			await page.setRequestInterception(true);
			page.on('request', (req) => {
				if (['image', 'stylesheet', 'font', 'media'].includes(req.resourceType())) {
					req.abort();
				} else {
					req.continue();
				}
			});

			// Crawl with error handling and retry
			const result = await crawlWithRetry(page, url);

			if (!result.success) {
				return result;
			}

			// Extract all SEO data
			const html = await page.content();
			const $ = cheerio.load(html);

			// Meta tags
			const metaTags = await extractMetaTags(page);

			// Structured data
			const structuredData = await extractStructuredData(page);

			// Visible headings
			const headings = await page.$$eval('h1,h2,h3,h4,h5,h6', (els) =>
				els
					.filter((el) => {
						const style = window.getComputedStyle(el);
						return (
							style.display !== 'none' &&
							style.visibility !== 'hidden' &&
							!el.closest('[aria-hidden="true"]') &&
							el.innerText.trim().length > 0
						);
					})
					.map((el) => ({
						tag: el.tagName.toLowerCase(),
						text: el.innerText.trim(),
					})),
			);

			// Links
			const links = await page.evaluate((baseUrl) => {
				const base = new URL(baseUrl);
				const linkElements = document.querySelectorAll('a[href]');

				return Array.from(linkElements)
					.map((a) => {
						try {
							const href = new URL(a.getAttribute('href')!, base).href;
							const url = new URL(href);
							return {
								href,
								text: a.textContent?.trim() || '',
								type: url.hostname === base.hostname ? 'internal' : 'external',
							};
						} catch {
							return null;
						}
					})
					.filter((l): l is NonNullable<typeof l> => l !== null);
			}, page.url());

			// Images
			const images = await page.$$eval('img', (els) =>
				els.map((el) => ({
					src: el.src || el.getAttribute('data-src') || '',
					alt: el.alt || '',
					loading: el.getAttribute('loading') || undefined,
					isDecorative: el.getAttribute('role') === 'presentation',
				})),
			);

			// Text content
			const textContent = await page.$eval(
				'body',
				(el) => el.innerText.trim().substring(0, 5000), // Limit to 5KB
			);

			return {
				url,
				success: true,
				data: {
					url,
					title: await page.title(),
					metaTags,
					structuredData,
					headings,
					links: {
						internal: links.filter((l) => l.type === 'internal'),
						external: links.filter((l) => l.type === 'external'),
					},
					images,
					textContent,
					statusCode: (await page.goto(url))?.status() || 0,
				},
			};
		} finally {
			await release();
		}
	}

	async crawlMultiple(urls: string[]): Promise<CrawlResult[]> {
		const results: CrawlResult[] = [];

		await this.crawler.crawlMultiple(urls, async (url) => {
			const result = await this.crawl(url);
			results.push(result);
		});

		return results;
	}

	async destroy() {
		await this.pool.destroy();
	}
}

// Usage
const crawler = new ProductionCrawler({
	maxConcurrentPages: 5,
	maxBrowsers: 3,
	respectRobotsTxt: true,
	crawlDelay: 1000,
});

await crawler.initialize();

const results = await crawler.crawlMultiple([
	'https://example.com',
	'https://example.com/about',
	'https://example.com/blog',
]);

console.log(`Crawled ${results.filter((r) => r.success).length} pages successfully`);

await crawler.destroy();

#Key Takeaways

  • Choose the right tool: Use Cheerio for static HTML, Puppeteer for JavaScript-rendered content, or combine both for optimal performance
  • Block non-essential resources: Reduce page load time by 60-75% and bandwidth by 90-95% with request interception
  • Extract visible content only: Use $$eval with visibility checks and innerText to match what users and search engines see
  • Parse metadata separately: Use Cheerio for static HTML parsing (meta tags, JSON-LD) and Puppeteer for dynamic content
  • Manage memory carefully: Implement browser instance pooling to handle hundreds of concurrent crawls without crashes
  • Respect robots.txt: Parse and honor crawl directives to be an ethical web citizen
  • Handle errors gracefully: Implement retry logic with exponential backoff and categorize errors for proper handling
  • Monitor and optimize: Track memory usage, crawl speed, and error rates to identify bottlenecks

Ready to build your own SEO analysis tool? Check out the official Puppeteer documentation for advanced features like PDF generation, screenshot capture, and performance profiling.

For network interception details, see the Puppeteer request interception guide.

#Further Reading


Note: Always respect website terms of service, robots.txt directives, and rate limits when crawling. This guide is for educational purposes and building compliant SEO analysis tools.

Enjoyed this article? Stay updated: