Back to Blog
BullMQ Parent-Child Workflows: Handling Failures Without Losing Results

BullMQ Parent-Child Workflows: Handling Failures Without Losing Results

January 3, 2026
Stefan Mentović
bullmqerror-handlingjob-queuestypescriptresilience

Learn how to build resilient parent-child job workflows in BullMQ that preserve successful results when children fail, enabling smart selective retries.

#BullMQ Parent-Child Workflows: Handling Failures Without Losing Results

You've built a parent job that spawns three child jobs to query different LLMs. Two children succeed, one fails. Now your entire orchestration is stuck, and you have to retry everything from scratch, wasting time and money on redundant API calls.

Sound familiar? This is one of the most frustrating problems in distributed job orchestration. Parent-child workflows are powerful for decomposing complex tasks, but they create a brittle dependency chain: if any child fails, the entire parent workflow grinds to a halt.

BullMQ 5.0+ introduces a solution: the ignoreDependencyOnFailure pattern. It allows parent jobs to continue even when children fail, enabling smart retry logic that preserves successful results while only retrying failed children. In this guide, we'll build a production-ready orchestrator that demonstrates this pattern in action.

#The Problem: Parent Jobs Get Stuck on Child Failures

Let's start with the classic orchestrator pattern:

import { Queue, Worker, Job } from 'bullmq';

// Parent orchestrator spawns 3 child jobs
async function orchestratorJob(job: Job) {
	// Spawn child jobs
	await geminiQueue.add(
		'query-gemini',
		{ prompt: job.data.prompt },
		{
			parent: { id: job.id!, queue: job.queueQualifiedName },
		},
	);

	await openaiQueue.add(
		'query-openai',
		{ prompt: job.data.prompt },
		{
			parent: { id: job.id!, queue: job.queueQualifiedName },
		},
	);

	await anthropicQueue.add(
		'query-anthropic',
		{ prompt: job.data.prompt },
		{
			parent: { id: job.id!, queue: job.queueQualifiedName },
		},
	);

	// Wait for all children to complete
	const shouldWait = await job.moveToWaitingChildren(token);
	if (shouldWait) {
		throw new WaitingChildrenError();
	}

	// Collect results
	const results = await job.getChildrenValues();
	return { results: Object.values(results) };
}

The problem: If the Anthropic child exhausts all retries and fails, BullMQ marks the parent as failed too. You lose the successful Gemini and OpenAI results, and the next retry spawns all three children again, even though two already succeeded.

This creates cascading failures and wastes resources.

#Why BullMQ Flows Require Manual Failure Handling

When we first built our multi-LLM orchestrator, we assumed BullMQ would handle parent-child failures gracefully out of the box. We were wrong - and understanding why taught us a lot about how flow dependencies actually work.

By default, BullMQ flows use a strict dependency model. According to the official BullMQ flows documentation, when you create a parent-child relationship, the parent waits for all children to complete successfully before resuming. If any child fails, the parent is blocked indefinitely - there's no automatic notification or retry mechanism.

The failParentOnFailure option changes this behavior to a fail-fast approach: when a child fails, the parent immediately fails too. This is useful when all children are required - if one fails, there's no point waiting for the others. But it comes with a significant downside: all successful child results are lost, and retrying the parent means re-running every child from scratch.

When a child job with the option failParentOnFailure fails, it will verify if the parent has other children that have not yet been processed. If so, they will be removed, and the parent job will fail immediately. - BullMQ Fail Parent Documentation

This is where we discovered the need for manual failure watching. Neither the default behavior (parent blocked) nor failParentOnFailure (parent fails, loses results) gives us what we actually wanted: the ability to preserve successful child results and selectively retry only the failed children.

#The Solution: ignoreDependencyOnFailure

BullMQ 5.0+ adds ignoreDependencyOnFailure to parent-child options. When set to true, failed children are moved to an "ignored" category instead of blocking the parent:

await geminiQueue.add(
	'query-gemini',
	{ prompt: job.data.prompt },
	{
		parent: { id: job.id!, queue: job.queueQualifiedName },
		ignoreDependencyOnFailure: true, // ✅ Don't block parent on failure
	},
);

How it works:

  1. Child job retries automatically (based on attempts config)
  2. While retrying, child stays in "delayed"/"waiting" state
  3. Only after exhausting ALL retries does child move to "failed" state
  4. With ignoreDependencyOnFailure: true, failed child moves to "ignored" category
  5. Parent job resumes and can check for ignored failures using getIgnoredChildrenFailures()

The parent can now handle failures intelligently instead of being blocked.

#Why Manual Failure Watching Is Required

Here's the catch we learned the hard way: ignoreDependencyOnFailure doesn't automatically surface failures to your workflow. The parent simply continues as if nothing happened. If you want step-based retries that preserve progress, you must explicitly check for failures using getIgnoredChildrenFailures().

The BullMQ documentation on ignored dependencies explains:

Since completed children results will be available in the parent via the getChildrenValues, we can use a similar method getFailedChildrenValues to fetch the errors (if any) of the children that failed.

This means your orchestrator code is responsible for:

  1. Detecting failures - Calling getIgnoredChildrenFailures() or getFailedChildrenValues() after children complete
  2. Deciding what to do - Retry the parent (which re-spawns failed children) or mark as unrecoverable
  3. Tracking progress - Remembering which children already succeeded so you don't re-run them on retry
  4. Handling retries gracefully - The ignored failures persist across parent retries, so you need to track which ones you've already handled

Without this manual handling, your parent job will happily complete with partial results - or worse, silently lose data from failed children. The framework gives you the building blocks, but the resilience logic is up to you.

#Tracking Handled Failures Across Retries

Here's the critical insight: getIgnoredChildrenFailures() persists across parent retries. If you don't track which failures you've already handled, you'll process the same errors repeatedly.

Let's build a utility to handle this:

interface FailedChildInfo {
	jobId: string;
	failedReason: string;
}

/**
 * Check for child job failures and throw if any NEW failures detected
 * Tracks handled failures in job.data._handledFailures to avoid duplicates
 */
async function checkChildJobFailures(parentJob: Job): Promise<FailedChildInfo[]> {
	// Get ignored failures (children that exhausted all retries)
	const ignoredFailures = await parentJob.getIgnoredChildrenFailures();

	if (!ignoredFailures || Object.keys(ignoredFailures).length === 0) {
		return []; // No failures
	}

	// Get previously handled failures from job data
	const handledFailures: string[] = parentJob.data._handledFailures || [];

	// Filter out already-handled failures
	const newFailures = Object.entries(ignoredFailures).filter(([jobId]) => !handledFailures.includes(jobId));

	if (newFailures.length === 0) {
		// All failures were already handled in previous retry
		return [];
	}

	// Convert to array format
	const failedChildren: FailedChildInfo[] = newFailures.map(([jobId, failedReason]) => ({
		jobId,
		failedReason: failedReason || 'Unknown error',
	}));

	// Track these failures as handled for next retry
	const updatedHandledFailures = [...handledFailures, ...failedChildren.map((c) => c.jobId)];

	await parentJob.updateData({
		...parentJob.data,
		_handledFailures: updatedHandledFailures,
	});

	const errorMsg = `${failedChildren.length} child job(s) failed: ${failedChildren
		.map((c) => `${c.jobId} (${c.failedReason})`)
		.join(', ')}`;

	// Throw regular Error (NOT UnrecoverableError) to allow retry
	throw new Error(errorMsg);
}

Key points:

  • getIgnoredChildrenFailures() returns { jobId: failedReason } for all ignored failures
  • We store handled failure IDs in job.data._handledFailures (persists across retries)
  • Only throw if there are NEW failures that weren't handled before
  • Throw regular Error (not UnrecoverableError) to allow parent retry

#getIgnoredChildrenFailures() vs getChildrenValues()

BullMQ provides two methods for checking child job results:

getChildrenValues() - Returns results from successful children only:

const childrenValues = await job.getChildrenValues();
// {
//   'gemini-job-123': { llmName: 'gemini', result: '...' },
//   'openai-job-456': { llmName: 'openai', result: '...' }
// }
// Note: Failed children are NOT included here

getIgnoredChildrenFailures() - Returns only failed children (after retries exhausted):

const ignoredFailures = await job.getIgnoredChildrenFailures();
// {
//   'anthropic-job-789': 'Rate limit exceeded'
// }
// Note: Only children that exhausted ALL retries appear here

When to use each:

  • Use getChildrenValues() to collect successful results
  • Use getIgnoredChildrenFailures() to detect failures and decide whether to retry parent
  • Children still retrying won't appear in either method (they're in "delayed"/"waiting" state)

#Preserving Successful Results When Retrying Failed Children

The power of this pattern is selective retry: only re-spawn failed children while keeping successful results. Here's how:

async function orchestratorJob(job: Job, token?: string) {
	const step = job.data.step || 'spawn-children';

	// ============================================================
	// STEP 1: Spawn Children
	// ============================================================
	if (step === 'spawn-children') {
		// Check which children already succeeded (from previous run)
		const existingChildren = await job.getChildrenValues();
		const existingLLMs = Object.values(existingChildren)
			.filter((r): r is { llmName: string } => typeof r === 'object' && r !== null && 'llmName' in r)
			.map((r) => r.llmName);

		const spawnPromises: Promise<unknown>[] = [];

		// Only spawn children that don't already have successful results
		if (!existingLLMs.includes('gemini')) {
			spawnPromises.push(
				geminiQueue.add(
					'query-gemini',
					{ prompt: job.data.prompt },
					{
						parent: { id: job.id!, queue: job.queueQualifiedName },
						ignoreDependencyOnFailure: true,
					},
				),
			);
		}

		if (!existingLLMs.includes('openai')) {
			spawnPromises.push(
				openaiQueue.add(
					'query-openai',
					{ prompt: job.data.prompt },
					{
						parent: { id: job.id!, queue: job.queueQualifiedName },
						ignoreDependencyOnFailure: true,
					},
				),
			);
		}

		if (!existingLLMs.includes('anthropic')) {
			spawnPromises.push(
				anthropicQueue.add(
					'query-anthropic',
					{ prompt: job.data.prompt },
					{
						parent: { id: job.id!, queue: job.queueQualifiedName },
						ignoreDependencyOnFailure: true,
					},
				),
			);
		}

		if (spawnPromises.length > 0) {
			await Promise.all(spawnPromises);
		}

		// Update to next step BEFORE waiting
		await job.updateData({ ...job.data, step: 'wait-for-children' });

		// Wait for children to complete
		const shouldWait = await job.moveToWaitingChildren(token ?? '');
		if (shouldWait) {
			throw new WaitingChildrenError();
		}
	}

	// ============================================================
	// STEP 2: Check Failures and Collect Results
	// ============================================================
	if (step === 'wait-for-children') {
		// Check for failures (throws if any NEW failures detected)
		try {
			await checkChildJobFailures(job);
		} catch (childError) {
			// Reset step to spawn-children so retry only re-spawns failed children
			await job.updateData({
				...job.data,
				step: 'spawn-children',
			});
			throw childError; // BullMQ retries parent
		}

		// All children succeeded, collect results
		const childrenValues = await job.getChildrenValues();
		const results = Object.values(childrenValues);

		return {
			success: true,
			results,
		};
	}

	throw new Error(`Unknown step: ${step}`);
}

Smart retry flow:

  1. On first run: spawn all 3 children
  2. Gemini and OpenAI succeed, Anthropic fails
  3. checkChildJobFailures() detects Anthropic failure and throws Error
  4. Parent job retries (BullMQ's automatic retry)
  5. On retry: check existingChildren, find Gemini and OpenAI already succeeded
  6. Only spawn Anthropic child (the one that failed)
  7. Anthropic succeeds on retry
  8. Collect all 3 results and complete

Result: We saved 2 redundant API calls and preserved successful work.

#When to Use Error vs UnrecoverableError

BullMQ has two error types with different behaviors:

Regular Error - Allows job to retry:

// Child failed, but parent should retry
throw new Error('Child job failed after exhausting retries');

UnrecoverableError - Prevents retries (job goes to failed state immediately):

import { UnrecoverableError } from 'bullmq';

// Fatal error, don't retry
throw new UnrecoverableError('Database connection lost');

When to use each in parent-child workflows:

Use regular Error when:

  • Child job failed but parent can retry intelligently
  • Temporary failures (rate limits, network timeouts)
  • Partial failures where some children succeeded
  • You want to preserve successful results and retry only failed children

Use UnrecoverableError when:

  • Invalid input data (won't succeed on retry)
  • Authentication/authorization failures
  • Resource not found (404)
  • Business logic violations
  • Fatal errors where retry would waste resources

Example decision tree:

async function checkChildJobFailures(job: Job) {
	const ignoredFailures = await job.getIgnoredChildrenFailures();

	if (!ignoredFailures || Object.keys(ignoredFailures).length === 0) {
		return; // No failures
	}

	for (const [jobId, failedReason] of Object.entries(ignoredFailures)) {
		// Check failure reason to decide error type
		if (failedReason.includes('Invalid API key')) {
			// Fatal error - don't retry
			throw new UnrecoverableError(`Authentication failed for ${jobId}`);
		}

		if (failedReason.includes('Rate limit exceeded')) {
			// Temporary error - retry parent
			throw new Error(`Rate limit hit for ${jobId}, retrying...`);
		}

		if (failedReason.includes('Timeout')) {
			// Temporary error - retry parent
			throw new Error(`Timeout for ${jobId}, retrying...`);
		}
	}

	// Default: allow retry
	throw new Error('Child job(s) failed, retrying parent...');
}

#Complete Working Example: Multi-LLM Orchestrator

Here's a production-ready orchestrator that demonstrates all patterns together:

import { Queue, Worker, Job, WaitingChildrenError, UnrecoverableError } from 'bullmq';
import { Redis } from 'ioredis';

// ============================================================
// TYPE DEFINITIONS
// ============================================================

interface LLMQueryData {
	prompt: string;
	model: string;
}

interface LLMQueryResult {
	llmName: string;
	response: string;
	tokensUsed: number;
}

interface OrchestratorData {
	prompt: string;
	step?: 'spawn-children' | 'wait-for-children';
	_handledFailures?: string[];
}

interface OrchestratorResult {
	success: boolean;
	results: LLMQueryResult[];
}

interface FailedChildInfo {
	jobId: string;
	failedReason: string;
}

// ============================================================
// REDIS CONNECTION
// ============================================================

const redisConnection = new Redis({
	host: 'localhost',
	port: 6379,
	maxRetriesPerRequest: null,
});

// ============================================================
// QUEUE DEFINITIONS
// ============================================================

const defaultJobOptions = {
	attempts: 3,
	backoff: { type: 'exponential' as const, delay: 2000 },
	removeOnComplete: { count: 100 },
	removeOnFail: { count: 500 },
};

const geminiQueue = new Queue<LLMQueryData>('gemini-query', {
	connection: redisConnection,
	defaultJobOptions,
});

const openaiQueue = new Queue<LLMQueryData>('openai-query', {
	connection: redisConnection,
	defaultJobOptions,
});

// ... anthropicQueue with same pattern

const orchestratorQueue = new Queue<OrchestratorData>('llm-orchestrator', {
	connection: redisConnection,
	defaultJobOptions: {
		...defaultJobOptions,
		backoff: { type: 'exponential', delay: 5000 }, // Longer delay for orchestrator
	},
});

// ============================================================
// UTILITY: CHECK CHILD JOB FAILURES
// ============================================================

async function checkChildJobFailures(parentJob: Job<OrchestratorData>): Promise<FailedChildInfo[]> {
	const ignoredFailures = await parentJob.getIgnoredChildrenFailures();

	if (!ignoredFailures || Object.keys(ignoredFailures).length === 0) {
		return [];
	}

	const handledFailures: string[] = parentJob.data._handledFailures || [];

	const newFailures = Object.entries(ignoredFailures).filter(([jobId]) => !handledFailures.includes(jobId));

	if (newFailures.length === 0) {
		return [];
	}

	const failedChildren: FailedChildInfo[] = newFailures.map(([jobId, failedReason]) => ({
		jobId,
		failedReason: failedReason || 'Unknown error',
	}));

	const updatedHandledFailures = [...handledFailures, ...failedChildren.map((c) => c.jobId)];

	await parentJob.updateData({
		...parentJob.data,
		_handledFailures: updatedHandledFailures,
	});

	// Check for fatal errors
	for (const child of failedChildren) {
		if (child.failedReason.includes('Invalid API key')) {
			throw new UnrecoverableError(`Authentication failed for ${child.jobId}: ${child.failedReason}`);
		}
	}

	const errorMsg = `${failedChildren.length} child job(s) failed: ${failedChildren
		.map((c) => `${c.jobId} (${c.failedReason})`)
		.join(', ')}`;

	throw new Error(errorMsg);
}

// ============================================================
// ORCHESTRATOR WORKER
// ============================================================

async function processOrchestratorJob(job: Job<OrchestratorData>, token?: string): Promise<OrchestratorResult> {
	const step = job.data.step || 'spawn-children';

	// ============================================================
	// STEP 1: Spawn Children
	// ============================================================
	if (step === 'spawn-children') {
		console.log(`[Orchestrator ${job.id}] Spawning child jobs...`);

		// Check which children already succeeded
		const existingChildren = await job.getChildrenValues();
		const existingLLMs = Object.values(existingChildren)
			.filter((r): r is LLMQueryResult => typeof r === 'object' && r !== null && 'llmName' in r)
			.map((r) => r.llmName);

		const spawnPromises: Promise<unknown>[] = [];

		// Only spawn missing children
		if (!existingLLMs.includes('gemini')) {
			spawnPromises.push(
				geminiQueue.add(
					'query-gemini',
					{ prompt: job.data.prompt, model: 'gemini-pro' },
					{
						parent: { id: job.id!, queue: job.queueQualifiedName },
						ignoreDependencyOnFailure: true,
					},
				),
			);
			console.log(`[Orchestrator ${job.id}] Spawning Gemini job`);
		} else {
			console.log(`[Orchestrator ${job.id}] Gemini result already exists`);
		}

		if (!existingLLMs.includes('openai')) {
			spawnPromises.push(
				openaiQueue.add(
					'query-openai',
					{ prompt: job.data.prompt, model: 'gpt-4' },
					{
						parent: { id: job.id!, queue: job.queueQualifiedName },
						ignoreDependencyOnFailure: true,
					},
				),
			);
			console.log(`[Orchestrator ${job.id}] Spawning OpenAI job`);
		} else {
			console.log(`[Orchestrator ${job.id}] OpenAI result already exists`);
		}

		if (!existingLLMs.includes('anthropic')) {
			spawnPromises.push(
				anthropicQueue.add(
					'query-anthropic',
					{ prompt: job.data.prompt, model: 'claude-3-opus' },
					{
						parent: { id: job.id!, queue: job.queueQualifiedName },
						ignoreDependencyOnFailure: true,
					},
				),
			);
			console.log(`[Orchestrator ${job.id}] Spawning Anthropic job`);
		} else {
			console.log(`[Orchestrator ${job.id}] Anthropic result already exists`);
		}

		if (spawnPromises.length > 0) {
			await Promise.all(spawnPromises);
			console.log(`[Orchestrator ${job.id}] Spawned ${spawnPromises.length} child job(s)`);
		}

		await job.updateData({ ...job.data, step: 'wait-for-children' });

		const shouldWait = await job.moveToWaitingChildren(token ?? '');
		if (shouldWait) {
			console.log(`[Orchestrator ${job.id}] Waiting for children to complete...`);
			throw new WaitingChildrenError();
		}
	}

	// ============================================================
	// STEP 2: Check Failures and Collect Results
	// ============================================================
	if (step === 'wait-for-children') {
		console.log(`[Orchestrator ${job.id}] Children completed, checking failures...`);

		try {
			await checkChildJobFailures(job);
		} catch (childError) {
			await job.updateData({
				...job.data,
				step: 'spawn-children',
			});
			console.log(`[Orchestrator ${job.id}] Child failure detected, resetting for retry`);
			throw childError;
		}

		console.log(`[Orchestrator ${job.id}] All children succeeded, collecting results...`);

		const childrenValues = await job.getChildrenValues();
		const results = Object.values(childrenValues).filter(
			(r): r is LLMQueryResult => typeof r === 'object' && r !== null && 'llmName' in r,
		);

		console.log(`[Orchestrator ${job.id}] Collected ${results.length} LLM results`);

		return {
			success: true,
			results,
		};
	}

	throw new Error(`Unknown step: ${step}`);
}

const orchestratorWorker = new Worker<OrchestratorData, OrchestratorResult>(
	'llm-orchestrator',
	processOrchestratorJob,
	{
		connection: redisConnection,
		concurrency: 1,
	},
);

// ============================================================
// CHILD WORKERS (Mock LLM APIs)
// ============================================================

async function simulateLLMQuery(llmName: string, data: LLMQueryData): Promise<LLMQueryResult> {
	// Simulate API call with random failure
	await new Promise((resolve) => setTimeout(resolve, 1000));

	// Simulate occasional failures for demonstration
	if (Math.random() < 0.2) {
		throw new Error(`${llmName} rate limit exceeded`);
	}

	return {
		llmName,
		response: `${llmName} response to: ${data.prompt}`,
		tokensUsed: Math.floor(Math.random() * 1000) + 100,
	};
}

// Factory function to create LLM workers with consistent configuration
function createLLMWorker(queueName: string, llmName: string) {
	return new Worker<LLMQueryData, LLMQueryResult>(
		queueName,
		async (job) => {
			console.log(`[${llmName} ${job.id}] Processing...`);
			const result = await simulateLLMQuery(llmName, job.data);
			console.log(`[${llmName} ${job.id}] Completed`);
			return result;
		},
		{ connection: redisConnection, concurrency: 2 },
	);
}

const geminiWorker = createLLMWorker('gemini-query', 'gemini');
const openaiWorker = createLLMWorker('openai-query', 'openai');
// ... anthropicWorker with same pattern

// ============================================================
// EXAMPLE USAGE
// ============================================================

async function runExample() {
	console.log('Starting multi-LLM orchestrator example...\n');

	const job = await orchestratorQueue.add('orchestrate', {
		prompt: 'Explain quantum computing in simple terms',
	});

	console.log(`Created orchestrator job: ${job.id}\n`);

	// Wait for completion
	const result = await job.waitUntilFinished(new Redis({ host: 'localhost', port: 6379 }));

	console.log('\n=== FINAL RESULT ===');
	console.log(JSON.stringify(result, null, 2));

	// Cleanup
	await Promise.all([
		orchestratorWorker.close(),
		geminiWorker.close(),
		// ... other workers
	]);
	await redisConnection.quit();
}

// Run example
runExample().catch(console.error);

What this example demonstrates:

  1. Smart spawning: Only spawns children that don't already have successful results
  2. Failure tracking: Tracks handled failures across retries to avoid duplicate errors
  3. Selective retry: On parent retry, only re-spawns failed children
  4. Error classification: Uses UnrecoverableError for fatal errors, regular Error for retryable failures
  5. Step-based workflow: Uses step pattern to resume correctly after parent retry

#Key Takeaways

  • Default BullMQ behavior blocks parents - When a child fails, the parent is blocked indefinitely unless you use failParentOnFailure or ignoreDependencyOnFailure
  • failParentOnFailure loses results - The parent fails immediately, but all successful child results are lost and must be re-computed on retry
  • ignoreDependencyOnFailure requires manual watching - The parent continues, but you must explicitly check for failures using getIgnoredChildrenFailures() or getFailedChildrenValues()
  • Track handled failures - Store handled failure IDs in job.data._handledFailures to avoid duplicate error handling across parent retries
  • Preserve successful results - Use getChildrenValues() to check which children succeeded before re-spawning only the failed ones
  • Use step-based workflow - Implement a step pattern (spawn-childrenwait-for-children) to resume correctly after parent retry
  • Choose error types carefully - Throw regular Error for retryable failures, UnrecoverableError for fatal errors that shouldn't be retried

The ignoreDependencyOnFailure pattern transforms brittle parent-child workflows into resilient, cost-efficient orchestrations. Instead of losing work and retrying everything, you can preserve successful results and intelligently retry only what failed. The tradeoff is that you must implement the failure detection and retry logic yourself - BullMQ gives you the tools, but the resilience pattern is your responsibility.

#Further Reading

Enjoyed this article? Stay updated: