Welcome to the Co3deX
Hello and welcome to the CO3DEX, a blog of my Journeys in Real-time 3D Graphics and Technical Art. My name is Jonny Galloway. I am a polymath technical art leader who bridges art, tools, engine, and product. I work as a Principal Technical Artist and tools/engine specialist with 30+ years in AAA game development, working across content, design, production, and technology.
When Parallelization IS the Answer: Building BATS π¦
After simplifying sequential work with QProcess in my last post, I wanted to think bigger: batch processing as infrastructure. Hereβs how I built BATS, a tool-agnostic gRPC orchestration system that any DCC tool can use, turning 7 hours of sequential work into 1 hour of distributed processing.
TL;DR (5-Minute Version)
β οΈ This is a long post. Deep technical content, real production code, implementation war stories. If youβre short on time, this section covers everything. Skim the headers to find whatβs relevant to you, or read straight through if you want the full picture.
Part 1 added logging to tools. Part 2 made tools async with QProcess. This isnβt Part 3. Itβs an architectural inversion. I stopped building better tools and started building distributed orchestration infrastructure. Tools become clients. Workers become permanent. Jobs become data.
The Inversion: Traditional batch processing spawns subprocesses on demand; script launches Maya, processes one asset, Maya closes, repeat 50 times. I flipped this completely: persistent worker swarm running 24/7, pulling jobs on demand. Workers and jobs became first-class primitives. Everything else (orchestrator, UI tools, APIs) is just plumbing.
The Architecture: Built BATS (Background Automation Task Swarm), a distributed gRPC system with persistent Maya/Houdini worker pools. Workers boot once, process hundreds of jobs, shut down when idle. Any tool can submit work: DCC plugins, game editors, web dashboards, CLI scripts, CI/CD pipelines.
The Results:
- Sequential processing: 50 assets Γ 8 minutes = 6.7 hours
- 8-worker swarm: 50 assets Γ· 8 = 1 hour (6.7x speedup)
- 50-worker swarm: 50 assets concurrently = 10 minutes (40x speedup)
The Paradigm: Workers poll for jobs (request when ready) rather than orchestrator pushing assignments. This eliminates push-related race conditions. The orchestrator actively dispatches by finding matching jobs, tracking worker state (busy/free, current job, heartbeats), and transmitting complete job packages (script + parameters + files + metadata). Request-dispatch pattern + centralized orchestration = robust distributed system. Workers boot once and stay hot (orchestrator pre-spawns them for instant readiness).
Core Lesson: This isnβt an upgrade to Part 2. Itβs a paradigm inversion. Part 2 made tools async (launch DCC, wait, process). BATS makes tools obsolete for batch work. You donβt build βbetter tools that spawn processesββyou build infrastructure where workers are always running and jobs are just data. The Part 2 tool becomes a BATS client. Build the orchestration backbone once, everything else becomes thin clients. Workers outlive jobs. Scale becomes a primitive. When you stop thinking βtoolβ and start thinking βdistributed job substrate,β the architecture writes itself.
Whatβs in the full post:
- The Architectural Inversion β why workers-pull-jobs beats orchestrator-pushes-jobs, and what βinfrastructure thinkingβ actually means
- The Technology Choices β why gRPC over REST, and what the system needed to handle
- BATS Architecture β three-layer design, protobuf schemas, key design decisions with code
- Implementation Journey β protobuf schema evolution, worker bootstrapping hell, priority queue gotchas, monitoring
- The Results β performance numbers, complexity tradeoffs, honest assessment
- Design Patterns That Worked β request dispatch, idempotent jobs, structured logging, health checks
- What Iβd Do Differently β SQLite state, Prometheus metrics, scaffolding tools, C# clients
- Lessons from the Journey β distilled takeaways from building a production distributed system
The Architectural Inversion
I built a swarm. Not βthe tool that sometimes batches workβ or βthe script that spawns Maya instances.β A living pool of workersβeight, twenty, fiftyβrunning continuously, hungry for work.
Traditional batch processing: Your script launches Maya β processes one asset β Maya closes β repeat 50 times. Seven hours of sequential execution, much of it spent booting and shutting down DCCs.
I inverted it entirely.
Workers boot once, stay running, pull jobs from a queue. When work arrives, they grab it immediately. When the queue empties, they idle. Not shutdown. Idle. New work appears, theyβre already hot. Process hundreds of assets without ever restarting.
The QProcess refactor was still tool thinking: βmake this Python script launch Houdini without blocking.β BATS is infrastructure thinking: βHoudini workers are always running, jobs are just data packets.β The moment you stop asking βhow do I launch a DCC?β and start asking βhow do I dispatch work to a pool of always-running DCCs?ββthatβs when the architecture inverts. Youβre not building a better tool. Youβre building distributed job substrate.
Workers Pull Jobs
This matters more than it sounds.
Push-based systems: Orchestrator tracks worker state (βWorker 3 is freeβ), assigns jobs (βWorker 3, run thisβ), handles failures (βWorker 3 crashed, reassign jobβ). Complex. Fragile. Race conditions everywhere.
Request-based systems: Workers request jobs when ready. Orchestrator still tracks state (busy/free, current job, heartbeats), but workers initiate the exchange. No push race conditions, no βdid the assignment succeed?β retries. Worker crashes? Job stays in queue. Worker reboots? Immediately requests a job. Natural backpressure: busy workers donβt poll.
# Traditional push-based (orchestrator assigns)
def assign_next_job():
available_workers = get_idle_workers() # Complex state tracking
if not available_workers:
return # What if worker crashes between check and assignment?
job = queue.pop()
worker = available_workers[0]
send_job_to_worker(worker, job) # Network call might fail
mark_worker_busy(worker) # State mutation
# BATS request-dispatch pattern (workers poll, orchestrator dispatches)
async def RequestJob(request):
client_id = request.client_id
dcc_type = request.dcc_type
# Orchestrator finds matching job for this worker
instance = await pool_manager.get_instance_by_client(client_id)
job = await job_manager.dispatch_jobs(dcc_type, instance.execution_mode)
if job:
# Mark worker busy and track assignment
await pool_manager.mark_instance_busy(instance.instance_id, job.job_id)
# Send COMPLETE job package: script, parameters, files, metadata, everything
return job.to_protobuf() # Worker receives full job specification
return empty_job() # Signal: no work available
The key insight: workers request jobs when ready (eliminating push race conditions), but the orchestrator actively dispatches by finding matching jobs, tracking assignments, and transmitting complete job packages including the entire script, all parameters, input files, output files, and metadata. The worker receives everything needed to execute, not just a job ID.
The orchestrator plays an active role in job dispatch:
Worker State Tracking (Explicit):
- Worker registration: Maps client_id to DCC instance (Maya worker #3, Houdini worker #1, etc.)
- Busy/free state: Marks workers busy when dispatching jobs, free when jobs complete
- Current job assignment: Tracks which job each worker is executing (
instance.current_job_id) - Health monitoring: Heartbeat every 30s, 60s timeout, auto-respawn on failure
- Capability registry: Knows which workers handle Maya vs Houdini, headless vs GUI
Job Dispatch Flow:
- Worker polls
RequestJob()with its client_id and DCC type - Orchestrator looks up workerβs instance and execution mode
- Orchestrator finds highest-priority job matching worker capabilities
- Orchestrator marks worker busy and assigns job
- Orchestrator sends complete job package (script/module + parameters + files + metadata)
- Worker executes locally, reports progress, signals completion
- Orchestrator marks worker free, ready for next job
Request-based polling eliminates push race conditions (no βdid the assignment succeed?β retries), but the orchestrator actively manages dispatch, state tracking, and transmits the full job specification, not just a reference.
Workers and Jobs as Primitives
Once you have persistent workers pulling jobs from a queue, everything else becomes thin clients. You donβt build βtools that automate DCCs.β You build one orchestration layer, and everything elseβgame editor, CLI script, web dashboard, CI/CD pipelineβbecomes a client: 20 lines of code that submits a job, gets an ID back, and optionally streams progress. The infrastructure is permanent. The clients are throwaway.
Workers donβt care who submitted the job. When they call RequestJob(), the orchestrator sends a complete job package:
job = job_pb2.Job(
job_id="abc123",
type="maya",
execution_mode=HEADLESS,
priority=7,
# THE COMPLETE EXECUTABLE CONTENT
script="""import maya.cmds as cmds
cmds.polySphere(radius={radius}, name='{name}')
cmds.file(rename="{output_path}")
cmds.file(save=True)
""",
# ALL PARAMETERS (worker substitutes {placeholders} or passes dict to module)
parameters={"radius": "2.0", "name": "rock_scan_047", "output_path": "D:/Assets/rock_scan_047.mb"},
# FILE PATHS (worker validates existence before execution)
input_files=["D:/Scans/rock_scan_047.fbx"],
output_files=["D:/Assets/rock_scan_047.mb"],
# METADATA
metadata={"asset_type": "prop", "biome": "canyon"},
submitter="game_editor",
)
The worker receives everything: executable code, parameters, file paths, metadata. It executes locally, streams progress to orchestrator, returns results. The job is self-contained: no callbacks to the orchestrator mid-execution, no fetching additional data. Submit once, execute once, report once.
The Orchestrator: Intelligent Hub
The orchestrator is BATSβs brain, managing the entire distributed system.
Job Queue Management:
- Priority queue (0-10 scale, heapq-based with negative values for max-heap)
- Dependency resolution (directed acyclic graph for multi-stage workflows)
- Capability matching (routes Maya jobs to Maya workers, Houdini jobs to Houdini workers)
- Preemption support (urgent jobs can checkpoint and pause lower-priority work)
- Job history and result caching (1-hour TTL)
Worker Pool Management:
- Pre-spawns base workers (3 headless + 1 GUI per DCC) for warm-startβno boot delay
- Monitors health via heartbeats (60s timeout, 30s check interval)
- Auto-respawns crashed workers (process dies? new worker spawns automatically)
- Configuration-based auto-scaling (spawns workers when queue > 10 jobs)
- Shuts down idle workers after 5 minutes to free resources
- Data-driven worker registry (eliminated ~400 lines of hardcoded DCC logic)
System-Wide Services:
- Real-time progress streaming to all subscribed clients
- Result caching with TTL (query results without re-running jobs)
- Per-job isolated logging (
.temp/logs/{dcc}/jobs/with 5-day retention) - Environment profile management per worker type
- gRPC dual-service architecture (internal OrchestratorService + external ExternalJobAPI)
The orchestrator doesnβt execute jobs. Workers do that. But it controls everything else: when workers start, which jobs they get, what happens when they crash, when to scale up or down. Workers declare capabilities (βIβm a Maya 2026 headless workerβ), jobs declare requirements (βI need a Maya workerβ), orchestrator matches them intelligently.
Any Tool Can Submit Work
This was the vision: build the backbone once, let anything connect.
# DCC plugin submits batch export
from bats_client import submit_job
for scene_file in selected_scenes:
submit_job(
dcc_type="maya",
script_path="jobs/animation/export_fbx.py",
parameters={"scene": scene_file, "frame_range": "1-120"},
priority=8
)
# Game editor sends procedural generation
for building_config in city_block:
submit_job(
dcc_type="houdini",
script_path="jobs/procedural/generate_building.py",
parameters=building_config,
priority=5
)
# CI/CD pipeline validates assets
submit_job(
dcc_type="maya",
script_path="jobs/validation/check_naming_conventions.py",
parameters={"asset_dir": "/assets/characters/"},
priority=10 # Urgent, blocks merge
)
Each submitter is a few lines of code. The infrastructureβworker pools, job queuing, progress streaming, result caching, fault toleranceβlives in BATS. Build once, reuse everywhere.
The Bigger Vision: Universal Warm-Start Infrastructure
But hereβs where it gets interesting.
Any Python-callable process becomes a worker. Maya and Houdini were first because those were my immediate needs. But this pattern works for anything:
- Blender: Python API, headless rendering, procedural generation
- Nuke: Python scripting, compositing automation, render farm integration
- Substance Designer: Substance Automation Toolkit for graph manipulation
- Substance Painter: Python API for texture baking and export pipelines
- Photoshop: COM wrapper for Windows automation, batch processing
- FFmpeg: Video encoding, format conversion, frame extraction
- ImageMagick: Image processing, format conversion, compositing
- Pandoc: Document conversion (Markdown β PDF β HTML β DOCX)
- Any CLI tool: Wrap it, pipe it, call it from Python
The pattern is identical: boot once, stay hot, poll for jobs. The orchestrator doesnβt care if workers use Python API, C++ bindings, COM wrappers, or shell commands. It just dispatches job packages.
This isnβt a DCC orchestrator. Itβs a universal job orchestrator. Content creation was just the entry point.
Python workers are the ultimate flexibility. dcc_type="python" workers arenβt tied to any DCC. Theyβre just Python interpreters staying warm. What can you keep loaded?
- AI/ML Models: Load PyTorch/TensorFlow once, process 1000 inferences without reload penalty
- Data Science Pipelines: Keep pandas/numpy/scipy loaded for ETL jobs
- Web Scraping: Maintain browser sessions (Selenium/Playwright) across jobs
- API Orchestration: Keep HTTP connection pools warm, manage rate limits globally
- Database Operations: Connection pools stay open, queries execute instantly
- Document Processing: Keep parsers loaded (PDF, Excel, CSV)
- Code Analysis: AST parsers, linters, formatters as persistent services
- Test Execution: Test frameworks loaded once, run suites in seconds
Inference-as-a-service becomes a BATS worker pool:
# AI worker stays hot with model loaded
@register_python_worker
def stable_diffusion_worker():
model = load_stable_diffusion_model() # Load once on boot (5-10s)
while True:
job = request_job(dcc_type="python", capability="stable_diffusion")
if job:
prompt = job.parameters["prompt"]
image = model.generate(prompt) # Inference on warm model (<1s)
save_image(image, job.output_files[0])
report_complete(job)
# Submit 100 image generation jobs
for prompt in prompts:
submit_job(dcc_type="python", capability="stable_diffusion",
parameters={"prompt": prompt})
Model loads once (5-10 seconds). 100 inferences run on the warm worker (<1 second each). No Python startup penalty, no model reload penalty. Batch ML inference becomes embarrassingly parallel.
Web automation becomes infrastructure:
# Selenium worker keeps browser session warm
@register_python_worker
def web_scraper_worker():
driver = webdriver.Chrome() # Launch once
driver.get("https://api.example.com/login")
login(driver) # Authenticate once
while True:
job = request_job(dcc_type="python", capability="web_scraper")
if job:
data = scrape_page(driver, job.parameters["url"])
save_data(data, job.output_files[0])
report_complete(job)
Browser launches once. Authentication happens once. 1000 scraping jobs reuse the same session. No re-login penalty.
Data pipelines become parallel:
# Pandas worker processes CSV batches
@register_python_worker
def data_processor():
schema = load_schema() # Parse schema once
while True:
job = request_job(dcc_type="python", capability="data_transform")
if job:
df = pd.read_csv(job.input_files[0])
transformed = transform(df, schema, job.parameters)
transformed.to_csv(job.output_files[0])
report_complete(job)
Parse 1000 CSVs? Spawn 20 workers, dispatch 50 jobs each. Embarrassingly parallel ETL.
This isnβt limited to content creation anymore. Any warm-start Python workload becomes infrastructure. The architecture doesnβt change. Add a worker type, define job schemas, submit work. The orchestrator handles the rest.
Whatβs next? In Part 4, Iβll show how Model Context Protocol (MCP) integration makes BATS agenticβyou can describe jobs in natural language, and AI assistants translate to API calls. βGenerate 10 hero swords with varying blade lengthsβ becomes 10 submitted jobs. Infrastructure becomes conversational.
When you build infrastructure where workers outlive jobs, youβre not building a DCC automation tool. Youβre building a universal job substrate where anything Python can become a persistent, scalable worker pool. Thatβs the real paradigm shift.
Why Centralized Orchestration Matters
Single Job Queue: All submitters (game editor, CLI tools, web dashboards, CI/CD pipelines) feed into one centralized priority queue. This means:
- Consistent priority resolution globally (canyon biome rocks always beat forest props)
- Dependency tracking across all jobs (simulation completes before rendering starts)
- No coordination overhead between multiple queues
- Fair resource allocation across all clients
Worker Pool Intelligence: Orchestrator decides when to scale, not individual workers:
- Pre-spawns base workers for instant job pickup (no 30-second Maya boot delay)
- Monitors queue depth to spawn additional workers dynamically
- Detects crashed workers via missed heartbeats and respawns automatically
- Shuts down idle workers after timeout (free resources when queue empties)
- Configuration-driven (add new DCC types in 15 minutes via config vs 8 hours of code)
Single Point of Visibility: Real-time system state from one source:
- βWhatβs Worker 3 doing right now?β (Pool monitor dashboard)
- βHow many jobs are pending?β (Queue depth)
- βWhatβs the error rate for Maya jobs?β (Job history)
- βWhy did job ABC123 fail?β (Per-job isolated logs)
Production Reliability: Orchestrator provides:
- Idempotent job retry (failed jobs retry safely without state corruption)
- Job result persistence (query results without re-running)
- Health monitoring with automatic recovery
- Resource management (prevent worker pool exhaustion)
- Audit trail (5-day log retention per job)
Without centralized orchestration, youβd need distributed coordination protocols, leader election, consensus algorithmsβfar more complex than request-based dispatch alone. The orchestrator keeps it simple: workers pull jobs, orchestrator manages dispatch and state.
The Technology Choices
What the System Needed
- Multiple DCC instances running simultaneously (8 Houdinis processing different rocks)
- Work queue management (50 jobs distributed across available workers)
- Priority handling (canyon biome rocks urgent, forest biome can wait)
- Dependency support (simulation must complete before rendering)
- External API (any tool should submit jobs)
- Real-time visibility (monitor whatβs running where)
- Fault tolerance (Maya crash on job #23 shouldnβt lose jobs #1-22)
What Workers Donβt Need
Workers run one job at a time. No threading within workers, no async I/O complexity, no elaborate state machines. Each worker is a simple execution loop: boot DCC, request job, execute, report progress, request next job. Keep it simple.
gRPC: Streaming and Strong Typing
I started by prototyping with a simple HTTP REST API. Clients POST job requests, GET status updates, DELETE to cancel. Works fine for hello-world demos.
Production use exposed the cracks. Polling for status every 2 seconds = chatty, inefficient. No real-time output streaming. JSON schema drift causing runtime errors when client and server versions diverge. Manual connection retry logic.
gRPC solved all of this:
Bidirectional streaming: Clients submit a job, get a stream of progress updates without polling. Worker sends output line-by-line as it executes. No artificial 2-second delay before seeing βError: texture not foundβ. You see it immediately.
Strong typing via protobuf: API contract is explicit. If I change job_id from string to int32, code generation fails at compile time. Clients and servers canβt drift out of sync; they wonβt even build.
Language-agnostic: Python orchestrator, Python workers for DCC jobs, C# client for game editor, JavaScript client for web dashboard. All generated from the same .proto files. Change the API once, all clients update automatically.
Built-in resilience: Connection retries, keepalive pings, graceful shutdown. HTTP requires implementing all of this manually. gRPC handles it.
Tradeoff: debugging is harder. Canβt just curl an endpoint. Need gRPC clients or tools like grpcurl. But Iβll take difficult debugging over production runtime errors any day.
BATS Architecture: The Core Design
High-Level Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β External Clients β
β Mock Editor β Game Editor β CLI Tools β Web Dashboard β
ββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββ
β gRPC (port 50051)
β ExternalJobAPI
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Orchestrator Server β
β β
β ββββββββββββββββ βββββββββββββββββββ β
β β Job Manager ββββββββ Priority Queue β β
β β β β (heapq 0-10) β β
β ββββββββββββββββ βββββββββββββββββββ β
β β β
β β β
β ββββββββββββββββββββββββββββββββββββββββ β
β β DCC Pool Manager β β
β β β’ 3 headless Maya workers β β
β β β’ 1 GUI Maya worker β β
β β β’ 3 headless Houdini workers β β
β β β’ 1 GUI Houdini worker β β
β β = 8 workers total (configurable) β β
β ββββββββββββββββββββββββββββββββββββββββ β
β β β
β β gRPC (internal) β
β β OrchestratorService β
βββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββ΄βββββββββ¬βββββββββββ¬βββββββββββ
β β β β
βββββββββββ βββββββββββ βββββββββββ ...
β Maya β β Maya β β Houdini β (8 workers)
β Worker β β Worker β β Worker β
β v4.0.0 β β v4.0.0 β β v2.0.0 β
βββββββββββ βββββββββββ βββββββββββ
The Three-Layer Design
Layer 1: External Job API (Client-facing)
service ExternalJobAPI {
rpc SubmitJob(JobRequest) returns (JobResponse);
rpc StreamJobStatus(JobQuery) returns (stream JobUpdate);
rpc GetJobResult(JobQuery) returns (JobResult);
rpc CancelJob(JobQuery) returns (JobResponse);
rpc ListJobs(JobListQuery) returns (JobList);
}
Clients submit jobs, monitor progress, retrieve results. They donβt know about workers or pools. Just jobs.
Layer 2: Orchestrator (Central coordinator)
- Priority queue (heapq-based, 0-10 scale where 10 = urgent)
- Worker pool management (spawn, track, health check)
- Job-to-worker assignment
- Result caching (1-hour TTL)
- Dependency resolution
Layer 3: DCC Workers (Job executors)
service OrchestratorService {
rpc RegisterClient(DCCClientInfo) returns (RegisterResponse);
rpc RequestJob(DCCClientInfo) returns (Job);
rpc UpdateStatus(JobStatusUpdate) returns (Empty);
rpc Heartbeat(DCCClientInfo) returns (RegisterResponse);
}
Workers register, poll for jobs matching their capabilities, execute, report progress.
Design Decisions That Mattered
Request-based job dispatch was the first critical choice. Covered in depth aboveβworkers initiate the exchange, eliminating push race conditions while the orchestrator actively manages state, assignment tracking, and transmits complete job packages. The practical payoff: crashed workers donβt lose jobs, busy workers donβt get overloaded, and the orchestrator stays simple.
Dual-mode job execution came from hard lessons in Part 2. Inline scripts work great for prototyping:
job = JobRequest(
dcc_type="maya",
execution_mode=ExecutionMode.HEADLESS,
script="""
import maya.cmds as cmds
sphere = cmds.polySphere(radius={radius}, name="{name}")[0]
cmds.move(0, {height}, 0, sphere)
cmds.file(rename="{output_path}")
cmds.file(save=True, type="mayaBinary")
print(f"Created: {output_path}")
""",
parameters={"radius": "2.0", "height": "5.0", "name": "test_sphere", "output_path": "C:/temp/sphere.mb"}
)
But production needs reusable modules:
# job_orchestrator/jobs/examples/maya/generate_hero_sword.py
def main(parameters: dict[str, Any]) -> None:
"""Entry point called by orchestrator."""
asset_name = parameters.get('asset_name', 'hero_sword')
output_dir = parameters.get('output_dir')
# Full IDE support: autocomplete, debugging, type checking
result = create_hero_sword(
asset_name=asset_name,
blade_length=float(parameters.get('blade_length', 1.2)),
handle_length=float(parameters.get('handle_length', 0.3)),
output_dir=output_dir,
export_fbx=bool(parameters.get('export_fbx', True))
)
print(f"Created: {result['maya_file']}")
# Submit using MODULE MODE
job = JobRequest(
dcc_type="maya",
execution_mode=ExecutionMode.HEADLESS,
module_path="job_orchestrator.jobs.examples.maya.generate_hero_sword",
entry_point="main",
parameters={"asset_name": "hero_sword_v1", "blade_length": "1.5", "output_dir": "D:/Assets/Weapons", "export_fbx": "True"}
)
MODULE MODE gives you proper IDE support, breakpoint debugging in Maya/Houdini GUI, unit tests, version control. STRING MODE remains perfect for quick experiments.
Priority queuing with dependencies handles both urgency and workflow constraints:
# Urgency: higher priority runs first
job1 = JobRequest(priority=5, ...) # Normal
job2 = JobRequest(priority=10, ...) # Urgent, runs first
# Workflow: simulation must complete before rendering
sim_job = JobRequest(job_id="houdini_sim_001", dcc_type="houdini", priority=8, script="# Run fluid simulation")
render_job = JobRequest(job_id="maya_render_001", dcc_type="maya", priority=8, dependencies=["houdini_sim_001"], script="# Render")
Job manager ensures high-priority jobs run first unless blocked by dependencies. Failed dependencies cascade: if simulation fails, rendering auto-cancels.
Per-job logging keeps debugging sane. Each job gets an isolated log file:
.temp/logs/
βββ maya/
β βββ jobs/
β β βββ job_abc123_20260321_143022.log
β β βββ job_def456_20260321_143045.log
β βββ worker_headless_1.log
β βββ worker_gui.log
βββ houdini/
βββ orchestrator.log
Script source, parameters, execution milestones, DCC output, error tracebacks, metadata. Everything you need to debug βwhy did job #23 fail?β Five-day retention, automatic cleanup.
Worker pool auto-scaling balances resource usage with throughput:
{
"worker_pools": {
"maya": { "headless_count": 3, "gui_count": 1, "max_workers": 20 },
"houdini": { "headless_count": 3, "gui_count": 1, "max_workers": 20 }
}
}
Pool manager pre-spawns base workers (3 headless + 1 GUI per DCC). Queue backs up? Spawn more workers. Queue empties? Shut down idle workers after 5 minutes. Worker crashes? Respawn automatically. The swarm adjusts to workload.
Implementation Journey: What I Learned
Phase 1: Protobuf Schema Evolution (Week 1)
Protocol buffers turned out more complex than expected. My first schema:
message JobRequest {
string dcc_type = 1;
string script = 2;
}
message JobResult {
bool success = 1;
string output = 2;
}
This lasted about two hours before I hit reality: no parameters (jobs were hardcoded scripts), no execution mode (couldnβt specify headless vs GUI), no priority (FIFO processing), no dependencies (couldnβt chain workflows), no progress tracking (just βrunningβ or βdoneβ).
Evolution happened in waves. Parameters added, then execution modes, then priority, then dependencies, then streaming progress updates. Each addition meant regenerating code for Python, updating orchestrator logic, updating worker logic, and testing all permutations.
By weekβs end: 400 lines of protobuf definitions, comprehensive job model, but also: every schema change broke existing clients. Learned to version the API early.
Phase 2: Worker Bootstrapping (Week 2)
Booting Maya workers sounds simple. Itβs not.
First attempt: launch mayapy.exe, sleep 30 seconds, hope for the best.
process = subprocess.Popen([
"C:/Program Files/Autodesk/Maya2026/bin/mayapy.exe",
"maya_rpc_server.py",
"--port", str(worker_port)
])
time.sleep(30) # How long is long enough? Β―\_(γ)_/Β―
Maya takes 15-45 seconds to boot depending on machine. No way to know if it booted successfully or crashed. 30-second delay isnβt enough on slow machines, too long on fast ones. Import errors fail silently until Mayaβs up.
Second attempt: health check polling.
def start_worker(dcc_type: str, port: int, timeout: int = 120):
process = subprocess.Popen([dcc_exe, worker_script, "--port", str(port)])
start_time = time.time()
while time.time() - start_time < timeout:
try:
channel = grpc.insecure_channel(f'localhost:{port}')
stub = OrchestratorServiceStub(channel)
stub.Heartbeat(DCCClientInfo(dcc_type=dcc_type))
_LOGGER.info(f"Worker {dcc_type} ready on port {port}")
return process
except grpc.RpcError:
time.sleep(1)
raise TimeoutError(f"Worker {dcc_type} failed to start in {timeout}s")
Better. But Maya sometimes starts but gRPC server fails (import errors). Need separate timeout for βprocess startedβ vs βgRPC readyβ. Need stdout/stderr capture to debug boot failures.
Final solution: worker registration protocol. Workers register themselves when ready.
class MayaRPCWorker:
def __init__(self, orchestrator_host: str, orchestrator_port: int):
self.orchestrator_channel = grpc.insecure_channel(f'{orchestrator_host}:{orchestrator_port}')
self.stub = OrchestratorServiceStub(self.orchestrator_channel)
def start(self):
response = self.stub.RegisterClient(DCCClientInfo(
dcc_type="maya",
version=cmds.about(version=True),
capabilities=["modeling", "rendering", "animation"],
worker_id=self.worker_id
))
if response.status == "READY":
_LOGGER.info(f"Registered with orchestrator")
self.poll_loop()
Orchestrator side:
class DCCPoolManager:
def start_instance(self, dcc_type: str) -> subprocess.Popen:
port = self._allocate_port()
process = subprocess.Popen(
[dcc_exe, worker_script, "--orchestrator-host", "localhost", "--orchestrator-port", "50051"],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE
)
registered = self._wait_for_registration(dcc_type, timeout=120)
if not registered:
process.kill()
raise TimeoutError(f"Worker {dcc_type} failed to register")
return process
Workers register when actually ready, not based on arbitrary sleep. Orchestrator knows worker capabilities before assigning jobs. Failed registration = kill process cleanly. Workers report DCC version, Python version, etc.
Lesson: Donβt poll from outside. Let workers push their ready state.
Phase 3: Understanding Priority Queue Behavior (Week 3)
Priority system works, but not how people initially expected.
The implementation uses heapq with negative priorities for max-heap behavior:
async def enqueue_job(self, job: Job) -> None:
async with self.lock:
# Use negative priority for max-heap (higher priority first)
heapq.heappush(self._job_queue, (-job.priority, self._queue_counter, job))
self._queue_counter += 1
self.active_jobs[job.job_id] = job
Priority 10 jobs dequeue before priority 5 jobs. Works perfectly⦠when the queue is full.
But hereβs what surprised people:
Timeline:
T=0s: Submit 8 low-priority jobs (priority 3)
T=1s: All 8 workers pick up jobs immediately (queue was empty)
T=2s: Submit 15 high-priority jobs (priority 10)
T=2s: High-priority jobs wait in queue
T=8m: First low-priority job completes, high-priority job starts
The issue: Priority determines queue position, not execution interruption. Once a worker grabs a job, it runs to completion. High-priority jobs arriving later must wait for workers to become free.
This is called βhead-of-line blockingβ in queue theory. The priority queue protects you from low-priority work preventing high-priority work in the queue, but not from low-priority work already running.
Why no preemption? I considered job cancellation and checkpoint/resume, but the complexity didnβt justify the benefit:
- Most jobs run 2-10 minutes (not multi-hour)
- Urgent work (priority 9-10) is rare in practice
- Simple mitigation: keep a few workers idle for urgent work, or temporarily spawn extras
The priority system does exactly what it should: ensures high-priority work executes next, not immediately. For batch processing, thatβs usually sufficient.
Lesson: Priority queues control dispatch order, not running job interruption. If you need preemption, design jobs to be interruptible from the start (checkpoint at natural boundaries, support resume). Donβt retrofit preemption onto long-running tasksβitβs almost always more complex than itβs worth.
Phase 4: Monitoring and Debugging (Week 4-5)
8-50 workers running across multiple machines makes debugging⦠interesting.
Three constant questions:
- βJob ABC123 failedββwhich worker ran it? What logs?
- βSystem is slowββare all workers busy? Crashed? Idle?
- βHoudini jobs hangββhanging or just slow?
Per-job logging (already covered) solved #1. Every job gets an isolated log file. Job fails? Error message includes log path.
System tray monitor solved quick access:
class BATSTrayApp(QSystemTrayIcon):
def __init__(self):
super().__init__()
self.setIcon(QIcon("bats_icon.png"))
menu = QMenu()
menu.addAction("Start Orchestrator", self.start_orchestrator)
menu.addAction("Stop Orchestrator", self.stop_orchestrator)
menu.addAction("Open Pool Monitor", self.open_pool_monitor)
menu.addSeparator()
menu.addAction("View Logs", self.open_logs)
menu.addAction("Quit", self.quit)
self.setContextMenu(menu)
DCC Pool Monitor solved #2 and #3, a real-time dashboard showing all workers:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BATS Pool Monitor - localhost:50051 β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β Maya Workers (4/4 ready) β
β β’ maya_headless_1 [READY] CPU: 5% Mem: 1.2GB β
β β’ maya_headless_2 [BUSY] CPU: 82% Mem: 2.8GB β
β ββ Job: hero_sword_v1 [Progress: 65%] β
β β’ maya_headless_3 [BUSY] CPU: 91% Mem: 3.1GB β
β ββ Job: rock_scan_023 [Progress: 45%] β
β β’ maya_gui [READY] CPU: 12% Mem: 2.1GB β
β β
β Houdini Workers (4/4 ready) β
β β’ houdini_headless_1 [BUSY] CPU: 88% Mem: 4.2GB β
β ββ Job: fluid_sim_canyon [Progress: 23%] β
β β’ houdini_headless_2 [READY] CPU: 8% Mem: 1.8GB β
β β’ houdini_headless_3 [CRASHED] - Respawning in 5s... β
β β’ houdini_gui [READY] CPU: 15% Mem: 2.5GB β
β β
β Queue: 12 jobs pending (2 priority 10, 10 priority 5) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Updates every 2 seconds via gRPC polling. Worker crashes? See it immediately. Job hung? CPU at 0% reveals the problem. Queue backing up? Spawn more workers.
Distributed systems need visibility. Build monitoring tools early, not after drowning in mystery failures.
The Results: Was It Worth It?
Performance Comparison
Scenario: Process 50 rock scans (Houdini β Substance β Maya)
| Approach | Time | Speedup | Notes |
|---|---|---|---|
| Original Threading | 6.7 hours | 1.0x | Sequential, complex code |
| QProcess (Part 2) | 6.7 hours | 1.0x | Sequential, simple code |
| BATS (8 workers) | 1.0 hour | 6.7x | Parallel, distributed |
| BATS (20 workers) | 24 min | 16.7x | Full machine utilization |
| BATS (50 workers) | 10 min | 40x | One worker per asset (theoretical max) |
Real Production Numbers (from logs):
2026-03-15 14:23:01 [Orchestrator] Batch job started: canyon_biome_rocks
2026-03-15 14:23:01 [Orchestrator] Jobs queued: 50 (priority 9)
2026-03-15 14:23:05 [Pool] Workers ready: 8 Maya, 8 Houdini
2026-03-15 15:18:43 [Orchestrator] Batch job completed: canyon_biome_rocks
Total time: 55 minutes 42 seconds
Average per asset: 1 minute 7 seconds (parallelized)
Sequential estimate: 50 Γ 8 minutes = 6.7 hours
Actual speedup: 7.2x
Code Complexity Comparison
| Metric | QProcess (Part 2) | BATS (Part 3) | Delta |
|---|---|---|---|
| Lines of code (tool) | 150 | 50 (client only) | -100 |
| Lines of code (infrastructure) | 0 | 2,500 (orchestrator + workers) | +2,500 |
| External dependencies | 0 | grpc, protobuf | +2 |
| Deployment complexity | Single .py file | Distributed system | High |
| Maintenance burden | Low | Medium-High | β |
When BATS Makes Sense
BATS shines when you have truly parallelizable work: independent assets, concurrent API calls, batch processing where each item has zero dependencies on others. The parallelizable portion needs to be significant (over 50% of runtime) or youβre just adding complexity for marginal gains.
Scaling beyond single-machine limits? Multiple clients need to submit jobs (game editor, web tools, CLI, CI/CD)? Priority and dependency management valuable? BATS handles all of this.
QProcess makes more sense when work is sequential or mostly sequential, processing one asset at a time, UI responsiveness is the only goal (not speed), or simplicity and maintainability trump raw performance. Self-contained tools donβt need distributed infrastructure.
The Honest Assessment
BATS gave me 7x speedup (8 workers) in production, with room to scale to 40x (50 workers).
But it also added:
- 2,500 lines of infrastructure code
- gRPC/protobuf compilation step
- Distributed system debugging complexity
- Process lifecycle management
- Worker health monitoring
Worth it for 50 concurrent assets? Absolutely. For single-asset processing? No. QProcess remains the better choice.
Thatβs the real win: universal job infrastructure, not just βthe tool that batches rocks.β
Design Patterns That Worked
Request-Based Work Dispatch
The request-dispatch pattern in action. Workers poll when ready; the orchestrator matches, assigns, and transmits complete job packages. What this looks like in the worker loop:
# Worker loop
while True:
job = stub.RequestJob(DCCClientInfo(
worker_id=self.worker_id,
dcc_type="maya",
capabilities=["modeling", "rendering"]
))
if job.job_id:
result = self.execute_job(job)
stub.UpdateStatus(JobStatusUpdate(job_id=job.job_id, status=result.status, progress=100))
else:
time.sleep(5)
Workers control their own request rate (canβt be overloaded). Crashed workers donβt lose assigned jobs (job stays in queue). Fast workers automatically request more jobs (natural load balancing).
Idempotent Job Execution
Jobs can be retried safely without side effects.
class Job:
def __init__(self, job_id: str, ...):
self.job_id = job_id
self.output_path = f"D:/Assets/{job_id}_output.mb" # Deterministic
self.retry_count = 0
self.max_retries = 3
def execute(self):
if os.path.exists(self.output_path):
os.remove(self.output_path) # Always start clean
# Do work...
# Atomic write (temp + rename)
temp_path = f"{self.output_path}.tmp"
write_output(temp_path)
os.rename(temp_path, self.output_path)
Failed jobs retry without corrupting state. Easy to implement βretry last N failed jobsβ commands. No complex rollback logic needed.
Structured Logging Everywhere
Every component logs to predictable locations with consistent formatting.
def setup_logging(component: str, job_id: str = None):
if job_id:
log_path = f".temp/logs/{component}/jobs/job_{job_id}_{timestamp()}.log"
else:
log_path = f".temp/logs/{component}/{component}.log"
handler = logging.FileHandler(log_path)
handler.setFormatter(logging.Formatter('%(asctime)s [%(levelname)s] %(name)s: %(message)s'))
logger = logging.getLogger(component)
logger.addHandler(handler)
return logger
When debugging: find job ID from error message, open .temp/logs/maya/jobs/job_{job_id}_*.log, see complete execution trace.
Health Checks and Auto-Recovery
Workers send heartbeats. Orchestrator respawns silent workers.
class DCCPoolManager:
def health_check_loop(self):
while True:
for worker_id, worker in self.workers.items():
last_heartbeat = worker.last_heartbeat
if time.time() - last_heartbeat > 60:
_LOGGER.warning(f"Worker {worker_id} missed heartbeat, respawning")
self.respawn_worker(worker_id)
time.sleep(30)
def respawn_worker(self, worker_id: str):
old_process = self.workers[worker_id].process
old_process.kill()
new_process = self.start_instance(dcc_type=self.workers[worker_id].dcc_type)
self.workers[worker_id].process = new_process
System self-heals from worker crashes. No manual intervention for transient failures. Long-running orchestrator stays stable.
What Iβd Do Differently
1. Start with SQLite for Job State
I used an in-memory dictionary for job results:
class JobResultStore:
def __init__(self):
self.results: dict[str, JobResult] = {} # Lost on restart
Better approach:
class JobResultStore:
def __init__(self, db_path: str = ".temp/jobs.db"):
self.conn = sqlite3.connect(db_path)
self.conn.execute("""
CREATE TABLE IF NOT EXISTS job_results (
job_id TEXT PRIMARY KEY,
status TEXT,
result_json TEXT,
created_at TIMESTAMP,
completed_at TIMESTAMP
)
""")
def store(self, job_id: str, result: JobResult):
self.conn.execute(
"INSERT OR REPLACE INTO job_results VALUES (?, ?, ?, ?, ?)",
(job_id, result.status, json.dumps(result.to_dict()),
datetime.now(), result.completed_at)
)
self.conn.commit()
Job history persists across orchestrator restarts. Can query βall failed jobs from last weekβ. Easy to build analytics dashboard.
2. Expose Prometheus Metrics
I built a custom monitoring UI. Prometheus would be more standard.
from prometheus_client import Counter, Histogram, Gauge, start_http_server
jobs_started = Counter('bats_jobs_started_total', 'Total jobs started')
jobs_completed = Counter('bats_jobs_completed_total', 'Total jobs completed', ['status'])
job_duration = Histogram('bats_job_duration_seconds', 'Job execution time')
active_workers = Gauge('bats_active_workers', 'Number of active workers', ['dcc_type'])
# In code:
jobs_started.inc()
job_duration.observe(elapsed_time)
active_workers.labels(dcc_type='maya').set(len(maya_workers))
# Expose metrics on :9090
start_http_server(9090)
Now Grafana can visualize: jobs per minute over time, average job duration by DCC type, worker pool utilization, error rates.
3. Add Job Scaffolding Tools
MODULE MODE is the first-class, recommended approach for production jobs. STRING MODE exists for quick experiments and prototyping, but all real work should be modules with proper IDE support, debugging, and version control.
Future improvement: Add CLI scaffolding for job creation:
$ bats create-job maya/my_new_asset
Created: job_orchestrator/jobs/studio/maya/my_new_asset.py
Template includes:
- main(parameters) entry point
- Example parameter parsing
- Error handling boilerplate
- Unit test stub
This would make it even easier to start new jobs with the right patterns from day one. The dual-mode design (STRING for prototyping, MODULE for production) is solid; scaffolding would just remove friction from the MODULE path.
4. Build the C# Client Earlier
I focused on Python clients for the first 6 months. But our game editor was C#, and the team had to wait for C# bindings.
Lesson: If you know youβll need multi-language clients, build them in parallel. gRPC makes this easy (protoc generates client code), but testing takes time.
Lessons from the Journey
Parallelize at the Right Level
Part 2 attempted to thread pipeline steps (Houdini β Substance β Maya). Wrong: theyβre sequential. Part 3 parallelized across assets (50 independent meshes). Correct: theyβre truly independent.
def can_parallelize(work_items: list) -> bool:
"""Test if work items can truly run in parallel."""
for i, item_a in enumerate(work_items):
for j, item_b in enumerate(work_items):
if i != j and depends_on(item_a, item_b):
return False
return True
# Part 2: Pipeline steps
depends_on("Substance", "Houdini") == True
can_parallelize(["Houdini", "Substance", "Maya"]) == False
# Part 3: Asset batch
depends_on("rock_002.fbx", "rock_001.fbx") == False
can_parallelize(["rock_001.fbx", ..., "rock_050.fbx"]) == True
Pull Beats Push for Work Distribution
The single architectural decision that made everything else simpler. Workers control their own capacity: busy workers donβt poll, crashed workers donβt lose jobs, the orchestrator doesnβt need push-coordination logic. The whole system gets more reliable for free.
Monitor from Day One
Donβt wait for production failures to build monitoring. Essential: worker pool status, queue depth, job history, error rates, performance metrics.
I built the system tray monitor in Phase 5. Should have been Phase 1.
Start Simple, Scale Later
BATS started as: 1 orchestrator, 2 Maya workers, no Houdini, STRING MODE only, no priorities, no dependencies.
It evolved to: 1 orchestrator, 8-50 workers, Maya + Houdini + Python, dual-mode execution, priority queue with dependency resolution.
Build the minimal viable version first. Add features when you feel the pain of not having them.
Separate Infrastructure from Jobs
Keep them in distinct directories and treat them as different codebases:
job_orchestrator/
βββ orchestrator/ # Stable infrastructure
βββ dcc_workers/ # DCC-specific workers
βββ jobs/ # Frequently updated
βββ examples/
βββ studio/ # Your code here
Infrastructure changes donβt touch job code. Job changes donβt require an orchestrator restart. New team members onboard by writing jobs; they never need to understand the infrastructure to contribute.
Conclusion: The Right Parallelism at the Right Time
From single-asset QProcess (Part 2) to distributed BATS orchestration (Part 3), the most important lesson: parallelism is only useful when work can actually run in parallel.
Sounds obvious. But Iβve seen developers (including past me) add threading, async, multiprocessing, distributed systems without first asking: βWhat actually runs in parallel?β
The Three-Part Arc
Part 1: Tool Logging built visibility into the tool. Without logging, I never would have seen the threading anti-patterns in Part 2.
Part 2: Threading Anti-patterns removed threading from sequential work. Houdini β Substance β Maya pipeline cannot be parallelized, so simplifying to QProcess was correct.
Part 3 (this post) added distributed orchestration for parallel work. Processing 50 independent assets can and should be parallelized. BATS was the right solution.
Key Takeaways
Draw the dependency graph. If work items have dependencies, threading wonβt help. If theyβre independent, parallelism is valuable.
Measure before and after. I knew my speedup (6.7x with 8 workers) because I measured sequential time vs parallel time.
Start simple, scale later. QProcess for single assets, BATS for batch processing. Donβt build distributed systems until you need them.
Infrastructure and jobs are separate concerns. Keep them decoupled: infrastructure changes shouldnβt touch job code, and vice versa.
Monitoring is not optional. Distributed systems are invisible without monitoring. Build visibility first.
gRPC scales. Strong typing, streaming, multi-language support. Worth the learning curve for distributed systems.
What I Achieved
6.7x speedup (8 workers) in production. Scales to 40x speedup (50 workers) when needed. Clean separation: infrastructure (stable) vs jobs (frequently updated). Dual-mode execution: STRING (prototyping) + MODULE (production). Priority + dependency support for complex workflows. Real-time monitoring and debugging.
The Bigger Picture
These three posts tell the complete story of building production-ready game dev tools:
Logging - See whatβs happening (observability)
Simplify - Remove needless complexity (architecture)
Scale - Add parallelism when provably useful (performance)
Each step builds on the previous. Logging revealed the threading anti-patterns. Simplifying to QProcess made the baseline fast and maintainable. Scaling with BATS gave true parallelism where it mattered.
The Golden Rules
βDonβt parallelize work that canβt run in parallel.β (Part 2 lesson)
βDo parallelize work that can and should run in parallel.β (Part 3 lesson)
βDraw the dependency graph first, then decide.β (The unifying principle)
Whatβs Next?
BATS is production-ready, but thereβs always more to build:
- C# client library - Unity/Unreal integration for in-editor job submission
- Web dashboard - Browser-based monitoring with Prometheus + Grafana
- Docker containerization - Deploy BATS as a containerized service
- Cloud scaling - Spin up AWS EC2 workers for massive batches (1000+ assets)
- More DCCs - Blender, Substance Painter, 3ds Max support
But the foundation is solid. No more anti-patterns. No more needless complexity. Just the right architecture for the right problem.
Want to Learn More?
Distributed Systems:
- Designing Data-Intensive Applications by Martin Kleppmann - ~20 hours: The bible of distributed systems design. Essential reading.
- gRPC Documentation - Official guide: Protocol buffers, streaming patterns, best practices.
Concurrency Patterns:
- The Little Book of Semaphores - Free PDF: Classic concurrency problems and solutions.
- Python Async Best Practices - ~30 min read: When to use async, threading, multiprocessing.
DCC Pipeline Architecture:
- Open Source Pipeline Conference talks - YouTube: Industry experts on VFX/game pipelines.
Final thought: Distributed systems are hard. Start simple, measure everything, scale when needed.
Whatβs next? In Part 4, Iβll cover how I made BATS AI-callable through Model Context Protocol (MCP) integration - turning infrastructure into natural language APIs that Claude Desktop, Cursor, and VS Code Copilot can operate.
This post is Part 3 of a series on building maintainable game development tools:
- Part 1: Tool Logging with Python
- Part 2: Donβt Thread What You Canβt Parallelize
- Part 3: When Parallelization IS the Answer (this post)
- Part 4: Natural Language Infrastructure with MCP (coming soon)
Have you built distributed DCC pipelines? Fought with gRPC? Scaled beyond single-machine limits? Let me know in the comments or reach out on Twitter or LinkedIn.
Have questions or improvements to this pattern? Did you find errors, omissions, inaccurate statements, or flaws in the code snippets? Open an issue on the CO3DEX repository. Find me on the Discord (in O3DE).
Want to see the full BATS implementation? The orchestrator is open-source (coming soon to GitHub). Check back for the repository link!
import logging as _logging
_MODULENAME = 'co3dex.posts.distributed_orchestration_bats'
_LOGGER = _logging.getLogger(_MODULENAME)
_LOGGER.info(f'Initializing: {_MODULENAME} ... parallelism: know what, when, and how')
Disclaimer: Views expressed are my own. All opinions are my own. The opinions expressed here belong solely to myself, and do not reflect the views of my current employer Sony Interactive Entertainment (SIE), any previous employers including AWS/Amazon, the Open 3D Foundation, or their parent the Linux Foundation. I am bound by NDAs with my current and previous employers and am not authorized to speak on their behalf. If you are a reporter or news outlet seeking official statements, please contact the respective company's PR department (I will not reply to such requests).
License: For terms please see the LICENSE*.TXT file at the root of this distribution.
// Copyright HogJonny-AMZN. or his affiliates. All Rights Reserved.
// SPDX-License-Identifier: CC-BY-4.0