When Parallelization IS the Answer: Building BATS

Posted by : on

Category : python


Welcome to the Co3deX

Hello and welcome to the CO3DEX, a blog of my Journeys in Real-time 3D Graphics and Technical Art. My name is Jonny Galloway. I am a polymath technical art leader who bridges art, tools, engine, and product. I work as a Principal Technical Artist and tools/engine specialist with 30+ years in AAA game development, working across content, design, production, and technology.

When Parallelization IS the Answer: Building BATS πŸ¦‡

After simplifying sequential work with QProcess in my last post, I wanted to think bigger: batch processing as infrastructure. Here’s how I built BATS, a tool-agnostic gRPC orchestration system that any DCC tool can use, turning 7 hours of sequential work into 1 hour of distributed processing.


TL;DR (5-Minute Version)

⚠️ This is a long post. Deep technical content, real production code, implementation war stories. If you’re short on time, this section covers everything. Skim the headers to find what’s relevant to you, or read straight through if you want the full picture.


Part 1 added logging to tools. Part 2 made tools async with QProcess. This isn’t Part 3. It’s an architectural inversion. I stopped building better tools and started building distributed orchestration infrastructure. Tools become clients. Workers become permanent. Jobs become data.

The Inversion: Traditional batch processing spawns subprocesses on demand; script launches Maya, processes one asset, Maya closes, repeat 50 times. I flipped this completely: persistent worker swarm running 24/7, pulling jobs on demand. Workers and jobs became first-class primitives. Everything else (orchestrator, UI tools, APIs) is just plumbing.

The Architecture: Built BATS (Background Automation Task Swarm), a distributed gRPC system with persistent Maya/Houdini worker pools. Workers boot once, process hundreds of jobs, shut down when idle. Any tool can submit work: DCC plugins, game editors, web dashboards, CLI scripts, CI/CD pipelines.

The Results:

  • Sequential processing: 50 assets Γ— 8 minutes = 6.7 hours
  • 8-worker swarm: 50 assets Γ· 8 = 1 hour (6.7x speedup)
  • 50-worker swarm: 50 assets concurrently = 10 minutes (40x speedup)

The Paradigm: Workers poll for jobs (request when ready) rather than orchestrator pushing assignments. This eliminates push-related race conditions. The orchestrator actively dispatches by finding matching jobs, tracking worker state (busy/free, current job, heartbeats), and transmitting complete job packages (script + parameters + files + metadata). Request-dispatch pattern + centralized orchestration = robust distributed system. Workers boot once and stay hot (orchestrator pre-spawns them for instant readiness).

Core Lesson: This isn’t an upgrade to Part 2. It’s a paradigm inversion. Part 2 made tools async (launch DCC, wait, process). BATS makes tools obsolete for batch work. You don’t build β€œbetter tools that spawn processes”—you build infrastructure where workers are always running and jobs are just data. The Part 2 tool becomes a BATS client. Build the orchestration backbone once, everything else becomes thin clients. Workers outlive jobs. Scale becomes a primitive. When you stop thinking β€œtool” and start thinking β€œdistributed job substrate,” the architecture writes itself.

What’s in the full post:


The Architectural Inversion

I built a swarm. Not β€œthe tool that sometimes batches work” or β€œthe script that spawns Maya instances.” A living pool of workersβ€”eight, twenty, fiftyβ€”running continuously, hungry for work.

Traditional batch processing: Your script launches Maya β†’ processes one asset β†’ Maya closes β†’ repeat 50 times. Seven hours of sequential execution, much of it spent booting and shutting down DCCs.

I inverted it entirely.

Workers boot once, stay running, pull jobs from a queue. When work arrives, they grab it immediately. When the queue empties, they idle. Not shutdown. Idle. New work appears, they’re already hot. Process hundreds of assets without ever restarting.

The QProcess refactor was still tool thinking: β€œmake this Python script launch Houdini without blocking.” BATS is infrastructure thinking: β€œHoudini workers are always running, jobs are just data packets.” The moment you stop asking β€œhow do I launch a DCC?” and start asking β€œhow do I dispatch work to a pool of always-running DCCs?”—that’s when the architecture inverts. You’re not building a better tool. You’re building distributed job substrate.

Workers Pull Jobs

This matters more than it sounds.

Push-based systems: Orchestrator tracks worker state (β€œWorker 3 is free”), assigns jobs (β€œWorker 3, run this”), handles failures (β€œWorker 3 crashed, reassign job”). Complex. Fragile. Race conditions everywhere.

Request-based systems: Workers request jobs when ready. Orchestrator still tracks state (busy/free, current job, heartbeats), but workers initiate the exchange. No push race conditions, no β€œdid the assignment succeed?” retries. Worker crashes? Job stays in queue. Worker reboots? Immediately requests a job. Natural backpressure: busy workers don’t poll.

# Traditional push-based (orchestrator assigns)
def assign_next_job():
    available_workers = get_idle_workers()  # Complex state tracking
    if not available_workers:
        return  # What if worker crashes between check and assignment?

    job = queue.pop()
    worker = available_workers[0]
    send_job_to_worker(worker, job)  # Network call might fail
    mark_worker_busy(worker)  # State mutation

# BATS request-dispatch pattern (workers poll, orchestrator dispatches)
async def RequestJob(request):
    client_id = request.client_id
    dcc_type = request.dcc_type

    # Orchestrator finds matching job for this worker
    instance = await pool_manager.get_instance_by_client(client_id)
    job = await job_manager.dispatch_jobs(dcc_type, instance.execution_mode)

    if job:
        # Mark worker busy and track assignment
        await pool_manager.mark_instance_busy(instance.instance_id, job.job_id)

        # Send COMPLETE job package: script, parameters, files, metadata, everything
        return job.to_protobuf()  # Worker receives full job specification

    return empty_job()  # Signal: no work available

The key insight: workers request jobs when ready (eliminating push race conditions), but the orchestrator actively dispatches by finding matching jobs, tracking assignments, and transmitting complete job packages including the entire script, all parameters, input files, output files, and metadata. The worker receives everything needed to execute, not just a job ID.

The orchestrator plays an active role in job dispatch:

Worker State Tracking (Explicit):

  • Worker registration: Maps client_id to DCC instance (Maya worker #3, Houdini worker #1, etc.)
  • Busy/free state: Marks workers busy when dispatching jobs, free when jobs complete
  • Current job assignment: Tracks which job each worker is executing (instance.current_job_id)
  • Health monitoring: Heartbeat every 30s, 60s timeout, auto-respawn on failure
  • Capability registry: Knows which workers handle Maya vs Houdini, headless vs GUI

Job Dispatch Flow:

  1. Worker polls RequestJob() with its client_id and DCC type
  2. Orchestrator looks up worker’s instance and execution mode
  3. Orchestrator finds highest-priority job matching worker capabilities
  4. Orchestrator marks worker busy and assigns job
  5. Orchestrator sends complete job package (script/module + parameters + files + metadata)
  6. Worker executes locally, reports progress, signals completion
  7. Orchestrator marks worker free, ready for next job

Request-based polling eliminates push race conditions (no β€œdid the assignment succeed?” retries), but the orchestrator actively manages dispatch, state tracking, and transmits the full job specification, not just a reference.

Workers and Jobs as Primitives

Once you have persistent workers pulling jobs from a queue, everything else becomes thin clients. You don’t build β€œtools that automate DCCs.” You build one orchestration layer, and everything elseβ€”game editor, CLI script, web dashboard, CI/CD pipelineβ€”becomes a client: 20 lines of code that submits a job, gets an ID back, and optionally streams progress. The infrastructure is permanent. The clients are throwaway.

Workers don’t care who submitted the job. When they call RequestJob(), the orchestrator sends a complete job package:

job = job_pb2.Job(
    job_id="abc123",
    type="maya",
    execution_mode=HEADLESS,
    priority=7,

    # THE COMPLETE EXECUTABLE CONTENT
    script="""import maya.cmds as cmds
cmds.polySphere(radius={radius}, name='{name}')
cmds.file(rename="{output_path}")
cmds.file(save=True)
""",

    # ALL PARAMETERS (worker substitutes {placeholders} or passes dict to module)
    parameters={"radius": "2.0", "name": "rock_scan_047", "output_path": "D:/Assets/rock_scan_047.mb"},

    # FILE PATHS (worker validates existence before execution)
    input_files=["D:/Scans/rock_scan_047.fbx"],
    output_files=["D:/Assets/rock_scan_047.mb"],

    # METADATA
    metadata={"asset_type": "prop", "biome": "canyon"},
    submitter="game_editor",
)

The worker receives everything: executable code, parameters, file paths, metadata. It executes locally, streams progress to orchestrator, returns results. The job is self-contained: no callbacks to the orchestrator mid-execution, no fetching additional data. Submit once, execute once, report once.

The Orchestrator: Intelligent Hub

The orchestrator is BATS’s brain, managing the entire distributed system.

Job Queue Management:

  • Priority queue (0-10 scale, heapq-based with negative values for max-heap)
  • Dependency resolution (directed acyclic graph for multi-stage workflows)
  • Capability matching (routes Maya jobs to Maya workers, Houdini jobs to Houdini workers)
  • Preemption support (urgent jobs can checkpoint and pause lower-priority work)
  • Job history and result caching (1-hour TTL)

Worker Pool Management:

  • Pre-spawns base workers (3 headless + 1 GUI per DCC) for warm-startβ€”no boot delay
  • Monitors health via heartbeats (60s timeout, 30s check interval)
  • Auto-respawns crashed workers (process dies? new worker spawns automatically)
  • Configuration-based auto-scaling (spawns workers when queue > 10 jobs)
  • Shuts down idle workers after 5 minutes to free resources
  • Data-driven worker registry (eliminated ~400 lines of hardcoded DCC logic)

System-Wide Services:

  • Real-time progress streaming to all subscribed clients
  • Result caching with TTL (query results without re-running jobs)
  • Per-job isolated logging (.temp/logs/{dcc}/jobs/ with 5-day retention)
  • Environment profile management per worker type
  • gRPC dual-service architecture (internal OrchestratorService + external ExternalJobAPI)

The orchestrator doesn’t execute jobs. Workers do that. But it controls everything else: when workers start, which jobs they get, what happens when they crash, when to scale up or down. Workers declare capabilities (β€œI’m a Maya 2026 headless worker”), jobs declare requirements (β€œI need a Maya worker”), orchestrator matches them intelligently.

Any Tool Can Submit Work

This was the vision: build the backbone once, let anything connect.

# DCC plugin submits batch export
from bats_client import submit_job

for scene_file in selected_scenes:
    submit_job(
        dcc_type="maya",
        script_path="jobs/animation/export_fbx.py",
        parameters={"scene": scene_file, "frame_range": "1-120"},
        priority=8
    )

# Game editor sends procedural generation
for building_config in city_block:
    submit_job(
        dcc_type="houdini",
        script_path="jobs/procedural/generate_building.py",
        parameters=building_config,
        priority=5
    )

# CI/CD pipeline validates assets
submit_job(
    dcc_type="maya",
    script_path="jobs/validation/check_naming_conventions.py",
    parameters={"asset_dir": "/assets/characters/"},
    priority=10  # Urgent, blocks merge
)

Each submitter is a few lines of code. The infrastructureβ€”worker pools, job queuing, progress streaming, result caching, fault toleranceβ€”lives in BATS. Build once, reuse everywhere.

The Bigger Vision: Universal Warm-Start Infrastructure

But here’s where it gets interesting.

Any Python-callable process becomes a worker. Maya and Houdini were first because those were my immediate needs. But this pattern works for anything:

  • Blender: Python API, headless rendering, procedural generation
  • Nuke: Python scripting, compositing automation, render farm integration
  • Substance Designer: Substance Automation Toolkit for graph manipulation
  • Substance Painter: Python API for texture baking and export pipelines
  • Photoshop: COM wrapper for Windows automation, batch processing
  • FFmpeg: Video encoding, format conversion, frame extraction
  • ImageMagick: Image processing, format conversion, compositing
  • Pandoc: Document conversion (Markdown β†’ PDF β†’ HTML β†’ DOCX)
  • Any CLI tool: Wrap it, pipe it, call it from Python

The pattern is identical: boot once, stay hot, poll for jobs. The orchestrator doesn’t care if workers use Python API, C++ bindings, COM wrappers, or shell commands. It just dispatches job packages.

This isn’t a DCC orchestrator. It’s a universal job orchestrator. Content creation was just the entry point.

Python workers are the ultimate flexibility. dcc_type="python" workers aren’t tied to any DCC. They’re just Python interpreters staying warm. What can you keep loaded?

  • AI/ML Models: Load PyTorch/TensorFlow once, process 1000 inferences without reload penalty
  • Data Science Pipelines: Keep pandas/numpy/scipy loaded for ETL jobs
  • Web Scraping: Maintain browser sessions (Selenium/Playwright) across jobs
  • API Orchestration: Keep HTTP connection pools warm, manage rate limits globally
  • Database Operations: Connection pools stay open, queries execute instantly
  • Document Processing: Keep parsers loaded (PDF, Excel, CSV)
  • Code Analysis: AST parsers, linters, formatters as persistent services
  • Test Execution: Test frameworks loaded once, run suites in seconds

Inference-as-a-service becomes a BATS worker pool:

# AI worker stays hot with model loaded
@register_python_worker
def stable_diffusion_worker():
    model = load_stable_diffusion_model()  # Load once on boot (5-10s)
    
    while True:
        job = request_job(dcc_type="python", capability="stable_diffusion")
        if job:
            prompt = job.parameters["prompt"]
            image = model.generate(prompt)  # Inference on warm model (<1s)
            save_image(image, job.output_files[0])
            report_complete(job)

# Submit 100 image generation jobs
for prompt in prompts:
    submit_job(dcc_type="python", capability="stable_diffusion", 
               parameters={"prompt": prompt})

Model loads once (5-10 seconds). 100 inferences run on the warm worker (<1 second each). No Python startup penalty, no model reload penalty. Batch ML inference becomes embarrassingly parallel.

Web automation becomes infrastructure:

# Selenium worker keeps browser session warm
@register_python_worker  
def web_scraper_worker():
    driver = webdriver.Chrome()  # Launch once
    driver.get("https://api.example.com/login")
    login(driver)  # Authenticate once
    
    while True:
        job = request_job(dcc_type="python", capability="web_scraper")
        if job:
            data = scrape_page(driver, job.parameters["url"])
            save_data(data, job.output_files[0])
            report_complete(job)

Browser launches once. Authentication happens once. 1000 scraping jobs reuse the same session. No re-login penalty.

Data pipelines become parallel:

# Pandas worker processes CSV batches
@register_python_worker
def data_processor():
    schema = load_schema()  # Parse schema once
    
    while True:
        job = request_job(dcc_type="python", capability="data_transform")
        if job:
            df = pd.read_csv(job.input_files[0])
            transformed = transform(df, schema, job.parameters)
            transformed.to_csv(job.output_files[0])
            report_complete(job)

Parse 1000 CSVs? Spawn 20 workers, dispatch 50 jobs each. Embarrassingly parallel ETL.

This isn’t limited to content creation anymore. Any warm-start Python workload becomes infrastructure. The architecture doesn’t change. Add a worker type, define job schemas, submit work. The orchestrator handles the rest.

What’s next? In Part 4, I’ll show how Model Context Protocol (MCP) integration makes BATS agenticβ€”you can describe jobs in natural language, and AI assistants translate to API calls. β€œGenerate 10 hero swords with varying blade lengths” becomes 10 submitted jobs. Infrastructure becomes conversational.

When you build infrastructure where workers outlive jobs, you’re not building a DCC automation tool. You’re building a universal job substrate where anything Python can become a persistent, scalable worker pool. That’s the real paradigm shift.

Why Centralized Orchestration Matters

Single Job Queue: All submitters (game editor, CLI tools, web dashboards, CI/CD pipelines) feed into one centralized priority queue. This means:

  • Consistent priority resolution globally (canyon biome rocks always beat forest props)
  • Dependency tracking across all jobs (simulation completes before rendering starts)
  • No coordination overhead between multiple queues
  • Fair resource allocation across all clients

Worker Pool Intelligence: Orchestrator decides when to scale, not individual workers:

  • Pre-spawns base workers for instant job pickup (no 30-second Maya boot delay)
  • Monitors queue depth to spawn additional workers dynamically
  • Detects crashed workers via missed heartbeats and respawns automatically
  • Shuts down idle workers after timeout (free resources when queue empties)
  • Configuration-driven (add new DCC types in 15 minutes via config vs 8 hours of code)

Single Point of Visibility: Real-time system state from one source:

  • β€œWhat’s Worker 3 doing right now?” (Pool monitor dashboard)
  • β€œHow many jobs are pending?” (Queue depth)
  • β€œWhat’s the error rate for Maya jobs?” (Job history)
  • β€œWhy did job ABC123 fail?” (Per-job isolated logs)

Production Reliability: Orchestrator provides:

  • Idempotent job retry (failed jobs retry safely without state corruption)
  • Job result persistence (query results without re-running)
  • Health monitoring with automatic recovery
  • Resource management (prevent worker pool exhaustion)
  • Audit trail (5-day log retention per job)

Without centralized orchestration, you’d need distributed coordination protocols, leader election, consensus algorithmsβ€”far more complex than request-based dispatch alone. The orchestrator keeps it simple: workers pull jobs, orchestrator manages dispatch and state.


The Technology Choices

What the System Needed

  1. Multiple DCC instances running simultaneously (8 Houdinis processing different rocks)
  2. Work queue management (50 jobs distributed across available workers)
  3. Priority handling (canyon biome rocks urgent, forest biome can wait)
  4. Dependency support (simulation must complete before rendering)
  5. External API (any tool should submit jobs)
  6. Real-time visibility (monitor what’s running where)
  7. Fault tolerance (Maya crash on job #23 shouldn’t lose jobs #1-22)

What Workers Don’t Need

Workers run one job at a time. No threading within workers, no async I/O complexity, no elaborate state machines. Each worker is a simple execution loop: boot DCC, request job, execute, report progress, request next job. Keep it simple.

gRPC: Streaming and Strong Typing

I started by prototyping with a simple HTTP REST API. Clients POST job requests, GET status updates, DELETE to cancel. Works fine for hello-world demos.

Production use exposed the cracks. Polling for status every 2 seconds = chatty, inefficient. No real-time output streaming. JSON schema drift causing runtime errors when client and server versions diverge. Manual connection retry logic.

gRPC solved all of this:

Bidirectional streaming: Clients submit a job, get a stream of progress updates without polling. Worker sends output line-by-line as it executes. No artificial 2-second delay before seeing β€œError: texture not found”. You see it immediately.

Strong typing via protobuf: API contract is explicit. If I change job_id from string to int32, code generation fails at compile time. Clients and servers can’t drift out of sync; they won’t even build.

Language-agnostic: Python orchestrator, Python workers for DCC jobs, C# client for game editor, JavaScript client for web dashboard. All generated from the same .proto files. Change the API once, all clients update automatically.

Built-in resilience: Connection retries, keepalive pings, graceful shutdown. HTTP requires implementing all of this manually. gRPC handles it.

Tradeoff: debugging is harder. Can’t just curl an endpoint. Need gRPC clients or tools like grpcurl. But I’ll take difficult debugging over production runtime errors any day.


BATS Architecture: The Core Design

High-Level Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    External Clients                         β”‚
β”‚   Mock Editor β”‚ Game Editor β”‚ CLI Tools β”‚ Web Dashboard     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚ gRPC (port 50051)
                 β”‚ ExternalJobAPI
                 ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Orchestrator Server                            β”‚
β”‚                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                  β”‚
β”‚  β”‚ Job Manager  │─────→│ Priority Queue  β”‚                  β”‚
β”‚  β”‚              β”‚      β”‚ (heapq 0-10)    β”‚                  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                  β”‚
β”‚         β”‚                                                   β”‚
β”‚         ↓                                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                   β”‚
β”‚  β”‚      DCC Pool Manager                β”‚                   β”‚
β”‚  β”‚  β€’ 3 headless Maya workers           β”‚                   β”‚
β”‚  β”‚  β€’ 1 GUI Maya worker                 β”‚                   β”‚
β”‚  β”‚  β€’ 3 headless Houdini workers        β”‚                   β”‚
β”‚  β”‚  β€’ 1 GUI Houdini worker              β”‚                   β”‚
β”‚  β”‚  = 8 workers total (configurable)    β”‚                   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β”‚
β”‚         β”‚                                                   β”‚
β”‚         β”‚ gRPC (internal)                                   β”‚
β”‚         β”‚ OrchestratorService                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
    β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    ↓              ↓          ↓          ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  ...
β”‚ Maya    β”‚  β”‚ Maya    β”‚  β”‚ Houdini β”‚  (8 workers)
β”‚ Worker  β”‚  β”‚ Worker  β”‚  β”‚ Worker  β”‚
β”‚ v4.0.0  β”‚  β”‚ v4.0.0  β”‚  β”‚ v2.0.0  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The Three-Layer Design

Layer 1: External Job API (Client-facing)

service ExternalJobAPI {
    rpc SubmitJob(JobRequest) returns (JobResponse);
    rpc StreamJobStatus(JobQuery) returns (stream JobUpdate);
    rpc GetJobResult(JobQuery) returns (JobResult);
    rpc CancelJob(JobQuery) returns (JobResponse);
    rpc ListJobs(JobListQuery) returns (JobList);
}

Clients submit jobs, monitor progress, retrieve results. They don’t know about workers or pools. Just jobs.

Layer 2: Orchestrator (Central coordinator)

  • Priority queue (heapq-based, 0-10 scale where 10 = urgent)
  • Worker pool management (spawn, track, health check)
  • Job-to-worker assignment
  • Result caching (1-hour TTL)
  • Dependency resolution

Layer 3: DCC Workers (Job executors)

service OrchestratorService {
    rpc RegisterClient(DCCClientInfo) returns (RegisterResponse);
    rpc RequestJob(DCCClientInfo) returns (Job);
    rpc UpdateStatus(JobStatusUpdate) returns (Empty);
    rpc Heartbeat(DCCClientInfo) returns (RegisterResponse);
}

Workers register, poll for jobs matching their capabilities, execute, report progress.

Design Decisions That Mattered

Request-based job dispatch was the first critical choice. Covered in depth aboveβ€”workers initiate the exchange, eliminating push race conditions while the orchestrator actively manages state, assignment tracking, and transmits complete job packages. The practical payoff: crashed workers don’t lose jobs, busy workers don’t get overloaded, and the orchestrator stays simple.

Dual-mode job execution came from hard lessons in Part 2. Inline scripts work great for prototyping:

job = JobRequest(
    dcc_type="maya",
    execution_mode=ExecutionMode.HEADLESS,
    script="""
import maya.cmds as cmds
sphere = cmds.polySphere(radius={radius}, name="{name}")[0]
cmds.move(0, {height}, 0, sphere)
cmds.file(rename="{output_path}")
cmds.file(save=True, type="mayaBinary")
print(f"Created: {output_path}")
""",
    parameters={"radius": "2.0", "height": "5.0", "name": "test_sphere", "output_path": "C:/temp/sphere.mb"}
)

But production needs reusable modules:

# job_orchestrator/jobs/examples/maya/generate_hero_sword.py
def main(parameters: dict[str, Any]) -> None:
    """Entry point called by orchestrator."""
    asset_name = parameters.get('asset_name', 'hero_sword')
    output_dir = parameters.get('output_dir')

    # Full IDE support: autocomplete, debugging, type checking
    result = create_hero_sword(
        asset_name=asset_name,
        blade_length=float(parameters.get('blade_length', 1.2)),
        handle_length=float(parameters.get('handle_length', 0.3)),
        output_dir=output_dir,
        export_fbx=bool(parameters.get('export_fbx', True))
    )

    print(f"Created: {result['maya_file']}")

# Submit using MODULE MODE
job = JobRequest(
    dcc_type="maya",
    execution_mode=ExecutionMode.HEADLESS,
    module_path="job_orchestrator.jobs.examples.maya.generate_hero_sword",
    entry_point="main",
    parameters={"asset_name": "hero_sword_v1", "blade_length": "1.5", "output_dir": "D:/Assets/Weapons", "export_fbx": "True"}
)

MODULE MODE gives you proper IDE support, breakpoint debugging in Maya/Houdini GUI, unit tests, version control. STRING MODE remains perfect for quick experiments.

Priority queuing with dependencies handles both urgency and workflow constraints:

# Urgency: higher priority runs first
job1 = JobRequest(priority=5, ...)  # Normal
job2 = JobRequest(priority=10, ...) # Urgent, runs first

# Workflow: simulation must complete before rendering
sim_job = JobRequest(job_id="houdini_sim_001", dcc_type="houdini", priority=8, script="# Run fluid simulation")
render_job = JobRequest(job_id="maya_render_001", dcc_type="maya", priority=8, dependencies=["houdini_sim_001"], script="# Render")

Job manager ensures high-priority jobs run first unless blocked by dependencies. Failed dependencies cascade: if simulation fails, rendering auto-cancels.

Per-job logging keeps debugging sane. Each job gets an isolated log file:

.temp/logs/
β”œβ”€β”€ maya/
β”‚   β”œβ”€β”€ jobs/
β”‚   β”‚   β”œβ”€β”€ job_abc123_20260321_143022.log
β”‚   β”‚   β”œβ”€β”€ job_def456_20260321_143045.log
β”‚   β”œβ”€β”€ worker_headless_1.log
β”‚   └── worker_gui.log
β”œβ”€β”€ houdini/
└── orchestrator.log

Script source, parameters, execution milestones, DCC output, error tracebacks, metadata. Everything you need to debug β€œwhy did job #23 fail?” Five-day retention, automatic cleanup.

Worker pool auto-scaling balances resource usage with throughput:

{
  "worker_pools": {
    "maya": { "headless_count": 3, "gui_count": 1, "max_workers": 20 },
    "houdini": { "headless_count": 3, "gui_count": 1, "max_workers": 20 }
  }
}

Pool manager pre-spawns base workers (3 headless + 1 GUI per DCC). Queue backs up? Spawn more workers. Queue empties? Shut down idle workers after 5 minutes. Worker crashes? Respawn automatically. The swarm adjusts to workload.


Implementation Journey: What I Learned

Phase 1: Protobuf Schema Evolution (Week 1)

Protocol buffers turned out more complex than expected. My first schema:

message JobRequest {
    string dcc_type = 1;
    string script = 2;
}

message JobResult {
    bool success = 1;
    string output = 2;
}

This lasted about two hours before I hit reality: no parameters (jobs were hardcoded scripts), no execution mode (couldn’t specify headless vs GUI), no priority (FIFO processing), no dependencies (couldn’t chain workflows), no progress tracking (just β€œrunning” or β€œdone”).

Evolution happened in waves. Parameters added, then execution modes, then priority, then dependencies, then streaming progress updates. Each addition meant regenerating code for Python, updating orchestrator logic, updating worker logic, and testing all permutations.

By week’s end: 400 lines of protobuf definitions, comprehensive job model, but also: every schema change broke existing clients. Learned to version the API early.

Phase 2: Worker Bootstrapping (Week 2)

Booting Maya workers sounds simple. It’s not.

First attempt: launch mayapy.exe, sleep 30 seconds, hope for the best.

process = subprocess.Popen([
    "C:/Program Files/Autodesk/Maya2026/bin/mayapy.exe",
    "maya_rpc_server.py",
    "--port", str(worker_port)
])

time.sleep(30)  # How long is long enough? Β―\_(ツ)_/Β―

Maya takes 15-45 seconds to boot depending on machine. No way to know if it booted successfully or crashed. 30-second delay isn’t enough on slow machines, too long on fast ones. Import errors fail silently until Maya’s up.

Second attempt: health check polling.

def start_worker(dcc_type: str, port: int, timeout: int = 120):
    process = subprocess.Popen([dcc_exe, worker_script, "--port", str(port)])

    start_time = time.time()
    while time.time() - start_time < timeout:
        try:
            channel = grpc.insecure_channel(f'localhost:{port}')
            stub = OrchestratorServiceStub(channel)
            stub.Heartbeat(DCCClientInfo(dcc_type=dcc_type))
            _LOGGER.info(f"Worker {dcc_type} ready on port {port}")
            return process
        except grpc.RpcError:
            time.sleep(1)

    raise TimeoutError(f"Worker {dcc_type} failed to start in {timeout}s")

Better. But Maya sometimes starts but gRPC server fails (import errors). Need separate timeout for β€œprocess started” vs β€œgRPC ready”. Need stdout/stderr capture to debug boot failures.

Final solution: worker registration protocol. Workers register themselves when ready.

class MayaRPCWorker:
    def __init__(self, orchestrator_host: str, orchestrator_port: int):
        self.orchestrator_channel = grpc.insecure_channel(f'{orchestrator_host}:{orchestrator_port}')
        self.stub = OrchestratorServiceStub(self.orchestrator_channel)

    def start(self):
        response = self.stub.RegisterClient(DCCClientInfo(
            dcc_type="maya",
            version=cmds.about(version=True),
            capabilities=["modeling", "rendering", "animation"],
            worker_id=self.worker_id
        ))

        if response.status == "READY":
            _LOGGER.info(f"Registered with orchestrator")
            self.poll_loop()

Orchestrator side:

class DCCPoolManager:
    def start_instance(self, dcc_type: str) -> subprocess.Popen:
        port = self._allocate_port()
        process = subprocess.Popen(
            [dcc_exe, worker_script, "--orchestrator-host", "localhost", "--orchestrator-port", "50051"],
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE
        )

        registered = self._wait_for_registration(dcc_type, timeout=120)
        if not registered:
            process.kill()
            raise TimeoutError(f"Worker {dcc_type} failed to register")

        return process

Workers register when actually ready, not based on arbitrary sleep. Orchestrator knows worker capabilities before assigning jobs. Failed registration = kill process cleanly. Workers report DCC version, Python version, etc.

Lesson: Don’t poll from outside. Let workers push their ready state.

Phase 3: Understanding Priority Queue Behavior (Week 3)

Priority system works, but not how people initially expected.

The implementation uses heapq with negative priorities for max-heap behavior:

async def enqueue_job(self, job: Job) -> None:
    async with self.lock:
        # Use negative priority for max-heap (higher priority first)
        heapq.heappush(self._job_queue, (-job.priority, self._queue_counter, job))
        self._queue_counter += 1
        self.active_jobs[job.job_id] = job

Priority 10 jobs dequeue before priority 5 jobs. Works perfectly… when the queue is full.

But here’s what surprised people:

Timeline:
T=0s:   Submit 8 low-priority jobs (priority 3)
T=1s:   All 8 workers pick up jobs immediately (queue was empty)
T=2s:   Submit 15 high-priority jobs (priority 10)
T=2s:   High-priority jobs wait in queue
T=8m:   First low-priority job completes, high-priority job starts

The issue: Priority determines queue position, not execution interruption. Once a worker grabs a job, it runs to completion. High-priority jobs arriving later must wait for workers to become free.

This is called β€œhead-of-line blocking” in queue theory. The priority queue protects you from low-priority work preventing high-priority work in the queue, but not from low-priority work already running.

Why no preemption? I considered job cancellation and checkpoint/resume, but the complexity didn’t justify the benefit:

  • Most jobs run 2-10 minutes (not multi-hour)
  • Urgent work (priority 9-10) is rare in practice
  • Simple mitigation: keep a few workers idle for urgent work, or temporarily spawn extras

The priority system does exactly what it should: ensures high-priority work executes next, not immediately. For batch processing, that’s usually sufficient.

Lesson: Priority queues control dispatch order, not running job interruption. If you need preemption, design jobs to be interruptible from the start (checkpoint at natural boundaries, support resume). Don’t retrofit preemption onto long-running tasksβ€”it’s almost always more complex than it’s worth.

Phase 4: Monitoring and Debugging (Week 4-5)

8-50 workers running across multiple machines makes debugging… interesting.

Three constant questions:

  1. β€œJob ABC123 failed”—which worker ran it? What logs?
  2. β€œSystem is slow”—are all workers busy? Crashed? Idle?
  3. β€œHoudini jobs hang”—hanging or just slow?

Per-job logging (already covered) solved #1. Every job gets an isolated log file. Job fails? Error message includes log path.

System tray monitor solved quick access:

class BATSTrayApp(QSystemTrayIcon):
    def __init__(self):
        super().__init__()
        self.setIcon(QIcon("bats_icon.png"))

        menu = QMenu()
        menu.addAction("Start Orchestrator", self.start_orchestrator)
        menu.addAction("Stop Orchestrator", self.stop_orchestrator)
        menu.addAction("Open Pool Monitor", self.open_pool_monitor)
        menu.addSeparator()
        menu.addAction("View Logs", self.open_logs)
        menu.addAction("Quit", self.quit)

        self.setContextMenu(menu)

DCC Pool Monitor solved #2 and #3, a real-time dashboard showing all workers:

╔═══════════════════════════════════════════════════════════════╗
β•‘  BATS Pool Monitor - localhost:50051                          β•‘
╠═══════════════════════════════════════════════════════════════╣
β•‘  Maya Workers (4/4 ready)                                     β•‘
β•‘    β€’ maya_headless_1 [READY]    CPU: 5%   Mem: 1.2GB          β•‘
β•‘    β€’ maya_headless_2 [BUSY]     CPU: 82%  Mem: 2.8GB          β•‘
β•‘      └─ Job: hero_sword_v1 [Progress: 65%]                    β•‘
β•‘    β€’ maya_headless_3 [BUSY]     CPU: 91%  Mem: 3.1GB          β•‘
β•‘      └─ Job: rock_scan_023 [Progress: 45%]                    β•‘
β•‘    β€’ maya_gui [READY]           CPU: 12%  Mem: 2.1GB          β•‘
β•‘                                                               β•‘
β•‘  Houdini Workers (4/4 ready)                                  β•‘
β•‘    β€’ houdini_headless_1 [BUSY]  CPU: 88%  Mem: 4.2GB          β•‘
β•‘      └─ Job: fluid_sim_canyon [Progress: 23%]                 β•‘
β•‘    β€’ houdini_headless_2 [READY] CPU: 8%   Mem: 1.8GB          β•‘
β•‘    β€’ houdini_headless_3 [CRASHED] - Respawning in 5s...       β•‘
β•‘    β€’ houdini_gui [READY]        CPU: 15%  Mem: 2.5GB          β•‘
β•‘                                                               β•‘
β•‘  Queue: 12 jobs pending (2 priority 10, 10 priority 5)        β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

Updates every 2 seconds via gRPC polling. Worker crashes? See it immediately. Job hung? CPU at 0% reveals the problem. Queue backing up? Spawn more workers.

Distributed systems need visibility. Build monitoring tools early, not after drowning in mystery failures.


The Results: Was It Worth It?

Performance Comparison

Scenario: Process 50 rock scans (Houdini β†’ Substance β†’ Maya)

Approach Time Speedup Notes
Original Threading 6.7 hours 1.0x Sequential, complex code
QProcess (Part 2) 6.7 hours 1.0x Sequential, simple code
BATS (8 workers) 1.0 hour 6.7x Parallel, distributed
BATS (20 workers) 24 min 16.7x Full machine utilization
BATS (50 workers) 10 min 40x One worker per asset (theoretical max)

Real Production Numbers (from logs):

2026-03-15 14:23:01 [Orchestrator] Batch job started: canyon_biome_rocks
2026-03-15 14:23:01 [Orchestrator] Jobs queued: 50 (priority 9)
2026-03-15 14:23:05 [Pool] Workers ready: 8 Maya, 8 Houdini
2026-03-15 15:18:43 [Orchestrator] Batch job completed: canyon_biome_rocks
Total time: 55 minutes 42 seconds
Average per asset: 1 minute 7 seconds (parallelized)
Sequential estimate: 50 Γ— 8 minutes = 6.7 hours
Actual speedup: 7.2x

Code Complexity Comparison

Metric QProcess (Part 2) BATS (Part 3) Delta
Lines of code (tool) 150 50 (client only) -100
Lines of code (infrastructure) 0 2,500 (orchestrator + workers) +2,500
External dependencies 0 grpc, protobuf +2
Deployment complexity Single .py file Distributed system High
Maintenance burden Low Medium-High ↑

When BATS Makes Sense

BATS shines when you have truly parallelizable work: independent assets, concurrent API calls, batch processing where each item has zero dependencies on others. The parallelizable portion needs to be significant (over 50% of runtime) or you’re just adding complexity for marginal gains.

Scaling beyond single-machine limits? Multiple clients need to submit jobs (game editor, web tools, CLI, CI/CD)? Priority and dependency management valuable? BATS handles all of this.

QProcess makes more sense when work is sequential or mostly sequential, processing one asset at a time, UI responsiveness is the only goal (not speed), or simplicity and maintainability trump raw performance. Self-contained tools don’t need distributed infrastructure.

The Honest Assessment

BATS gave me 7x speedup (8 workers) in production, with room to scale to 40x (50 workers).

But it also added:

  • 2,500 lines of infrastructure code
  • gRPC/protobuf compilation step
  • Distributed system debugging complexity
  • Process lifecycle management
  • Worker health monitoring

Worth it for 50 concurrent assets? Absolutely. For single-asset processing? No. QProcess remains the better choice.

That’s the real win: universal job infrastructure, not just β€œthe tool that batches rocks.”


Design Patterns That Worked

Request-Based Work Dispatch

The request-dispatch pattern in action. Workers poll when ready; the orchestrator matches, assigns, and transmits complete job packages. What this looks like in the worker loop:

# Worker loop
while True:
    job = stub.RequestJob(DCCClientInfo(
        worker_id=self.worker_id,
        dcc_type="maya",
        capabilities=["modeling", "rendering"]
    ))

    if job.job_id:
        result = self.execute_job(job)
        stub.UpdateStatus(JobStatusUpdate(job_id=job.job_id, status=result.status, progress=100))
    else:
        time.sleep(5)

Workers control their own request rate (can’t be overloaded). Crashed workers don’t lose assigned jobs (job stays in queue). Fast workers automatically request more jobs (natural load balancing).

Idempotent Job Execution

Jobs can be retried safely without side effects.

class Job:
    def __init__(self, job_id: str, ...):
        self.job_id = job_id
        self.output_path = f"D:/Assets/{job_id}_output.mb"  # Deterministic
        self.retry_count = 0
        self.max_retries = 3

    def execute(self):
        if os.path.exists(self.output_path):
            os.remove(self.output_path)  # Always start clean

        # Do work...

        # Atomic write (temp + rename)
        temp_path = f"{self.output_path}.tmp"
        write_output(temp_path)
        os.rename(temp_path, self.output_path)

Failed jobs retry without corrupting state. Easy to implement β€œretry last N failed jobs” commands. No complex rollback logic needed.

Structured Logging Everywhere

Every component logs to predictable locations with consistent formatting.

def setup_logging(component: str, job_id: str = None):
    if job_id:
        log_path = f".temp/logs/{component}/jobs/job_{job_id}_{timestamp()}.log"
    else:
        log_path = f".temp/logs/{component}/{component}.log"

    handler = logging.FileHandler(log_path)
    handler.setFormatter(logging.Formatter('%(asctime)s [%(levelname)s] %(name)s: %(message)s'))

    logger = logging.getLogger(component)
    logger.addHandler(handler)
    return logger

When debugging: find job ID from error message, open .temp/logs/maya/jobs/job_{job_id}_*.log, see complete execution trace.

Health Checks and Auto-Recovery

Workers send heartbeats. Orchestrator respawns silent workers.

class DCCPoolManager:
    def health_check_loop(self):
        while True:
            for worker_id, worker in self.workers.items():
                last_heartbeat = worker.last_heartbeat

                if time.time() - last_heartbeat > 60:
                    _LOGGER.warning(f"Worker {worker_id} missed heartbeat, respawning")
                    self.respawn_worker(worker_id)

            time.sleep(30)

    def respawn_worker(self, worker_id: str):
        old_process = self.workers[worker_id].process
        old_process.kill()

        new_process = self.start_instance(dcc_type=self.workers[worker_id].dcc_type)
        self.workers[worker_id].process = new_process

System self-heals from worker crashes. No manual intervention for transient failures. Long-running orchestrator stays stable.


What I’d Do Differently

1. Start with SQLite for Job State

I used an in-memory dictionary for job results:

class JobResultStore:
    def __init__(self):
        self.results: dict[str, JobResult] = {}  # Lost on restart

Better approach:

class JobResultStore:
    def __init__(self, db_path: str = ".temp/jobs.db"):
        self.conn = sqlite3.connect(db_path)
        self.conn.execute("""
            CREATE TABLE IF NOT EXISTS job_results (
                job_id TEXT PRIMARY KEY,
                status TEXT,
                result_json TEXT,
                created_at TIMESTAMP,
                completed_at TIMESTAMP
            )
        """)

    def store(self, job_id: str, result: JobResult):
        self.conn.execute(
            "INSERT OR REPLACE INTO job_results VALUES (?, ?, ?, ?, ?)",
            (job_id, result.status, json.dumps(result.to_dict()),
             datetime.now(), result.completed_at)
        )
        self.conn.commit()

Job history persists across orchestrator restarts. Can query β€œall failed jobs from last week”. Easy to build analytics dashboard.

2. Expose Prometheus Metrics

I built a custom monitoring UI. Prometheus would be more standard.

from prometheus_client import Counter, Histogram, Gauge, start_http_server

jobs_started = Counter('bats_jobs_started_total', 'Total jobs started')
jobs_completed = Counter('bats_jobs_completed_total', 'Total jobs completed', ['status'])
job_duration = Histogram('bats_job_duration_seconds', 'Job execution time')
active_workers = Gauge('bats_active_workers', 'Number of active workers', ['dcc_type'])

# In code:
jobs_started.inc()
job_duration.observe(elapsed_time)
active_workers.labels(dcc_type='maya').set(len(maya_workers))

# Expose metrics on :9090
start_http_server(9090)

Now Grafana can visualize: jobs per minute over time, average job duration by DCC type, worker pool utilization, error rates.

3. Add Job Scaffolding Tools

MODULE MODE is the first-class, recommended approach for production jobs. STRING MODE exists for quick experiments and prototyping, but all real work should be modules with proper IDE support, debugging, and version control.

Future improvement: Add CLI scaffolding for job creation:

$ bats create-job maya/my_new_asset
Created: job_orchestrator/jobs/studio/maya/my_new_asset.py
Template includes:
  - main(parameters) entry point
  - Example parameter parsing
  - Error handling boilerplate
  - Unit test stub

This would make it even easier to start new jobs with the right patterns from day one. The dual-mode design (STRING for prototyping, MODULE for production) is solid; scaffolding would just remove friction from the MODULE path.

4. Build the C# Client Earlier

I focused on Python clients for the first 6 months. But our game editor was C#, and the team had to wait for C# bindings.

Lesson: If you know you’ll need multi-language clients, build them in parallel. gRPC makes this easy (protoc generates client code), but testing takes time.


Lessons from the Journey

Parallelize at the Right Level

Part 2 attempted to thread pipeline steps (Houdini β†’ Substance β†’ Maya). Wrong: they’re sequential. Part 3 parallelized across assets (50 independent meshes). Correct: they’re truly independent.

def can_parallelize(work_items: list) -> bool:
    """Test if work items can truly run in parallel."""
    for i, item_a in enumerate(work_items):
        for j, item_b in enumerate(work_items):
            if i != j and depends_on(item_a, item_b):
                return False
    return True

# Part 2: Pipeline steps
depends_on("Substance", "Houdini") == True
can_parallelize(["Houdini", "Substance", "Maya"]) == False

# Part 3: Asset batch
depends_on("rock_002.fbx", "rock_001.fbx") == False
can_parallelize(["rock_001.fbx", ..., "rock_050.fbx"]) == True

Pull Beats Push for Work Distribution

The single architectural decision that made everything else simpler. Workers control their own capacity: busy workers don’t poll, crashed workers don’t lose jobs, the orchestrator doesn’t need push-coordination logic. The whole system gets more reliable for free.

Monitor from Day One

Don’t wait for production failures to build monitoring. Essential: worker pool status, queue depth, job history, error rates, performance metrics.

I built the system tray monitor in Phase 5. Should have been Phase 1.

Start Simple, Scale Later

BATS started as: 1 orchestrator, 2 Maya workers, no Houdini, STRING MODE only, no priorities, no dependencies.

It evolved to: 1 orchestrator, 8-50 workers, Maya + Houdini + Python, dual-mode execution, priority queue with dependency resolution.

Build the minimal viable version first. Add features when you feel the pain of not having them.

Separate Infrastructure from Jobs

Keep them in distinct directories and treat them as different codebases:

job_orchestrator/
β”œβ”€β”€ orchestrator/          # Stable infrastructure
β”œβ”€β”€ dcc_workers/           # DCC-specific workers
└── jobs/                  # Frequently updated
    β”œβ”€β”€ examples/
    └── studio/            # Your code here

Infrastructure changes don’t touch job code. Job changes don’t require an orchestrator restart. New team members onboard by writing jobs; they never need to understand the infrastructure to contribute.


Conclusion: The Right Parallelism at the Right Time

From single-asset QProcess (Part 2) to distributed BATS orchestration (Part 3), the most important lesson: parallelism is only useful when work can actually run in parallel.

Sounds obvious. But I’ve seen developers (including past me) add threading, async, multiprocessing, distributed systems without first asking: β€œWhat actually runs in parallel?”

The Three-Part Arc

Part 1: Tool Logging built visibility into the tool. Without logging, I never would have seen the threading anti-patterns in Part 2.

Part 2: Threading Anti-patterns removed threading from sequential work. Houdini β†’ Substance β†’ Maya pipeline cannot be parallelized, so simplifying to QProcess was correct.

Part 3 (this post) added distributed orchestration for parallel work. Processing 50 independent assets can and should be parallelized. BATS was the right solution.

Key Takeaways

Draw the dependency graph. If work items have dependencies, threading won’t help. If they’re independent, parallelism is valuable.

Measure before and after. I knew my speedup (6.7x with 8 workers) because I measured sequential time vs parallel time.

Start simple, scale later. QProcess for single assets, BATS for batch processing. Don’t build distributed systems until you need them.

Infrastructure and jobs are separate concerns. Keep them decoupled: infrastructure changes shouldn’t touch job code, and vice versa.

Monitoring is not optional. Distributed systems are invisible without monitoring. Build visibility first.

gRPC scales. Strong typing, streaming, multi-language support. Worth the learning curve for distributed systems.

What I Achieved

6.7x speedup (8 workers) in production. Scales to 40x speedup (50 workers) when needed. Clean separation: infrastructure (stable) vs jobs (frequently updated). Dual-mode execution: STRING (prototyping) + MODULE (production). Priority + dependency support for complex workflows. Real-time monitoring and debugging.

The Bigger Picture

These three posts tell the complete story of building production-ready game dev tools:

Logging - See what’s happening (observability)
Simplify - Remove needless complexity (architecture)
Scale - Add parallelism when provably useful (performance)

Each step builds on the previous. Logging revealed the threading anti-patterns. Simplifying to QProcess made the baseline fast and maintainable. Scaling with BATS gave true parallelism where it mattered.

The Golden Rules

β€œDon’t parallelize work that can’t run in parallel.” (Part 2 lesson)

β€œDo parallelize work that can and should run in parallel.” (Part 3 lesson)

β€œDraw the dependency graph first, then decide.” (The unifying principle)

What’s Next?

BATS is production-ready, but there’s always more to build:

  • C# client library - Unity/Unreal integration for in-editor job submission
  • Web dashboard - Browser-based monitoring with Prometheus + Grafana
  • Docker containerization - Deploy BATS as a containerized service
  • Cloud scaling - Spin up AWS EC2 workers for massive batches (1000+ assets)
  • More DCCs - Blender, Substance Painter, 3ds Max support

But the foundation is solid. No more anti-patterns. No more needless complexity. Just the right architecture for the right problem.

Want to Learn More?

Distributed Systems:

Concurrency Patterns:

DCC Pipeline Architecture:

Final thought: Distributed systems are hard. Start simple, measure everything, scale when needed.

What’s next? In Part 4, I’ll cover how I made BATS AI-callable through Model Context Protocol (MCP) integration - turning infrastructure into natural language APIs that Claude Desktop, Cursor, and VS Code Copilot can operate.


This post is Part 3 of a series on building maintainable game development tools:

Have you built distributed DCC pipelines? Fought with gRPC? Scaled beyond single-machine limits? Let me know in the comments or reach out on Twitter or LinkedIn.


Have questions or improvements to this pattern? Did you find errors, omissions, inaccurate statements, or flaws in the code snippets? Open an issue on the CO3DEX repository. Find me on the Discord (in O3DE).

Want to see the full BATS implementation? The orchestrator is open-source (coming soon to GitHub). Check back for the repository link!


import logging as _logging
_MODULENAME = 'co3dex.posts.distributed_orchestration_bats'
_LOGGER = _logging.getLogger(_MODULENAME)
_LOGGER.info(f'Initializing: {_MODULENAME} ... parallelism: know what, when, and how')




Disclaimer: Views expressed are my own. All opinions are my own. The opinions expressed here belong solely to myself, and do not reflect the views of my current employer Sony Interactive Entertainment (SIE), any previous employers including AWS/Amazon, the Open 3D Foundation, or their parent the Linux Foundation. I am bound by NDAs with my current and previous employers and am not authorized to speak on their behalf. If you are a reporter or news outlet seeking official statements, please contact the respective company's PR department (I will not reply to such requests).

License: For terms please see the LICENSE*.TXT file at the root of this distribution.

// Copyright HogJonny-AMZN. or his affiliates. All Rights Reserved.

// SPDX-License-Identifier: CC-BY-4.0


About Jonny Galloway
Jonny Galloway

Howdy I am HogJonny, a Sr. Design Technologist with AWS, working on technical art for Open 3D Engine.

Email : AMA_HogJonny@co3dex.com

Website : http://jonathankimballgalloway.me

About Jonny Galloway (HogJonny)

I'm a polymath technical art leader who bridges art, tools, engine, and product. Principal Technical Artist at Bluepoint Games (Sony/PlayStation) with 30+ years in AAA game development, specializing in real-time rendering, Python tooling, procedural world-building, and performance optimization.

Star
Categories
Useful Links