Skip to main content

Command Palette

Search for a command to run...

Determinism Over Convenience: Building Automation That Can Be Trusted

A practical rulebook for designing automation systems that are reproducible, observable, auditable, and safe to re-run.

Updated
7 min read
Determinism Over Convenience: Building Automation That Can Be Trusted
Շ
I design and build production-grade automation systems. Spans infrastructure automation, AI workflow orchestration, developer tooling, iOS Shortcuts systems, backend automation & high-fidelity interfaces.

Most automation fails for one reason:

It was designed to work once, not to be trusted repeatedly.

A script that succeeds on a clean machine is useful. A workflow that survives retries, partial failure, stale state, bad inputs, and operator mistakes is infrastructure.

That distinction matters.

Automation is not just about removing manual work. Good automation creates repeatable execution. It gives you the same result from the same inputs, exposes what it changed, and makes failure recoverable instead of mysterious.

This is the operating principle I use:

If a system cannot be re-run safely, audited clearly, and rolled back deliberately, it is not production-grade automation.


Convenience Is Not Reliability

Convenient automation optimizes for speed.

Production automation optimizes for trust.

A convenient script might:

  • assume the current directory

  • mutate files in place

  • depend on hidden environment variables

  • skip validation

  • fail halfway through without recording state

  • require the operator to “just know” what happened

That may be acceptable for a one-off local task.

It is not acceptable for systems that affect infrastructure, data, deployments, credentials, workflows, or client-facing behavior.

The problem is not that scripts are bad. The problem is that many scripts are built without a system model.

A reliable automation system needs to answer five questions before it changes anything:

  1. What state exists now?

  2. What state do we want?

  3. What changes are required?

  4. How do we verify success?

  5. How do we recover if execution fails?

Without those answers, automation becomes accelerated uncertainty.


Determinism Means Predictable State Transitions

Determinism does not mean nothing ever fails.

It means behavior is predictable.

Given the same inputs, configuration, environment assumptions, and prior state, the system should produce the same result or fail in the same controlled way.

For automation, determinism usually requires:

  • explicit inputs

  • validated dependencies

  • known execution context

  • stable configuration

  • idempotent operations

  • structured logging

  • bounded side effects

  • clear success criteria

A deterministic workflow should not depend on guesswork.

It should not silently behave differently because the shell changed, a path was missing, a package version drifted, or a previous run left partial state behind.


Idempotency Is the Foundation

An idempotent operation can be safely repeated.

That is the difference between this:

# Non-idempotent: appends every time it runs.
echo "PORT=3000" >> .env

And this:

# Idempotent: only appends if the line is missing.
touch .env
grep -qxF "PORT=3000" .env || echo "PORT=3000" >> .env

The first command changes the file every time it runs.

The second command only changes the file if the desired line is missing.

That small difference becomes critical when automation is retried.

Retries are not edge cases. Retries are normal. Networks fail. APIs timeout. Processes crash. Operators re-run commands. CI jobs restart.

If re-running a workflow corrupts state, duplicates configuration, recreates resources incorrectly, or destroys existing work, the workflow is not safe.

Idempotency should be designed into every layer:

  • file writes

  • database migrations

  • deployment scripts

  • API calls

  • infrastructure provisioning

  • generated artifacts

  • notification systems

  • AI workflow outputs

The rule is simple:

Re-running the same automation should converge the system toward the desired state, not push it further into drift.


Preflight Before Mutation

A production-grade automation flow should separate diagnostics from change.

Before making changes, it should verify:

  • required commands exist

  • expected files exist

  • permissions are sufficient

  • environment variables are present

  • target paths are correct

  • remote services are reachable

  • configuration is parseable

  • the operation is safe for the current environment

A basic shell pattern looks like this:

#!/usr/bin/env bash
set -euo pipefail

APP_DIR="${APP_DIR:-}"
REQUIRED_COMMANDS=("git" "docker")

fail() {
  echo "ERROR: $*" >&2
  exit 1
}

check_command() {
  command -v "\(1" >/dev/null 2>&1 || fail "Missing required command: \)1"
}

preflight() {
  [[ -n "$APP_DIR" ]] || fail "APP_DIR is not set"
  [[ -d "\(APP_DIR" ]] || fail "APP_DIR does not exist: \)APP_DIR"

  for cmd in "${REQUIRED_COMMANDS[@]}"; do
    check_command "$cmd"
  done

  [[ -f "$APP_DIR/docker-compose.yml" ]] || fail "Missing docker-compose.yml"
}

main() {
  preflight
  echo "Preflight passed. Safe to continue."

  # Mutation logic goes here.
}

main "$@"

This script does not assume the environment is correct.

It proves it.

That is the difference between automation and hope.


Logs Are Part of the Interface

If automation changes something, it should say what it changed.

If it skips something, it should say why.

If it fails, it should say where and how.

Logs should not be treated as afterthoughts. They are the interface between the system and the operator.

Useful logs answer:

  • What was attempted?

  • What inputs were used?

  • What state was detected?

  • What changed?

  • What was skipped?

  • What failed?

  • What should happen next?

Poor logging says:

Done.

Useful logging says:

[preflight] docker found: /usr/bin/docker
[preflight] config found: /srv/app/docker-compose.yml
[deploy] current revision: 9f23a81
[deploy] target revision: b6c77ad
[deploy] pulling image: app:b6c77ad
[verify] health check passed: 200 OK
[result] deployment completed successfully

Automation should leave a trail.

Not noise. Evidence.


Rollback Is Not Optional

Rollback should not be invented during an incident.

If a workflow can change production state, it needs a recovery path before it runs.

Rollback may be simple:

  • restore a previous config file

  • redeploy the previous container image

  • revert a symlink

  • restore a database snapshot

  • disable a feature flag

  • reapply the last known-good artifact

The mechanism depends on the system.

The requirement does not.

A deployment without rollback is not a deployment process. It is a bet.

For small systems, even a basic release structure helps:

releases/
  2025-01-01-120000/
  2025-01-03-090000/
current -> releases/2025-01-03-090000
previous -> releases/2025-01-01-120000

With that structure, rollback becomes a controlled state transition:

ln -sfn "\(PREVIOUS_RELEASE" "\)APP_ROOT/current"

The point is not complexity.

The point is reversibility.


AI Workflows Need the Same Discipline

AI workflows are often treated as inherently fuzzy.

That is a mistake.

The model may be probabilistic, but the surrounding system does not have to be chaotic.

A production-grade AI workflow should still define:

  • input schema

  • prompt version

  • model version

  • temperature and parameters

  • expected output format

  • validation rules

  • retry behavior

  • storage location

  • audit trail

  • fallback behavior

Without those controls, AI automation becomes difficult to debug.

If an output changes, you need to know why.

Was it the input? The prompt? The model? The parameters? The retrieval context? The post-processing logic?

A reproducible AI workflow treats prompts, schemas, and evaluations as system components, not magic text.


My Automation Rulebook

When I design automation, I use these rules.

1. No Hidden State

The system should not depend on undocumented assumptions.

2. No Irreversible Change Without a Checkpoint

If the operation can damage state, create a recovery path first.

3. No Mutation Before Validation

Preflight checks should run before changes.

4. No Silent Failure

Every failure should be visible, structured, and actionable.

5. No Unsafe Retries

Re-running should converge or stop safely.

6. No Success Without Verification

Completion is not success.

Verified state is success.

7. No Production Workflow Without Rollback

If rollback does not exist, the change process is incomplete.


A Simple Mental Model

Every automation system can be modeled as:

Observed State → Desired State → Planned Change → Executed Change → Verified State

Weak automation jumps directly from desire to execution.

Strong automation observes, plans, changes, verifies, and records.

That sequence creates trust.

It also makes systems easier to maintain because every phase has a purpose:

  • Observation prevents false assumptions.

  • Planning reduces unintended change.

  • Execution performs bounded mutation.

  • Verification proves the result.

  • Logging preserves evidence.


Final Thought

Automation should not merely make work faster.

It should make work safer.

The goal is not to build scripts that succeed under ideal conditions. The goal is to build systems that behave predictably under real conditions.

That means deterministic inputs, idempotent execution, observable behavior, explicit verification, and deliberate rollback.

Convenience is useful.

But in production, trust wins.

Production-Grade Automation

Part 1 of 2

A technical series on building automation systems that are deterministic, observable, auditable, idempotent, and safe to operate in real production environments.

Up next

Building Reliable iOS Automation with Shortcuts, Scriptable, and Data Jar

A production-grade pattern for designing deterministic, observable, and safe-to-rerun iOS workflows.