Prompt Recovery — A Novel About Building AI Systems That Actually Work

ssh sarah@autoscale — agentOS-prod — tmux

            sarah@autoscale:~$ kubectl get pods -n agentOS-prod
            NAME                          READY  STATUS           RESTARTS
            agentOS-router-7b4f9          0/1    CrashLoopBackOff 47
            agentOS-llm-worker-3          0/1    OOMKilled        12
            agentOS-eval-runner           1/1    Running          (lying)
            agentOS-gateway-a2c1          1/1    Running          0
             
            sarah@autoscale:~$ cat /var/log/billing-alert.log | tail -3
            [CRIT] OpenAI spend: $47,231.89 / 24h (budget: $2,000)
            [WARN] Token burn rate: 4.2M tok/min — 3x normal
            [INFO] Cost anomaly detection triggered at 11:47 PM
             
            sarah@autoscale:~$ ./recover.sh --plan --agents=all
            ▸ Loading recovery playbook...
            ▸ 25 chapters. 90 days. One shot.
            sarah@autoscale:~$ 
          

            $ agentctl status --watch
            ┌─ Agent Fleet ─────────────┐
            │ router    ✗ crashed (47x) │
            │ planner   ⚠ looping       │
            │ retriever ✓ healthy       │
            │ executor  ⚠ throttled     │
            │ evaluator ✗ lying         │
            └───────────────────────────┘
            ↻ refreshing in 5s... 
          

            $ stern -n agentOS-prod --since 5m
            02:44:12 planner → "Retrying prompt… attempt 94"
            02:44:13 router  → panic: nil pointer dereference
            02:45:01 executor→ rate limit hit (429)
            02:46:58 eval    → assert failed: "accuracy" > 0.7
            02:47:02 gateway → Sarah connected from 10.0.1.42
            02:47:03 gateway → "Let's fix this." 
          

incident 0:ops-triage* 1:agents 2:eval 3:logs sarah@autoscale 02:47 AM

// The Premise

Day one.
Everything is already on fire.

Sarah Chen is a seasoned engineering leader who has just been hired to run the AI platform at AutoScale — a fast-growing startup whose crown jewel, AgentOS, is held together by one exhausted engineer, duct tape, and good intentions.

Sarah's Slack notification sounded at 11:47 PM. Then again at 11:48. By 11:50, her phone was buzzing with the intensity of a trapped bee. She knew what that meant: production was on fire, and she was about to do something reckless about it.

She untangled herself from the couch, dislodging Kernel from her lap and earning a look of betrayal that only a cat could deliver with such precision.

— Chapter 1: Into the Fire

She has 90 days before the board pulls the plug. What follows is a crash course in building AI systems that survive contact with reality — told through the lens of one team's fight to turn chaos into something they can be proud of.

// What You'll Learn

Real engineering. Real consequences.

Every chapter embeds production-grade AI engineering concepts inside a story you can't put down.

🪟

Context Window Architecture

Why your prompts break at scale and how to design context as a contract, not an afterthought.

🛡️

Guardrails & Safety

Prompt injection, jailbreaks, and the layered defense patterns that keep AI systems from going off the rails.

📊

Evaluation That Doesn't Lie

Moving beyond vibes-based testing. Building eval frameworks that catch failures before your customers do.

🔁

Agent Orchestration

Multi-agent systems, cascade failures, circuit breakers, and the patterns that make AI agents reliable.

👁️

Observability & Cost Control

Tracing LLM calls, spotting the $47,000 Tuesday before it happens, and building dashboards that matter.

⚖️

AI Ethics in Practice

Not theory — the messy, real-world moments when technically correct recommendations have devastating human consequences.

// The Structure

Three acts. Twenty-five chapters.

A ninety-day journey from inherited chaos to production confidence.

Act I — Days 1–30

Inheriting Chaos

Sarah discovers what's broken: runaway costs, a single point of failure, shadow agents nobody owns, evaluations that lie, and a team on the edge of burnout.

Chapters 1–8

Act II — Days 31–72

Building the Foundation

With the clock ticking, the team rebuilds around three principles: context is your contract, reliability through orchestration, deploy with humility.

Chapters 9–18

Act III — Days 73–90

Trial by Fire

The enterprise demo. A crisis of values. A walkout. And the beginning of everything that comes after.

Chapters 19–25

// Who It's For

If you've ever been paged at 2 AM over an AI system, this book is for you.

⌨️
Engineers building with LLMs You're shipping AI into production and learning the hard way that the model is the easy part. This book gives names to the problems you're already fighting.
📋
Engineering managers & tech leads You're navigating the gap between AI hype and production reality. Sarah's 90-day playbook is the one you wish you'd had on day one.
🧭
Product leaders & executives You need to understand what it actually takes to deploy AI responsibly — without reading a textbook. This is the book to hand your leadership team.
🔍
Anyone curious about AI You don't need to be a programmer. You do need to care about what happens when powerful technology meets messy reality.

// If You Liked

The DNA of this book.

Prompt Recovery lives at the intersection of these five books.

📕

The Goal

Eliyahu Goldratt

The theory of constraints — applied to AI pipelines

📕

The Phoenix Project

Gene Kim et al.

DevOps through narrative — our format inspiration

🤖

AI Engineering

Chip Huyen

Production ML systems — the textbook behind the story

⚙️

Release It!

Michael Nygard

Stability patterns — circuit breakers and bulkheads in practice

🧠

The Alignment Problem

Brian Christian

AI ethics and safety — the human cost of getting it wrong

If any of these are on your shelf, Prompt Recovery was written for you.

// About the Author

Michael John Peña

Michael is a software engineer, author, and Microsoft MVP who has spent over a decade shipping AI and cloud systems across startups and enterprises in Australia, the Philippines, and beyond. He has worked on everything from LLM orchestration to IoT platforms — and now writes fiction about the messy, human side of deploying AI at scale.

Website → LinkedIn → GitHub →

Day one. Everything is already on fire.