Showing My Work: Closing an API Key Exposure

Memory, constraints, and review gates, applied to a recent update

Sean McCauley · March 2026

Three robots reviewing a notebook together

Key Takeaways

Systems beat tools. Anyone can access the models; advantage comes from process.
Write the rules down. The biggest accelerant was clarity on intent, success criteria, and “do not touch” boundaries.
The “memory layer” enabled human/bot governance. The record of decisions, changes, and constraints is what made this start feeling like a process.
This is a personal build, but it gave me a hands-on way to internalize how memory, controls, and review gates make agentic work shippable.

A few weeks ago I wrote about building a concert tracker app with my bot friends. The most common follow-up: “What’s actually happening when you’re doing it?”

The honest answer is that it feels a lot like working with Leonard from Memento: brilliant, relentless, and completely capable, but always at risk of forgetting what they were just doing. The workflow I’ve built is basically the system of tattoos and notes that keeps everything on track between sessions.

“It feels a lot like working with Leonard from Memento: brilliant, relentless, and completely capable, but always at risk of forgetting what they were just doing.”

This post walks through one real upgrade, migrating exposed API keys behind a secure gateway, to show how it worked in my experience. One note, tools may vary from the last post since Codex 5.3 shipped in early Feb.

What needed to change

I recently asked Gemini and Codex to perform code reviews to spot defects and security risks. One of those reviews flagged that several third-party API keys were exposed in the frontend code, and the corresponding API calls were being made directly from the browser. Anyone with browser dev tools open could extract those keys and misuse them, not ideal.

The fix: route external API calls through API Gateway and Lambda, with authentication handled by Cognito and enforced at the gateway. Provider keys stay server-side, every request goes through one authenticated layer.

Here’s what that architecture update looks like:

Architecture V1 — frontend calling third-party APIs directly

Architecture V2 — API Gateway and Lambda centralizing all external API calls

V1: the frontend called third-party APIs directly, which meant keys were exposed in the client-side code.

V2: API Gateway + Lambda enforce auth, centralize caching/rate limits, and keep provider keys server-side.

The rest of this post is the migration path.

The approach

Before any agent writes code or takes action, it reads the project notes. The notebook is the system of record, a living spec, decision log, and change log in one place, written in Markdown for the agents and viewed in Obsidian for me. That’s how it gets up to speed without reconstructing the project state from scratch.

Once the agent has context, I give the design agent my intent and it writes the prompt.

Project notebook in Obsidian showing JWT-protected routes, providers, and validation rules — The project notebook in Obsidian, every JWT-protected route, its provider, and its validation rules, documented in one place.

This is the Memento problem from the opening, and the notebook is what solves it. I’ve had an agent that was mid-implementation of a feature in AWS, hit a compaction (the tool summarizing the session to fit its context window), and then finished the work locally without deploying it. When I asked it to push to AWS, it acted like it didn’t know the configuration, minutes after it had been working in it.

The prompt is the other half. The design agent (Claude Code in this case) takes my plain-English description of what I want and turns it into a build-ready prompt for the coding agent: the overall goal, detailed instructions, success criteria, and constraints (what not to touch).

Build-ready prompt generated by the design agent with success criteria, constraints, and non-goals

See that red line at the bottom? “Do not change any other behavior. Do not modify prod Lambda, frontend hooks, or any other files.” That’s the constraints, and it exists because I learned what happens without it. When all three are defined before work starts, the prompt works like a contract. When any of them are missing, it’s a wish list.

Executing the change

The coding agent gets the prompt, explores the codebase, and starts planning. Codex does an excellent job of articulating a complete plan before executing, including looking for any failure modes.

During the production migration, the agent discovered that simply copying the dev Lambda code over to production would silently drop two fields, concert.venueId and concert.url — that the production app relies on. It flagged the regression risk, explained why it mattered, and adjusted its plan before building anything.

Agent catching a production data regression before writing any code — The agent caught a landmine before it started building, what a team player.

During the same migration, I asked: “Are you considering compatibility with both stage and prod front ends?” The agent answered honestly: “partially”, and offered to add a staging validation checklist to the plan.

Agent responding to a compatibility question and adjusting the plan — I’m glad I pressure tested before deploying.

One more detail worth showing: when it’s time to deploy, the agent asks for permission first. Claude Code has always done this, but Codex takes it a step further; it summarizes what it wants to do in plain language before showing the command.

Agent asking for explicit permission before deploying to production — Codex forced an explicit checkpoint before irreversible changes.

QA with three sets of eyes

I run the reviews without shared chat history; each agent starts from the repo + notebook and a fresh prompt, with a different objective: audit the plan, audit the code, and verify fixes. Gemini CLI reviewed the migration plan against the original spec. Claude Code reviewed the actual implementation. As an example, it flagged:

Must fix: null response edge case in the Unsplash fallback
Must fix: missing input validation on artist name and destination queries
Should fix: stale third-party keys still sitting in the staging environment
Should fix: duplicate variable declaration (not a bug, but a readability trap for the next person or agent who reads the code)

Side-by-side plan review and code review showing Must Fix and Should Fix categorization — Note the “Must Fix” and “Should Fix” categorization.

The fixes went back through the same loop: audit agent prioritized them, coding agent implemented them, audit agent verified them.

Audit agent verifying all four fixes are correct — “All four look correct. Ready for Phase 3.”

Before promoting anything to staging, the coding agent runs its own smoke tests against the live dev environment: auth guard, all seven external endpoints, response-shape validation across every provider.

Smoke test results showing auth enforcement, 200 responses, and validated response shapes — Smoke test results: auth enforcement verified, all endpoints returning 200, response shapes validated.

These are health checks, not exhaustive UAT. What passes my testing moves to staging, and what doesn’t goes back into the backlog with a priority.

I’m currently maintaining two frontends in parallel: a V1 in production and a V2 in staging. I haven’t implemented all the features I want in the V2 yet, so they coexist while I test them both. I can validate changes in both versions against a common backend.

And after all of that - the specs, the prompts, the builds, the reviews, the smoke tests - here’s the app running in Prod and Staging with real data.

Concert tracker in production showing Brandi Carlile tour dates and Dropkick Murphys setlist analytics — The concert tracker in stage and production. Brandi Carlile tour dates and Dropkick Murphy setlist analytics pulled through the new API Gateway.

Close the loop

The last step in every cycle: the notebook gets updated. Release notes, decision log entries, backlog reprioritization, and any changes to specs. The agents do most of this automatically, the coding agent commits the changes and appends a changelog entry to the project notes as part of the build.

The Prod Change Log in Obsidian after the migration. This is what the next session will read.

The next time I start a session, the agent reads this notebook first.

What breaks

I want to be clear: not everything works perfectly. This process still requires human intervention and troubleshooting.

Vague specs produce confident wrong answers. I mentioned the constraint line in the prompt earlier, that wasn’t always there. Early on, I’d give agents a task without clear boundaries and they’d “improve” things I didn’t ask them to touch. Defining scope up front is the single most effective thing I’ve done to reduce rework.
The “Memento” problem is real. The notebook helps, but it’s not a complete fix. I’ve learned to monitor where I am in the context window and compact early when a task looks like it might not fit, managing context is part of the workflow.
The agent doesn’t know what it doesn’t know. The automated smoke tests all passed, but they were the agent testing its own work. My UAT caught things the automated pass missed. You still need a human in the loop, at least for now.
Cost could add up. Right now this small project isn’t costing me a lot, but costs could scale with the size of the project and team. In an enterprise setting, you’d want guardrails around token consumption and agent-initiated infrastructure changes to prevent cost sprawl across cloud and API usage.

Closing thoughts: the system is a differentiator

The screenshots in this post could be from any feature or update. The loop is the same:

read notes → prompt → build → review → test → ship → update notes

Everyone has access to the same models, the system you build around them is what compounds.

If you’re wondering whether this could work for one of your teams, give it a try. Define a process and collaboration system that includes the agents, start with one small but well-defined task, and use it to learn what works before you scale it.

These are my personal observations from tinkering on a side project—they don’t necessarily reflect the views of my firm. That said, we have some great AI tools and solutions, and I’d love to tell you about them.