Showing My Work: Closing an API Key Exposure
Memory, constraints, and review gates, applied to a recent update
Key Takeaways
- Systems beat tools. Anyone can access the models; advantage comes from process.
- Write the rules down. The biggest accelerant was clarity on intent, success criteria, and “do not touch” boundaries.
- The “memory layer” enabled human/bot governance. The record of decisions, changes, and constraints is what made this start feeling like a process.
- This is a personal build, but it gave me a hands-on way to internalize how memory, controls, and review gates make agentic work shippable.
A few weeks ago I wrote about building a concert tracker app with my bot friends. The most common follow-up: “What’s actually happening when you’re doing it?”
The honest answer is that it feels a lot like working with Leonard from Memento: brilliant, relentless, and completely capable, but always at risk of forgetting what they were just doing. The workflow I’ve built is basically the system of tattoos and notes that keeps everything on track between sessions.
“It feels a lot like working with Leonard from Memento: brilliant, relentless, and completely capable, but always at risk of forgetting what they were just doing.”
This post walks through one real upgrade, migrating exposed API keys behind a secure gateway, to show how it worked in my experience. One note, tools may vary from the last post since Codex 5.3 shipped in early Feb.
What needed to change
I recently asked Gemini and Codex to perform code reviews to spot defects and security risks. One of those reviews flagged that several third-party API keys were exposed in the frontend code, and the corresponding API calls were being made directly from the browser. Anyone with browser dev tools open could extract those keys and misuse them, not ideal.
The fix: route external API calls through API Gateway and Lambda, with authentication handled by Cognito and enforced at the gateway. Provider keys stay server-side, every request goes through one authenticated layer.
Here’s what that architecture update looks like:
The rest of this post is the migration path.
The approach
Before any agent writes code or takes action, it reads the project notes. The notebook is the system of record, a living spec, decision log, and change log in one place, written in Markdown for the agents and viewed in Obsidian for me. That’s how it gets up to speed without reconstructing the project state from scratch.
Once the agent has context, I give the design agent my intent and it writes the prompt.
This is the Memento problem from the opening, and the notebook is what solves it. I’ve had an agent that was mid-implementation of a feature in AWS, hit a compaction (the tool summarizing the session to fit its context window), and then finished the work locally without deploying it. When I asked it to push to AWS, it acted like it didn’t know the configuration, minutes after it had been working in it.
The prompt is the other half. The design agent (Claude Code in this case) takes my plain-English description of what I want and turns it into a build-ready prompt for the coding agent: the overall goal, detailed instructions, success criteria, and constraints (what not to touch).
See that red line at the bottom? “Do not change any other behavior. Do not modify prod Lambda, frontend hooks, or any other files.” That’s the constraints, and it exists because I learned what happens without it. When all three are defined before work starts, the prompt works like a contract. When any of them are missing, it’s a wish list.
Executing the change
The coding agent gets the prompt, explores the codebase, and starts planning. Codex does an excellent job of articulating a complete plan before executing, including looking for any failure modes.
During the production migration, the agent discovered that simply copying the dev Lambda code over to production would silently drop two fields, concert.venueId and concert.url — that the production app relies on. It flagged the regression risk, explained why it mattered, and adjusted its plan before building anything.
During the same migration, I asked: “Are you considering compatibility with both stage and prod front ends?” The agent answered honestly: “partially”, and offered to add a staging validation checklist to the plan.
One more detail worth showing: when it’s time to deploy, the agent asks for permission first. Claude Code has always done this, but Codex takes it a step further; it summarizes what it wants to do in plain language before showing the command.
QA with three sets of eyes
I run the reviews without shared chat history; each agent starts from the repo + notebook and a fresh prompt, with a different objective: audit the plan, audit the code, and verify fixes. Gemini CLI reviewed the migration plan against the original spec. Claude Code reviewed the actual implementation. As an example, it flagged:
- Must fix: null response edge case in the Unsplash fallback
- Must fix: missing input validation on artist name and destination queries
- Should fix: stale third-party keys still sitting in the staging environment
- Should fix: duplicate variable declaration (not a bug, but a readability trap for the next person or agent who reads the code)
The fixes went back through the same loop: audit agent prioritized them, coding agent implemented them, audit agent verified them.
Before promoting anything to staging, the coding agent runs its own smoke tests against the live dev environment: auth guard, all seven external endpoints, response-shape validation across every provider.
These are health checks, not exhaustive UAT. What passes my testing moves to staging, and what doesn’t goes back into the backlog with a priority.
I’m currently maintaining two frontends in parallel: a V1 in production and a V2 in staging. I haven’t implemented all the features I want in the V2 yet, so they coexist while I test them both. I can validate changes in both versions against a common backend.
And after all of that - the specs, the prompts, the builds, the reviews, the smoke tests - here’s the app running in Prod and Staging with real data.
Close the loop
The last step in every cycle: the notebook gets updated. Release notes, decision log entries, backlog reprioritization, and any changes to specs. The agents do most of this automatically, the coding agent commits the changes and appends a changelog entry to the project notes as part of the build.
The next time I start a session, the agent reads this notebook first.
What breaks
I want to be clear: not everything works perfectly. This process still requires human intervention and troubleshooting.
- Vague specs produce confident wrong answers. I mentioned the constraint line in the prompt earlier, that wasn’t always there. Early on, I’d give agents a task without clear boundaries and they’d “improve” things I didn’t ask them to touch. Defining scope up front is the single most effective thing I’ve done to reduce rework.
- The “Memento” problem is real. The notebook helps, but it’s not a complete fix. I’ve learned to monitor where I am in the context window and compact early when a task looks like it might not fit, managing context is part of the workflow.
- The agent doesn’t know what it doesn’t know. The automated smoke tests all passed, but they were the agent testing its own work. My UAT caught things the automated pass missed. You still need a human in the loop, at least for now.
- Cost could add up. Right now this small project isn’t costing me a lot, but costs could scale with the size of the project and team. In an enterprise setting, you’d want guardrails around token consumption and agent-initiated infrastructure changes to prevent cost sprawl across cloud and API usage.
Closing thoughts: the system is a differentiator
The screenshots in this post could be from any feature or update. The loop is the same:
read notes → prompt → build → review → test → ship → update notes
Everyone has access to the same models, the system you build around them is what compounds.
If you’re wondering whether this could work for one of your teams, give it a try. Define a process and collaboration system that includes the agents, start with one small but well-defined task, and use it to learn what works before you scale it.
These are my personal observations from tinkering on a side project—they don’t necessarily reflect the views of my firm. That said, we have some great AI tools and solutions, and I’d love to tell you about them.