13 points syx 5 days ago 14 comments
I'm curious how people are catching subtle bugs or technical debt when the LLM produces something that works but might be unoptimized.
throwup238 5 days ago | parent
You then take a photo of the cracked bone and feed it back to your coding agent, which has been properly trained in interpreting Oracle Bones to extract PR review comments.
If the PR is too big too fit on the bone, you reject it for being too big. If after three rounds of review the bones keep cracking in the same spot, reject the PR. You accept the PR once the bone starts to seep bone marrow before cracking (it will crack first if there are any PR comments left)
muzani 2 days ago | parent
foxmoss 4 days ago | parent
al2o3cr 4 days ago | parent
segmondy 4 days ago | parent
Davidbrcz 4 days ago | parent
- Compile it with the maximum number of warnings enabled
- Run linters/analyzers/fuzzers on it
- Ask another LLM to review it
vrighter 4 days ago | parent
metadat 3 days ago | parent
aristofun 3 days ago | parent
bjourne 3 days ago | parent
raw_anon_1111 3 days ago | parent
1. I wrote the code in BASIC
2. I wrote the code in assembly
3. I got more improvement because storing and reading from the first page of memory took two clock cycles instead of 3.
But this isn’t 1986, this is 2026. I “vibe coded” my first project this year. I designed the AWS architecture from the an empty account using IAC, I chose every service to use, I verified every permission, I chose and designed the orchestration, the concurrency model, I gathered requirements. What I didn’t do is look at a line of Python code or infrastructure code aside from the permissions that Codex generated.
Now to answer your questions :
How did I validate the correctness? Just like if I had written it myself. I had Codex to create a shell script to do end to end tests of all of the scenarios I cared about and when one broke, I went back to Codex to fix it. I was very detailed about the scenarios.
The web front end that I used was built by another developer. I haven’t touched web dev in a decade. I told Codex what changes I needed and I verified the changes by deploying it and testing it manually.
How did I validate the performance? Again just like I would do on something I wrote. I tested it first with a few hundred transactions to verify the functionality and then I stress tested it with a real world amount a transactions. The first iteration broke horribly. Not because of Claude code. It was a bad design.
But here’s the beauty. It took me a day to do the bad implementation that would have taken me three or four days. I then redesigned it, didn’t use the AWS service and did I designed that was much more scalable and it took a day. I knew in theory how it worked under the hood. But not in practice. Again I tested for scalability by testing the result.
The architectural quality? I validated it by synthesizing real world traffic. ChatGPT in thinking mode did find a subtle concurrency bug. That was my fault though. I designed the concurrency implementation, Codex just did what I told it to do.
Subtle bugs happen whether people write it or an agent writes it. You do the best you can with your tests and when they come up you fix it?
How do I prevent technical debt? All large implementations have technical debt. Again just like when I lead a team - I componenitize everything with clean interfaces. It makes it easier for coding agents and people.
pearlos 1 day ago | parent
the_harpia_io 1 day ago | parent
Aristarkh 15 hours ago | parent