Yet another Hacker News

13 points syx 5 days ago 14 comments

If you've used Claude Code or similar tools for "vibe-coding" in programming languages you aren't fluent in or that you don't know, how do you validate the performance or architectural quality of the outcome?

I'm curious how people are catching subtle bugs or technical debt when the LLM produces something that works but might be unoptimized.

throwup238 5 days ago | parent

I prefer the ancient Chinese science of Oracle Bone divination. You take the scapulae of an ox and copy the PR diff onto the bone using jiǎgǔwén encoding, then throw it in a fire until thermal expansion causes the bone to crack.

You then take a photo of the cracked bone and feed it back to your coding agent, which has been properly trained in interpreting Oracle Bones to extract PR review comments.

If the PR is too big too fit on the bone, you reject it for being too big. If after three rounds of review the bones keep cracking in the same spot, reject the PR. You accept the PR once the bone starts to seep bone marrow before cracking (it will crack first if there are any PR comments left)

muzani 2 days ago | parent

Jiǎgǔwén has been used for decision making for 3000 years. It's a stable technology that just works.

foxmoss 4 days ago | parent

You don’t. A JS dev isn’t going to catch an uninitialized variable in C and probably doesn’t even know the damage nasal demons can cause. You either throw more LLMs at it or learn the language.

al2o3cr 4 days ago | parent

That's the neat part - you don't!

segmondy 4 days ago | parent

if you audit it, then you're not vibing.

Davidbrcz 4 days ago | parent

You do a cross analysis.

- Compile it with the maximum number of warnings enabled

- Run linters/analyzers/fuzzers on it

- Ask another LLM to review it

vrighter 4 days ago | parent

by burying your head in the sand and convincing yourself that the llm doesn't generate any slop.

metadat 3 days ago | parent

https://factory.strongdm.ai has some advice I've found useful.

aristofun 3 days ago | parent

How can a tree fall in the wood where nobody there?

bjourne 3 days ago | parent

Isn't that like proofreading text in a language you're not familiar with?

raw_anon_1111 3 days ago | parent

So before the gate keeping starts, my first crack at optimizations was in 1987 on a 1 MHz Apple //e

1. I wrote the code in BASIC

2. I wrote the code in assembly

3. I got more improvement because storing and reading from the first page of memory took two clock cycles instead of 3.

But this isn’t 1986, this is 2026. I “vibe coded” my first project this year. I designed the AWS architecture from the an empty account using IAC, I chose every service to use, I verified every permission, I chose and designed the orchestration, the concurrency model, I gathered requirements. What I didn’t do is look at a line of Python code or infrastructure code aside from the permissions that Codex generated.

Now to answer your questions :

How did I validate the correctness? Just like if I had written it myself. I had Codex to create a shell script to do end to end tests of all of the scenarios I cared about and when one broke, I went back to Codex to fix it. I was very detailed about the scenarios.

The web front end that I used was built by another developer. I haven’t touched web dev in a decade. I told Codex what changes I needed and I verified the changes by deploying it and testing it manually.

How did I validate the performance? Again just like I would do on something I wrote. I tested it first with a few hundred transactions to verify the functionality and then I stress tested it with a real world amount a transactions. The first iteration broke horribly. Not because of Claude code. It was a bad design.

But here’s the beauty. It took me a day to do the bad implementation that would have taken me three or four days. I then redesigned it, didn’t use the AWS service and did I designed that was much more scalable and it took a day. I knew in theory how it worked under the hood. But not in practice. Again I tested for scalability by testing the result.

The architectural quality? I validated it by synthesizing real world traffic. ChatGPT in thinking mode did find a subtle concurrency bug. That was my fault though. I designed the concurrency implementation, Codex just did what I told it to do.

Subtle bugs happen whether people write it or an agent writes it. You do the best you can with your tests and when they come up you fix it?

How do I prevent technical debt? All large implementations have technical debt. Again just like when I lead a team - I componenitize everything with clean interfaces. It makes it easier for coding agents and people.

pearlos 1 day ago | parent

One thing I’ve found helpful is relying less on understanding the syntax itself and more on validating behavior at the system level. Even if I don’t know the language well, I can still focus on things like test coverage, edge cases, performance patterns, and whether the architecture follows familiar design principles. In that sense it ends up feeling more like auditing a black box than reviewing code line by line.

the_harpia_io 1 day ago | parent

honestly i treat it similar to reviewing any code i don't fully understand - focus on the high-risk stuff first. look for anything touching external input, file operations, network calls, auth/session handling. even if you don't know the language syntax, data flow patterns are pretty universal. I've been working on tooling for this exact problem (scanning AI-generated code) and the biggest issue is that LLMs are really good at generating code that looks safe but has subtle logic bugs or missing validation. what language are you dealing with? some have better static analysis ecosystems than others

Aristarkh 15 hours ago | parent

Treating generated code in unknown languages as a black box creates significant risk, particularly regarding security vulnerabilities or race conditions that functional tests often miss. If that unvetted code causes data corruption or a production outage, how do you handle the immediate remediation and liability without the internal expertise to debug it? Have you considered using a secondary, distinct model specifically prompted to act as an adversarial "inspector" to critique the architectural decisions of the first? I'm curious if you rely solely on end-to-end testing or if you implement strict sandboxing to limit the blast radius of code you can't manually review.