The clearest thing Karpathy has said in a while is that LLMs are not animals.
We've spent three years using the wrong analogy. We talk about LLMs the way we talk about a child or a dog: it learned, it understood, it made a mistake, it will get better with experience. The metaphor is sticky because it's convenient. It lets you treat the model like a small person and skip the part where you ask what it actually is.
The model is not a small person. The model is a statistical distillate of human text. It compresses every regularity that survived the training cut, including the regularity of being right about most things most of the time. That's not understanding. It's pattern coverage with very high resolution. A different thing, and one that fails in patterns of its own.
The reason this matters: we didn't copy birds to fly. We built something different that also flies, by understanding what flight actually was — pressure differentials over a curved surface, not flapping. Aviation got somewhere only after we stopped pretending birds were the spec. Karpathy's argument is that we're at the equivalent moment with intelligence. An LLM that "knows" Python doesn't know Python. It knows the distribution of Python that appeared on the open web, weighted by what humans wrote about it, with all the lopsidedness that implies.
That's why he's pushing the term agent engineering, with the same earnestness software went through when it stopped being a craft and became a discipline. Specs. Evals. Verification. The unglamorous middle layer.
Three things this shift makes concrete.
Vibe coding dies in production
Vibe coding works on a Saturday afternoon. A hackathon, a demo, the first internal pilot that everyone gets excited about because the thing speaks back. The illusion is reliable up to the moment a stranger uses it. Then the surface area expands by an order of magnitude, the inputs stop looking like the curated test set, and the system starts producing confident outputs nobody can trace.
A team that ships agents without specs and evals accumulates a particular kind of debt: behavior nobody can audit. Not technical debt. Interpretive debt. When something goes wrong, the only person who can explain why is the prompt author, and probably not even them, because the prompt was iterated by feel. Multiply that across an organization and you get systems whose correctness lives in someone's head, with no documentation, no regression tests, no contract.
Software took thirty years to learn this with conventional code. Type systems, unit tests, integration tests, contract tests, observability stacks. The discipline isn't enjoyable but it's what allows software to be operated by people other than its author. Agents are not exempt. If anything, they need it more, because the failure modes are silent. A wrong answer that sounds correct is harder to detect than a stack trace.
The discipline that works is borrowed from regular software. Eval suites in CI, with failure cases captured from real traffic. System prompts versioned and reviewed like code. Tool schemas typed. Model outputs validated against strict contracts that fail loud if the shape drifts. None of this is exotic. It's what backend engineers have been doing for decades. The novelty is having to insist it applies to the new component too.
Jagged skills are unmeasured risk
The phrase that keeps coming up — "talking to a PhD and a ten-year-old at the same time" — is the right one. A model that aces a Leetcode hard and fails to extract three fields from an invoice is not "almost there". It's a distribution of competence that doesn't map onto any human profile, and that's the part most teams skip.
When a senior engineer makes a mistake, you can predict roughly where the mistake will be. They get distracted on the boring tasks; they overengineer the simple ones; they push back on requirements; they leave gaps in documentation. The mistakes correlate with the kind of person they are. With LLMs the failure surface is not correlated with anything intuitive. The same model that wrote a sonnet will misread a date format. The same model that explained a transformer architecture will sign a JSON quote wrong.
This is what "jagged" means in practice, and it has a direct operational consequence: you cannot extrapolate competence. Performance on tasks A and B tells you nothing about C. You have to measure C separately. Most teams skip this because measuring is expensive and the demo was convincing.
The discipline that works is to map the jaggedness before you go live. Not "does the model work" — that's the wrong question. The right question is "where does it collapse, and is the collapse loud or silent?". Loud collapses are fine; you build a fallback. Silent collapses are the ones that kill you, because by the time you notice, the system has been wrong for weeks and the data downstream is contaminated.
The discipline that works is to run adversarial evals at the seams — wherever a model's output becomes a downstream input. Held-out historical data, pathological inputs (truncated, malformed, multilingual, adversarial), edge cases mined from production. The measurement that matters is not whether the model gets the input right; it is whether it knows when it doesn't. Calibrated uncertainty is more valuable than calibrated answers.
SWE-bench 87.6% is not your codebase
The benchmark numbers are useful for vendor comparisons and useless for operational planning. SWE-bench is a clean dataset with well-formed issues in popular open-source repositories, evaluated by a model that was tuned for the task. The 87.6% number describes what the model can do under those conditions, not under yours.
Your codebase is different. The issues are filed by product managers who don't speak engineering. The repository has six years of accreted context, abandoned migrations, dead modules, and three competing conventions for naming. The dependencies are partially internal and not documented anywhere a model can read. The success criterion is "the customer stops complaining", not "the test suite passes".
This is not a complaint about benchmarks. Benchmarks have their place. It's a warning about extrapolation. A model that hits 87.6% on a polished public dataset will hit something much lower on your code, and the gap will not be linear. The places it loses are precisely the places where context matters most: ambiguous requirements, conflicting constraints, code that was correct for a reason nobody bothered to write down.
The lesson I keep coming back to: the organization is the carrier of understanding. The output of the model — the diff, the answer, the matched candidate — is a byproduct. The understanding lives in the people, the documentation, the eval datasets, the post-mortems. If you let the model write the code without anybody on your team developing the comprehension of why that code is right, you have built a system that runs only as long as the model does.
What this looks like on a Wednesday
Three habits I've watched separate the teams that survive their AI adoption from the ones that quietly stall.
They read what the model writes. Sounds obvious. Most teams stop reading after the first month. The output looks good enough, the developer moves on. The compounding effect is that nobody on the team can explain the system anymore, and when it breaks, time-to-repair is time-to-re-derive what it was supposed to do in the first place. The teams that keep reading — even when it's slow, even when 90% of the output is fine — are the ones that retain the muscle.
They version their prompts like code. Prompts that ship to production live in the repo, go through code review, have changelogs. There is no situation where the prompt is "whatever someone put in the dashboard last week". That's a feature flag with no audit trail, and it will fail at the worst possible moment.
They run evals on every change. Not just on the first deploy. Every prompt change, every model version bump, every tool schema modification triggers a benchmark run against a curated dataset of historical traffic. The dataset grows as production traffic surfaces new failure modes. This is unglamorous, expensive in setup time, and the single highest-leverage thing a team can do. It's also the part most teams skip because "it works" and the cost feels theoretical until it isn't.
None of this is the future. It's the present, badly distributed.
The thesis that keeps standing
Puedes externalizar el pensamiento, pero no la comprensión.
You can outsource thinking. You can't outsource understanding.
The model can do the thinking — the search, the draft, the synthesis, the cleanup. That's real, that's useful, that compresses weeks of work into hours. The understanding is what stays with you: knowing why the answer is what it is, what would change it, where it breaks, what the next version should look like.
A team that outsources both — that lets the model think and be the only thing that understands — has built a black box and forgotten where they put the key. They will be productive for some number of months, until they aren't, and the recovery is slow, because they have to rebuild the understanding from the ground up.
The teams that win the next five years are not the ones with the most aggressive AI adoption. They're the ones that adopt aggressively and keep their comprehension. They use the model to generate, but they read the output. They use the model to draft specs, but they revise them. They use the model to code, but they review the diffs as though a junior wrote them. The relationship is supervisory, not delegative, and the difference is where the understanding lives.
That's the part agent engineering is actually about. Not the frameworks, not the prompts, not the tool-calling APIs. The discipline of staying smarter than the thing you built.