A good sanity check for long-horizon agents is not a benchmark. It is a task that is easy to verify and hard to fake.
That is why I still like my ...
For further actions, you may consider blocking this person and/or reporting abuse
Most “long-horizon agents” today are just fragile loops that haven’t failed yet.
They look great in controlled demos, but once you let them run longer, things start drifting in pretty subtle ways; and it adds up fast. Not because the model is bad, but because we’re stretching it into workflows it wasn’t really designed for.
What worries me is we’re pushing toward more autonomy without really solving control. Especially with memory in the mix… that’s not just a feature, it’s also a new way for things to go wrong over time.
Feels like we’re a bit early to be talking about “autopilot” when we don’t fully understand how these systems behave after step 20 or 30.
Curious if you think this gets solved with better tooling; or if we’re missing something more fundamental in how we design these agents.
I'm also under the impression that the autonomy we get today is fragile, you often get a minefield that is hard to navigate. Yet I also think that expectations of agents doing fine deliverables on the first attempt with little human involvement are both (a) inflated and (b) contradictory.
Inflated cause we noticed the step change in capability at the end of the last year and very soon started to demand and expect from AI agents more. The already huge increase in capability is taken for granted now, people just push forward their demands.
And contradictory in a philosophical sense of the question, cause what is control if you can't make sense of the subject? How much involvement and understanding on the human side is required? If we exaggerate the arrogance/ignorance side of the question, what if a human is asking to draw 7 parallel lines, 2 of which must cross does not understand/not able to understand own ignorance?
Yeah that makes sense; especially the part about expectations getting ahead of reality.
I think what you said about “control” is where things get tricky. Right now it feels like we don’t really have control in the traditional sense… more like influence and correction after the fact.
And that works fine when a human is in the loop. But as soon as we try to reduce that involvement, the gaps show up pretty quickly.
The “7 parallel lines” example actually fits well; a lot of the time the system looks like it understands, until you push it into edge cases or longer chains.
Maybe that’s the real limitation right now:
not capability, but lack of reliable grounding over time.
This resonates deeply -- I've been running a 24/7 autonomous agent for 2+ months, and the "operational shift" you describe is the real story.
One thing I'd add: the long-horizon problem isn't just about whether the model can sustain coherent work over time. It's about whether the architecture around it can make efficient decisions about what level of intelligence each step actually needs.
In my agent's production data, 87.4% of decisions (routing, classification, quality checks) run on 0.8B-1.5B parameter models. The frontier model only handles the remaining ~12% that genuinely require deep reasoning. This isn't about cost optimization -- it's about matching cognitive complexity to task complexity. Most of the agent's "thinking" is more like reflexes than deliberation.
Your hyperlink_button test is a great example of where this matters. Those cross-language, cross-framework tasks are exactly the 12% that need the big model. The question isn't "can agents do long-horizon tasks?" -- it's "can the routing infrastructure accurately identify which steps need what level of intelligence?"
The verification loop you describe (making the agent prove it didn't cheat) maps to the same insight. Verification is almost always a small-model task -- pattern matching against expected outputs. But the original creative implementation? That needs the frontier model. Getting this split right is, I think, the real engineering challenge most agent builders haven't confronted yet.
The "easy to verify, hard to fake" framing is the most useful thing I have read about agent evaluation in a while. Most benchmark games get played on synthetic tasks where plausible-looking output is enough to score. A task with a cross-language integration surface - Python backend, React frontend, packaging, docs - is adversarial in the right way: it exposes whether the agent actually understood the constraint or just pattern-matched toward something that looks right. That gap between looks plausible and works is where most production agent failures live.
The "very fast worker inside a good harness" framing is exactly right. The harness question is where most setups overcomplicate things. I figured out a easy way to run long-horizon agents with zero infrastructure , using Notion as persistent state, SKILL.md files as behavior, scheduled Cowork sessions as the executor. Stateless between runs = no drift. If you're curious check out: github.com/srf6413/cstack
Long-horizon tasks are definitely where current models struggle the most. I've found this is especially true in complex frameworks like Next.js 15, where an agent might get the first 3 steps right but then 'hallucinate' a deprecated API pattern in the 4th. This gap between 'helpful assistant' and 'full autopilot' is exactly why we need better rules and safety rails. Great breakdown of the current state of agency!
The hyperlink_button test is a good framing. Verification is harder than execution for agents.
The gap you're describing isn't just about whether agents produce working code -- it's about whether you can trust the report of what they did. An agent that says 'I implemented the requirement and ran the tests' and one that actually did look identical from the outside if you're only checking the output.
METR's task-length metric measures capability, not trustworthiness. The question of how you prove an agent's actions match its claimed actions is a separate problem that doesn't get enough attention yet.
Trust, verification, reliability - those are same questions you have when a team of devs is shipping a product. And what you typically do about that - have other people tasked with breaking the product and hunting for bugs)
The gap isn't capability, it's accountability. Long-horizon agents execute well but still can't explain why they took a specific path when something goes wrong. Until you can replay an agent's decision tree at the tool-call level, full trust is premature. Logging every tool call with the input that triggered it is a start — it's not full autopilot, but it's auditable.
You can always interrogate the agent for the motivations and decisions, that doesn't seem to be a problem... Though I rarely find value in those recalls.