Why structured workflows beat one-shot prompting on hard reasoning

Hand the same arithmetic puzzle to the same model twice. Asked to think it through step by step, GPT-4 solves the Game of 24 about four times in a hundred. Wrapped in a structured search instead, the very same model solves it seventy-four times in a hundred. Nothing about the intelligence changed between those two runs. Only the process around it changed, and the process was worth almost everything. A second study, on an entirely different family of tasks, found the same shape of result, a structured action chain scoring 70.7 where plain step-by-step reasoning scored 42.5. The pattern keeps repeating across tasks and across research groups. On problems that are long, layered and easy to get wrong, how a model is made to work matters far more than which model it is.

It is worth being clear about what is being compared. A one-shot prompt asks for the whole answer in a single pass. Chain-of-thought asks for the reasoning in that same single pass, the model narrating its steps before committing to an answer. A structured workflow does something different in kind. It breaks the task into stages, lets the model retrieve, check or evaluate between them, and allows it to abandon a line of attack rather than ride the first one to the end. The three look like cousins. On a hard problem they behave like different algorithms.

Search instead of a single guess

The cleanest demonstration is Tree of Thoughts, the work of Shunyu Yao and colleagues at Princeton and Google DeepMind. Their idea was to stop treating reasoning as a single line and start treating it as a search over a tree, where each node is a partial solution. For the Game of 24, the puzzle breaks into three moves, and at each move the model proposes several possible next steps rather than one. A second pass then scores those candidates, marking each as sure, likely or impossible depending on whether that line can still reach 24. The search keeps the most promising handful, expands them, and quietly drops the branches that have been judged hopeless.

The numbers are what make the point land. Plain prompting solved about seven puzzles in a hundred. Chain-of-thought solved four. Even chain-of-thought with self-consistency, which generates a hundred separate reasoning chains and lets them vote, climbed only to about nine. The structured search reached seventy-four. That comparison with self-consistency is the one to sit with, because it kills the obvious objection. Sampling a hundred attempts barely helped, since every one of those hundred attempts carried the same defect, a single greedy pass with no way to look ahead or take back a bad move. Adding more tries at a flawed method did almost nothing. Adding a method that could search and prune did almost everything.

The reason is buried in how these models write. They generate left to right, one token at a time, committing as they go. A puzzle like this has a few good opening moves and a great many dead ends, and a model that picks a bad opening has no machinery to notice, let alone reverse it. It simply writes out a fluent, assured, wrong solution and stops. The search and the scoring are exactly the two things a single pass cannot do, hold several possibilities open at once, and judge a half-finished answer before finishing it.

Decomposition and grounding

The 70.7 figure comes from Chain-of-Action, published at ICLR 2025 by Zhenyu Pan, Haozheng Luo, Manling Li and Han Liu at Northwestern. It goes after a different weakness than the puzzle work. The first is that models invent things, answering from a hazy memory rather than the facts. The second is that they reason poorly over questions that have to be stitched together from several pieces of information.

The method takes a complicated question apart into a chain of smaller reasoning steps, and then, instead of treating retrieval as one lookup at the start, it lets each step reach for evidence as it needs it. A step can query the live web for current text, pull domain knowledge in as vectors, or run a query against tabular data. Each of those reaches does three things in turn. It retrieves what is relevant, it checks the step's tentative answer against what came back, and it notices when something is still missing and goes again. The checking is not a gesture. The paper builds a faith score that compares the model's guess at each step against the retrieved evidence, and when the guess and the evidence disagree, the guess is thrown out and rebuilt from the source.

What gives the result its weight is the consistency. The structured method did not just win the headline matchup of 70.7 to 42.5. It beat plain step-by-step reasoning on every benchmark in the table, often by twenty points or more, across fact-checking, strategy questions and long-form answers alike. The authors even ran it against a popular agent design on complex financial questions, a domain they chose precisely because the answers there depend on many current, easily faked facts. A single win can be luck. A clean sweep is a structural advantage showing through.

Why a stronger model does not close the gap

The natural assumption is that all of this is temporary, that the next and larger model will handle the hard question in one breath and make the scaffolding pointless. The evidence points the other way, and the reason is that the failures being repaired are not failures of intelligence. They are failures of procedure, and a bigger mind run through the same bad procedure inherits the same problems.

Look at what happens along a chain of dependent steps. Each step is right with some probability, and the whole chain is only as strong as the product of those probabilities. Ten steps that are each ninety percent reliable do not add up to a ninety percent answer. They add up to about thirty-five. A single sweeping reply hides this completely, because it never shows the steps, so a thesis that quietly broke at the third link still arrives looking whole and sounding sure. A structured process refuses to let the error pile up unseen. It pauses between steps to check, retrieve and score, and that pausing is the only thing standing between the model and a compounding it cannot feel.

There is also the plain matter of how the words come out. Writing left to right and choosing greedily is a fine way to cross a space with no traps and a poor way to cross one full of them, and hard reasoning is full of them. The lines that look best at the first step often die at the third, and the only way through is to keep a few open, weigh them, and walk back from the ones that fail. That is what the tree search adds, and what the retrieve-and-correct loop adds, and what no amount of extra model size adds on its own. Scale sharpens the single greedy pass. It does not turn the single pass into a search.

Underneath both lives the same quiet asymmetry that makes the whole thing work. Judging whether a half-finished line is going anywhere is far easier than producing the finished answer, and a system that can make that cheap judgment can discard its bad attempts before they cost anything. A one-shot prompt never reaches that fork in the road. It picks a direction at once and spends all of its effort making that direction sound convincing.

Where the architecture pays off

The tasks in these studies, planning that needs several moves and questions that have to be assembled from many facts, describe a very large class of real problems, and equity research sits right inside it. Understanding a company is long, it is layered, and a fluent error buried in the middle of it is both expensive and invisible. A view on a business is built from many smaller views, what it sells, whether the revenue lasts, what the balance sheet can take, what the price already assumes, and each of those is a place where a single confident pass will sometimes be confidently wrong. The same properties that let a tree search beat step-by-step reasoning on a number puzzle are what let a structured research workflow beat a single prompt on a stock, the ordered decomposition, the checking and retrieving at each step, the freedom to revise rather than commit. The advantage lives in the architecture, which is why aiming a newer and bigger model at a one-shot question tends to disappoint. It produces more convincing output, not more reliable analysis.

What the evidence does not claim

One line is worth drawing in bright ink. These are reasoning and question-answering benchmarks, and what they show is the quality of the process, higher accuracy, fewer inventions, steadier handling of multi-step problems. They do not show that any workflow can predict a share price, and the jump from a more faithful answer to a market return is one the data does not pay for. The honest reading is the narrower and sturdier one. On long, complex reasoning a structured workflow beats a single prompt by a wide and repeatable margin, and the gain comes from search, decomposition, checking and grounding rather than from the model. Wherever a confident mistake is costly, that is the architecture the research stands behind.

Sources

Tree of Thoughts: Deliberate Problem Solving with Large Language Models. Yao, Yu, Zhao, Shafran, Griffiths, Cao, Narasimhan. Princeton and Google DeepMind, NeurIPS 2023. arXiv 2305.10601

Chain-of-Action: Faithful and Multimodal Question Answering through Large Language Models. Pan, Luo, Li, Liu. Northwestern University, ICLR 2025. arXiv 2403.17359

Next
Next

The research behind Tesseract Stock Agent