Training large language models (LLMs) to master complex tasks, especially those requiring structured outputs like generating precise code or engaging in multi-step reasoning, is challenging even for current state of the art (SOTA) models. Reinforcement Learning (RL) offers a powerful theoretical framework for teaching models to do "what works", but applying these techniques to LLMs has been messy to execute in practice.
We’ve run into this problem at our startup, Levro. We want to be the easiest medium to conduct international commerce (includes high yield deposits, low FX!). This requires tools for where business and banking intersect. For example, if you want to update basic information about who owns your business, having the correct documentation to satisfy banking regulations is important and nuanced. (To mention one case, while many consumers think of Wise as a bank, Wise statements rarely qualify as official bank statements for anti-money-laundering / KYC purposes.) A truly ‘personal’ banker should advise you how to meet all these requirements so you don’t have unpleasant surprises later.
Since this domain is highly structured but also highly complex, we wanted to build agents to handle complex tasks like technical customer support and structured reasoning. Specifically, we wanted models that generated Python code to handle user queries, call tools and APIs, and process structured data to resolve a range of user inquiries. This is complex enough that the LLMs weren’t getting everything right out of the box, so we turned to RL.
To get RL to actually help, though, we had to solve a persistent problem: how do you give your model feedback that is specific enough to improve it without crushing the parts it already does well? One of the challenges that the DeepSeek team ran into while they built their product is this: if your model generates 8 imperfect responses (which often happens!) and you have a judge that penalizes these incorrect responses because they are not perfect, you end up in a situation where your model does not improve because every answer was flawed. As a practical example of this situation, if we ask the LLM to make the tool calls that generate 1099s for all the vendors a business paid more than $600 USD to during the period 2024, and the error that the LLM made in executing this request was to mistakenly ignore non-USD payments, using a per-token model rewards the important parts that the LLM got right. This in turn helps the LLM get the “true” answer much more quickly.
Techniques Used in Reinforcement Learning
So how do models get better today?
- You generate multiple outputs for a given prompt.
- You use a reward model to score these outputs.
- You fine-tune the model (often using techniques like GRPO) to increase the likelihood of the model producing high-scoring outputs in the future.
This process involves two main “moving pieces”:
- A reward model that determines what constitutes “good” output.
- An RL loop that “nudges” your main model based on the feedback from the reward model.
The Reward Model: Deciding What’s “Good”
The reward model is your system’s arbiter of quality. In many setups, including those for code tasks, it’s rule-based. For instance, for code, it might check if the code ran successfully or if tests passed. For formatting, it would assess consistency and structure. Good outputs receive positive scores, while broken ones are penalized.
Levro’s reward model, for example, evaluates outputs based on syntax correctness, correct use of tools, and output quality and relevance.
The “Nudge”: Steering the Model
Once you have a reward model, the next step is to steer the LLM towards generating better outputs. This is where various RL techniques come into play:
Historically, researchers leveraged techniques like Proximal Policy Optimization (PPO) (and its parent, TRPO) to guide the model. PPO in particular uses a value model—a separate AI that estimates the expected discounted reward for any partial trajectory. For example, a value model might learn that for the prompt “Fish love,” the partial trajectory “to swim” will lead to better final outcomes than “to fly.”
The Challenges with PPO’s Value Model: While promising, building a good value model is tricky. It needs to be learned during training, sometimes requiring a “bootstrap phase” where the main model’s training is paused. Additionally, the value model itself can be large and memory-intensive.
Another popular technique is Direct Preference Optimization (DPO), which trains the model based on static pairs of good/bad responses and steers the model towards selecting the “good” response.
DeepSeek popularized Group Relative Policy Optimization (GRPO), which uses the reward model’s scores to steer the LLM. Unlike supervised fine-tuning, which simply imitates labeled examples, GRPO allows the model to learn directly from outcomes. This is especially critical for open-ended tasks where clear-cut labels are scarce.
Of course, there is a catch with GRPO. Traditionally, GRPO assigns a single reward to the entire output. Imagine your LLM generates 200 tokens of perfectly structured code, but with one tiny syntax error. Under this traditional approach, the entire 200-token output receives the same negative score. It’s analogous to submitting a meticulously written 50-page report with a single typo on page 5 and having the entire report “docked across the board,” rather than just fixing the typo.
On a more technical level, GRPO typically does not employ discounting and computes the expectation by averaging rewards from a set number of trajectories, which means it doesn’t sensitively consider or reward sub-trajectories.
It’s important to remember that the “signal” for the model in RL is an expected discounted reward. Because it’s computationally intractable to roll out all possible trajectories and compute their average rewards, algorithms in the REINFORCE family (which includes GRPO) use a “surrogate objective” that estimates this signal. This objective often involves maximizing a term related to the “advantage,” which is a proxy for the expected discounted reward.
Process Supervision: A Good Idea That Didn’t Work (Effectively)
To address the granularity problem in single-reward systems like GRPO, researchers proposed the concept of process supervision. The idea was compelling: instead of just scoring the final output, score each step in the output so the model could learn which specific parts were correct.
In theory, this approach promised more granular feedback, allowing the model to improve specific weaknesses while preserving its strengths.
In practice, process supervision faced significant hurdles for LLMs:
- It was difficult to consistently define what a “step” meant within complex, varied outputs.
- Reward models struggled to consistently grade partial outputs.
- Perhaps most critically, models often learned to “game the reward function” without genuinely improving their underlying capabilities, a phenomenon known as “reward hacking.”
Ultimately, process supervision remained an intriguing concept that, in real-world scenarios, didn’t work effectively.
What Levro Built: Token-Level Rewards That Actually Work
Realizing the limitations of current approaches specifically in the context of making real-world agents for Levro, we developed a new technique focused on rewarding the parts of an output that worked while precisely targeting specific failures. We extended the idea of process supervision into a practical, implementable solution: token-level reward attribution.
This approach makes RL training significantly more effective for LLMs tackling complex tasks, by combining the memory efficiency and simplicity of GRPO with the token-sensitive rewards of PPO.
Here’s a breakdown of how Token-Level Rewards works:
Detailed Scoring and Attribution for Each Metric
Instead of just one overall score, for every quality being evaluated (e.g., syntax correctness, correct tool usage, output quality and relevance), a “judge LLM” is asked to provide both a score and identify the specific parts of the output that most influenced that score. Crucially, each quality becomes its own separate reward system. This allows for a multi-faceted, independent evaluation.
Creating “Token Heatmaps”
Once scores are given for each metric, they are distributed to the specific tokens the judge LLM attributed them to. Additionally, in a discounted fashion, reduced credit is given to tokens that preceded them, as they helped set up the good or bad parts. This process creates a “heatmap” for each metric, visually showing which tokens were most important or influential for that particular quality.
Precise Advantage Calculation
For each individual reward metric, “advantages” are computed. An advantage is essentially how much each part of the output helped or hurt the overall score relative to expectations. These advantages are calculated independently for each reward function based on the total reward for that sequence.
This independent calculation is vital because it addresses a common problem: if an output performs very well on one metric but poorly on another, a single average reward might dilute the feedback. By treating each metric separately, highly influential good parts can receive very positive advantages, and highly influential bad parts can receive very negative advantages. Finally, for the overall token-level advantages, the advantages from all individual metrics are summed up (with weights if appropriate).
Targeted Training
These token-level advantages are then directly used in the model’s optimization or loss functions. This means that instead of applying the same learning adjustment to every token, the LLM receives precise feedback on exactly where it performed well or poorly. The model learns exactly which tokens drove good or bad performance across different qualities, leading to much more focused and effective training.
Results
What were we able to accomplish with this new technique? Once we implemented token-level rewards, we were able to:
- Improve training speed by 25%.
- Increase evaluation reward on code generation benchmarks.
- Reduce “gaming” of the reward function.
- Improve the model’s ability to retain good structure while simultaneously fixing errors.
Most important, we were able to train the model to handle the real-world questions for our fintech agent.
As a result, we wanted to share this technique for others tuning LLMs for complex, high-stakes applications such as code generation, agentic workflows (where the model interacts with tools or environments), or structured reasoning tasks. Moving to token-level reward attribution is a critical step.
By leveraging per-token rewards, you can speed up learning, ensure that good partial outputs are preserved, and make RL training practical even for smaller, more specialized models without requiring additional compute resources.
Conclusion
In essence, token-level reward attribution has transformed process supervision from an interesting concept into a practical tool. It enables LLMs to learn with unprecedented precision, leading to more efficient training, higher quality outputs, and models that genuinely understand why their outputs are good or bad, token by token. This approach can save compute, improve training efficiency, and boost output quality without the frustration of seeing your model regress on what it already does well.