Training large language models (LLMs) to master complex tasks, especially those requiring structured outputs like generating precise code or engaging in multi-step reasoning is challenging even for current state of the art (SOTA) models. While Reinforcement Learning (RL) offers a powerful theoretical framework for teaching models to do "what works", applying these techniques for LLMs has been far messier to execute in practice.

At Levro, we wanted to build agents to handle complex, domain-specific tasks like technical customer support and structured reasoning. This meant we wanted models that generated Python code to handle complex user queries, call tools and APIs, and process structured data to resolve a range of user inquiries. To get RL to actually help rather than slow us down, we had to solve a persistent problem: how do you give your model feedback that is specific enough to improve it without crushing the parts it already does well?

What is Reinforcement Learning?

Causal LLMs (the popular type of open-weight LLM) generate their outputs one token (a word or sub-word piece) at a time. The final output is often called a "completion" in LLM APIs or a "trajectory" (or "rollout" or "episode") in RL. In RL, the model's fundamental goal is to maximize an expected discounted reward, which serves as its signal for learning.

So how does RL work with LLMs today?

If you've ventured into applying RL for LLMs, the typical workflow follows a clear pattern:

You generate multiple outputs for a given prompt.
You use a reward model to score these outputs.
You fine-tune the model (often using techniques like GRPO) to increase the likelihood of the model producing high-scoring outputs in the future.

This process involves two main "moving pieces":

A reward model that determines what constitutes "good" output.
An RL loop that "nudges" your main model based on the feedback from the reward model.

Let's break down these critical components:

The Reward Model: Deciding What's "Good"

The reward model is your system's arbiter of quality. In many setups, including those for code tasks, it's rule-based. For instance, for code, it might check if the code ran successfully or if tests passed. For formatting, it would assess consistency and structure. Good outputs receive positive scores, while broken ones are penalized.

Levro's reward model, for example, evaluates outputs based on syntax correctness, correct use of tools, and output quality and relevance.

The "Nudge": Steering the Model

Once you have a reward model, the next step is to steer the LLM towards generating better outputs. This is where various RL techniques come into play:

Historically, researchers leveraged techniques like Proximal Policy Optimization (PPO) PPO (and its parent, Trust Region Policy Optimization or TRPO) to guide the model. PPO in particular uses a value model – a separate AI that estimates the expected discounted reward for any partial trajectory. For example, a value model might learn that for the prompt "Fish love," the partial trajectory "to swim" will lead to better final outcomes than "to fly".

The Challenges with PPO's Value Model: While promising, building a good value model is tricky. It needs to be learned during training, sometimes requiring a "bootstrap phase" where the main model's training is paused. Additionally, the value model itself can be large and memory-intensive.

Another popular technique is Direct Preference Optimization (DPO), which trains the model based on static pairs of good/bad responses and steers the model towards selecting the “good” response.

DeepSeek popularized Group Relative Policy Optimization (GRPO), which uses the reward model's scores to steer the LLM. Unlike supervised fine-tuning, which simply imitates labeled examples, GRPO allows the model to learn directly from outcomes. This is especially critical for open-ended tasks where clear-cut labels are scarce.

Of course, there is a catch with GRPO. Traditionally, GRPO assigns a single reward to the entire output. Imagine your LLM generates 200 tokens of perfectly structured code, but with one tiny syntax error. Under this traditional approach, the entire 200-token output receives the same negative score. It's analogous to submitting a meticulously written 50-page report with a single typo on page 5 and having the entire report "docked across the board," rather than just fixing the typo.

On a more technical level, GRPO typically does not employ discounting and computes the expectation by averaging rewards from a set number of trajectories, which means it doesn't sensitively consider or reward sub-trajectories.

It's important to remember that the "signal" for the model in RL is an expected discounted reward. Because it's computationally intractable to roll out all possible trajectories and compute their average rewards, algorithms in the REINFORCE family (which includes GRPO) use a "surrogate objective" that estimates this signal. This objective often involves maximizing a term related to the "advantage," which is a proxy for the expected discounted reward.

Process Supervision: A Good Idea That Didn’t Work (Effectively)

To address the granularity problem in single-reward systems like GRPO, researchers proposed the concept of process supervision. The idea was compelling: instead of just scoring the final output, score each step in the output so the model could learn which specific parts were correct.

In theory, this approach promised more granular feedback, allowing the model to improve specific weaknesses while preserving its strengths.

In practice, process supervision faced significant hurdles for LLMs:

It was difficult to consistently define what a "step" meant within complex, varied outputs.
Reward models struggled to consistently grade partial outputs.
Perhaps most critically, models often learned to "game the reward function" without genuinely improving their underlying capabilities, a phenomenon known as "reward hacking".

Ultimately, process supervision remained an intriguing concept that, in real-world scenarios, didn't work effectively.

What Levro Built: Token-Level Rewards That Actually Work

Realizing the limitations of current approaches specifically in the context of making real-world agents for Levro, we developed a new technique focused on rewarding the parts of an output that worked while precisely targeting specific failures. We extended the idea of process supervision into a practical, implementable solution: token-level reward attribution.

This approach makes RL training significantly more effective for LLMs tackling complex tasks, by combining the memory efficiency and simplicity of GRPO with the token-sensitive rewards of PPO.

Here’s a breakdown of how Token-Level Rewards works:

Detailed Scoring and Attribution for Each Metric

Instead of just one overall score, for every quality being evaluated (e.g., syntax correctness, correct tool usage, output quality and relevance), a "judge LLM" is asked to provide both a score and identify the specific parts of the output that most influenced that score. Crucially, each quality becomes its own separate reward system. This allows for a multi-faceted, independent evaluation.

Creating "Token Heatmaps"

Once scores are given for each metric, they are distributed to the specific tokens the judge LLM attributed them to. Additionally, in a discounted fashion, reduced credit is given to tokens that preceded them, as they helped set up the good or bad parts. This process creates a "heatmap" for each metric, visually showing which tokens were most important or influential for that particular quality.

Precise Advantage Calculation:

For each individual reward metric, "advantages" are computed. An advantage is essentially how much each part of the output helped or hurt the overall score relative to expectations. These advantages are calculated independently for each reward function based on the total reward for that sequence.

This independent calculation is vital because it addresses a common problem: if an output performs very well on one metric but poorly on another, a single average reward might dilute the feedback. By treating each metric separately, highly influential good parts can receive very positive advantages, and highly influential bad parts can receive very negative advantages. Finally, for the overall token-level advantages, the advantages from all individual metrics are summed up (with weights if appropriate).

Targeted Training

These token-level advantages are then directly used in the model's optimization or loss functions. This means that instead of applying the same learning adjustment to every token, the LLM receives precise feedback on exactly where it performed well or poorly. The model learns exactly which tokens drove good or bad performance across different qualities, leading to much more focused and effective training.

Results

What were we able to accomplish with this new technique? Once we implemented token-level rewards, we were able to:

Improve training speed by 25%.
Increase evaluation reward on code generation benchmarks.
Reduce "gaming" of the reward function.
Improve the model's ability to retain good structure while simultaneously fixing errors.

Most important, we were able to train the model to handle the real-world questions for our fintech agent.

As a result, we wanted to share this technique for others tuning LLMs for complex, high-stakes applications such as code generation, agentic workflows (where the model interacts with tools or environments), or structured reasoning tasks, moving to token-level reward attribution is a critical step.

By leveraging per token rewards, you can speed up learning, ensure that good partial outputs are preserved, and make RL training practical even for smaller, more specialized models without requiring additional compute resources.

Conclusion

In essence, token-level reward attribution has transformed process supervision from an interesting concept into a practical tool. It enables LLMs to learn with unprecedented precision, leading to more efficient training, higher quality outputs, and models that genuinely understand why their outputs are good or bad, token by token. This approach can save compute, improve training efficiency, and boost output quality without the frustration of seeing your model regress on what it already does well.

Supercharging LLM Training: The Granular Revolution of Token-Level Rewards

Listen to the deep dive of this article: