Occupancy Reward Shaping: Improving Credit Assignment for Offline Goal-Conditioned RL

ICLR 2026
Aravind Venugopal¹, Jiayu Chen², Xudong Wu², Chongyi Zheng³, Benjamin Eysenbach³, Jeff Schneider¹
¹ Carnegie Mellon University  ·  ² The University of Hong Kong  ·  ³ Princeton University
📄 Paper 💻 Code
How do we learn reward functions for credit assignment that accurately capture long-term temporal dependencies and goal-reaching information present in offline data?

ORS learns the occupancy measure (distribution of future states) and extracts goal-reaching information into a reward function such that states with future states closer to goal have higher rewards.

Overview
Mean success rate bar chart

Average relative success rate (task-normalized) on 7 challenging OGBench tasks. Higher is better; 100% matches the best algorithm per task. ORS performs best, improving over the best baseline by 31%.

  • Offline goal-conditioned RL with sparse rewards suffers from a credit assignment problem. This triggers a cascade: sparse rewards → noisy value functions → bad policies → ineffective GCRL. Prior reward shaping methods either require per-task manual reward design or rely on graph-building methods that don't scale with task complexity.
  • We hypothesize that generative world models — specifically, flow-matching models of the occupancy measure — implicitly encode the temporal geometry of the state space, capturing which future states lead toward a goal.
  • We formalize how such models encode temporal geometry and propose Occupancy Reward Shaping (ORS): a method that extracts goal-reaching information from the occupancy model as a reward-shaping term \(r^W(s,a,g)\) while preserving the optimal policy.
  • On 13 sparse-reward robotics tasks (OGBench), ORS achieves 23%–217% improvements over competitive baselines.
Method
Learn flow-matching occupancy model »» Extracted occupancy- shaped rewards »» Use for downstream GCRL
  1. Learning the Occupancy Model
    • The discounted occupancy measure \(d^\pi(s^+ \mid s, a)\) aggregates future states reachable from \((s,a)\) under policy \(\pi\), weighted by temporal distance: \[ d^\pi(s^+ \mid s, a) = (1 - \gamma) \sum_{\Delta t=1}^{\infty} \gamma^{\Delta t - 1} \mathbb{P}(s_{t + \Delta t} = s^+ \mid s, a, \pi) \]
    • Analogous to SARSA bootstrapping \(Q(s,a) = r(s,a,s') + \gamma Q(s',a')\), we bootstrap occupancy learning over offline data \(\mathcal{D}\): \[ d_\theta^{\pi_\mathcal{D}}(s^+ \mid s, a) = (1-\gamma)\, p(s' \mid s, a) + \gamma\, d_{\theta^-}^{\pi_\mathcal{D}}(s^+ \mid s', a'); \quad \forall (s, a, s', a') \in \mathcal{D} \]
    • We parameterize the occupancy model using flow matching [Farebrother et al., 2025], learning a velocity field \(v_\theta(t, s, a, x_t)\) with loss \(\mathcal{L}_{\text{flow}}(\theta) = (1-\gamma)\mathcal{L}_{\text{next}} + \gamma \mathcal{L}_{\text{future}}\), where \(\mathcal{L}_{\text{next}}\) matches the immediate next state and \(\mathcal{L}_{\text{future}}\) bootstraps long-horizon occupancy with a target network.
  2. Extracting Goal-Reaching Information (Without Solving an ODE)
    • States closer to goal \(g\) occupy inner level-set layers around \(g\), while distant states lie in outer layers. The \(W_2^2\) distance grows monotonically with distance to goal — making it a natural dense credit signal.
    • The flow-matching loss provides a tractable upper bound on \(W_2^2\) [Haviv et al., 2025; Lv et al., 2025; Park et al., 2025], computable without ODE integration: \[ W_2^2(\delta_g, d^{\pi_\mathcal{D}}(s^+ \mid s, a)) \leq \mathbb{E}_{\substack{x_1 = g,\, x_0 \sim \mathcal{N}(0, I) \\ t \sim \mathcal{U}(0,1)}} \left[\| v_\theta(t, s, a, x_t) - (x_1 - x_0) \|_2^2 \right] \]
    • The ORS reward \(r^W(s,a,g)\) is learned by regressing this flow-matching loss: \[ \mathcal{L}_{\text{rew}}(\psi) = \mathbb{E}_{s,a,g \sim \mathcal{D}} \left\| r^W_\psi(s,a,g) + \mathbb{E}_{\substack{x_1=g,\, x_0 \sim \mathcal{N}(0,I) \\ t \sim \mathcal{U}([0,1])}} \|v_\theta(t,s,a,x_t) - (x_1-x_0)\|_2^2 \right\|_2^2 \]
  3. Use for Downstream GCRL: The ORS rewards are used to train policies for any downstream goal-reaching task, agnostic of the RL algorithm.
Evaluation
antmaze
cube
scene
puzzle

We evaluate ORS on 13 sparse-reward tasks of varying dataset quality and task complexity from 4 categories of OGBench datasets — antmaze, cube, scene and puzzle.

ORS vs baselines: overall average (binary) success rate (%) across 5 test-time goals over 8 seeds per task. Bold = best within 95% bootstrapped CI.
Dataset GCBC GC-IVL QRL CRL GC-IQL HIQL SAW SMORE n-step GCIQL-OTA Go-Fresh ORS (ours)
antmaze-large-navigate25±318±364±1890±434±491±286±522±553±990±488±388±7
antmaze-giant-navigate0±00±09±439±80±072±748±91±11±126±530±1056±9
cube-double-play1±136±31±010±240±56±240±72±24±33±217±645±7
cube-triple-play0±01±10±06±37±33±26±60±01±12±218±537±8
puzzle-4x4-play0±013±20±00±026±37±217±120±046±585±474±670±5
puzzle-4x5-play0±07±10±01±014±18±48±40±05±219±120±120±0
puzzle-4x6-play0±010±20±04±112±13±18±40±014±315±317±420±2
scene-play5±142±45±119±251±438±363±68±226±742±756±1080±4
antmaze-large-explore0±08±60±00±01±10±015±80±00±00±038±1022±7
cube-triple-noisy1±19±11±03±12±12±10±01±12±12±15±422±7
puzzle-4x4-noisy0±020±30±00±029±73±33±30±00±00±050±556±7
puzzle-4x6-noisy0±017±20±06±318±22±18±60±012±615±319±419±1
scene-noisy1±126±59±11±126±225±433±63±22±23±234±540±5
Mean2.515.96.913.720.020.025.82.812.723.235.844.2
Does ORS improve value function learning?
Value non-monotonicity vs noise level
Value function estimates over trajectories
ORS reward scatter over state space

Left: ORS leads to lower average value non-monotonicity \(\delta_V\) at lower noise levels \((\sigma_v)\) over expert trajectories compared to sparse rewards or using \(\hat{V}(s,g) = r_W(s,g)\) directly. Center: ORS induces less noisy estimates of \(\hat{V}(s,g)\) over expert trajectories even for long horizons. Right: ORS rewards over 5000 state-action pairs for a single fixed goal smoothly decay in magnitude with temporal distance from goal. All plots computed over antmaze-giant-navigate.

Citation

If you find this work useful, please cite:

@inproceedings{venugopal2026ors,
  title     = {Occupancy Reward Shaping: Improving Credit Assignment
               for Offline Goal-Conditioned RL},
  author    = {Aravind Venugopal and Jiayu Chen and Xudong Wu and
               Chongyi Zheng and Benjamin Eysenbach and Jeff Schneider},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://openreview.net/forum?id=EW8DskWQ1K}
}
Aravind Venugopal · Carnegie Mellon University · 2026 · Paper