where T = { o i ≤ n , a i ≤ n } ∼ T a g e n t {\displaystyle T={\{o_{i\leq n},a_{i\leq n}\}}\sim {\mathcal {T}}_{agent}} , x 0 = ϕ ( o n ) {\displaystyle x_{0}={\phi {(o_{n})}}} , t ∼ U ( 0 , 1 ) {\displaystyle t\sim {{\mathcal {U}}{(0,1)}}} , ϵ ∼ N ( 0 , I ) {\displaystyle \epsilon \sim {{\mathcal {N}}{(0,\mathbf {I} )}}} , x t = α ¯ t x 0 + 1 − α ¯ t ϵ {\displaystyle x_{t}={{{\sqrt {{\overline {\alpha }}_{t}}}x_{0}}+{{\sqrt {1-{\overline {\alpha }}_{t}}}\epsilon }}} , v ( ϵ , x 0 , t ) = α ¯ t ϵ − 1 − α ¯ t x 0 {\displaystyle {v{(\epsilon ,x_{0},t)}}={{{\sqrt {{\overline {\alpha }}_{t}}}\epsilon }-{{\sqrt {1-{\overline {\alpha }}_{t}}}x_{0}}}} , and v θ ′ {\displaystyle v_{\theta ^{\prime }}} is the v-prediction output of the model f θ {\displaystyle f_{\theta }} . The noise schedule α ¯ t {\displaystyle {\overline {\alpha }}_{t}} is linear, similarly to Rombach et al. (2022).