Translations:Diffusion Models Are Real-Time Game Engines/18/zh: Difference between revisions

Latest revision as of 00:19, 9 September 2024

Information about message (contribute)

This message has no documentation. If you know where or how this message is used, you can help other translators by adding documentation to this message.

Message definition (Diffusion Models Are Real-Time Game Engines)

Given an input interactive environment <math>\mathcal{E}</math>, and an initial state <math>s_{0} \in \mathcal{S}</math>, an ''Interactive World Simulation'' is a ''simulation distribution function'' <math>q \left( o_{n} \,|\, \{o_{< n}, a_{\leq n}\} \right), \; o_{i} \in \mathcal{O}, \; a_{i} \in \mathcal{A}</math>. Given a distance metric between observations <math>D: \mathcal{O} \times \mathcal{O} \rightarrow \mathbb{R}</math>, a ''policy'', i.e., a distribution on agent actions given past actions and observations <math>\pi \left( a_{n} \,|\, o_{< n}, a_{< n} \right)</math>, a distribution <math>S_{0}</math> on initial states, and a distribution <math>N_{0}</math> on episode lengths, the ''Interactive World Simulation'' objective consists of minimizing <math>E \left( D \left( o_{q}^{i}, o_{p}^{i} \right) \right)</math> where <math>n \sim N_{0}</math>, <math>0 \leq i \leq n</math>, and <math>o_{q}^{i} \sim q, \; o_{p}^{i} \sim V(p)</math> are sampled observations from the environment and the simulation when enacting the agent’s policy <math>\pi</math>. Importantly, the conditioning actions for these samples are always obtained by the agent interacting with the environment <math>\mathcal{E}</math>, while the conditioning observations can either be obtained from <math>\mathcal{E}</math> (the ''teacher forcing objective'') or from the simulation (the ''auto-regressive objective'').

给定输入交互环境 ${\mathcal {E}}$ 和初始状态 $s_{0}\in {\mathcal {S}}$ ，一个“交互世界模拟”是一个“模拟分布函数” $q\left(o_{n}\,|\,\{o_{<n},a_{\leq n}\}\right),\;o_{i}\in {\mathcal {O}},\;a_{i}\in {\mathcal {A}}$ 。给定观测值之间的距离度量 $D:{\mathcal {O}}\times {\mathcal {O}}\rightarrow \mathbb {R}$ ，一个“策略”，即给定过去动作和观测的代理动作分布 $\pi \left(a_{n}\,|\,o_{<n},a_{<n}\right)$ ，初始状态分布 $S_{0}$ 和回合长度分布 $N_{0}$ ，交互世界模拟的目标是最小化 $E\left(D\left(o_{q}^{i},o_{p}^{i}\right)\right)$ ，其中 $n\sim N_{0}$ ， $0\leq i\leq n$ ，以及 $o_{q}^{i}\sim q,\;o_{p}^{i}\sim V(p)$ 是在执行代理策略 $\pi$ 时从环境和模拟中抽取的观测值。重要的是，这些样本的条件动作总是通过代理与环境 ${\mathcal {E}}$ 交互获得，而条件观测既可以从 ${\mathcal {E}}$ 获得（“教师强迫目标”），也可以从模拟中获得（“自回归目标”）。