All translations

Enter a message name below to show all available translations.

Found 3 translations.

Name	Current message text
^h English (en)	Given an input interactive environment <math>\mathcal{E}</math>, and an initial state <math>s_{0} \in \mathcal{S}</math>, an ''Interactive World Simulation'' is a ''simulation distribution function'' <math>q \left( o_{n} \,\|\, \{o_{< n}, a_{\leq n}\} \right), \; o_{i} \in \mathcal{O}, \; a_{i} \in \mathcal{A}</math>. Given a distance metric between observations <math>D: \mathcal{O} \times \mathcal{O} \rightarrow \mathbb{R}</math>, a ''policy'', i.e., a distribution on agent actions given past actions and observations <math>\pi \left( a_{n} \,\|\, o_{< n}, a_{< n} \right)</math>, a distribution <math>S_{0}</math> on initial states, and a distribution <math>N_{0}</math> on episode lengths, the ''Interactive World Simulation'' objective consists of minimizing <math>E \left( D \left( o_{q}^{i}, o_{p}^{i} \right) \right)</math> where <math>n \sim N_{0}</math>, <math>0 \leq i \leq n</math>, and <math>o_{q}^{i} \sim q, \; o_{p}^{i} \sim V(p)</math> are sampled observations from the environment and the simulation when enacting the agent’s policy <math>\pi</math>. Importantly, the conditioning actions for these samples are always obtained by the agent interacting with the environment <math>\mathcal{E}</math>, while the conditioning observations can either be obtained from <math>\mathcal{E}</math> (the ''teacher forcing objective'') or from the simulation (the ''auto-regressive objective'').
^h Spanish (es)	Dado un entorno interactivo de entrada <math>\mathcal{E}</math>, y un estado inicial <math>s_{0} \in \mathcal{S}</math>, una ''simulación de mundo interactivo'' es una ''función de distribución de simulación'' <math>q \left( o_{n} \,\|\, \{o_{< n}, a_{\leq n}\} \right), \; o_{i} \in \mathcal{O}, \; a_{i} \in \mathcal{A}</math>. Dada una métrica de distancia entre observaciones <math>D: \mathcal{O} \times \mathcal{O} \rightarrow \mathbb{R}</math>, una ''política'', es decir, una distribución sobre las acciones del agente dadas las acciones pasadas y las observaciones <math>\pi \left( a_{n} \,\|\, o_{< n}, a_{< n} \right)</math>, una distribución <math>S_{0}</math> sobre los estados iniciales, y una distribución <math>N_{0}</math> sobre la duración de los episodios, el objetivo de la ''simulación de mundo interactivo'' consiste en minimizar <math>E \left( D \left( o_{q}^{i}, o_{p}^{i} \right) \right)</math> donde <math>n \sim N_{0}</math>, <math>0 \leq i \leq n</math>, y <math>o_{q}^{i} \sim q, \; o_{p}^{i} \sim V(p)</math> son observaciones muestreadas del entorno y de la simulación al aplicar la política del agente <math>\pi</math>. Es importante destacar que las acciones de condicionamiento para estas muestras siempre se obtienen mediante la interacción del agente con el entorno <math>\mathcal{E}</math>, mientras que las observaciones de condicionamiento pueden obtenerse de <math>\mathcal{E}</math> (el ''objetivo de forzamiento por el maestro'') o de la simulación (el ''objetivo autorregresivo'').
^h Chinese (zh)	给定输入交互环境 <math>\mathcal{E}</math> 和初始状态 <math>s_{0} \in \mathcal{S}</math>，一个“交互世界模拟”是一个“模拟分布函数” <math>q \left( o_{n} \,\|\, \{o_{< n}, a_{\leq n}\} \right), \; o_{i} \in \mathcal{O}, \; a_{i} \in \mathcal{A}</math>。给定观测值之间的距离度量 <math>D: \mathcal{O} \times \mathcal{O} \rightarrow \mathbb{R}</math>，一个“策略”，即给定过去动作和观测的代理动作分布 <math>\pi \left( a_{n} \,\|\, o_{< n}, a_{< n} \right)</math>，初始状态分布 <math>S_{0}</math> 和回合长度分布 <math>N_{0}</math>，交互世界模拟的目标是最小化 <math>E \left( D \left( o_{q}^{i}, o_{p}^{i} \right) \right)</math>，其中 <math>n \sim N_{0}</math>，<math>0 \leq i \leq n</math>，以及 <math>o_{q}^{i} \sim q, \; o_{p}^{i} \sim V(p)</math> 是在执行代理策略 <math>\pi</math> 时从环境和模拟中抽取的观测值。重要的是，这些样本的条件动作总是通过代理与环境 <math>\mathcal{E}</math> 交互获得，而条件观测既可以从 <math>\mathcal{E}</math> 获得（“教师强迫目标”），也可以从模拟中获得（“自回归目标”）。