All translations

Enter a message name below to show all available translations.

Found 3 translations.

Name	Current message text
^h English (en)	We re-purpose a pre-trained text-to-image diffusion model, Stable Diffusion v1.4 (Rombach et al., [https://arxiv.org/html/2408.14837v1#bib.bib26 2022]). We condition the model <math>f_{\theta}</math> on trajectories <math>T \sim \mathcal{T}_{agent}</math>, i.e., on a sequence of previous actions <math>a_{< n}</math> and observations (frames) <math>o_{< n}</math> and remove all text conditioning. Specifically, to condition on actions, we simply learn an embedding <math>A_{emb}</math> from each action (e.g., a specific key press) into a single token and replace the cross attention from the text into this encoded actions sequence. In order to condition on observations (i.e., previous frames), we encode them into latent space using the auto-encoder <math>\phi</math> and concatenate them in the latent channels dimension to the noised latents (see Figure [https://arxiv.org/html/2408.14837v1#S3.F3 3]). We also experimented conditioning on these past observations via cross-attention but observed no meaningful improvements.
^h Spanish (es)	Reutilizamos un modelo de difusión de texto a imagen preentrenado, Stable Diffusion v1.4 (Rombach et al., [https://arxiv.org/html/2408.14837v1#bib.bib26 2022]). Condicionamos el modelo <math>f_{\theta}</math> en trayectorias <math>T \sim \mathcal{T}_{agent}</math>, es decir, en una secuencia de acciones previas <math>a_{< n}</math> y observaciones (fotogramas) <math>o_{< n}</math>, y eliminamos todo condicionamiento textual. Específicamente, para condicionar en las acciones, simplemente aprendemos una incrustación <math>A_{emb}</math> de cada acción (por ejemplo, una pulsación de tecla específica) en un único token y sustituimos la atención cruzada del texto en esta secuencia de acciones codificadas. Para condicionar en las observaciones (es decir, los fotogramas anteriores), las codificamos en el espacio latente utilizando el autocodificador <math>\phi</math> y las concatenamos en la dimensión de los canales latentes a los latentes ruidosos (véase la figura [https://arxiv.org/html/2408.14837v1#S3.F3 3]). También experimentamos condicionando estas observaciones anteriores mediante atención cruzada, pero no observamos mejoras significativas.
^h Chinese (zh)	我们重新利用预训练的文本到图像扩散模型 Stable Diffusion v1.4（Rombach 等人，[https://arxiv.org/html/2408.14837v1#bib.bib26 2022]）。我们将模型 <math>f_{\theta}</math> 置于轨迹 <math>T \sim \mathcal{T}_{agent}</math> 的条件下，即在之前的动作 <math>a_{< n}</math> 和观察（帧） <math>o_{< n}</math> 的序列条件下，并移除所有文本条件。具体来说，为了以动作为条件，我们仅需学习将每个动作（例如按下特定按键）嵌入为单个标记的 <math>A_{emb}</math>，并将文本的交叉注意力替换为该编码动作序列。为了对观察（即之前的帧）进行条件化，我们使用自动编码器 <math>\phi</math> 将它们编码到潜在空间中，并在潜在通道维度中将它们串联到噪声潜在空间中（见图 [https://arxiv.org/html/2408.14837v1#S3.F3 3]）。我们还尝试通过交叉注意力对这些过去的观察进行条件化，但没有观察到有意义的改进。