Advantage Actor-Critic

    From Marovi AI
    This page contains changes which are not marked for translation.
    Other languages:
    Article
    Topic area Reinforcement Learning
    Prerequisites Policy Gradient, Actor-Critic Methods, Markov Decision Process


    Overview

    Advantage Actor-Critic (A2C) is a Policy Gradient algorithm for Reinforcement Learning that combines a parameterized policy (the actor) with a learned value function (the critic), and uses the advantage function as the signal for policy updates. By replacing raw returns with advantages, A2C reduces the variance of gradient estimates while keeping them unbiased, which makes training substantially more stable than vanilla REINFORCE. A2C is the synchronous formulation of the asynchronous algorithm A3C introduced by Mnih and colleagues in 2016, and it has become a standard baseline for both discrete-action benchmarks such as Atari and continuous-control tasks.[1]

    The defining idea is simple: rather than scaling each policy gradient by the noisy Monte Carlo return, scale it by how much better an action was than the average action available in that state. The critic supplies that "average," and the difference between the observed return and the critic's prediction is the advantage. This article describes the formulation, the synchronous training procedure that distinguishes A2C from A3C, common variants, and the relationship to neighboring algorithms such as Proximal Policy Optimization and Trust Region Policy Optimization.

    Intuition

    In a Policy Gradient update, the gradient direction is weighted by some scalar score that says "this action was good" or "this action was bad." The simplest choice is the trajectory return, but returns have very high variance: an action taken near the start of an episode may be assigned credit for events occurring hundreds of steps later, most of which are unrelated. The variance of the resulting estimator can dwarf its mean, which slows learning to a crawl.

    A2C addresses this by introducing a learned baseline. Subtracting any state-dependent baseline from the return leaves the gradient unbiased, because the baseline has zero expected gradient with respect to the policy parameters. Choosing the Value Function as the baseline turns the score into the advantage, which intuitively measures the action's deviation from the policy's average behavior. Actions that beat the baseline are reinforced; actions that fall short are discouraged. The critic is trained in parallel by temporal-difference regression toward bootstrapped targets, so the baseline tracks the policy as it improves.

    Formulation

    Let $ \pi_\theta(a \mid s) $ denote the policy parameterized by $ \theta $ and let $ V_\phi(s) $ denote the critic parameterized by $ \phi $. The state-value function under the current policy is

    $ {\displaystyle V^{\pi}(s) = \mathbb{E}_{\pi}\!\left[\sum_{k=0}^{\infty} \gamma^{k} r_{t+k} \,\middle|\, s_t = s\right],} $

    and the advantage of taking action $ a $ in state $ s $ is

    $ {\displaystyle A^{\pi}(s, a) = Q^{\pi}(s, a) - V^{\pi}(s).} $

    The policy gradient with the advantage as the score is

    $ {\displaystyle \nabla_\theta J(\theta) = \mathbb{E}_{\pi}\!\left[\nabla_\theta \log \pi_\theta(a \mid s)\, A^{\pi}(s, a)\right].} $

    In practice the advantage is estimated from rollouts. The simplest one-step estimator is

    $ {\displaystyle \hat{A}_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t),} $

    which is just the TD error. An $ n $-step estimator that bootstraps after $ n $ reward terms is widely used,

    $ {\displaystyle \hat{A}_t = \sum_{k=0}^{n-1} \gamma^{k} r_{t+k} + \gamma^{n} V_\phi(s_{t+n}) - V_\phi(s_t),} $

    and modern implementations frequently replace it with Generalized Advantage Estimation (GAE), a discounted exponential average of $ n $-step errors that exposes a bias-variance knob.[2]

    The critic is trained by minimizing the squared TD error against the bootstrapped target,

    $ {\displaystyle \mathcal{L}_V(\phi) = \mathbb{E}\!\left[\bigl(V_\phi(s_t) - \hat{R}_t\bigr)^2\right], \quad \hat{R}_t = \sum_{k=0}^{n-1} \gamma^{k} r_{t+k} + \gamma^{n} V_\phi(s_{t+n}).} $

    A small entropy bonus on the policy is almost always added to discourage premature collapse to deterministic actions, giving the combined loss

    $ {\displaystyle \mathcal{L}(\theta, \phi) = -\mathbb{E}\!\left[\log \pi_\theta(a_t \mid s_t)\, \hat{A}_t\right] + c_v \mathcal{L}_V(\phi) - c_e \mathbb{E}\!\left[\mathcal{H}\bigl(\pi_\theta(\cdot \mid s_t)\bigr)\right],} $

    where $ c_v $ and $ c_e $ are scalar coefficients and $ \mathcal{H} $ is the policy entropy.

    Training

    A2C is on-policy and synchronous. A vector of parallel environments steps in lockstep for $ n $ transitions, producing a batch of size $ n \times N $ where $ N $ is the number of workers. Advantages are computed on this batch using the bootstrapped values from the critic, and a single gradient step updates both networks. The collected transitions are then discarded and a fresh rollout begins, since gradient estimates are valid only for the current policy.

    This synchronous design is what differentiates A2C from its predecessor A3C. In A3C each worker holds a local copy of the parameters, computes gradients independently, and applies them asynchronously to a shared parameter server, which means workers operate on slightly stale parameters. The OpenAI baselines team observed that, with a sufficiently fast GPU implementation, batching all workers' transitions and performing one synchronous update is at least as sample-efficient as A3C and substantially simpler to implement and debug.[3] The actor and critic typically share lower layers when the observation is high-dimensional (for example, the convolutional trunk of an Atari agent), and split into separate heads near the output.

    Hyperparameters that matter most in practice are the number of parallel environments, the rollout length $ n $, the discount factor $ \gamma $, the entropy coefficient $ c_e $, and the learning rate. The effective batch size couples these: doubling the worker count halves the gradient-step frequency at fixed total environment steps, which usually requires retuning the learning rate.

    Variants

    Several variants extend the basic recipe. A3C, the original asynchronous formulation, runs many CPU workers without a GPU and was historically attractive on commodity hardware. ACKTR replaces the SGD optimizer with a Kronecker-factored natural-gradient step, exploiting the structure of the Fisher Information Matrix for faster convergence at modest extra cost.[4] Proximal Policy Optimization (PPO) extends A2C with a clipped surrogate objective and multiple epochs of minibatch updates per rollout, which permits larger effective step sizes without policy collapse and has largely supplanted A2C as the default policy-gradient baseline. The Soft Actor-Critic family generalizes the framework to off-policy training with a maximum-entropy objective and is preferred for continuous control with replay buffers.

    Comparisons

    Compared to value-based methods such as DQN, A2C handles continuous action spaces directly and avoids the need for an explicit max-over-actions operation, which is intractable in continuous domains. Compared to vanilla REINFORCE, A2C achieves dramatically lower variance through the learned baseline at the cost of introducing bias from the bootstrapped critic, although in practice the bias is small and well-controlled. Compared to PPO, A2C is simpler and uses each rollout exactly once, which makes it less sample-efficient but easier to reason about; PPO is the better default when wall-clock training time is plentiful and sample efficiency matters. Compared to off-policy actor-critic methods such as Soft Actor-Critic or DDPG, A2C cannot reuse old transitions, so its sample efficiency on hard continuous-control benchmarks is meaningfully worse, although stability and ease of distributed training are often better.

    Limitations

    A2C inherits the central weakness of on-policy methods: every transition is used in exactly one gradient step before being discarded, so sample efficiency is poor compared to off-policy alternatives. The synchronous design also means the slowest environment in the worker pool throttles all the others, which becomes a problem when episodes have variable length or wall-clock cost. The algorithm is sensitive to the entropy coefficient and to the value-loss coefficient; too little entropy causes premature convergence to a deterministic but suboptimal policy, while too much prevents exploitation of any structure the critic has learned. Finally, because both networks are trained on the same trajectories with shared lower layers, the value loss can dominate early in training and starve the policy of useful gradient signal, which is one reason most modern implementations prefer PPO with separate networks.

    References

    1. Template:Cite arxiv
    2. Template:Cite arxiv
    3. Template:Cite arxiv
    4. Wu, Y., Mansimov, E., Liao, S., Grosse, R., Ba, J. "Scalable Trust-Region Method for Deep Reinforcement Learning Using Kronecker-Factored Approximation," 2017.