Deep Q-Networks/en

Article
Topic area	Reinforcement Learning
Prerequisites	Neural Networks, Stochastic Gradient Descent, Backpropagation

Translate this page

Other languages:

Overview

A Deep Q-Network (DQN) is a reinforcement learning algorithm that combines classical Q-learning with a deep neural network function approximator. It was introduced by Mnih et al. in 2013 and refined in a 2015 Nature paper, where a single architecture learned to play 49 Atari 2600 games at human-level performance directly from raw pixels and game scores.^[1] DQN demonstrated that the same network, trained end-to-end with stochastic gradient descent, could master a wide range of tasks without task-specific feature engineering, and it is widely credited with launching the modern era of deep reinforcement learning.

The core idea is to approximate the optimal action-value function $$ Q^*(s, a) $$ — the expected discounted return obtainable by taking action $$ a $$ in state $$ s $$ and following the optimal policy thereafter — with a parameterised neural network $Q(s, a; \theta)$ . Two engineering ingredients, experience replay and a periodically-updated target network, stabilise training and prevent the divergence that had historically plagued attempts to combine Q-learning with non-linear function approximators.

Background: Q-learning

Reinforcement learning problems are usually formalised as a Markov decision process with states $s \in \mathcal{S}$ , actions $a \in \mathcal{A}$ , rewards $$ r $$ , and a discount factor $\gamma \in [0, 1)$ . The optimal action-value function satisfies the Bellman optimality equation:

Q^*(s, a) = \mathbb{E}_{s'} \left[ r + \gamma \max_{a'} Q^*(s', a') \mid s, a \right]

Tabular Q-learning iteratively updates an estimate $$ Q(s, a) $$ from sampled transitions $$ (s, a, r, s') $$ via:

Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]

This converges to $$ Q^* $$ under mild conditions when every state-action pair is visited infinitely often, but tabular methods are infeasible for large or continuous state spaces. The natural fix is to replace the table with a parameterised function $Q(s, a; \theta)$ and learn the parameters by gradient descent — a recipe that, with non-linear approximators, can diverge in practice. DQN's contribution is a set of stabilisation tricks that make this recipe work reliably.

The DQN algorithm

DQN parameterises $Q(s, a; \theta)$ with a convolutional neural network whose input is a stack of recent observations (for Atari, four consecutive grey-scale frames) and whose output is a vector of Q-values, one per discrete action. At every environment step the agent selects an action by an epsilon-greedy policy: with probability $\epsilon$ it picks uniformly at random, otherwise it picks $\arg\max_a Q(s, a; \theta)$ . The exploration rate $\epsilon$ is annealed from 1.0 to a small terminal value (typically 0.1 or 0.05) over the first several million frames.

The transition $$ (s, a, r, s') $$ is stored in a replay buffer and a mini-batch of past transitions is sampled for the gradient update. The Q-network is trained to minimise the temporal-difference error between its prediction and a bootstrapped target computed from a slowly-updated copy of itself.

Experience replay

The replay buffer is a fixed-size circular store, typically holding the last $$ 10^6 $$ transitions. Each gradient step samples a mini-batch uniformly at random from the buffer rather than using the most recent transition. This addresses three problems with online learning from correlated trajectories:

Sample efficiency — every transition can contribute to many gradient updates.
Decorrelation — successive transitions are highly correlated; sampling breaks this dependency and makes the data closer to i.i.d., which is what stochastic gradient descent assumes.
Smoothing — the distribution of training data changes more gradually, reducing oscillations and feedback loops in which a recent policy change biases the next batch of data toward similar states.

Prioritised experience replay later refined the scheme by sampling transitions in proportion to their absolute TD error, focusing learning on surprising experiences.^[2]

Target network

If the bootstrap target $r + \gamma \max_{a'} Q(s', a'; \theta)$ is computed from the same parameters being updated, every gradient step shifts the target it is chasing — a moving-target problem that frequently leads to divergence. DQN introduces a separate target network with parameters $\theta^-$ that are held fixed for $$ C $$ steps (typically 10 000) and then copied from the online network. The target then becomes:

y = r + \gamma \max_{a'} Q(s', a'; \theta^-)

Decoupling the target from the live parameters dramatically improves stability, at the cost of slightly slower learning because the target lags behind the latest knowledge.

Loss function and training

The loss for a sampled mini-batch $\mathcal{B}$ is the mean squared TD error:

\mathcal{L}(\theta) = \mathbb{E}_{(s, a, r, s') \sim \mathcal{B}} \left[ \left( r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta) \right)^2 \right]

For terminal transitions the bootstrap term is dropped: $$ y = r $$ . Gradients are taken only with respect to $\theta$ — the target parameters $\theta^-$ are treated as constants. The original DQN paper used the RMSProp optimiser, gradient clipping (clipping the squared-error term to $$ [-1, 1] $$ , which is equivalent to using the Huber loss outside the linear region), and frame-skipping (the agent picks an action every fourth frame and repeats it on the intervening frames).

In pseudo-code:

 initialise online network theta and target network theta_minus = theta
 initialise replay buffer D
 for episode = 1 to M:
     observe s
     while not terminal:
         with probability epsilon select random a, else a = argmax Q(s, a; theta)
         execute a, observe r and s'
         store (s, a, r, s') in D
         sample mini-batch from D
         compute targets y and gradient step on theta
         every C steps: theta_minus <- theta
         s <- s'

Variants and extensions

A flurry of follow-up work addressed specific failure modes of vanilla DQN:

Double DQN — uses the online network to choose the next action and the target network to evaluate it, reducing the systematic over-estimation of Q-values caused by the max operator.^[3]
Dueling DQN — decomposes $$ Q(s, a) $$ into a state-value $$ V(s) $$ and an action-advantage $$ A(s, a) $$ , allowing the network to assess state value without learning the effect of every action.^[4]
Prioritised replay — see above.
Multi-step returns — replaces the one-step bootstrap with an $$ n $$ -step return, trading off bias for variance.
Distributional DQN (C51) — predicts the distribution of returns rather than just the mean, capturing risk-related information.^[5]
Noisy Nets — replaces epsilon-greedy with parametric noise on the network weights, providing state-dependent exploration.
Rainbow — combines six of the above improvements into a single agent that substantially outperforms each in isolation.^[6]

Comparisons with policy-gradient methods

DQN is a value-based method: it learns a Q-function and derives the policy by greedy action selection. It is restricted to discrete action spaces because the $\max_{a'}$ operator is intractable for continuous actions. Policy-gradient methods such as REINFORCE, A2C and PPO instead parameterise the policy directly and optimise it with gradient ascent on expected return. Actor-critic methods combine the two, learning a value baseline alongside a policy. For continuous control the standard recipe is a deterministic actor-critic such as DDPG, which can be viewed as a continuous-action analogue of DQN that replaces the discrete max with the output of a learned actor network.

Limitations

DQN inherits the brittleness of off-policy bootstrapping. Training can be unstable when the environment, the buffer contents, and the moving target interact unfavourably, and small implementation details (loss function, optimiser, network architecture, exploration schedule) measurably affect final performance. The algorithm is also sample-inefficient: tens of millions of frames are typical for Atari, far more than a human player needs. Discrete actions limit applicability to continuous control without modifications. Finally, function approximation, bootstrapping, and off-policy learning together form what Sutton and Barto term the deadly triad — a combination known to cause divergence in tabular guarantees, and DQN's stabilisation tricks are heuristics rather than convergence proofs.

References

↑ Mnih, V. et al. (2015). "Human-level control through deep reinforcement learning". Nature 518, 529–533.
↑ Schaul, T., Quan, J., Antonoglou, I. and Silver, D. (2016). "Prioritized Experience Replay". ICLR.
↑ van Hasselt, H., Guez, A. and Silver, D. (2016). "Deep Reinforcement Learning with Double Q-learning". AAAI.
↑ Wang, Z. et al. (2016). "Dueling Network Architectures for Deep Reinforcement Learning". ICML.
↑ Bellemare, M. G., Dabney, W. and Munos, R. (2017). "A Distributional Perspective on Reinforcement Learning". ICML.
↑ Hessel, M. et al. (2018). "Rainbow: Combining Improvements in Deep Reinforcement Learning". AAAI.

[1] Mnih, V. et al. (2015). "Human-level control through deep reinforcement learning". Nature 518, 529–533.

[2] Schaul, T., Quan, J., Antonoglou, I. and Silver, D. (2016). "Prioritized Experience Replay". ICLR.

[3] van Hasselt, H., Guez, A. and Silver, D. (2016). "Deep Reinforcement Learning with Double Q-learning". AAAI.

[4] Wang, Z. et al. (2016). "Dueling Network Architectures for Deep Reinforcement Learning". ICML.

[5] Bellemare, M. G., Dabney, W. and Munos, R. (2017). "A Distributional Perspective on Reinforcement Learning". ICML.

[6] Hessel, M. et al. (2018). "Rainbow: Combining Improvements in Deep Reinforcement Learning". AAAI.

[1]

[2]

[3]

[4]

[5]

[6]