Welcome
The papers that changed everything
The research that bent the field — read the originals.
Attention Is All You Need
The paper that quietly took over modern AI. Replaced recurrence and convolution with self-attention — every LLM you've heard of descends from this 2017 architecture.
The paper that quietly took over modern AI. Replaced recurrence and convolution with self-attention — every LLM you've heard of descends from this 2017 architecture.
Generative Adversarial Nets
Two networks playing a minimax game: one fakes data, one tries to spot the fakes. The framework behind a decade of generative models.
Two networks playing a minimax game: one fakes data, one tries to spot the fakes. The framework behind a decade of generative models.
Deep Residual Learning for Image Recognition
Skip connections that let you train networks 1,000+ layers deep. Won ImageNet, fixed gradient flow, and changed everything that came after.
Skip connections that let you train networks 1,000+ layers deep. Won ImageNet, fixed gradient flow, and changed everything that came after.
Language Models are Few-Shot Learners
GPT-3, and the surprise that prompted the LLM era — large enough models learn new tasks from a handful of examples in the prompt.
GPT-3, and the surprise that prompted the LLM era — large enough models learn new tasks from a handful of examples in the prompt.
Adam A Method for Stochastic Optimization
The optimizer behind nearly every neural network trained in the last decade. Adaptive per-parameter learning rates, in 7 lines of pseudocode.
The optimizer behind nearly every neural network trained in the last decade. Adaptive per-parameter learning rates, in 7 lines of pseudocode.
Dropout: A Simple Way to Prevent Neural Networks from Overfitting
Randomly zero out half your neurons during training and the model gets better. A simple trick with deep theoretical roots.
Randomly zero out half your neurons during training and the model gets better. A simple trick with deep theoretical roots.
How modern AI actually works
Concrete techniques powering current systems — diffusion, attention, fine-tuning, retrieval.
Diffusion Models
The math behind Stable Diffusion, Sora, and DALL-E. Learn the noising process, then learn the denoising — that's the whole trick.
The math behind Stable Diffusion, Sora, and DALL-E. Learn the noising process, then learn the denoising — that's the whole trick.
FlashAttention
How to compute attention 3× faster without changing the math — by using GPU memory hierarchy that the textbook formulation ignores.
How to compute attention 3× faster without changing the math — by using GPU memory hierarchy that the textbook formulation ignores.
LoRA Adapters
Fine-tune a 70-billion-parameter model by training a few million extra parameters. The technique behind every Stable Diffusion variant on the internet.
Fine-tune a 70-billion-parameter model by training a few million extra parameters. The technique behind every Stable Diffusion variant on the internet.
Cross-Attention
How transformer decoders look at the encoder. The mechanism behind text-to-image conditioning, machine translation, and tool use.
How transformer decoders look at the encoder. The mechanism behind text-to-image conditioning, machine translation, and tool use.
Knowledge Distillation
Take a giant model, train a small one to mimic it. The reason your phone can run something close to GPT-class quality.
Take a giant model, train a small one to mimic it. The reason your phone can run something close to GPT-class quality.
Mixed Precision Training
Train neural networks at half the memory and twice the speed by mixing 16-bit and 32-bit numbers — without losing accuracy.
Train neural networks at half the memory and twice the speed by mixing 16-bit and 32-bit numbers — without losing accuracy.
Classifier-Free Guidance
The trick that makes diffusion models actually follow your prompt. A clever way to steer generation without an extra classifier.
The trick that makes diffusion models actually follow your prompt. A clever way to steer generation without an extra classifier.
Start here
The foundational ideas everything else builds on. New to the field? Read these first.
Backpropagation
The algorithm that makes neural networks possible. Take the chain rule, apply it backwards through a computation graph — and you can train anything.
The algorithm that makes neural networks possible. Take the chain rule, apply it backwards through a computation graph — and you can train anything.
Gradient Descent
Roll downhill on the loss landscape. Every modern ML system, from logistic regression to GPT-4, is a variant of this idea.
Roll downhill on the loss landscape. Every modern ML system, from logistic regression to GPT-4, is a variant of this idea.
Word Embeddings
Turn words into vectors so geometry encodes meaning. The first step from symbol-pushing AI to the language models we have today.
Turn words into vectors so geometry encodes meaning. The first step from symbol-pushing AI to the language models we have today.
Neural Networks
Stack matrix multiplications, sprinkle nonlinearities, repeat. A simple recipe with surprising expressiveness.
Stack matrix multiplications, sprinkle nonlinearities, repeat. A simple recipe with surprising expressiveness.
Convolutional Neural Networks
Translation-equivariant filters that revolutionized computer vision. Why your phone can identify a cat in 5 milliseconds.
Translation-equivariant filters that revolutionized computer vision. Why your phone can identify a cat in 5 milliseconds.
Cross-Entropy Loss
The loss function powering 90% of classification systems. From information theory to training your first MNIST classifier.
The loss function powering 90% of classification systems. From information theory to training your first MNIST classifier.
Transfer Learning
Train once on a giant dataset, reuse the representations forever. The economic engine behind today's AI tools.
Train once on a giant dataset, reuse the representations forever. The economic engine behind today's AI tools.
The hard parts
Where AI gets uncomfortable — the failure modes, ethics, and open problems.
Algorithmic Fairness
When models inherit bias from data, who do they harm? Formal definitions, impossibility results, and what they imply for deployment.
When models inherit bias from data, who do they harm? Formal definitions, impossibility results, and what they imply for deployment.
Calibration of Predictions
A model that says "90% confident" should be right 90% of the time. Most aren't. Why this matters and how to fix it.
A model that says "90% confident" should be right 90% of the time. Most aren't. Why this matters and how to fix it.
Bias in Machine Learning
Bias enters through data, labels, objectives, and deployment. A taxonomy of how it gets in and what we can do about it.
Bias enters through data, labels, objectives, and deployment. A taxonomy of how it gets in and what we can do about it.
Overfitting and Regularization
The eternal tension: a model that memorizes training data fails on new examples. Why every effective technique trades flexibility for generalization.
The eternal tension: a model that memorizes training data fails on new examples. Why every effective technique trades flexibility for generalization.
Attention Rollout
Transformers are black boxes — but you can still trace what each layer attended to. A first step toward interpreting them.
Transformers are black boxes — but you can still trace what each layer attended to. A first step toward interpreting them.
Grad-CAM
Visual explanations for vision models. Highlight the pixels that drove the prediction — sometimes you'll be surprised what the model is looking at.
Visual explanations for vision models. Highlight the pixels that drove the prediction — sometimes you'll be surprised what the model is looking at.
What is Marovi? · All articles · Search · Glossary · Made by Felipe Felix Arias