AlexWelcome to another episode of ResearchPod.
SamSam, we've got a notable puzzle today about why a key tool in AI training keeps succeeding despite warnings it shouldn't. This episode looks at a paper called "Adam Converges Without Any Modification On Update Rules" by Yushun Zhang and colleagues. The central puzzle is this: Adam is the go-to method for training huge AI models like large language models, yet a prior study showed it can fail to settle down mathematically in certain setups—yet in practice, it powers successes like Llama and GPT without changes.
AlexSo the paper is basically asking why Adam works so reliably in real training, even when theory predicts problems?
SamYes, exactly. The key mismatch they spot is in how things are set up: the earlier theory picked Adam's settings first—like β1 and β2, which control how much it remembers past steps—then built a tricky problem to make it fail. In real life, you fix your dataset and training problem first, then tune those settings. The paper proves Adam can converge—meaning steadily improve toward the best solution—without tweaks, if you tune β2 large enough relative to the problem and keep β1 below the square root of β2.
AlexRight, so the core problem is that theory and practice flipped the order of decisions?
SamPrecisely. When the problem—like how many data batches you have—is fixed first, there's a safe zone in the β1-β2 plane where Adam reliably converges to good points, and a danger zone where it might blow up if β2 is too small. This reveals a phase transition—a boundary line that shifts based on batch size, shrinking as batches grow larger. Their tests on datasets like MNIST confirm Adam succeeds in what theory called the danger zone, once you tune right.
AlexAnd that boundary depends on the batch size?
SamIt does—the smaller your batches, the higher β2 needs to be to cross into the safe zone. They suggest tuning β2 up as batch size drops, then β1 under its square root, matching what works in real large model training.
AlexOkay, so tuning β2 higher for smaller batches gets you into that safe zone. But how does that actually stabilize things inside the algorithm—what's the key mechanism making the steps reliable?
SamThe main idea is that β2 controls how much the algorithm remembers past gradient sizes to adjust its step length. When β2 is close to 1—like 0.999—it changes that memory very slowly, like a heavy flywheel in a machine that resists sudden jolts and keeps spinning smoothly. This makes the step-size adjustment—specifically, one over the square root of that memory—behave predictably, clustering tightly around its average value from the data.
AlexA flywheel... so it smooths out the ups and downs from noisy small batches?
SamYes. With large β2, the memory is a weighted average of many past gradients, mostly from recent ones but stretching back far. This average stays close to its expected value with high probability, because the weights form a stable mix of which data batches get sampled—like a predictable recipe pulling from history. This stabilizes the whole update, mimicking a steady full-dataset step even with mini-batches.
AlexSo without that, small β2 lets noise throw it off course?
SamExactly. They also handle the momentum part—β1's running average of past directions—with an extra trick: define a corrected position z_k that subtracts out the bias built up over a full cycle of all data batches. This z_k sequence steadily descends toward better points, as proven by a standard inequality linking gradient direction to progress. Together, it ensures Adam reaches critical points—or very near them—under realistic data assumptions like smooth functions and controlled gradient variance.
AlexRight, and those assumptions cover common setups without forcing gradients to stay tiny?
SamThey do—things like Lipschitz smoothness, where changes don't explode, and variance bounds that let noise grow reasonably with the signal, fitting empirical risk minimization over batches. The paper notes this works even where prior theory demanded stricter limits.
AlexSo those bounds fit real training pretty well. But in practice, how does this play out—like, does tuning β₂ bigger for tiny batches actually make a difference on datasets?
SamThe paper checks this directly on MNIST, a standard image dataset split into many mini-batches. They run Adam sweeps varying batch size and β₂, with β₁ fixed at 0.9. The pattern matches theory: when batches are small—meaning more mini-batches total—you need a larger β₂, closer to 1, to drive training loss down reliably. With bigger batches, even moderate β₂ works fine.
AlexSo smaller batches demand more memory in that variance estimate to smooth the noise?
SamPrecisely—small batches bring noisier gradient samples, so the flywheel effect from high β₂ becomes essential to keep the memory stable. This guides large language model training, where tiny batches are common due to memory limits.
AlexThat lines up with what you've said about the phase boundary shifting.
SamYes, and here's a key point: as β₂ climbs from small values toward 1, Adam crosses from a divergence zone—where steps blow up—to convergence. The paper calls this a phase transition, the first mapped out in the β₁-β₂ plane, with the safe β₂ threshold growing inversely with batch size. When noise exists in the data—called the non-realizable case—Adam settles near critical points, not exactly on them, but that neighborhood shrinks to nearly zero as β₂ nears 1, observed in their MNIST plots with diminishing steps.
AlexOkay, so even with real-world messiness, pushing β₂ high gets you arbitrarily close.
SamCorrect. Their divergence proofs reinforce this: below that threshold, on crafted but realistic problems, iterates and gradients explode, especially as mini-batches multiply. Tuning β₂ problem-dependently isn't optional—it's what flips the switch to reliable progress, explaining Adam's real-world wins.
AlexHigh β₂ flips the switch. But inside the math, how do they show that step-size part stays so predictable with small batches?
SamThey focus on the term one over the square root of the memory, which scales each step based on recent gradient sizes. The memory itself builds as a weighted sum of squared gradients from past steps, with newer ones weighted more but old ones lingering if β₂ is high. Large β₂ makes this sum change gradually, pulling in history from many batches. To prove it hugs its average closely, they recast the weights as geometric sums over which batches get picked—each pick is like a fair coin flip from the full set. A statistical tool then bounds how far this weighted count strays from even sampling, holding with high probability when β₂ nears 1.
AlexSo... it's like proving the average height in a big group of coin flips stays near expected, even weighted toward recent flips?
SamYes—those coin flips are independent trials, one for each past step matching the biggest-gradient batch. The geometric weights shrink older flips' influence just enough for the math to show deviation shrinks as β₂ grows, without assuming gradients stay small. This pins one over sqrt of the memory between tight bounds around its expected value, stabilizing the update direction toward descent.
AlexAnd that handles the noise from random batch picks?
SamIt does—the bounds ensure the scaling factor doesn't swing wildly, even if one gradient spikes. For the momentum distortion from β₁'s long memory, they introduce an auxiliary path z_k: subtract a fixed fraction of the position n steps back, canceling bias over a full data cycle. This z_k telescopes the updates into a clean descent lemma—yielding overall convergence to critical points or near them.
AlexSo z_k wipes the historical slate clean every cycle.
SamPrecisely, and it relaxes prior needs on β₂, expanding the safe tuning zone. The paper stresses this holds under standard assumptions like smooth losses and variance growing reasonably, without prior theory's tight gradient caps—matching large-model practice.
AlexTo tie it all together, how do they actually show the expected progress is bounded below for descent?
SamThey break the key progress term into three parts based on events: one for bounded gradients where steps are controlled, one main descent part under good sampling, and a rare tail where bad samples cluster. The tail—when sampling strays far from even—has tiny probability, making its contribution vanish as β₂ grows. The core descent term splits further: a direct gradient alignment piece, proven positive via concentration around expected memory, and cross-history terms bounded as errors that fade with large β₂.
AlexLike isolating rare fluke samples that could derail the average?
SamYes. As β₂ approaches 1, errors drop, ensuring reliable convergence under practical smoothness and variance growth—without needing the old theory's strict gradient limits.
AlexSo the proof wraps up by showing those error terms fade reliably as β₂ gets closer to 1. Overall, it feels like a solid bridge between why Adam fails in some contrived setups and why it succeeds in real training.
SamIt does bridge that gap. The paper establishes that when you fix the problem first—like real datasets and batch sizes—then tune β₂ high enough relative to β₁ under its square root, Adam converges to critical points or very near them, under standard assumptions on smoothness and reasonable gradient noise. Though the paper notes its convergence threshold for β₂ is sufficient but not the tightest possible; the exact shape of that phase boundary remains uncharted. It also requires diminishing step sizes, and the bounds carry factors depending on dimension and batch count that could be sharpened.
AlexRight, and that tuning scales β₂ up as batches shrink, matching what works for giant models. Those are fair caveats—nothing's fully nailed down yet, but the core logic holds up.
SamIn practice, this gives clear recipes: auto-tune β₂ inversely with batch size for trillion-parameter language models, dodging divergence pitfalls seen in some recent studies. It grounds why vanilla Adam powers successes without mods, despite earlier warnings. The paper's contribution is mapping that first phase transition in the tuning plane, proving convergence in realistic order, and offering batch-aware guidance—advancing both theory and practice for large-scale AI training.
AlexWell put. That's our look at why Adam converges without changes in the real world. Thanks for listening to ResearchPod.