

Some operations are always safe in fp16, but others are only reliable in fp32. Two, different vector operations accumulate errors at different rates, so treat them differently.
FP32 VS FP64 UPDATE
This makes applying the gradient update much safer. Gradient updates are calculated using the fp16 matrix but applied to the fp32 matrix. One, maintain two copies of the weights matrix, a "master copy" in fp32, and a half-precision copy of it in fp16. It’s a combination of three different techniques. Mixed precision training is a set of techniques which allows you to use fp16 without causing your model training to diverge. The 2018 ICLR paper Mixed Precision Training found that naively using fp16 everywhere "swallows" gradient updates smaller than 2^-24 in value - around 5% of all gradient updates made by their example network: Rounding error accumulation during backpropogation can turn these numbers into zeroes or nans this creates inaccurate gradient updates and prevents your network from converging. Any operation performed on a "small enough" floating point number will round the value to zero! This is known as underflowing, and it’s a problem because many to most gradient update values created during backpropogation are extremely small but nevertheless non-zero. Notice that the smaller the floating point, the larger the rounding errors it incurs. The basic idea behind mixed precision training is simple: halve the precision ( fp32 → fp16), halve the training time. PyTorch, which is much more memory-sensitive, uses fp32 as its default dtype instead.

This is where the automatic in automatic mixed-precision training comes in. However, up until now these tensor cores have remained difficult to use, as it has required writing reduced precision operations into your model by hand.

Recent generations of NVIDIA GPUs come loaded with special-purpose tensor cores specially designed for fast fp16 matrix operations. Mixed-precision training is a technique for substantially reducing neural net training time by performing as many operations as possible in half-precision floating point, fp16, instead of the (PyTorch default) single-precision floating point, fp32. One of the most exciting additions expected to land in PyTorch 1.6, coming soon, is support for automatic mixed-precision training. TLDR: the mixed-precision training module forthcoming in PyTorch 1.6 delivers on its promise, delivering speed-ups of 50-60% in large model training jobs with just a handful of new lines of code.
