Studying Triton One Kernel at a Time: Softmax

November 24, 2025

63

Studying Triton One Kernel at a Time: Softmax — bg 1

Within the earlier article of this collection, operation in all fields of laptop science: matrix multiplication. It’s closely utilized in neural networks to compute the activation of linear layers. Nevertheless, activations on their very own are troublesome to interpret, since their values and statistics (imply, variance, min-max amplitude) can differ wildly from layer to layer. This is without doubt one of the explanation why we use activation capabilities, for instance the logistic operate (aka sigmoid) which tasks any actual quantity within the [0; 1] vary.

The softmax operate, also referred to as the normalised exponential operate, is a multi-dimensional generalisation of the sigmoid. It converts a vector of uncooked scores (logits) right into a chance distribution over M courses. We are able to interpret it as a weighted common that behaves as a clean operate and could be conveniently differentiated. It’s a essential element of dot-product consideration, language modeling, and multinomial logistic regression.

On this article, we’ll cowl: