
Picture by Editor
# Introduction
Due to giant language fashions (LLMs), we these days have spectacular, extremely helpful functions like Gemini, ChatGPT, and Claude, to call just a few. Nonetheless, few individuals understand that the underlying structure behind an LLM known as a transformer. This structure is fastidiously designed to “suppose” — particularly, to course of information describing human language — in a really specific and considerably particular manner. Are you interested by gaining a broad understanding of what occurs inside these so-called transformers?
This text describes, utilizing a mild, comprehensible, and reasonably non-technical tone, how transformer fashions sitting behind LLMs analyze enter info like person prompts and the way they generate coherent, significant, and related output textual content phrase by phrase (or, barely extra technically, token by token).
# Preliminary Steps: Making Language Comprehensible by Machines
The primary key idea to know is that AI fashions don’t really perceive human language; they solely perceive and function on numbers, and transformers behind LLMs are not any exception. Due to this fact, it’s essential to convert human language — i.e. textual content — right into a kind that the transformer can absolutely perceive earlier than it is ready to deeply course of it.
Put one other manner, the primary few steps going down earlier than getting into the core and innermost layers of the transformer primarily give attention to turning this uncooked textual content right into a numerical illustration that preserves the important thing properties and traits of the unique textual content underneath the hood. Let’s study these three steps.


Making language comprehensible by machines (click on to enlarge)
// Tokenization
The tokenizer is the primary actor coming onto the scene, working in tandem with the transformer mannequin, and is liable for chunking the uncooked textual content into small items referred to as tokens. Relying on the tokenizer used, these tokens could also be equal to phrases normally, however tokens also can generally be components of phrases or punctuation indicators. Additional, every token in a language has a novel numerical identifier. That is when textual content turns into not textual content, however numbers: all on the token degree, as proven on this instance during which a easy tokenizer converts a textual content containing 5 phrases into 5 token identifiers, one per phrase:


Tokenization of textual content into token identifiers
// Token Embeddings
Subsequent, each token ID is reworked right into a ( d )-dimensional vector, which is a listing of numbers of measurement ( d ). This full illustration of a token as an embedding is sort of a description of the general which means of this token, be it a phrase, a part of it, or a punctuation signal. The magic lies in the truth that tokens related to comparable ideas of meanings, like queen and empress, may have related embedding vectors which might be comparable.
// Positional Encoding
Till now, a token embedding comprises info within the type of a group of numbers, but that info continues to be associated to a single token in isolation. Nonetheless, in a “piece of language” like a textual content sequence, it can be crucial not solely to know the phrases or tokens it comprises, but additionally their place within the textual content they’re a part of. Positional encoding is a course of that, by utilizing mathematical capabilities, injects into every token embedding some additional details about its place within the authentic textual content sequence.
# The Transformation Via the Core of the Transformer Mannequin
Now that every token’s numerical illustration incorporates details about its place within the textual content sequence, it’s time to enter the primary layer of the principle physique of the transformer mannequin. The transformer is a really deep structure, with many stacked parts replicated all through the system. There are two kinds of transformer layers — the encoder layer and the decoder layer — however for the sake of simplicity, we won’t make a nuanced distinction between them on this article. Simply remember for now that there are two kinds of layers in a transformer, despite the fact that they each have so much in frequent.


The transformation by way of the core of the transformer mannequin (click on to enlarge)
// Multi-Headed Consideration
That is the primary main subprocess going down inside a transformer layer, and maybe essentially the most impactful and distinctive characteristic of transformer fashions in comparison with different kinds of AI techniques. The multi-headed consideration is a mechanism that lets a token observe or “take note of” the opposite tokens within the sequence. It collects and incorporates helpful contextual info into its personal token illustration, particularly linguistic points like grammar relationships, long-range dependencies amongst phrases not essentially subsequent to one another within the textual content, or semantic similarities. In sum, because of this mechanism, various points of the relevance and relationships amongst components of the unique textual content are efficiently captured. After a token illustration travels by way of this element, it finally ends up gaining a richer, extra context-aware illustration about itself and the textual content it belongs to.
Some transformer architectures constructed for particular duties, like translating textual content from one language to a different, additionally analyze through this mechanism doable dependencies amongst tokens, each the enter textual content and the output (translated) textual content generated up to now, as proven beneath:


Multi-headed consideration in translation transformers
// Feed-Ahead Neural Community Sublayer
In easy phrases, after passing by way of consideration, the second frequent stage inside each replicated layer of the transformer is a set of chained neural community layers that additional course of and assist be taught further patterns of our enriched token representations. This course of is akin to additional sharpening these representations, figuring out, and reinforcing options and patterns which might be related. In the end, these layers are the mechanism used to progressively be taught a basic, more and more summary understanding of all the textual content being processed.
The method of going by way of multi-headed consideration and feed-forward sublayers is repeated a number of instances in that order: as many instances because the variety of replicated transformer layers now we have.
// Closing Vacation spot: Predicting the Subsequent Phrase
After repeating the earlier two steps in an alternate method a number of instances, the token representations that got here from the preliminary textual content ought to have allowed the mannequin to accumulate a really deep understanding, enabling it to acknowledge advanced and refined relationships. At this level, we attain the ultimate element of the transformer stack: a particular layer that converts the ultimate illustration right into a chance for each doable token within the vocabulary. That’s, we calculate — primarily based on all the data realized alongside the best way — a chance for every phrase within the goal language being the following phrase the transformer mannequin (or the LLM) ought to output. The mannequin lastly chooses the token or phrase with the best chance as the following one it generates as a part of the output for the top person. Your complete course of repeats for each phrase to be generated as a part of the mannequin response.
# Wrapping Up
This text supplies a mild and conceptual tour by way of the journey skilled by text-based info when it flows by way of the signature mannequin structure behind LLMs: the transformer. After studying this, chances are you’ll hopefully have gained a greater understanding of what goes on inside fashions like those behind ChatGPT.
Iván Palomares Carrascosa is a pacesetter, author, speaker, and adviser in AI, machine studying, deep studying & LLMs. He trains and guides others in harnessing AI in the true world.
