• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Tuesday, December 2, 2025
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

The Journey of a Token: What Really Happens Inside a Transformer

Josh by Josh
December 1, 2025
in Al, Analytics and Automation
0
The Journey of a Token: What Really Happens Inside a Transformer
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


In this article, you will learn how a transformer converts input tokens into context-aware representations and, ultimately, next-token probabilities.

Topics we will cover include:

  • How tokenization, embeddings, and positional information prepare inputs
  • What multi-headed attention and feed-forward networks contribute inside each layer
  • How the final projection and softmax produce next-token probabilities

Let’s get our journey underway.

The Journey of a Token: What Really Happens Inside a Transformer

The Journey of a Token: What Really Happens Inside a Transformer (click to enlarge)
Image by Editor

The Journey Begins

Large language models (LLMs) are based on the transformer architecture, a complex deep neural network whose input is a sequence of token embeddings. After a deep process — that looks like a parade of numerous stacked attention and feed-forward transformations — it outputs a probability distribution that indicates which token should be generated next as part of the model’s response. But how can this journey from inputs to outputs be explained for a single token in the input sequence?

READ ALSO

MIT Sea Grant students explore the intersection of technology and offshore aquaculture in Norway | MIT News

MiniMax-M2: Technical Deep Dive into Interleaved Thinking for Agentic Coding Workflows

In this article, you will learn what happens inside a transformer model — the architecture behind LLMs — at the token level. In other words, we will see how input tokens or parts of an input text sequence turn into generated text outputs, and the rationale behind the changes and transformations that take place inside the transformer.

The description of this journey through a transformer model will be guided by the above diagram that shows a generic transformer architecture and how information flows and evolves through it.

Entering the Transformer: From Raw Input Text to Input Embedding

Before entering the depths of the transformer model, a few transformations already happen to the text input, primarily so it is represented in a form that is fully understandable by the internal layers of the transformer.

Tokenization

The tokenizer is an algorithmic component typically working in symbiosis with the LLM’s transformer model. It takes the raw text sequence, e.g. the user prompt, and splits it into discrete tokens (often subword units or bytes, sometimes whole words), with each token in the source language being mapped to an identifier i.

Token Embeddings

There is a learned embedding table E with shape |V| × d (vocabulary size by embedding dimension). Looking up the identifiers for a sequence of length n yields an embedding matrix X with shape n × d. That is, each token identifier is mapped to a d-dimensional embedding vector that forms one row of X. Two embedding vectors will be similar to each other if they are associated with tokens that have similar meanings, e.g. king and emperor, or vice versa. Importantly, at this stage, each token embedding carries semantic and lexical information for that single token, without incorporating information about the rest of the sequence (at least not yet).

Positional Encoding

Before fully entering the core parts of the transformer, it is necessary to inject within each token embedding vector — i.e. inside each row of the embedding matrix X — information about the position of that token in the sequence. This is also called injecting positional information, and it is typically done with trigonometric functions like sine and cosine, although there are techniques based on learned positional embeddings as well. A nearly-residual component is summed to the previous embedding vector e_t associated with a token, as follows:

\[
x_t^{(0)} = e_t + p_{\text{pos}}(t)
\]

with p_pos(t) typically being a trigonometric-based function of the token position t in the sequence. As a result, an embedding vector that formerly encoded “what a token is” only now encodes “what the token is and where in the sequence it sits”. This is equivalent to the “input embedding” block in the above diagram.

Now, time to enter the depths of the transformer and see what happens inside!

Deep Inside the Transformer: From Input Embedding to Output Probabilities

Let’s explain what happens to each “enriched” single-token embedding vector as it goes through one transformer layer, and then zoom out to describe what happens across the entire stack of layers.

The formula

\[
h_t^{(0)} = x_t^{(0)}
\]

is used to denote a token’s representation at layer 0 (the first layer), whereas more generically we will use ht(l) to denote the token’s embedding representation at layer l.

Multi-headed Attention

The first major component inside each replicated layer of the transformer is the multi-headed attention. This is arguably the most influential component in the entire architecture when it comes to identifying and incorporating into each token’s representation a lot of meaningful information about its role in the entire sequence and its relationships with other tokens in the text, be it syntactic, semantic, or any other sort of linguistic relationship. Multiple heads in this so-called attention mechanism are each specialized in capturing different linguistic aspects and patterns in the token and the entire sequence it belongs to simultaneously.

The result of having a token representation ht(l) (with positional information injected a priori, don’t forget!) traveling through this multi-headed attention inside a layer is a context-enriched or context-aware token representation. By using residual connections and layer normalizations across the transformer layer, newly generated vectors become stabilized blends of their own previous representations and the multi-headed attention output. This helps improve coherence throughout the entire process, which is applied repeatedly across layers.

Feed-forward Neural Network

Next comes something relatively less complex: a few feed-forward neural network (FFN) layers. For instance, these can be per-token multilayer perceptrons (MLPs) whose goal is to further transform and refine the token features that are gradually being learned.

The main difference between the attention stage and this one is that attention mixes and incorporates, in each token representation, contextual information from across all tokens, but the FFN step is applied independently on each token, refining the contextual patterns already integrated to yield useful “knowledge” from them. These layers are also supplemented with residual connections and layer normalizations, and as a result of this process, we have at the end of a transformer layer an updated representation ht(l+1) that will become the input to the next transformer layer, thereby entering another multi-headed attention block.

The whole process is repeated as many times as the number of stacked layers defined in our architecture, thus progressively enriching the token embedding with more and more higher-level, abstract, and long-range linguistic information behind those seemingly indecipherable numbers.

Final Destination

So, what happens at the very end? At the top of the stack, after going through the last replicated transformer layer, we obtain a final token representation ht*(L) (where t* denotes the current prediction position) that is projected through a linear output layer followed by a softmax.

The linear layer produces unnormalized scores called logits, and the softmax converts these logits into next-token probabilities.

Logits computation:

\[
\text{logits}_j = W_{\text{vocab}, j} \cdot h_{t^*}^{(L)} + b_j
\]

Applying softmax to calculate normalized probabilities:

\[
\text{softmax}(\text{logits})_j = \frac{\exp(\text{logits}_j)}{\sum_{k} \exp(\text{logits}_k)}
\]

Using softmax outputs as next-token probabilities:

\[
P(\text{token} = j) = \text{softmax}(\text{logits})_j
\]

These probabilities are calculated for all possible tokens in the vocabulary. The next token to be generated by the LLM is then selected — often the one with the highest probability, though sampling-based decoding strategies are also common.

Journey’s End

This article took a journey, with a gentle level of technical detail, through the transformer architecture to provide a general understanding of what happens to the text that is provided to an LLM — the most prominent model based on a transformer architecture — and how this text is processed and transformed inside the model at the token level to finally turn into a model’s output: the next word to generate.

We hope you have enjoyed our travels together, and we look forward to the opportunity to embark upon another in the near future.



Source_link

Related Posts

MIT Sea Grant students explore the intersection of technology and offshore aquaculture in Norway | MIT News
Al, Analytics and Automation

MIT Sea Grant students explore the intersection of technology and offshore aquaculture in Norway | MIT News

December 2, 2025
MiniMax-M2: Technical Deep Dive into Interleaved Thinking for Agentic Coding Workflows
Al, Analytics and Automation

MiniMax-M2: Technical Deep Dive into Interleaved Thinking for Agentic Coding Workflows

December 2, 2025
Pretrain a BERT Model from Scratch
Al, Analytics and Automation

Pretrain a BERT Model from Scratch

December 1, 2025
How to Design an Advanced Multi-Page Interactive Analytics Dashboard with Dynamic Filtering, Live KPIs, and Rich Visual Exploration Using Panel
Al, Analytics and Automation

How to Design an Advanced Multi-Page Interactive Analytics Dashboard with Dynamic Filtering, Live KPIs, and Rich Visual Exploration Using Panel

December 1, 2025
Al, Analytics and Automation

Meta AI Researchers Introduce Matrix: A Ray Native a Decentralized Framework for Multi Agent Synthetic Data Generation

November 30, 2025
Training a Tokenizer for BERT Models
Al, Analytics and Automation

Training a Tokenizer for BERT Models

November 30, 2025
Next Post
Cyber Monday SSD deals include up to $270 off recommended internal and portable SSDs, microSD cards and more

Cyber Monday SSD deals include up to $270 off recommended internal and portable SSDs, microSD cards and more

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR NEWS

Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
7 Best EOR Platforms for Software Companies in 2025

7 Best EOR Platforms for Software Companies in 2025

June 21, 2025

EDITOR'S PICK

Understanding the nuances of human-like intelligence | MIT News

Understanding the nuances of human-like intelligence | MIT News

November 11, 2025
Remarkable fundraising: Superlatives & extremes

Remarkable fundraising: Superlatives & extremes

June 15, 2025
How Digital Transformation in Manufacturing Is Rewriting the Future of Production

How Digital Transformation in Manufacturing Is Rewriting the Future of Production

May 27, 2025
Hatching Mascots and Slime Boutiques

Hatching Mascots and Slime Boutiques

August 18, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • How to create an Instagram marketing strategy (2025 guide)
  • The best charities for helping animals in 2025
  • MIT Sea Grant students explore the intersection of technology and offshore aquaculture in Norway | MIT News
  • Boeing And The Quest For Quality
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?