• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Wednesday, June 24, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

Linear Layers and Activation Functions in Transformer Models

Josh by Josh
July 21, 2025
in Al, Analytics and Automation
0
Linear Layers and Activation Functions in Transformer Models


Attention operations are the signature of transformer models, but they are not the only building blocks. Linear layers and activation functions are equally essential. In this post, you will learn about:

  • Why linear layers and activation functions enable non-linear transformations
  • The typical design of feed-forward networks in transformer models
  • Common activation functions and their characteristics

Let’s get started.

Linear Layers and Activation Functions in Transformer Models
Photo by Svetlana Gumerova. Some rights reserved.

Overview

This post is divided into three parts; they are:

READ ALSO

Datalab Releases lift: A 9B Open-Weights Vision Model That Extracts Structured JSON From PDFs Using Schemas

New chip could help tiny robots traverse complex environments | MIT News

  • Why Linear Layers and Activations are Needed in Transformers
  • Typical Design of the Feed-Forward Network
  • Variations of the Activation Functions

Why Linear Layers and Activations are Needed in Transformers

The attention layer is the core function of a transformer model. It aligns different elements in a sequence and transforms the input sequence into an output sequence. The attention layer performs an affine transformation of the input, meaning the output is a weighted sum of the input at each sequence element.

Neural networks derive their power not from linear layers alone, but from activation functions that introduce non-linearity. In transformer models, you need non-linearity after the attention layer to learn complex patterns. This is achieved by adding a feed-forward network (FFN) or multi-layer perception network (MLP) after each attention layer. A typical transformer block looks like this:

Architecture of the BERT model

The gray box in the figure above repeats multiple times in a transformer model. In each block (excluding normalization layers), the input first passes through an attention layer, then through a feed-forward network (implemented as nn.Linear in PyTorch). An activation function within the feed-forward network adds non-linearity to the transformation.

The feed-forward network allows the model to learn more complex patterns. Typically, it contains multiple linear layers: the first expands the dimension to explore different representations, while the last contracts it back to the original dimension. Activations are usually applied to the output of the first linear layer.

Due to this design, we typically call the first half of the block the “attention sub-layer” and the second half the “MLP sub-layer.”

Typical Design of the Feed-Forward Network

In the BERT model, the MLP sublayer is implemented as follows:

import torch.nn as nn

 

class BertMLP(nn.Module):

    def __init__(self, dim, intermediate_dim):

        super().__init__()

        self.fc1 = nn.Linear(dim, intermediate_dim)

        self.fc2 = nn.Linear(intermediate_dim, dim)

        self.gelu = nn.GELU()

 

    def forward(self, hidden_states):

        hidden_states = self.fc1(hidden_states)

        hidden_states = self.gelu(hidden_states)

        hidden_states = self.fc2(hidden_states)

        return hidden_states

The MLP sublayer contains two linear modules. When an input sequence enters the MLP sublayer, the first linear module expands the dimension, then applies the GELU activation function. The result passes through the second linear module to contract the dimension back to the original size.

The intermediate dimension is typically 4 times the original dimension: a common design pattern in transformer models.

Variations of the Activation Functions

Activation functions introduce non-linearity into neural networks, enabling them to learn complex patterns. While traditional neural networks commonly use hyperbolic tangent (tanh), sigmoid, and rectified linear unit (ReLU), transformer models typically employ GELU and SwiGLU activations.

Below are the mathematical definitions of some common activation functions:

$$
\begin{aligned}
\text{Sigmoid}(x) &= \frac{1}{1 + e^{-x}} \\
\tanh(x) &= \frac{e^x – e^{-x}}{e^x + e^{-x}} = 2\text{Sigmoid}(2x) – 1 \\
\text{ReLU}(x) &= \max(0, x) \\
\text{GELU}(x) &= x \cdot \Phi(x) \approx \frac{x}{2}\Big(1 + \tanh\big(\sqrt{\frac{2}{\pi}}(x + 0.044715x^3)\big)\Big) \\
\text{Swish}_\beta(x) &= x \cdot \text{Sigmoid}(\beta x) = \frac{x}{1 + e^{-\beta x}} \\
\text{SiLU}(x) &= \frac{x}{1 + e^{-x}} = \text{Swish}_1(x) \\
\text{SwiGLU}(x) &= \text{SiLU}(xW + b) \cdot (xV + c)
\end{aligned}
$$

ReLU (rectified linear unit) is popular in modern deep learning because it avoids the vanishing gradient problem and is computationally simple.

GELU (Gaussian Error Linear Unit) is more computationally expensive due to its use of the cumulative distribution function of the standard normal distribution $\Phi(x)$. An approximation formula exists, as shown above. GELU is not monotonic, as you can see in the figure below.

Monotonic activation functions are generally preferred because they ensure consistent gradient direction, potentially leading to faster convergence. However, monotonicity is not strictly required—you may just need longer training times. You trade off model complexity with training duration.

Swish is another non-monotonic activation function with a parameter $\beta$ that controls the slope at $x=0$. When $\beta=1$, it’s called SiLU (Sigmoid Linear Unit).

SwiGLU (Swish-Gated Linear Unit) is a recent activation function common in modern transformer models. It’s the product of a Swish function and a linear function, with parameters learned during training. Its popularity stems from its complexity: expanding the formula reveals a quadratic term in the numerator, helping models learn complex patterns without additional layers.

Plot of some common activation functions

The figure above shows plots of these activation functions. The SwiGLU function shown is $f(x) = \text{SiLU}(x) \cdot (x+1)$.

Switching activation functions in the Python code is straightforward. PyTorch provides built-in nn.Sigmoid, nn.ReLU, nn.Tanh, and nn.SiLU. However, SwiGLU requires special implementation. Below is the PyTorch code used in the Llama model:

import torch.nn as nn

 

class LlamaMLP(nn.Module):

    def __init__(self, dim, intermediate_dim):

        super().__init__()

        self.gate_proj = nn.Linear(dim, intermediate_dim)

        self.up_proj = nn.Linear(dim, intermediate_dim)

        self.down_proj = nn.Linear(intermediate_dim, dim)

        self.act = nn.SiLU()

 

    def forward(self, hidden_states):

        gate = self.gate_proj(hidden_states)

        up = self.up_proj(hidden_states)

        swish = self.act(up)

        output = self.down_proj(swish * gate)

        return hidden_states

This implementation uses two linear layers to process the input hidden_states. One output passes through the SiLU function, then multiplies with the other output before processing through a final linear layer. The linear layers are named “up” or “down” for dimension expansion/contraction, while the layer connected to SiLU is called “gate” for its gating mechanism. Gating is a design of the neural network that means the elementwise multiplication of the output of one linear layer by a weight, which here is produced by the Swish activation function.

The Llama model architecture appears as follows, showing the MLP module’s dual-branch structure:

Architecture of a Llama model

Further Readings

Below are some resources that you may find useful:

Summary

In this post, you learned about linear layers and activation functions in transformer models. Specifically, you learned about:

  • Why linear layers and activation functions are necessary for non-linear transformations
  • The characteristics and implementations of ReLU, GELU, and SwiGLU activation functions
  • How to build complete feed-forward networks as used in transformer models

Learn Transformers and Attention!

Building Transformer Models with Attention

Teach your deep learning model to read a sentence

…using transformer models with attention

Discover how in my new Ebook:

Building Transformer Models with Attention

It provides self-study tutorials with working code to guide you into building a fully-working transformer models that can

translate sentences from one language to another…

Give magical power of understanding human language for
Your Projects

See What’s Inside



Source_link

Related Posts

Datalab Releases lift: A 9B Open-Weights Vision Model That Extracts Structured JSON From PDFs Using Schemas
Al, Analytics and Automation

Datalab Releases lift: A 9B Open-Weights Vision Model That Extracts Structured JSON From PDFs Using Schemas

June 23, 2026
New chip could help tiny robots traverse complex environments | MIT News
Al, Analytics and Automation

New chip could help tiny robots traverse complex environments | MIT News

June 23, 2026
GLM-5.2 OpenAI-Compatible API: A Hands-On Guide to Reasoning Effort, Function Calling, and Long-Context Retrieval
Al, Analytics and Automation

GLM-5.2 OpenAI-Compatible API: A Hands-On Guide to Reasoning Effort, Function Calling, and Long-Context Retrieval

June 23, 2026
Sakana AI Launches Sakana Fugu: An Orchestration Model That Routes Tasks Across a Swappable Pool of Frontier LLMs
Al, Analytics and Automation

Sakana AI Launches Sakana Fugu: An Orchestration Model That Routes Tasks Across a Swappable Pool of Frontier LLMs

June 22, 2026
How to Design Python-First Interactive Dashboards with Prefab Reactive UI Components and Static HTML Export
Al, Analytics and Automation

How to Design Python-First Interactive Dashboards with Prefab Reactive UI Components and Static HTML Export

June 22, 2026
Cisco AI Introduces FAPO: Pipeline-Aware Prompt Optimization With Step-Level Failure Attribution and Claude Code Orchestration
Al, Analytics and Automation

Cisco AI Introduces FAPO: Pipeline-Aware Prompt Optimization With Step-Level Failure Attribution and Claude Code Orchestration

June 21, 2026
Next Post
France launches criminal probe of X’s alleged algorithm ‘manipulation’

France launches criminal probe of X's alleged algorithm 'manipulation'

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

From policy shift to customer support: DHL explains comms strategy of de minimis change

September 18, 2025

Safely Deploying ML Models to Production: Four Controlled Strategies (A/B, Canary, Interleaved, Shadow Testing)

March 22, 2026
Google pulls AI model after senator says it fabricated assault allegation

Google pulls AI model after senator says it fabricated assault allegation

November 4, 2025
Marketing News – December 2025

Marketing News – December 2025

December 12, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Is There an Actual Difference?
  • Amazon Prime Day Deal 2026: A Tushy Bidet for Under $100
  • Datalab Releases lift: A 9B Open-Weights Vision Model That Extracts Structured JSON From PDFs Using Schemas
  • How & How takes a smart, bright collage-based approach to overhauling Bristol Dockyards
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions