• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Friday, January 23, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

Evaluating Perplexity on Language Models

Josh by Josh
December 29, 2025
in Al, Analytics and Automation
0
Evaluating Perplexity on Language Models
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


A language model is a probability distribution over sequences of tokens. When you train a language model, you want to measure how accurately it predicts human language use. This is a difficult task, and you need a metric to evaluate the model. In this article, you will learn about the perplexity metric. Specifically, you will learn:

  • What is perplexity, and how to compute it
  • How to evaluate the perplexity of a language model with sample data

Let’s get started.

Evaluating Perplexity on Language Models
Photo by Lucas Davis. Some rights reserved.

Overview

This article is divided into two parts; they are:

READ ALSO

Slow Down the Machines? Wall Street and Silicon Valley at Odds Over A.I.’s Nearest Future

Inworld AI Releases TTS-1.5 For Realtime, Production Grade Voice Agents

  • What Is Perplexity and How to Compute It
  • Evaluate the Perplexity of a Language Model with HellaSwag Dataset

What Is Perplexity and How to Compute It

Perplexity is a measure of how well a language model predicts a sample of text. It is defined as the inverse of the geometric mean of the probabilities of the tokens in the sample. Mathematically, perplexity is defined as:

$$
PPL(x_{1:L}) = \prod_{i=1}^L p(x_i)^{-1/L} = \exp\big(-\frac{1}{L} \sum_{i=1}^L \log p(x_i)\big)
$$

Perplexity is a function of a particular sequence of tokens. In practice, it is more convenient to compute perplexity as the mean of the log probabilities, as shown in the formula above.

Perplexity is a metric that quantifies how much a language model hesitates about the next token on average. If the language model is absolutely certain, the perplexity is 1. If the language model is completely uncertain, then every token in the vocabulary is equally likely; the perplexity is equal to the vocabulary size. You should not expect perplexity to go beyond this range.

Evaluate the Perplexity of a Language Model with HellaSwag Dataset

Perplexity is a dataset-dependent metric. One dataset you can use is HellaSwag. It is a dataset with train, test, and validation splits. It is available on the Hugging Face hub, and you can load it with the following code:

import datasets

 

dataset = datasets.load_dataset(“HuggingFaceFW/hellaswag”)

print(dataset)

 

for sample in dataset[“validation”]:

    print(sample)

    break

Running this code will print the following:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

DatasetDict({

    train: Dataset({

        features: [‘ind’, ‘activity_label’, ‘ctx_a’, ‘ctx_b’, ‘ctx’, ‘endings’,

                   ‘source_id’, ‘split’, ‘split_type’, ‘label’],

        num_rows: 39905

    })

    test: Dataset({

        features: [‘ind’, ‘activity_label’, ‘ctx_a’, ‘ctx_b’, ‘ctx’, ‘endings’,

                   ‘source_id’, ‘split’, ‘split_type’, ‘label’],

        num_rows: 10003

    })

    validation: Dataset({

        features: [‘ind’, ‘activity_label’, ‘ctx_a’, ‘ctx_b’, ‘ctx’, ‘endings’,

                   ‘source_id’, ‘split’, ‘split_type’, ‘label’],

        num_rows: 10042

    })

})

{‘ind’: 24, ‘activity_label’: ‘Roof shingle removal’,

‘ctx_a’: ‘A man is sitting on a roof.’, ‘ctx_b’: ‘he’,

‘ctx’: ‘A man is sitting on a roof. he’, ‘endings’: [

    ‘is using wrap to wrap a pair of skis.’, ‘is ripping level tiles off.’,

    “is holding a rubik’s cube.”, ‘starts pulling up roofing on a roof.’

], ‘source_id’: ‘activitynet~v_-JhWjGDPHMY’, ‘split’: ‘val’, ‘split_type’: ‘indomain’,

‘label’: ‘3’}

You can see that the validation split has 10,042 samples. This is the dataset you will use in this article. Each sample is a dictionary. The key "activity_label" describes the activity category, and the key "ctx" provides the context that needs to be completed. The model is expected to complete the sequence by selecting one of the four endings. The key "label", with values 0 to 3, indicates which ending is correct.

With this, you can write a short code to evaluate your own language model. Let’s use a small model from Hugging Face as an example:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

import datasets

import torch

import torch.nn.functional as F

import tqdm

import transformers

 

model = “openai-community/gpt2”

 

# Load the model

torch.set_default_device(“cuda” if torch.cuda.is_available() else “cpu”)

tokenizer = transformers.AutoTokenizer.from_pretrained(model)

model = transformers.AutoModelForCausalLM.from_pretrained(model)

 

# Load the dataset: HellaSwag has train, test, and validation splits

dataset = datasets.load_dataset(“hellaswag”, split=“validation”)

 

# Evaluate the model: Compute the perplexity of each ending

num_correct = 0

for sample in tqdm.tqdm(dataset):

    # tokenize text from the sample

    text = tokenizer.encode(” “ + sample[“activity_label”] + “. “ + sample[“ctx”])

    endings = [tokenizer.encode(” “ + x) for x in sample[“endings”]]  # 4 endings

    groundtruth = int(sample[“label”])  # integer, 0 to 3

    # generate logits for each ending

    perplexities = [0.0] * 4

    for i, ending in enumerate(endings):

        # run the entire input and ending to the model

        input_ids = torch.tensor(text + ending).unsqueeze(0)

        output = model(input_ids).logits

        # extract the logits for each token in the ending

        logits = output[0, len(text)–1:, :]

        token_probs = F.log_softmax(logits, dim=–1)

        # accumulate the probability of generating the ending

        log_prob = 0.0

        for j, token in enumerate(ending):

            log_prob += token_probs[j, token]

        # convert the sum of log probabilities to perplexity

        perplexities[i] = torch.exp(–log_prob / len(ending))

    # print the perplexity of each ending

    print(sample[“activity_label”] + “. “ + sample[“ctx”])

    correct = perplexities[groundtruth] == min(perplexities)

    for i, p in enumerate(perplexities):

        if i == groundtruth:

            symbol = ‘(O)’ if correct else ‘(!)’

        elif p == min(perplexities):

            symbol = ‘(X)’

        else:

            symbol = ‘   ‘

        print(f“Ending {i}: {p:.4g} {symbol} – {sample[‘endings’][i]}”)

    if correct:

        num_correct += 1

 

print(f“Accuracy: {num_correct}/{len(dataset)} = {num_correct / len(dataset):.4f}”)

This code loads the smallest GPT-2 model from the Hugging Face Hub. It is a 124M-parameter model that you can easily run on a low-profile computer. The model and tokenizer are loaded using the Hugging Face transformers library. You also load the HellaSwag validation dataset.

In the for-loop, you tokenize the activity label and the context. You also tokenize each of the four endings. Note that tokenizer.encode() is the method for using the tokenizer from the transformers library. It is different from the tokenizer object you used in the previous article.

Next, for each ending, you run the concatenated input and ending to the model. The input_ids tensor is a 2D tensor of integer token IDs with the batch dimension 1. The model returns an object, in which you extract the output logits tensor. This is different from the model you built in the previous article as this is a model object from the transformers library. You can easily swap it with your trained model object with minor changes.

GPT-2 is a decoder-only transformer model. It processes the input with a causal mask. For an input tensor of shape $(1, L)$, the output logits tensor has shape $(1, L, V)$, where $V$ is the vocabulary size. The output at position $p$ corresponds to the model’s estimate of the token at position $p+1$, depending on the input at positions 1 to $p$. Therefore, you extract the logits starting at offset $n-1$, where $n$ is the length of the combined activity label and context. You then convert the logits to log probabilities and compute the average over the length of each ending.

The value token_probs[j, token] is the log probability at position j for the token with ID token. The mean log-probability of each token in the ending is used to compute the perplexity. A good model is expected to identify the correct ending with the lowest perplexity. You can evaluate a model by counting the number of correct predictions over the entire HellaSwag validation dataset. When you run this code, you will see the following:

…

Finance and Business. [header] How to buy a peridot Evaluating Perplexity on Language Models Look at a variety of stones…

Ending 0: 13.02 (X) – You will want to watch several of the gemstones, particularly eme…

Ending 1: 30.19 – Not only are they among the delicates among them, but they can be…

Ending 2: 34.96 (!) – Familiarize yourself with the different shades that it comes in, …

Ending 3: 28.85 – Neither peridot nor many other jade or allekite stones are necess…

Family Life. [header] How to tell if your teen is being abused Evaluating Perplexity on Language Models Pay attention to…

Ending 0: 16.58 – Try to figure out why they are dressing something that is frowned…

Ending 1: 22.01 – Read the following as a rule for determining your teen’s behaviou…

Ending 2: 15.21 (O) – [substeps] For instance, your teen may try to hide the signs of a…

Ending 3: 23.91 – [substeps] Ask your teen if they have black tights (with stripper…

Accuracy: 3041/10042 = 0.3028

The code prints the perplexity of each ending and marks the correct answer with (O) or (!) and the model’s wrong prediction with (X). You can see that GPT-2 has a perplexity of 10 to 20, even for a correct answer. Advanced LLMs can achieve perplexity below 10, even with a much larger vocabulary size than GPT-2. More important is whether the model can identify the correct ending: the one that naturally completes the sentence. It should be the one with the lowest perplexity; otherwise, the model cannot generate the correct ending. GPT-2 achieves only 30% accuracy on this dataset.

You can also repeat the code with a different model. Here are the results:

  • model openai-community/gpt2: This is the smallest GPT-2 model with 124M parameters, used in the code above. The accuracy is 3041/10042 or 30.28%
  • model openai-community/gpt2-medium: This is the larger GPT-2 model with 355M parameters. The accuracy is 3901/10042 or 38.85%
  • model meta-llama/Llama-3.2-1B: This is the smallest model in the Llama family with 1B parameters. The accuracy is 5731/10042 or 57.07%

Therefore, it is natural to see higher accuracy with larger models.

Note that you should not compare perplexities across models with vastly different architectures. Since perplexity is a metric in the range of 1 to the vocabulary size, it highly depends on the tokenizer. You can see the reason when you compare the perplexity in the code above after replacing GPT-2 with Llama 3.2 1B: The perplexity is an order of magnitude higher for Llama 3, but the accuracy is indeed better. This is because GPT-2 has a vocabulary size of only 50,257, while Llama 3.2 1B has a vocabulary size of 128,256.

Further Readings

Below are some resources that you may find useful:

Summary

In this article, you learned about the perplexity metric and how to evaluate the perplexity of a language model with the HellaSwag dataset. Specifically, you learned:

  • Perplexity measures how much a model hesitates about the next token on average.
  • Perplexity is a metric sensitive to vocabulary size.
  • Computing perplexity means computing the geometric mean of the probabilities of the tokens in the sample.



Source_link

Related Posts

Slow Down the Machines? Wall Street and Silicon Valley at Odds Over A.I.’s Nearest Future
Al, Analytics and Automation

Slow Down the Machines? Wall Street and Silicon Valley at Odds Over A.I.’s Nearest Future

January 22, 2026
Inworld AI Releases TTS-1.5 For Realtime, Production Grade Voice Agents
Al, Analytics and Automation

Inworld AI Releases TTS-1.5 For Realtime, Production Grade Voice Agents

January 22, 2026
FlashLabs Researchers Release Chroma 1.0: A 4B Real Time Speech Dialogue Model With Personalized Voice Cloning
Al, Analytics and Automation

FlashLabs Researchers Release Chroma 1.0: A 4B Real Time Speech Dialogue Model With Personalized Voice Cloning

January 22, 2026
Al, Analytics and Automation

Salesforce AI Introduces FOFPred: A Language-Driven Future Optical Flow Prediction Framework that Enables Improved Robot Control and Video Generation

January 21, 2026
Why it’s critical to move beyond overly aggregated machine-learning metrics | MIT News
Al, Analytics and Automation

Why it’s critical to move beyond overly aggregated machine-learning metrics | MIT News

January 21, 2026
What are Context Graphs? – MarkTechPost
Al, Analytics and Automation

What are Context Graphs? – MarkTechPost

January 21, 2026
Next Post
iMP Tech Mini Arcade Pro Review: A Nintendo Switch Arcade Cabinet

iMP Tech Mini Arcade Pro Review: A Nintendo Switch Arcade Cabinet

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Google announced the next step in its nuclear energy plans 

Google announced the next step in its nuclear energy plans 

August 20, 2025

EDITOR'S PICK

Cloud Kitchen Management Software Development Guide

Cloud Kitchen Management Software Development Guide

October 21, 2025

Customer Loyalty Statistics That Prove Price Isn’t Why Customers Stay

October 24, 2025
76% of AI Overview Citations Pull From Top 10 Pages

76% of AI Overview Citations Pull From Top 10 Pages

July 25, 2025
Google’s best AI tools for college students for free

Google’s best AI tools for college students for free

August 9, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Legislators Push to Make Companies Tell Customers When Their Products Will Die
  • Higher-Ed in 2026: AI Targeting for Higher Education from Brand Awareness to Enrollment
  • NRF 2026: 5 Retail Shifts You Can’t Ignore
  • Agentiiv enters strategic technology partnership with the Vector Institute
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?