• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Tuesday, December 2, 2025
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

Training a Tokenizer for BERT Models

Josh by Josh
November 30, 2025
in Al, Analytics and Automation
0
Training a Tokenizer for BERT Models
0
SHARES
1
VIEWS
Share on FacebookShare on Twitter


BERT is an early transformer-based model for NLP tasks that’s small and fast enough to train on a home computer. Like all deep learning models, it requires a tokenizer to convert text into integer tokens. This article shows how to train a WordPiece tokenizer following BERT’s original design.

Let’s get started.

Training a Tokenizer for BERT Models
Photo by JOHN TOWNER. Some rights reserved.

Overview

This article is divided into two parts; they are:

READ ALSO

MIT Sea Grant students explore the intersection of technology and offshore aquaculture in Norway | MIT News

MiniMax-M2: Technical Deep Dive into Interleaved Thinking for Agentic Coding Workflows

  • Picking a Dataset
  • Training a Tokenizer

Picking a Dataset

To keep things simple, we’ll use English text only. WikiText is a popular preprocessed dataset for experiments, available through the Hugging Face datasets library:

import random

from datasets import load_dataset

 

# path and name of each dataset

path, name = “wikitext-2”, “wikitext-2-raw-v1”

dataset = load_dataset(path, name, split=“train”)

print(f“size: {len(dataset)}”)

# Print a few samples

for idx in random.sample(range(len(dataset)), 5):

    text = dataset[idx][“text”].strip()

    print(f“{idx}: {text}”)

On first run, the dataset downloads to ~/.cache/huggingface/datasets and is cached for future use. WikiText-2 that used above is a smaller dataset suitable for quick experiments, while WikiText-103 is larger and more representative of real-world text for a better model.

The output of this code may look like this:

size: 36718

23905: Dudgeon Creek

4242: In 1825 the Congress of Mexico established the Port of Galveston and in 1830 …

7181: Crew : 5

24596: On March 19 , 2007 , Sports Illustrated posted on its website an article in its …

12920: The most recent building included in the list is in the Quantock Hills . The …

The dataset contains strings of varying lengths with spaces around punctuation marks. While you could split on whitespace, this wouldn’t capture sub-word components. That’s what the WordPiece tokenization algorithm is good at.

Training a Tokenizer

Several tokenization algorithms support sub-word components. BERT uses WordPiece, while modern LLMs often use Byte-Pair Encoding (BPE). We’ll train a WordPiece tokenizer following BERT’s original design.

The tokenizers library implements multiple tokenization algorithms that can be configured to your needs. It saves you the hassle of implementing the tokenization algorithm from scratch. You should install it with pip command:

Let’s train a tokenizer:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

import tokenizers

from datasets import load_dataset

 

path, name = “wikitext”, “wikitext-103-raw-v1”

vocab_size = 30522

dataset = load_dataset(path, name, split=“train”)

 

# Collect texts, skip title lines starting with “=”

texts = []

for line in dataset[“text”]:

    line = line.strip()

    if line and not line.startswith(“=”):

        texts.append(line)

 

# Configure WordPiece tokenizer with NFKC normalization and special tokens

tokenizer = tokenizers.Tokenizer(tokenizers.models.WordPiece())

tokenizer.pre_tokenizer = tokenizers.pre_tokenizers.Whitespace()

tokenizer.decoder = tokenizers.decoders.WordPiece()

tokenizer.normalizer = tokenizers.normalizers.NFKC()

tokenizer.trainer = tokenizers.trainers.WordPieceTrainer(

    vocab_size=vocab_size,

    special_tokens=[“[PAD]”, “[CLS]”, “[SEP]”, “[MASK]”, “[UNK]”]

)

# Train the tokenizer and save it

tokenizer.train_from_iterator(texts, trainer=tokenizer.trainer)

tokenizer.enable_padding(pad_id=tokenizer.token_to_id(“[PAD]”), pad_token=“[PAD]”)

tokenizer_path = f“{dataset_name}_wordpiece.json”

tokenizer.save(tokenizer_path, pretty=True)

 

# Test the tokenizer

tokenizer = tokenizers.Tokenizer.from_file(tokenizer_path)

print(tokenizer.encode(“Hello, world!”).tokens)

print(tokenizer.decode(tokenizer.encode(“Hello, world!”).ids))

Running this code may print the following output:

wikitext-103-raw-v1/train-00000-of-00002(…): 100%|█████| 157M/157M [00:46<00:00, 3.40MB/s]

wikitext-103-raw-v1/train-00001-of-00002(…): 100%|█████| 157M/157M [00:04<00:00, 37.0MB/s]

Generating test split: 100%|███████████████| 4358/4358 [00:00<00:00, 174470.75 examples/s]

Generating train split: 100%|████████| 1801350/1801350 [00:09<00:00, 199210.10 examples/s]

Generating validation split: 100%|█████████| 3760/3760 [00:00<00:00, 201086.14 examples/s]

size: 1801350

[00:00:04] Pre-processing sequences ████████████████████████████ 0 / 0

[00:00:00] Tokenize words ████████████████████████████ 606445 / 606445

[00:00:00] Count pairs ████████████████████████████ 606445 / 606445

[00:00:04] Compute merges ████████████████████████████ 22020 / 22020

[‘Hell’, ‘##o’, ‘,’, ‘world’, ‘!’]

Hello, world!

This code uses the WikiText-103 dataset. The first run downloads 157MB of data containing 1.8 million lines. The training takes a few seconds. The example shows how "Hello, world!" becomes 5 tokens, with “Hello” split into “Hell” and “##o” (the “##” prefix indicates a sub-word component).

The tokenizer created in the code above has the following properties:

  • Vocabulary size: 30,522 tokens (matching the original BERT model)
  • Special tokens: [PAD], [CLS], [SEP], [MASK], and [UNK] are added to the vocabulary even though they are not in the dataset.
  • Pre-tokenizer: Whitespace splitting (since the dataset has spaces around punctuation)
  • Normalizer: NFKC normalization for unicode text. Note that you can also configure the tokenizer to convert everything into lowercase, as the common BERT-uncased model does.
  • Algorithm: WordPiece is used. Hence the decoder should be set accordingly so that the “##” prefix for sub-word components is recognized.
  • Padding: Enabled with [PAD] token for batch processing. This is not demonstrated in the code above, but it will be useful when you are training a BERT model.

The tokenizer saves to a fairly large JSON file containing the full vocabulary, allowing you to reload the tokenizer later without retraining.

To convert a string into a list of tokens, you use the syntax tokenizer.encode(text).tokens, in which each token is just a string. For use in a model, you should use tokenizer.encode(text).ids instead, in which the result will be a list of integers. The decode method can be used to convert a list of integers back to a string. This is demonstrated in the code above.

Below are some resources that you may find useful:

This article demonstrated how to train a WordPiece tokenizer for BERT using the WikiText dataset. You learned to configure the tokenizer with appropriate normalization and special tokens, and how to encode text to tokens and decode back to strings. This is just a starting point for tokenizer training. Consider leveraging existing libraries and tools to optimize tokenizer training speed so it doesn’t become a bottleneck in your training process.



Source_link

Related Posts

MIT Sea Grant students explore the intersection of technology and offshore aquaculture in Norway | MIT News
Al, Analytics and Automation

MIT Sea Grant students explore the intersection of technology and offshore aquaculture in Norway | MIT News

December 2, 2025
MiniMax-M2: Technical Deep Dive into Interleaved Thinking for Agentic Coding Workflows
Al, Analytics and Automation

MiniMax-M2: Technical Deep Dive into Interleaved Thinking for Agentic Coding Workflows

December 2, 2025
Pretrain a BERT Model from Scratch
Al, Analytics and Automation

Pretrain a BERT Model from Scratch

December 1, 2025
How to Design an Advanced Multi-Page Interactive Analytics Dashboard with Dynamic Filtering, Live KPIs, and Rich Visual Exploration Using Panel
Al, Analytics and Automation

How to Design an Advanced Multi-Page Interactive Analytics Dashboard with Dynamic Filtering, Live KPIs, and Rich Visual Exploration Using Panel

December 1, 2025
The Journey of a Token: What Really Happens Inside a Transformer
Al, Analytics and Automation

The Journey of a Token: What Really Happens Inside a Transformer

December 1, 2025
Al, Analytics and Automation

Meta AI Researchers Introduce Matrix: A Ray Native a Decentralized Framework for Multi Agent Synthetic Data Generation

November 30, 2025
Next Post
Why observable AI is the missing SRE layer enterprises need for reliable LLMs

Why observable AI is the missing SRE layer enterprises need for reliable LLMs

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR NEWS

Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
7 Best EOR Platforms for Software Companies in 2025

7 Best EOR Platforms for Software Companies in 2025

June 21, 2025

EDITOR'S PICK

Boostez l’efficacité de vos emails marketing avec l’IA

Boostez l’efficacité de vos emails marketing avec l’IA

June 21, 2025
How PR Promotes Innovation in the Gaming Sector

How PR Promotes Innovation in the Gaming Sector

September 25, 2025
The Titan 2 is a modern BlackBerry with 5G, Android, and two screens

The Titan 2 is a modern BlackBerry with 5G, Android, and two screens

June 25, 2025
New Incremental Attribution Details – Jon Loomer Digital

New Incremental Attribution Details – Jon Loomer Digital

July 12, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • How to create an Instagram marketing strategy (2025 guide)
  • The best charities for helping animals in 2025
  • MIT Sea Grant students explore the intersection of technology and offshore aquaculture in Norway | MIT News
  • Boeing And The Quest For Quality
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?