• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Monday, May 18, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

A Coding Implementation to Compress and Benchmark Instruction-Tuned LLMs with FP8, GPTQ, and SmoothQuant Quantization using llmcompressor

Josh by Josh
May 18, 2026
in Al, Analytics and Automation
0
A Coding Implementation to Compress and Benchmark Instruction-Tuned LLMs with FP8, GPTQ, and SmoothQuant Quantization using llmcompressor


import subprocess, sys
def pip(*pkgs):
   subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", *pkgs])
pip("llmcompressor", "compressed-tensors",
   "transformers>=4.45", "accelerate", "datasets")
import os, gc, time, json, math
from pathlib import Path
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
assert torch.cuda.is_available(), \
   "Enable a GPU: Runtime > Change runtime type > T4 GPU"
print("GPU:", torch.cuda.get_device_name(0),
     "| CUDA:", torch.version.cuda,
     "| torch:", torch.__version__)
MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct"
WORKDIR = Path("/content/quant_lab"); WORKDIR.mkdir(exist_ok=True)
os.chdir(WORKDIR)
def free_mem():
   gc.collect(); torch.cuda.empty_cache()
def dir_size_gb(path):
   total = 0
   for root, _, files in os.walk(path):
       for f in files:
           total += os.path.getsize(os.path.join(root, f))
   return total / 1e9
def time_generation(model, tok, prompt, max_new_tokens=64):
   """Greedy decode; reports latency & tokens/sec after a brief warmup."""
   inputs = tok(prompt, return_tensors="pt").to(model.device)
   _ = model.generate(**inputs, max_new_tokens=4, do_sample=False)
   torch.cuda.synchronize()
   t0 = time.time()
   out = model.generate(**inputs, max_new_tokens=max_new_tokens,
                        do_sample=False, pad_token_id=tok.eos_token_id)
   torch.cuda.synchronize()
   dt = time.time() - t0
   new_ids = out[0][inputs["input_ids"].shape[1]:]
   return tok.decode(new_ids, skip_special_tokens=True), dt, max_new_tokens/dt
@torch.no_grad()
def wikitext_ppl(model, tok, seq_len=512, max_chunks=20, stride=512):
   """Light WikiText-2 perplexity probe (fast, indicative)."""
   ds = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
   text = "\n\n".join(t for t in ds["text"][:400] if t.strip())
   enc = tok(text, return_tensors="pt").input_ids.to(model.device)
   nll_sum, tok_count = 0.0, 0
   for begin in range(0, enc.size(1) - seq_len, stride):
       chunk = enc[:, begin:begin+seq_len]
       out = model(chunk, labels=chunk)
       nll_sum += out.loss.float().item() * seq_len
       tok_count += seq_len
       if tok_count // seq_len >= max_chunks: break
   return math.exp(nll_sum / tok_count)
results = {}
PROMPT = ("<|im_start|>user\nIn two sentences, explain why post-training "
         "quantization works for large language models.<|im_end|>\n"
         "<|im_start|>assistant\n")
def benchmark(label, model_path_or_id):
   free_mem()
   print(f"\n──── benchmarking: {label} ────")
   tok = AutoTokenizer.from_pretrained(model_path_or_id)
   m = AutoModelForCausalLM.from_pretrained(
           model_path_or_id, torch_dtype="auto", device_map="cuda").eval()
   sample, dt, tps = time_generation(m, tok, PROMPT)
   ppl = wikitext_ppl(m, tok)
   size = dir_size_gb(model_path_or_id) if os.path.isdir(str(model_path_or_id)) else None
   results[label] = {"size_gb": size, "ppl": round(ppl, 3),
                     "latency_s": round(dt, 3), "tok_per_s": round(tps, 1),
                     "sample": sample.strip().replace("\n", " ")[:180]}
   print(json.dumps(results[label], indent=2))
   del m; free_mem()



Source_link

READ ALSO

The Roadmap to Mastering Tool Calling in AI Agents

Vercel Labs Introduces Zero, a Systems Programming Language Designed So AI Agents Can Read, Repair, and Ship Native Programs

Related Posts

The Roadmap to Mastering Tool Calling in AI Agents
Al, Analytics and Automation

The Roadmap to Mastering Tool Calling in AI Agents

May 18, 2026
Vercel Labs Introduces Zero, a Systems Programming Language Designed So AI Agents Can Read, Repair, and Ship Native Programs
Al, Analytics and Automation

Vercel Labs Introduces Zero, a Systems Programming Language Designed So AI Agents Can Read, Repair, and Ship Native Programs

May 17, 2026
Implementing Permission-Gated Tool Calling in Python Agents
Al, Analytics and Automation

Implementing Permission-Gated Tool Calling in Python Agents

May 17, 2026
Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That Delivers 1.4–1.7× Pretraining Speedup at Long Context
Al, Analytics and Automation

Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That Delivers 1.4–1.7× Pretraining Speedup at Long Context

May 16, 2026
Al, Analytics and Automation

Implementing Prompt Compression to Reduce Agentic Loop Costs

May 16, 2026
How to Build an MCP Style Routed AI Agent System with Dynamic Tool Exposure Planning, Execution, and Context Injection
Al, Analytics and Automation

How to Build an MCP Style Routed AI Agent System with Dynamic Tool Exposure Planning, Execution, and Context Injection

May 16, 2026
Next Post
Architectural patterns for graph-enhanced RAG: Moving beyond vector search in production

Architectural patterns for graph-enhanced RAG: Moving beyond vector search in production

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

Security flaws in Freedom Chat app exposed users’ phone numbers and PINs

Security flaws in Freedom Chat app exposed users’ phone numbers and PINs

December 11, 2025
Software Development Cost in Canada Guide 2025

Software Development Cost in Canada Guide 2025

December 18, 2025
The Evolution of Market Research: From Focus Groups to AI

The Evolution of Market Research: From Focus Groups to AI

June 7, 2025
Filmmaker Henry Daubrez joins Google Labs team to work on Flow

Filmmaker Henry Daubrez joins Google Labs team to work on Flow

September 3, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Architectural patterns for graph-enhanced RAG: Moving beyond vector search in production
  • A Coding Implementation to Compress and Benchmark Instruction-Tuned LLMs with FP8, GPTQ, and SmoothQuant Quantization using llmcompressor
  • AI Connectors Might Put Your Clients at Risk
  • How to fall in love with humanity in the age of AI
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions