• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Monday, May 11, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

A Coding Implementation to Run Qwen3.5 Reasoning Models Distilled with Claude-Style Thinking Using GGUF and 4-Bit Quantization

Josh by Josh
March 26, 2026
in Al, Analytics and Automation
0
A Coding Implementation to Run Qwen3.5 Reasoning Models Distilled with Claude-Style Thinking Using GGUF and 4-Bit Quantization


In this tutorial, we work directly with Qwen3.5 models distilled with Claude-style reasoning and set up a Colab pipeline that lets us switch between a 27B GGUF variant and a lightweight 2B 4-bit version with a single flag. We start by validating GPU availability, then conditionally install either llama.cpp or transformers with bitsandbytes, depending on the selected path. Both branches are unified through shared generate_fn and stream_fn interfaces, ensuring consistent inference across backends. We also implement a ChatSession class for multi-turn interaction and build utilities to parse <think> traces, allowing us to explicitly separate reasoning from final outputs during execution.

MODEL_PATH = "2B_HF"


import torch


if not torch.cuda.is_available():
   raise RuntimeError(
       "❌ No GPU! Go to Runtime → Change runtime type → T4 GPU."
   )


gpu_name = torch.cuda.get_device_name(0)
vram_gb = torch.cuda.get_device_properties(0).total_memory / 1e9
print(f"✅ GPU: {gpu_name} — {vram_gb:.1f} GB VRAM")


import subprocess, sys, os, re, time


generate_fn = None
stream_fn = None

We initialize the execution by setting the model path flag and checking whether a GPU is available on the system. We retrieve and print the GPU name along with available VRAM to ensure the environment meets the requirements. We also import all required base libraries and define placeholders for the unified generation functions that will be assigned later.

READ ALSO

Sakana AI and NVIDIA Introduce TwELL with CUDA Kernels for 20.5% Inference and 21.9% Training Speedup in LLMs

OpenClaw vs Hermes Agent: Why Nous Research’s Self-Improving Agent Now Leads OpenRouter’s Global Rankings

if MODEL_PATH == "27B_GGUF":
   print("\n📦 Installing llama-cpp-python with CUDA (takes 3-5 min)...")
   env = os.environ.copy()
   env["CMAKE_ARGS"] = "-DGGML_CUDA=on"
   subprocess.check_call(
       [sys.executable, "-m", "pip", "install", "-q", "llama-cpp-python", "huggingface_hub"],
       env=env,
   )
   print("✅ Installed.\n")


   from huggingface_hub import hf_hub_download
   from llama_cpp import Llama


   GGUF_REPO = "Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF"
   GGUF_FILE = "Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-Q4_K_M.gguf"


   print(f"⏳ Downloading {GGUF_FILE} (~16.5 GB)... grab a coffee ☕")
   model_path = hf_hub_download(repo_id=GGUF_REPO, filename=GGUF_FILE)
   print(f"✅ Downloaded: {model_path}\n")


   print("⏳ Loading into llama.cpp (GPU offload)...")
   llm = Llama(
       model_path=model_path,
       n_ctx=8192,
       n_gpu_layers=40,
       n_threads=4,
       verbose=False,
   )
   print("✅ 27B GGUF model loaded!\n")


   def generate_fn(
       prompt, system_prompt="You are a helpful assistant. Think step by step.",
       max_new_tokens=2048, temperature=0.6, top_p=0.95, **kwargs
   ):
       output = llm.create_chat_completion(
           messages=[
               {"role": "system", "content": system_prompt},
               {"role": "user", "content": prompt},
           ],
           max_tokens=max_new_tokens,
           temperature=temperature,
           top_p=top_p,
       )
       return output["choices"][0]["message"]["content"]


   def stream_fn(
       prompt, system_prompt="You are a helpful assistant. Think step by step.",
       max_new_tokens=2048, temperature=0.6, top_p=0.95,
   ):
       print("⏳ Streaming output:\n")
       for chunk in llm.create_chat_completion(
           messages=[
               {"role": "system", "content": system_prompt},
               {"role": "user", "content": prompt},
           ],
           max_tokens=max_new_tokens,
           temperature=temperature,
           top_p=top_p,
           stream=True,
       ):
           delta = chunk["choices"][0].get("delta", {})
           text = delta.get("content", "")
           if text:
               print(text, end="", flush=True)
       print()


   class ChatSession:
       def __init__(self, system_prompt="You are a helpful assistant. Think step by step."):
           self.messages = [{"role": "system", "content": system_prompt}]
       def chat(self, user_message, temperature=0.6):
           self.messages.append({"role": "user", "content": user_message})
           output = llm.create_chat_completion(
               messages=self.messages, max_tokens=2048,
               temperature=temperature, top_p=0.95,
           )
           resp = output["choices"][0]["message"]["content"]
           self.messages.append({"role": "assistant", "content": resp})
           return resp

We handle the 27B GGUF path by installing llama.cpp with CUDA support and downloading the Qwen3.5 27B distilled model from Hugging Face. We load the model with GPU offloading and define a standardized generate_fn and stream_fn for inference and streaming outputs. We also implement a ChatSession class to maintain conversation history for multi-turn interactions.

elif MODEL_PATH == "2B_HF":
   print("\n📦 Installing transformers + bitsandbytes...")
   subprocess.check_call([
       sys.executable, "-m", "pip", "install", "-q",
       "transformers @ git+https://github.com/huggingface/transformers.git@main",
       "accelerate", "bitsandbytes", "sentencepiece", "protobuf",
   ])
   print("✅ Installed.\n")


   from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TextStreamer


   HF_MODEL_ID = "Jackrong/Qwen3.5-2B-Claude-4.6-Opus-Reasoning-Distilled"


   bnb_config = BitsAndBytesConfig(
       load_in_4bit=True,
       bnb_4bit_quant_type="nf4",
       bnb_4bit_compute_dtype=torch.bfloat16,
       bnb_4bit_use_double_quant=True,
   )


   print(f"⏳ Loading {HF_MODEL_ID} in 4-bit...")
   tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_ID, trust_remote_code=True)
   model = AutoModelForCausalLM.from_pretrained(
       HF_MODEL_ID,
       quantization_config=bnb_config,
       device_map="auto",
       trust_remote_code=True,
       torch_dtype=torch.bfloat16,
   )
   print(f"✅ Model loaded! Memory: {model.get_memory_footprint() / 1e9:.2f} GB\n")


   def generate_fn(
       prompt, system_prompt="You are a helpful assistant. Think step by step.",
       max_new_tokens=2048, temperature=0.6, top_p=0.95,
       repetition_penalty=1.05, do_sample=True, **kwargs
   ):
       messages = [
           {"role": "system", "content": system_prompt},
           {"role": "user", "content": prompt},
       ]
       text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
       inputs = tokenizer(text, return_tensors="pt").to(model.device)
       with torch.no_grad():
           output_ids = model.generate(
               **inputs, max_new_tokens=max_new_tokens, temperature=temperature,
               top_p=top_p, repetition_penalty=repetition_penalty, do_sample=do_sample,
           )
       generated = output_ids[0][inputs["input_ids"].shape[1]:]
       return tokenizer.decode(generated, skip_special_tokens=True)


   def stream_fn(
       prompt, system_prompt="You are a helpful assistant. Think step by step.",
       max_new_tokens=2048, temperature=0.6, top_p=0.95,
   ):
       messages = [
           {"role": "system", "content": system_prompt},
           {"role": "user", "content": prompt},
       ]
       text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
       inputs = tokenizer(text, return_tensors="pt").to(model.device)
       streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
       print("⏳ Streaming output:\n")
       with torch.no_grad():
           model.generate(
               **inputs, max_new_tokens=max_new_tokens, temperature=temperature,
               top_p=top_p, do_sample=True, streamer=streamer,
           )


   class ChatSession:
       def __init__(self, system_prompt="You are a helpful assistant. Think step by step."):
           self.messages = [{"role": "system", "content": system_prompt}]
       def chat(self, user_message, temperature=0.6):
           self.messages.append({"role": "user", "content": user_message})
           text = tokenizer.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True)
           inputs = tokenizer(text, return_tensors="pt").to(model.device)
           with torch.no_grad():
               output_ids = model.generate(
                   **inputs, max_new_tokens=2048, temperature=temperature, top_p=0.95, do_sample=True,
               )
           generated = output_ids[0][inputs["input_ids"].shape[1]:]
           resp = tokenizer.decode(generated, skip_special_tokens=True)
           self.messages.append({"role": "assistant", "content": resp})
           return resp
else:
   raise ValueError("MODEL_PATH must be '27B_GGUF' or '2B_HF'")

We implement the lightweight 2B path using transformers with 4-bit quantization through bitsandbytes. We load the Qwen3.5 2B distilled model efficiently onto the GPU and configure generation parameters for controlled sampling. We again define unified generation, streaming, and chat session logic so that both model paths behave identically during execution.

def parse_thinking(response: str) -> tuple:
   m = re.search(r"<think>(.*?)</think>", response, re.DOTALL)
   if m:
       return m.group(1).strip(), response[m.end():].strip()
   return "", response.strip()




def display_response(response: str):
   thinking, answer = parse_thinking(response)
   if thinking:
       print("🧠 THINKING:")
       print("-" * 60)
       print(thinking[:1500] + ("\n... [truncated]" if len(thinking) > 1500 else ""))
       print("-" * 60)
   print("\n💬 ANSWER:")
   print(answer)




print("✅ All helpers ready. Running tests...\n")

We define helper functions to extract reasoning traces enclosed within <think> tags and separate them from final answers. We create a display utility that formats and prints both the thinking process and the response in a structured way. This allows us to inspect how the Qwen-based model reasons internally during generation.

print("=" * 70)
print("📝 TEST 1: Basic reasoning")
print("=" * 70)


response = generate_fn(
   "If I have 3 apples and give away half, then buy 5 more, how many do I have? "
   "Explain your reasoning."
)
display_response(response)


print("\n" + "=" * 70)
print("📝 TEST 2: Streaming output")
print("=" * 70)


stream_fn(
   "Explain the difference between concurrency and parallelism. "
   "Give a real-world analogy for each."
)


print("\n" + "=" * 70)
print("📝 TEST 3: Thinking ON vs OFF")
print("=" * 70)


question = "What is the capital of France?"


print("\n--- Thinking ON (default) ---")
resp = generate_fn(question)
display_response(resp)


print("\n--- Thinking OFF (concise) ---")
resp = generate_fn(
   question,
   system_prompt="Answer directly and concisely. Do not use <think> tags.",
   max_new_tokens=256,
)
display_response(resp)


print("\n" + "=" * 70)
print("📝 TEST 4: Bat & ball trick question")
print("=" * 70)


response = generate_fn(
   "A bat and a ball cost $1.10 in total. "
   "How much does the ball cost? Show complete reasoning and verify.",
   system_prompt="You are a precise mathematical reasoner. Set up equations and verify.",
   temperature=0.3,
)
display_response(response)


print("\n" + "=" * 70)
print("📝 TEST 5: Train meeting problem")
print("=" * 70)


response = generate_fn(
   "A train leaves Station A at 9:00 AM at 60 mph toward Station B. "
   "Another leaves Station B at 10:00 AM at 80 mph toward Station A. "
   "Stations are 280 miles apart. When and where do they meet?",
   temperature=0.3,
)
display_response(response)


print("\n" + "=" * 70)
print("📝 TEST 6: Logic puzzle (five houses)")
print("=" * 70)


response = generate_fn(
   "Five houses in a row are painted different colors. "
   "The red house is left of the blue house. "
   "The green house is in the middle. "
   "The yellow house is not next to the blue house. "
   "The white house is at one end. "
   "What is the order from left to right?",
   temperature=0.3,
   max_new_tokens=3000,
)
display_response(response)


print("\n" + "=" * 70)
print("📝 TEST 7: Code generation — longest palindromic substring")
print("=" * 70)


response = generate_fn(
   "Write a Python function to find the longest palindromic substring "
   "using Manacher's algorithm. Include docstring, type hints, and tests.",
   system_prompt="You are an expert Python programmer. Think through the algorithm carefully.",
   max_new_tokens=3000,
   temperature=0.3,
)
display_response(response)


print("\n" + "=" * 70)
print("📝 TEST 8: Multi-turn conversation (physics tutor)")
print("=" * 70)


session = ChatSession(
   system_prompt="You are a knowledgeable physics tutor. Explain clearly with examples."
)


turns = [
   "What is the Heisenberg uncertainty principle?",
   "Can you give me a concrete example with actual numbers?",
   "How does this relate to quantum tunneling?",
]


for i, q in enumerate(turns, 1):
   print(f"\n{'─'*60}")
   print(f"👤 Turn {i}: {q}")
   print(f"{'─'*60}")
   resp = session.chat(q, temperature=0.5)
   _, answer = parse_thinking(resp)
   print(f"🤖 {answer[:1000]}{'...' if len(answer) > 1000 else ''}")


print("\n" + "=" * 70)
print("📝 TEST 9: Temperature comparison — creative writing")
print("=" * 70)


creative_prompt = "Write a one-paragraph opening for a sci-fi story about AI consciousness."


configs = [
   {"label": "Low temp (0.1)",  "temperature": 0.1, "top_p": 0.9},
   {"label": "Med temp (0.6)",  "temperature": 0.6, "top_p": 0.95},
   {"label": "High temp (1.0)", "temperature": 1.0, "top_p": 0.98},
]


for cfg in configs:
   print(f"\n🎛️  {cfg['label']}")
   print("-" * 60)
   start = time.time()
   resp = generate_fn(
       creative_prompt,
       system_prompt="You are a creative fiction writer.",
       max_new_tokens=512,
       temperature=cfg["temperature"],
       top_p=cfg["top_p"],
   )
   elapsed = time.time() - start
   _, answer = parse_thinking(resp)
   print(answer[:600])
   print(f"⏱️  {elapsed:.1f}s")


print("\n" + "=" * 70)
print("📝 TEST 10: Speed benchmark")
print("=" * 70)


start = time.time()
resp = generate_fn(
   "Explain how a neural network learns, step by step, for a beginner.",
   system_prompt="You are a patient, clear teacher.",
   max_new_tokens=1024,
)
elapsed = time.time() - start


approx_tokens = int(len(resp.split()) * 1.3)
print(f"~{approx_tokens} tokens in {elapsed:.1f}s")
print(f"~{approx_tokens / elapsed:.1f} tokens/sec")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"Peak VRAM: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")


import gc


for name in ["model", "llm"]:
   if name in globals():
       del globals()[name]
gc.collect()
torch.cuda.empty_cache()


print(f"\n✅ Memory freed. VRAM: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print("\n" + "=" * 70)
print("🎉 Tutorial complete!")
print("=" * 70)

We run a comprehensive test suite that evaluates the model across reasoning, streaming, logic puzzles, code generation, and multi-turn conversations. We compare outputs under different temperature settings and measure performance in terms of speed and token throughput. Finally, we clean up memory and free GPU resources, ensuring the notebook remains reusable for further experiments.

In conclusion, we have a compact but flexible setup for running Qwen3.5-based reasoning models enhanced with Claude-style distillation across different hardware constraints. The script abstracts backend differences while exposing consistent generation, streaming, and conversational interfaces, making it easy to experiment with reasoning behavior. Through the test suite, we probe how the model handles structured reasoning, edge-case questions, and longer multi-step tasks, while also measuring speed and memory usage. What we end up with is not just a demo, but a reusable scaffold for evaluating and extending Qwen-based reasoning systems in Colab without changing the core code.


Check out the Full Notebook and Source Page. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.




Source_link

Related Posts

Sakana AI and NVIDIA Introduce TwELL with CUDA Kernels for 20.5% Inference and 21.9% Training Speedup in LLMs
Al, Analytics and Automation

Sakana AI and NVIDIA Introduce TwELL with CUDA Kernels for 20.5% Inference and 21.9% Training Speedup in LLMs

May 11, 2026
OpenClaw vs Hermes Agent: Why Nous Research’s Self-Improving Agent Now Leads OpenRouter’s Global Rankings
Al, Analytics and Automation

OpenClaw vs Hermes Agent: Why Nous Research’s Self-Improving Agent Now Leads OpenRouter’s Global Rankings

May 10, 2026
NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels Directly to PTX
Al, Analytics and Automation

NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels Directly to PTX

May 10, 2026
Meet GitHub Spec-Kit: An Open Source Toolkit for Spec-Driven Development with AI Coding Agents
Al, Analytics and Automation

Meet GitHub Spec-Kit: An Open Source Toolkit for Spec-Driven Development with AI Coding Agents

May 9, 2026
Al, Analytics and Automation

9 Best AI Tools for Spec-Driven Development in 2026: Kiro, BMAD, GSD, and More Compare

May 9, 2026
Europe Hits Pause on Its Toughest AI Rules — and the Backlash Has Already Begun
Al, Analytics and Automation

Europe Hits Pause on Its Toughest AI Rules — and the Backlash Has Already Begun

May 9, 2026
Next Post
Wikipedia cracks down on the use of AI in article writing

Wikipedia cracks down on the use of AI in article writing

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

12 Key Attributes of a Successful Brand

12 Key Attributes of a Successful Brand

October 19, 2025
Website Maintenance Services

Random Acts Of Marketing Video Transcript

July 7, 2025

How to Create a Pin on Pinterest: The Complete 2025 Guide

August 28, 2025
Goofy, playful and knowingly a bit silly

Goofy, playful and knowingly a bit silly

April 1, 2026

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • A smarter way to approach Reddit in PR
  • AI tool poisoning exposes a major flaw in enterprise agent security
  • Sakana AI and NVIDIA Introduce TwELL with CUDA Kernels for 20.5% Inference and 21.9% Training Speedup in LLMs
  • What Leaders Need to Know – Truly Deeply – Brand Strategy & Creative Agency Melbourne
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions